Fall 2020 Concepts of Programming Languages 12th Edition PDF
Document Details
Uploaded by HilariousSagacity
Tags
Related
- Fall 2020 Concepts of Programming Languages 12th Edition PDF
- Unit 1 Notes - Programming Languages PDF
- Constants, variables and data types.pdf
- SIN 3011 / SIM 3025 Scientific Computing Chapter 1 PDF
- Boolean, Character, and Real Data (PDF)
- Python Programming Languages - First Unit - RJ First Unit F.ppt PDF
Summary
This document is a chapter from a textbook on programming languages, specifically about data types, type checking, and type equivalence. The chapter covers the fundamental concepts of data types and how they are implemented and used in programming languages.
Full Transcript
6 Data Types 1. 6.1 Introduction 2. 6.2 Primitive Data Types 3. 6.3 Character String Types 4. 6.4 Enumeration Types 5. 6.5 Array Types 6. 6.6 Associative Arrays 7. 6.7 Record Types 8. 6.8 Tuple Types 9. 6.9 List Types 10. 6.10 Union Types 11. 6.11 Pointer and Reference Types 12. 6.12 Optional Types...
6 Data Types 1. 6.1 Introduction 2. 6.2 Primitive Data Types 3. 6.3 Character String Types 4. 6.4 Enumeration Types 5. 6.5 Array Types 6. 6.6 Associative Arrays 7. 6.7 Record Types 8. 6.8 Tuple Types 9. 6.9 List Types 10. 6.10 Union Types 11. 6.11 Pointer and Reference Types 12. 6.12 Optional Types 13. 6.13 Type Checking 14. 6.14 Strong Typing 15. 6.15 Type Equivalence 16. 6.16 Theory and Data Types This chapter first introduces the concept of a data type and the characteristics of the common primitive data types. Then, the designs of enumeration and subrange types are discussed. Next, the details of structured data types— specifically arrays, associative arrays, records, tuples, lists, and unions—are investigated. This section is followed by an in-depth look at pointers and references. The last category of types discussed are the optional types. For each of the various categories of data types, the design issues are stated and the design choices made by the designers of some common languages are described. These designs are then evaluated. The next three sections provide a thorough investigation of type checking, strong typing, and type equivalence rules. The last section of the chapter briefly introduces the fundamentals of the theory of data types. Implementation methods for data types sometimes have a significant impact on their design. Therefore, implementation of the various data types is another important part of this chapter, especially arrays. 6.1 Introduction A data type defines a collection of data values and a set of predefined operations on those values. Computer programs produce results by manipulating data. An important factor in determining the ease with which they can perform this task is how well the data types available in the language being used match the objects in the real world of the problem being addressed. Therefore, it is crucial that a language supports an appropriate collection of data types and structures. The contemporary concepts of data typing have evolved over the last 60 years. In the earliest languages, all problem space data structures had to be modeled with only a few basic language-supported data structures. For example, in pre-90 Fortrans, linked lists and binary trees were implemented with arrays. The data structures of COBOL took the first step away from the Fortran I model by allowing programmers to specify the accuracy of decimal data values, and also by providing a structured data type for records of information. PL/I extended the capability of accuracy specification to integer and floating-point types. The designers of PL/I included many data types, with the intent of supporting a large range of applications. A better approach, introduced in ALGOL 68, is to provide a few basic types and a few flexible structure-defining operators that allow a programmer to design a data structure for each need. Clearly, this was one of the most important advances in the evolution of data type design. User-defined types also provide improved readability through the use of meaningful names for types. They allow type checking of the variables of a special category of use, which would otherwise not be possible. User-defined types also aid modifiability: A programmer can change the type of a category of variables in a program by changing a type definition statement only. Taking the concept of a user-defined type a step further, we arrive at abstract data types, which are supported by most programming languages designed since the mid-1980s. The fundamental idea of an abstract data type is that the interface of a type, which is visible to the user, is separated from the representation and set of operations on values of that type, which are hidden from the user. All of the types provided by a high-level programming language are abstract data types. User-defined abstract data types are discussed in detail in Chapter 11. There are a number of uses of the type system of a programming language. The most practical of these is error detection. The process and value of type checking, which is directed by the type system of the language, are discussed in Section 6.12. A second use of a type system is the assistance it provides for program modularization. This results from the cross-module type checking that ensures the consistency of the interfaces among modules. Another use of a type system is documentation. The type declarations in a program document information about its data, which provides clues about the program’s behavior. The type system of a programming language defines how a type is associated with each expression in the language and includes its rules for type equivalence and type compatibility. Certainly, one of the most important parts of understanding the semantics of a programming language is understanding its type system. The two most common structured (nonscalar) data types in the imperative languages are arrays and records, although the popularity of associative arrays has increased significantly in recent years. Lists have been a central part of functional programming languages since the first such language appeared in 1959 (Lisp). Over the last decade, the increasing popularity of functional programming has led to lists being added to primarily imperative languages, such as Python and C#. The structured data types are defined with type operators, or constructors, which are used to form type expressions. For example, C uses brackets and asterisks as type operators to specify arrays and pointers. It is convenient, both logically and concretely, to think of variables in terms of descriptors. A descriptor is the collection of the attributes of a variable. In an implementation, a descriptor is an area of memory that stores the attributes of a variable. If the attributes are all static, descriptors are required only at compile time. These descriptors are built by the compiler, usually as a part of the symbol table, and are used during compilation. For dynamic attributes, however, part or all of the descriptor must be maintained during execution. In this case, the descriptor is used by the run-time system. In all cases, descriptors are used for type checking and building the code for the allocation and deallocation operations. Care must be taken when using the term variable. One who uses only traditional imperative languages may think of identifiers as variables, but that can lead to confusion when considering data types. Identifiers do not have data types in some programming languages. It is wise to remember that identifiers are just one of the attributes of a variable. The word object is often associated with the value of a variable and the space it occupies. In this book, however, we reserve object exclusively for instances of user-defined and language-defined abstract data types, rather than for the values of all program variables of predefined types. Objects are discussed in detail in Chapters 11 and 12. In the following sections, many common data types are discussed. For most, design issues particular to the type are stated. For all, one or more example designs are described. One design issue is fundamental to all data types: What operations are provided for variables of the type, and how are they specified? 6.2 Primitive Data Types Data types that are not defined in terms of other types are called primitive data types. Nearly all programming languages provide a set of primitive data types. Some of the primitive types are merely reflections of the hardware— for example, most integer types. Others require only a little nonhardware support for their implementation. To specify the structured types, the primitive data types of a language are used, along with one or more type constructors. 6.2.1 Numeric Types Some early programming languages only had numeric primitive types. Numeric types still play a central role among the collections of types supported by contemporary languages. 6.2.1.1 Integer The most common primitive numeric data type is integer. The hardware of many computers supports several sizes of integers. These sizes of integers, and often a few others, are supported by some programming languages. For example, Java includes four signed integer sizes: byte, short, int, and long. Some languages, for example, C++ and C#, include unsigned integer types, which are types for integer values without signs. Unsigned types are often used for binary data. A signed integer value is represented in a computer by a string of bits, with one of the bits (typically the leftmost) representing the sign. Most integer types are supported directly by the hardware. One example of an integer type that is not supported directly by the hardware is the long integer type of Python (F# also provides such integers). Values of this type can have unlimited length. Long integer values can be specified as literals, as in the following example: 243725839182756281923L Integer arithmetic operations in Python that produce values too large to be represented with int type store them as long integer type values. A negative integer could be stored in sign-magnitude notation, in which the sign bit is set to indicate negative and the remainder of the bit string represents the absolute value of the number. Sign-magnitude notation, however, does not lend itself to computer arithmetic. Most computers now use a notation called twos complement to store negative integers, which is convenient for addition and subtraction. In twos-complement notation, the representation of a negative integer is formed by taking the logical complement of the positive version of the number and adding one. Onescomplement notation is still used by some computers. In ones-complement notation, the negative of an integer is stored as the logical complement of its absolute value. Ones-complement notation has the disadvantage that it has two representations of zero. See any book on assembly language programming for details of integer representations. 6.2.1.2 Floating-Point Floating-point data types model real numbers, but the representations are only approximations for many real values. For example, neither of the fundamental numbers π or e (the base for the natural logarithms) can be correctly represented in floating-point notation. Of course, neither of these numbers can be precisely represented in any finite amount of computer memory. On most computers, floating-point numbers are stored in binary, which exacerbates the problem. For example, even the value 0.1 in decimal cannot be represented by a finite number of binary digits.1 Another problem with floating-point types is the loss of accuracy through arithmetic operations. For more information on the problems of floating-point notation, see any book on numerical analysis. 1. 0.1 in decimal is 0.0001100110011 . . . in binary. Floating-point values are represented as fractions and exponents, a form that is borrowed from scientific notation. Older computers used a variety of different representations for floating-point values. However, most newer machines use the IEEE Floating-Point Standard 754 format. Language implementors use whatever representation is supported by the hardware. Most languages include two floating-point types, often called float and double. The float type is the standard size, usually stored in four bytes of memory. The double type is provided for situations where larger fractional parts and/or a larger range of exponents is needed. Double-precision variables usually occupy twice as much storage as float variables and provide at least twice the number of bits of fraction. The collection of values that can be represented by a floating-point type is defined in terms of precision and range. Precision is the accuracy of the fractional part of a value, measured as the number of bits. Range is a combination of the range of fractions and, more important, the range of exponents. Figure 6.1 shows the IEEE Floating-Point Standard 754 format for singleand double-precision representation (IEEE, 1985). Details of the IEEE formats can be found in Tanenbaum (2005). Figure 6.1 IEEE floating-point formats: (a) single precision, (b) double precision Figure 6.1 Full Alternative Text 6.2.1.3 Complex Some programming languages support a complex data type—for example, Fortran and Python. Complex values are represented as ordered pairs of floating-point values. In Python, the imaginary part of a complex literal is specified by following it with a j or J—for example, (7 + 3j) Languages that support a complex type include operations for arithmetic on complex values. 6.2.1.4 Decimal Most larger computers that are designed to support business systems applications have hardware support for decimal data types. Decimal data types store a fixed number of decimal digits, with the implied decimal point at a fixed position in the value. These are the primary data types for business data processing and are therefore essential to COBOL. C# and F# also have decimal data types. Decimal types have the advantage of being able to precisely store decimal values, at least those within a restricted range, which cannot be done with floating-point. For example, the number 0.1 (in decimal) can be exactly represented in a decimal type, but not in a floating-point type, as is noted in Section 6.2.1.2. The disadvantages of decimal types are that the range of values is restricted because no exponents are allowed, and their representation in memory is mildly wasteful, for reasons discussed in the following paragraph. Decimal types are stored very much like character strings, using binary codes for the decimal digits. These representations are called binary coded decimal (BCD). In some cases, they are stored one digit per byte, but in others, they are packed two digits per byte. Either way, they take more storage than binary representations. It takes at least four bits to code a decimal digit. Therefore, to store a six-digit coded decimal number requires 24 bits of memory. However, it takes only 20 bits to store the same number in binary.2 The operations on decimal values are done in hardware on machines that have such capabilities; otherwise, they are simulated in software. 2. Of course, unless a program needs to maintain a large number of large decimal values, the difference is insignificant. 6.2.2 Boolean Types Boolean types are perhaps the simplest of all types. Their range of values has only two elements: one for true and one for false. They were introduced in ALGOL 60 and have been included in most general-purpose languages designed since 1960. One popular exception is C89, in which numeric expressions are used as conditionals. In such expressions, all operands with nonzero values are considered true, and zero is considered false. Although C99 and C++ have a Boolean type, they also allow numeric expressions to be used as if they were Boolean. This is not the case in the subsequent languages, Java and C#. Boolean types are often used to represent switches or flags in programs. Although other types, such as integers, can be used for these purposes, the use of Boolean types is more readable. A Boolean value could be represented by a single bit, but because a single bit of memory cannot be accessed efficiently on many machines, they are often stored in the smallest efficiently addressable cell of memory, typically a byte. 6.2.3 Character Types Character data are stored in computers as numeric codings. Traditionally, the most commonly used coding was the 8-bit code ASCII (American Standard Code for Information Interchange), which uses the values 0 to 127 to code 128 different characters. ISO 8859-1 is another 8-bit character code, but it allows 256 different characters. Because of the globalization of business and the need for computers to communicate with other computers around the world, the ASCII character set became inadequate. In response, in 1991, the Unicode Consortium published the UCS-2 standard, a 16-bit character set. This character code is often called Unicode. Unicode includes the characters from most of the world’s natural languages. For example, Unicode includes the Cyrillic alphabet, as used in Serbia, and the Thai digits. The first 128 characters of Unicode are identical to those of ASCII. Java was the first widely used language to use the Unicode character set. Since then, it has found its way into JavaScript, Python, Perl, C#, F#, and Swift. After 1991, the Unicode Consortium, in cooperation with the International Standards Organization (ISO), developed a 4-byte character code named UCS-4, or UTF-32, which is described in the ISO/IEC 10646 Standard, published in 2000. To provide the means of processing codings of single characters, most programming languages include a primitive type for them. However, Python supports single characters only as character strings of length 1. 6.3 Character String Types A character string type is one in which the values consist of sequences of characters. Character string constants are used to label output, and the input and output of all kinds of data are often done in terms of strings. Of course, character strings also are an essential type for all programs that do character manipulation. 6.3.1 Design Issues The two most important design issues that are specific to character string types are the following: Should strings be a special kind of character array or a primitive type? Should strings have static or dynamic length? 6.3.2 Strings and Their Operations The most common string operations are assignment, catenation, substring reference, comparison, and pattern matching. A substring reference is a reference to a substring of a given string. Substring references are discussed in the more general context of arrays, where the substring references are called slices. In general, both assignment and comparison operations on character strings are complicated by the possibility of string operands of different lengths. For example, what happens when a longer string is assigned to a shorter string, or vice versa? Usually, simple and sensible choices are made for these situations, although programmers often have trouble remembering them. In some languages, pattern matching is supported directly in the language. In others, it is provided by a function or class library. If strings are not defined as a primitive type, string data is usually stored in arrays of single characters and referenced as such in the language. This is the approach taken by C and C++, which use char arrays to store character strings. These languages provide a collection of string operations through standard libraries. Many users of strings and many of the library functions use the convention that character strings are terminated with a special character, null, which is represented with zero. This is an alternative to maintaining the length of string variables. The library operations simply carry out their operations until the null character appears in the string being operated on. Library functions that produce strings often supply the null character. The character string literals that are built by the compiler also have the null character. For example, consider the following declaration: char str[] = "apples"; In this example, str represents an array of char elements, specifically apples0, where 0 is the null character. Some of the most commonly used library functions for character strings in C and C++ are strcpy, which moves strings; strcat, which catenates one given string onto another; strcmp, which lexicographically compares (by the order of their character codes) two given strings; and strlen, which returns the number of characters, not counting the null character, in the given string. The parameters and return values for most of the string manipulation functions are char pointers that point to arrays of char. Parameters can also be string literals. The string manipulation functions of the C standard library, which are also available in C++, are inherently unsafe and have led to numerous programming errors. The problem is that the functions in this library that move string data do not guard against overflowing the destination. For example, consider the following call to strcpy: strcpy(dest, src); If the length of dest is 20 and the length of src is 50, strcpy will write over the 30 bytes that follow dest. The point is that strcpy does not know the length of dest, so it cannot ensure that the memory following it will not be overwritten. The same problem can occur with several of the other functions in the C string library. In addition to C-style strings, C++ also supports strings through its standard class library, which is also similar to that of Java. Because of the insecurities of the C string library, C++ programmers should use the string class from the standard library, rather than char arrays and the C string library. In Java, strings are supported by the String class, whose values are constant strings, and the StringBuffer class, whose values are changeable and are more like arrays of single characters. These values are specified with methods of the StringBuffer class. C# and Ruby include string classes that are similar to those of Java. Python includes strings as a primitive type and has operations for substring reference, catenation, indexing to access individual characters, as well as methods for searching and replacement. There is also an operation for character membership in a string. So, even though Python’s strings are primitive types, for character and substring references, they act very much like arrays of characters. However, Python strings are immutable, similar to the String class objects of Java. In F#, strings are a class. Individual characters, which are represented in Unicode UTF-16, can be accessed, but not changed. Strings can be catenated with the + operator. In ML, string is a primitive immutable type. It uses ^ for its catenation operator and includes functions for substring referencing and getting the size of a string. In Swift, the String class supports its character strings. String objects can be either constants or variables. The binary + operator catenates String variables. The append method is used to add a Character object to a String object. The characters method of String is used to examine individual characters of a String object. history note SNOBOL 4 was the first widely known language to support pattern matching. Perl, JavaScript, Ruby, and PHP include built-in pattern-matching operations. In these languages, the pattern-matching expressions are somewhat loosely based on mathematical regular expressions. In fact, they are often called regular expressions. They evolved from the early UNIX line editor, ed, to become part of the UNIX shell languages. Eventually, they grew to their current complex form. There is at least one complete book on this kind of pattern-matching expressions (Friedl, 2006). In this section, we provide a brief look at the style of these expressions through two relatively simple examples. Consider the following pattern expression: /[A-Za-z][A-Za-z\d]+/ This pattern matches (or describes) the typical name form in programming languages. The brackets enclose character classes. The first character class specifies all letters; the second specifies all letters and digits (a digit is specified with the abbreviation \d). If only the second character class were included, we could not prevent a name from beginning with a digit. The plus operator following the second category specifies that there must be one or more of what is in the category. So, the whole pattern matches strings that begin with a letter, followed by one or more letters or digits. Next, consider the following pattern expression: /\d+\.?\d*|\.\d+/ This pattern matches numeric literals. The \. specifies a literal decimal point.3 The question mark quantifies what it follows to have zero or one appearance. The vertical bar (|) separates two alternatives in the whole pattern. The first alternative matches strings of one or more digits, possibly followed by a decimal point, followed by zero or more digits; the second alternative matches strings that begin with a decimal point, followed by one or more digits. 3. The period must be “escaped” with the backslash because period has special meaning in a regular expression. Pattern-matching capabilities using regular expressions are included in the class libraries of C++, Java, Python, C#, and F#. 6.3.3 String Length Options There are several design choices regarding the length of string values. First, the length can be static and set when the string is created. Such a string is called a static length string. This is the choice for the strings of Python, the immutable objects of Java’s String class, as well as similar classes in the C++ standard class library, Ruby’s built-in String class, and the .NET class library available to C# and F#. The second option is to allow strings to have varying length up to a declared and fixed maximum set by the variable’s definition, as exemplified by the strings in C and the C-style strings of C++. These are called limited dynamic length strings. Such string variables can store any number of characters between zero and the maximum. Recall that strings in C use a special character to indicate the end of the string’s characters, rather than maintaining the string length. The third option is to allow strings to have varying length with no maximum, as in JavaScript, Perl, and the standard C++ library. These are called dynamic length strings. This option requires the overhead of dynamic storage allocation and deallocation but provides maximum flexibility. 6.3.4 Evaluation String types are important to the writability of a language. Dealing with strings as arrays can be more cumbersome than dealing with a primitive string type. For example, consider a language that treats strings as arrays of characters and does not have a predefined function that does what strcpy in C does. Then, a simple assignment of one string to another would require a loop. The addition of strings as a primitive type to a language is not costly in terms of either language or compiler complexity. Therefore, it is difficult to justify the omission of primitive string types in some contemporary languages. Of course, providing strings through a standard library is nearly as convenient as having them as a primitive type. String operations such as simple pattern matching and catenation are essential and should be included for string type values. Although dynamic length strings are obviously the most flexible, the overhead of their implementation must be weighed against that additional flexibility. 6.3.5 Implementation of Character String Types Character string types could be supported directly in hardware; but in most cases, software is used to implement string storage, retrieval, and manipulation. When character string types are represented as character arrays, the language often supplies few operations. A descriptor for a static character string type, which is required only during compilation, has three fields. The first field of every descriptor is the name of the type. In the case of static character strings, the second field is the type’s length (in characters). The third field is the address of the first character. This descriptor is shown in Figure 6.2. Limited dynamic strings require a run-time descriptor to store the fixed maximum length, the current length, and the address, as shown in Figure 6.3. Dynamic length strings require a simpler run-time descriptor because only the current length and the address need to be stored. Although we depict descriptors as independent blocks of storage, in most cases, they are stored in the symbol table. Figure 6.2 Compile-time descriptor for static strings Figure 6.3 Run-time descriptor for limited dynamic strings The limited dynamic strings of C and C++ do not require run-time descriptors, because the end of a string is marked with the null character. They do not need the maximum length, because index values in array references are not range checked in these languages. Static length and limited dynamic length strings require no special dynamic storage allocation. In the case of limited dynamic length strings, sufficient storage for the maximum length is allocated when the string variable is bound to storage, so only a single allocation process is involved. Dynamic length strings require more complex storage management. The length of a string, and therefore the storage to which it is bound, must grow and shrink dynamically. There are three approaches to supporting the dynamic allocation and deallocation that is required for dynamic length strings. First, strings can be stored in a linked list, so that when a string grows, the newly required cells can come from anywhere in the heap. The drawbacks to this method are the extra storage occupied by the links in the list representation and the necessary complexity of string operations. The second approach is to store strings as arrays of pointers to individual characters allocated in the heap. This method still uses extra memory, but string processing can be faster than with the linked-list approach. The third alternative is to store complete strings in adjacent storage cells. The problem with this method arises when a string grows: How can storage that is adjacent to the existing cells continue to be allocated for the string variable? Frequently, such storage is not available. Instead, a new area of memory is found that can store the complete new string, and the old part is moved to this area. Then, the memory cells used for the old string are deallocated. This latter approach is the one typically used. The general problem of managing allocation and deallocation of variable-size segments is discussed in Section 6.11.7.3. Although the linked-list method requires more storage, the associated allocation and deallocation processes are simple. However, some string operations are slowed by the required pointer chasing. On the other hand, using adjacent memory for complete strings results in faster string operations and requires significantly less storage, but the allocation and deallocation processes are slower. 6.4 Enumeration Types An enumeration type is one in which all of the possible values, which are named constants, are provided, or enumerated, in the definition. Enumeration types provide a way of defining and grouping collections of named constants, which are called enumeration constants. The definition of a typical enumeration type is shown in the following C# example: enum days {Mon, Tue, Wed, Thu, Fri, Sat, Sun}; The enumeration constants are typically implicitly assigned the integer values, 0, 1, . . . but can be explicitly assigned any integer literal in the type’s definition. 6.4.1 Design Issues The design issues for enumeration types are as follows: Is an enumeration constant allowed to appear in more than one type definition, and if so, how is the type of an occurrence of that constant in the program checked? Are enumeration values coerced to integer? Are any other types coerced to an enumeration type? All of these design issues are related to type checking. If an enumeration variable is coerced to a numeric type, then there is little control over its range of legal operations or its range of values. If an int type value is coerced to an enumeration type, then an enumeration type variable could be assigned any integer value, whether it represented an enumeration constant or not. 6.4.2 Designs In languages that do not have enumeration types, programmers usually simulate them with integer values. For example, suppose we needed to represent colors in a C program and C did not have an enumeration type. We might use 0 to represent blue, 1 to represent red, and so forth. These values could be defined as follows: int red = 0, blue = 1; Now, in the program, we could use red and blue as if they were of a color type. The problem with this approach is that because we have not defined a type for our colors, there is no type checking when they are used. For example, it would be legal to add the two together, although that would rarely be an intended operation. They could also be combined with any other numeric type operand using any arithmetic operator, which would also rarely be useful. Furthermore, because they are just variables, they could be assigned any integer value, thereby destroying the relationship with the colors. This latter problem could be prevented by making them named constants. C and Pascal were the first widely used languages to include an enumeration data type. C++ includes C’s enumeration types. In C++, we could have the following: enum colors {red, blue, green, yellow, black}; colors myColor = blue, yourColor = red; The colors type uses the default internal values for the enumeration constants, 0, 1, . . . , although the constants could have been specifically assigned any integer literal (or any constant-valued expression) by the programmer. The enumeration values are coerced to int when they are put in integer context. This allows their use in any numeric expression. For example, if the current value of myColor is blue, then the expression myColor++ would assign the integer code for green to myColor. C++ also allows enumeration constants to be assigned to variables of any numeric type, though that would likely be an error. However, no other type value is coerced to an enumeration type in C++. For example, myColor = 4; is illegal in C++. This assignment would be legal if the right side had been cast to colors type. This prevents some potential errors. C++ enumeration constants can appear in only one enumeration type in the same referencing environment. In 2004, an enumeration type was added to Java in Java 5.0. All enumeration types in Java are implicitly subclasses of the predefined class Enum. Because enumeration types are classes, they can have instance data fields, constructors, and methods. Syntactically, Java enumeration type definitions appear like those of C++, except that they can include fields, constructors, and methods. The possible values of an enumeration are the only possible instances of the class. All enumeration types inherit toString, as well as a few other methods. An array of the instances of an enumeration type can be fetched with the static method values. The internal numeric value of an enumeration variable can be fetched with the ordinal method. No expression of any other type can be assigned to an enumeration variable. Also, an enumeration variable is never coerced to any other type. C# enumeration types are like those of C++, except that they are never coerced to integer. So, operations on enumeration types are restricted to those that make sense. Also, the range of values is restricted to that of the particular enumeration type. In ML, enumeration types are defined as new types with datatype declarations. For example, we could have the following: datatype weekdays = Monday | Tuesday | Wednesday | Thursday | Friday The type of the elements of weekdays is integer. F# has enumeration types that are similar to those of ML, except the reserved word type is used instead of datatype and the first value is preceded by an OR operator (|). Swift has an enumeration type in which the enumeration values are names, which represent themselves, rather than having internal integer values. An enumeration type is defined in a structure that is similar to a switch structure, as in: enum fruit { case orange case apple case banana } Dot notation is used to reference enumeration values, so in our example, the value of apple is referenced as fruit.apple. Interestingly, none of the relatively recent scripting languages include enumeration types. These include Perl, JavaScript, PHP, Python, and Ruby. Even Java was a decade old before enumeration types were added. 6.4.3 Evaluation Enumeration types can provide advantages in both readability and reliability. Readability is enhanced very directly: Named values are easily recognized, whereas coded values are not. In the area of reliability, the enumeration types of C#, F#, Java 5.0, and Swift provide two advantages: (1) No arithmetic operations are legal on enumeration types; this prevents adding days of the week, for example, and (2) second, no enumeration variable can be assigned a value outside its defined range.4 If the colors enumeration type has 10 enumeration constants and uses 0..9 as its internal values, no number greater than 9 can be assigned to a c olors type variable. 4. In C# and F#, an integer value can be cast to an enumeration type and assigned to the name of an enumeration variable. Such values must be tested with Enum.IsDefined method before assigning them to the name of an enumeration variable. Because C treats enumeration variables like integer variables, it does not provide either of these two advantages. C++ is a little better. Numeric values can be assigned to enumeration type variables only if they are cast to the type of the assigned variable. Numeric values assigned to enumeration type variables are checked to determine whether they are in the range of the internal values of the enumeration type. Unfortunately, if the user uses a wide range of explicitly assigned values, this checking is not effective. For example, enum colors {red = 1, blue = 1000, green = 100000} In this example, a value assigned to a variable of colors type will only be checked to determine whether it is in the range of 1..100000. 6.5 Array Types An array is a homogeneous aggregate of data elements in which an individual element is identified by its position in the aggregate, relative to the first element. The individual data elements of an array are of the same type. References to individual array elements are specified using subscript expressions. If any of the subscript expressions in a reference include variables, then the reference will require an additional run-time calculation to determine the address of the memory location being referenced. In many languages, such as C, C++, Java, and C#, all of the elements of an array are required to be of the same type. In these languages, pointers and references are restricted to point to or reference a single type. So the objects or data values being pointed to or referenced are also of a single type. In some other languages, such as JavaScript, Python, and Ruby, variables are typeless references to objects or data values. In these cases, arrays still consist of elements of a single type, but the elements can reference objects or data values of different types. Such arrays are still homogeneous, because the array elements are of the same type. In Swift, arrays can be typed, that is, they will contain values only of a single type, or untyped, which means they can contain values of any type. C# and Java 5.0 provide generic arrays, that is, arrays whose elements are references to objects, through their class libraries. These are discussed in Section 6.5.3. 6.5.1 Design Issues The primary design issues specific to arrays are the following: What types are legal for subscripts? Are subscripting expressions in element references range checked? When are subscript ranges bound? When does array allocation take place? Are ragged or rectangular multidimensioned arrays allowed, or both? Can arrays be initialized when they have their storage allocated? What kinds of slices are allowed, if any? In the following sections, examples of the design choices made for the arrays of the most common programming languages are discussed. 6.5.2 Arrays and Indices Specific elements of an array are referenced by means of a two-level syntactic mechanism, where the first part is the aggregate name, and the second part is a possibly dynamic selector consisting of one or more items known as subscripts or indices. If all of the subscripts in a reference are constants, the selector is static; otherwise, it is dynamic. The selection operation can be thought of as a mapping from the array name and the set of subscript values to an element in the aggregate. Indeed, arrays are sometimes called finite mappings. Symbolically, this mapping can be shown as history note The designers of pre-90 Fortrans and PL/I chose parentheses for array subscripts because no other suitable characters were available at the time. Card punches did not include bracket characters. array\_name(subscript\_value\_list)→element The syntax of array references is fairly universal: The array name is followed by the list of subscripts, which is surrounded by either parentheses or brackets. In some languages that provide multidimensioned arrays as arrays of arrays, each subscript appears in its own brackets. A problem with using parentheses to enclose subscript expressions is that they often are also used to enclose the parameters in subprogram calls; this use makes references to arrays appear exactly like those calls. For example, consider the following Ada assignment statement: Sum := Sum + B(I); Because parentheses are used for both subprogram parameters and array subscripts in Ada, both program readers and compilers are forced to use other information to determine whether B(I) in this assignment is a function call or a reference to an array element. This results in reduced readability. history note Fortran I limited the number of array subscripts to three, because at the time of the design, execution efficiency was a primary concern. Fortran I designers had developed a very fast method for accessing the elements of arrays of up to three dimensions, using the three index registers of the IBM 704. Fortran IV was first implemented on an IBM 7094, which had seven index registers. This allowed Fortran IV’s designers to allow arrays with up to seven subscripts. Most other contemporary languages enforce no such limits. The designers of Ada specifically chose parentheses to enclose subscripts so there would be uniformity between array references and function calls in expressions, in spite of potential readability problems. They made this choice in part because both array element references and function calls are mappings. Array element references map the subscripts to a particular element of the array. Function calls map the actual parameters to the function definition and, eventually, a functional value. Most languages other than Fortran and Ada use brackets to delimit their array indices. Two distinct types are involved in an array type: the element type and the type of the subscripts. The type of the subscripts is often integer. Early programming languages did not specify that subscript ranges must be implicitly checked. Range errors in subscripts are common in programs, so requiring range checking is an important factor in the reliability of languages. Many contemporary languages also do not specify range checking of subscripts, but Java, ML, and C# do. Subscripting in Perl is a bit unusual in that although the names of all arrays begin with at signs (@), because array elements are always scalars and the names of scalars always begin with dollar signs ($), references to array elements use dollar signs rather than at signs in their names. For example, for the array @list, the second element is referenced with $list[1]. One can reference an array element in Perl with a negative subscript, in which case the subscript value is an offset from the end of the array. For example, if the array @list has five elements with the subscripts 0..4, $list [−2] references the element with the subscript 3. A reference to a nonexistent element in Perl yields undef, but no error is reported. 6.5.3 Subscript Bindings and Array Categories The binding of the subscript type to an array variable is usually static, but the subscript value ranges are sometimes dynamically bound. In some languages, the lower bound of the subscript range is implicit. For example, in the C-based languages, the lower bound of all subscript ranges is fixed at 0. In some other languages, the lower bounds of the subscript ranges must be specified by the programmer. There are four categories of arrays, based on the binding to subscript ranges, the binding to storage, and from where the storage is allocated. The category names indicate the design choices of these three. In the first three of these categories, once the subscript ranges are bound and the storage is allocated, they remain fixed for the lifetime of the variable. Of course, when the subscript ranges are fixed, the array cannot change size. A static array is one in which the subscript ranges are statically bound and storage allocation is static (done before run time). The advantage of static arrays is efficiency: No dynamic allocation or deallocation is required. The disadvantage is that the storage for the array is fixed for the entire execution time of the program. A fixed stack-dynamic array is one in which the subscript ranges are statically bound, but the allocation is done at declaration elaboration time during execution. The advantage of fixed stack-dynamic arrays over static arrays is space efficiency. A large array in one subprogram can use the same space as a large array in a different subprogram, as long as both subprograms are not active at the same time. The same is true if the two arrays are in different blocks that are not active at the same time. The disadvantage is the required allocation and deallocation time. A fixed heap-dynamic array is similar to a fixed stack-dynamic array, in that the subscript ranges and the storage binding are both fixed after storage is allocated. The differences are that both the subscript ranges and storage bindings are done when the user program requests them during execution, and the storage is allocated from the heap, rather than the stack. The advantage of fixed heap-dynamic arrays is flexibility—the array’s size always fits the problem. The disadvantage is allocation time from the heap, which is longer than allocation time from the stack. A heap-dynamic array is one in which the binding of subscript ranges and storage allocation is dynamic and can change any number of times during the array’s lifetime. The advantage of heap-dynamic arrays over the others is flexibility: Arrays can grow and shrink during program execution as the need for space changes. The disadvantage is that allocation and deallocation take longer and may happen many times during execution of the program. Examples of the four categories are given in the following paragraphs. Arrays declared in C and C++ functions that include the static modifier are static. Arrays that are declared in C and C++ functions without the static specifier are examples of fixed stack-dynamic arrays. C and C++ also provide fixed heap-dynamic arrays. The standard C library functions malloc and free, which are general heap allocation and deallocation operations, respectively, can be used for C arrays. C++ uses the operators new and delete to manage heap storage. An array is treated as a pointer to a collection of storage cells, where the pointer can be indexed, as discussed in Section 6.11.5. In Java, all non-generic arrays are fixed heap-dynamic. Once created, these arrays keep the same subscript ranges and storage. C# also provides fixed heap-dynamic arrays. Objects of the C# List class are generic heap-dynamic arrays. These array objects are created without any elements, as in List<String> stringList = new List<String>(); Elements are added to this object with the Add method, as in stringList.Add("Michael"); Access to elements of these arrays is through subscripting. Java includes a generic class similar to C#’s List, named ArrayList. It is different from C#’s List in that subscripting is not supported—get and set methods must be used to access the elements. A Perl array can be made to grow by using the push (puts one or more new elements on the end of the array) and unshift (puts one or more new elements on the beginning of the array), or by assigning a value to the array specifying a subscript beyond the highest current subscript of the array. An array can be made to shrink to no elements by assigning it the empty list, (). The length of an array is defined to be the largest subscript plus one. Like Perl, JavaScript allows arrays to grow with the push and unshift methods and shrink by setting them to the empty list. However, negative subscripts are not supported. JavaScript arrays can be sparse, meaning the subscript values need not be contiguous. For example, suppose we have an array named list that has 10 elements with the subscripts 0..9.5 Consider the following assignment statement: 5. The subscript range could just as easily have been 1000 .. 1009. list[50] = 42; Now, list has 11 elements and length 51. The elements with subscripts 11..49 are not defined and therefore do not require storage. A reference to a nonexistent element in a JavaScript array yields undefined. Arrays in Python and Ruby can be made to grow only through methods to add elements or catenate other arrays. Ruby and Perl support negative subscripts, but Python does not. In Python an element or slice of an array can be deleted. A reference to a nonexistent element in Python results in a runtime error, whereas a similar reference in Ruby yields nil and no error is reported. Swift dynamic arrays are objects that use integer subscripts, beginning at zero, and include several useful methods. The append method adds an element to the end of an array. The insert method inserts a new element at any position in the array, but results in an error if the insertion is at a subscript beyond the current length of the array. Elements can be removed from an array with the removeAtIndex method. There are also reverse and count methods. Although the ML definition does not include arrays, its widely used implementation, SML/NJ, does. The only predefined collection type that is part of F# is the array (other collection types are provided through the .NET Framework Library). These arrays are like those of C#. A foreach statement is included in the language for array processing. 6.5.4 Array Initialization Some languages provide the means to initialize arrays at the time their storage is allocated. C, C++, Java, Swift, and C# allow initialization of their arrays. Consider the following C declaration: int list [] = {4, 5, 7, 83}; The array list is created and initialized with the values 4, 5, 7, and 83. The compiler also sets the length of the array. This is meant to be a convenience but is not without cost. It effectively removes the possibility that the system could detect some kinds of programmer errors, such as mistakenly leaving a value out of the list. As discussed in Section 6.3.2, character strings in C and C++ are implemented as arrays of char. These arrays can be initialized to string constants, as in char name [] = "freddie"; The array name will have eight elements, because all strings are terminated with a null character (zero), which is implicitly supplied by the system for string constants. Arrays of strings in C and C++ can also be initialized with string literals. For example, char *names [] = {"Bob", "Jake", "Darcie"}; This example illustrates the nature of character literals in C and C++. In the previous example of a string literal being used to initialize the char array name, the literal is taken to be a char array. But in the latter example (names), the literals are taken to be pointers to characters, so the array is an array of pointers to characters. For example, names[0] is a pointer to the letter 'B' in the literal character array that contains the characters 'B', 'o', 'b', and the null character. In Java, similar syntax is used to define and initialize an array of references to String objects. For example, String[] names = ["Bob", "Jake", "Darcie"]; 6.5.5 Array Operations An array operation is one that operates on an array as a unit. The most common array operations are assignment, catenation, comparison for equality and inequality, and slices, which are discussed separately in Section 6.5.5. The C-based languages do not provide any array operations, except through the methods of Java, C++, and C#. Perl supports array assignments but does not support comparisons. Python’s arrays are called lists, although they have all the characteristics of dynamic arrays. Because the objects can be of any types, these arrays are heterogeneous. Python provides array assignment, although it is only a reference change. Python also has operations for array catenation (+) and element membership (in). It includes two different comparison operators: one that determines whether the two variables reference the same object (is) and one that compares all corresponding objects in the referenced objects, regardless of how deeply they are nested, for equality (==). Like Python, the elements of Ruby’s arrays are references to objects. And like Python, when a == operator is used between two arrays, the result is true only if the two arrays have the same length and the corresponding elements are equal. Ruby’s arrays can be catenated with an Array method. F# includes many array operators in its Array module. Among these are Array.append, Array.copy, and Array.length. Arrays and their operations are the heart of APL; it is the most powerful array-processing language ever devised. Because of its relative obscurity and its lack of effect on subsequent languages, however, we present here only a glimpse into its array operations. In APL, the four basic arithmetic operations are defined for vectors (singledimensioned arrays) and matrices, as well as scalar operands. For example, A + B is a valid expression, whether A and B are scalar variables, vectors, or matrices. APL includes a collection of unary operators for vectors and matrices, some of which are as follows (where V is a vector and M is a matrix): APL also includes several special operators that take other operators as operands. One of these is the inner product operator, which is specified with a period (.). It takes two operands, which are binary operators. For example, +.× is a new operator that takes two arguments, either vectors or matrices. It first multiplies the corresponding elements of two arguments, and then it sums the results. For example, if A and B are vectors, A × B is the mathematical inner product of A and B (a vector of the products of the corresponding elements of A and B). The statement A +.× B is the sum of the inner product of A and B. If A and B are matrices, this expression specifies the matrix multiplication of A and B. The special operators of APL are actually functional forms, which are described in Chapter 15. 6.5.6 Rectangular and Jagged Arrays A rectangular array is a multidimensioned array in which all of the rows have the same number of elements and all of the columns have the same number of elements. Rectangular arrays model rectangular tables exactly. A jagged array is one in which the lengths of the rows need not be the same. For example, a jagged matrix may consist of three rows, one with 5 elements, one with 7 elements, and one with 12 elements. This also applies to the columns and higher dimensions. So, if there is a third dimension (layers), each layer can have a different number of elements. Jagged arrays are made possible when multidimensioned arrays are actually arrays of arrays. For example, a matrix would appear as an array of single-dimensioned arrays. C, C++, and Java support jagged arrays but not rectangular arrays. In those languages, a reference to an element of a multidimensioned array uses a separate pair of brackets for each dimension. For example, myArray[3][7] C# and F# support both rectangular and jagged arrays. For rectangular arrays, all subscript expressions in references to elements are placed in a single pair of brackets. For example, myArray[3, 7] 6.5.7 Slices A slice of an array is some substructure of that array. For example, if A is a matrix, then the first row of A is one possible slice, as are the last row and the first column. It is important to realize that a slice is not a new data type. Rather, it is a mechanism for referencing part of an array as a unit. If arrays cannot be manipulated as units in a language, that language has no use for slices. Consider the following Python declarations: vector = [2, 4, 6, 8, 10, 12, 14, 16] mat = [[1, 2, 3],[4, 5, 6],[7, 8, 9]] Recall that the default lower bound for Python arrays is 0. The syntax of a Python slice reference is a pair of numeric expressions separated by a colon. The first is the first subscript of the slice; the second is the first subscript after the last subscript in the slice. Therefore, vector[3:6] is a three-element array with the fourth through sixth elements of vector (those elements with the subscripts 3, 4, and 5). A row of a matrix is specified by giving just one subscript. For example, mat[1] refers to the second row of mat; a part of a row can be specified with the same syntax as a part of a single-dimensioned array. For example, mat[0][0:2] refers to the first and second element of the first row of mat, which is [1, 2]. Python also supports more complex slices of arrays. For example, vector[0:7:2] references every other element of vector, up to but not including the element with the subscript 7, starting with the subscript 0, which is [2, 6, 10, 14]. Perl supports slices of two forms, a list of specific subscripts or a range of subscripts. For example, @list[1..5] = @list2[3, 5, 7, 9, 13]; Notice that slice references use array names, not scalar names, because slices are arrays (not scalars). Ruby supports slices with the slice method of its Array object, which can take three forms of parameters. A single integer expression parameter is interpreted as a subscript, in which case slice returns the element with the given subscript. If slice is given two integer expression parameters, the first is interpreted as a beginning subscript and the second is interpreted as the number of elements in the slice. For example, suppose list is defined as follows: list = [2, 4, 6, 8, 10] returns [6, 8]. The third parameter form for slice is a range, which has the form of an integer expression, two periods, and a second list.slice(2, 2) integer expression. With a range parameter, slice returns an array of the element with the given range of subscripts. For example, list.slice (1..3) returns [4, 6, 8]. 6.5.8 Evaluation Arrays have been included in virtually all programming languages. The primary advances since their introduction in Fortran I have been slices and dynamic arrays. As discussed in Section 6.6, the latest advances in arrays have been in associative arrays. 6.5.9 Implementation of Array Types Implementing arrays requires considerably more compile-time effort than does implementing primitive types. The code to allow accessing of array elements must be generated at compile time. At run time, this code must be executed to produce element addresses. There is no way to precompute the address to be accessed by a reference such as list[k] A single-dimensioned array is implemented as a list of adjacent memory cells. Suppose the array list is defined to have a subscript range lower bound of 0. The access function for list is often of the form address (list[k]) = address (list[0]) + k * element_size where the first operand of the addition is the constant part of the access function, and the second is the variable part. If the element type is statically bound and the array is statically bound to storage, then t