PLP6 copy.pdf

6.4 Enumeration Types An enumeration type is one in which all of the possible values, which are named constants, are provided, or enumerated, in the definition. Enumeration types provide a way of defining and grouping collections of named constants, which are called enumeration constants. The definition of a typical enumeration type is shown in the following C# example: enum days {Mon, Tue, Wed, Thu, Fri, Sat, Sun}; The enumeration constants are typically implicitly assigned the integer values, 0, 1, . . . but can be explicitly assigned any integer literal in the type’s definition. 6.4.1 Design Issues The design issues for enumeration types are as follows: Is an enumeration constant allowed to appear in more than one type definition, and if so, how is the type of an occurrence of that constant in the program checked? Are enumeration values coerced to integer? Are any other types coerced to an enumeration type? All of these design issues are related to type checking. If an enumeration variable is coerced to a numeric type, then there is little control over its range of legal operations or its range of values. If an int type value is coerced to an enumeration type, then an enumeration type variable could be assigned any integer value, whether it represented an enumeration constant or not. 6.4.2 Designs In languages that do not have enumeration types, programmers usually simulate them with integer values. For example, suppose we needed to represent colors in a C program and C did not have an enumeration type. We might use 0 to represent blue, 1 to represent red, and so forth. These values could be defined as follows: int red = 0, blue = 1; Now, in the program, we could use red and blue as if they were of a color type. The problem with this approach is that because we have not defined a type for our colors, there is no type checking when they are used. For example, it would be legal to add the two together, although that would rarely be an intended operation. They could also be combined with any other numeric type operand using any arithmetic operator, which would also rarely be useful. Furthermore, because they are just variables, they could be assigned any integer value, thereby destroying the relationship with the colors. This latter problem could be prevented by making them named constants. C and Pascal were the first widely used languages to include an enumeration data type. C++ includes C’s enumeration types. In C++, we could have the following: enum colors {red, blue, green, yellow, black}; colors myColor = blue, yourColor = red; The colors type uses the default internal values for the enumeration constants, 0, 1, . . . , although the constants could have been specifically assigned any integer literal (or any constant-valued expression) by the programmer. The enumeration values are coerced to int when they are put in integer context. This allows their use in any numeric expression. For example, if the current value of myColor is blue, then the expression myColor++ would assign the integer code for green to myColor. C++ also allows enumeration constants to be assigned to variables of any numeric type, though that would likely be an error. However, no other type value is coerced to an enumeration type in C++. For example, myColor = 4; is illegal in C++. This assignment would be legal if the right side had been cast to colors type. This prevents some potential errors. C++ enumeration constants can appear in only one enumeration type in the same referencing environment. In 2004, an enumeration type was added to Java in Java 5.0. All enumeration types in Java are implicitly subclasses of the predefined class Enum. Because enumeration types are classes, they can have instance data fields, constructors, and methods. Syntactically, Java enumeration type definitions appear like those of C++, except that they can include fields, constructors, and methods. The possible values of an enumeration are the only possible instances of the class. All enumeration types inherit toString, as well as a few other methods. An array of the instances of an enumeration type can be fetched with the static method values. The internal numeric value of an enumeration variable can be fetched with the ordinal method. No expression of any other type can be assigned to an enumeration variable. Also, an enumeration variable is never coerced to any other type. C# enumeration types are like those of C++, except that they are never coerced to integer. So, operations on enumeration types are restricted to those that make sense. Also, the range of values is restricted to that of the particular enumeration type. In ML, enumeration types are defined as new types with datatype declarations. For example, we could have the following: datatype weekdays = Monday | Tuesday | Wednesday | Thursday | Friday The type of the elements of weekdays is integer. F# has enumeration types that are similar to those of ML, except the reserved word type is used instead of datatype and the first value is preceded by an OR operator (|). Swift has an enumeration type in which the enumeration values are names, which represent themselves, rather than having internal integer values. An enumeration type is defined in a structure that is similar to a switch structure, as in: enum fruit { case orange case apple case banana } Dot notation is used to reference enumeration values, so in our example, the value of apple is referenced as fruit.apple. Interestingly, none of the relatively recent scripting languages include enumeration types. These include Perl, JavaScript, PHP, Python, and Ruby. Even Java was a decade old before enumeration types were added. 6.4.3 Evaluation Enumeration types can provide advantages in both readability and reliability. Readability is enhanced very directly: Named values are easily recognized, whereas coded values are not. In the area of reliability, the enumeration types of C#, F#, Java 5.0, and Swift provide two advantages: (1) No arithmetic operations are legal on enumeration types; this prevents adding days of the week, for example, and (2) second, no enumeration variable can be assigned a value outside its defined range.4 If the colors enumeration type has 10 enumeration constants and uses 0..9 as its internal values, no number greater than 9 can be assigned to a c olors type variable. 4. In C# and F#, an integer value can be cast to an enumeration type and assigned to the name of an enumeration variable. Such values must be tested with Enum.IsDefined method before assigning them to the name of an enumeration variable. Because C treats enumeration variables like integer variables, it does not provide either of these two advantages. C++ is a little better. Numeric values can be assigned to enumeration type variables only if they are cast to the type of the assigned variable. Numeric values assigned to enumeration type variables are checked to determine whether they are in the range of the internal values of the enumeration type. Unfortunately, if the user uses a wide range of explicitly assigned values, this checking is not effective. For example, enum colors {red = 1, blue = 1000, green = 100000} In this example, a value assigned to a variable of colors type will only be checked to determine whether it is in the range of 1..100000. 6.5 Array Types An array is a homogeneous aggregate of data elements in which an individual element is identified by its position in the aggregate, relative to the first element. The individual data elements of an array are of the same type. References to individual array elements are specified using subscript expressions. If any of the subscript expressions in a reference include variables, then the reference will require an additional run-time calculation to determine the address of the memory location being referenced. In many languages, such as C, C++, Java, and C#, all of the elements of an array are required to be of the same type. In these languages, pointers and references are restricted to point to or reference a single type. So the objects or data values being pointed to or referenced are also of a single type. In some other languages, such as JavaScript, Python, and Ruby, variables are typeless references to objects or data values. In these cases, arrays still consist of elements of a single type, but the elements can reference objects or data values of different types. Such arrays are still homogeneous, because the array elements are of the same type. In Swift, arrays can be typed, that is, they will contain values only of a single type, or untyped, which means they can contain values of any type. C# and Java 5.0 provide generic arrays, that is, arrays whose elements are references to objects, through their class libraries. These are discussed in Section 6.5.3. 6.5.1 Design Issues The primary design issues specific to arrays are the following: What types are legal for subscripts? Are subscripting expressions in element references range checked? When are subscript ranges bound? When does array allocation take place? Are ragged or rectangular multidimensioned arrays allowed, or both? Can arrays be initialized when they have their storage allocated? What kinds of slices are allowed, if any? In the following sections, examples of the design choices made for the arrays of the most common programming languages are discussed. 6.5.2 Arrays and Indices Specific elements of an array are referenced by means of a two-level syntactic mechanism, where the first part is the aggregate name, and the second part is a possibly dynamic selector consisting of one or more items known as subscripts or indices. If all of the subscripts in a reference are constants, the selector is static; otherwise, it is dynamic. The selection operation can be thought of as a mapping from the array name and the set of subscript values to an element in the aggregate. Indeed, arrays are sometimes called finite mappings. Symbolically, this mapping can be shown as history note The designers of pre-90 Fortrans and PL/I chose parentheses for array subscripts because no other suitable characters were available at the time. Card punches did not include bracket characters. array\_name(subscript\_value\_list)→element The syntax of array references is fairly universal: The array name is followed by the list of subscripts, which is surrounded by either parentheses or brackets. In some languages that provide multidimensioned arrays as arrays of arrays, each subscript appears in its own brackets. A problem with using parentheses to enclose subscript expressions is that they often are also used to enclose the parameters in subprogram calls; this use makes references to arrays appear exactly like those calls. For example, consider the following Ada assignment statement: Sum := Sum + B(I); Because parentheses are used for both subprogram parameters and array subscripts in Ada, both program readers and compilers are forced to use other information to determine whether B(I) in this assignment is a function call or a reference to an array element. This results in reduced readability. history note Fortran I limited the number of array subscripts to three, because at the time of the design, execution efficiency was a primary concern. Fortran I designers had developed a very fast method for accessing the elements of arrays of up to three dimensions, using the three index registers of the IBM 704. Fortran IV was first implemented on an IBM 7094, which had seven index registers. This allowed Fortran IV’s designers to allow arrays with up to seven subscripts. Most other contemporary languages enforce no such limits. The designers of Ada specifically chose parentheses to enclose subscripts so there would be uniformity between array references and function calls in expressions, in spite of potential readability problems. They made this choice in part because both array element references and function calls are mappings. Array element references map the subscripts to a particular element of the array. Function calls map the actual parameters to the function definition and, eventually, a functional value. Most languages other than Fortran and Ada use brackets to delimit their array indices. Two distinct types are involved in an array type: the element type and the type of the subscripts. The type of the subscripts is often integer. Early programming languages did not specify that subscript ranges must be implicitly checked. Range errors in subscripts are common in programs, so requiring range checking is an important factor in the reliability of languages. Many contemporary languages also do not specify range checking of subscripts, but Java, ML, and C# do. Subscripting in Perl is a bit unusual in that although the names of all arrays begin with at signs (@), because array elements are always scalars and the names of scalars always begin with dollar signs ($), references to array elements use dollar signs rather than at signs in their names. For example, for the array @list, the second element is referenced with $list[1]. One can reference an array element in Perl with a negative subscript, in which case the subscript value is an offset from the end of the array. For example, if the array @list has five elements with the subscripts 0..4, $list [−2] references the element with the subscript 3. A reference to a nonexistent element in Perl yields undef, but no error is reported. 6.5.3 Subscript Bindings and Array Categories The binding of the subscript type to an array variable is usually static, but the subscript value ranges are sometimes dynamically bound. In some languages, the lower bound of the subscript range is implicit. For example, in the C-based languages, the lower bound of all subscript ranges is fixed at 0. In some other languages, the lower bounds of the subscript ranges must be specified by the programmer. There are four categories of arrays, based on the binding to subscript ranges, the binding to storage, and from where the storage is allocated. The category names indicate the design choices of these three. In the first three of these categories, once the subscript ranges are bound and the storage is allocated, they remain fixed for the lifetime of the variable. Of course, when the subscript ranges are fixed, the array cannot change size. A static array is one in which the subscript ranges are statically bound and storage allocation is static (done before run time). The advantage of static arrays is efficiency: No dynamic allocation or deallocation is required. The disadvantage is that the storage for the array is fixed for the entire execution time of the program. A fixed stack-dynamic array is one in which the subscript ranges are statically bound, but the allocation is done at declaration elaboration time during execution. The advantage of fixed stack-dynamic arrays over static arrays is space efficiency. A large array in one subprogram can use the same space as a large array in a different subprogram, as long as both subprograms are not active at the same time. The same is true if the two arrays are in different blocks that are not active at the same time. The disadvantage is the required allocation and deallocation time. A fixed heap-dynamic array is similar to a fixed stack-dynamic array, in that the subscript ranges and the storage binding are both fixed after storage is allocated. The differences are that both the subscript ranges and storage bindings are done when the user program requests them during execution, and the storage is allocated from the heap, rather than the stack. The advantage of fixed heap-dynamic arrays is flexibility—the array’s size always fits the problem. The disadvantage is allocation time from the heap, which is longer than allocation time from the stack. A heap-dynamic array is one in which the binding of subscript ranges and storage allocation is dynamic and can change any number of times during the array’s lifetime. The advantage of heap-dynamic arrays over the others is flexibility: Arrays can grow and shrink during program execution as the need for space changes. The disadvantage is that allocation and deallocation take longer and may happen many times during execution of the program. Examples of the four categories are given in the following paragraphs. Arrays declared in C and C++ functions that include the static modifier are static. Arrays that are declared in C and C++ functions without the static specifier are examples of fixed stack-dynamic arrays. C and C++ also provide fixed heap-dynamic arrays. The standard C library functions malloc and free, which are general heap allocation and deallocation operations, respectively, can be used for C arrays. C++ uses the operators new and delete to manage heap storage. An array is treated as a pointer to a collection of storage cells, where the pointer can be indexed, as discussed in Section 6.11.5. In Java, all non-generic arrays are fixed heap-dynamic. Once created, these arrays keep the same subscript ranges and storage. C# also provides fixed heap-dynamic arrays. Objects of the C# List class are generic heap-dynamic arrays. These array objects are created without any elements, as in List<String> stringList = new List<String>(); Elements are added to this object with the Add method, as in stringList.Add("Michael"); Access to elements of these arrays is through subscripting. Java includes a generic class similar to C#’s List, named ArrayList. It is different from C#’s List in that subscripting is not supported—get and set methods must be used to access the elements. A Perl array can be made to grow by using the push (puts one or more new elements on the end of the array) and unshift (puts one or more new elements on the beginning of the array), or by assigning a value to the array specifying a subscript beyond the highest current subscript of the array. An array can be made to shrink to no elements by assigning it the empty list, (). The length of an array is defined to be the largest subscript plus one. Like Perl, JavaScript allows arrays to grow with the push and unshift methods and shrink by setting them to the empty list. However, negative subscripts are not supported. JavaScript arrays can be sparse, meaning the subscript values need not be contiguous. For example, suppose we have an array named list that has 10 elements with the subscripts 0..9.5 Consider the following assignment statement: 5. The subscript range could just as easily have been 1000 .. 1009. list[50] = 42; Now, list has 11 elements and length 51. The elements with subscripts 11..49 are not defined and therefore do not require storage. A reference to a nonexistent element in a JavaScript array yields undefined. Arrays in Python and Ruby can be made to grow only through methods to add elements or catenate other arrays. Ruby and Perl support negative subscripts, but Python does not. In Python an element or slice of an array can be deleted. A reference to a nonexistent element in Python results in a runtime error, whereas a similar reference in Ruby yields nil and no error is reported. Swift dynamic arrays are objects that use integer subscripts, beginning at zero, and include several useful methods. The append method adds an element to the end of an array. The insert method inserts a new element at any position in the array, but results in an error if the insertion is at a subscript beyond the current length of the array. Elements can be removed from an array with the removeAtIndex method. There are also reverse and count methods. Although the ML definition does not include arrays, its widely used implementation, SML/NJ, does. The only predefined collection type that is part of F# is the array (other collection types are provided through the .NET Framework Library). These arrays are like those of C#. A foreach statement is included in the language for array processing. 6.5.4 Array Initialization Some languages provide the means to initialize arrays at the time their storage is allocated. C, C++, Java, Swift, and C# allow initialization of their arrays. Consider the following C declaration: int list [] = {4, 5, 7, 83}; The array list is created and initialized with the values 4, 5, 7, and 83. The compiler also sets the length of the array. This is meant to be a convenience but is not without cost. It effectively removes the possibility that the system could detect some kinds of programmer errors, such as mistakenly leaving a value out of the list. As discussed in Section 6.3.2, character strings in C and C++ are implemented as arrays of char. These arrays can be initialized to string constants, as in char name [] = "freddie"; The array name will have eight elements, because all strings are terminated with a null character (zero), which is implicitly supplied by the system for string constants. Arrays of strings in C and C++ can also be initialized with string literals. For example, char *names [] = {"Bob", "Jake", "Darcie"}; This example illustrates the nature of character literals in C and C++. In the previous example of a string literal being used to initialize the char array name, the literal is taken to be a char array. But in the latter example (names), the literals are taken to be pointers to characters, so the array is an array of pointers to characters. For example, names[0] is a pointer to the letter 'B' in the literal character array that contains the characters 'B', 'o', 'b', and the null character. In Java, similar syntax is used to define and initialize an array of references to String objects. For example, String[] names = ["Bob", "Jake", "Darcie"]; 6.5.5 Array Operations An array operation is one that operates on an array as a unit. The most common array operations are assignment, catenation, comparison for equality and inequality, and slices, which are discussed separately in Section 6.5.5. The C-based languages do not provide any array operations, except through the methods of Java, C++, and C#. Perl supports array assignments but does not support comparisons. Python’s arrays are called lists, although they have all the characteristics of dynamic arrays. Because the objects can be of any types, these arrays are heterogeneous. Python provides array assignment, although it is only a reference change. Python also has operations for array catenation (+) and element membership (in). It includes two different comparison operators: one that determines whether the two variables reference the same object (is) and one that compares all corresponding objects in the referenced objects, regardless of how deeply they are nested, for equality (==). Like Python, the elements of Ruby’s arrays are references to objects. And like Python, when a == operator is used between two arrays, the result is true only if the two arrays have the same length and the corresponding elements are equal. Ruby’s arrays can be catenated with an Array method. F# includes many array operators in its Array module. Among these are Array.append, Array.copy, and Array.length. Arrays and their operations are the heart of APL; it is the most powerful array-processing language ever devised. Because of its relative obscurity and its lack of effect on subsequent languages, however, we present here only a glimpse into its array operations. In APL, the four basic arithmetic operations are defined for vectors (singledimensioned arrays) and matrices, as well as scalar operands. For example, A + B is a valid expression, whether A and B are scalar variables, vectors, or matrices. APL includes a collection of unary operators for vectors and matrices, some of which are as follows (where V is a vector and M is a matrix): APL also includes several special operators that take other operators as operands. One of these is the inner product operator, which is specified with a period (.). It takes two operands, which are binary operators. For example, +.× is a new operator that takes two arguments, either vectors or matrices. It first multiplies the corresponding elements of two arguments, and then it sums the results. For example, if A and B are vectors, A × B is the mathematical inner product of A and B (a vector of the products of the corresponding elements of A and B). The statement A +.× B is the sum of the inner product of A and B. If A and B are matrices, this expression specifies the matrix multiplication of A and B. The special operators of APL are actually functional forms, which are described in Chapter 15. 6.5.6 Rectangular and Jagged Arrays A rectangular array is a multidimensioned array in which all of the rows have the same number of elements and all of the columns have the same number of elements. Rectangular arrays model rectangular tables exactly. A jagged array is one in which the lengths of the rows need not be the same. For example, a jagged matrix may consist of three rows, one with 5 elements, one with 7 elements, and one with 12 elements. This also applies to the columns and higher dimensions. So, if there is a third dimension (layers), each layer can have a different number of elements. Jagged arrays are made possible when multidimensioned arrays are actually arrays of arrays. For example, a matrix would appear as an array of single-dimensioned arrays. C, C++, and Java support jagged arrays but not rectangular arrays. In those languages, a reference to an element of a multidimensioned array uses a separate pair of brackets for each dimension. For example, myArray[3][7] C# and F# support both rectangular and jagged arrays. For rectangular arrays, all subscript expressions in references to elements are placed in a single pair of brackets. For example, myArray[3, 7] 6.5.7 Slices A slice of an array is some substructure of that array. For example, if A is a matrix, then the first row of A is one possible slice, as are the last row and the first column. It is important to realize that a slice is not a new data type. Rather, it is a mechanism for referencing part of an array as a unit. If arrays cannot be manipulated as units in a language, that language has no use for slices. Consider the following Python declarations: vector = [2, 4, 6, 8, 10, 12, 14, 16] mat = [[1, 2, 3],[4, 5, 6],[7, 8, 9]] Recall that the default lower bound for Python arrays is 0. The syntax of a Python slice reference is a pair of numeric expressions separated by a colon. The first is the first subscript of the slice; the second is the first subscript after the last subscript in the slice. Therefore, vector[3:6] is a three-element array with the fourth through sixth elements of vector (those elements with the subscripts 3, 4, and 5). A row of a matrix is specified by giving just one subscript. For example, mat[1] refers to the second row of mat; a part of a row can be specified with the same syntax as a part of a single-dimensioned array. For example, mat[0][0:2] refers to the first and second element of the first row of mat, which is [1, 2]. Python also supports more complex slices of arrays. For example, vector[0:7:2] references every other element of vector, up to but not including the element with the subscript 7, starting with the subscript 0, which is [2, 6, 10, 14]. Perl supports slices of two forms, a list of specific subscripts or a range of subscripts. For example, @list[1..5] = @list2[3, 5, 7, 9, 13]; Notice that slice references use array names, not scalar names, because slices are arrays (not scalars). Ruby supports slices with the slice method of its Array object, which can take three forms of parameters. A single integer expression parameter is interpreted as a subscript, in which case slice returns the element with the given subscript. If slice is given two integer expression parameters, the first is interpreted as a beginning subscript and the second is interpreted as the number of elements in the slice. For example, suppose list is defined as follows: list = [2, 4, 6, 8, 10] returns [6, 8]. The third parameter form for slice is a range, which has the form of an integer expression, two periods, and a second list.slice(2, 2) integer expression. With a range parameter, slice returns an array of the element with the given range of subscripts. For example, list.slice (1..3) returns [4, 6, 8]. 6.5.8 Evaluation Arrays have been included in virtually all programming languages. The primary advances since their introduction in Fortran I have been slices and dynamic arrays. As discussed in Section 6.6, the latest advances in arrays have been in associative arrays. 6.5.9 Implementation of Array Types Implementing arrays requires considerably more compile-time effort than does implementing primitive types. The code to allow accessing of array elements must be generated at compile time. At run time, this code must be executed to produce element addresses. There is no way to precompute the address to be accessed by a reference such as list[k] A single-dimensioned array is implemented as a list of adjacent memory cells. Suppose the array list is defined to have a subscript range lower bound of 0. The access function for list is often of the form address (list[k]) = address (list[0]) + k * element_size where the first operand of the addition is the constant part of the access function, and the second is the variable part. If the element type is statically bound and the array is statically bound to storage, then the value of the constant part can be computed before run time. However, the addition and multiplication operations must be done at run time. The generalization of this access function for an arbitrary lower bound is address (list[k]) = address (list[lower_bound]) + ((k −lower_bound) * element_size) The compile-time descriptor for single-dimensioned arrays can have the form shown in Figure 6.4. The descriptor includes information required to construct the access function. If run-time checking of index ranges is not done and the attributes are all static, then only the access function is required during execution; no descriptor is needed. If run-time checking of index ranges is done, then those index ranges may need to be stored in a run-time descriptor. If the subscript ranges of a particular array type are static, then the ranges may be incorporated into the code that does the checking, thus eliminating the need for the run-time descriptor. If any of the descriptor entries are dynamically bound, then those parts of the descriptor must be maintained at run time. Figure 6.4 Compile-time descriptor for singledimensioned arrays True multidimensional arrays, that is, those that are not arrays of arrays, are more complex to implement than single-dimensioned arrays, although the extension to more dimensions is straightforward. Hardware memory is linear —just a simple sequence of bytes. So values of data types that have two or more dimensions must be mapped onto the single-dimensioned memory. There are two ways in which multidimensional arrays can be mapped to one dimension: row major order and column major order (not used in any widely used language). In row major order, the elements of the array that have as their first subscript the lower bound value of that subscript are stored first, followed by the elements of the second value of the first subscript, and so forth. If the array is a matrix, it is stored by rows. For example, if the matrix had the values 347 625 138 it would be stored in row major order as 3, 4, 7, 6, 2, 5, 1, 3, 8 The access function for a multidimensional array is the mapping of its base address and a set of index values to the address in memory of the element specified by the index values. The access function for two-dimensional arrays stored in row major order can be developed as follows. In general, the address of an element is the base address of the structure plus the element size times the number of elements that precede it in the structure. For a matrix in row major order, the number of elements that precede an element is the number of rows above the element times the size of a row, plus the number of elements to the left of the element in its row. This is illustrated in Figure 6.5, in which we assume that subscript lower bounds are all zero. Figure 6.5 The location of the [i,j] element in a matrix Figure 6.5 Full Alternative Text To get an actual address value, the number of elements that precede the desired element must be multiplied by the element size. Now, the access function can be written as location(a[i,j]) = address of a[0, 0] + ((((number of rows above the ith row) * (size of a row)) + (number of elements left of the jth column)) * element size) Because the number of rows above the ith row is i and the number of elements to the left of the jth column is j, we have location(a[i, j]) = address of a[0, 0] + (((i * n) + j) * element_size) where n is the number of elements per row. The first term is the constant part and the last is the variable part. The generalization to arbitrary lower bounds results in the following access function: location(a[i, j]) = address of a[row_lb, col_lb] = (((i − row_lb) * n) + (j − col_lb)) * element_size where row_lb is the lower bound of the rows and col_lb is the lower bound of the columns. This can be rearranged to the form location(a[i, j]) = address of a[row_lb, col_lb] − (((row_lb * n) + col_lb) * element_size) + (((i * n) + j) * element_size) where the first two terms are the constant part and the last is the variable part. This can be generalized rather easily to an arbitrary number of dimensions. For each dimension of an array, one add and one multiply instruction are required for the access function. Therefore, accesses to elements of arrays with several subscripts are costly. The compile-time descriptor for a multidimensional array is shown in Figure 6.6. Figure 6.6 A compile-time descriptor for a multidimensional array 6.6 Associative Arrays An associative array is an unordered collection of data elements that are indexed by an equal number of values called keys. In the case of non-associative arrays, the indices never need to be stored (because of their regularity). In an associative array, however, the user-defined keys must be stored in the structure. So each element of an associative array is in fact a pair of entities, a key and a value. We use Perl’s design of associative arrays to illustrate this data structure. Associative arrays are also supported directly by Python, Ruby, and Swift and by the standard class libraries of Java, C++, C#, and F#. The only design issue that is specific for associative arrays is the form of references to their elements. 6.6.1 Structure and Operations In Perl, associative arrays are called hashes, because in the implementation their elements are stored and retrieved with hash functions. The namespace for Perl hashes is distinct: Every hash variable name must begin with a percent sign (%). Each hash element consists of two parts: a key, which is a string, and a value, which is a scalar (number, string, or reference). Hashes can be set to literal values with the assignment statement, as in %salaries = ("Gary" => 75000, "Perry" => 57000, "Mary" => 55750, "Cedric" => 47850); Individual element values are referenced using notation that is similar to that used for Perl arrays. The key value is placed in braces and the hash name is replaced by a scalar variable name that is the same except for the first character. Although hashes are not scalars, the value parts of hash elements are scalars, so references to hash element values use scalar names. Recall that scalar variable names begin with dollar signs ($). So, an assignment of 58850 to the element of %salaries with the key "Perry" would appear as follows: $salaries{"Perry"} = 58850; A new element is added using the same assignment statement form. An element can be removed from the hash with the delete operator, as in the following: delete $salaries{"Gary"}; The entire hash can be emptied by assigning the empty literal to it, as in the following: @salaries = (); The size of a Perl hash is dynamic: It grows when an element is added and shrinks when an element is deleted, and also when it is emptied by assignment of the empty literal. The exists operator returns true or false, depending on whether its operand key is an element in the hash. For example, if (exists $salaries{"Shelly"}) . . . The keys operator, when applied to a hash, returns an array of the keys of the hash. The values operator does the same for the values of the hash. The each operator iterates over the element pairs of a hash. Python’s associative arrays, which are called dictionaries, are similar to those of Perl, except the values are all references to objects. The associative arrays supported by Ruby are similar to those of Python, except that the keys can be any object,6 rather than just strings. There is a progression from Perl’s hashes, in which the keys must be strings, to PHP’s arrays, in which the keys can be integers or strings, to Ruby’s hashes, in which any type object can be a key. 6. Objects that change do not make good keys, because the changes could change the hash function value. Therefore, arrays and hashes are never used as keys. PHP’s arrays are both normal arrays and associative arrays. They can be treated as either. The language provides functions that allow both indexed and hashed access to elements. An array can have elements that are created with simple numeric indices and elements that are created with string hash keys. Swift’s associative arrays are called dictionaries. The keys can be of one specific type, but the values can be of mixed types, in which case they are objects. An associative array is much better than an array if searches of the elements are required, because the implicit hashing operation used to access elements is very efficient. Furthermore, associative arrays are ideal when the data to be stored is paired, as with employee names and their salaries. On the other hand, if every element of a list must be processed, it is more efficient to use an array. 6.6.2 Implementing Associative Arrays The implementation of Perl’s associative arrays is optimized for fast lookups, but it also provides relatively fast reorganization when array growth requires it. A 32-bit hash value is computed for each entry and is stored with the entry, although an associative array initially uses only a small part of the hash value. When an associative array must be expanded beyond its initial size, the hash function need not be changed; rather, more bits of the hash value are used. Only half of the entries must be moved when this happens. So, although expansion of an associative array is not free, it is not as costly as might be expected. The elements in PHP’s arrays are placed in memory through a hash function. However, all elements are linked together in the order in which they were created. The links are used to support iterative access to elements through the current and next functions. 6.7 Record Types A record is an aggregate of data elements in which the individual elements are identified by names and accessed through offsets from the beginning of the structure. There is frequently a need in programs to model a collection of data in which the individual elements are not of the same type or size. For example, information about a college student might include name, student number, grade point average, and so forth. A data type for such a collection might use a character string for the name, an integer for the student number, a floatingpoint for the grade point average, and so forth. Records are designed for this kind of need. It may appear that records and heterogeneous arrays are the same, but that is not the case. The elements of a heterogeneous array are all references to data objects that reside in scattered locations, often on the heap. The elements of a record are of potentially different sizes and reside in adjacent memory locations. Records have been part of all of the most popular programming languages, except pre-90 versions of Fortran, since the early 1960s, when they were introduced by COBOL. In some languages that support object-oriented programming, data classes serve as records. In C, C++, C#, and Swift, records are supported with the struct data type. In C++, structures are a minor variation on classes. In C#, structs also are related to classes, but are quite different from them. C# structs are stack-allocated value types, as opposed to class objects, which are heap-allocated reference types. Structs in C++ and C# are normally used as encapsulation structures, rather than data structures. They are further discussed in this capacity in Chapter 11. Structs are also included in ML and F#. In Python and Ruby, records can be implemented as hashes, which themselves can be elements of arrays. The following sections describe how records are declared or defined, how references to fields within records are made, and the common record operations. The following design issues are specific to records: What is the syntactic form of references to fields? Are elliptical references allowed? 6.7.1 Definitions of Records The fundamental difference between a record and an array is that record elements, or fields, are not referenced by indices. Instead, the fields are named with identifiers, and references to the fields are made using these identifiers. Another difference between arrays and records is that records in some languages are allowed to include unions, which are discussed in Section 6.10. The COBOL form of a record declaration, which is part of the data division of a COBOL program, is illustrated in the following example: 01 EMPLOYEE-RECORD. 02 EMPLOYEE-NAME. 05 FIRST PICTURE 05 Middle PICTURE 05 LAST PICTURE 02 HOURLY-RATE PICTURE IS IS IS IS X(20). X(10). X(20). 99V99. The EMPLOYEE-RECORD record consists of the EMPLOYEE-NAME record and the HOURLY-RATE field. The numerals 01, 02, and 05 that begin the lines of the record declaration are level numbers, which indicate by their relative values the hierarchical structure of the record. Any line that is followed by a line with a higher-level number is itself a record. The PICTURE clauses show the formats of the field storage locations, with X(20) specifying 20 alphanumeric characters and 99V99 specifying four decimal digits with the decimal point in the middle. In Java, records can be defined as data classes, with nested records defined as nested classes. Data members of such classes serve as the record fields. 6.7.2 References to Record Fields References to the individual fields of records are syntactically specified by several different methods, two of which name the desired field and its enclosing records. COBOL field references have the form field_name OF record_name_1 OF . . . OF record_name_n where the first record named is the smallest or innermost record that contains the field. The next record name in the sequence is that of the record that contains the previous record, and so forth. For example, the Middle field in the COBOL record example above can be referenced with Middle OF EMPLOYEE-NAME OF EMPLOYEE-RECORD Most of the other languages use dot notation for field references, where the components of the reference are connected with periods. Names in dot notation have the opposite order of COBOL references: They use the name of the largest enclosing record first and the field name last. For example, if Middle is a field in the Employee_Name record which is embedded in the Employee_Record record, it would be referenced with the following: Employee_Record.Employee_Name.Middle A fully qualified reference to a record field is one in which all intermediate record names, from the largest enclosing record to the specific field, are named in the reference. In the COBOL example above the field reference is fully qualified. As an alternative to fully qualified references, COBOL allows elliptical references to record fields. In an elliptical reference, the field is named, but any or all of the enclosing record names can be omitted, as long as the resulting reference is unambiguous in the referencing environment. For example, FIRST, FIRST OF EMPLOYEE-NAME, and FIRST OF EMPLOYEE-RECORD are elliptical references to the employee’s first name in the COBOL record declared above. Although elliptical references are a programmer convenience, they require a compiler to have elaborate data structures and procedures in order to correctly identify the referenced field. They are also somewhat detrimental to readability. 6.7.3 Evaluation Records are frequently valuable data types in programming languages. The design of record types is straightforward, and their use is safe. Records and arrays are closely related structural forms, and therefore it is interesting to compare them. Arrays are used when all the data values have the same type and/or are processed in the same way. This processing is easily done when there is a systematic way of sequencing through the structure. Such processing is well supported by using dynamic subscripting as the addressing method. Records are used when the collection of data values is heterogeneous and the different fields are not processed in the same way. Also, the fields of a record often need not be processed in a particular order. Field names are like literal, or constant, subscripts. Because they are static, they provide very efficient access to the fields. Dynamic subscripts could be used to access record fields, but it would disallow type checking and would also be slower. Records and arrays represent thoughtful and efficient methods of fulfilling two separate but related applications of data structures. 6.7.4 Implementation of Record Types The fields of records are stored in adjacent memory locations. But because the sizes of the fields are not necessarily the same, the access method used for arrays is not used for records. Instead, the offset address, relative to the beginning of the record, is associated with each field. Field accesses are all handled using these offsets. The compile-time descriptor for a record has the general form shown in Figure 6.7. Run-time descriptors for records are unnecessary. Figure 6.7 A compile-time descriptor for a record Figure 6.7 Full Alternative Text 6.8 Tuple Types A tuple is a data type that is similar to a record, except that the elements are not named. Python includes an immutable tuple type. If a tuple needs to be changed, it can be converted to an array with the list function. After the change, it can be converted back to a tuple with the tuple function. One use of tuples is when an array must be write protected, such as when it is sent as a parameter to an external function and the user does not want the function to be able to modify the parameter. Python’s tuples are closely related to its lists, except that tuples are immutable. A tuple is created by assigning a tuple literal, as in the following example: myTuple = (3, 5.8, 'apple') Notice that the elements of a tuple need not be of the same type. The elements of a tuple can be referenced with indexing in brackets, as in the following: myTuple[1] This references the first element of the tuple, because tuple indexing begins at 1. Tuples can be catenated with the plus (+) operator. They can be deleted with the del statement. There are also other operators and functions that operate on tuples. ML includes a tuple data type. An ML tuple must have at least two elements, whereas Python’s tuples can be empty or contain one element. As in Python, an ML tuple can include elements of mixed types. The following statement creates a tuple: val myTuple = (3, 5.8, 'apple'); The syntax of a tuple element access is as follows: #1(myTuple); This references the first element of the tuple. A new tuple type can be defined in ML with a type declaration, such as the following: type intReal = int * real; Values of this type consist of an integer and a real. The asterisk can be misleading. It is used to separate the tuple components, indicating a type product, and has nothing to do with arithmetic. F# also has tuples. A tuple is created by assigning a tuple value, which is a list of expressions separated by commas and delimited by parentheses, to a name in a let statement. If a tuple has two elements, they can be referenced with the functions fst and snd, respectively. The elements of a tuple with more than two elements are often referenced with a tuple pattern on the left side of a let statement. A tuple pattern is simply a sequence of names, one for each element of the tuple, with or without the delimiting parentheses. When a tuple pattern is the left side of a let construct, it is a multiple assignment. For example, consider the following let constructs: let tup = (3, 5, 7);; let a, b, c = tup;; This assigns 3 to a, 5 to b, and 7 to c. Tuples are used in Python, ML, and F# to allow functions to return multiple values. In Swift, tuples are passed by value, so they are sometimes used to pass data to a function when the function is not to change that data. 6.9 List Types Lists were first supported in the first functional programming language, Lisp. They have always been part of the functional languages, but in recent years they have found their way into some imperative languages. Lists in Scheme and Common Lisp are delimited by parentheses and the elements are not separated by any punctuation. For example, (A B C D) Nested lists have the same form, so we could have (A (B C) D) In this list, (B C) is a list nested inside the outer list. Data and code have the same syntactic form in Lisp and its descendants. If the list (A B C) is interpreted as code, it is a call to the function A with parameters B and C. The fundamental list operations in Scheme are two functions that take lists apart and two that build lists. The CAR function returns the first element of its list parameter. For example, consider the following example: (CAR '(A B C)) The quote before the parameter list is to prevent the interpreter from considering the list a call to the A function with the parameters B and C, in which case it would interpret it. This call to CAR returns A. The CDR function returns its parameter list minus its first element. For example, consider the following example: (CDR '(A B C)) This function call returns the list (B C). Common Lisp also has the functions FIRST (same as CAR), SECOND, . . . , TENTH, which return the element of their list parameters that is specified by their names. In Scheme and Common Lisp, new lists are constructed with the CONS and LIST functions. The function CONS takes two parameters and returns a new list with its first parameter as the first element and its second parameter as the remainder of that list. For example, consider the following: (CONS 'A '(B C)) This call returns the new list (A B C). The LIST function takes any number of parameters and returns a new list with the parameters as its elements. For example, consider the following call to LIST: (LIST 'A 'B '(C D)) This call returns the new list (A B (C D)). ML has lists and list operations, although their appearance is not like those of Scheme. Lists are specified in square brackets, with the elements separated by commas, as in the following list of integers: [5, 7, 9] [] is the empty list, which could also be specified with nil. The Scheme CONS function is implemented as a binary infix operator in ML, represented as ::. For example, 3 :: [5, 7, 9] returns the following new list: [3, 5, 7, 9]. The elements of a list must be of the same type, so the following list would be illegal: [5, 7.3, 9] ML has functions that correspond to Scheme’s CAR and CDR, named hd (head) and tl (tail). For example, hd [5, 7, 9] is 5 tl [5, 7, 9] is [7, 9] Lists and list operations in Scheme and ML are more fully discussed in Chapter 15. Lists in F# are related to those of ML with a few notable differences. Elements of a list in F# are separated by semicolons, rather than the commas of ML. The operations hd and tl are the same, but they are called as methods of the List class, as in List.hd [1; 3; 5; 7], which returns 1. The CONS operation of F# is specified as two colons, as in ML. Python includes a list data type, which also serves as Python’s arrays. Unlike the lists of Scheme, Common Lisp, ML, and F#, the lists of Python are mutable. They can contain any data value or object. A Python list is created with an assignment of a list value to a name. A list value is a sequence of expressions that are separated by commas and delimited with brackets. For example, consider the following statement: myList = [3, 5.8, "grape"] The elements of a list are referenced with subscripts in brackets, as in the following example: x = myList[1] This statement assigns 5.8 to x. The elements of a list are indexed starting at zero. List elements also can be updated by assignment. A list element can be deleted with del, as in the following statement: del myList[1] This statement removes the second element of myList. Python includes a powerful mechanism for creating arrays called list comprehensions. A list comprehension is an idea derived from set notation. It first appeared in the functional programming language Haskell (see Chapter 15). The mechanics of a list comprehension is that a function is applied to each of the elements of a given array and a new array is constructed from the results. The syntax of a Python list comprehension is as follows: [expression for iterate_var in array if condition] Consider the following example: [x * x for x in range(12) if x % 3 == 0] The range function creates the array [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. The conditional filters out all numbers in the array that are not evenly divisible by 3. Then, the expression squares the remaining numbers. The results of the squaring are collected in an array, which is returned. This list comprehension returns the following array: [0, 9, 36, 81] Slices of lists are also supported in Python. Haskell’s list comprehensions have the following form: [body | qualifiers] For example, consider the following definition of a list: [n * n | n <- [1..10]] This defines a list of the squares of the numbers from 1 to 10. F# includes list comprehensions, which in that language can also be used to create arrays. For example, consider the following statement: let myArray = [|for i in 1 .. 5 -> (i * i) |];; This statement creates the array [1; 4; 9; 16; 25] and names it myArray. Recall from Section 6.5 that C# and Java support generic heap-dynamic collection classes, List and ArrayList, respectively. These structures are actually lists. 6.10 Union Types A union is a type whose variables may store different type values at different times during program execution. As an example of the need for a union type, consider a table of constants for a compiler, which is used to store the constants found in a program being compiled. One field of each table entry is for the value of the constant. Suppose that for a particular language being compiled, the types of constants were integer, floating-point, and Boolean. In terms of table management, it would be convenient if the same location, a table field, could store a value of any of these three types. Then all constant values could be addressed in the same way. The type of such a location is, in a sense, the union of the three value types it can store. 6.10.1 Design Issues The problem of type checking union types, which is discussed in Section 6.12, is their major design issue. 6.10.2 Discriminated Versus Free Unions C and C++ provide union constructs in which there is no language support for type checking. In C and C++, the union construct is used to specify union structures. The unions in these languages are called free unions, because programmers are allowed complete freedom from type checking in their use. For example, consider the following C union: union flexType { int intEl; float floatEl; }; union flexType el1; float x; . . . el1.intEl = 27; x = el1.floatEl; This last assignment is not type checked, because the system cannot determine the current type of the current value of el1, so it assigns the bit string representation of 27 to the float variable x, which, of course, is nonsense. Type checking of unions requires that each union construct include a type indicator. Such an indicator is called a tag, or discriminant, and a union with a discriminant is called a discriminated union. The first language to provide discriminated unions was ALGOL 68. They are now supported by ML, Haskell, and F#. 6.10.3 Unions in F# A union is declared in F# with a type statement using OR operators (|) to define the components. For example, we could have the following: type intReal = | IntValue of int | RealValue of float;; In this example, intReal is the union type. IntValue and RealValue are constructors. Values of type intReal can be created using the constructors as if they were a function, as in the following examples:7 7. The let statement is used to assign values to names and to create a static scope; the double semicolons are used to terminate statements when the F# interactive interpreter is being used. let ir1 = IntValue 17;; let ir2 = RealValue 3.4;; Accessing the value of a union is done with a pattern-matching structure. Pattern matching in F# is specified with the match reserved word. The general form of the construct is as follows: match | pattern with expression_list1−>expression1 | . . . | expression_listn−>expressionn The pattern can be any data type. The expression list can include wild card characters (_) or be solely a wild card character. For example, consider the following match construct: let a = 7;; let b = "grape";; let x = match (a, b) with | 4, "apple" -> apple | _, "grape" -> grape | _ -> fruit;; To display the type of the intReal union, the following function could be used: let printType value = match value with | IntValue value -> printfn "It is an integer" | RealValue value -> printfn "It is a float";; The following lines show calls to this function and the output: printType ir1;; It is an integer printType ir2;; It is a float 6.10.4 Evaluation Unions are potentially unsafe constructs in some languages. They are one of the reasons why C and C++ are not strongly typed: These languages do not allow type checking of references to their unions. On the other hand, unions can be safely used, as in their design in ML, Haskell, and F#. Neither Java nor C# includes unions, which may be reflective of the growing concern for safety in some programming languages. 6.10.5 Implementation of Union Types Unions are implemented by simply using the same address for every possible variant. Sufficient storage for the largest variant is allocated. 6.11 Pointer and Reference Types A pointer type is one in which the variables have a range of values that consists of memory addresses and a special value, nil. The value nil is not a valid address and is used to indicate that a pointer cannot currently be used to reference a memory cell. Pointers are designed for two distinct kinds of uses. First, pointers provide some of the power of indirect addressing, which is frequently used in assembly language programming. Second, pointers provide a way to manage dynamic storage. A pointer can be used to access a location in an area where storage is dynamically allocated called a heap. Variables that are dynamically allocated from the heap are called heap-dynamic variables. They often do not have identifiers associated with them and thus can be referenced only by pointe

Document Details

Tags

Related

Full Transcript