PLP6 copy 2.pdf

6.6 Associative Arrays An associative array is an unordered collection of data elements that are indexed by an equal number of values called keys. In the case of non-associative arrays, the indices never need to be stored (because of their regularity). In an associative array, however, the user-defined keys must be stored in the structure. So each element of an associative array is in fact a pair of entities, a key and a value. We use Perl’s design of associative arrays to illustrate this data structure. Associative arrays are also supported directly by Python, Ruby, and Swift and by the standard class libraries of Java, C++, C#, and F#. The only design issue that is specific for associative arrays is the form of references to their elements. 6.6.1 Structure and Operations In Perl, associative arrays are called hashes, because in the implementation their elements are stored and retrieved with hash functions. The namespace for Perl hashes is distinct: Every hash variable name must begin with a percent sign (%). Each hash element consists of two parts: a key, which is a string, and a value, which is a scalar (number, string, or reference). Hashes can be set to literal values with the assignment statement, as in %salaries = ("Gary" => 75000, "Perry" => 57000, "Mary" => 55750, "Cedric" => 47850); Individual element values are referenced using notation that is similar to that used for Perl arrays. The key value is placed in braces and the hash name is replaced by a scalar variable name that is the same except for the first character. Although hashes are not scalars, the value parts of hash elements are scalars, so references to hash element values use scalar names. Recall that scalar variable names begin with dollar signs ($). So, an assignment of 58850 to the element of %salaries with the key "Perry" would appear as follows: $salaries{"Perry"} = 58850; A new element is added using the same assignment statement form. An element can be removed from the hash with the delete operator, as in the following: delete $salaries{"Gary"}; The entire hash can be emptied by assigning the empty literal to it, as in the following: @salaries = (); The size of a Perl hash is dynamic: It grows when an element is added and shrinks when an element is deleted, and also when it is emptied by assignment of the empty literal. The exists operator returns true or false, depending on whether its operand key is an element in the hash. For example, if (exists $salaries{"Shelly"}) . . . The keys operator, when applied to a hash, returns an array of the keys of the hash. The values operator does the same for the values of the hash. The each operator iterates over the element pairs of a hash. Python’s associative arrays, which are called dictionaries, are similar to those of Perl, except the values are all references to objects. The associative arrays supported by Ruby are similar to those of Python, except that the keys can be any object,6 rather than just strings. There is a progression from Perl’s hashes, in which the keys must be strings, to PHP’s arrays, in which the keys can be integers or strings, to Ruby’s hashes, in which any type object can be a key. 6. Objects that change do not make good keys, because the changes could change the hash function value. Therefore, arrays and hashes are never used as keys. PHP’s arrays are both normal arrays and associative arrays. They can be treated as either. The language provides functions that allow both indexed and hashed access to elements. An array can have elements that are created with simple numeric indices and elements that are created with string hash keys. Swift’s associative arrays are called dictionaries. The keys can be of one specific type, but the values can be of mixed types, in which case they are objects. An associative array is much better than an array if searches of the elements are required, because the implicit hashing operation used to access elements is very efficient. Furthermore, associative arrays are ideal when the data to be stored is paired, as with employee names and their salaries. On the other hand, if every element of a list must be processed, it is more efficient to use an array. 6.6.2 Implementing Associative Arrays The implementation of Perl’s associative arrays is optimized for fast lookups, but it also provides relatively fast reorganization when array growth requires it. A 32-bit hash value is computed for each entry and is stored with the entry, although an associative array initially uses only a small part of the hash value. When an associative array must be expanded beyond its initial size, the hash function need not be changed; rather, more bits of the hash value are used. Only half of the entries must be moved when this happens. So, although expansion of an associative array is not free, it is not as costly as might be expected. The elements in PHP’s arrays are placed in memory through a hash function. However, all elements are linked together in the order in which they were created. The links are used to support iterative access to elements through the current and next functions. 6.7 Record Types A record is an aggregate of data elements in which the individual elements are identified by names and accessed through offsets from the beginning of the structure. There is frequently a need in programs to model a collection of data in which the individual elements are not of the same type or size. For example, information about a college student might include name, student number, grade point average, and so forth. A data type for such a collection might use a character string for the name, an integer for the student number, a floatingpoint for the grade point average, and so forth. Records are designed for this kind of need. It may appear that records and heterogeneous arrays are the same, but that is not the case. The elements of a heterogeneous array are all references to data objects that reside in scattered locations, often on the heap. The elements of a record are of potentially different sizes and reside in adjacent memory locations. Records have been part of all of the most popular programming languages, except pre-90 versions of Fortran, since the early 1960s, when they were introduced by COBOL. In some languages that support object-oriented programming, data classes serve as records. In C, C++, C#, and Swift, records are supported with the struct data type. In C++, structures are a minor variation on classes. In C#, structs also are related to classes, but are quite different from them. C# structs are stack-allocated value types, as opposed to class objects, which are heap-allocated reference types. Structs in C++ and C# are normally used as encapsulation structures, rather than data structures. They are further discussed in this capacity in Chapter 11. Structs are also included in ML and F#. In Python and Ruby, records can be implemented as hashes, which themselves can be elements of arrays. The following sections describe how records are declared or defined, how references to fields within records are made, and the common record operations. The following design issues are specific to records: What is the syntactic form of references to fields? Are elliptical references allowed? 6.7.1 Definitions of Records The fundamental difference between a record and an array is that record elements, or fields, are not referenced by indices. Instead, the fields are named with identifiers, and references to the fields are made using these identifiers. Another difference between arrays and records is that records in some languages are allowed to include unions, which are discussed in Section 6.10. The COBOL form of a record declaration, which is part of the data division of a COBOL program, is illustrated in the following example: 01 EMPLOYEE-RECORD. 02 EMPLOYEE-NAME. 05 FIRST PICTURE 05 Middle PICTURE 05 LAST PICTURE 02 HOURLY-RATE PICTURE IS IS IS IS X(20). X(10). X(20). 99V99. The EMPLOYEE-RECORD record consists of the EMPLOYEE-NAME record and the HOURLY-RATE field. The numerals 01, 02, and 05 that begin the lines of the record declaration are level numbers, which indicate by their relative values the hierarchical structure of the record. Any line that is followed by a line with a higher-level number is itself a record. The PICTURE clauses show the formats of the field storage locations, with X(20) specifying 20 alphanumeric characters and 99V99 specifying four decimal digits with the decimal point in the middle. In Java, records can be defined as data classes, with nested records defined as nested classes. Data members of such classes serve as the record fields. 6.7.2 References to Record Fields References to the individual fields of records are syntactically specified by several different methods, two of which name the desired field and its enclosing records. COBOL field references have the form field_name OF record_name_1 OF . . . OF record_name_n where the first record named is the smallest or innermost record that contains the field. The next record name in the sequence is that of the record that contains the previous record, and so forth. For example, the Middle field in the COBOL record example above can be referenced with Middle OF EMPLOYEE-NAME OF EMPLOYEE-RECORD Most of the other languages use dot notation for field references, where the components of the reference are connected with periods. Names in dot notation have the opposite order of COBOL references: They use the name of the largest enclosing record first and the field name last. For example, if Middle is a field in the Employee_Name record which is embedded in the Employee_Record record, it would be referenced with the following: Employee_Record.Employee_Name.Middle A fully qualified reference to a record field is one in which all intermediate record names, from the largest enclosing record to the specific field, are named in the reference. In the COBOL example above the field reference is fully qualified. As an alternative to fully qualified references, COBOL allows elliptical references to record fields. In an elliptical reference, the field is named, but any or all of the enclosing record names can be omitted, as long as the resulting reference is unambiguous in the referencing environment. For example, FIRST, FIRST OF EMPLOYEE-NAME, and FIRST OF EMPLOYEE-RECORD are elliptical references to the employee’s first name in the COBOL record declared above. Although elliptical references are a programmer convenience, they require a compiler to have elaborate data structures and procedures in order to correctly identify the referenced field. They are also somewhat detrimental to readability. 6.7.3 Evaluation Records are frequently valuable data types in programming languages. The design of record types is straightforward, and their use is safe. Records and arrays are closely related structural forms, and therefore it is interesting to compare them. Arrays are used when all the data values have the same type and/or are processed in the same way. This processing is easily done when there is a systematic way of sequencing through the structure. Such processing is well supported by using dynamic subscripting as the addressing method. Records are used when the collection of data values is heterogeneous and the different fields are not processed in the same way. Also, the fields of a record often need not be processed in a particular order. Field names are like literal, or constant, subscripts. Because they are static, they provide very efficient access to the fields. Dynamic subscripts could be used to access record fields, but it would disallow type checking and would also be slower. Records and arrays represent thoughtful and efficient methods of fulfilling two separate but related applications of data structures. 6.7.4 Implementation of Record Types The fields of records are stored in adjacent memory locations. But because the sizes of the fields are not necessarily the same, the access method used for arrays is not used for records. Instead, the offset address, relative to the beginning of the record, is associated with each field. Field accesses are all handled using these offsets. The compile-time descriptor for a record has the general form shown in Figure 6.7. Run-time descriptors for records are unnecessary. Figure 6.7 A compile-time descriptor for a record Figure 6.7 Full Alternative Text 6.8 Tuple Types A tuple is a data type that is similar to a record, except that the elements are not named. Python includes an immutable tuple type. If a tuple needs to be changed, it can be converted to an array with the list function. After the change, it can be converted back to a tuple with the tuple function. One use of tuples is when an array must be write protected, such as when it is sent as a parameter to an external function and the user does not want the function to be able to modify the parameter. Python’s tuples are closely related to its lists, except that tuples are immutable. A tuple is created by assigning a tuple literal, as in the following example: myTuple = (3, 5.8, 'apple') Notice that the elements of a tuple need not be of the same type. The elements of a tuple can be referenced with indexing in brackets, as in the following: myTuple[1] This references the first element of the tuple, because tuple indexing begins at 1. Tuples can be catenated with the plus (+) operator. They can be deleted with the del statement. There are also other operators and functions that operate on tuples. ML includes a tuple data type. An ML tuple must have at least two elements, whereas Python’s tuples can be empty or contain one element. As in Python, an ML tuple can include elements of mixed types. The following statement creates a tuple: val myTuple = (3, 5.8, 'apple'); The syntax of a tuple element access is as follows: #1(myTuple); This references the first element of the tuple. A new tuple type can be defined in ML with a type declaration, such as the following: type intReal = int * real; Values of this type consist of an integer and a real. The asterisk can be misleading. It is used to separate the tuple components, indicating a type product, and has nothing to do with arithmetic. F# also has tuples. A tuple is created by assigning a tuple value, which is a list of expressions separated by commas and delimited by parentheses, to a name in a let statement. If a tuple has two elements, they can be referenced with the functions fst and snd, respectively. The elements of a tuple with more than two elements are often referenced with a tuple pattern on the left side of a let statement. A tuple pattern is simply a sequence of names, one for each element of the tuple, with or without the delimiting parentheses. When a tuple pattern is the left side of a let construct, it is a multiple assignment. For example, consider the following let constructs: let tup = (3, 5, 7);; let a, b, c = tup;; This assigns 3 to a, 5 to b, and 7 to c. Tuples are used in Python, ML, and F# to allow functions to return multiple values. In Swift, tuples are passed by value, so they are sometimes used to pass data to a function when the function is not to change that data. 6.9 List Types Lists were first supported in the first functional programming language, Lisp. They have always been part of the functional languages, but in recent years they have found their way into some imperative languages. Lists in Scheme and Common Lisp are delimited by parentheses and the elements are not separated by any punctuation. For example, (A B C D) Nested lists have the same form, so we could have (A (B C) D) In this list, (B C) is a list nested inside the outer list. Data and code have the same syntactic form in Lisp and its descendants. If the list (A B C) is interpreted as code, it is a call to the function A with parameters B and C. The fundamental list operations in Scheme are two functions that take lists apart and two that build lists. The CAR function returns the first element of its list parameter. For example, consider the following example: (CAR '(A B C)) The quote before the parameter list is to prevent the interpreter from considering the list a call to the A function with the parameters B and C, in which case it would interpret it. This call to CAR returns A. The CDR function returns its parameter list minus its first element. For example, consider the following example: (CDR '(A B C)) This function call returns the list (B C). Common Lisp also has the functions FIRST (same as CAR), SECOND, . . . , TENTH, which return the element of their list parameters that is specified by their names. In Scheme and Common Lisp, new lists are constructed with the CONS and LIST functions. The function CONS takes two parameters and returns a new list with its first parameter as the first element and its second parameter as the remainder of that list. For example, consider the following: (CONS 'A '(B C)) This call returns the new list (A B C). The LIST function takes any number of parameters and returns a new list with the parameters as its elements. For example, consider the following call to LIST: (LIST 'A 'B '(C D)) This call returns the new list (A B (C D)). ML has lists and list operations, although their appearance is not like those of Scheme. Lists are specified in square brackets, with the elements separated by commas, as in the following list of integers: [5, 7, 9] [] is the empty list, which could also be specified with nil. The Scheme CONS function is implemented as a binary infix operator in ML, represented as ::. For example, 3 :: [5, 7, 9] returns the following new list: [3, 5, 7, 9]. The elements of a list must be of the same type, so the following list would be illegal: [5, 7.3, 9] ML has functions that correspond to Scheme’s CAR and CDR, named hd (head) and tl (tail). For example, hd [5, 7, 9] is 5 tl [5, 7, 9] is [7, 9] Lists and list operations in Scheme and ML are more fully discussed in Chapter 15. Lists in F# are related to those of ML with a few notable differences. Elements of a list in F# are separated by semicolons, rather than the commas of ML. The operations hd and tl are the same, but they are called as methods of the List class, as in List.hd [1; 3; 5; 7], which returns 1. The CONS operation of F# is specified as two colons, as in ML. Python includes a list data type, which also serves as Python’s arrays. Unlike the lists of Scheme, Common Lisp, ML, and F#, the lists of Python are mutable. They can contain any data value or object. A Python list is created with an assignment of a list value to a name. A list value is a sequence of expressions that are separated by commas and delimited with brackets. For example, consider the following statement: myList = [3, 5.8, "grape"] The elements of a list are referenced with subscripts in brackets, as in the following example: x = myList[1] This statement assigns 5.8 to x. The elements of a list are indexed starting at zero. List elements also can be updated by assignment. A list element can be deleted with del, as in the following statement: del myList[1] This statement removes the second element of myList. Python includes a powerful mechanism for creating arrays called list comprehensions. A list comprehension is an idea derived from set notation. It first appeared in the functional programming language Haskell (see Chapter 15). The mechanics of a list comprehension is that a function is applied to each of the elements of a given array and a new array is constructed from the results. The syntax of a Python list comprehension is as follows: [expression for iterate_var in array if condition] Consider the following example: [x * x for x in range(12) if x % 3 == 0] The range function creates the array [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. The conditional filters out all numbers in the array that are not evenly divisible by 3. Then, the expression squares the remaining numbers. The results of the squaring are collected in an array, which is returned. This list comprehension returns the following array: [0, 9, 36, 81] Slices of lists are also supported in Python. Haskell’s list comprehensions have the following form: [body | qualifiers] For example, consider the following definition of a list: [n * n | n <- [1..10]] This defines a list of the squares of the numbers from 1 to 10. F# includes list comprehensions, which in that language can also be used to create arrays. For example, consider the following statement: let myArray = [|for i in 1 .. 5 -> (i * i) |];; This statement creates the array [1; 4; 9; 16; 25] and names it myArray. Recall from Section 6.5 that C# and Java support generic heap-dynamic collection classes, List and ArrayList, respectively. These structures are actually lists. 6.10 Union Types A union is a type whose variables may store different type values at different times during program execution. As an example of the need for a union type, consider a table of constants for a compiler, which is used to store the constants found in a program being compiled. One field of each table entry is for the value of the constant. Suppose that for a particular language being compiled, the types of constants were integer, floating-point, and Boolean. In terms of table management, it would be convenient if the same location, a table field, could store a value of any of these three types. Then all constant values could be addressed in the same way. The type of such a location is, in a sense, the union of the three value types it can store. 6.10.1 Design Issues The problem of type checking union types, which is discussed in Section 6.12, is their major design issue. 6.10.2 Discriminated Versus Free Unions C and C++ provide union constructs in which there is no language support for type checking. In C and C++, the union construct is used to specify union structures. The unions in these languages are called free unions, because programmers are allowed complete freedom from type checking in their use. For example, consider the following C union: union flexType { int intEl; float floatEl; }; union flexType el1; float x; . . . el1.intEl = 27; x = el1.floatEl; This last assignment is not type checked, because the system cannot determine the current type of the current value of el1, so it assigns the bit string representation of 27 to the float variable x, which, of course, is nonsense. Type checking of unions requires that each union construct include a type indicator. Such an indicator is called a tag, or discriminant, and a union with a discriminant is called a discriminated union. The first language to provide discriminated unions was ALGOL 68. They are now supported by ML, Haskell, and F#. 6.10.3 Unions in F# A union is declared in F# with a type statement using OR operators (|) to define the components. For example, we could have the following: type intReal = | IntValue of int | RealValue of float;; In this example, intReal is the union type. IntValue and RealValue are constructors. Values of type intReal can be created using the constructors as if they were a function, as in the following examples:7 7. The let statement is used to assign values to names and to create a static scope; the double semicolons are used to terminate statements when the F# interactive interpreter is being used. let ir1 = IntValue 17;; let ir2 = RealValue 3.4;; Accessing the value of a union is done with a pattern-matching structure. Pattern matching in F# is specified with the match reserved word. The general form of the construct is as follows: match | pattern with expression_list1−>expression1 | . . . | expression_listn−>expressionn The pattern can be any data type. The expression list can include wild card characters (_) or be solely a wild card character. For example, consider the following match construct: let a = 7;; let b = "grape";; let x = match (a, b) with | 4, "apple" -> apple | _, "grape" -> grape | _ -> fruit;; To display the type of the intReal union, the following function could be used: let printType value = match value with | IntValue value -> printfn "It is an integer" | RealValue value -> printfn "It is a float";; The following lines show calls to this function and the output: printType ir1;; It is an integer printType ir2;; It is a float 6.10.4 Evaluation Unions are potentially unsafe constructs in some languages. They are one of the reasons why C and C++ are not strongly typed: These languages do not allow type checking of references to their unions. On the other hand, unions can be safely used, as in their design in ML, Haskell, and F#. Neither Java nor C# includes unions, which may be reflective of the growing concern for safety in some programming languages. 6.10.5 Implementation of Union Types Unions are implemented by simply using the same address for every possible variant. Sufficient storage for the largest variant is allocated. 6.11 Pointer and Reference Types A pointer type is one in which the variables have a range of values that consists of memory addresses and a special value, nil. The value nil is not a valid address and is used to indicate that a pointer cannot currently be used to reference a memory cell. Pointers are designed for two distinct kinds of uses. First, pointers provide some of the power of indirect addressing, which is frequently used in assembly language programming. Second, pointers provide a way to manage dynamic storage. A pointer can be used to access a location in an area where storage is dynamically allocated called a heap. Variables that are dynamically allocated from the heap are called heap-dynamic variables. They often do not have identifiers associated with them and thus can be referenced only by pointer or reference type variables. Variables without names are called anonymous variables. It is in this latter application area of pointers that the most important design issues arise. Pointers, unlike arrays and records, are not structured types, although they are defined using a type operator (* in C and C++). Furthermore, they are also different from scalar variables because they are used to reference some other variable, rather than being used to store data. These two categories of variables are called reference types and value types, respectively. Both kinds of uses of pointers add writability to a language. For example, suppose it is necessary to implement a dynamic structure like a binary tree in a language that does not have pointers or dynamic storage. This would require the programmer to provide and maintain a pool of available tree nodes, which would probably be implemented in parallel arrays. Also, it would be necessary for the programmer to guess the maximum number of required nodes. This is clearly an awkward and error-prone way to deal with binary trees. Reference variables, which are discussed in Section 6.11.6, are closely related to pointers. 6.11.1 Design Issues The primary design issues particular to pointers are the following: What are the scope and lifetime of a pointer variable? What is the lifetime of a heap-dynamic variable (the value a pointer references)? Are pointers restricted as to the type of value to which they can point? Are pointers used for dynamic storage management, indirect addressing, or both? Should the language support pointer types, reference types, or both? 6.11.2 Pointer Operations Languages that provide a pointer type usually include two fundamental pointer operations: assignment and dereferencing. The first operation sets a pointer variable’s value to some useful address. If pointer variables are used only to manage dynamic storage, then the allocation mechanism, whether by operator or built-in subprogram, serves to initialize the pointer variable. If pointers are used for indirect addressing to variables that are not heap dynamic, then there must be an explicit operator or built-in subprogram for fetching the address of a variable, which can then be assigned to the pointer variable. An occurrence of a pointer variable in an expression can be interpreted in two distinct ways. First, it could be interpreted as a reference to the contents of the memory cell to which it is bound, which in the case of a pointer is an address. This is exactly how a nonpointer variable in an expression would be interpreted, although in that case its value likely would not be an address. However, the pointer also could be interpreted as a reference to the value in the memory cell pointed to by the memory cell to which the pointer variable is bound. In this case, the pointer is interpreted as an indirect reference. The former case is a normal pointer reference; the latter is the result of dereferencing the pointer. Dereferencing, which takes a reference through one level of indirection, is the second fundamental pointer operation. Dereferencing of pointers can be either explicit or implicit. In many contemporary languages, it occurs only when explicitly specified. In C++, it is explicitly specified with the asterisk (*) as a prefix unary operator. Consider the following example of dereferencing: If ptr is a pointer variable with the value 7080 and the cell whose address is 7080 has the value 206, then the assignment j = *ptr sets j to 206. This process is shown in Figure 6.8. Figure 6.8 The assignment - operation j = *ptr Figure 6.8 Full Alternative Text When pointers point to records, the syntax of the references to the fields of these records varies among languages. In C and C++, there are two ways a pointer to a record can be used to reference a field in that record. If a pointer variable p points to a record with a field named age, (*p).age can be used to refer to that field. The operator ->, when used between a pointer to a struct and a field of that struct, combines dereferencing and field reference. For example, the expression p -> age is equivalent to (*p).age. Languages that provide pointers for the management of a heap must include an explicit allocation operation. Allocation is sometimes specified with a subprogram, such as malloc in C. In languages that support object-oriented programming, allocation of heap objects is often specified with the new operator. C++, which does not provide implicit deallocation, uses delete as its deallocation operator. 6.11.3 Pointer Problems The first high-level programming language to include pointer variables was PL/I, in which pointers could be used to refer to both heap-dynamic variables and other program variables. The pointers of PL/I were highly flexible, but their use could lead to several kinds of programming errors. Some of the problems of PL/I pointers are also present in the pointers of subsequent languages. Some recent languages, such as Java, have replaced pointers completely with reference types, which, along with implicit deallocation, minimize the primary problems with pointers. A reference type is really only a pointer with restricted operations. 6.11.3.1 Dangling Pointers A dangling pointer, or dangling reference, is a pointer that contains the address of a heap-dynamic variable that has been deallocated. Dangling pointers are dangerous for several reasons. First, the location being pointed to may have been reallocated to some new heap-dynamic variable. If the new variable is not the same type as the old one, type checks of uses of the dangling pointer are invalid. Even if the new dynamic variable is the same type, its new value will have no relationship to the old pointer’s dereferenced value. Furthermore, if the dangling pointer is used to change the heapdynamic variable, the value of the new heap-dynamic variable will be destroyed. Finally, it is possible that the location now is being temporarily used by the storage management system, possibly as a pointer in a chain of available blocks of storage, thereby allowing a change to the location to cause the storage manager to fail. The following sequence of operations creates a dangling pointer in many languages: 1. A new heap-dynamic variable is created and pointer p1 is set to point to it. 2. Pointer p2 is assigned p1’s value. 3. The heap-dynamic variable pointed to by p1 is explicitly deallocated (possibly setting p1 to nil), but p2 is not changed by the operation. p2 is now a dangling pointer. If the deallocation operation did not change p1, both p1 and p2 would be dangling. (Of course, this is a problem of aliasing—p1 and p2 are aliases.) For example, in C++ we could have the following: int * arrayPtr1; int * arrayPtr2 = new int[100]; arrayPtr1 = arrayPtr2; delete [] arrayPtr2; // Now, arrayPtr1 is dangling, because the heap storage // to which it was pointing has been deallocated. In C++, both arrayPtr1 and arrayPtr2 are now dangling pointers, because the C++ delete operator has no effect on the value of its operand pointer. In C++, it is common (and safe) to follow a delete operator with an assignment of zero, which represents null, to the pointer whose pointed-to value has been deallocated. Notice that the explicit deallocation of dynamic variables is the cause of dangling pointers. history note Pascal included an explicit deallocate operator: dispose. Because of the problem of dangling pointers caused by dispose, some Pascal implementations simply ignored dispose when it appeared in a program. Although this effectively prevents dangling pointers, it also disallows the reuse of heap storage that the program no longer needs. Recall that Pascal initially was designed as a teaching language, rather than as an industrial tool. 6.11.3.2 Lost Heap-Dynamic Variables A lost heap-dynamic variable is an allocated heap-dynamic variable that is no longer accessible to the user program. Such variables are often called garbage, because they are not useful for their original purpose, and they also cannot be reallocated for some new use in the program. Lost heap-dynamic variables are most often created by the following sequence of operations: 1. Pointer p1 is set to point to a newly created heap-dynamic variable. 2. p1 is later set to point to another newly created heap-dynamic variable. The first heap-dynamic variable is now inaccessible, or lost. This is sometimes called memory leakage. Memory leakage is a problem, regardless of whether the language uses implicit or explicit deallocation. In the following sections, we investigate how language designers have dealt with the problems of dangling pointers and lost heap-dynamic variables. 6.11.4 Pointers in C and C++ In C and C++, pointers can be used in the same ways as addresses are used in assembly languages. This means they are extremely flexible but must be used with great care. This design offers no solutions to the dangling pointer or lost heap-dynamic variable problems. However, the fact that pointer arithmetic is possible in C and C++ makes their pointers more interesting than those of the other programming languages. C and C++ pointers can point at any variable, regardless of where it is allocated. In fact, they can point anywhere in memory, whether there is a variable there or not, which is one of the dangers of such pointers. In C and C++, the asterisk (*) denotes the dereferencing operation and the ampersand (&) denotes the operator for producing the address of a variable. For example, consider the following code: int *ptr; int count, init; . . . ptr = &init; count = *ptr; The assignment to the variable ptr sets it to the address of init. The assignment to count dereferences ptr to produce the value at init, which is then assigned to count. So, the effect of the two assignment statements is to assign the value of init to count. Notice that the declaration of a pointer specifies its domain type. Pointers can be assigned the address value of any variable of the correct domain type, or they can be assigned the constant zero, which is used for nil. Pointer arithmetic is also possible in some restricted forms. For example, if ptr is a pointer variable that is declared to point at some variable of some data type, then ptr + index is a legal expression. The semantics of such an expression is as follows. Instead of simply adding the value of index to ptr, the value of index is first scaled by the size of the memory cell (in memory units) to which ptr is pointing (its base type). For example, if ptr points to a memory cell for a type that is four memory units in size, then index is multiplied by 4, and the result is added to ptr. The primary purpose of this sort of address arithmetic is array manipulation. The following discussion is related to single-dimensioned arrays only. In C and C++, all arrays use zero as the lower bound of their subscript ranges, and array names without subscripts always refer to the address of the first element. Consider the following declarations: int list [10]; int *ptr; Now consider the assignment ptr = list; This assigns the address of list[0] to ptr. Given this assignment, the following are true: *(ptr + 1) is equivalent to list[1]. *(ptr + index) ptr[index] is equivalent to list[index]. is equivalent to list[index]. It is clear from these statements that the pointer operations include the same scaling that is used in indexing operations. Furthermore, pointers to arrays can be indexed as if they were array names. Pointers in C and C++ can point to functions. This feature is used to pass functions as parameters to other functions. Pointers are also used for parameter passing, as discussed in Chapter 9. C and C++ include pointers of type void *, which can point at values of any type. In effect they are generic pointers. However, type checking is not a problem with void * pointers, because these languages disallow dereferencing them. One common use of void * pointers is as the types of parameters of functions that operate on memory. For example, suppose we wanted a function to move a sequence of bytes of data from one place in memory to another. It would be most general if it could be passed two pointers of any type. This would be legal if the corresponding formal parameters in the function were void * type. The function could then convert them to char * type and do the operation, regardless of what type pointers were sent as actual parameters. 6.11.5 Reference Types A reference type variable is similar to a pointer, with one important and fundamental difference: A pointer refers to an address in memory, while a reference refers to an object or a value in memory. As a result, although it is natural to perform arithmetic on addresses, it is not sensible to do arithmetic on references. C++ includes a special kind of reference type that is used primarily for the formal parameters in function definitions. A C++ reference type variable is a constant pointer that is always implicitly dereferenced. Because a C++ reference type variable is a constant, it must be initialized with the address of some variable in its definition, and after initialization a reference type variable can never be set to reference any other variable. The implicit dereference prevents assignment to the address value of a reference variable. Reference type variables are specified in definitions by preceding their names with ampersands (&). For example, int result = 0; int &ref_result = result; . . . ref_result = 100; In this code segment, result and ref_result are aliases. When used as formal parameters in function definitions, C++ reference types provide for two-way communication between the caller function and the called function. This is not possible with nonpointer primitive parameter types, because C++ parameters are passed by value. Passing a pointer as a parameter accomplishes the same two-way communication, but pointer formal parameters require explicit dereferencing, making the code less readable and less safe. Reference parameters are referenced in the called function exactly as are other parameters. The calling function need not specify that a parameter whose corresponding formal parameter is a reference type is anything unusual. The compiler passes addresses, rather than values, to reference parameters. In their quest for increased safety over C++, the designers of Java removed C++-style pointers altogether. Unlike C++ reference variables, Java reference variables can be assigned to refer to different class instances; they are not constants. All Java class instances are referenced by reference variables. That is, in fact, the only use of reference variables in Java. These issues are discussed further in Chapter 12. In the following, String is a standard Java class: String str1; . . . str1 = "This is a Java literal string"; In this code, str1 is defined to be a reference to a String class instance or object. It is initially set to null. The subsequent assignment sets str1 to reference the String object, "This is a Java literal string". Because Java class instances are implicitly deallocated (there is no explicit deallocation operator), there cannot be dangling references in Java. C# includes both the references of Java and the pointers of C++. However, the use of pointers is strongly discouraged. In fact, any subprogram that uses pointers must include the unsafe modifier. Note that although objects pointed to by references are implicitly deallocated, that is not true for objects pointed to by pointers. Pointers were included in C# primarily to allow C# programs to interoperate with C and C++ code. All variables in the object-oriented languages Smalltalk, Python, and Ruby are references. They are always implicitly dereferenced. Furthermore, the direct values of these variables cannot be accessed. 6.11.6 Evaluation The problems of dangling pointers and garbage have already been discussed at length. The problems of heap management are discussed in Section 6.11.7.3. Pointers have been compared with the goto. The goto statement widens the range of statements that can be executed next. Pointer variables widen the range of memory cells that can be referenced by a variable. Perhaps the most damning statement about pointers was made by Hoare (1973): “Their introduction into high-level languages has been a step backward from which we may never recover.” On the other hand, pointers are essential in some kinds of programming applications. For example, pointers are necessary to write device drivers, in which specific absolute addresses must be accessed. The references of Java and C# provide some of the flexibility and the capabilities of pointers, without the hazards. It remains to be seen whether programmers will be willing to trade the full power of C and C++ pointers for the greater safety of references. The extent to which C# programs use pointers will be one measure of this. 6.11.7 Implementation of Pointer and Reference Types In most languages, pointers are used in heap management. The same is true for Java and C# references, as well as the variables in Smalltalk and Ruby, so we cannot treat pointers and references separately. First, we briefly describe how pointers and references are represented internally. We then discuss two possible solutions to the dangling pointer problem. Finally, we describe the major problems with heap-management techniques. 6.11.7.1 Representations of Pointers and References In most larger computers, pointers and references are single values stored in memory cells. However, in early microcomputers based on Intel microprocessors, addresses have two parts: a segment and an offset. So, pointers and references are implemented in these systems as pairs of 16-bit cells, one for each of the two parts of an address. 6.11.7.2 Solutions to the DanglingPointer Problem There have been several proposed solutions to the dangling-pointer problem. Among these are tombstones (Lomet, 1975), in which every heap-dynamic variable includes a special cell, called a tombstone, that is itself a pointer to the heap-dynamic variable. The actual pointer variable points only at tombstones and never to heap-dynamic variables. When a heap-dynamic variable is deallocated, the tombstone remains but is set to nil, indicating that the heap-dynamic variable no longer exists. This approach prevents a pointer from ever pointing to a deallocated variable. Any reference to any pointer that points to a nil tombstone can be detected as an error. Tombstones are costly in both time and space. Because tombstones are never deallocated, their storage is never reclaimed. Every access to a heap-dynamic variable through a tombstone requires one more level of indirection, which requires an additional machine cycle on most computers. Apparently none of the designers of the more popular languages have found the additional safety to be worth this additional cost, because no widely used language uses tombstones. An alternative to tombstones is the locks-and-keys approach used in the implementation of UW-Pascal (Fischer and LeBlanc, 1977, 1980). In this compiler, pointer values are represented as ordered pairs (key, address), where the key is an integer value. Heap-dynamic variables are represented as the storage for the variable plus a header cell that stores an integer lock value. When a heap-dynamic variable is allocated, a lock value is created and placed both in the lock cell of the heap-dynamic variable and in the key cell of the pointer that is specified in the call to new. Every access to the dereferenced pointer compares the key value of the pointer to the lock value in the heapdynamic variable. If they match, the access is legal; otherwise the access is treated as a run-time error. Any copies of the pointer value to other pointers must copy the key value. Therefore, any number of pointers can reference a given heap-dynamic variable. When a heap-dynamic variable is deallocated with dispose, its lock value is cleared to an illegal lock value. Then, if a pointer other than the one specified in the dispose is dereferenced, its address value will still be intact, but its key value will no longer match the lock, so the access will not be allowed. Of course, the best solution to the dangling-pointer problem is to take deallocation of heap-dynamic variables out of the hands of programmers. If programs cannot explicitly deallocate heap-dynamic variables, there will be no dangling pointers. To do this, the run-time system must implicitly deallocate heap-dynamic variables when they are no longer useful. Lisp systems have always done this. Both Java and C# also use this approach for their reference variables. Recall that C#’s pointers do not include implicit deallocation. 6.11.7.3 Heap Management Heap management can be a very complex run-time process. We examine the process in two separate situations: one in which all heap storage is allocated and deallocated in units of a single size, and one in which variable-size segments are allocated and deallocated. Note that for deallocation, we discuss only implicit approaches. Our discussion will be brief and far from comprehensive, since a thorough analysis of these processes and their associated problems is not so much a language design issue as it is an implementation issue. Single-Size Cells The simplest situation is when all allocation and deallocation is of single-size cells. It is further simplified when every cell already contains a pointer. This is the scenario of many implementations of Lisp, where the problems of dynamic storage allocation were first encountered on a large scale. All Lisp programs and most Lisp data consist of cells in linked lists. In a single-size allocation heap, all available cells are linked together using the pointers in the cells, forming a list of available space. Allocation is a simple matter of taking the required number of cells from this list when they are needed. Deallocation is a much more complex process. A heap-dynamic variable can be pointed to by more than one pointer, making it difficult to determine when the variable is no longer useful to the program. Simply because one pointer is disconnected from a cell obviously does not make it garbage; there could be several other pointers still pointing to the cell. In Lisp, several of the most frequent operations in programs create collections of cells that are no longer accessible to the program and therefore should be deallocated (put back on the list of available space). One of the fundamental design goals of Lisp was to ensure that reclamation of unused cells would not be the task of the programmer but rather that of the run-time system. This goal left Lisp implementors with the fundamental design question: When should deallocation be performed? There are several different approaches to garbage collection. The two most common traditional techniques are in some ways opposite processes. These are named reference counters, in which reclamation is incremental and is done when inaccessible cells are created, and mark-sweep, in which reclamation occurs only when the list of available space becomes empty. These two methods are sometimes called the eager approach and the lazy approach, respectively. Many variations of these two approaches have been developed. In this section, however, we discuss only the basic processes. The reference counter method of storage reclamation accomplishes its goal by maintaining in every cell a counter that stores the number of pointers that are currently pointing at the cell. Embedded in the decrement operation for the reference counters, which occurs when a pointer is disconnected from the cell, is a check for a zero value. If the reference counter reaches zero, it means that no program pointers are pointing at the cell, and it has thus become garbage and can be returned to the list of available space. There are three distinct problems with the reference counter method. First, if storage cells are relatively small, the space required for the counters is significant. Second, some execution time is obviously required to maintain the counter values. Every time a pointer value is changed, the cell to which it was pointing must have its counter decremented, and the cell to which it is now pointing must have its counter incremented. In a language like Lisp, in which nearly every action involves changing pointers, that can be a significant portion of the total execution time of a program. Of course, if pointer changes are not too frequent, this is not a problem. Some of the inefficiency of reference counters can be eliminated by an approach named deferred reference counting, which avoids reference counters for some pointers. The third problem is that complications arise when a collection of cells is connected circularly. The problem here is that each cell in the circular list has a reference counter value of at least 1, which prevents it from being collected and placed back on the list of available space. A solution to this problem can be found in Friedman and Wise (1979). The advantage of the reference counter approach is that it is intrinsically incremental. Its actions are interleaved with those of the application, so it never causes significant delays in the execution of the application. The original mark-sweep process of garbage collection operates as follows: The run-time system allocates storage cells as requested and disconnects pointers from cells as necessary, without regard for storage reclamation (allowing garbage to accumulate), until it has allocated all available cells. At this point, a mark-sweep process is begun to gather all the garbage left floating around in the heap. To facilitate the process, every heap cell has an extra indicator bit or field that is used by the collection algorithm. The mark-sweep process consists of three distinct phases. First, all cells in the heap have their indicators set to indicate they are garbage. This is, of course, a correct assumption for only some of the cells. The second part, called the marking phase, is the most difficult. Every pointer in the program is traced into the heap, and all reachable cells are marked as not being garbage. After this, the third phase, called the sweep phase, is executed: All cells in the heap that have not been specifically marked as still being used are returned to the list of available space. To illustrate the flavor of algorithms used to mark the cells that are currently in use, we provide the following simple version of a marking algorithm. We assume that all heap-dynamic variables, or heap cells, consist of an information part; a part for the mark, named marker; and two pointers named llink and rlink. These cells are used to build directed graphs with at most two edges leading from any node. The marking algorithm traverses all spanning trees of the graphs, marking all cells that are found. Like other graph traversals, the marking algorithm uses recursion. for every pointer r do mark(r) void mark(void * ptr) { if (ptr != 0) if (*ptr.marker is not marked) { set *ptr.marker mark(*ptr.llink) mark(*ptr.rlink) } } An example of the actions of this procedure on a given graph is shown in Figure 6.9. This simple marking algorithm requires a great deal of storage (for stack space to support recursion). A marking process that does not require additional stack space was developed by Schorr and Waite (1967). Their method reverses pointers as it traces out linked structures. Then, when the end of a list is reached, the process can follow the pointers back out of the structure. Figure 6.9 An example of the actions of the marking algorithm Figure 6.9 Full Alternative Text The most serious problem with the original version of mark-sweep was that it was done too infrequently—only when a program had used all or nearly all of the heap storage. Mark-sweep in that situation takes a good deal of time, because most of the cells must be traced and marked as being currently used. This causes a significant delay in the progress of the application. Furthermore, the process may yield only a small number of cells that can be placed on the list of available space. This problem has been addressed in a variety of improvements. For example, incremental mark-sweep garbage collection occurs more frequently, long before memory is exhausted, making the process more effective in terms of the amount of storage that is reclaimed. Also, the time required for each run of the process is obviously shorter, thus reducing the delay in application execution. Another alternative is to perform the mark-sweep process on parts, rather than all of the memory associated with the application, at different times. This provides the same kinds of improvements as incremental mark-sweep. Both the marking algorithms for the mark-sweep method and the processes required by the reference counter method can be made more efficient by use of the pointer rotation and slide operations that are described by Suzuki (1982). Variable-Size Cells Managing a heap from which variable-size cells8 are allocated has all the difficulties of managing one for single-size cells, but also has additional problems. Unfortunately, variable-size cells are required by most programming languages. The additional problems posed by variable-size cell management depend on the method used. If mark-sweep is used, the following additional problems occur: 8. The cells have variable sizes because these are abstract cells, which store the values of variables, regardless of their types. Furthermore, a variable could be a structured type. The initial setting of the indicators of all cells in the heap to indicate that they are garbage is difficult. Because the cells are different sizes, scanning them is a problem. One solution is to require each cell to have the cell size as its first field. Then the scanning can be done, although it takes slightly more space and somewhat more time than its counterpart for fixed-size cells. The marking process is nontrivial. How can a chain be followed from a pointer if there is no predefined location for the pointer in the pointed-to cell? Cells that do not contain pointers at all are also a problem. Adding an internal pointer to each cell, which is maintained in the background by the run-time system, will work. However, this background maintenance processing adds both space and execution time overhead to the cost of running the program. Maintaining the list of available space is another source of overhead. The list can begin with a single cell consisting of all available space. Requests for segments simply reduce the size of this block. Reclaimed cells are added to the list. The problem is that before long, the list becomes a long list of various-size segments, or blocks. This slows allocation because requests cause the list to be searched for sufficiently large blocks. Eventually, the list may consist of a large number of very small blocks, which are not large enough for most requests. At this point, adjacent blocks may need to be collapsed into larger blocks. Alternatives to using the first sufficiently large block on the list can shorten the search but require the list to be ordered by block size. In either case, maintaining the list is additional overhead. If reference counters are used, the first two problems are avoided, but the available-space list-maintenance problem remains. For a comprehensive study of memory management problems, see Wilson (2005). 6.12 Optional Types There are situations in programming when there is a need to be able to indicate that a variable does not currently have a value. Some older languages use zero as a nonvalue for numeric variables. This approach has the disadvantage of not being able to distinguish between when the variable is supposed to have the zero value and when the zero indicates that it has no value. Some newer languages provide types that can have a normal value or a special value to indicate that their variables have no value. Variables that have this capability are called optional types. Optional types are now directly supported in C#, F#, and Swift, among others. C# has two categories of variables, value and reference types. Reference types, which are classes, are optional types by their nature. The null value indicates that a reference type has no value. Value types, which are all struct types, can be declared to be optional types, which allows them to have the value null. A variable is declared to be an optional type by following its type name with a question mark (?), as in int? x; To determine whether a variable has a normal value, it can be tested against null, as in int? x; . . . if(x == null) Console.WriteLine("x has no value"); else Console.WriteLine("The value of x is: {0}", x); Swift’s optional types are similar to those of C#, except that the nonvalue is named nil, instead of null. The Swift version of the above code is: var Int? x; . . . if x == nil print("x has no value") else print("The value of x is: \(x)") 6.13 Type Checking For our discussion of type checking, the concept of operands and operators is generalized to include subprograms and assignment statements. Subprograms will be thought of as operators whose operands are their parameters. The assignment symbol will be thought of as a binary operator, with its target variable and its expression being the operands. Type checking is the activity of ensuring that the operands of an operator are of compatible types. A compatible type is one that either is legal for the operator or is allowed under language rules to be implicitly converted by compiler-generated code (or the interpreter) to a legal type. This automatic conversion is called a coercion. For example, if an int variable and a float variable are added in Java, the value of the int variab

Document Details

Tags

Related

Full Transcript

Upgrade to continue