Garbage collection - 6. 5 Data types (Gábor Pécsy)

6. 5 Data types (Gábor Pécsy)

6.6.1.2. Garbage collection

In other languages, deallocation of unused memory blocks is not the developers' responsibility - it is handled by the runtime environment. In such languages for each dynamic variable a reference counter is maintained which indicates whether the corresponding variable is accessible from somewhere within the application. When a variable becomes inaccessible, it becomes eligible for garbage collection. The garbage collector is a mechanism in the runtime environment which is responsible for finding memory blocks that can be released and for deallocating them according to the rules of the language. Depending on the runtime implementation, garbage

collection can be triggered by a timer, a certain level of memory usage or the combination of both. Some languages allow the developer to explicitly trigger garbage collection as well.

The pointers in languages which support garbage collection are often called references. Their usage is safer, meaning it cannot happen that the developer forgets to release an unused variable or that he refers to a variable which has already been released. However, the developer has a lot looser control over the memory usage of his application. Another potential drawback of this approach is that garbage collection generates periodic peaks in the CPU usage of the application - it might even have to stop all other execution threads, which might have undesirable effect on the real time characteristics of the application. Garbage collection is a complex task, as it is not enough to release variables with reference counter 0, there might be circular references between variables.

The garbage collection must be able to discover such blocks which have no external references anymore and release all objects in them simultaneously. Nevertheless the added safety and ease of use is a big motivation for using garbage collectors. The popularity of Java and C# contributed a lot to understanding the challenges of garbage collection and to the creation of better and better garbage collectors.

Memory management has fundamental influence on the usage of pointers and their available operations. Further details of memory management are provided in Chapter 4.

6.6.2. 5.6.2 Type-value set

A pointer is an object which represents the location of another object in memory. The actual representation is implementation dependent but often it is an unsigned integer value which is the index (address) of the first byte of the referenced object in memory. Other representations are also possible. Disconnecting pointers from actual memory addresses might enable relocating objects at runtime for instance to enable defragmenting the memory.

6.6.2.1. 5.6.2.1 Untyped pointers

In most languages pointers are usually typed, meaning they can refer to certain types of objects. Usually this typedness does not change the representation of the pointer, but it enables the compiler to verify the correct usage of the referenced objects. However, in some languages it is possible to define untyped pointers as well.

Untyped pointers represent a memory address without restricting the type of object located at that position.

The usage of untyped pointers is rather limited. As the type of the referenced object is not determined, dereference is not permitted on them. Usually any pointer can be converted to untyped pointer automatically, while the opposite direction requires explicit - interpretation changing - conversion.

A typical use case for untyped pointers in languages which do not have a common superclass for all objects is to create containers which can store any kind of object. The following - C language - example is the implementation of a list type which can contain arbitrary objects.

struct __listelement_struct;

typedef struct __listelement_struct* List;

struct __listelement_struct { List next;

void* element;

};

List insert(List l, void* element) {

List p = malloc(sizeof(struct __listelement_struct));

if (p != NULL) { p->next = l;

p->element = element;

}

return p;

} ...

List l = NULL;

int i = 5;

char text[] = "Hello world!";

l = insert(l, &i);

l = insert(l, &text);

As illustrated above, any kind of object can be inserted to the list. But in order to use the objects after having retrieved them from the list, we need to know their exact type. In C the void* pointers can be converted to any pointer type, but the developer needs some solution to determine the type of the retrieved objects to be able to use them. The language itself does not offer such mechanism.

In object-oriented languages where the language defines a common base class for all classes - for example in Java or Eiffel -, references of the common base class can serve as untyped pointers. Though not exactly untyped - its type is the common base class -, these references share the ability to point to any object in the language.

However, a significant advantage is that objects of these languages typically carry information about their dynamic type, which enables us to safely perform the conversion when using such an "untyped" reference.

Before introducing generics to Java, the collection framework of the language used Object references for the stored objects.

6.6.2.2. 5.6.2.2 Pointers to nowhere

In all programming languages the type-value set of pointers contains a special value, the pointer to nowhere.

This pointer is called NULL in C, 0 in C++, null in Ada and Java, nil in CLU, void in Eiffel, etc. It has many names, but its properties and function are the same. This pointer can be automatically converted to any pointer type; therefore, it can be assigned to any pointer variable. The value is almost always represented as the constant 0 bit sequence, which is not a valid address in any system. Dereferencing the pointer to nowhere always causes runtime error.

In many languages, especially the ones which provide references instead of pointers - e.g. Ada, Java, Eiffel, CLU - the default value of references is the pointer to nowhere. This means that uninitialized references can be clearly distinguished. In C or C++ uninitialized pointers have undefined values.

If a language supports garbage collection and its pointers or references by default are initialized to the pointer to nowhere, the pointers have a very useful invariant property. Their value is either the pointer to nowhere or they point to an existing object. This invariant greatly simplifies the development of sound applications.

6.6.2.3. 5.6.2.3 Forward declaration

A very important property of pointer types is that their representation is independent from the referenced data type. We rely on this property to resolve the "chicken and egg" problem of creating linked data structures. In the example below - written in Ada - we describe the representation of linked list type. The ListElement type consists of an integer value - the data - and a reference pointing to the next element of the list. When declaring the type, we ran into a problem immediately. That is, in Ada reference type can only be declared for an existing type but components of a record type must also be of existing types. The solution to this contradiction is the forward declaration of the ListElement type. With forward declaration we let the compiler know that the type ListElement exists. As the representation of references does not depend on the details of the referenced type, the existence is enough to define the ListElementAccess type, which is a fully defined type at this point. Therefore, it can be used in the definition of ListElement as the type of one of its components.

-- Forward declaration of @ListElement@ type type ListElement;

-- We can now define the referece type to it type ListElementAccess is access all ListElement;

-- And now we can define the @ListElement@ type as well type ListElement is record

Data : Integer;

Next : ListElementAccess;

end record;

6.6.3. 5.6.3 Operations

Moving on to the operations of pointer types, some of the operations - e.g. assignment, or allocation - are present in all languages, others are very unique, language specific operations - e.g. the pointer arithmetics of C and C++.

6.6.3.1. 5.6.3.1 Assignment

Assignment of pointers usually means copying the address of the referenced memory block. However, in languages which support garbage collection, such as Ada or Java, assignment includes the maintenance of reference counters as well. In the assignment p := q, if both pointer variables reference to existing objects, the reference counter of the object referenced by p needs to be reduced by one - p does not reference it anymore -, if its value drops to 0 it might also trigger garbage collection, depending on the implementation. At the same time the reference counter of the object referenced by q needs to be incremented by one as it is now referenced by p a well.

Generally, there are five ways of assigning a new value to a pointer variable:

• The pointer to nowhere is assigned to them. In many languages this is the default value of all pointers.

• The value of another pointer variable is assigned to them.

• The pointer to a newly allocated object is assigned to it using an allocator.

• The address of a static or automatic variable is assigned to it. However, not all languages support this option as it can be unsafe, especially in the case of automatic variables.

• Some languages allow assigning concrete memory addresses to pointers. For instance in C any integer value can be converted to a pointer.

6.6.3.2. 5.6.3.2 Allocators

All languages that support dynamic variables must provide an allocator operation. This operation is responsible for obtaining a free memory block of sufficient size from the free list and returning its address. In many languages - in object-oriented languages such as Java and C++ in particular - the allocator is responsible for initializing the allocated memory block according to the type of the object to be stored there.

6.6.3.3. 5.6.3.3 Deallocators

Not all languages have an operator for deallocation. Languages that use garbage collection typically do not provide explicit deallocators. However, some of these languages might provide some means for the developers to control deallocation.

It is quite common to provide a mechanism to explicitly trigger garbage collections, e.g. in Java we can use the System.gc() method.

Since the launching of Ada 95, Ada offers even deeper access. The developer can declare that he is taking full responsibility for maintaining the consistency of memory management related to a certain type by instantiating the Ada.Unchecked_Deallocation generic for that type. The instance of this generic is a deallocator usable for objects of the given type. It is important to note that the type is removed from the scope of garbage collection.

6.6.3.4. 5.6.3.4 Referencing

Many languages allow assigning the address of a static or automatic variable to a pointer variable. The address of the variable is created using the reference operator of the language. Not all languages provide such mechanism, e.g. Ada 83, Eiffel or Java do not have such means. In these languages pointers can only reference dynamic variables.

In C and C++ the reference operator is . It can be used without limitations on any objects.

The Ada 83 version of Ada did not allow the referencing of static and automatic objects, but this restriction was loosened in the Ada 95 revision. As the language is essentially garbage collection based, the references created using the reference operator also had to be integrated into this system for safety.

The challenge of introducing a reference operator is that the lifetime of automatic variables depends on the block structure of the language and the thread of execution. When the thread enters a block, the automatic variables of the block are created, and when it exits, the variables are destroyed.

Assume that we have a program which consists of blocks A and B, where B is embedded in A. In block A we declare a pointer IP and in block B we have an automatic variable I. We also have a statement in block B which assigns the reference of I to IP using the reference operator. The following code snippet results in compilation error:

Type IntegerAccess is access all Integer;

IP : IntegerAccess;

Procedure B is I : aliased Integer;

Begin

-- IP is visible here as it is global to this block -- However, this assignment is invalid...

IP := I'Access;

End B;

Begin B;

-- ... because at this point IP would point to invalid memory area IP.all := 42;

End;

As shown in part a.) of Figure 10, when the execution exits B, the variable I is destroyed. However, the pointer IP still points to the same memory area. Destroying I would break the former invariant of references which guarantees, that they either point to a valid memory area or their value is null.

Therefore, Ada does not allow the use of the reference operator on ordinary automatic variables.

The solution of this problem in Ada 95 is that the developer needs to mark automatic variables with the keyword aliased in case they want to use referencing of that variable. The compiler will place these variables - even though they are automatic - in the dynamic storage area and it will replace them with a pointer within the block.

In other words, it converts these variables to dynamic ones, and creates an automatic variable which holds the only reference to them. If the referencing operator is not used, the automatic (reference) variable is destroyed which will drop the reference counter of the dynamic variable to 0, which will then be garbage collected. If the reference of the variable is assigned to some other pointer, the variable can live safely after leaving the block as it resides in the dynamic storage. See part b. of Figure 10. Below is the corrected example of using the reference operator (the 'Access attribute) in Ada 95:

Type IntegerAccess is access all Integer;

IP : IntegerAccess;

Procedure B is

I : aliased Integer; -- Allocated I on heap, variable I is a reference.

Begin

IP := I'Access;

End B;

Begin B;

-- This is now valid, IP points to an address in the heap IP.all := 42;

End;

6.6.4. 5.6.4 Dereference

One, if not the most important operation of pointers is dereference. Dereference operation reduces the reference level by one, i.e. it produces the referenced object based on its address.

In most languages dereference needs to be explicitly marked in code. For example in C and C++ the operator of dereference is the prefix *. If p is a pointer, *p denotes the referenced object. If p points to a structure (struct), or in the case of C++, it point to a class, which has a field f, the (*p).f reference to f can be shortened as p->f.

In object-oriented languages where objects are accessed through references - for example in Java, CLU or partly Eiffel - there is no need to explicitly mark dereference. The language defines for each operation whether they work on the reference or the referenced object. For example, if a and b are two object references in Java, the a = b assignment or the a == b equality operator are operations on the references. However, method invocation is always an operation on the referenced object; therefore, a.equals(b) will compare the objects and not their references.

Ada combines the two approaches. When it cannot be decided unambiguously on the basis of the context if the operation should be applied to the reference or the referenced object, dereference needs to be explicitly indicated. Dereference is indicated using the .all qualifier. Let us assume that I is an integer variable, IP and JP are two integer access - reference - variables. In the case of the I:=IP and IP:=I assignments or the I = IP equality check, the indication of dereference is not needed. In each case it follows from the context that the referenced object should be used, otherwise the operations would not be defined.

However, if we replace I with JP in the expressions, they will become ambiguous. The operations are defined at both reference levels - on references as well as on the referenced objects; therefore, we have to explicitly indicate dereference. The assignment IP:=JP is interpreted at the reference level; if we want to assign the referenced object, we have to use the IP.all := JP.all form.

6.6.4.1. 5.6.4.1 Equality

Almost all languages provide means to check the equality of pointers. This equality is always interpreted at the reference level, meaning it is independent from the equality operation of the referenced objects, irrespective of whether they have such operations. Two pointers are equal if they point to exactly the same object, i.e. they contain the same memory address.

6.6.5. 5.6.5 Pointers to subprograms

A special variants of pointers are the pointers to subprograms. The type-value sets of these types consist of the entry points of subprograms, rather than of the memory addresses of objects. The pointer to nowhere is element of these type-value sets as well.

Pointers to subprograms are always typed. The type of the referenced subprogram is the signature of the subprogram, which is the list of formal parameters and in case of functions the type of the returned value.

The set of operations for subprogram pointers is a lot more limited than for pointers of objects. There are no allocators or deallocators as subprograms cannot be dynamically created. Assignment and equality check is similar to object pointers. There are reference and dereference operators. The later in practice means the invocation of the subprogram.

C - and therefore, C++ as well - support subprogram pointers. The example below is a function which calculates the definite integral of a real function over a specified interval. The first parameter of the function is the integrand real function, which is followed by the end points of the interval and the number of integration points.

Notice that neither reference, nor dereference need to be indicated explicitly, though both are allowed.

typedef double (*RealFunction)(double);

double Integral(RealFunction fn, double a, double b, int n) { double x = a;

double delta = (b - a) / (double)(n);

double s = 0.0;

if (b <= a) return 0.0;

while ((x + delta) <= b) {

s += delta * (fn(x) + fn(x + delta)) / 2.0;

x += delta;

}

return s;

}

double sininteg = Integral(sin, 0.0, 1.0, 1000);

In object-oriented languages, the importance of subprogram pointers is smaller. Subprogram pointer types can be substituted with an interface or abstract class which declares a single method, the subprogram to be passed.

Then objects that implement the specified interface or abstract class can be created, which realize the required variant of the method. These objects typically do not have data members at all. They are called function objects or functors. Below is a Java implementation of the definite integral calculation. Notice the usage of an unnamed embedded class as functor.

public interface RealFunction { public double calc(double x);

}

public static double integral(RealFunction fn,double a,double b,int n) { double x = a;

double delta = (b - a) / (double)(n);

double s = 0.0;

if (b <= a) return 0.0;

while ((x + delta) <= b) {

s += delta * (fn.calc(x) + fn.calc(x + delta)) / 2.0;

x += delta;

}

return s;

} ...

double sininteg = integral( new RealFunction() { public double calc( double x ) {

return Math.sin(x);

}

}, 0.0, 1.0, 1000);

6.6.6. 5.6.6 Language specialties

In this section we will examine some special, language specific solutions related to pointers.

6.6.6.1. 5.6.6.1 Pointer arithmetics of C

One of the most interesting and probably most often used property of C is pointer arithmetics. By using pointer arithmetics we can use pointers very conveniently and in C, we may create close connection between arrays and pointers. Pointers in C are unsigned integers. Their value is the memory address - index - of the first byte of the referenced object.

C supports typed pointers. The type of pointers to a type T is T*. It has untyped pointers as well. The type of untyped pointers is void*, which behaves as if it were a pointer to byte type (unsigned char) but dereference is not allowed for this type.

The pointer arithmetics of C extends the usual set of pointer operations in the following respects:

• The operator sizeof specifies the number of bytes used for the representation of a type T.

• Any pointer type can be converted automatically to void*.

In document Advanced Programming Languages (Pldal 145-169)