Flexibility versus safety - Hybrid implementation systems

2.2.3. Hybrid implementation systems

2.6.1.22. Flexibility versus safety

Pascal variant records allow a memory cell to contain different types at different times. For example, the cell may contain either a pointer or an integer. So a pointer value put in such a cell can be operated on as if it were an integer, using any operation defined for integer values. This provides a loophole in Pascal's type checking that allows a program to do arithmetic on pointers, which is sometimes convenient. However, this unchecked use of memory cells is, in general, a dangerous practice [Tucker and Noonan, 2006].

2.6.2. 1.6.2 Language design

In the design of a new language, certain matters require careful assessment well before any consideration is given to the details of the design. The first and most important question that must be asked is whether it is necessary to design a new language. Is there an existing language that can be used to satisfy the requirement?

Even if it requires a new implementation, implementing an existing language is easier and faster than designing and then implementing a new language.

The language designer's first problem, therefore, is a judicious selection of concepts. What to omit is just as important a decision as what to include as defining the success or failure of a programming language is very complex. A language is successful if it satisfies any of the following criteria:

• Achieves the goals of its designers.

• Attains widespread use in an application area.

• Serves as a model for other languages that are themselves successful.

When creating a new language, it is essential to decide on an overall goal for the language, and then keep that goal in mind through the entire design process.

Nevertheless, it is extremely difficult to describe good programming language design. Even recognized computer scientists and successful language designers offer conflicting advice. Niklaus Wirth, the designer of Pascal, advices that simplicity is paramount [Wirth, 1974]. C. A. R. Hoare, a prominent computer scientist and ALGOL designer, emphasizes the design of individual language constructs [Hoare, 1973]. The designer of C++, Bjarne Stroustrup notes that a language cannot be merely a collection of neat features [Stroustrup, 1994].

Horowitz suggested the following ten-step protocol to design a new programming language [Horowitz, 1994]:

1. Choose an application area;

2. Make the design committee as small as possible;

3. Choose some precise design goals;

4. Release version one to small set of people;

5. Revise language definition again;

6. Build a prototype compiler;

7. Revise language definition again;

8. Write the manual;

9. Write a good compiler and distribute it;

10. Write primers.

2.7. 1.7 The standardization process

We already emphasized the importance of the existence of a well-defined and widely accepted language standard. Documentation for the early programming languages was written in a informal way, in ordinary English. However, programmers soon became aware of the need for the more precise description of a language, and argued for the type of formal definitions used in mathematics. A further reason for a formal definition was the need for machine or implementation independence. The best way to achieve this was through standardization, which requires an independent and precise language definition that is universally accepted.

Standards organizations such as ANSI (American National Standards Institute) and ISO (International Standards Organization) have published definitions for a number of languages including C, C++, Ada, Common Lisp, and Prolog.

Once a language is in widespread use, it becomes very important to have a complete and precise definition of the language so that compatible implementations may be produced for a variety of hardware and system environments. The standardization process was developed in response to this need. A language standard is a formal definition of the syntax and semantics of a language. It must be a complete, unambiguous statement of both. Language aspects must be defined clearly, while those aspects that go beyond the limits of the standard must be designated clearly as "undefined".

A language translator that implements the standard must produce code that conforms to all the defined aspects of the standard, but for an undefined aspect, it is permitted to produce any convenient translation. The right to define an unstandardized language, or to change a language definition, may belong to the individual language designer, to the agency that has sponsored the language design, or to a committee of the American National Standards Institute (ANSI) or the International Standards Organization (ISO). The FORTRAN standard was originated by ANSI, the Pascal standard by ISO. The definition of Ada is controlled by the U.S. Department of Defense, which funded the design of Ada. New or experimental languages are usually controlled by their designers.

When a standards organization decides to sponsor a new standard for a language, it convenes a committee of people from industry and academia who have a strong interest in and extensive experience with that language.

The standardization process is not easy or smooth. The committee must decide which dialect, or combination of ideas from different dialects, will become the standard. The committee members approach the task with different notions of what is good or bad, and have different preferences. Agreement at the start is rare, and the harmonization process may take several years. This was the case with the original ISO Pascal standard, the ANSI C standard, and the new FORTRAN-90 standard.

After a standard is adopted by one standards organization (ISO or ANSI), the definition is re-evaluated by the other. In an ideal situation, the new standard is accepted by the other one, as well. For example, ANSI adopted the ISO standard for Pascal nearly unchanged. However, smooth sailing is not always the rule. The new ANSI C standard was rejected by some ISO committee members, and a number of amendments had to be performed during the standardization process. The first standard for a language often clears up ambiguities, fixes obvious defects, and defines a better and more portable language. For instance, ANSI C and ANSI LISP standards do all of these. Programmers writing new translators for these languages must then conform to the common standard.

Implementations may also include words and structures, called extensions, that go beyond anything specified in the standard.

2.8. 1.8 Summary

In this Chapter, we have first introduced general concepts about programming language design and implementation options, all of which strongly determine the overall efficiency of a programming language. We then went through the history of programming languages. We have seen how new languages inherited successful concepts from their ancestors, and how they sometimes introduced new concepts of their own. We have discussed how major programming paradigms evolved. We have identified a number of technical and economic criteria that must be taken into account when selecting a language for a particular software development project. All this information will be of key importance in the following chapters in this textbook.

2.9. 1.9 Exercises

Exercise 1.1. Write an evaluation of a programming language you are familiar with, using the criteria described in this chapter.

Exercise 1.2. C requires a semicolon to be placed between the then and else branches of a conditional if-then-else statement, whereas this is prohibited in Pascal. What are the pros and cons of the two regulations?

Exercise 1.3. Certain programming languages distinguish between uppercase and lowercase characters in identifiers. What are the pros and cons of this design decision?

Exercise 1.4. FORTRAN does not require all variables to be declared before being used. What problems may result from this during syntax processing?

Exercise 1.5. Explain the different factors that determine the overall cost of a programming language.

Exercise 1.6. Describe, in your own words, the concept of orthogonality in programming language design.

2.10. 1.10 Useful tips

Tip 1.1. Consider the following: overloading, memory allocation, support of data abstraction, different ways of using complex conditionals for loops, etc.

Tip 1.2. Consider what semicolons are usually used for and whether this use is justified within an if-then-else statement. Also think about a case when you later add an else clause to an existing if-then statement. What happens to the semicolon?

Tip 1.3. Consider whether additional information about the case of letters is necessary or useful. Does this extra information positively or negatively affect the readability of the program code?

Tip 1.4. Ask yourself what would happen to typographical errors in FORTRAN.

Tip 1.5. The different aspects of the cost of a programming language are a) the cost of deployment, b) the cost of maintenance, and c) the cost of support.

Tip 1.6. Orthogonality is the property that means "Changing A does not change B". An example of an orthogonal system would be a radio, where changing the station does not change the volume and vice-versa. A non-orthogonal system would be like a helicopter where changing the speed may change the direction.

2.11. 1.11 Solutions

Solution 1.1. We will consider the Java language.

Readability In terms of readability, Java has some issues with simplicity with respect to readability. There is feature multiplicity in Java as shown in the textbook with the example of

count=count+1, count++, count+=1, ++count

being four different ways to increment an integer by one. Another problem is operator overloading since Java allows some operators such as the sign to add integers, floats, and other number types.

Control statements in Java have higher readability than Basic and Fortran programs because they can use more complex conditionals like for loops. There is no need for goto statements that have the reader leaping to other lines of code that could be far away or out of order. However, the use of braces to designate the starting and stopping points of all compound statements can lead to some confusion.

Writability Java has a fair bit of orthogonality in that its primitive constructs can be used in various different ways. Because Java is an imperative language that supports objects object oriented programming, it can be fairly complex. Java supports data abstraction so it would be easier to create a binary tree in Java with its dynamic storage and pointers than in a language like Fortran 77. Java also has a for statement which is easier than using a typical while statement. Java is a high level programming language so specifying details like memory allocation are unnecessary due to Java's dynamic array system.

Solution 1.2.

Semicolons are mostly used between two statements, either to separate them (Pascal) or to terminate the preceding statement (C). Since a conditional if-then-else statement is a single statement, there is no justifyable reason to place a semicolon between the then and else branches. In this theoretical sense, the Pascal version is more appropriate.

On the other hand, when you want to add a new else branch to an existing if-then statement in Pascal, you need to go back to delete the semicolon in the preceding line; if you forget this, a syntax error is generated. This problem does not occur in C since the semicolon can remain at the end of the then branch. Therefore, the C version is more practical.

Solution 1.3. Considering the distinction between uppercase and lowercase characters in identifiers.

Pros:

The same words can be used in different meanings depending on the use of uppercase or lowercase letters. E.g.

in Java Byte is a class whereas byte is a primitive type. We can also differentiate between constants and variables or dynamic and static names.

Cons:

Case sensitivity may lead to small hard to detect differences between identifiers.

Note that a different situation is when both uppercase and lowercase characters are allowed but no distinction between them is made at the syntactic level. A typical such example is Visual Basic.

Solution 1.4. In case of typographical errors, the compiler will not know if this is an error or a new variable, therefore it may generate a new variable instead of reporting the syntax error.

Solution 1.5. The different aspects of the cost of a programming language are:

• The cost of deployment;

• The cost of maintenance;

• The cost of support.

Solution 1.6.

Orthogonality is the property that means "Changing A does not change B". An example of an orthogonal system would be a radio, where changing the station does not change the volume and vice-versa. A non-orthogonal system would be like a helicopter where changing the speed can change the direction.

In programming languages this means that when you execute an instruction, nothing but that instruction happens (very important for debugging). There is also a specific meaning when referring to instruction sets.

3. 2 Lexical elements (Judit Nyéky-Gaizler, Attila Kispitye)

The common characteristic of source codes from different programming languages is that they are made of as sequences of symbols from a given set. The structure of these sequences is described by the lexical and syntactic rules of the given programming language. The basic language units called lexical elements are the building blocks of program units. In this chapter we examine from what kind of symbol sets lexical units can be built, to which level this process is standardized for each of the programming languages, how the identifiers of these languages are constructed, and which numerical-, character- and text literals are allowed. We discuss applicable comment forms in source code, since this can also affect the reliability of our programs.

Source code is made of one or more compilation units.{About compilation units see Chapter 4.} Compilation units are built from sequences of lexical elements. Lexical elements are defined by given rules as character sequences separated by delimiters. So lexical elements include delimiters, identifiers, numerical-, character- and text literals, and comments.

3.1. 2.1 Symbol sets

Symbol sets usable in source codes define not only the program text, but also control data input. For this reason, standardization of these symbol sets is a key factor for portability.

Computers manage and communicate data in binary form based on bits in groups of , called octets. Therefore, the value range of an octet is an integer number between and , which is normally given in decimal, octal or hexadecimal form for better readability. Octets are often called bytes, but mind the difference: although an octet is represented with 8 bits (that is with a byte), interpreting it as a byte means the above mentioned positive value range, but on bits negative values could also be encoded by assigning a sign bit, or using different coding methods (like BCD{Binary Coded Decimal: low and high bits of a byte represent a decimal digit each.} or two's complement).

There are many conventions on how an octet or a sequence of octets represent data. Naming conventions are used exchangeably for the number of octets and of their representing bits. For example, consecutive octets (

bits) often represent a real number using some standardized encoding, or in ASCII one, in UTF two octets ( bits, bytes) implement characters.

It is important to distinguish between character set, character code and character encoding. We define these according to Jukka Korpela's study about characters [Korpela, 2002].

The character set is simply the set of all allowed characters. Nothing is presumed about the internal representation of the characters within the computer. The set does not even require an ordering of the characters, this must be defined separately. Character sets are normally defined by enumerating the name and the visual appearance pairs of their elements. Keep in mind, that a set could contain different characters with the same visual appearance, like the Latin capital A, the Cyrillic capital A and the Greek capital Alfa (A).

Examples:

EXCLAMATION ! QUESTION_MARK ? SEMICOLON ;

character set element names (or shorter: character names) are rather identifiers than definitions for them. These names can usually contain letters from .. , spaces and underscores. The same characters can have different names in different character set definitions. Character names presumably suggest some general meaning and hint usage scope; but are advised that usage possibility could be much broader.

Character code is a mapping usually given in tables, which define a mutually unambiguous correspondence between the elements of the character set and integer numbers. This means, that a unique numeric code, so called code position is given for all the set members. The code mapping is seen as one contiguous table (irrespectively of the actual numbers of defining tables given), which is indexed by the code positions.

Synonyms for code position are code value, code point, code-set value, or simply just code.{It is not required that mapped character codes should cover a contiguous integer range. Actually most of the character codes have

"holes", empty code positions, which are mapped for control sequences or are reserved for future use.}

Character encoding is an algorithm to define a digital format for handling characters. It maps sequences of character codes on sequences of octets. In the simplest case every character is mapped to an integer in the range - , using their character code as octets. This allows, of course, only for a maximum of characters this way.

A character code table directly defines a character set, and the character encoding is often given by character codes (and the defining character set). Logically the character set is primary for providing the set of characters.

It also gives the character codes, the numeric values assigned to the characters -, for example, in the ISO 10646 character coding the codes of the characters 'a', 'ä' and ‰ (the thousandths sign) are , and . The character encoding defines how character codes are encoded as octet sequences. For example, one possible coding of ISO 10646 uses two octets for every character encoding , and the ‰ sign with the octet pairs of ( , ), ( , ) and ( , ). Using some notions ambiguously can lead to problems, such as character set can mean the character set, but also the character codes, or sometimes the character encoding.{Using the notion character set for the meaning of "coding" is troublesome.}

The most widely used internationally accepted and standardized character codings are ASCII, EBCDIC, ISO 8859-1, ISO 8859-2 and Unicode with multiple possible encodings. The growing demand for specific national characters played an important role to establish these new standards.

3.1.1. 2.1.1 The ASCII code

To understand why the introduction of the ASCII in 1963 had such a big impact, it is worth mentioning that before this period of time different computers were unable to communicate with each other. Every manufacturer had their own method to represent the letters of the alphabet, numbers and control codes. "Characters were represented in computers in more than 60 different ways. This was a real tower of Babel." - explained Bob Bemer[Brandel, 1999], who actively participated in the development of ASCII, and is also known as the "father of ASCII".

ASCII stands for American Standard Code for Information Interchange, acts as a "common denominator" for every computer nowadays, which could have nothing else in common. It took more than two years, till this codeset suggested by the ANSI (American National Standard Institute) had been established. Today this is the most common encoding. It is so prevalent, that an "ASCII file" now simply denotes a text (that is not binary) file, even if its encoding is something different. Most encodings contain ASCII as a subset. The first codepoints and the codepoint define control characters, like line feed (LF) or escape (ESC). The actual printable part of the character set is shown in Table 3. (The whole codetable is in the Appendix 17.)

In the original standard the value range - was not used, but later as the code points were running out of various extensions were introduced. Such a widely used extension is shown in the Appendix 193. The ISO 8859-1 or Latin-1 and the ISO 8859-2 or Latin-2 character sets are actually extensions of ASCII. So, the original

In document Advanced Programming Languages (Pldal 39-47)