5

Algorithms and Data Structures © N. Wirth 1985 (Oberon version: August 2004) Contents Preface 1 Fundamental Data Structures 1.1 Introduction 1.2 The Concept of Data Type 1.3 Primitive Data Types 1.4 Standard Primitive Types 1.4.1 Integer types 1.4.2 The type REAL 1.4.3 The type BOOLEAN 1.4.4 The type CHAR 1.4.5 The type SET 1.5 The Array Structure 1.6 The Record Structure 1.7 Representation of Arrays, Records, and Sets 1.7.1 Representation of Arrays 1.7.2 Representation of Recors 1.7.3 Representation of Sets 1.8 The File (Sequence) 1.8.1 Elementary File Operators 1.8.2 Buffering Sequences 1.8.3 Buffering between Concurrent Processes 1.8.4 Textual Input and Output 1.9 Searching 1.9.1 Linear Search 1.9.2 Binary Search 1.9.3 Table Search 1.9.4 Straight String Search 1.9.5 The Knuth-Morris-Pratt String Search 1.9.6 The Boyer-Moore String Search Exercises 2 Sorting 2.1 Introduction 2.2 Sorting Arrays 2.2.1 Sorting by Straight Insertion 2.2.2 Sorting by Straight Selection 2.2.3 Sorting by Straight Exchange 2.3 Advanced Sorting Methods 2.3.1 Insertion Sort by Diminishing Increment 2.3.2 Tree Sort 2.3.3 Partition Sort 2.3.4 Finding the Median 2.3.5 A Comparison of Array Sorting Methods 2.4 Sorting Sequences 2.4.1 Straight Merging 2.4.2 Natural Merging 2.4.3 Balanced Multiway Merging 2.4.4 Polyphase Sort 2.4.5 Distribution of Initial Runs Exercises

6 3 Recursive Algorithms 3.1 Introduction 3.2 When Not to Use Recursion 3.3 Two Examples of Recursive Programs 3.4 Backtracking Algorithms 3.5 The Eight Queens Problem 3.6 The Stable Marriage Problem 3.7 The Optimal Selection Problem Exercises 4 Dynamic Information Structures 4.1 Recursive Data Types 4.2 Pointers 4.3 Linear Lists 4.3.1 Basic Operations 4.3.2 Ordered Lists and Reorganizing Lists 4.3.3 An Application: Topological Sorting 4.4 Tree Structures 4.4.1 Basic Concepts and Definitions 4.4.2 Basic Operations on Binary Trees 4.4.3 Tree Search and Insertion 4.4.4 Tree Deletion 4.4.5 Analysis of Tree Search and Insertion 4.5 Balanced Trees 4.5.1 Balanced Tree Insertion 4.5.2 Balanced Tree Deletion 4.6 Optimal Search Trees 4.7 B-Trees 4.7.1 Multiway B-Trees 4.7.2 Binary B-Trees 4.8 Priority Search Trees Exercises 5 Key Transformations (Hashing) 5.1 Introduction 5.2 Choice of a Hash Function 5.3 Collision handling 5.4 Analysis of Key Transformation Exercises Appendices A

The ASCII Character Set

B

The Syntax of Oberon

Index

7

Preface In recent years the subject of computer programming has been recognized as a discipline whose mastery is fundamental and crucial to the success of many engineering projects and which is amenable to scientific treatement and presentation. It has advanced from a craft to an academic discipline. The initial outstanding contributions toward this development were made by E.W. Dijkstra and C.A.R. Hoare. Dijkstra's Notes on Structured Programming [1] opened a new view of programming as a scientific subject and intellectual challenge, and it coined the title for a "revolution" in programming. Hoare's Axiomatic Basis of Computer Programming [2] showed in a lucid manner that programs are amenable to an exacting analysis based on mathematical reasoning. Both these papers argue convincingly that many programmming errors can be prevented by making programmers aware of the methods and techniques which they hitherto applied intuitively and often unconsciously. These papers focused their attention on the aspects of composition and analysis of programs, or more explicitly, on the structure of algorithms represented by program texts. Yet, it is abundantly clear that a systematic and scientific approach to program construction primarily has a bearing in the case of large, complex programs which involve complicated sets of data. Hence, a methodology of programming is also bound to include all aspects of data structuring. Programs, after all, are concrete formulations of abstract algorithms based on particular representations and structures of data. An outstanding contribution to bring order into the bewildering variety of terminology and concepts on data structures was made by Hoare through his Notes on Data Structuring [3]. It made clear that decisions about structuring data cannot be made without knowledge of the algorithms applied to the data and that, vice versa, the structure and choice of algorithms often depend strongly on the structure of the underlying data. In short, the subjects of program composition and data structures are inseparably interwined. Yet, this book starts with a chapter on data structure for two reasons. First, one has an intuitive feeling that data precede algorithms: you must have some objects before you can perform operations on them. Second, and this is the more immediate reason, this book assumes that the reader is familiar with the basic notions of computer programming. Traditionally and sensibly, however, introductory programming courses concentrate on algorithms operating on relatively simple structures of data. Hence, an introductory chapter on data structures seems appropriate. Throughout the book, and particularly in Chap. 1, we follow the theory and terminology expounded by Hoare and realized in the programming language Pascal [4]. The essence of this theory is that data in the first instance represent abstractions of real phenomena and are preferably formulated as abstract structures not necessarily realized in common programming languages. In the process of program construction the data representation is gradually refined -- in step with the refinement of the algorithm -to comply more and more with the constraints imposed by an available programming system [5]. We therefore postulate a number of basic building principles of data structures, called the fundamental structures. It is most important that they are constructs that are known to be quite easily implementable on actual computers, for only in this case can they be considered the true elements of an actual data representation, as the molecules emerging from the final step of refinements of the data description. They are the record, the array (with fixed size), and the set. Not surprisingly, these basic building principles correspond to mathematical notions that are fundamental as well. A cornerstone of this theory of data structures is the distinction between fundamental and "advanced" structures. The former are the molecules -- themselves built out of atoms -- that are the components of the latter. Variables of a fundamental structure change only their value, but never their structure and never the set of values they can assume. As a consequence, the size of the store they occupy remains constant. "Advanced" structures, however, are characterized by their change of value and structure during the execution of a program. More sophisticated techniques are therefore needed for their implementation. The sequence appears as a hybrid in this classification. It certainly varies its length; but that change in structure is of a trivial nature. Since the sequence plays a truly fundamental role in practically all computer systems, its treatment is included in Chap. 1. The second chapter treats sorting algorithms. It displays a variety of different methods, all serving the same purpose. Mathematical analysis of some of these algorithms shows the advantages and disadvantages of the methods, and it makes the programmer aware of the importance of analysis in the

8 choice of good solutions for a given problem. The partitioning into methods for sorting arrays and methods for sorting files (often called internal and external sorting) exhibits the crucial influence of data representation on the choice of applicable algorithms and on their complexity. The space allocated to sorting would not be so large were it not for the fact that sorting constitutes an ideal vehicle for illustrating so many principles of programming and situations occurring in most other applications. It often seems that one could compose an entire programming course by deleting examples from sorting only. Another topic that is usually omitted in introductory programming courses but one that plays an important role in the conception of many algorithmic solutions is recursion. Therefore, the third chapter is devoted to recursive algorithms. Recursion is shown to be a generalization of repetition (iteration), and as such it is an important and powerful concept in programming. In many programming tutorials, it is unfortunately exemplified by cases in which simple iteration would suffice. Instead, Chap. 3 concentrates on several examples of problems in which recursion allows for a most natural formulation of a solution, whereas use of iteration would lead to obscure and cumbersome programs. The class of backtracking algorithms emerges as an ideal application of recursion, but the most obvious candidates for the use of recursion are algorithms operating on data whose structure is defined recursively. These cases are treated in the last two chapters, for which the third chapter provides a welcome background. Chapter 4 deals with dynamic data structures, i.e., with data that change their structure during the execution of the program. It is shown that the recursive data structures are an important subclass of the dynamic structures commonly used. Although a recursive definition is both natural and possible in these cases, it is usually not used in practice. Instead, the mechanism used in its implementation is made evident to the programmer by forcing him to use explicit reference or pointer variables. This book follows this technique and reflects the present state of the art: Chapter 4 is devoted to programming with pointers, to lists, trees and to examples involving even more complicated meshes of data. It presents what is often (and somewhat inappropriately) called list processing. A fair amount of space is devoted to tree organizations, and in particular to search trees. The chapter ends with a presentation of scatter tables, also called "hash" codes, which are oftern preferred to search trees. This provides the possibility of comparing two fundamentally different techniques for a frequently encountered application. Programming is a constructive activity. How can a constructive, inventive activity be taught? One method is to crystallize elementary composition priciples out many cases and exhibit them in a systematic manner. But programming is a field of vast variety often involving complex intellectual activities. The belief that it could ever be condensed into a sort of pure recipe teaching is mistaken. What remains in our arsenal of teaching methods is the careful selection and presentation of master examples. Naturally, we should not believe that every person is capable of gaining equally much from the study of examples. It is the characteristic of this approach that much is left to the student, to his diligence and intuition. This is particularly true of the relatively involved and long example programs. Their inclusion in this book is not accidental. Longer programs are the prevalent case in practice, and they are much more suitable for exhibiting that elusive but essential ingredient called style and orderly structure. They are also meant to serve as exercises in the art of program reading, which too often is neglected in favor of program writing. This is a primary motivation behind the inclusion of larger programs as examples in their entirety. The reader is led through a gradual development of the program; he is given various snapshots in the evolution of a program, whereby this development becomes manifest as a stepwise refinement of the details. I consider it essential that programs are shown in final form with sufficient attention to details, for in programming, the devil hides in the details. Although the mere presentation of an algorithm's principle and its mathematical analysis may be stimulating and challenging to the academic mind, it seems dishonest to the engineering practitioner. I have therefore strictly adhered to the rule of presenting the final programs in a language in which they can actually be run on a computer. Of course, this raises the problem of finding a form which at the same time is both machine executable and sufficiently machine independent to be included in such a text. In this respect, neither widely used languages nor abstract notations proved to be adequate. The language Pascal provides an appropriate compromise; it had been developed with exactly this aim in mind, and it is therefore used throughout this book. The programs can easily be understood by programmers who are familiar with some other highlevel language, such as ALGOL 60 or PL/1, because it is easy to understand the Pascal notation while proceeding through the text. However, this not to say that some proparation would not be beneficial. The

9 book Systematic Programming [6] provides an ideal background because it is also based on the Pascal notation. The present book was, however, not intended as a manual on the language Pascal; there exist more appropriate texts for this purpose [7]. This book is a condensation -- and at the same time an elaboration -- of several courses on programming taught at the Federal Institute of Technology (ETH) at Zürich. I owe many ideas and views expressed in this book to discussions with my collaborators at ETH. In particular, I wish to thank Mr. H. Sandmayr for his careful reading of the manuscript, and Miss Heidi Theiler and my wife for their care and patience in typing the text. I should also like to mention the stimulating influence provided by meetings of the Working Groups 2.1 and 2.3 of IFIP, and particularly the many memorable arguments I had on these occasions with E. W. Dijkstra and C.A.R. Hoare. Last but not least, ETH generously provided the environment and the computing facilities without which the preparation of this text would have been impossible. Zürich, Aug. 1975 1. 2. 3. 4. 5. 6. 7.

N. Wirth

In Structured Programming. O-.J. Dahl, E.W. Dijkstra, C.A.R. Hoare. F. Genuys, Ed. (New York; Academic Press, 1972), pp. 1-82. In Comm. ACM, 12, No. 10 (1969), 576-83. In Structured Programming, pp. 83-174. N. Wirth. The Programming Language Pascal. Acta Informatica, 1, No. 1 (1971), 35-63. N. Wirth. Program Development by Stepwise Refinement. Comm. ACM, 14, No. 4 (1971), 221-27. N. Wirth. Systematic Programming. (Englewood Cliffs, N.J. Prentice-Hall, Inc., 1973.) K. Jensen and N. Wirth, PASCAL-User Manual and Report. (Berlin, Heidelberg, New York; Springer-Verlag, 1974).

Preface To The 1985 Edition This new Edition incorporates many revisions of details and several changes of more significant nature. They were all motivated by experiences made in the ten years since the first Edition appeared. Most of the contents and the style of the text, however, have been retained. We briefly summarize the major alterations. The major change which pervades the entire text concerns the programming language used to express the algorithms. Pascal has been replaced by Modula-2. Although this change is of no fundamental influence to the presentation of the algorithms, the choice is justified by the simpler and more elegant syntactic structures of Modula-2, which often lead to a more lucid representation of an algorithm's structure. Apart from this, it appeared advisable to use a notation that is rapidly gaining acceptance by a wide community, because it is well-suited for the development of large programming systems. Nevertheless, the fact that Pascal is Modula's ancestor is very evident and eases the task of a transition. The syntax of Modula is summarized in the Appendix for easy reference. As a direct consequence of this change of programming language, Sect. 1.11 on the sequential file structure has been rewritten. Modula-2 does not offer a built-in file type. The revised Sect. 1.11 presents the concept of a sequence as a data structure in a more general manner, and it introduces a set of program modules that incorporate the sequence concept in Modula-2 specifically. The last part of Chapter 1 is new. It is dedicated to the subject of searching and, starting out with linear and binary search, leads to some recently invented fast string searching algorithms. In this section in particular we use assertions and loop invariants to demonstrate the correctness of the presented algorithms. A new section on priority search trees rounds off the chapter on dynamic data structures. Also this species of trees was unknown when the first Edition appeared. They allow an economical representation and a fast search of point sets in a plane.

10 The entire fifth chapter of the first Edition has been omitted. It was felt that the subject of compiler construction was somewhat isolated from the preceding chapters and would rather merit a more extensive treatment in its own volume. Finally, the appearance of the new Edition reflects a development that has profoundly influenced publications in the last ten years: the use of computers and sophisticated algorithms to prepare and automatically typeset documents. This book was edited and laid out by the author with the aid of a Lilith computer and its document editor Lara. Without these tools, not only would the book become more costly, but it would certainly not be finished yet. Palo Alto, March 1985

N. Wirth

Notation The following notations, adopted from publications of E.W. Dijkstra, are used in this book. In logical expressions, the character & denotes conjunction and is pronounced as and. The character ~ denotes negation and is pronounced as not. Boldface A and E are used to denote the universal and existential quantifiers. In the following formulas, the left part is the notation used and defined here in terms of the right part. Note that the left parts avoid the use of the symbol "...", which appeals to the readers intuition.

Ai: m ≤ i < n : Pi

≡

P m & Pm+1 & ... & P n-1

The P i are predicates, and the formula asserts that for all indices i ranging from a given value m to, but excluding a value n, P i holds.

Ei: m ≤ i < n : Pi

≡

P m or Pm+1 or ... or Pn-1

The P i are predicates, and the formula asserts that for some indices i ranging from a given value m to, but excluding a value n, P i holds.

Si: m ≤ i < n : xi

=

xm + xm+1 + ... + xn-1

MIN i: m ≤ i < n : xi =

minimum(xm , ... , xn-1)

MAX i: m ≤ i < n : xi =

maximum(xm, … , xn-1)

11

1. Fundamental Data Structures 1.1. Introduction The modern digital computer was invented and intended as a device that should facilitate and speed up complicated and time-consuming computations. In the majority of applications its capability to store and access large amounts of information plays the dominant part and is considered to be its primary characteristic, and its ability to compute, i.e., to calculate, to perform arithmetic, has in many cases become almost irrelevant. In all these cases, the large amount of information that is to be processed in some sense represents an abstraction of a part of reality. The information that is available to the computer consists of a selected set of data about the actual problem, namely that set that is considered relevant to the problem at hand, that set from which it is believed that the desired results can be derived. The data represent an abstraction of reality in the sense that certain properties and characteristics of the real objects are ignored because they are peripheral and irrelevant to the particular problem. An abstraction is thereby also a simplification of facts. We may regard a personnel file of an employer as an example. Every employee is represented (abstracted) on this file by a set of data relevant either to the employer or to his accounting procedures. This set may include some identification of the employee, for example, his or her name and salary. But it will most probably not include irrelevant data such as the hair color, weight, and height. In solving a problem with or without a computer it is necessary to choose an abstraction of reality, i.e., to define a set of data that is to represent the real situation. This choice must be guided by the problem to be solved. Then follows a choice of representation of this information. This choice is guided by the tool that is to solve the problem, i.e., by the facilities offered by the computer. In most cases these two steps are not entirely separable. The choice of representation of data is often a fairly difficult one, and it is not uniquely determined by the facilities available. It must always be taken in the light of the operations that are to be performed on the data. A good example is the representation of numbers, which are themselves abstractions of properties of objects to be characterized. If addition is the only (or at least the dominant) operation to be performed, then a good way to represent the number n is to write n strokes. The addition rule on this representation is indeed very obvious and simple. The Roman numerals are based on the same principle of simplicity, and the adding rules are similarly straightforward for small numbers. On the other hand, the representation by Arabic numerals requires rules that are far from obvious (for small numbers) and they must be memorized. However, the situation is reversed when we consider either addition of large numbers or multiplication and division. The decomposition of these operations into simpler ones is much easier in the case of representation by Arabic numerals because of their systematic structuring principle that is based on positional weight of the digits. It is generally known that computers use an internal representation based on binary digits (bits). This representation is unsuitable for human beings because of the usually large number of digits involved, but it is most suitable for electronic circuits because the two values 0 and 1 can be represented conveniently and reliably by the presence or absence of electric currents, electric charge, or magnetic fields. From this example we can also see that the question of representation often transcends several levels of detail. Given the problem of representing, say, the position of an object, the first decision may lead to the choice of a pair of real numbers in, say, either Cartesian or polar coordinates. The second decision may lead to a floating-point representation, where every real number x consists of a pair of integers denoting a fraction f and an exponent e to a certain base (such that x = f×2e). The third decision, based on the knowledge that the data are to be stored in a computer, may lead to a binary, positional representation of integers, and the final decision could be to represent binary digits by the electric charge in a semiconductor storage device. Evidently, the first decision in this chain is mainly influenced by the problem situation, and the later ones are progressively dependent on the tool and its technology. Thus, it can hardly be required that a programmer decide on the number representation to be employed, or even on the storage device characteristics. These lower-level decisions can be left to the designers of computer equipment, who have the most information available on current technology with which to make a sensible choice that will be acceptable for all (or almost all) applications where numbers play a role.

12 In this context, the significance of programming languages becomes apparent. A programming language represents an abstract computer capable of interpreting the terms used in this language, which may embody a certain level of abstraction from the objects used by the actual machine. Thus, the programmer who uses such a higher-level language will be freed (and barred) from questions of number representation, if the number is an elementary object in the realm of this language. The importance of using a language that offers a convenient set of basic abstractions common to most problems of data processing lies mainly in the area of reliability of the resulting programs. It is easier to design a program based on reasoning with familiar notions of numbers, sets, sequences, and repetitions than on bits, storage units, and jumps. Of course, an actual computer represents all data, whether numbers, sets, or sequences, as a large mass of bits. But this is irrelevant to the programmer as long as he or she does not have to worry about the details of representation of the chosen abstractions, and as long as he or she can rest assured that the corresponding representation chosen by the computer (or compiler) is reasonable for the stated purposes. The closer the abstractions are to a given computer, the easier it is to make a representation choice for the engineer or implementor of the language, and the higher is the probability that a single choice will be suitable for all (or almost all) conceivable applications. This fact sets definite limits on the degree of abstraction from a given real computer. For example, it would not make sense to include geometric objects as basic data items in a general-purpose language, since their proper repesentation will, because of its inherent complexity, be largely dependent on the operations to be applied to these objects. The nature and frequency of these operations will, however, not be known to the designer of a general-purpose language and its compiler, and any choice the designer makes may be inappropriate for some potential applications. In this book these deliberations determine the choice of notation for the description of algorithms and their data. Clearly, we wish to use familiar notions of mathematics, such as numbers, sets, sequences, and so on, rather than computer-dependent entities such as bitstrings. But equally clearly we wish to use a notation for which efficient compilers are known to exist. It is equally unwise to use a closely machine-oriented and machine-dependent language, as it is unhelpful to describe computer programs in an abstract notation that leaves problems of representation widely open. The programming language Pascal had been designed in an attempt to find a compromise between these extremes, and the successor languages Modula-2 and Oberon are the result of decades of experience [1-3]. Oberon retains Pascal's basic concepts and incorporates some improvements and some extensions; it is used throughout this book [1-5]. It has been successfully implemented on several computers, and it has been shown that the notation is sufficiently close to real machines that the chosen features and their representations can be clearly explained. The language is also sufficiently close to other languages, and hence the lessons taught here may equally well be applied in their use.

1.2. The Concept of Data Type In mathematics it is customary to classify variables according to certain important characteristics. Clear distinctions are made between real, complex, and logical variables or between variables representing individual values, or sets of values, or sets of sets, or between functions, functionals, sets of functions, and so on. This notion of classification is equally if not more important in data processing. We will adhere to the principle that every constant, variable, expression, or function is of a certain type. This type essentially characterizes the set of values to which a constant belongs, or which can be assumed by a variable or expression, or which can be generated by a function. In mathematical texts the type of a variable is usually deducible from the typeface without consideration of context; this is not feasible in computer programs. Usually there is one typeface available on computer equipment (i.e., Latin letters). The rule is therefore widely accepted that the associated type is made explicit in a declaration of the constant, variable, or function, and that this declaration textually precedes the application of that constant, variable, or function. This rule is particularly sensible if one considers the fact that a compiler has to make a choice of representation of the object within the store of a computer. Evidently, the amount of storage allocated to a variable will have to be chosen according to the size of the range of values that the variable may assume. If this information is known to a compiler, so-called dynamic storage allocation can be avoided. This is very often the key to an efficient realization of an algorithm.

13 The primary characteristics of the concept of type that is used throughout this text, and that is embodied in the programming language Oberon, are the following [1-2]: 1. A data type determines the set of values to which a constant belongs, or which may be assumed by a variable or an expression, or which may be generated by an operator or a function. 2. The type of a value denoted by a constant, variable, or expression may be derived from its form or its declaration without the necessity of executing the computational process. 3. Each operator or function expects arguments of a fixed type and yields a result of a fixed type. If an operator admits arguments of several types (e.g., + is used for addition of both integers and real numbers), then the type of the result can be determined from specific language rules. As a consequence, a compiler may use this information on types to check the legality of various constructs. For example, the mistaken assignment of a Boolean (logical) value to an arithmetic variable may be detected without executing the program. This kind of redundancy in the program text is extremely useful as an aid in the development of programs, and it must be considered as the primary advantage of good highlevel languages over machine code (or symbolic assembly code). Evidently, the data will ultimately be represented by a large number of binary digits, irrespective of whether or not the program had initially been conceived in a high-level language using the concept of type or in a typeless assembly code. To the computer, the store is a homogeneous mass of bits without apparent structure. But it is exactly this abstract structure which alone is enabling human programmers to recognize meaning in the monotonous landscape of a computer store. The theory presented in this book and the programming language Oberon specify certain methods of defining data types. In most cases new data types are defined in terms of previously defined data types. Values of such a type are usually conglomerates of component values of the previously defined constituent types, and they are said to be structured. If there is only one constituent type, that is, if all components are of the same constituent type, then it is known as the base type. The number of distinct values belonging to a type T is called its cardinality. The cardinality provides a measure for the amount of storage needed to represent a variable x of the type T, denoted by x: T. Since constituent types may again be structured, entire hierarchies of structures may be built up, but, obviously, the ultimate components of a structure are atomic. Therefore, it is necessary that a notation is provided to introduce such primitive, unstructured types as well. A straightforward method is that of enumerating the values that are to constitute the type. For example in a program concerned with plane geometric figures, we may introduce a primitive type called shape, whose values may be denoted by the identifiers rectangle, square, ellipse, circle. But apart from such programmer-defined types, there will have to be some standard, predefined types. They usually include numbers and logical values. If an ordering exists among the individual values, then the type is said to be ordered or scalar. In Oberon, all unstructured types are ordered; in the case of explicit enumeration, the values are assumed to be ordered by their enumeration sequence. With this tool in hand, it is possible to define primitive types and to build conglomerates, structured types up to an arbitrary degree of nesting. In practice, it is not sufficient to have only one general method of combining constituent types into a structure. With due regard to practical problems of representation and use, a general-purpose programming language must offer several methods of structuring. In a mathematical sense, they are equivalent; they differ in the operators available to select components of these structures. The basic structuring methods presented here are the array, the record, the set, and the sequence. More complicated structures are not usually defined as static types, but are instead dynamically generated during the execution of the program, when they may vary in size and shape. Such structures are the subject of Chap. 4 and include lists, rings, trees, and general, finite graphs. Variables and data types are introduced in a program in order to be used for computation. To this end, a set of operators must be available. For each standard data type a programming languages offers a certain set of primitive, standard operators, and likewise with each structuring method a distinct operation and notation for selecting a component. The task of composition of operations is often considered the heart of the art of programming. However, it will become evident that the appropriate composition of data is equally fundamental and essential.

14 The most important basic operators are comparison and assignment, i.e., the test for equality (and for order in the case of ordered types), and the command to enforce equality. The fundamental difference between these two operations is emphasized by the clear distinction in their denotation throughout this text. Test for equality: Assignment to x:

x=y x := y

(an expression with value TRUE or FALSE) (a statement making x equal to y)

These fundamental operators are defined for most data types, but it should be noted that their execution may involve a substantial amount of computational effort, if the data are large and highly structured. For the standard primitive data types, we postulate not only the availability of assignment and comparison, but also a set of operators to create (compute) new values. Thus we introduce the standard operations of arithmetic for numeric types and the elementary operators of propositional logic for logical values.

1.3. Primitive Data Types A new, primitive type is definable by enumerating the distinct values belonging to it. Such a type is called an enumeration type. Its definition has the form TYPE T = (c1, c2, ... , cn) T is the new type identifier, and the ci are the new constant identifiers. Examples TYPE shape = (rectangle, square, ellipse, circle) TYPE color = (red, yellow, green) TYPE sex = (male, female) TYPE weekday = (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday) TYPE currency = (franc, mark, pound, dollar, shilling, lira, guilder, krone, ruble, cruzeiro, yen) TYPE destination = (hell, purgatory, heaven) TYPE vehicle = (train, bus, automobile, boat, airplane) TYPE rank = (private, corporal, sergeant, lieutenant, captain, major, colonel, general) TYPE object = (constant, type, variable, procedure, module) TYPE structure = (array, record, set, sequence) TYPE condition = (manual, unloaded, parity, skew) The definition of such types introduces not only a new type identifier, but at the same time the set of identifiers denoting the values of the new type. These identifiers may then be used as constants throughout the program, and they enhance its understandability considerably. If, as an example, we introduce variables s, d, r, and b. VAR s: sex VAR d: weekday VAR r: rank then the following assignment statements are possible: s := male d := Sunday r := major b := TRUE Evidently, they are considerably more informative than their counterparts s := 1 d := 7 r := 6 b := 2 which are based on the assumption that c, d, r, and b are defined as integers and that the constants are mapped onto the natural numbers in the order of their enumeration. Furthermore, a compiler can check

15 against the inconsistent use of operators. For example, given the declaration of s above, the statement s := s+1 would be meaningless. If, however, we recall that enumerations are ordered, then it is sensible to introduce operators that generate the successor and predecessor of their argument. We therefore postulate the following standard operators, which assign to their argument its successor and predecessor respectively: INC(x)

DEC(x)

1.4. Standard Primitive Types Standard primitive types are those types that are available on most computers as built-in features. They include the whole numbers, the logical truth values, and a set of printable characters. On many computers fractional numbers are also incorporated, together with the standard arithmetic operations. We denote these types by the identifiers INTEGER, REAL, BOOLEAN, CHAR, SET 1.4.1. Integer types The type INTEGER comprises a subset of the whole numbers whose size may vary among individual computer systems. If a computer uses n bits to represent an integer in two's complement notation, then the admissible values x must satisfy -2n-1 ≤ x < 2n-1. It is assumed that all operations on data of this type are exact and correspond to the ordinary laws of arithmetic, and that the computation will be interrupted in the case of a result lying outside the representable subset. This event is called overflow. The standard operators are the four basic arithmetic operations of addition (+), subtraction (-), multiplication (*), and division (/, DIV). Whereas the slash denotes ordinary division resulting in a value of type REAL, the operator DIV denotes integer division resulting in a value of type INTEGER. If we define the quotient q = m DIV n and the remainder r = m MOD n, the following relations hold, assuming n > 0: q*n + r = m and 0 ≤ r < n Examples: 31 DIV 10 = 3 -31 DIV 10 = -4

31 MOD 10 = 1 -31 MOD 10 = 9

We know that dividing by 10n can be achieved by merely shifting the decimal digits n places to the right and thereby ignoring the lost digits. The same method applies, if numbers are represented in binary instead of decimal form. If two's complement representation is used (as in practically all modern computers), then the shifts implement a division as defined by the above DIV operaton. Moderately sophisticated compilers will therefore represent an operation of the form m DIV 2n or m MOD 2n by a fast shift (or mask) operation. 1.4.2. The type REAL The type REAL denotes a subset of the real numbers. Whereas arithmetic with operands of the types INTEGER is assumed to yield exact results, arithmetic on values of type REAL is permitted to be inaccurate within the limits of round-off errors caused by computation on a finite number of digits. This is the principal reason for the explicit distinction between the types INTEGER and REAL, as it is made in most programming languages. The standard operators are the four basic arithmetic operations of addition (+), subtraction (-), multiplication (*), and division (/). It is an essence of data typing that different types are incompatible under assignment. An exception to this rule is made for assignment of integer values to real variables, because here the semanitcs are unambiguous. After all, integers form a subset of real numbers. However, the inverse direction is not permissible: Assignment of a real value to an integer variable requires an operation such as truncation or rounding. The standard transfer function Entier(x) yields the integral part of x. Rounding of x is obtained by Entier(x + 0.5).

16 Many programming languages do not include an exponentiation operator. The following is an algorithm for the fast computation of y = xn, where n is a non-negative integer. y := 1.0; i := n; WHILE i > 0 DO (* x0n = xi * y *) IF ODD(i) THEN y := y*x END ; x := x*x; i := i DIV 2 END 1.4.3. The type BOOLEAN The two values of the standard type BOOLEAN are denoted by the identifiers TRUE and FALSE. The Boolean operators are the logical conjunction, disjunction, and negation whose values are defined in Table 1.1. The logical conjunction is denoted by the symbol &, the logical disjunction by OR, and negation by “~”. Note that comparisons are operations yielding a result of type BOOLEAN. Thus, the result of a comparison may be assigned to a variable, or it may be used as an operand of a logical operator in a Boolean expression. For instance, given Boolean variables p and q and integer variables x = 5, y = 8, z = 10, the two assignments p := x = y q := (x ≤ y) & (y < z) yield p = FALSE and q = TRUE. p TRUE TRUE FALSE FALSE

q TRUE FALSE TRUE FALSE

p&q

p OR q

~p

TRUE TRUE TRUE FALSE

TRUE FALSE FALSE FALSE

FALSE FALSE TRUE TRUE

Table 1.1 Boolean Operators. The Boolean operators & (AND) and OR have an additional property in most programming languages, which distinguishes them from other dyadic operators. Whereas, for example, the sum x+y is not defined, if either x or y is undefined, the conjunction p&q is defined even if q is undefined, provided that p is FALSE. This conditionality is an important and useful property. The exact definition of & and OR is therefore given by the following equations: p &q

= if p then q else FALSE

p OR q

= if p then TRUE else q

1.4.4. The type CHAR The standard type CHAR comprises a set of printable characters. Unfortunately, there is no generally accepted standard character set used on all computer systems. Therefore, the use of the predicate "standard" may in this case be almost misleading; it is to be understood in the sense of "standard on the computer system on which a certain program is to be executed." The character set defined by the International Standards Organization (ISO), and particularly its American version ASCII (American Standard Code for Information Interchange) is the most widely accepted set. The ASCII set is therefore tabulated in Appendix A. It consists of 95 printable (graphic) characters and 33 control characters, the latter mainly being used in data transmission and for the control of printing equipment. In order to be able to design algorithms involving characters (i.e., values of type CHAR) that are system independent, we should like to be able to assume certain minimal properties of character sets, namely: 1. The type CHAR contains the 26 capital Latin letters, the 26 lower-case letters, the 10 decimal digits, and a number of other graphic characters, such as punctuation marks. 2. The subsets of letters and digits are ordered and contiguous, i.e.,

17 ("A" ≤ x) & (x ≤ "Z") ("a" ≤ x) & (x ≤ "z") ("0" ≤ x) & (x ≤ "9")

implies that x is a capital letter implies that x is a lower-case letter implies that x is a decimal digit

3. The type CHAR contains a non-printing, blank character and a line-end character that may be used as separators.

THIS IS A TEXT

Fig. 1.1. Representations of a text The availability of two standard type transfer functions between the types CHAR and INTEGER is particularly important in the quest to write programs in a machine independent form. We will call them ORD(ch), denoting the ordinal number of ch in the character set, and CHR(i), denoting the character with ordinal number i. Thus, CHR is the inverse function of ORD, and vice versa, that is, ORD(CHR(i)) = i CHR(ORD(c)) = c

(if CHR(i) is defined)

Furthermore, we postulate a standard function CAP(ch). Its value is defined as the capital letter corresponding to ch, provided ch is a letter. ch is a lower-case letter implies that ch is a capital letter implies that

CAP(ch) = corresponding capital letter CAP(ch) = ch

1.4.5. The type SET The type SET denotes sets whose elements are integers in the range 0 to a small number, typically 31 or 63. Given, for example, variables VAR r, s, t: SET possible assignments are r := {5}; s := {x, y .. z}; t := {} Here, the value assigned to r is the singleton set consisting of the single element 5; to t is assigned the empty set, and to s the elements x, y, y+1, … , z-1, z. The following elementary operators are defined on variables of type SET: * + / IN

set intersection set union set difference symmetric set difference set membership

Constructing the intersection or the union of two sets is often called set multiplication or set addition, respectively; the priorities of the set operators are defined accordingly, with the intersection operator having priority over the union and difference operators, which in turn have priority over the membership operator, which is classified as a relational operator. Following are examples of set expressions and their fully parenthesized equivalents: r*s+t r-s*t r-s+ t

= (r*s) + t = r - (s*t) = (r-s) + t

18 r+s/t = r + (s/t) x IN s + t = x IN (s+t)

1.5. The Array Structure The array is probably the most widely used data structure; in some languages it is even the only one available. An array consists of components which are all of the same type, called its base type; it is therefore called a homogeneous structure. The array is a random-access structure, because all components can be selected at random and are equally quickly accessible. In order to denote an individual component, the name of the entire structure is augmented by the index selecting the component. This index is to be an integer between 0 and n-1, where n is the number of elements, the size, of the array. TYPE T = ARRAY n OF T0 Examples TYPE Row = ARRAY 4 OF REAL TYPE Card = ARRAY 80 OF CHAR TYPE Name = ARRAY 32 OF CHAR A particular value of a variable VAR x: Row with all components satisfying the equation xi = 2-i, may be visualized as shown in Fig. 1.2. x0

1.0

x1

0.5

x2

0.25

x3

0.125

Fig. 1.2 Array of type Row with x i = 2 -i An individual component of an array can be selected by an index. Given an array variable x, we denote an array selector by the array name followed by the respective component's index i, and we write x i or x[i]. Because of the first, conventional notation, a component of an array component is therefore also called a subscripted variable. The common way of operating with arrays, particularly with large arrays, is to selectively update single components rather than to construct entirely new structured values. This is expressed by considering an array variable as an array of component variables and by permitting assignments to selected components, such as for example x[i] := 0.125. Although selective updating causes only a single component value to change, from a conceptual point of view we must regard the entire composite value as having changed too. The fact that array indices, i.e., names of array components, are integers, has a most important consequence: indices may be computed. A general index expression may be substituted in place of an index constant; this expression is to be evaluated, and the result identifies the selected component. This generality not only provides a most significant and powerful programming facility, but at the same time it also gives rise to one of the most frequently encountered programming mistakes: The resulting value may be outside the interval specified as the range of indices of the array. We will assume that decent computing systems provide a warning in the case of such a mistaken access to a non-existent array component. The cardinality of a structured type, i. e. the number of values belonging to this type, is the product of the cardinality of its components. Since all components of an array type T are of the same base type T0, we obtain card(T) = card(T0)n

19 Constituents of array types may themselves be structured. An array variable whose components are again arrays is called a matrix. For example, M: ARRAY 10 OF Row is an array consisting of ten components (rows), each constisting of four components of type REAL, and is called a 10 × 4 matrix with real components. Selectors may be concatenated accordingly, such that Mij and M[i][j] denote the j th component of row Mi, which is the i th component of M. This is usually abbreviated as M[i, j] and in the same spirit the declaration M: ARRAY 10 OF ARRAY 4 OF REAL can be written more concisely as M: ARRAY 10, 4 OF REAL. If a certain operation has to be performed on all components of an array or on adjacent components of a section of the array, then this fact may conveniently be emphasized by using the FOR satement, as shown in the following examples for computing the sum and for finding the maximal element of an array declared as VAR a: ARRAY N OF INTEGER sum := 0; FOR i := 0 TO N-1 DO sum := a[i] + sum END k := 0; max := a[0]; FOR i := 1 TO N-1 DO IF max < a[i] THEN k := i; max := a[k] END END. In a further example, assume that a fraction f is represented in its decimal form with k-1 digits, i.e., by an array d such that f = S i : 0 ≤ i < k: di * 10 -i f = d0 + 10*d1 + 100*d2 + … + dk-1*10

or k-1

Now assume that we wish to divide f by 2. This is done by repeating the familiar division operation for all k-1 digits di, starting with i=1. It consists of dividing each digit by 2 taking into account a possible carry from the previous position, and of retaining a possible remainder r for the next position: r := 10*r +d[i]; d[i] := r DIV 2; r := r MOD 2 This algorithm is used to compute a table of negative powers of 2. The repetition of halving to compute 2-1, 2 -2, ... , 2-N is again appropriately expressed by a FOR statement, thus leading to a nesting of two FOR statements. PROCEDURE Power(VAR W: Texts.Writer; N: INTEGER); (*compute decimal representation of negative powers of 2*) VAR i, k, r: INTEGER; d: ARRAY N OF INTEGER; BEGIN FOR k := 0 TO N-1 DO Texts.Write(W, "."); r := 0; FOR i := 0 TO k-1 DO r := 10*r + d[i]; d[i] := r DIV 2; r := r MOD 2; Texts.Write(W, CHR(d[i] + ORD("0"))) END ; d[k] := 5; Texts.Write(W, "5"); Texts.WriteLn(W) END END Power. The resulting output text for N = 10 is

20 .5 .25 .125 .0625 .03125 .015625 .0078125 .00390625 .001953125 .0009765625

1.6. The Record Structure The most general method to obtain structured types is to join elements of arbitrary types, that are possibly themselves structured types, into a compound. Examples from mathematics are complex numbers, composed of two real numbers, and coordinates of points, composed of two or more numbers according to the dimensionality of the space spanned by the coordinate system. An example from data processing is describing people by a few relevant characteristics, such as their first and last names, their date of birth, sex, and marital status. In mathematics such a compound type is the Cartesian product of its constituent types. This stems from the fact that the set of values defined by this compound type consists of all possible combinations of values, taken one from each set defined by each constituent type. Thus, the number of such combinations, also called n-tuples, is the product of the number of elements in each constituent set, that is, the cardinality of the compound type is the product of the cardinalities of the constituent types. In data processing, composite types, such as descriptions of persons or objects, usually occur in files or data banks and record the relevant characteristics of a person or object. The word record has therefore become widely accepted to describe a compound of data of this nature, and we adopt this nomenclature in preference to the term Cartesian product. In general, a record type T with components of the types T1, T2, ... , Tn is defined as follows: TYPE T =

RECORD s1: T1; s2: T2; ... sn: Tn END

card(T) = card(T1) * card(T2) * ... * card(Tn) Examples TYPE Complex = RECORD re, im: REAL END TYPE Date = RECORD day, month, year: INTEGER END TYPE Person = RECORD name, firstname: Name; birthdate: Date; sex: (male, female); marstatus: (single, married, widowed, divorced) END We may visualize particular, record-structured values of, for example, the variables z: Complex d: Date p: Person as shown in Fig. 1.3.

21

Complex z

Date d

Person p

1.0

1

SMITH

-1.0

4

JOHN

1973

18

1

1986

male single

Fig. 1.3. Records of type Complex, Date, and Person The identifiers s1, s2, ... , sn introduced by a record type definition are the names given to the individual components of variables of that type. As components of records are called fields, the names are field identifiers. They are used in record selectors applied to record structured variables. Given a variable x: T, its i-th field is denoted by x.si. Selective updating of x is achieved by using the same selector denotation on the left side in an assignment statement: x.si := e where e is a value (expression) of type Ti. Given, for example, the record variables z, d, and p declared above, the following are selectors of components: z.im d.month p.name p.birthdate p.birthdate.day

(of type REAL) (of type INTEGER) (of type Name) (of type Date) (of type INTEGER)

The example of the type Person shows that a constituent of a record type may itself be structured. Thus, selectors may be concatenated. Naturally, different structuring types may also be used in a nested fashion. For example, the i-th component of an array a being a component of a record variable r is denoted by r.a[i], and the component with the selector name s of the i-th record structured component of the array a is denoted by a[i].s. It is a characteristic of the Cartesian product that it contains all combinations of elements of the constituent types. But it must be noted that in practical applications not all of them may be meaningful. For instance, the type Date as defined above includes the 31st April as well as the 29th February 1985, which are both dates that never occurred. Thus, the definition of this type does not mirror the actual situation entirely correctly; but it is close enough for practical purposes, and it is the responsibility of the programmer to ensure that meaningless values never occur during the execution of a program. The following short excerpt from a program shows the use of record variables. Its purpose is to count the number of persons represented by the array variable family that are both female and single: VAR count: INTEGER; family: ARRAY N OF Person; count := 0; FOR i := 0 TO N-1 DO IF (family[i].sex = female) & (family[i].marstatus = single) THEN INC(count) END END The record structure and the array structure have the common property that both are random-access structures. The record is more general in the sense that there is no requirement that all constituent types must be identical. In turn, the array offers greater flexibility by allowing its component selectors to be computable values (expressions), whereas the selectors of record components are field identifiers declared in the record type definition.

22 1.7. Representation Of Arrays, Records, And Sets The essence of the use of abstractions in programming is that a program may be conceived, understood, and verified on the basis of the laws governing the abstractions, and that it is not necessary to have further insight and knowledge about the ways in which the abstractions are implemented and represented in a particular computer. Nevertheless, it is essential for a professional programmer to have an understanding of widely used techniques for representing the basic concepts of programming abstractions, such as the fundamental data structures. It is helpful insofar as it might enable the programmer to make sensible decisions about program and data design in the light not only of the abstract properties of structures, but also of their realizations on actual computers, taking into account a computer's particular capabilities and limitations. The problem of data representation is that of mapping the abstract structure onto a computer store. Computer stores are - in a first approximation - arrays of individual storage cells called bytes. They are understood to be groups of 8 bits. The indices of the bytes are called addresses. VAR store: ARRAY StoreSize OF BYTE The basic types are represented by a small number of bytes, typically 2, 4, or 8. Computers are designed to transfer internally such small numbers (possibly 1) of contiguous bytes concurrently, ”in parallel”. The unit transferable concurrently is called a word. 1.7.1. Representation of Arrays A representation of an array structure is a mapping of the (abstract) array with components of type T onto the store which is an array with components of type BYTE. The array should be mapped in such a way that the computation of addresses of array components is as simple (and therefore as efficient) as possible. The address i of the j-th array component is computed by the linear mapping function i = i0 + j*s where i0 is the address of the first component, and s is the number of words that a component occupies. Assuming that the word is the smallest individually transferable unit of store, it is evidently highly desirable that s be a whole number, the simplest case being s = 1. If s is not a whole number (and this is the normal case), then s is usually rounded up to the next larger integer S. Each array component then occupies S words, whereby S-s words are left unused (see Figs. 1.5 and 1.6). Rounding up of the number of words needed to the next whole number is called padding. The storage utilization factor u is the quotient of the minimal amounts of storage needed to represent a structure and of the amount actually used: u = s / (s rounded up to nearest integer) store i0 array

Fig. 1.5. Mapping an array onto a store s=2.3 S=3 unused

23 Fig. 1.6. Padded representation of a record Since an implementor has to aim for a storage utilization as close to 1 as possible, and since accessing parts of words is a cumbersome and relatively inefficient process, he or she must compromise. The following considerations are relevant: 1. Padding decreases storage utilization. 2. Omission of padding may necessitate inefficient partial word access. 3. Partial word access may cause the code (compiled program) to expand and therefore to counteract the gain obtained by omission of padding. In fact, considerations 2 and 3 are usually so dominant that compilers always use padding automatically. We notice that the utilization factor is always u > 0.5, if s > 0.5. However, if s ≤ 0.5, the utilization factor may be significantly increased by putting more than one array component into each word. This technique is called packing. If n components are packed into a word, the utilization factor is (see Fig. 1.7) u = n*s / (n*s rounded up to nearest integer) padded

Fig. 1.7. Packing 6 components into one word Access to the i-th component of a packed array involves the computation of the word address j in which the desired component is located, and it involves the computation of the respective component position k within the word. j = i DIV n

k = i MOD n

In most programming languages the programmer is given no control over the representation of the abstract data structures. However, it should be possible to indicate the desirability of packing at least in those cases in which more than one component would fit into a single word, i.e., when a gain of storage economy by a factor of 2 and more could be achieved. We propose the convention to indicate the desirability of packing by prefixing the symbol ARRAY (or RECORD) in the declaration by the symbol PACKED. 1.7.2. Representation of Records Records are mapped onto a computer store by simply juxtaposing their components. The address of a component (field) r i relative to the origin address of the record r is called the field's offset k i. It is computed as ki = s1 + s2 + ... + si-1

k0 = 0

where sj is the size (in words) of the j-th component. We now realize that the fact that all components of an array are of equal type has the welcome consequence that ki = i×s. The generality of the record structure does unfortunately not allow such a simple, linear function for offset address computation, and it is therefore the very reason for the requirement that record components be selectable only by fixed identifiers. This restriction has the desirable benefit that the respective offsets are known at compile time. The resulting greater efficiency of record field access is well-known. The technique of packing may be beneficial, if several record components can be fitted into a single storage word (see Fig. 1.8). Since offsets are computable by the compiler, the offset of a field packed within a word may also be determined by the compiler. This means that on many computers packing of records causes a deterioration in access efficiency considerably smaller than that caused by the packing of arrays.

24

s1 s2

s3 s4 padded s5

s6

s7

s8

Fig. 1.8. Representation of a packed record 1.7.3. Representation of Sets A set s is conveniently represented in a computer store by its characteristic function C(s). This is an array of logical values whose ith component has the meaning “i is present in s”. As an example, the set of small integers s = {2, 3, 5, 7, 11, 13} is represented by the sequence of bits, by a bitstring: C(s) = (… 0010100010101100) The representation of sets by their characteristic function has the advantage that the operations of computing the union, intersection, and difference of two sets may be implemented as elementary logical operations. The following equivalences, which hold for all elements i of the base type of the sets x and y, relate logical operations with operations on sets: i IN (x+y) = (i IN x) OR (i IN y) i IN (x*y) = (i IN x) & (i IN y) i IN (x-y) = (i IN x) & ~(i IN y) These logical operations are available on all digital computers, and moreover they operate concurrently on all corresponding elements (bits) of a word. It therefore appears that in order to be able to implement the basic set operations in an efficient manner, sets must be represented in a small, fixed number of words upon which not only the basic logical operations, but also those of shifting are available. Testing for membership is then implemented by a single shift and a subsequent (sign) bit test operation. As a consequence, a test of the form x IN {c1, c2, ... , cn} can be implemented considerably more efficiently than the equivalent Boolean expression (x = c1) OR (x = c2) OR ... OR (x = cn) A corollary is that the set structure should be used only for small integers as elements, the largest one being the wordlength of the underlying computer (minus 1).

1.8. The File or Sequence Another elementary structuring method is the sequence. A sequence is typically a homogeneous structure like the array. That is, all its elements are of the same type, the base type of the sequence. We shall denote a sequence s with n elements by s = n is called the length of the sequence. This structure looks exactly like the array. The essential difference is that in the case of the array the number of elements is fixed by the array's declaration, whereas for the sequence it is left open. This implies that it may vary during execution of the program. Although every sequence has at any time a specific, finite length, we must consider the cardinality of a sequence type as infinite, because there is no fixed limit to the potential length of sequence variables. A direct consequence of the variable length of sequences is the impossibility to allocate a fixed amount of storage to sequence variables. Instead, storage has to be allocated during program execution, namely whenever the sequence grows. Perhaps storage can be reclaimed when the sequence shrinks. In any case, a

25 dynamic storage allocation scheme must be employed. All structures with variable size share this property, which is so essential that we classify them as advanced structures in contrast to the fundamental structures discussed so far. What, then, causes us to place the discussion of sequences in this chapter on fundamental structures? The primary reason is that the storage management strategy is sufficiently simple for sequences (in contrast to other advanced structures), if we enforce a certain discipline in the use of sequences. In fact, under this proviso the handling of storage can safely be delegated to a machanism that can be guaranteed to be reasonably effective. The secondary reason is that sequences are indeed ubiquitous in all computer applications. This structure is prevalent in all cases where different kinds of storage media are involved, i.e. where data are to be moved from one medium to another, such as from disk or tape to primary store or vice-versa. The discipline mentioned is the restraint to use sequential access only. By this we mean that a sequence is inspected by strictly proceeding from one element to its immediate successor, and that it is generated by repeatedly appending an element at its end. The immediate consequence is that elements are not directly accessible, with the exception of the one element which currently is up for inspection. It is this accessing discipline which fundamentally distinguishes sequences from arrays. As we shall see in Chapter 2, the influence of an access discipline on programs is profound. The advantage of adhering to sequential access which, after all, is a serious restriction, is the relative simplicity of needed storage management. But even more important is the possibility to use effective buffering techniques when moving data to or from secondary storage devices. Sequential access allows us to feed streams of data through pipes between the different media. Buffering implies the collection of sections of a stream in a buffer, and the subsequent shipment of the whole buffer content once the buffer is filled. This results in very significantly more effective use of secondary storage. Given sequential access only, the buffering mechanism is reasonably straightforward for all sequences and all media. It can therefore safely be built into a system for general use, and the programmer need not be burdened by incorporating it in the program. Such a system is usually called a file system, because the high-volume, sequential access devices are used for permanent storage of (persistent) data, and they retain them even when the computer is switched off. The unit of data on these media is commonly called (sequential) file. Here we will use the term file as synonym to sequence. There exist certain storage media in which the sequential access is indeed the only possible one. Among them are evidently all kinds of tapes. But even on magnetic disks each recording track constitutes a storage facility allowing only sequential access. Strictly sequential access is the primary characteristic of every mechanically moving device and of some other ones as well. It follows that it is appropriate to distinguish between the data structure, the sequence, on one hand, and the mechanism to access elements on the other hand. The former is declared as a data structure, the latter typically by the introduction of a record with associated operators, or, according to more modern terminology, by a rider object. The distinction between data and mechanism declarations is also useful in view of the fact that several access points may exist concurrently on one and the same sequence, each one representing a sequential access at a (possibly) different location. We summarize the essence of the foregoing as follows: 1. Arrays and records are random access structures. They are used when located in primary, random-access store. 2. Sequences are used to access data on secondary, sequential-access stores, such as disks and tapes. 3. We distinguish between the declaration of a sequence variable, and that of an access mechanism located at a certain position within the seqence. 1.8.1 Elementary File Operators The discipline of sequential access can be enforced by providing a set of seqencing operators through which files can be accessed exclusively. Hence, although we may here refer to the i-th element of a sequence s by writing si, this shall not be possible in a program.

26 Sequences, files, are typically large, dynamic data structures stored on a secondary storage device. Such a device retains the data even if a program is terminated, or a computer is switched off. Therefore the introduction of a file variable is a complex operation connecting the data on the external device with the file variable in the program. We therefore define the type File in a separate module, whose definition specifies the type together with its operators. We call this module Files and postulate that a sequence or file variable must be explicitly initialized (opened) by calling an appropriate operator or function: VAR f: File f := Open(name) where name identifies the file as recorded on the persistent data carrier. Some systems distinguish between opening an existing file and opening a new file: f := Old(name)

f := New(name)

The disconnection between secondary storage and the file variable then must also be explicitly requested by, for example, a call of Close(f). Evidently, the set of operators must contain an operator for generating (writing) and one for inspecting (reading) a sequence. We postulate that these operations apply not to a file directly, but to an object called a rider, which itself is connected with a file (sequence), and which implements a certain access mechanism. The sequential access discipline is guaranteed by a restrictive set of access operators (procedures). A sequence is generated by appending elements at its end after having placed a rider on the file. Assuming the declaration VAR r: Rider we position the rider r on the file f by the statement Set(r, f, pos) where pos = 0 designates the beginning of the file (sequence). A typical pattern for generating the sequence is: WHILE more DO compute next element x; Write(r, x) END A sequence is inspected by first positioning a rider as shown above, and then proceeding from element to element. A typical pattern for reading a sequence is: Read(r, x); WHILE ~r.eof DO process element x; Read(r, x) END Evidently, a certain position is always associated with every rider. It is denoted by r.pos. Furthermore, we postulate that a rider contain a predicate (flag) r.eof indicating whether a preceding read operation had reached the sequence’s end. We can now postulate and describe informally the following set of primitive operators: 1a. 1b. 2. 3. 4. 5.

New(f, name) Old(f, name) Set(r, f, pos) Write(r, x) Read(r, x) Close(f)

defines f to be the empty sequence. defines f to be the sequence persistently stored with given name. associate rider r with sequence f, and place it at position pos. place element with value x in the sequence designated by rider r, and advance. assign to x the value of the element designated by rider r, and advance. registers the written file f in the persistent store (flush buffers).

Note: Writing an element in a sequence is often a complex operation. However, mostly, files are created by appending elements at the end. In order to convey a more precise understanding of the sequencing operators, the following example of an implementation is provided. It shows how they might be expressed if sequences were represented by arrays. This example of an implementation intentionally builds upon concepts introduced and discussed earlier, and it does not involve either buffering or sequential stores which, as mentioned above, make the sequence concept truly necessary and attractive. Nevertheless, this example exhibits all the essential

27 characteristics of the primitive sequence operators, independently on how the sequences are represented in store. The operators are presented in terms of conventional procedures. This collection of definitions of types, variables, and procedure headings (signatures) is called a definition. We assume that we are to deal with sequences of characters, i.e. text files whose elements are of type CHAR. The declarations of File and Rider are good examples of an application of record structures because, in addition to the field denoting the array which represents the data, further fields are required to denote the current length and position, i.e. the state of the rider. DEFINITION Files; TYPE File; (*sequence of characters*) Rider = RECORD eof: BOOLEAN END ; PROCEDURE New(VAR name: ARRAY OF CHAR): File; PROCEDURE Old(VAR name: ARRAY OF CHAR): File; PROCEDURE Close(VAR f: File); PROCEDURE Set(VAR r: Rider; VAR f: File; pos: INTEGER); PROCEDURE Write (VAR r: Rider; ch: CHAR); PROCEDURE Read (VAR r: Rider; VAR ch: CHAR); END Files. A definition represents an abstraction. Here we are given the two data types, File and Rider, together with their operations, but without further details revealing their actual representation in store. Of the operators, declared as procedures, we see their headings only. This hiding of the details of implementation is intentional. The concept is called information hiding. About riders we only learn that there is a property called eof. This flag is set, if a read operation reaches the end of the file. The rider’s position is invisible, and hence the rider’s invariant cannot be falsified by direct access. The invariant expresses the fact that the position always lies within the limits given by the associated sequence. The invariant is established by procedure Set, and required and maintained by procedures Read and Write. The statements that implement the procedures and further, internal details of the data types, are sepecified in a construct called module. Many representations of data and implementations of procedures are possible. We chose the following as a simple example (with fixed maximal file length): MODULE Files; CONST MaxLength = 4096; TYPE File = POINTER TO RECORD len: INTEGER; a: ARRAY MaxLength OF CHAR END ; Rider = RECORD (* 0 = 0*) nonempty: Signals.Signal; (*nf >= 0*) buf: ARRAY N OF CHAR; PROCEDURE deposit(VAR x: ARRAY OF CHAR); BEGIN ne := ne - Np; IF ne < 0 THEN Signals.Wait(nonfull) END ; FOR i := 0 TO Np-1 DO buf[in] := x[i]; INC(in) END ; IF in = N THEN in := 0 END ; nf := nf + Np; IF nf >= 0 THEN Signals.Send(nonempty) END END deposit; PROCEDURE fetch(VAR x: ARRAY OF CHAR); BEGIN nf := nf - Nc; IF nf < 0 THEN Signals.Wait(nonempty) END ; FOR i := 0 TO Nc-1 DO x[i] := buf[out]; INC(out) END; IF out = N THEN out := 0 END ; ne := ne + Nc; IF ne >= 0 THEN Signals.Send(nonfull) END END fetch; BEGIN ne := N; nf := 0; in := 0; out := 0; Signals.Init(nonfull); Signals.Init(nonempty) END Buffer. 1.8.4 Textual Input and Output By standard input and output we understand the transfer of data to (from) a computer system from (to) genuinely external agents, in particular its human operator. Input may typically originate at a keyboard and output may sink into a display screen. In any case, its characteristic is that it is readable, and it typically

32 consists of a sequence of characters. It is a text. This readability condition is responsible for yet another complication incurred in most genuine input and output operations. Apart from the actual data transfer, they also involve a transformation of representation. For example, numbers, usually considered as atomic units and represented in binary form, need be transformed into readable, decimal notation. Structures need to be represented in a suitable layout, whose generation is called formatting. Whatever the transformation may be, the concept of the sequence is once again instrumental for a considerable simplification of the task. The key is the observation that, if the data set can be considered as a sequence of characters, the transformation of the sequence can be implemented as a sequence of (identical) transformations of elements. T() = We shall briefly investigate the necessary operations for transforming representations of natural numbers for input and output. The basis is that a number x represented by the sequence of decimal digits d = has the value x = Si: i = 0 .. n-1: d i * 10i x = dn-1×10 n-1 + dn-2×10n-2 + … + d1×10 + d0 x = ( … ((dn-1×10) + dn-2) ×10 + … + d1×10) + d0 Assume now that the sequence d is to be read and transformed, and the resulting numeric value to be assigned to x. The simple algorithm terminates with the reading of the first character that is not a digit. (Arithmetic overflow is not considered). x := 0; Read(ch); WHILE ("0" R) & (Ak : 0 ≤ k < L : ak < x ) & (Ak : R < k < N : ak > x)) which implies (am = x) OR (Ak : 0 ≤ k < N : ak ≠ x) The choice of m is apparently arbitrary in the sense that correctness does not depend on it. But it does influence the algorithm's effectiveness. Clearly our goal must be to eliminate in each step as many elements as possible from further searches, no matter what the outcome of the comparison is. The optimal solution is to choose the middle element, because this eliminates half of the array in any case. As a result, the maximum number of steps is log2N, rounded up to the nearest integer. Hence, this algorithm offers a drastic improvement over linear search, where the expected number of comparisons is N/2. The efficiency can be somewhat improved by interchanging the two if-clauses. Equality should be tested second, because it occurs only once and causes termination. But more relevant is the question, whether -- as in the case of linear search -- a solution could be found that allows a simpler condition for termination. We indeed find such a faster algorithm, if we abandon the naive wish to terminate the search as soon as a match is established. This seems unwise at first glance, but on closer inspection we realize that the gain in efficiency at every step is greater than the loss incurred in comparing a few extra elements. Remember that the number of steps is at most log N. The faster solution is based on the following invariant: (Ak : 0 ≤ k < L : ak < x) & (Ak : R ≤ k < N : ak ≥ x) and the search is continued until the two sections span the entire array. L := 0; R := N; WHILE L < R DO m := (L+R) DIV 2; IF a[m] < x THEN L := m+1 ELSE R := m END END The terminating condition is L ≥ R. Is it guaranteed to be reached? In order to establish this guarantee, we must show that under all circumstances the difference R-L is diminished in each step. L < R holds at the

36 beginning of each step. The arithmetic mean m then satisfies L ≤ m < R. Hence, the difference is indeed diminished by either assigning m+1 to L (increasing L) or m to R (decreasing R), and the repetition terminates with L = R. However, the invariant and L = R do not yet establish a match. Certainly, if R = N, no match exists. Otherwise we must take into consideration that the element a[R] had never been compared. Hence, an additional test for equality a[R] = x is necessary. In contrast to the first solution, this algorithm -- like linear search -- finds the matching element with the least index. 1.9.3 Table Search A search through an array is sometimes also called a table search, particularly if the keys are themselves structured objects, such as arrays of numbers or characters. The latter is a frequently encountered case; the character arrays are called strings or words. Let us define a type String as String = ARRAY M OF CHAR and let order on strings x and y be defined as follows: (x = y) ≡ (Aj: 0 ≤ j < M : xj = yj) (x < y) ≡ Ei: 0 ≤ i < N: ((Aj: 0 ≤ j < i : xj = yj) & (x i < yi)) In order to establish a match, we evidently must find all characters of the comparands to be equal. Such a comparison of structured operands therefore turns out to be a search for an unequal pair of comparands, i.e. a search for inequality. If no unequal pair exists, equality is established. Assuming that the length of the words be quite small, say less than 30, we shall use a linear search in the following solution. In most practical applications, one wishes to consider strings as having a variable length. This is accomplished by associating a length indication with each individual string value. Using the type declared above, this length must not exceed the maximum length M. This scheme allows for sufficient flexibility for many cases, yet avoids the complexities of dynamic storage allocation. Two representations of string lengths are most commonly used: 1. The length is implicitly specified by appending a terminating character which does not otherwise occur. Usually, the non-printing value 0X is used for this purpose. (It is important for the subsequent applications that it be the least character in the character set). 2. The length is explicitly stored as the first element of the array, i.e. the string s has the form s = s0, s1, s2, ... , sN-1 where s1 ... sN-1 are the actual characters of the string and s0 = CHR(N). This solution has the advantage that the length is directly available, and the disadvantage that the maximum length is limited to the size of the character set, that is, to 256 in the case of the ASCII set. For the subsequent search algorithm, we shall adhere to the first scheme. A string comparison then takes the form i := 0; WHILE (x[i] = y[i]) & (x[i] # 0X) DO i := i+1 END The terminating character now functions as a sentinel, the loop invariant is Aj: 0 ≤ j < i : x j = yj ≠ 0X, and the resulting condition is therefore ((xi = yi) OR (x i = 0X)) & (Aj: 0 < j < i : x j = yj ≠ 0X) It establishes a match between x and y, provided that xi = yi, and it establishes x < y, if xi < yi. We are now prepared to return to the task of table searching. It calls for a nested search, namely a search through the entries of the table, and for each entry a sequence of comparisons between components. For example, let the table T and the search argument x be defined as T: ARRAY N OF String; x: String

37 Assuming that N may be fairly large and that the table is alphabetically ordered, we shall use a binary search. Using the algorithms for binary search and string comparison developed above, we obtain the following program segment. L := 0; R := N; WHILE L < R DO m := (L+R) DIV 2; i := 0; WHILE (T[m,i] = x[i]) & (x[i] # 0C) DO i := i+1 END ; IF T[m,i] < x[i] THEN L := m+1 ELSE R := m END END ; IF R < N THEN i := 0; WHILE (T[R,i] = x[i]) & (x[i] # 0X) DO i := i+1 END END (* (R < N) & (T[R,i] = x[i]) establish a match*) 1.9.4. Straight String Search A frequently encountered kind of search is the so-called string search. It is characterized as follows. Given an array s of N elements and an array p of M elements, where 0 < M < N, declared as s: ARRAY N OF Item p: ARRAY M OF Item string search is the task of finding the first occurrence of p in s. Typically, the items are characters; then s may be regarded as a text and p as a pattern or word, and we wish to find the first occurrence of the word in the text. This operation is basic to every text processing system, and there is obvious interest in finding an efficient algorithm for this task. Before paying particular attention to efficiency, however, let us first present a straightforward searching algorithm. We shall call it straight string search. A more precise formulation of the desired result of a search is indispensible before we attempt to specify an algorithm to compute it. Let the result be the index i which points to the first occurrence of a match of the pattern within the string. To this end, we introduce a predicate P(i,j) P(i, j) = Ak : 0 ≤ k < j : si+k = pk Then evidently our resulting index i must satisfy P(i, M). But this condition is not sufficient. Because the search is to locate the first occurrence of the pattern, P(k, M) must be false for all k < i. We denote this condition by Q(i). Q(i) = Ak : 0 ≤ k < i : ~P(k, M) The posed problem immediately suggests to formulate the search as an iteration of comparisons, and we proposed the following approach: i := -1; REPEAT INC(i); (* Q(i) *) found := P(i, M) UNTIL found OR (i = N-M) The computation of P again results naturally in an iteration of individual character comparisons. When we apply DeMorgan's theorem to P, it appears that the iteration must be a search for inequality among corresponding pattern and string characters. P(i, j) = (Ak : 0 ≤ k < j : si+k = p k) = (~Ek : 0 ≤ k < j : si+k ≠ pk) The result of the next refinement is a repetition within a repetition. The predicates P and Q are inserted at appropriate places in the program as comments. They act as invariants of the iteration loops. i := -1; REPEAT INC(i); j := 0; (* Q(i) *) WHILE (j < M) & (s[i+j] = p[j]) DO (* P(i, j+1) *) INC(j) END (* Q(i) & P(i, j) & ((j = M) OR (s[i+j] # p[j])) *)

38 UNTIL (j = M) OR (i = N-M) The term j = M in the terminating condition indeed corresponds to the condition found, because it implies P(i,M). The term i = N-M implies Q(N-M) and thereby the nonexistence of a match anywhere in the string. If the iteration continues with j < M, then it must do so with si+j ≠ pj. This implies ~P(i,j), which implies Q(i+1), which establishes Q(i) after the next incrementing of i. Analysis of straight string search. This algorithm operates quite effectively, if we can assume that a mismatch between character pairs occurs after at most a few comparisons in the inner loop. This is likely to be the case, if the cardinality of the item type is large. For text searches with a character set size of 128 we may well assume that a mismatch occurs after inspecting 1 or 2 characters only. Nevertheless, the worst case performance is rather alarming. Consider, for example, that the string consist of N-1 A's followed by a single B, and that the pattern consist of M-1 A's followed by a B. Then in the order of N*M comparisons are necessary to find the match at the end of the string. As we shall subsequently see, there fortunately exist methods that drastically improve this worst case behaviour. 1.9.5. The Knuth-Morris-Pratt String Search Around 1970, D.E. Knuth, J.H. Morris, and V.R. Pratt invented an algorithm that requires essentially in the order of N character comparisons only, even in the worst case [1-8]. The new algorithm is based on the observation that by starting the next pattern comparison at its beginning each time, we may be discarding valuable information gathered during previous comparisons. After a partial match of the beginning of the pattern with corresponding characters in the string, we indeed know the last part of the string, and perhaps could have precompiled some data (from the pattern) which could be used for a more rapid advance in the text string. The following example of a search for the word Hooligan illustrates the principle of the algorithm. Underlined characters are those which were compared. Note that each time two compared characters do not match, the pattern is shifted all the way, because a smaller shift could not possibly lead to a full match. Hoola-Hoola girls like Hooligans. Hooligan Hooligan Hooligan Hooligan Hooligan Hooligan ...... Hooligan Using the predicates P and Q, the KMP-algorithm is the following: i := 0; j := 0; WHILE (j < M) & (i < N) DO (* Q(i-j) & P(i-j, j) *) WHILE (j >= 0) & (s[i] # p[j]) DO j := D END ; INC(i); INC(j) END This formulation is admittedly not quite complete, because it contains an unspecified shift value D. We shall return to it shortly, but first point out that the conditions Q(i-j) and P(i-j, j) are maintained as global invariants, to which we may add the relations 0 ≤ i < N and 0 ≤ j < M. This suggests that we must abandon the notion that i always marks the current position of the first pattern character in the text. Rather, the alignment position of the pattern is now i-j. If the algorithm terminates due to j = M, the term P(i-j, j) of the invariant implies P(i-M, M), that is, a match at position i-M. Otherwise it terminates with i = N, and since j < M, the invariant Q(i) implies that no match exists at all. We must now demonstrate that the algorithm never falsifies the invariant. It is easy to show that it is established at the beginning with the values i = j = 0. Let us first investigate the effect of the two statements

39 incrementing i and j by 1. They apparently neither represent a shift of the pattern to the right, nor do they falsify Q(i-j), since the difference remains unchanged. But could they falsify P(i-j, j), the second factor of the invariant? We notice that at this point the negation of the inner while clause holds, i.e. either j < 0 or si = pj. The latter extends the partial match and establishes P(i-j, j+1). In the former case, we postulate that P(i-j, j+1) hold as well. Hence, incrementing both i and j by 1 cannot falsify the invariant either. The only other assignment left in the algorithm is j := D. We shall simply postuate that the value D always be such that replacing j by D will maintain the invariant. In order to find an appropriate expression for D, we must first understand the effect of the assignment. Provided that D < j, it represents a shift of the pattern to the right by j-D positions. Naturally, we wish this shift to be as large as possible, i.e., D to be as small as possible. This is illustrated by Fig. 1.10. i A

B

C

D

string

A

B

C

E

pattern j=3

D=0

A

B

C

D A

B

C

E

j=0

Fig. 1.10. Assignment j := D shifts pattern by j-D positions Evidently the condition P(i-D, D) & Q(i-D) must hold before assigning D to j, if the invariant P(i-j, j) & Q(i-j) is to hold thereafter. This precondition is therefore our guideline for finding an appropriate expression for D. The key observation is that thanks to P(i-j, j) we know that si-j ... si-1 = p0 ... p j-1 (we had just scanned the first j characters of the pattern and found them to match). Therefore the condition P(i-D, D) with D ≤ j, i.e., p 0 ... p D-1 = si-D ... si-1 translates into p 0 ... p D-1 = pj-D ... pj-1 and (for the purpose of establishing the invariance of Q(i-D)) the predicate ~P(i-k, M) for k = 1 ... j-D translates into p 0 ... p k-1 ≠ pj-k ... p j-1

for all k = 1 ... j-D

The essential result is that the value D apparently is determined by the pattern alone and does not depend on the text string. The conditions tell us that in order to find D we must, for every j, search for the smallest D, and hence for the longest sequence of pattern characters just preceding position j, which matches an equal number of characters at the beginning of the pattern. We shall denote D for a given j by dj. Since these values depend on the pattern only, the auxiliary table d may be computed before starting the actual search; this computation amounts to a precompilation of the pattern. This effort is evidently only worthwhile if the text is considerably longer than the pattern (M = 0) & (s[i] # p[j]) DO j := d[j] END ; INC(i); INC(j) END ; IF j = m THEN r := i-m ELSE r := -1 END END Search Analysis of KMP search. The exact analysis of the performance of KMP-search is, like the algorithm itself, very intricate. In [1-8] its inventors prove that the number of character comparisons is in the order of M+N, which suggests a substantial improvement over M*N for the straight search. They also point out the welcome property that the scanning pointer i never backs up, whereas in straight string search the scan always begins at the first pattern character after a mismatch, and therefore may involve characters that had actually been scanned already. This may cause awkward problems when the string is read from secondary storage where backing up is costly. Even when the input is buffered, the pattern may be such that the backing up extends beyond the buffer contents. 1.9.6. The Boyer-Moore String Search The clever scheme of the KMP-search yields genuine benefits only if a mismatch was preceded by a partial match of some length. Only in this case is the pattern shift increased to more than 1. Unfortunately, this is the exception rather than the rule; matches occur much more seldom than mismatches. Therefore the gain in using the KMP strategy is marginal in most cases of normal text searching. The method to be discussed here does indeed not only improve performance in the worst case, but also in the average case. It was invented by R.S. Boyer and J.S. Moore around 1975, and we shall call it BM search. We shall here present a simplified version of BM-search before proceeding to the one given by Boyer and Moore.. BM-search is based on the unconventional idea to start comparing characters at the end of the pattern rather than at the beginning. Like in the case of KMP-search, the pattern is precompiled into a table d before the actual search starts. Let, for every character x in the character set, dx be the distance of the rightmost occurrence of x in the pattern from its end. Now assume that a mismatch between string and pattern was discovered. Then the pattern can immediately be shifted to the right by dp[M-1] positions, an amount that is quite likely to be greater than 1. If pM-1 does not occur in the pattern at all, the shift is even greater, namely equal to the entire pattern's length. The following example illustrates this process. Hoola-Hoola girls like Hooligans. Hooligan Hooligan Hooligan Hooligan Hooligan

42 Since individual character comparisons now proceed from right to left, the following, slightly modified versions of of the predicates P and Q are more convenient. P(i,j) = Ak: j ≤ k < M : si-j+k = p k Q(i) = Ak: 0 ≤ k < i : ~P(i, 0) These predicates are used in the following formulation of the BM-algorithm to denote the invariant conditions. i := M; j := M; WHILE (j > 0) & (i 0) & (s[k-1] = p[j-1]) DO (* P(k-j, j) & (k-j = i-M) *) DEC(k); DEC(j) END ; i := i + d[s[i-1]] END The indices satisfy 0 < j < M and 0 < i,k < N. Therefore, termination with j = 0, together with P(k-j, j), implies P(k, 0), i.e., a match at position k. Termination with j > 0 demands that i = N; hence Q(i-M) implies Q(N-M), signalling that no match exists. Of course we still have to convince ourselves that Q(i-M) and P(k-j, j) are indeed invariants of the two repetitions. They are trivially satisfied when repetition starts, since Q(0) and P(x,M) are always true. Let us first consider the effect of the two statements decrementing k and j. Q(i-M) is not affected, and, since sk-1 = pj-1 had been established, P(k-j, j-1) holds as precondition, guaranteeing P(k-j, j) as postcondition. If the inner loop terminates with j > 0, the fact that sk-1 ≠ p j-1 implies ~P(k-j, 0), since ~P(i, 0) = Ek: 0 ≤ k < M : si+k ≠ pk Moreover, because k-j = M-i, Q(i-M) & ~P(k-j, 0) = Q(i+1-M), establishing a non-match at position i-M+1. Next we must show that the statement i := i + d s[i-1] never falsifies the invariant. This is the case, provided that before the assignment Q(i+ds[i-1]-M) is guaranteed. Since we know that Q(i+1-M) holds, it suffices to establish ~P(i+h-M) for h = 2, 3, ... , ds[i-1]. We now recall that dx is defined as the distance of the rightmost occurrence of x in the pattern from the end. This is formally expressed as Ak: M-dx ≤ k < M-1 : p k ≠ x Substituting si for x, we obtain Ah: M-ds[i-1] ≤ h < M-1 : si-1 ≠ ph Ah: 1 < h ≤ ds[i-1] : si-1 ≠ ph-M Ah: 1 < h ≤ ds[i-1] : ~P(i+h-M) The following program includes the presented, simplified Boyer-Moore strategy in a setting similar to that of the preceding KMP-search program. Note as a detail that a repeat statement is used in the inner loop, incrementing k and j before comparing s and p. This eliminates the -1 terms in the index expressions. PROCEDURE Search(VAR s, p: ARRAY OF CHAR; m, n: INTEGER; VAR r: INTEGER); (*search for pattern p of length m in text s of length n*) (*if p is found, then r indicates the position in s, otherwise r = -1*) VAR i, j, k: INTEGER; d: ARRAY 128 OF INTEGER; BEGIN FOR i := 0 TO 127 DO d[i] := m END ; FOR j := 0 TO m-2 DO d[ORD(p[j])] := m-j-1 END ; i := m; REPEAT j := m; k := i; REPEAT DEC(k); DEC(j)

43 UNTIL (j < 0) OR (p[j] # s[k]); i := i + d[ORD(s[i-1])] UNTIL (j < 0) OR (i > n); IF j < 0 THEN r := k ELSE r := -1 END END Search Analysis of Boyer-Moore Search. The original publication of this algorithm [1-9] contains a detailed analysis of its performance. The remarkable property is that in all except especially construed cases it requires substantially less than N comparisons. In the luckiest case, where the last character of the pattern always hits an unequal character of the text, the number of comparisons is N/M. The authors provide several ideas on possible further improvements. One is to combine the strategy explained above, which provides greater shifting steps when a mismatch is present, with the Knuth-MorrisPratt strategy, which allows larger shifts after detection of a (partial) match. This method requires two precomputed tables; d1 is the table used above, and d2 is the table corresponding to the one of the KMPalgorithm. The step taken is then the larger of the two, both indicating that no smaller step could possibly lead to a match. We refrain from further elaborating the subject, because the additional complexity of the table generation and the search itself does not seem to yield any appreciable efficiency gain. In fact, the additional overhead is larger, and casts some uncertainty whether the sophisticated extension is an improvement or a deterioration.

Exercises 1.1. Assume that the cardinalities of the standard types INTEGER, REAL, and CHAR are denoted by cint, creal, and cchar . What are the cardinalities of the following data types defined as exemples in this chapter: sex, weekday, row, alfa, complex, date, person? 1.2. Which are the instruction sequences (on your computer) for the following: (a) Fetch and store operations for an element of packed records and arrays? (b) Set operations, including the test for membership? 1.3. What are the reasons for defining certain sets of data as sequences instead of arrays? 1.4. Given is a railway timetable listing the daily services on several lines of a railway system. Find a representation of these data in terms of arrays, records, or sequences, which is suitable for lookup of arrival and departure times, given a certain station and desired direction of the train. 1.5. Given a text T in the form of a sequence and lists of a small number of words in the form of two arrays A and B. Assume that words are short arrays of characters of a small and fixed maximum length. Write a program that transforms the text T into a text S by replacing each occurrence of a word A i by its corresponding word Bi. 1.6. Compare the following three versions of the binary search with the one presented in the text. Which of the three programs are correct? Determine the relevant invariants. Which versions are more efficient? We assume the following variables, and the constant N > 0: VAR i, j, k, x: INTEGER; a: ARRAY N OF INTEGER; Program A: i := 0; j := N-1; REPEAT k := (i+j) DIV 2; IF a[k] < x THEN i := k ELSE j := k END UNTIL (a[k] = x) OR (i > j) Program B: i := 0; j := N-1; REPEAT k := (i+j) DIV 2; IF x < a[k] THEN j := k-1 END ;

44 IF a[k] < x THEN i := k+1 END UNTIL i > j Program C: i := 0; j := N-1; REPEAT k := (i+j) DIV 2; IF x < a[k] THEN j := k ELSE i := k+1 END UNTIL i > j Hint: All programs must terminate with ak = x, if such an element exists, or ak ≠ x, if there exists no element with value x. 1.7. A company organizes a poll to determine the success of its products. Its products are records and tapes of hits, and the most popular hits are to be broadcast in a hit parade. The polled population is to be divided into four categories according to sex and age (say, less or equal to 20, and older than 20). Every person is asked to name five hits. Hits are identified by the numbers 1 to N (say, N = 30). The results of the poll are to be appropriately encoded as a sequence of characters. Hint: use procedures Read and ReadInt to read the values of the poll. TYPE hit = [N]; sex = (male, female); reponse = RECORD name, firstname: alfa; s: sex; age: INTEGER; choice: ARRAY 5 OF hit END ; VAR poll: Files.File This file is the input to a program which computes the following results: 1. A list of hits in the order of their popularity. Each entry consists of the hit number and the number of times it was mentioned in the poll. Hits that were never mentioned are omitted from the list. 2. Four separate lists with the names and first names of all respondents who had mentioned in first place one of the three hits most popular in their category. The five lists are to be preceded by suitable titles.

References 1-1. O-.J. Dahl, E.W. Dijkstra, and C.A.R. Hoare. Structured Programming. (New York: Academic Press, 1972), pp. 155-65. 1-2. C.A.R. Hoare. Notes on data structuring; in Structured Programming. Dahl, Dijkstra, and Hoare, pp. 83-174. 1-3. K. Jensen and N. Wirth. Pascal User Manual and Report. (Berlin: Springer-Verlag, 1974). 1-4. N. Wirth. Program development by stepwise refinement. Comm. ACM, 14, No. 4 (1971), 221-27. 1-5. ------, Programming in Modula-2. (Berlin, Heidelberg, New York: Springer-Verlag, 1982). 1-6. ------, On the composition of well-structured programs. Computing Surveys, 6, No. 4, (1974) 247-59. 1-7. C.A.R. Hoare. The Monitor: An operating systems structuring concept. Comm. ACM 17, 10 (Oct. 1974), 549-557. 1-8. D.E.Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6, 2, (June 1977), 323-349. 1-9. R.S. Boyer and J.S. Moore. A fast string searching algorithm. Comm. ACM, 20, 10 (Oct. 1977), 762772.

45

2. SORTING 2.1. Introduction The primary purpose of this chapter is to provide an extensive set of examples illustrating the use of the data structures introduced in the preceding chapter and to show how the choice of structure for the underlying data profoundly influences the algorithms that perform a given task. Sorting is also a good example to show that such a task may be performed according to many different algorithms, each one having certain advantages and disadvantages that have to be weighed against each other in the light of the particular application. Sorting is generally understood to be the process of rearranging a given set of objects in a specific order. The purpose of sorting is to facilitate the later search for members of the sorted set. As such it is an almost universally performed, fundamental activity. Objects are sorted in telephone books, in income tax files, in tables of contents, in libraries, in dictionaries, in warehouses, and almost everywhere that stored objects have to be searched and retrieved. Even small children are taught to put their things "in order", and they are confronted with some sort of sorting long before they learn anything about arithmetic. Hence, sorting is a relevant and essential activity, particularly in data processing. What else would be easier to sort than data! Nevertheless, our primary interest in sorting is devoted to the even more fundamental techniques used in the construction of algorithms. There are not many techniques that do not occur somewhere in connection with sorting algorithms. In particular, sorting is an ideal subject to demonstrate a great diversity of algorithms, all having the same purpose, many of them being optimal in some sense, and most of them having advantages over others. It is therefore an ideal subject to demonstrate the necessity of performance analysis of algorithms. The example of sorting is moreover well suited for showing how a very significant gain in performance may be obtained by the development of sophisticated algorithms when obvious methods are readily available. The dependence of the choice of an algorithm on the structure of the data to be processed -- an ubiquitous phenomenon -- is so profound in the case of sorting that sorting methods are generally classified into two categories, namely, sorting of arrays and sorting of (sequential) files. The two classes are often called internal and external sorting because arrays are stored in the fast, high-speed, random-access "internal" store of computers and files are appropriate on the slower, but more spacious "external" stores based on mechanically moving devices (disks and tapes). The importance of this distinction is obvious from the example of sorting numbered cards. Structuring the cards as an array corresponds to laying them out in front of the sorter so that each card is visible and individually accessible (see Fig. 2.1). Structuring the cards as a file, however, implies that from each pile only the card on the top is visible (see Fig. 2.2). Such a restriction will evidently have serious consequences on the sorting method to be used, but it is unavoidable if the number of cards to be laid out is larger than the available table. Before proceeding, we introduce some terminology and notation to be used throughout this chapter. If we are given n items a0, a1, ... , an-1 sorting consists of permuting these items into an array ak0, ak1, ... , ak[n-1] such that, given an ordering function f, f(ak0) ≤ f(ak1) ≤ ... ≤ f(ak[n-1]) Ordinarily, the ordering function is not evaluated according to a specified rule of computation but is stored as an explicit component (field) of each item. Its value is called the key of the item. As a consequence, the record structure is particularly well suited to represent items and might for example be declared as follows: TYPE Item = RECORD key: INTEGER; (*other components declared here*) END

46 The other components represent relevant data about the items in the collection; the key merely assumes the purpose of identifying the items. As far as our sorting algorithms are concerned, however, the key is the only relevant component, and there is no need to define any particular remaining components. In the following discussions, we shall therefore discard any associated information and assume that the type Item be defined as INTEGER. This choice of the key type is somewhat arbitrary. Evidently, any type on which a total ordering relation is defined could be used just as well. A sorting method is called stable if the relative order if items with equal keys remains unchanged by the sorting process. Stability of sorting is often desirable, if items are already ordered (sorted) according to some secondary keys, i.e., properties not reflected by the (primary) key itself. This chapter is not to be regarded as a comprehensive survey in sorting techniques. Rather, some selected, specific methods are exemplified in greater detail. For a thorough treatment of sorting, the interested reader is referred to the excellent and comprehensive compendium by D. E. Knuth [2-7] (see also Lorin [2-10]).

2.2. Sorting Arrays The predominant requirement that has to be made for sorting methods on arrays is an economical use of the available store. This implies that the permutation of items which brings the items into order has to be performed in situ, and that methods which transport items from an array a to a result array b are intrinsically of minor interest. Having thus restricted our choice of methods among the many possible solutions by the criterion of economy of storage, we proceed to a first classification according to their efficiency, i.e., their economy of time. A good measure of efficiency is obtained by counting the numbers C of needed key comparisons and M of moves (transpositions) of items. These numbers are functions of the number n of items to be sorted. Whereas good sorting algorithms require in the order of n*log(n) comparisons, we first discuss several simple and obvious sorting techniques, called straight methods, all of which require in the order n2 comparisons of keys. There are three good reasons for presenting straight methods before proceeding to the faster algorithms. 1. Straight methods are particularly well suited for elucidating the characteristics of the major sorting principles. 2. Their programs are easy to understand and are short. Remember that programs occupy storage as well! 3. Although sophisticated methods require fewer operations, these operations are usually more complex in their details; consequently, straight methods are faster for sufficiently small n, although they must not be used for large n. Sorting methods that sort items in situ can be classified into three principal categories according to their underlying method: Sorting by insertion Sorting by selection Sorting by exchange These three pinciples will now be examined and compared. The procedures operate on a global variable a whose components are to be sorted in situ, i.e. without requiring additional, temporary storage. The components are the keys themselves. We discard other data represented by the record type Item, thereby simplifying matters. In all algorithms to be developed in this chapter, we will assume the presence of an array a and a constant n, the number of elements of a: TYPE Item = INTEGER; VAR a: ARRAY n OF Item 2.2.1. Sorting by Straight Insertion This method is widely used by card players. The items (cards) are conceptually divided into a destination sequence a1 ... ai-1 and a source sequence ai ... an. In each step, starting with i = 2 and incrementing i by unity, the i th element of the source sequence is picked and transferred into the destination sequence by inserting it at the appropriate place.

47 Initial Keys: 44 i=1 i=2 i=3 i=4 i=5 i=6 i=7

44 12 12 12 12 06 06

55

12

42

94

18

06

67

55 44 42 42 18 12 12

12 55 44 44 42 18 18

42 42 55 55 44 42 42

94 94 94 94 55 44 44

18 18 18 18 94 55 55

06 06 06 06 06 94 67

67 67 67 67 67 67 94

Table 2.1 A Sample Process of Straight Insertion Sorting. The process of sorting by insertion is shown in an example of eight numbers chosen at random (see Table 2.1). The algorithm of straight insertion is FOR i := 1 TO n-1 DO x := a[i]; insert x at the appropriate place in a0 ... ai END In the process of actually finding the appropriate place, it is convenient to alternate between comparisons and moves, i.e., to let x sift down by comparing x with the next item aj, and either inserting x or moving aj to the right and proceeding to the left. We note that there are two distinct conditions that may cause the termination of the sifting down process: 1. An item aj is found with a key less than the key of x. 2. The left end of the destination sequence is reached. PROCEDURE StraightInsertion; VAR i, j: INTEGER; x: Item; BEGIN FOR i := 1 TO n-1 DO x := a[i]; j := i; WHILE (j > 0) & (x < a[j-1] DO a[j] := a[j-1]; DEC(j) END ; a[j] := x END END StraightInsertion Analysis of straight insertion. The number Ci of key comparisons in the i-th sift is at most i-1, at least 1, and -- assuming that all permutations of the n keys are equally probable -- i/2 in the average. The number Mi of moves (assignments of items) is Ci + 2 (including the sentinel). Therefore, the total numbers of comparisons and moves are Cmin = n-1 Cave = (n2 + n - 2)/4 Cmax = (n2 + n - 4)/4

Mmin = 3*(n-1) Mave = (n2 + 9n - 10)/4 Mmax = (n2 + 3n - 4)/2

The minimal numbers occur if the items are initially in order; the worst case occurs if the items are initially in reverse order. In this sense, sorting by insertion exhibits a truly natural behavior. It is plain that the given algorithm also describes a stable sorting process: it leaves the order of items with equal keys unchanged. The algorithm of straight insertion is easily improved by noting that the destination sequence a0 ... ai-1, in which the new item has to be inserted, is already ordered. Therefore, a faster method of determining the insertion point can be used. The obvious choice is a binary search that samples the destination sequence in the middle and continues bisecting until the insertion point is found. The modified sorting algorithm is called binary insertion. PROCEDURE BinaryInsertion(VAR a: ARRAY OF Item; n: INTEGER); VAR i, j, m, L, R: INTEGER; x: Item; BEGIN FOR i := 1 TO n-1 DO

48 x := a[i]; L := 1; R := i; WHILE L < R DO m := (L+R) DIV 2; IF a[m] 0) & (r > 0) DO IF a[i] < a[j] THEN move an item from i-source to k-destination; advance i and k; q := q-1 ELSE move an item from j-source to k-destination; advance j and k; r := r-1 END END ; copy tail of i-sequence; copy tail of j-sequence After this further refinement of the tail copying operations, the program is laid out in complete detail. Before writing it out in full, we wish to eliminate the restriction that n be a power of 2. Which parts of the algorithm are affected by this relaxation of constraints? We easily convince ourselves that the best way to cope with the more general situation is to adhere to the old method as long as possible. In this example this means that we continue merging p-tuples until the remainders of the source sequences are of length less than p. The one and only part that is influenced are the statements that determine the values of q and r, the lengths of the sequences to be merged. The following four statements replace the three statements q := p; r := p; m := m -2*p and, as the reader should convince himself, they represent an effective implementation of the strategy specified above; note that m denotes the total number of items in the two source sequences that remain to be merged: IF m >= p THEN q := p ELSE q := m END ; m := m-q; IF m >= p THEN r := p ELSE r := m END ; m := m-r In addition, in order to guarantee termination of the program, the condition p=n, which controls the outer repetition, must be changed to p ≥ n. After these modifications, we may now proceed to describe the entire algorithm in terms of a procedure operating on the global array a with 2n elements. PROCEDURE StraightMerge; VAR i, j, k, L, t: INTEGER; (*index range of a is 0 .. 2*n-1 *) h, m, p, q, r: INTEGER; up: BOOLEAN; BEGIN up := TRUE; p := 1; REPEAT h := 1; m := n; IF up THEN i := 0; j := n-1; k := n; L := 2*n-1 ELSE k := 0; L := n-1; i := n; j := 2*n-1 END ;

66 REPEAT (*merge a run from i- and j-sources to k-destination*) IF m >= p THEN q := p ELSE q := m END ; m := m-q; IF m >= p THEN r := p ELSE r := m END ; m := m-r; WHILE (q > 0) & (r > 0) DO IF a[i] < a[j] THEN a[k] := a[i]; k := k+h; i := i+1; q := q-1 ELSE a[k] := a[j]; k := k+h; j := j-1; r := r-1 END END ; WHILE r > 0 DO a[k] := a[j]; k := k+h; j := j-1; r := r-1 END ; WHILE q > 0 DO a[k] := a[i]; k := k+h; i := i+1; q := q-1 END ; h := -h; t := k; k := L; L := t UNTIL m = 0; up := ~up; p := 2*p UNTIL p >= n; IF ~up THEN FOR i := 1 TO n DO a[i] := a[i+n] END END END StraightMerge Analysis of Mergesort. Since each pass doubles p, and since the sort is terminated as soon as p > n, it involves ilog nj passes. Each pass, by definition, copies the entire set of n items exactly once. As a consequence, the total number of moves is exactly M = n × log(n) The number C of key comparisons is even less than M since no comparisons are involved in the tail copying operations. However, since the mergesort technique is usually applied in connection with the use of peripheral storage devices, the computational effort involved in the move operations dominates the effort of comparisons often by several orders of magnitude. The detailed analysis of the number of comparisons is therefore of little practical interest. The merge sort algorithm apparently compares well with even the advanced sorting techniques discussed in the previous chapter. However, the administrative overhead for the manipulation of indices is relatively high, and the decisive disadvantage is the need for storage of 2n items. This is the reason sorting by merging is rarely used on arrays, i.e., on data located in main store. Figures comparing the real time behavior of this Mergesort algorithm appear in the last line of Table 2.9. They compare favorably with Heapsort but unfavorably with Quicksort. 2.4.2. Natural Merging In straight merging no advantage is gained when the data are initially already partially sorted. The length of all merged subsequences in the k th pass is less than or equal to 2k, independent of whether longer subsequences are already ordered and could as well be merged. In fact, any two ordered subsequences of lengths m and n might be merged directly into a single sequence of m+n items. A mergesort that at any time merges the two longest possible subsequences is called a natural merge sort. An ordered subsequence is often called a string. However, since the word string is even more frequently used to describe sequences of characters, we will follow Knuth in our terminology and use the word run instead of string when referring to ordered subsequences. We call a subsequence ai ... aj such that (ai-1 > ai) & (Ak : i ≤ k < j : ak ≤ ak+1) & (aj > aj+1)

67 a maximal run or, for short, a run. A natural merge sort, therefore, merges (maximal) runs instead of sequences of fixed, predetermined length. Runs have the property that if two sequences of n runs are merged, a single sequence of exactly n runs emerges. Therefore, the total number of runs is halved in each pass, and the number of required moves of items is in the worst case n*log(n), but in the average case it is even less. The expected number of comparisons, however, is much larger because in addition to the comparisons necessary for the selection of items, further comparisons are needed between consecutive items of each file in order to determine the end of each run. Our next programming exercise develops a natural merge algorithm in the same stepwise fashion that was used to explain the straight merging algorithm. It employs the sequence structure (represented by files, see Sect. 1.8) instead of the array, and it represents an unbalanced, two-phase, three-tape merge sort. We assume that the file variable c represents the initial sequence of items. (Naturally, in actual data processing application, the initial data are first copied from the original source to c for reasons of safety.) a and b are two auxiliary file variables. Each pass consists of a distribution phase that distributes runs equally from c to a and b, and a merge phase that merges runs from a and b to c. This process is illustrated in Fig. 2.13. a

a

c

a

c

c

b

c

c

b

b

merge phase distribution phase

st

nd

1 run

th

2 run

n run

Fig. 2.13. Sort phases and passes 17 05 05 02

31' 17 11 03

05 31 13 05

59' 59' 17 07

13 11 23 11

41 13 29 13

43 23 31 17

67' 29 41 19

11 41 43 23

23 43 47 29

29 47 59 31

47' 67' 67' 37

03 02 02 41

07 03 03 43

71' 07 07 47

02 19 19 57

19 57 37 59

57' 71' 57 61

37 37 61 67

61 61 71 71

Table 2.11. Example of a Natural Mergesort. As an example, Table 2.11 shows the file c in its original state (line1) and after each pass (lines 2-4) in a natural merge sort involving 20 numbers. Note that only three passes are needed. The sort terminates as soon as the number of runs on c is 1. (We assume that there exists at least one non-empty run on the initial sequence). We therefore let a variable L be used for counting the number of runs merged onto c. By making use of the type Rider defined in Sect. 1.8.1, the program can be formulated as follows: VAR L: INTEGER; r0, r1, r2: Files.Rider; (*see 1.8.1*) REPEAT Files.Set(r0, a, 0); Files.Set(r1, b, 0); Files.Set(r2, c, 0); distribute(r2, r0, r1); (*c to a and b*) Files.Set(r0, a, 0); Files.Set(r1, b, 0); Files.Set(r2, c, 0); L := 0; merge(r0, r1, r2) (*a and b into c*) UNTIL L = 1 The two phases clearly emerge as two distinct statements. They are now to be refined, i.e., expressed in more detail. The refined descriptions of distribute (from rider r2 to riders r0 and r1) and merge (from riders r0 and r1 to rider r2) follow:

68 REPEAT copyrun(r2, r0); IF ~r2.eof THEN copyrun(r2, r1) END UNTIL r2.eof REPEAT mergerun(r0, r1, r2); INC(L) UNTIL r1.eof; IF ~r0.eof THEN copyrun(r0, r2); INC(L) END This method of distribution supposedly results in either equal numbers of runs in both a and b, or in sequence a containing one run more than b. Since corresponding pairs of runs are merged, a leftover run may still be on file a, which simply has to be copied. The statements merge and distribute are formulated in terms of a refined statement mergerun and a subordinate procedure copyrun with obvious tasks. When attempting to do so, one runs into a serious difficulty: In order to determine the end of a run, two consecutive keys must be compared. However, files are such that only a single element is immediately accessible. We evidently cannot avoid to look ahead, i.e to associate a buffer with every sequence. The buffer is to contain the first element of the file still to be read and constitutes something like a window sliding over the file. Instead of programming this mechanism explicitly into our program, we prefer to define yet another level of abstraction. It is represented by a new module Runs. It can be regarded as an extension of module Files of Sect. 1.8, introducing a new type Rider, which we may consider as an extension of type Files.Rider. This new type will not only accept all operations available on Riders and indicate the end of a file, but also indicate the end of a run and the first element of the remaining part of the file. The new type as well as its operators are presented by the following definition. DEFINITION Runs; IMPORT Files, Texts; TYPE Rider = RECORD (Files.Rider) first: INTEGER; eor: BOOLEAN END ; PROCEDURE OpenRandomSeq(f: Files.File; length, seed: INTEGER); PROCEDURE Set (VAR r: Rider; VAR f: Files.File); PROCEDURE copy(VAR source, destination: Rider); PROCEDURE ListSeq(VAR W: Texts.Writer; f: Files.File); END Runs. A few additional explanations for the choice of the procedures are necessary. As we shall see, the sorting algorithms discussed here and later are based on copying elements from one file to another. A procedure copy therefore takes the place of separate read and write operations. For convenience of testing the following examples, we also introduce a procedure ListSeq, converting a file of integers into a text. Also for convenience an additional procedure is included: OpenRandomSeq initializes a file with numbers in random order. These two procedures will serve to test the algorithms to be discussed below. The values of the fields eof and eor are defined as results of copy in analogy to eof having been defined as result of a read operation. MODULE Runs; IMPORT Files, Texts; TYPE Rider* = RECORD (Files.Rider) first: INTEGER; eor: BOOLEAN END ; PROCEDURE OpenRandomSeq*( f: Files.File; length, seed: INTEGER); VAR i: INTEGER; w: Files.Rider; BEGIN Files.Set(w, f, 0); FOR i := 0 TO length-1 DO Files.WriteInt(w, seed); seed := (31*seed) MOD 997 + 5 END ; Close(f) END OpenRandomSeq; PROCEDURE Set*(VAR r: Rider; f: Files.File); BEGIN Files.Set(r, f, 0); Files.Read (r, r.first); r.eor := r.eof END Set;

69 PROCEDURE copy*(VAR src, dest: Rider); BEGIN dest.first := src.first; Files.Write(dest, dest.first); Files.Read(src, src.first); src.eor := src.eof OR (src.first < dest.first) END copy; PROCEDURE ListSeq*(VAR W: Texts; f: Files.File;); VAR x, y, k, n: INTEGER; r: Files.Rider; BEGIN k := 0; n := 0; Files.Set(r, f, 0); Files.ReadInt(r, x); WHILE ~r.eof DO Texts.WriteInt(W, x, 6); INC(k); Files.Read(r, y); IF y < x THEN (*run ends*) Texts.Write(W, “|”); INC(n) END ; x := y END ; Texts.Write(W, “$”); Texts.WriteInt(W, k, 5); Texts.WriteInt(W, n, 5); Texts.WriteLn(W) END ListSeq; END Runs. We now return to the process of successive refinement of the process of natural merging. Procedure copyrun and the statement merge are now conveniently expressible as shown below. Note that we refer to the sequences (files) indirectly via the riders attached to them. In passing, we also note that the rider’s field first represents the next key on a sequence being read, and the last key of a sequence being written. PROCEDURE copyrun(VAR x, y: Runs.Rider); BEGIN (*copy from x to y*) REPEAT Runs.copy(x, y) UNTIL x.eor END copyrun (*merge from r0 and r1 to r2*) REPEAT IF r0.first < r1.first THEN Runs.copy(r0, r2); IF r0.eor THEN copyrun(r1, r2) END ELSE Runs.copy(r1, r2); IF r1.eor THEN copyrun(r0, r2) END END UNTIL r0.eor OR r1.eor The comparison and selection process of keys in merging a run terminates as soon as one of the two runs is exhausted. After this, the other run (which is not exhausted yet) has to be transferred to the resulting run by merely copying its tail. This is done by a call of procedure copyrun. This should supposedly terminate the development of the natural merging sort procedure. Regrettably, the program is incorrect, as the very careful reader may have noticed. The program is incorrect in the sense that it does not sort properly in some cases. Consider, for example, the following sequence of input data: 03 02 05 11 07 13 19 17 23 31 29 37 43 41 47 59 57 61 71 67 By distributing consecutive runs alternately to a and b, we obtain a = 03 ' 07 13 19 ' 29 37 43 ' 57 61 71' b = 02 05 11 ' 17 23 31 ' 41 47 59 ' 67 These sequences are readily merged into a single run, whereafter the sort terminates successfully. The example, although it does not lead to an erroneous behaviour of the program, makes us aware that mere distribution of runs to serveral files may result in a number of output runs that is less than the number of input runs. This is because the first item of the i+2nd run may be larger than the last item of the i-th run, thereby causing the two runs to merge automatically into a single run.

70 Although procedure distribute supposedly outputs runs in equal numbers to the two files, the important consequence is that the actual number of resulting runs on a and b may differ significantly. Our merge procedure, however, only merges pairs of runs and terminates as soon as b is read, thereby losing the tail of one of the sequences. Consider the following input data that are sorted (and truncated) in two subsequent passes: 17 19 13 57 23 29 11 59 31 37 07 61 41 43 05 67 47 71 02 03 13 17 19 23 29 31 37 41 43 47 57 71 11 59 11 13 17 19 23 29 31 37 41 43 47 57 59 71 Table 2.12 Incorrect Result of Mergesort Program. The example of this programming mistake is typical for many programming situations. The mistake is caused by an oversight of one of the possible consequences of a presumably simple operation. It is also typical in the sense that serval ways of correcting the mistake are open and that one of them has to be chosen. Often there exist two possibilities that differ in a very important, fundamental way: 1. We recognize that the operation of distribution is incorrectly programmed and does not satisfy the requirement that the number of runs differ by at most 1. We stick to the original scheme of operation and correct the faulty procedure accordingly. 2. We recognize that the correction of the faulty part involves far-reaching modifications, and we try to find ways in which other parts of the algorithm may be changed to accommodate the currently incorrect part. In general, the first path seems to be the safer, cleaner one, the more honest way, providing a fair degree of immunity from later consequences of overlooked, intricate side effects. It is, therefore, the way toward a solution that is generally recommended. It is to be pointed out, however, that the second possibility should sometimes not be entirely ignored. It is for this reason that we further elaborate on this example and illustrate a fix by modification of the merge procedure rather than the distribution procedure, which is primarily at fault. This implies that we leave the distribution scheme untouched and renounce the condition that runs be equally distributed. This may result in a less than optimal performance. However, the worst-case performance remains unchanged, and moreover, the case of highly unequal distribution is statistically very unlikely. Efficiency considerations are therefore no serious argument against this solution. If the condition of equal distribution of runs no longer exists, then the merge procedure has to be changed so that, after reaching the end of one file, the entire tail of the remaining file is copied instead of at most one run. This change is straightforward and is very simple in comparison with any change in the distribution scheme. (The reader is urged to convince himself of the truth of this claim). The revised version of the merge algorithm is shown below in the form of a function procedure: PROCEDURE NaturalMerge(src: Files.File): Files.File; VAR L: INTEGER; (*no. of runs merged*) f0, f1, f2: Files.File; r0, r1, r2: Runs.Rider; PROCEDURE copyrun(VAR x, y: Runs.Rider); BEGIN (*from x to y*) REPEAT Runs.copy(x, y) UNTIL x.eor END copyrun; BEGIN Runs.Set(r2, src); REPEAT f0 := Files.New("test0"); Files.Set(r0, f0, 0); f1 := Files.New("test1"); Files.Set (r1, f1, 0); (*distribute from r2 to r0 and r1*) REPEAT copyrun(r2, r0); IF ~r2.eof THEN copyrun(r2, r1) END UNTIL r2.eof; Runs.Set(r0, f0); Runs.Set(r1, f1); f2 := Files.New(""); Files.Set(r2, f2, 0); L := 0;

71 (*merge from r0 and r1 to r2*) REPEAT REPEAT IF r0.first < r1.first THEN Runs.copy(r0, r2); IF r0.eor THEN copyrun(r1, r2) END ELSE Runs.copy(r1, r2); IF r1.eor THEN copyrun(r0, r2) END END UNTIL r0.eor OR r1.eor; INC(L) UNTIL r0.eof OR r1.eof; WHILE ~r0.eof DO copyrun(r0, r2); INC(L) END ; WHILE ~r1.eof DO copyrun(r1, r2); INC(L) END ; Runs.Set(r2, f2) UNTIL L = 1; RETURN f2 END NaturalMerge; 2.4.3. Balanced Multiway Merging The effort involved in a sequential sort is proportional to the number of required passes since, by definition, every pass involves the copying of the entire set of data. One way to reduce this number is to distribute runs onto more than two files. Merging r runs that are equally distributed on N files results in a sequence of r/N runs. A second pass reduces their number to r/N 2, a third pass to r/N3, and after k passes there are r/Nk runs left. The total number of passes required to sort n items by N-way merging is therefore k = logN(n). Since each pass requires n copy operations, the total number of copy operations is in the worst case M = n×logN(n) As the next programming exercise, we will develop a sort program based on multiway merging. In order to further contrast the program from the previous natural two-phase merging procedure, we shall formulate the multiway merge as a single phase, balanced mergesort. This implies that in each pass there are an equal number of input and output files onto which consecutive runs are alternately distributed. Using 2N files, the algorithm will therefore be based on N-way merging. Following the previously adopted strategy, we will not bother to detect the automatic merging of two consecutive runs distributed onto the same file. Consequently, we are forced to design the merge program whithout assuming strictly equal numbers of runs on the input files. In this program we encounter for the first time a natural application of a data structure consisting of arrays of files. As a matter of fact, it is surprising how strongly the following program differs from the previous one because of the change from two-way to multiway merging. The change is primarily a result of the circumstance that the merge process can no longer simply be terminated after one of the input runs is exhausted. Instead, a list of inputs that are still active, i.e., not yet exhausted, must be kept. Another complication stems from the need to switch the groups of input and output files after each pass. Here the indirection of access to files via riders comes in handy. In each pass, data may be copied from the same riders r to the same riders w. At the end of each pass we merely need to reset the input and output files to different riders. Obviously, file numbers are used to index the array of files. Let us then assume that the initial file is the parameter src, and that for the sorting process 2N files are available: f, g: ARRAY N OF Files.File; r, w: ARRAY N OF Runs.Rider The algorithm can now be sketched as follows: PROCEDURE BalancedMerge(src: Files.File): Files.File; VAR i, j: INTEGER; L: INTEGER; (*no. of runs distributed*)

72 R: Runs.Rider; BEGIN Runs.Set(R, src); (*distribute initial runs from R to w[0] ... w[N-1]*) j := 0; L := 0; position riders w on files g; REPEAT copy one run from R to w[j]; INC(j); INC(L); IF j = N THEN j := 0 END UNTIL R.eof; REPEAT (*merge from riders r to riders w*) switch files g to riders r; L := 0; j := 0; (*j = index of output file*) REPEAT INC(L); merge one run from inputs to w[j]; IF j < N THEN INC(j) ELSE j := 0 END UNTIL all inputs exhausted; UNTIL L = 1 (*sorted file is with w[0]*) END BalancedMerge. Having associated a rider R with the source file, we now refine the statement for the initial distribution of runs. Using the definition of copy, we replace copy one run from R to w[j] by: REPEAT Runs.copy(R, w[j]) UNTIL R.eor Copying a run terminates when either the first item of the next run is encountered or when the end of the entire input file is reached. In the actual sort algorithm, the following statements remain to be specified in more detail: 1. Position riders w on files g 2. Merge one run from inputs to wj 3. Switch files g to riders r 4. All inputs exhausted First, we must accurately identify the current input sequences. Notably, the number of active inputs may be less than N. Obviously, there can be at most as many sources as there are runs; the sort terminates as soon as there is one single sequence left. This leaves open the possibility that at the initiation of the last sort pass there are fewer than N runs. We therefore introduce a variable, say k1, to denote the actual number of inputs used. We incorporate the initialization of k1 in the statement switch files as follows: IF L < N THEN k1 := L ELSE k1 := N END ; FOR i := 0 TO k1-1 DO Runs.Set(r[i], g[i]) END Naturally, statement (2) is to decrement k1 whenever an input source ceases. Hence, predicate (4) may easily be expressed by the relation k1 = 0. Statement (2), however, is more difficult to refine; it consists of the repeated selection of the least key among the available sources and its subsequent transport to the destination, i.e., the current output sequence. The process is further complicated by the necessity of determining the end of each run. The end of a run may be reached because (1) the subsequent key is less than the current key or (2) the end of the source is reached. In the latter case the source is eliminated by decrementing k1; in the former case the run is closed by excluding the sequence from further selection of items, but only until the creation of the current output run is completed. This makes it obvious that a second variable, say k2, is needed to denote the number of sources actually available for the selection of the next item. This value is initially set equal to k1 and is decremented whenever a run teminates because of condition (1). Unfortunately, the introduction of k2 is not sufficient. We need to know not only the number of files, but also which files are still in actual use. An obvious solution is to use an array with Boolean components indicating the availability of the files. We choose, however, a different method that leads to a more efficient selection

73 procedure which, after all, is the most frequently repeated part of the entire algorithm. Instead of using a Boolean array, a file index map, say t, is introduced. This map is used so that t0 ... tk2-1 are the indices of the available sequences. Thus statement (2) can be formulated as follows: k2 := k1; REPEAT select the minimal key, let t[m] be the sequence number on which it occurs; Runs.copy(r[t[m]], w[j]); IF r[t[m]].eof THEN eliminate sequence ELSIF r[t[m]].eor THEN close run END UNTIL k2 = 0 Since the number of sequences will be fairly small for any practical purpose, the selection algorithm to be specified in further detail in the next refinement step may as well be a straightforward linear search. The statement eliminate sequence implies a decrease of k1 as well as k2 and also a reassignment of indices in the map t. The statement close run merely decrements k2 and rearranges components of t accordingly. The details are shown in the following procedure, being the last refinement. The statement switch sequences is elaborated according to explanations given earlier. PROCEDURE BalancedMerge(src: Files.File): Files.File; VAR i, j, m, tx: INTEGER; L, k1, k2: INTEGER; min, x: INTEGER; t: ARRAY N OF INTEGER; (*index map*) R: Runs.Rider; (*source*) f, g: ARRAY N OF Files.File; r, w: ARRAY N OF Runs.Rider; BEGIN Runs.Set(R, src); FOR i := 0 TO N-1 DO g[i] := Files.New(""); Files.Set(w[i], g[i], 0) END ; (*distribute initial runs from src to g[0] ... g[N-1]*) j := 0; L := 0; REPEAT REPEAT Runs.copy(R, w[j]) UNTIL R.eor; INC(L); INC(j); IF j = N THEN j := 0 END UNTIL R.eof; FOR i := 0 TO N-1 DO t[i] := i END ; REPEAT IF L < N THEN k1 := L ELSE k1 := N END ; FOR i := 0 TO k1-1 DO Runs.Set(r[i], g[i]) END ; (*set input riders*) FOR i := 0 TO k1-1 DO g[i] := Files.New(""); Files.Set(w[i], g[i], 0) END ; (*set output riders*) (*merge from r[0] ... r[N-1] to w[0] ... w[N-1]*) FOR i := 0 TO N-1 DO t[i] := i END ; L := 0; (*nof runs merged*) j := 0; REPEAT (*merge one run from inputs to w[j]*) INC(L); k2 := k1; REPEAT (*select the minimal key*) m := 0; min := r[t[0]].first; i := 1; WHILE i < k2 DO x := r[t[i]].first; IF x < min THEN min := x; m := i END ; INC(i) END ; Runs.copy(r[t[m]], w[j]);

74 IF r[t[m]].eof THEN (*eliminate this sequence*) DEC(k1); DEC(k2); t[m] := t[k2]; t[k2] := t[k1] ELSIF r[t[m]].eor THEN (*close run*) DEC(k2); tx := t[m]; t[m] := t[k2]; t[k2] := tx END UNTIL k2 = 0; INC(j); IF j = N THEN j := 0 END UNTIL k1 = 0 UNTIL L = 1; RETURN Files.Base(w[t[0]]) END BalancedMerge 2.4.4. Polyphase Sort We have now discussed the necessary techniques and have acquired the proper background to investigate and program yet another sorting algorithm whose performance is superior to the balanced sort. We have seen that balanced merging eliminates the pure copying operations necessary when the distribution and the merging operations are united into a single phase. The question arises whether or not the given sequences could be processed even more efficiently. This is indeed the case; the key to this next improvement lies in abandoning the rigid notion of strict passes, i.e., to use the sequences in a more sophisticated way than by always having N/2 sources and as many destinations and exchanging sources and destinations at the end of each distinct pass. Instead, the notion of a pass becomes diffuse. The method was invented by R.L. Gilstad [2-3] and called Polyphase Sort. It is first illustrated by an example using three sequences. At any time, items are merged from two sources into a third sequence variable. Whenever one of the source sequences is exhausted, it immediately becomes the destination of the merge operations of data from the non-exhausted source and the previous destination sequence. As we know that n runs on each input are transformed into n runs on the output, we need to list only the number of runs present on each sequence (instead of specifying actual keys). In Fig. 2.14 we assume that initially the two input sequences f1 and f2 contain 13 and 8 runs, respectively. Thus, in the first pass 8 runs are merged from f1 and f2 to f3, in the second pass the remaining 5 runs are merged from f3 and f1 onto f2, etc. In the end, f1 is the sorted sequence. f1

f2

f3

13

8

5

0

8

0

5

3

3

2

0

1

0

2

0

1

1

1

0

0

Fig. 2.14. Polyphase mergesort of 21 runs with 3 sequences

75 A second example shows the Polyphase method with 6 sequences. Let there initially be 16 runs on f1, 15 on f2, 14 on f3, 12 on f4, and 8 on f5. In the first partial pass, 8 runs are merged onto f6; In the end, f2 contains the sorted set of items (see Fig. 2.15). f1

f2

f3

f4

f5

f6

16

15

14

12

8

8

7

6

4

0

8

4

3

2

0

4

4

2

1

0

2

2

2

1

0

1

1

1

1

0

1

0

0

0

0

Fig. 2.15. Polyphase mergesort of 65 runs with 6 sequences Polyphase is more efficient than balanced merge because, given N sequences, it always operates with an N-1way merge instead of an N/2-way merge. As the number of required passes is approximately logN n, n being the number of items to be sorted and N being the degree of the merge operations, Polyphase promises a significant improvement over balanced merging. Of course, the distribution of initial runs was carefully chosen in the above examples. In order to find out which initial distributions of runs lead to a proper functioning, we work backward, starting with the final distribution (last line in Fig. 2.15). Rewriting the tables of the two examples and rotating each row by one position with respect to the prior row yields Tables 2.13 and 2.14 for six passes and for three and six sequences, respectively. L

a1(L)

a2(L)

Sum ai(L)

0 1 2 3 4 5 6

1 1 2 3 5 8 13

0 1 1 2 3 5 8

1 2 3 5 8 13 21

Table 2.13 Perfect distribution of runs on two sequences. L

a1(L)

a2(L)

a3(L)

a4(L)

a5(L)

Sum ai(L)

0 1 2 3 4 5

1 1 2 4 8 16

0 1 2 4 8 15

0 1 2 4 7 14

0 1 2 3 6 12

0 1 1 2 4 8

1 5 9 17 33 65

Table 2.14 Perfect distribution of runs on five sequences. From Table 2.13 we can deduce for L > 0 the relations a2(L+1) = a1(L) a1(L+1) = a1(L) + a2(L)

76 and a1(0) = 1, a2(0) = 0. Defining fi+1 = a1(i), we obtain for i > 0 fi+1 = fi + fi-1, f1 = 1, f0 = 0 These are the recursive rules (or recurrence relations) defining the Fibonacci numbers: f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ... Each Fibonacci number is the sum of its two predecessors. As a consequence, the numbers of initial runs on the two input sequences must be two consecutive Fibonacci numbers in order to make Polyphase work properly with three sequences. How about the second example (Table 2.14) with six sequences? The formation rules are easily derived as a5(L+1) a4(L+1) a3(L+1) a2(L+1) a1(L+1)

= = = = =

a1(L) a1(L) + a5(L) a1(L) + a4(L) a1(L) + a3(L) a1(L) + a2(L)

= = = =

a1(L) + a1(L) + a1(L) + a1(L) +

a1(L-1) a1(L-1) + a1(L-2) a1(L-1) + a1(L-2) + a1(L-3) a1(L-1) + a1(L-2) + a1(L-3) + a1(L-4)

Substituting fi for a1(i) yields fi+1 f4 fi

= fi + fi-1 + fi-2 + fi-3 + fi-4 = 1 = 0 for i < 4

for i > 4

These numbers are the Fibonacci numbers of order 4. In general, the Fibonacci numbers of order p are defined as follows: fi+1(p) = fi(p) + fi-1(p) + ... + fi-p(p) for i > p fp(p) = 1 fi(p) = 0 for 0 < i < p Note that the ordinary Fibonacci numbers are those of order 1. We have now seen that the initial numbers of runs for a perfect Polyphase Sort with N sequences are the sums of any N-1, N-2, ... , 1 (see Table 2.15) consecutive Fibonacci numbers of order N-2. This apparently implies that this method is only applicable to inputs whose number of runs is the sum of N-1 such Fibonacci sums. The important question thus arises: What is to be done when the number of initial runs is not such an ideal sum? The answer is simple (and typical for such situations): we simulate the existence of hypothetical empty runs, such that the sum of real and hypothetical runs is a perfect sum. The empty runs are called dummy runs. But this is not really a satisfactory answer because it immediately raises the further and more difficult question: How do we recognize dummy runs during merging? Before answering this question we must first investigate the prior problem of initial run distribution and decide upon a rule for the distribution of actual and dummy runs onto the N-1 tapes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

2 3 5 8 13 21 34 55 89 144 233 377 610 987

3 5 9 17 31 57 105 193 355 653 1201 2209 4063 7473

4 7 13 25 49 94 181 349 673 1297 2500 4819 9289 17905

5 9 17 33 65 129 253 497 977 1921 3777 7425 14597 28697

6 11 21 41 81 161 321 636 1261 2501 4961 9841 19521 38721

7 13 25 49 97 193 385 769 1531 3049 6073 12097 24097 48001

Table 2.15 Numbers of runs allowing for perfect distribution.

77 In order to find an appropriate rule for distribution, however, we must know how actual and dummy runs are merged. Clearly, the selection of a dummy run from sequence i means precisely that sequence i is ignored during this merge. resulting in a merge from fewer than N-1 sources. Merging of a dummy run from all N-1 sources implies no actual merge operation, but instead the recording of the resulting dummy run on the output sequence. From this we conclude that dummy runs should be distributed to the n-1 sequences as uniformly as possible, since we are interested in active merges from as many sources as possible. Let us forget dummy runs for a moment and consider the problem of distributing an unknown number of runs onto N-1 sequences. It is plain that the Fibonacci numbers of order N-2 specifying the desired numbers of runs on each source can be generated while the distribution progresses. Assuming, for example, N = 6 and referring to Table 2.14, we start by distributing runs as indicated by the row with index L = 1 (1, 1, 1, 1, 1); if there are more runs available, we proceed to the second row (2, 2, 2, 2, 1); if the source is still not exhausted, the distribution proceeds according to the third row (4, 4, 4, 3, 2), and so on. We shall call the row index level. Evidently, the larger the number of runs, the higher is the level of Fibonacci numbers which, incidentally, is equal to the number of merge passes or switchings necessary for the subsequent sort. The distribution algorithm can now be formulated in a first version as follows: 1. Let the distribution goal be the Fibonacci numbers of order N-2, level 1. 2. Distribute according to the set goal. 3. If the goal is reached, compute the next level of Fibonacci numbers; the difference between them and those on the former level constitutes the new distribution goal. Return to step 2. If the goal cannot be reached because the source is exhausted, terminate the distribution process. The rules for calculating the next level of Fibonacci numbers are contained in their definition. We can thus concentrate our attention on step 2, where, with a given goal, the subsequent runs are to be distributed one after the other onto the N-1 output sequences. It is here where the dummy runs have to reappear in our considerations. Let us assume that when raising the level, we record the next goal by the differences di for i = 1 ... N-1, where di denotes the number of runs to be put onto sequence i in this step. We can now assume that we immediately put di dummy runs onto sequence i and then regard the subsequent distribution as the replacement of dummy runs by actual runs, each time recording a replacement by subtracting 1 from the count di. Thus, the d i indicates the number of dummy runs on sequence i when the source becomes empty. It is not known which algorithm yields the optimal distribution, but the following has proved to be a very good method. It is called horizontal distribution (cf. Knuth, Vol 3. p. 270), a term that can be understood by imagining the runs as being piled up in the form of silos, as shown in Fig. 2.16 for N = 6, level 5 (cf. Table 2.14). In order to reach an equal distribution of remaining dummy runs as quickly as possible, their replacement by actual runs reduces the size of the piles by picking off dummy runs on horizontal levels proceeding from left to right. In this way, the runs are distributed onto the sequences as indicated by their numbers as shown in Fig. 2.16. 8 1 7 2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

6 5 4 3 2 1

Fig. 2.16. Horizontal distribution of runs

78 We are now in a position to describe the algorithm in the form of a procedure called select, which is activated each time a run has been copied and a new source is selected for the next run. We assume the existence of a variable j denoting the index of the current destination sequence. ai and di denote the ideal and dummy distribution numbers for sequence i. j, level: INTEGER; a, d: ARRAY N OF INTEGER; These variables are initialized with the following values: for i = 0 ... N-2 ai = 1, di = 1 aN-1 = 0, dN-1 = 0 dummy j = 0, level = 0 Note that select is to compute the next row of Table 2.14, i.e., the values a1(L) ... aN-1(L) each time that the level is increased. The next goal, i.e., the differences d i = ai(L) - ai(L-1) are also computed at that time. The indicated algorithm relies on the fact that the resulting di decrease with increasing index (descending stair in Fig. 2.16). Note that the exception is the transition from level 0 to level 1; this algorithm must therefore be used starting at level 1. Select ends by decrementing dj by 1; this operation stands for the replacement of a dummy run on sequence j by an actual run. PROCEDURE select; VAR i, z: INTEGER; BEGIN IF d[j] < d[j+1] THEN INC(j) ELSE IF d[j] = 0 THEN INC(level); z := a[0]; FOR i := 0 TO N-2 DO d[i] := z + a[i+1] - a[i]; a[i] := z + a[i+1] END END ; j := 0 END ; DEC(d[j]) END select Assuming the availability of a routine to copy a run from the source src woth rider R onto fj with rider rj, we can formulate the initial distribution phase as follows (assuming that the source contains at least one run): REPEAT select; copyrun UNTIL R.eof Here, however, we must pause for a moment to recall the effect encountered in distributing runs in the previously discussed natural merge algorithm: The fact that two runs consecutively arriving at the same destination may merge into a single run, causes the assumed numbers of runs to be incorrect. By devising the sort algorithm such that its correctness does not depend on the number of runs, this side effect can safely be ignored. In the Polyphase Sort, however, we are particularly concerned about keeping track of the exact number of runs on each file. Consequently, we cannot afford to overlook the effect of such a coincidental merge. An additional complication of the distribution algorithm therefore cannot be avoided. It becomes necessary to retain the keys of the last item of the last run on each sequence. Fortunately, our implementation of Runs does exactly this. In the case of output sequences, f.first represents the item last written. A next attempt to describe the distribution algorithm could therefore be REPEAT select; IF f[j].first f(optimum) THEN optimum := solution END The variable optimum records the best solution so far encountered. Naturally, it has to be properly initialized; morever, it is customary to record to value f(optimum) by another variable in order to avoid its frequent recomputation. An example of the general problem of finding an optimal solution to a given problem follows: We choose the important and frequently encountered problem of finding an optimal selection out of a given set of objects subject to constraints. Selections that constitute acceptable solutions are gradually built up by investigating individual objects from the base set. A procedure Try describes the process of investigating the suitability of one individual object, and it is called recursively (to investigate the next object) until all objects have been considered. We note that the consideration of each object (called candidates in previous examples) has two possible outcomes, namely, either the inclusion of the investigated object in the current selection or its exclusion. This makes the use of a repeat or for statement inappropriate; instead, the two cases may as well be explicitly written out. This is shown, assuming that the objects are numbered 1, 2, ... , n. PROCEDURE Try(i: INTEGER); BEGIN IF inclusion is acceptable THEN include i th object; IF i < n THEN Try(i+1) ELSE check optimality END ; eliminate i th object END ; IF exclusion is acceptable THEN IF i < n THEN Try(i+1) ELSE check optimality END END END Try From this pattern it is evident that there are 2n possible sets; clearly, appropriate acceptability criteria must be employed to reduce the number of investigated candidates very drastically. In order to elucidate this process, let us choose a concrete example for a selection problem: Let each of the n objects a0, ... , an-1 be characterized by its weight and its value. Let the optimal set be the one with the largest sum of the values of its components, and let the constraint be a limit on the sum of their weight. This is a problem well known to all travellers who pack suitcases by selecting from n items in such a way that their total value is optimal and that their total weight does not exceed a specific allowance. We are now in a position to decide upon the representation of the given facts in terms of global variables. The choices are easily derived from the foregoing developments:

106 TYPE object = RECORD weight, value: INTEGER END ; VAR obj: ARRAY n OF object; limw, totv, maxv: INTEGER; s, opts: SET The variables limw and totv denote the weight limit and the total value of all n objects. These two values are actually constant during the entire selection process. s represents the current selection of objects in which each object is represented by its name (index). opts is the optimal selection so far encountered, and maxv is its value. Which are now the criteria for acceptability of an object for the current selection? If we consider inclusion, then an object is selectable, if it fits into the weight allowance. If it does not fit, we may stop trying to add further objects to the current selection. If, however, we consider exclusion, then the criterion for acceptability, i.e., for the continuation of building up the current selection, is that the total value which is still achievable after this exclusion is not less than the value of the optimum so far encountered. For, if it is less, continuation of the search, although it may produce some solution, will not yield the optimal solution. Hence any further search on the current path is fruitless. From these two conditions we determine the relevant quantities to be computed for each step in the selection process: 1. The total weight tw of the selection s so far made. 2. The still achievable value av of the current selection s. These two entities are appropriately represented as parameters of the procedure Try. The condition inclusion is acceptable can now be formulated as tw + a[i].weight < limw and the subsequent check for optimality as IF av > maxv THEN (*new optimum, record it*) opts := s; maxv := av END The last assignment is based on the reasoning that the achievable value is the achieved value, once all n objects have been dealt with. The condition exclusion is acceptable is expressed by av - a[i].value > maxv Since it is used again thereafter, the value av - a[i].value is given the name av1 in order to circumvent its reevaluation. The entire procedure is now composed of the discussed parts with the addition of appropriate initialization statements for the global variables. The ease of expressing inclusion and exclusion from the set s by use of set operators is noteworthy. The results opts and maxv of Selection with weight allowances ranging from 10 to 120 are listed in Table 3.5. TYPE Object = RECORD value, weight: INTEGER END ; VAR obj: ARRAY n OF Object; limw, totv, maxv: INTEGER; s, opts: SET; PROCEDURE Try(i, tw, av: INTEGER); VAR av1: INTEGER; BEGIN (*try inclusion*) IF tw + obj[i].weight maxv THEN maxv := av; opts := s END ; s := s - {i} END ; (*try exclusion*)

107 IF av > maxv + obj[i].value THEN IF i < n THEN Try(i+1, tw, av - obj[i].value) ELSE maxv := av - obj[i].value; opts := s END END END Try; PROCEDURE Selection(n, Weightinc, WeightLimit: INTEGER); VAR i: INTEGER; BEGIN limw := 0; REPEAT limw := limw + WeightInc; maxv := 0; s := {}; opts := {}; Try(0, 0, totv); UNTIL limw >= WeightLimit END Selection. Weight 10 Value 18 10 20 30 40 50 60 70 80 90 100 110 120

11 20

12 17

13 19

14 25

15 21

16 27

17 23

18 25

19 24

*

* * * * * * * * *

* * * *

* * * * * * * * * *

* *

* *

* *

*

* * * * * * *

* * * * * *

* *

* * * *

* * * *

*

*

Tot 18 27 52 70 84 99 115 130 139 157 172 183

Table 3.5 Sample Output from Optimal Selection Program. This backtracking scheme with a limitation factor curtailing the growth of the potential search tree is also known as branch and bound algorithm.

Exercises 3.1 (Towers of Hanoi). Given are three rods and n disks of different sizes. The disks can be stacked up on the rods, thereby forming towers. Let the n disks initially be placed on rod A in the order of decreasing size, as shown in Fig. 3.10 for n = 3. The task is to move the n disks from rod A to rod C such that they are ordered in the original way. This has to be achieved under the constraints that 1. In each step exactly one disk is moved from one rod to another rod. 2. A disk may never be placed on top of a smaller disk. 3. Rod B may be used as an auxiliary store. Find an algorithm that performs this task. Note that a tower may conveniently be considered as consisting of the single disk at the top, and the tower consisting of the remaining disks. Describe the algorithm as a recursive program.

1 2 3

A

B

C

108 Fig. 3.10. The towers of Hanoi 3.2. Write a procedure that generates all n! permutations of n elements a1, ... , an in situ, i.e., without the aid of another array. Upon generating the next permutation, a parametric procedure Q is to be called which may, for instance, output the generated permutation. Hint: Consider the task of generating all permutations of the elements a1, ... , am as consisting of the m subtasks of generating all permutations of a1, ... , am-1 followed by am , where in the i th subtask the two elements ai and am had initially been interchanged. 3.3. Deduce the recursion scheme of Fig. 3.11 which is a superposition of the four curves W1, W2, W3, W4. The structure is similar to that of the Sierpinski curves (3.21) and (3.22). From the recursion pattern, derive a recursive program that draws these curves.

Fig. 3.11. Curves W1 – W4 3.4. Only 12 of the 92 solutions computed by the Eight Queens algorithm are essentially different. The other ones can be derived by reflections about axes or the center point. Devise a program that determines the 12 principal solutions. Note that, for example, the search in column 1 may be restricted to positions 1-4. 3.5 Change the Stable Marriage Program so that it determines the optimal solution (male or female). It therefore becomes a branch and bound program of the type represented by Program 3.7. 3.6 A certain railway company serves n stations S 0, ... , S n-1. It intends to improve its customer information service by computerized information terminals. A customer types in his departure station SA and his destination SD, and he is supposed to be (immediately) given the schedule of the train connections with minimum total time of the journey. Devise a program to compute the desired information. Assume that the timetable (which is your data bank) is provided in a suitable data structure containing departure (= arrival) times of all available trains. Naturally, not all stations are connected by direct lines (see also Exercise 1.6). 3.7 The Ackermann Function A is defined for all non-negative integer arguments m and n as follows: A(0, n) = n + 1 A(m, 0) = A(m-1, 1) (m > 0) A(m, n) = A(m-1, A(m, n-1)) (m, n > 0) Design a program that computes A(m,n) without the use of recursion. As a guideline, use Program 2.11, the non-recusive version of Quicksort. Devise a set of rules for the transformation of recursive into iterative programs in general.

References 3-1. D.G. McVitie and L.B. Wilson. The Stable Marriage Problem. Comm. ACM, 14, No. 7 (1971), 48692. 3-2. -------. Stable Marriage Assignment for Unequal Sets. Bit, 10, (1970), 295-309. 3-3. Space Filling Curves, or How to Waste Time on a Plotter. Software - Practice and Experience, 1, No. 4 (1971), 403-40. 3-4. N. Wirth. Program Development by Stepwise Refinement. Comm. ACM, 14, No. 4 (1971), 221-27.

109

4 Dynamic Information Structures 4.1. Recursive Data Types In Chap. 2 the array, record, and set structures were introduced as fundamental data structures. They are called fundamental because they constitute the building blocks out of which more complex structures are formed, and because in practice they do occur most frequently. The purpose of defining a data type, and of thereafter specifying that certain variables be of that type, is that the range of values assumed by these variables, and therefore their storage pattern, is fixed once and for all. Hence, variables declared in this way are said to be static. However, there are many problems which involve far more complicated information structures. The characteristic of these problems is that not only the values but also the structures of variables change during the computation. They are therefore called dynamic structures. Naturally, the components of such structures are -- at some level of resolution -- static, i.e., of one of the fundamental data types. This chapter is devoted to the construction, analysis, and management of dynamic information structures. It is noteworthy that there exist some close analogies between the methods used for structuring algorithms and those for structuring data. As with all analogies, there remain some differences, but a comparison of structuring methods for programs and data is nevertheless illuminating. The elementary, unstructured statement is the assignment of an expression's value to a variable. Its corresponding member in the family of data structures is the scalar, unstructured type. These two are the atomic building blocks for composite statements and data types. The simplest structures, obtained through enumeration or sequencing, are the compound statement and the record structure. They both consist of a finite (usually small) number of explicitly enumerated components, which may themselves all be different from each other. If all components are identical, they need not be written out individually: we use the for statement and the array structure to indicate replication by a known, finite factor. A choice among two or more elements is expressed by the conditional or the case statement and by extensions of record types, respectively. And finally, a repetiton by an initially unknown (and potentially infinite) factor is expressed by the while and repeat statements. The corresponding data structure is the sequence (file), the simplest kind which allows the construction of types of infinite cardinality. The question arises whether or not there exists a data structure that corresponds in a similar way to the procedure statement. Naturally, the most interesting and novel property of procedures in this respect is recursion. Values of such a recursive data type would contain one or more components belonging to the same type as itself, in analogy to a procedure containing one or more calls to itself. Like procedures, data type definitions might be directly or indirectly recursive. A simple example of an object that would most appropriately be represented as a recursively defined type is the arithmetic expression found in programming languages. Recursion is used to reflect the possibility of nesting, i.e., of using parenthesized subexpressions as operands in expressions. Hence, let an expression here be defined informally as follows: An expression consists of a term, followed by an operator, followed by a term. (The two terms constitute the operands of the operator.) A term is either a variable -- represented by an identifier -- or an expression enclosed in parentheses. A data type whose values represent such expressions can easily be described by using the tools already available with the addition of recursion: TYPE expression = RECORD op: INTEGER; opd1, opd2: term END TYPE term =

RECORD IF t: BOOLEAN THEN id: Name ELSE subex: expression END END

110 Hence, every variable of type term consists of two components, namely, the tagfield t and, if t is true, the field id, or of the field subex otherwise. Consider now, for example, the following four expressions: 1. x + y 2. x - (y * z) 3. (x + y) * (z - w) 4. (x/(y + z)) * w These expressions may be visualized by the patterns in Fig. 4.1, which exhibit their nested, recursive structure, and they determine the layout or mapping of these expressions onto a store. 1.

2.

+ T

x

T

y

T

* F

3.

x

T

y

T

z

4.

*

*

+ F

/

T

x

T

y

T

+

F F

F

T

z

T

w

x

T

y

T

z

T

w

Fig. 4.1. Storage patterns for recursive record structures A second example of a recursive information structure is the family pedigree: Let a pedigree be defined by (the name of) a person and the two pedigrees of the parents. This definition leads inevitably to an infinite structure. Real pedigrees are bounded because at some level of ancestry information is missing. Assume that this can be taken into account by again using a conditional structure: TYPE ped = RECORD IF known: BOOLEAN THEN name: Name; father, mother: ped END END Note that every variable of type ped has at least one component, namely, the tagfield called known. If its value is TRUE, then there are three more fields; otherwise there is none. A particular value is shown here in the forms of a nested expression and of a diagram that may suggest a possible storage pattern (see Fig. 4.2). (T, Ted, (T, Fred, (T, Adam, (F), (F)), (F)), (T, Mary, (F), (T, Eva, (F), (F))) The important role of the variant facility becomes clear; it is the only means by which a recursive data structure can be bounded, and it is therefore an inevitable companion of every recursive definition. The analogy between program and data structuring concepts is particularly pronounced in this case. A conditional (or selective) statement must necessarily be part of every recursive procedure in order that execution of the procedure can terminate. In practice, dynamic structures involve references or pointers to its elements, and the concept of an alternative (to terminate the recursion) is implied in the pointer, as shown in the next paragraph.

111

T

Ted T

Fred T

Adam F F

F

T

Mary F T

Eva F F

Fig. 4.2. An example of a recursive data structure

4.2. Pointers The characteristic property of recursive structures which clearly distinguishes them from the fundamental structures (arrays, records, sets) is their ability to vary in size. Hence, it is impossible to assign a fixed amount of storage to a recursively defined structure, and as a consequence a compiler cannot associate specific addresses to the components of such variables. The technique most commonly used to master this problem involves dynamic allocation of storage, i.e., allocation of store to individual components at the time when they come into existence during program execution, instead of at translation time. The compiler then allocates a fixed amount of storage to hold the address of the dynamically allocated component instead of the component itself. For instance, the pedigree illustrated in Fig. 4.2 would be represented by individual -- quite possibly noncontiguous -- records, one for each person. These persons are then linked by their addresses assigned to the respective father and mother fields. Graphically, this situation is best expressed by the use of arrows or pointers (Fig. 4.3).

112

T

T

T

Ted

Fred

T

Adam

F

F

F

Mary

T

F

Eva

F

F

Fig. 4.3. Data structure linked by pointers It must be emphasized that the use of pointers to implement recursive structures is merely a technique. The programmer need not be aware of their existence. Storage may be allocated automatically the first time a new component is referenced. However, if the technique of using references or pointers is made explicit, more general data structures can be constructed than those definable by purely recursive data definiton. In particular, it is then possible to define potentially infinite or circular (graph) structures and to dictate that certain structures are shared. It has therefore become common in advanced programming languages to make possible the explicit manipulation of references to data in additon to the data themeselves. This implies that a clear notational distinction must exist between data and references to data and that consequently data types must be introduced whose values are pointers (references) to other data. The notation we use for this purpose is the following: TYPE T = POINTER TO T0 This type declaration expresses that values of type T are pointers to data of type T0. It is fundamentally important that the type of elements pointed to is evident from the declaration of T. We say that T is bound to T0. This binding distinguishes pointers in higher-level languages from addresses in assembly codes, and it is a most important facility to increase security in programming through redundancy of the underlying notation. Values of pointer types are generated whenever a data item is dynamically allocated. We will adhere to the convention that such an occasion be explicitly mentioned at all times. This is in contrast to the situation in which the first time that an item is mentioned it is automatically allocated. For this purpose, we introduce a procedure New. Given a pointer variable p of type T, the statement New(p) effectively allocates a variable of type T0 and assigns the pointer referencing this new variable to p (see Fig. 4.4). The pointer value itself can now be referred to as p (i.e., as the value of the pointer variable p). In contrast, the variable which is referenced by p is denoted by p^. The referenced structures are typically records. If the referenced record has, for example, a field x, then it is denoted by p^.x. Because it is clear that not the pointer p has any fields, but only the referenced record p^, we allow the abbreviated notation p.x in place of p^.x.

113

p: POINTER TO T

p↑: T

Fig. 4.4. Dynamic allocation of variable p^ It was mentioned above that a variant component is essential in every recursive type to ensure finite instances. The example of the family predigree is of a pattern that exhibits a most frequently occurring constellation, namely, the case in which one of the two cases features no further components. This is expressed by the following declaration schema: TYPE T = RECORD IF nonterminal: BOOLEAN THEN S(T) END END S(T) denotes a sequence of field definitions which includes one or more fields of type T, thereby ensuring recursivity. All structures of a type patterned after this schema exhibit a tree (or list) structure similar to that shown in Fig. 4.3. Its peculiar property is that it contains pointers to data components with a tag field only, i.e., without further relevant information. The implementation technique using pointers suggests an easy way of saving storage space by letting the tag information be included in the pointer value itself. The common solution is to extend the range of values of all pointer types by a single value that is pointing to no element at all. We denote this value by the special symbol NIL, and we postulate that the value NIL can be assumed by all pointer typed variables. This extension of the range of pointer values explains why finite structures may be generated without the explicit presence of variants (conditions) in their (recursive) declaration. The new formulations of the explicitly recursive data types declared above are reformulated using pointers as shown below. Note that the field known has vanished, since ~p.known is now expressed as p = NIL. The renaming of the type ped to person reflects the difference in the viewpoint brought about by the introduction of explicit pointer values. Instead of first considering the given structure in its entirety and then investigating its substructure and its components, attention is focused on the components in the first place, and their interrelationship (represented by pointers) is not evident from any fixed declaration. TYPE term = TYPE exp = TYPE ExpDescriptor = TYPE TermDescriptor =

POINTER TO TermDescriptor; POINTER TO ExpDescriptor; RECORD op: INTEGER; opd1, opd2: term END ; RECORD id: ARRAY 32 OF CHAR END

TYPE Person =

POINTER TO RECORD name: ARRAY 32 OF CHAR; father, mother: Person END

Note: The type Person points to records of an anonymous type (PersonDescriptor). The data structure representing the pedigree shown in Figs. 4.2 and 4.3 is again shown in Fig. 4.5 in which pointers to unknown persons are denoted by NIL. The resulting improvement in storage economy is obvious.

114

T

T

T

Adam

Fred

Ted

NIL

T

NIL NIL

Mary

NIL

T

Eva

NIL NIL

Fig. 4.5. Data structure with NIL pointers Again referring to Fig. 4.5, assume that Fred and Mary are siblings, i.e., have the same father and mother. This situation is easily expressed by replacing the two NIL values in the respective fields of the two records. An implementation that hides the concept of pointers or uses a different technique of storage handling would force the programmer to represent the ancestor records of Adam and Eve twice. Although in accessing their data for inspection it does not matter whether the two fathers (and the two mothers) are duplicated or represented by a single record, the difference is essential when selective updating is permitted. Treating pointers as explicit data items instead of as hidden implementation aids allows the programmer to express clearly where storage sharing is intended and where it is not. A further consequence of the explicitness of pointers is that it is possible to define and manipulate cyclic data structures. This additional flexibility yields, of course, not only increased power but also requires increased care by the programmer, because the manipulation of cyclic data structures may easily lead to nonterminating processes. This phenomenon of power and flexibility being intimately coupled with the danger of misuse is well known in programming, and it particularly recalls the GOTO statement. Indeed, if the analogy between program structures and data structures is to be extended, the purely recursive data structure could well be placed at the level corresponding with the procedure, whereas the introduction of pointers is comparable to the use of GOTO statements. For, as the GOTO statement allows the construction of any kind of program pattern (including loops), so do pointers allow for the composition of any kind of data structure (including rings). The parallel development of corresponding program and data structures is shown in condensed form in Table 4.1. Construction Pattern

Program Statement

Data Type

Atomic element Enumeration Repetition (known factor) Choice Repetition Recursion General graph

Assignment Compound statement For statement Conditional statement While or repeat statement Procedure statement GO TO statement

Scalar type Record type Array type Type union (Variant record) Sequence type Recursive data type Structure linked by pointers

Table 4.1 Correspondences of Program and Data Structures. In Chap. 3, we have seen that iteration is a special case of recursion, and that a call of a recursive procedure P defined according to the following schema: PROCEDURE P; BEGIN IF B THEN P0; P END END where P0 is a statement not involving P, is equivalent to and replaceable by the iterative statement WHILE B DO P0 END

115 The analogies outlined in Table 4.1 reveal that a similar relationship holds between recursive data types and the sequence. In fact, a recursive type defined according to the schema TYPE T = RECORD IF b: BOOLEAN THEN t0: T0; t: T END END where T0 is a type not involving T, is equivalent and replaceable by a sequence of T0s. The remainder of this chapter is devoted to the generation and manipulation of data structures whose components are linked by explicit pointers. Structures with specific simple patterns are emphasized in particular; recipes for handling more complex structures may be derived from those for manipulating basic formations. These are the linear list or chained sequence -- the simplest case -- and trees. Our preoccupation with these building blocks of data structuring does not imply that more involved structures do not occur in practice. In fact, the following story appeared in a Zürich newspaper in July 1922 and is a proof that irregularity may even occur in cases which usually serve as examples for regular structures, such as (family) trees. The story tells of a man who laments the misery of his life in the following words: I married a widow who had a grown-up daughter. My father, who visited us quite often, fell in love with my step-daughter and married her. Hence, my father became my son-in-law, and my step-daughter became my mother. Some months later, my wife gave birth to a son, who became the brother-in-law of my father as well as my uncle. The wife of my father, that is my stepdaughter, also had a son. Thereby, I got a brother and at the same time a grandson. My wife is my grandmother, since she is my mother's mother. Hence, I am my wife's husband and at the same time her step-grandson; in other words, I am my own grandfather.

4.3. Linear Lists 4.3.1. Basic Operations The simplest way to interrelate or link a set of elements is to line them up in a single list or queue. For, in this case, only a single link is needed for each element to refer to its successor. Assume that types Node and NodeDesc are defined as shown below. Every variable of type NodeDesc consists of three components, namely, an identifying key, the pointer to its successor, and possibly further associated information. For our further discussion, only key and next will be relevant. TYPE Node = POINTER TO NodeDesc; TYPE NodeDesc = RECORD key: INTEGER; next: Ptr; data: ... END ; VAR p, q: Node (*pointer variables*) A list of nodes, with a pointer to its first component being assigned to a variable p, is illustrated in Fig. 4.6. Probably the simplest operation to be performed with a list as shown in Fig. 4.6 is the insertion of an element at its head. First, an element of type NodeDesc is allocated, its reference (pointer) being assigned to an auxiliary pointer variable, say q. Thereafter, a simple reassignment of pointers completes the operation. Note that the order of these three statements is essential. NEW(q); q.next := p; p := q p

1 2 3 4 NIL

116 Fig. 4.6. Example of a linked list The operation of inserting an element at the head of a list immediately suggests how such a list can be generated: starting with the empty list, a heading element is added repeatedly. The process of list generation is expressed in by the following piece of program; here the number of elements to be linked is n. p := NIL; (*start with empty list*) WHILE n > 0 DO NEW(q); q.next := p; p := q; q.key := n; DEC(n) END This is the simplest way of forming a list. However, the resulting order of elements is the inverse of the order of their insertion. In some applications this is undesirable, and consequently, new elements must be appended at the end instead of the head of the list. Although the end can easily be determined by a scan of the list, this naive approach involves an effort that may as well be saved by using a second pointer, say q, always designating the last element. This method is, for example, applied in Program 4.4, which generates cross-references to a given text. Its disadvantage is that the first element inserted has to be treated differently from all later ones. The explicit availability of pointers makes certain operations very simple which are otherwise cumbersome; among the elementary list operations are those of inserting and deleting elements (selective updating of a list), and, of course, the traversal of a list. We first investigate list insertion. Assume that an element designated by a pointer (variable) q is to be inserted in a list after the element designated by the pointer p. The necessary pointer assignments are expressed as follows, and their effect is visualized by Fig. 4.7. q.next := p.next; p.next := q q

q

p

Fig. 4.7. Insertion after p^ If insertion before instead of after the designated element p^ is desired, the unidirectional link chain seems to cause a problem, because it does not provide any kind of path to an element's predecessors. However, a simple trick solves our dilemma. It is illustrated in Fig. 4.8. Assume that the key of the new element is 8. NEW(q); q^ := p^; p.key := k; p.next := q

117

q

8

27

p

13

27

21

13

8

21

Fig. 4.8. Insertion before p^ The trick evidently consists of actually inserting a new component after p^ and thereafter interchanging the values of the new element and p^. Next, we consider the process of list deletion. Deleting the successor of a p^ is straightforward. This is shown here in combination with the reinsertion of the deleted element at the head of another list (designated by q). Figure 4.9 illustrates the situation and shows that it constitutes a cyclic exchange of three pointers. r := p.next; p.next := r.next; r.next := q; q := r q

q

p

Fig. 4.9. Deletion and re-insertion The removal of a designated element itself (instead of its successor) is more difficult, because we encounter the same problem as with insertion: tracing backward to the denoted element's predecessor is impossible. But deleting the successor after moving its value forward is a relatively obvious and simple solution. It can be applied whenever p^ has a successor, i.e., is not the last element on the list. However, it must be assured that there exist no other variables pointing to the now deleted element. We now turn to the fundamental operation of list traversal. Let us assume that an operation P(x) has to be performed for every element of the list whose first element is p^. This task is expressible as follows: WHILE list designated by p is not empty DO perform operation P; proceed to the successor END In detail, this operation is descibed by the following statement: WHILE p # NIL DO

118 P(p); p := p.next END It follows from the definitions of the while statement and of the linking structure that P is applied to all elements of the list and to no other ones. A very frequent operation performed is list searching for an element with a given key x. Unlike for arrays, the search must here be purely sequential. The search terminates either if an element is found or if the end of the list is reached. This is reflected by a logical conjunction consisting of two terms. Again, we assume that the head of the list is designated by a pointer p. WHILE (p # NIL) & (p.key # x) DO p := p.next END p = NIL implies that p^ does not exist, and hence that the expression p.key # x is undefined. The order of the two terms is therefore essential. 4.3.2. Ordered Lists and Reorganizing Lists The given linear list search strongly resembles the search routines for scanning an array or a sequence. In fact, a sequence is precisely a linear list for which the technique of linkage to the successor is left unspecified or implicit. Since the primitive sequence operators do not allow insertion of new elements (except at the end) or deletion (except removal of all elements), the choice of representation is left wide open to the implementor, and he may well use sequential allocation, leaving successive components in contiguous storage areas. Linear lists with explicit pointers provide more flexibility, and therefore they should be used whenever this additional flexibility is needed. To exemplify, we will now consider a problem that will occur throughout this chapter in order to illustate alternative solutions and techniques. It is the problem of reading a text, collecting all its words, and counting the frequency of their occurrence. It is called the construction of a concordance or the generation of a cross-reference list. An obvious solution is to construct a list of words found in the text. The list is scanned for each word. If the word is found, its frequency count is incremented; otherwise the word is added to the list. We shall simply call this process search, although it may actually also include an insertion. In order to be able to concentrate our attention on the essential part of list handling, we assume that the words have already been extracted from the text under investigation, have been encoded as integers, and are available in the from of an input sequence. The formulation of the procedure called search follows in a straightforward manner. The variable root refers to the head of the list in which new words are inserted accordingly. The complete algorithm is listed below; it includes a routine for tabulating the constructed cross-reference list. The tabulation process is an example in which an action is executed once for each element of the list. TYPE Word = POINTER TO RECORD key, count: INTEGER; next: Word END ; PROCEDURE search(x: INTEGER; VAR root: Word); VAR w: Word; BEGIN w := root; WHILE (w # NIL) & (w.key # x) DO w := w.next END ; (* (w = NIL) OR (w.key = x) *) IF w = NIL THEN (*new entry*) w := root; NEW(root); root.key := x; root.count := 1; root.next := w ELSE INC(w.count) END END search; PROCEDURE PrintList(w: Word); BEGIN (*uses global writer W *) WHILE w # NIL DO

119 Texts.WriteInt(W, w.key, 8); Texts.WriteInt(W, w.count, 8); Texts.WriteLn(W); w := w.next END END PrintList; The linear scan algorithm resembles the search procedure for arrays, and reminds us of a simple technique used to simplify the loop termination condition: the use of a sentinel. A sentinel may as well be used in list search; it is represented by a dummy element at the end of the list. The new procedure is listed below. We must assume that a global variable sentinel is added and that the initialization of root := NIL is replaced by the statements NEW(sentinel); root := sentinel which generate the element to be used as sentinel. PROCEDURE search(x: INTEGER; VAR root: Word); VAR w: Word; BEGIN w := root; sentinel.key := x; WHILE w.key # x DO w := w.next END ; IF w = sentinel THEN (*new entry*) w := root; NEW(root); root.key := x; root.count := 1; root.next := w ELSE INC(w.count) END END search Obviously, the power and flexibility of the linked list are ill used in this example, and the linear scan of the entire list can only be accepted in cases in which the number of elements is limited. An easy improvement, however, is readily at hand: the ordered list search. If the list is ordered (say by increasing keys), then the search may be terminated at the latest upon encountering the first key that is larger than the new one. Ordering of the list is achieved by inserting new elements at the appropriate place instead of at the head. In effect, ordering is practically obtained free of charge. This is because of the ease by which insertion in a linked list is achieved, i.e., by making full use of its flexibility. It is a possibility not provided by the array and sequence structures. (Note, however, that even in ordered lists no equivalent to the binary search of arrays is available). 7

w3

1 5

5

12

NIL

w2

w1

Fig. 4.10. Insertion in ordered list Ordered list search is a typical example of the situation, where an element must be inserted ahead of a given item, here in front of the first one whose key is too large. The technique shown here, however, differs from the one used shown earlier. Instead of copying values, two pointers are carried along in the list traversal; w2 lags one step behind w1 and thus identifies the proper insertion place when w1 has found too large a key. The general insertion step is shown in Fig. 4.10. The pointer to the new element (w3) is to be assigned to w2^.next, except when the list is still empty. For reasons of simplicity and effectiveness, we prefer to avoid this distinction by using a conditional statement. The only way to avoid

120 this is to introduce a dummy element at the list head. The initializing statement root := NIL is accordingly replaced by NEW(root); root.next := NIL Referring to Fig. 4.10, we determine the condition under which the scan continues to proceed to the next element; it consists of two factors, namely, (w1 # NIL) & (w1.key < x) The resulting search procedure is:. PROCEDURE search(x: INTEGER); VAR root: Word); VAR w1, w2, w3: Word; BEGIN (*w2 # NIL*) w2 := root; w1 := w2.next; WHILE (w1 # NIL) & (w1.key < x) DO w2 := w1; w1 := w2.next END ; (* (w1 = NIL) OR (w1.key >= x) *) IF (w1 = NIL) OR (w1.key > x) THEN (*new entry*) NEW(w3); w2.next := w3; w3.key := x; w3.count := 1; w3.next := w1 ELSE INC(w1.count) END END search In order to speed up the search, the continuation condition of the while statement can once again be simplified by using a sentinel. This requires the initial presence of a dummy header as well as a sentinel at the tail. It is now high time to ask what gain can be expected from ordered list search. Remembering that the additional complexity incurred is small, one should not expect an overwhelming improvement. Assume that all words in the text occur with equal frequency. In this case the gain through lexicographical ordering is indeed also nil, once all words are listed, because the position of a word does not matter if only the total of all access steps is significant and if all words have the same frequency of occurrence. However, a gain is obtained whenever a new word is to be inserted. Instead of first scanning the entire list, on the average only half the list is scanned. Hence, ordered list insertion pays off only if a concordance is to be generated with many distinct words compared to their frequency of occurrence. The preceding examples are therefore suitable primarily as programming exercises rather than for practical applications. The arrangement of data in a linked list is recommended when the number of elements is relatively small (< 50), varies, and, moreover, when no information is given about their frequencies of access. A typical example is the symbol table in compilers of programming languages. Each declaration causes the addition of a new symbol, and upon exit from its scope of validity, it is deleted from the list. The use of simple linked lists is appropriate for applications with relatively short programs. Even in this case a considerable improvement in access method can be achieved by a very simple technique which is mentioned here again primarily because it constitutes a pretty example for demonstrating the flexibilities of the linked list structure. A characteristic property of programs is that occurrences of the same identifier are very often clustered, that is, one occurrence is often followed by one or more reoccurrences of the same word. This information is an invitation to reorganize the list after each access by moving the word that was found to the top of the list, thereby minimizing the length of the search path the next time it is sought. This method of access is called list search with reordering, or -- somewhat pompously -- self-organizing list search. In presenting the corresponding algorithm in the form of a procedure, we take advantage of our experience made so far and introduce a sentinel right from the start. In fact, a sentinel not only speeds up the search, but in this case it also simplifies the program. The list must initially not be empty, but contains the sentinel element already. The initialization statements are

121 NEW(sentinel); root := sentinel Note that the main difference between the new algorithm and the straight list search is the action of reordering when an element has been found. It is then detached or deleted from its old position and inserted at the top. This deletion again requires the use of two chasing pointers, such that the predecessor w2 of an identified element w1 is still locatable. This, in turn, calls for the special treatment of the first element (i.e., the empty list). To conceive the linking process, we refer to Fig. 4.11. It shows the two pointers when w1 was identified as the desired element. The configuration after correct reordering is represented in Fig. 4.12, and the complete new search procedure is listed below. sentinel root

X1 3 U2 2

A0 7 G5 6

NIL

w2

w1

Fig. 4.11. List before re-ordering sentinel root

X1 3 U2 2

A0 8 G5 6

NIL

w2

w1

Fig. 4.12. List after re-ordering PROCEDURE search(x: INTEGER; VAR root: Word); VAR w1, w2: Word; BEGIN w1 := root; sentinel.key := x; IF w1 = sentinel THEN (*first element*) NEW(root); root.key := x; reoot.count := 1; root.next := sentinel ELSIF w1.key = x THEN INC(w1.count) ELSE (*search*) REPEAT w2 := w1; w1 := w2.next UNTIL w1.key = x; IF w1 = sentinel THEN (*new entry*) w2 := root; NEW(root); root.key := x; root.count := 1; root.next := w2 ELSE (*found, now reorder*) INC(w1^.count); w2.next := w1.next; w1.next := root; root := w1

122 END END END search The improvement in this search method strongly depends on the degree of clustering in the input data. For a given factor of clustering, the improvement will be more pronounced for large lists. To provide an idea of how much gain can be expected, an empirical measurement was made by applying the above crossreference program to a short and a relatively long text and then comparing the methods of linear list ordering and of list reorganization. The measured data are condensed into Table 4.2. Unfortunately, the improvement is greatest when a different data organization is needed anyway. We will return to this example in Sect. 4.4. Test 1

Test 2

Number of distinct keys Number of occurrences of keys Time for search with ordering Time for search with reordering

53 315 6207 4529

582 14341 3200622 681584

Improvement factor

1.37

4.70

Table 4.2 Comparsion of List Search Methods. 4.3.3. An Application: Partial Ordering (Topological Sorting) An appropriate example of the use of a flexible, dynamic data structure is the process of topological sorting. This is a sorting process of items over which a partial ordering is defined, i.e., where an ordering is given over some pairs of items but not between all of them. The following are examples of partial orderings: 1. In a dictionary or glossary, words are defined in terms of other words. If a word v is defined in terms of a word w, we denote this by v 〈 w. Topological sorting of the words in a dictionary means arranging them in an order such that there will be no forward references. 2. A task (e.g., an engineering project) is broken up into subtasks. Completion of certain subtasks must usually precede the execution of other subtasks. If a subtask v must precede a subtask w, we write v 〈 w. Topological sorting means their arrangement in an order such that upon initiation of each subtask all its prerequisite subtasks have been completed. 3. In a university curriculum, certain courses must be taken before others since they rely on the material presented in their prerequisites. If a course v is a prerequisite for course w, we write v 〈 w. Topological sorting means arranging the courses in such an order that no course lists a later course as prerequisite. 4. In a program, some procedures may contain calls of other procedures. If a procedure v is called by a procedure w, we write v 〈 w. Topological sorting implies the arrangement of procedure declarations in such a way that there are no forward references. In general, a partial ordering of a set S is a relation between the elements of S. It is denoted by the symbol “〈”, verbalized by precedes, and satisfies the following three properties (axioms) for any distinct elements x, y, z of S: 1. if x 〈 y and y 〈 z, then x 〈 z (transitivity) 2. if x 〈 y, then not y 〈 x (asymmetry) 3. not z 〈 z (irreflexivity) For evident reasons, we will assume that the sets S to be topologically sorted by an algorithm are finite. Hence, a partial ordering can be illustrated by drawing a diagram or graph in which the vertices denote the elements of S and the directed edges represent ordering relationships. An example is shown in Fig. 4.13.

123

2 1

10 6 4 8

9

3 5 7

Fig. 4.13. Partially ordered set The problem of topological sorting is to embed the partial order in a linear order. Graphically, this implies the arrangement of the vertices of the graph in a row, such that all arrows point to the right, as shown in Fig. 4.14. Properties (1) and (2) of partial orderings ensure that the graph contains no loops. This is exactly the prerequisite condition under which such an embedding in a linear order is possible.

7

9

1

2

4

6

3

5

8

10

Fig. 4.14. Linear arrangement of the partially ordered set of Fig. 4.13. How do we proceed to find one of the possible linear orderings? The recipe is quite simple. We start by choosing any item that is not preceded by another item (there must be at least one; otherwise a loop would exist). This object is placed at the head of the resulting list and removed from the set S. The remaining set is still partially ordered, and so the same algorithm can be applied again until the set is empty. In order to describe this algorithm more rigorously, we must settle on a data structure and representation of S and its ordering. The choice of this representation is determined by the operations to be performed, particularly the operation of selecting elements with zero predecessors. Every item should therefore be represented by three characteristics: its identification key, its set of successors, and a count of its predecessors. Since the number n of elements in S is not given a priori, the set is conveniently organized as a linked list. Consequently, an additional entry in the description of each item contains the link to the next item in the list. We will assume that the keys are integers (but not necessarily the consecutive integers from 1 to n). Analogously, the set of each item's successors is conveniently represented as a linked list. Each element of the successor list is described by an identification and a link to the next item on this list. If we call the descriptors of the main list, in which each item of S occurs exactly once, leaders, and the descriptors of elements on the successor chains trailers, we obtain the following declarations of data types: TYPE Leader = POINTER TO LeaderDesc; Trailer = POINTER TO TrailerDesc; LeaderDesc = RECORD key, count: INTEGER; trail: Trailer; next: Leader END; TrailerDesc = RECORD id: Leader; next: Trailer END

124 Assume that the set S and its ordering relations are initially represented as a sequence of pairs of keys in the input file. The input data for the example in Fig. 4.13 are shown below, in which the symbols 〈 are added for the sake of clarity, symbolizing partial order: 1 〈2 3 〈5

2 〈4 5 〈8

4 〈6 7 〈5

2 〈 10 7 〈9

4 〈8 9 〈4

6 〈3 9 〈 10

1 〈3

The first part of the topological sort program must read the input and transform the data into a list structure. This is performed by successively reading a pair of keys x and y (x 〈 y). Let us denote the pointers to their representations on the linked list of leaders by p and q. These records must be located by a list search and, if not yet present, be inserted in the list. This task is perfomed by a function procedure called find. Subsequently, a new entry is added in the list of trailers of x, along with an identification of y; the count of predecessors of y is incremented by 1. This algorithm is called input phase. Figure 4.15 illustrates the data structure generated during processing the given input data. The function find(w) yields the pointer to the list element with key w. In the following poece of program we make use of text scanning, a feature of the Oberon system’s text concept. Instead of considering a text (file) as a sequence of characters, a text is considered as a sequence of tokens, which are identifiers, numbers, strings, and special characters (such as +, *, a THEN search(w.left, a)

138 ELSE (*old entry*) NEW(q); q.lno := line; w.last.next := q; w.last := q END END search; PROCEDURE Tabulate(w: Node); VAR m: INTEGER; item: Item; BEGIN IF w # NIL THEN Tabulate(w.left); Texts.WriteString(W, w.key); item := w.first; m := 0; REPEAT IF m = 10 THEN Texts.WriteLn(W); Texts.Write(W, TAB); m := 0; END ; INC(m); Texts.WriteInt(W, item.lno, 6); item := item.next UNTIL item = NIL; Texts.WriteLn(W); Tabulate(w.right) END END Tabulate; PROCEDURE CrossRef(VAR R: Texts.Reader); VAR root: Node; (*uses global writer W*) i: INTEGER; ch: CHAR; w: Word; BEGIN root := NIL; line := 0; Texts.WriteInt(W, 0, 6); Texts.Write(W, TAB); Texts.Read(R, ch); WHILE ~R.eot DO IF ch = 0DX THEN (*line end*) Texts.WriteLn(W); INC(line); Texts.WriteInt(W, line, 6); Texts.Write(W, 9X); Texts.Read(R, ch) ELSIF ("A" 1, we will provide the root with two subtrees which again have a minimal number of nodes. Hence, the subtrees are also T's. Evidently, one subtree must have height h-1, and the other is then allowed to have a height of one less, i.e. h-2. Figure 4.30 shows the trees with height 2, 3, and 4. Since their composition principle very strongly resembles that of Fibonacci numbers, they are called Fibonacci-trees (see Fig. 4.30). They are defined as follows: 1. The empty tree is the Fibonacci-tree of height 0. 2. A single node is the Fibonacci-tree of height 1. 3. If Th-1 and Th-2 are Fibonacci-trees of heights h-1 and h-2, then Th = is a Fibonacci-tree. 4. No other trees are Fibonacci-trees.

T2

T3 2

T4 3

1

2

5 4

3

1

2

7 4

6

1

Fig. 4.30. Fibonacci-trees of height 2, 3, and 4 The number of nodes of Th is defined by the following simple recurrence relation: N 0 = 0, N1 = 1 N h = Nh-1 + 1 + Nh-2 The Ni are those numbers of nodes for which the worst case (upper limit of h) can be attained, and they are called Leonardo numbers. 4.5.1.

Balanced Tree Insertion

Let us now consider what may happen when a new node is inserted in a balanced tree. Given a root r with the left and right subtrees L and R, three cases must be distinguished. Assume that the new node is inserted in L causing its height to increase by 1: 1. hL = h R: L and R become of unequal height, but the balance criterion is not violated. 2. hL < h R: L and R obtain equal height, i.e., the balance has even been improved. 3. hL > h R: the balance criterion is violated, and the tree must be restructured. Consider the tree in Fig. 4.31. Nodes with keys 9 and 11 may be inserted without rebalancing; the tree with root 10 will become one-sided (case 1); the one with root 8 will improve its balance (case 2). Insertion of nodes 1, 3, 5, or 7, however, requires subsequent rebalancing.

144

8

4 2

10 6

Fig. 4.31. Balanced tree Some careful scrutiny of the situation reveals that there are only two essentially different constellations needing individual treatment. The remaining ones can be derived by symmetry considerations from those two. Case 1 is characterized by inserting keys 1 or 3 in the tree of Fig. 4.31, case 2 by inserting nodes 5 or 7. The two cases are generalized in Fig. 4.32 in which rectangular boxes denote subtrees, and the height added by the insertion is indicated by crosses. Simple transformations of the two structures restore the desired balance. Their result is shown in Fig. 4.33; note that the only movements allowed are those occurring in the vertical direction, whereas the relative horizontal positions of the shown nodes and subtrees must remain unchanged. case 1

case 2 A

C

B

A B

Fig. 4.32. Imbalance resulting from insertion case 1

case 2 A

B B

A

C

Fig. 4.33. Restoring the balance An algorithm for insertion and rebalancing critically depends on the way information about the tree's balance is stored. An extreme solution lies in keeping balance information entirely implicit in the tree structure itself. In this case, however, a node's balance factor must be rediscovered each time it is affected

145 by an insertion, resulting in an excessively high overhead. The other extreme is to attribute an explicitly stored balance factor to every node. The definition of the type Node is then extended into TYPE Node = POINTER TO RECORD key, count, bal: INTEGER; (*bal = -1, 0, +1*) left, right: Node END We shall subsequently interpret a node's balance factor as the height of its right subtree minus the height of its left subtree, and we shall base the resulting algorithm on this node type. The process of node insertion consists essentially of the following three consecutive parts: 1. Follow the search path until it is verified that the key is not already in the tree. 2. Insert the new node and determine the resulting balance factor. 3. Retreat along the search path and check the balance factor at each node. Rebalance if necessary. Although this method involves some redundant checking (once balance is established, it need not be checked on that node's ancestors), we shall first adhere to this evidently correct schema because it can be implemented through a pure extension of the already established search and insertion procedures. This procedure describes the search operation needed at each single node, and because of its recursive formulation it can easily accommodate an additional operation on the way back along the search path. At each step, information must be passed as to whether or not the height of the subtree (in which the insertion had been performed) had increased. We therefore extend the procedure's parameter list by the Boolean h with the meaning the subtree height has increased. Clearly, h must denote a variable parameter since it is used to transmit a result. Assume now that the process is returning to a node p^ from the left branch (see Fig. 4.32), with the indication that it has increased its height. We now must distinguish between the three conditions involving the subtree heights prior to insertion: 1. hL < h R, p.bal = +1, 2. hL = h R, p.bal = 0, 3. hL > h R, p.bal = -1,

the previous imbalance at p has been equilibrated. the weight is now slanted to the left. rebalancing is necessary.

In the third case, inspection of the balance factor of the root of the left subtree (say, p1.bal) determines whether case 1 or case 2 of Fig. 4.32 is present. If that node has also a higher left than right subtree, then we have to deal with case 1, otherwise with case 2. (Convince yourself that a left subtree with a balance factor equal to 0 at its root cannot occur in this case.) The rebalancing operations necessary are entirely expressed as sequences of pointer reassignments. In fact, pointers are cyclically exchanged, resulting in either a single or a double rotation of the two or three nodes involved. In addition to pointer rotation, the respective node balance factors have to be updated. The details are shown in the search, insertion, and rebalancing procedures.

146

a)

b)

4 5

c)

5 4

7

5 4

7

2

d)

2

1

e)

5

7

4

2

1

f)

4

5

3

4

2

7

1

6

3

5

7

Fig. 4.34. Insertions in balanced tree The working principle is shown by Fig. 4.34. Consider the binary tree (a) which consists of two nodes only. Insertion of key 7 first results in an unbalanced tree (i.e., a linear list). Its balancing involves a RR single rotation, resulting in the perfectly balanced tree (b). Further insertion of nodes 2 and 1 result in an imbalance of the subtree with root 4. This subtree is balanced by an LL single rotation (d). The subsequent insertion of key 3 immediately offsets the balance criterion at the root node 5. Balance is thereafter reestablished by the more complicated LR double rotation; the outcome is tree (e). The only candidate for losing balance after a next insertion is node 5. Indeed, insertion of node 6 must invoke the fourth case of rebalancing outlined below, the RL double rotation. The final tree is shown in Fig.4.34 (f). PROCEDURE search(x: INTEGER; VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; (*~h*) BEGIN IF p = NIL THEN (*insert*) NEW(p); h := TRUE; p.key := x; p.count := 1; p.left := NIL; p.right := NIL; p.bal := 0 ELSIF p.key > x THEN search(x, p.left, h); IF h THEN (*left branch has grown*) IF p.bal = 1 THEN p.bal := 0; h := FALSE ELSIF p.bal = 0 THEN p.bal := -1 ELSE (*bal = -1, rebalance*) p1 := p.left; IF p1.bal = -1 THEN (*single LL rotation*) p.left := p1.right; p1.right := p; p.bal := 0; p := p1 ELSE (*double LR rotation*) p2 := p1.right; p1.right := p2.left; p2.left := p1; p.left := p2.right; p2.right := p; IF p2.bal = -1 THEN p.bal := 1 ELSE p.bal := 0 END ; IF p2.bal = +1 THEN p1.bal := -1 ELSE p1.bal := 0 END ; p := p2 END ; p.bal := 0; h := FALSE END END ELSIF p.key < x THEN search(x, p.right, h);

147 IF h THEN (*right branch has grown*) IF p.bal = -1 THEN p.bal := 0; h := FALSE ELSIF p.bal = 0 THEN p.bal := 1 ELSE (*bal = +1, rebalance*) p1 := p.right; IF p1.bal = 1 THEN (*single RR rotation*) p.right := p1.left; p1.left := p; p.bal := 0; p := p1 ELSE (*double RL rotation*) p2 := p1.left; p1.left := p2.right; p2.right := p1; p.right := p2.left; p2.left := p; IF p2.bal = +1 THEN p.bal := -1 ELSE p.bal := 0 END ; IF p2.bal = -1 THEN p1.bal := 1 ELSE p1.bal := 0 END ; p := p2 END ; p.bal := 0; h := FALSE END END ELSE INC(p.count) END END search Two particularly interesting questions concerning the performance of the balanced tree insertion algorithm are the following: 1. If all n! permutations of n keys occur with equal probability, what is the expected height of the constructed balanced tree? 2. What is the probability that an insertion requires rebalancing? Mathematical analysis of this complicated algorithm is still an open problem. Empirical tests support the conjecture that the expected height of the balanced tree thus generated is h = log(n)+c, where c is a small constant (c ≈ 0.25). This means that in practice the AVL-balanced tree behaves as well as the perfectly balanced tree, although it is much simpler to maintain. Empirical evidence also suggests that, on the average, rebalancing is necessary once for approximately every two insertions. Here single and double rotations are equally probable. The example of Fig. 4.34 has evidently been carefully chosen to demonstrate as many rotations as possible in a minimum number of insertions. The complexity of the balancing operations suggests that balanced trees should be used only if information retrievals are considerably more frequent than insertions. This is particularly true because the nodes of such search trees are usually implemented as densely packed records in order to economize storage. The speed of access and of updating the balance factors -- each requiring two bits only -- is therefore often a decisive factor to the efficiency of the rebalancing operation. Empirical evaluations show that balanced trees lose much of their appeal if tight record packing is mandatory. It is indeed difficult to beat the straightforward, simple tree insertion algorithm. 4.5.2. Balanced Tree Deletion Our experience with tree deletion suggests that in the case of balanced trees deletion will also be more complicated than insertion. This is indeed true, although the rebalancing operation remains essentially the same as for insertion. In particular, rebalancing consists again of either single or a double rotations of nodes. The basis for balanced tree deletion is the ordinary tree deletion algorithm. The easy cases are terminal nodes and nodes with only a single descendant. If the node to be deleted has two subtrees, we will again replace it by the rightmost node of its left subtree. As in the case of insertion, a Boolean variable parameter h is added with the meaning “the height of the subtree has been reduced”. Rebalancing has to be considered only when h is true. h is made true upon finding and deleting a node, or if rebalancing itself reduces the height of a subtree. We now introduce the two (symmetric) balancing operations in the form of procedures, because they have to be invoked from more than one point in the deletion algorithm. Note that balanceL is applied when the left, balanceR after the right branch had been reduced in height.

148

a)

b)

5 38

8

2

4

10

6

c)

2

7

1

9

3

7

11

7 3

10

9

10 3

7

11

11

9 f)

10

7 3

7

11

11

5

1

3

1

9

2

6

2

10

6

d)

2

e)

8

1

5

1

5

10

1

9

11

9

g)

h)

7 3

10 9

10 3

11

11 9

Fig. 4.35. Deletions in balanced tree The operation of the procedure is illustrated in Fig. 4.35. Given the balanced tree (a), successive deletion of the nodes with keys 4, 8, 6, 5, 2, 1, and 7 results in the trees (b) ... (h). Deletion of key 4 is simple in itself, because it represents a terminal node. However, it results in an unbalanced node 3. Its rebalancing operation invoves an LL single rotation. Rebalancing becomes again necessary after the deletion of node 6. This time the right subtree of the root (7) is rebalanced by an RR single rotation. Deletion of node 2, although in itself straightforward since it has only a single descendant, calls for a complicated RL double rotation. The fourth case, an LR double rotation, is finally invoked after the removal of node 7, which at first was replaced by the rightmost element of its left subtree, i.e., by the node with key 3. PROCEDURE balanceL(VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; BEGIN (*h; left branch has shrunk*) IF p.bal = -1 THEN p.bal := 0 ELSIF p.bal = 0 THEN p.bal := 1; h := FALSE

149 ELSE (*bal = 1, rebalance*) p1 := p.right; IF p1.bal >= 0 THEN (*single RR rotation*) p.right := p1.left; p1.left := p; IF p1.bal = 0 THEN p.bal := 1; p1.bal := -1; h := FALSE ELSE p.bal := 0; p1.bal := 0 END ; p := p1 ELSE (*double RL rotation*) p2 := p1.left; p1.left := p2.right; p2.right := p1; p.right := p2.left; p2.left := p; IF p2.bal = +1 THEN p.bal := -1 ELSE p.bal := 0 END ; IF p2.bal = -1 THEN p1.bal := 1 ELSE p1.bal := 0 END ; p := p2; p2.bal := 0 END END END balanceL; PROCEDURE balanceR(VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; BEGIN (*h; right branch has shrunk*) IF p.bal = 1 THEN p.bal := 0 ELSIF p.bal = 0 THEN p.bal := -1; h := FALSE ELSE (*bal = -1, rebalance*) p1 := p.left; IF p1.bal x THEN delete(x, p.left, h);

150 IF h THEN balanceL(p, h) END ELSIF p.key < x THEN delete(x, p.right, h); IF h THEN balanceR(p, h) END ELSE (*delete p^*) q := p; IF q.right = NIL THEN p := q.left; h := TRUE ELSIF q.left = NIL THEN p := q.right; h := TRUE ELSE del(q.left, h); IF h THEN balanceL(p, h) END END END END delete Fortunately, deletion of an element in a balanced tree can also be performed with -- in the worst case -O(log n) operations. An essential difference between the behaviour of the insertion and deletion procedures must not be overlooked, however. Whereas insertion of a single key may result in at most one rotation (of two or three nodes), deletion may require a rotation at every node along the search path. Consider, for instance, deletion of the rightmost node of a Fibonacci-tree. In this case the deletion of any single node leads to a reduction of the height of the tree; in addition, deletion of its rightmost node requires the maximum number of rotations. This therefore represents the worst choice of node in the worst case of a balanced tree, a rather unlucky combination of chances. How probable are rotations, then, in general? The surprising result of empirical tests is that whereas one rotation is invoked for approximately every two insertions, one is required for every five deletions only. Deletion in balanced trees is therefore about as easy -- or as complicated -- as insertion.

4.6. Optimal Search Trees So far our consideration of organizing search trees has been based on the assumption that the frequency of access is equal for all nodes, that is, that all keys are equally probable to occur as a search argument. This is probably the best assumption if one has no idea of access distribution. However, there are cases (they are the exception rather than the rule) in which information about the probabilities of access to individual keys is available. These cases usually have the characteristic that the keys always remain the same, i.e., the search tree is subjected neither to insertion nor deletion, but retains a constant structure. A typical example is the scanner of a compiler which determines for each word (identifier) whether or not it is a keyword (reserved word). Statistical measurements over hundreds of compiled programs may in this case yield accurate information on the relative frequencies of occurrence, and thereby of access, of individual keys. Assume that in a search tree the probability with which node i is accessed is Pr {x = ki} = p i,

(Si: 1 ≤ i ≤ n : p i) = 1

We now wish to organize the search tree in a way that the total number of search steps -- counted over sufficiently many trials -- becomes minimal. For this purpose the definition of path length is modified by (1) attributing a certain weight to each node and by (2) assuming the root to be at level 1 (instead of 0), because it accounts for the first comparison along the search path. Nodes that are frequently accessed become heavy nodes; those that are rarely visited become light nodes. The (internal) weighted path length is then the sum of all paths from the root to each node weighted by that node's probability of access. P = Si: 1 ≤ i ≤ n : pi*hi hi is the level of node i. The goal is now to minimize the weighted path length for a given probability distribution. As an example, consider the set of keys 1, 2, 3, with probabilities of access p1 = 1/7, p2 = 2/7, and p3 = 4/7. These three keys can be arranged in five different ways as search trees (see Fig. 4.36).

151

a)

2 1

3

b)

3

1

c)

1

d)

2

e)

1

3

2

1

3 2

2 3

Fig. 4.36. The search trees with 3 nodes The weighted path lengths of trees (a) to (e) are computed according to their definition as P(a) = 11/7, P(b) = 12/7, P(c) = 12/7, P(d) = 15/7, P(e) = 17/7 Hence, in this example, not the perfectly balanced tree (c), but the degenerate tree (a) turns out to be optimal. The example of the compiler scanner immediately suggests that this problem should be viewed under a slightly more general condition: words occurring in the source text are not always keywords; as a matter of fact, their being keywords is rather the exception. Finding that a given word k is not a key in the search tree can be considered as an access to a hypothetical "special node" inserted between the next lower and next higher key (see Fig. 4.19) with an associated external path length. If the probability q i of a search argument x lying between the two keys ki and ki+1 is also known, this information may considerably change the structure of the optimal search tree. Hence, we generalize the problem by also considering unsuccessful searches. The overall average weighted path length is now P = (Si: 1 ≤ i ≤ n : p i*hi) + (Si: 1 ≤ i ≤ m : qi*h'i) where (Si: 1 ≤ i ≤ n : p i) + (Si: 1 ≤ i ≤m : q i) = 1. and where, hi is the level of the (internal) node i and h'j is the level of the external node j. The average weighted path length may be called the cost of the search tree, since it represents a measure for the expected amount of effort to be spent for searching. The search tree that requires the minimal cost among all trees with a given set of keys ki and probabilities p i and q i is called the optimal tree.

152

k2|a2

k1|a1

b0

k4|a4

k3|a3

b1

b2

b4

b3

Fig. 4.37. Search tree with associated access frequencies For finding the optimal tree, there is no need to require that the p's and q's sum up to 1. In fact, these probabilities are commonly determined by experiments in which the accesses to nodes are counted. Instead of using the probabilities pi and qj, we will subsequently use such frequency counts and denote them by ai = number of times the search argument x equals ki b j = number of times the search argument x lies between kj and kj+1 By convention, b 0 is the number of times that x is less than k1, and bn is the frequency of x being greater than kn (see Fig. 4.37). We will subsequently use P to denote the accumulated weighted path length instead of the average path length: P = (Si: 1 ≤ i ≤ n : ai*hi) + (Si: 1 ≤ i ≤ m : b i*h'i) Thus, apart from avoiding the computation of the probabilities from measured frequency counts, we gain the further advantage of being able to use integers instead of fractions in our search for the optimal tree. Considering the fact that the number of possible configurations of n nodes grows exponentially with n, the task of finding the optimum seems rather hopeless for large n. Optimal trees, however, have one significant property that helps to find them: all their subtrees are optimal too. For instance, if the tree in Fig. 4.37 is optimal, then the subtree with keys k3 and k4 is also optimal as shown. This property suggests an algorithm that systematically finds larger and larger trees, starting with individual nodes as smallest possible subtrees. The tree thus grows from the leaves to the root, which is, since we are used to drawing trees upside-down, the bottom-up direction [4-6]. The equation that is the key to this algorithm is derived as follows: Let P be the weighted path length of a tree, and let P L and PR be those of the left and right subtrees of its root. Clearly, P is the sum of PL and P R, and the number of times a search travels on the leg to the root, which is simply the total number W of search trials. We call W the weight of the tree. Its average path length is then P/W. P = PL + W + P R W = (Si: 1 ≤ i ≤ n : ai) + (Si: 1 ≤ i ≤ m : bi) These considerations show the need for a denotation of the weights and the path lengths of any subtree consisting of a number of adjacent keys. Let Tij be the optimal subtree consisting of nodes with keys ki+1, ki+2, ... , kj. Then let wij denote the weight and let p ij denote the path length of Tij. Clearly P = p0,n and W = w0,n. These quantities are defined by the following recurrence relations:

153 w ii w ij p ii p ij

= bi = w i, j-1 + aj + b j = w ii = w ij + MIN k: i < k ≤ j : (pi,k-1 + pkj)

(0 ≤ i ≤ n) (0 ≤ i < j ≤ n) (0 ≤ i ≤ n) (0 ≤ i < k < j ≤ n)

The last equation follows immediately from the definitions of P and of optimality. Since there are approximately n2/2 values p ij, and because its definition calls for a choice among all cases such that 0 < j-i ≤ n, the minimization operation will involve approximately n 3/6 operations. Knuth pointed out that a factor n can be saved by the following consideration, which alone makes this algorithm usable for practical purposes. Let rij be a value of k which achieves the minimum for p ij. It is possible to limit the search for r ij to a much smaller interval, i.e., to reduce the number of the j-i evaluation steps. The key is the observation that if we have found the root rij of the optimal subtree T ij, then neither extending the tree by adding a node at the right, nor shrinking the tree by removing its leftmost node ever can cause the optimal root to move to the left. This is expressed by the relation ri,j-1 ≤ rij ≤ ri+1,j which limits the search for possible solutions for r ij to the range ri,j-1 ... ri+1,j. This results in a total number of elementary steps in the order of n2. We are now ready to construct the optimization algorithm in detail. We recall the following definitions, which are based on optimal trees Tij consisting of nodes with keys ki+1 ... kj. 1. ai: 2. b j: 3. wij: 4. p ij: 5. rij:

the frequency of a search for k i. the frequency of a search argument x between kj and kj+1. the weight of T ij. the weighted path length of Tij. the index of the root of Tij.

We declare the following arrays: a: ARRAY n+1 OF INTEGER; (*a[0] not used*) b: ARRAY n+1 OF INTEGER; p,w,r: ARRAY n+1, n+1 OF INTEGER; Assume that the weights wij have been computed from a and b in a straightforward way. Now consider w as the argument of the procedure OptTree to be developed and consider r as its result, because r describes the tree structure completely. p may be considered an intermediate result. Starting out by considering the smallest possible subtrees, namely those consisting of no nodes at all, we proceed to larger and larger trees. Let us denote the width j-i of the subtree Tij by h. Then we can trivially determine the values pii for all trees with h = 0 according to the definition of p ij. FOR i := 0 TO n DO p[i,i] := b[i] END In the case h = 1 we deal with trees consisting of a single node, which plainly is also the root (see Fig. 4.38). FOR i := 0 TO n-1 DO j := i+1; p[i,j] := w[i,j] + p[i,i] + p[j,j]; r[i,j] := j END

154

kj|aj

bj-1

bj

wj-1, j-1 wj-1, j

Fig. 4.38. Optimal search tree with single node Note that i denotes the left index limit and j the right index limit in the considered tree T ij. For the cases h > 1 we use a repetitive statement with h ranging from 2 to n, the case h = n spanning the entire tree T0,n. In each case the minimal path length p ij and the associated root index rij are determined by a simple repetitive statement with an index k ranging over the interval given for rij. FOR h := 2 TO n DO FOR i := 0 TO n-h DO j := i+h; find k and min = MIN k: i < k < j : (pi,k-1 + pkj) such that ri,j-1 < k < r i+1,j; p[i,j] := min + w[i,j]; r[i,j] := k END END The details of the refinement of the statement in italics can be found in Program 4.6. The average path length of T0,n is now given by the quotient p 0,n/w0,n, and its root is the node with index r 0,n. Let us now describe the structure of the program to be designed. Its two main components are the procedures to find the optimal search tree, given a weight distribution w, and to display the tree given the indices r. First, the counts a and b and the keys are read from an input source. The keys are actually not involved in the computation of the tree structure; they are merely used in the subsequent display of the tree. After printing the frequency statistics, the program proceeds to compute the path length of the perfectly balanced tree, in passing also determining the roots of its subtrees. Thereafter, the average weighted path length is printed and the tree is displayed. In the third part, procedure OptTree is activated in order to compute the optimal search tree; thereafter, the tree is displayed. And finally, the same procedures are used to compute and display the optimal tree considering the key frequencies only, ignoring the frequencies of non-keys. To summarize, the following are the global constants and variables: CONST N = 100; (*max no. of keywords*) WordLen = 16; (*max keyword length*) VAR key: ARRAY N+1, WordLen OF CHAR; a, b: ARRAY N+1 OF INTEGER; p, w, r: ARRAY N+1, N+1 OF INTEGER; PROCEDURE BalTree(i, j: INTEGER): INTEGER; VAR k: INTEGER; BEGIN k := (i+j+1) DIV 2; r[i, j] := k; IF i >= j THEN RETURN 0 ELSE RETURN BalTree(i, k-1) + BalTree(k, j) + w[i, j] END END BalTree;

155 PROCEDURE ComputeOptTree(n: INTEGER); VAR x, min, tmp: INTEGER; i, j, k, h, m: INTEGER; BEGIN (*argument: W, results: p, r*) FOR i := 0 TO n DO p[i, i] := 0 END ; FOR i := 0 TO n-1 DO j := i+1; p[i, j] := w[i, j]; r[i, j] := j END ; FOR h := 2 TO n DO FOR i := 0 TO n-h DO j := i+h; m := r[i, j-1]; min := p[i, m-1] + p[m, j]; FOR k := m+1 TO r[i+1, j] DO tmp := p[i, k-1]; x := p[k, j] + tmp; IF x < min THEN m := k; min := x END END ; p[i, j] := min + w[i, j]; r[i, j] := m END END END ComputeOptTree; PROCEDURE WriteTree(i, j, level: INTEGER); VAR k: INTEGER; (*uses global writer W*) BEGIN IF i < j THEN WriteTree(i, r[i, j]-1, level+1); FOR k := 1 TO level DO Texts.Write(W, TAB) END ; Texts.WriteString(W, key[r[i, j]]); Texts.WriteLn(W); WriteTree(r[i, j], j, level+1) END END WriteTree; PROCEDURE Find(VAR S: Texts.Scanner); VAR i, j, n: INTEGER; (*uses global writer W*) BEGIN Texts.Scan(S); b[0] := SHORT(S.i); n := 0; Texts.Scan(S); (*input a, key, b*) WHILE S.class = Texts.Int DO INC(n); a[n] := SHORT(S.i); Texts.Scan(S); COPY(S.s, key[n]); Texts.Scan(S); b[n] := SHORT(S.i); Texts.Scan(S) END ; (*compute w from a and b*) FOR i := 0 TO n DO w[i, i] := b[i]; FOR j := i+1 TO n DO w[i, j] := w[i, j-1] + a[j] + b[j] END END ; Texts.WriteString(W, "Total weight = "); Texts.WriteInt(W, w[0, n], 6); Texts.WriteLn(W); Texts.WriteString(W, "Pathlength of balanced tree = "); Texts.WriteInt(W, BalTree(0, n), 6); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W); ComputeOptTree(n); Texts.WriteString(W, "Pathlength of optimal tree = "); Texts.WriteInt(W, p[0, n], 6); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W); FOR i := 0 TO n DO w[i, i] := 0;

156 FOR j := i+1 TO n DO w[i, j] := w[i, j-1] + a[j] END END ; ComputeOptTree(n); Texts.WriteString(W, "optimal tree not considering b"); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W) END Find; As an example, let us consider the following input data of a tree with 3 keys: 20 1 Albert 10 2 Ernst 1 5 Peter 1 b 0 = 20 a1 = 1 a2 = 2 a3 = 4

key1 = Albert key2 = Ernst key3 = Peter

b 1 = 10 b2 = 1 b3 = 1

The results of procedure Find are shown in Fig. 4.40 and demonstrate that the structures obtained for the three cases may differ significantly. The total weight is 40, the pathlength of the balanced tree is 78, and that of the optimal tree is 66. balanced tree

optimal tree Albert

not considering key misses Albert

Albert Ernst

Ernst Peter

Ernst Peter

Peter

Fig. 4.40. The 3 trees generated by the Optimal Tree procedure (NEW FIGURE!) It is evident from this algorithm that the effort to determine the optimal structure is of the order of n2; also, the amount of required storage is of the order n2. This is unacceptable if n is very large. Algorithms with greater efficiency are therefore highly desirable. One of them is the algorithm developed by Hu and Tucker [4-5] which requires only O(n) storage and O(n*log(n)) computations. However, it considers only the case in which the key frequencies are zero, i.e., where only the unsuccessful search trials are registered. Another algorithm, also requiring O(n) storage elements and O(n*log(n)) computations was described by Walker and Gotlieb [4-7]. Instead of trying to find the optimum, this algorithm merely promises to yield a nearly optimal tree. It can therefore be based on heuristic principles. The basic idea is the following. Consider the nodes (genuine and special nodes) being distributed on a linear scale, weighted by their frequencies (or probabilities) of access. Then find the node which is closest to the center of gravity. This node is called the centroid, and its index is (Si: 1 ≤ i ≤ n : i*ai) + (Si: 1 ≤ i ≤ m : i*b i) / W rounded to the nearest integer. If all nodes have equal weight, then the root of the desired optimal tree evidently coincides with the centroid Otherwise -- so the reasoning goes -- it will in most cases be in the close neighborhood of the centroid. A limited search is then used to find the local optimum, whereafter this procedure is applied to the resulting two subtrees. The likelihood of the root lying very close to the centroid grows with the size n of the tree. As soon as the subtrees have reached a manageable size, their optimum can be determined by the above exact algorithm.

4.7. B-Trees So far, we have restricted our discussion to trees in which every node has at most two descendants, i.e., to binary trees. This is entirely satisfactory if, for instance, we wish to represent family relationships with a preference to the pedigree view, in which every person is associated with his parents. After all, no one has more than two parents. But what about someone who prefers the posterity view? He has to cope with the

157 fact that some people have more than two children, and his trees will contain nodes with many branches. For lack of a better term, we shall call them multiway trees. Of course, there is nothing special about such structures, and we have already encountered all the programming and data definition facilities to cope with such situations. If, for instance, an absolute upper limit on the number of children is given (which is admittedly a somewhat futuristic assumption), then one may represent the children as an array component of the record representing a person. If the number of children varies strongly among different persons, however, this may result in a poor utilization of available storage. In this case it will be much more appropriate to arrange the offspring as a linear list, with a pointer to the youngest (or eldest) offspring assigned to the parent. A possible type definition for this case is the following, and a possible data structure is shown in Fig. 4.43. TYPE Person = POINTER TO RECORD name: alfa; sibling, offspring: Person END JOHN

ALBERT

PETER

MARY

PAUL

ROBERT

CAROL

CHRIS

GEORGE

PAMELA

TINA

Fig. 4.43. Multiway tree represented as binary tree We now realize that by tilting this picture by 45 degrees it will look like a perfect binary tree. But this view is misleading because functionally the two references have entirely different meanings. One usually dosen't treat a sibling as an offspring and get away unpunished, and hence one should not do so even in constructing data definitions. This example could also be easily extended into an even more complicated data structure by introducing more components in each person's record, thus being able to represent further family relationships. A likely candidate that cannot generally be derived from the sibling and offspring references is that of husband and wife, or even the inverse relationship of father and mother. Such a structure quickly grows into a complex relational data bank, and it may be possible to map serveral trees into it. The algorithms operating on such structures are intimately tied to their data definitions, and it does not make sense to specify any general rules or widely applicable techniques. However, there is a very practical area of application of multiway trees which is of general interest. This is the construction and maintenance of large-scale search trees in which insertions and deletions are necessary, but in which the primary store of a computer is not large enough or is too costly to be used for long-time storage. Assume, then, that the nodes of a tree are to be stored on a secondary storage medium such as a disk store. Dynamic data structures introduced in this chapter are particularly suitable for incorporation of secondary storage media. The principal innovation is merely that pointers are represented by disk store addresses instead of main store addresses. Using a binary tree for a data set of, say, a million items, requires on the average approximately log 10 6 (i.e. about 20) search steps. Since each step now involves a disk access (with inherent latency time), a storage organization using fewer accesses will be highly desirable. The multiway tree is a perfect solution to this problem. If an item located on a secondary store is accessed, an entire group of items may also be accessed without much additional cost. This suggests that a tree be subdivided into subtrees, and that the subtrees are represented as units that are accessed all together. We shall call these subtrees pages. Figure 4.44 shows a binary tree subdivided into pages, each page consisting of 7 nodes.

158

Fig. 4.44. Binary tree subdivided into pages The saving in the number of disk accesses -- each page access now involves a disk access -- can be considerable. Assume that we choose to place 100 nodes on a page (this is a reasonable figure); then the million item search tree will on the average require only log100(106) (i.e. about 3) page accesses instead of 20. But, of course, if the tree is left to grow at random, then the worst case may still be as large as 10 4. It is plain that a scheme for controlled growth is almost mandatory in the case of multiway trees. 4.7.1. Multiway B-Trees If one is looking for a controlled growth criterion, the one requiring a perfect balance is quickly eliminated because it involves too much balancing overhead. The rules must clearly be somewhat relaxed. A very sensible criterion was postulated by R. Bayer and E.M. McCreight [4.2] in 1970: every page (except one) contains between n and 2n nodes for a given constant n. Hence, in a tree with N items and a maximum page size of 2n nodes per page, the worst case requires logn N page accesses; and page accesses clearly dominate the entire search effort. Moreover, the important factor of store utilization is at least 50% since pages are always at least half full. With all these advantages, the scheme involves comparatively simple algorithms for search, insertion, and deletion. We will subsequently study them in detail. The underlying data structures are called B-trees, and have the following characteristics; n is said to be the order of the B-tree. 1. Every page contains at most 2n items (keys.) 2. Every page, except the root page, contains at least n items. 3. Every page is either a leaf page, i.e. has no descendants, or it has m+1 descendants, where m is its number of keys on this page. 4. All leaf pages appear at the same level.

25

10 20

2

5

7

8

13

14

30 40

15

18

22

24

26

27

28

32

35

38

41

42

45

46

Fig. 4.45. B-tree of order 2 Figure 4.45 shows a B-tree of order 2 with 3 levels. All pages contain 2, 3, or 4 items; the exception is the root which is allowed to contain a single item only. All leaf pages appear at level 3. The keys appear in increasing order from left to right if the B-tree is squeezed into a single level by inserting the descendants in between the keys of their ancestor page. This arrangement represents a natural extension of binary

159 search trees, and it determines the method of searching an item with given key. Consider a page of the form shown in Fig. 4.46 and a given search argument x. Assuming that the page has been moved into the primary store, we may use conventional search methods among the keys k1 ... km. If m is sufficiently large, one may use binary search; if it is rather small, an ordinary sequential search will do. (Note that the time required for a search in main store is probably negligible compared to the time it takes to move the page from secondary into primary store.) If the search is unsuccessful, we are in one of the following situations: 1. ki < x < ki+1, for 1 < i < m The search continues on page p i^ 2. km < x The search continues on page pm^. 3. x < k1 The search continues on page p0^.

p0 k1 p1 k2 p2

...

pm-1 km pm

Fig. 4.46. B-tree page with m keys If in some case the designated pointer is NIL, i.e., if there is no descendant page, then there is no item with key x in the whole tree, and the search is terminated. Surprisingly, insertion in a B-tree is comparatively simple too. If an item is to be inserted in a page with m < 2n items, the insertion process remains constrained to that page. It is only insertion into an already full page that has consequences upon the tree structure and may cause the allocation of new pages. To understand what happens in this case, refer to Fig. 4.47, which illustrates the insertion of key 22 in a Btree of order 2. It proceeds in the following steps: 1. Key 22 is found to be missing; insertion in page C is impossible because C is already full. 2. Page C is split into two pages (i.e., a new page D is allocated). 3. The 2n+1 keys are equally distributed onto C and D, and the middle key is moved up one level into the ancestor page A.

A

7

10

15

B

A

20

18

26

30

35

40

C

7

10

15

18

B

20

30

22

26

35

C

40

D

Fig. 4.47. Insertion of key 22 in B-tree This very elegant scheme preserves all the characteristic properties of B-trees. In particular, the split pages contain exactly n items. Of course, the insertion of an item in the ancestor page may again cause that page to overflow, thereby causing the splitting to propagate. In the extreme case it may propagate up to the root. This is, in fact, the only way that the B-tree may increase its height. The B-tree has thus a strange manner of growing: it grows from its leaves upward to the root. We shall now develop a detailed program from these sketchy descriptions. It is already apparent that a recursive formulation will be most convenient because of the property of the splitting process to propagate back along the search path. The general structure of the program will therefore be similar to balanced tree insertion, although the details are different. First of all, a definition of the page structure has to be formulated. We choose to represent the items in the form of an array. TYPE Page = Item =

POINTER TO PageDescriptor; RECORD key: INTEGER; p: Page;

160 count: INTEGER (*data*) END ; PageDescriptor =

RECORD m: INTEGER; (* 0 .. 2n *) p0: Page; e: ARRAY 2*n OF Item END

Again, the item component count stands for all kinds of other information that may be associated with each item, but it plays no role in the actual search process. Note that each page offers space for 2n items. The field m indicates the actual number of items on the page. As m ≥ n (except for the root page), a storage utilization of a least 50% is guaranteed. The algorithm of B-tree search and insertion is formulated below as a procedure called search. Its main structure is straightforward and similar to that for the balanced binary tree search, with the exception that the branching decision is not a binary choice. Instead, the “within-page search” is represented as a binary search on the array e of elements. The insertion algorithm is formulated as a separate procedure merely for clarity. It is activated after search has indicated that an item is to be passed up on the tree (in the direction toward the root). This fact is indicated by the Boolean result parameter h; it assumes a similar role as in the algorithm for balanced tree insertion, where h indicates that the subtree had grown. If h is true, the second result parameter, u, represents the item being passed up. Note that insertions start in hypothetical pages, namely, the "special nodes" of Fig. 4.19; the new item is immediately handed up via the parameter u to the leaf page for actual insertion. The scheme is sketched here: PROCEDURE search(x: INTEGER; a: Page; VAR h: BOOLEAN; VAR u: Item); BEGIN IF a = NIL THEN (*x not in tree, insert*) Assign x to item u, set h to TRUE, indicating that an item u is passed up in the tree ELSE binary search for x in array a.e; IF found THEN process data ELSE search(x, descendant, h, u); IF h THEN (*an item was passed up*) IF no. of items on page a^ < 2n THEN insert u on page a^ and set h to FALSE ELSE split page and pass middle item up END END END END END search If the paramerter h is true after the call of search in the main program, a split of the root page is requested. Since the root page plays an exceptional role, this process has to be programmed separately. It consists merely of the allocation of a new root page and the insertion of the single item given by the paramerter u. As a consequence, the new root page contains a single item only. The details can be gathered from Program 4.7, and Fig. 4.48 shows the result of using Program 4.7 to construct a B-tree with the following insertion sequence of keys: 20; 40 10 30 15; 35 7 26 18 22; 5; 42 13 46 27 8 32; 38 24 45 25; The semicolons designate the positions of the snapshots taken upon each page allocation. Insertion of the last key causes two splits and the allocation of three new pages.

161

a)

20

b)

20

10

15

30

c)

7

40

20 30

10

15

18

22

26

35

d)

5

10 20 30

7

15

18

22

e)

5

40

26

35

40

32

35

10 20 30 40

7

8

13

15

18

22

26

f)

27

42

25

10 20

5

46

7

8

13

15

30 40

18

22

24

26

27

32

35

38

42

45

46

Fig. 4.48. Growth of B-tree of order 2 Since each activation of search implies one page transfer to main store, k = logn(N) recursive calls are necessary at most, if the tree contains N items. Hence, we must be capable of accommodating k pages in main store. This is one limiting factor on the page size 2n. In fact, we need to accommodate even more than k pages, because insertion may cause page splitting to occur. A corollary is that the root page is best allocated permanently in the primary store, because each query proceeds necessarily through the root page. Another positive quality of the B-tree organization is its suitability and economy in the case of purely sequential updating of the entire data base. Every page is fetched into primary store exactly once. Deletion of items from a B-tree is fairly straight-forward in principle, but it is complicated in the details. We may distinguish two different circumstances: 1. The item to be deleted is on a leaf page; here its removal algorithm is plain and simple. 2. The item is not on a leaf page; it must be replaced by one of the two lexicographically adjacent items, which happen to be on leaf pages and can easily be deleted. In case 2 finding the adjacent key is analogous to finding the one used in binary tree deletion. We descend along the rightmost pointers down to the leaf page P, replace the item to be deleted by the rightmost item on P, and then reduce the size of P by 1. In any case, reduction of size must be followed by a check of the number of items m on the reduced page, because, if m < n, the primary characteristic of B-trees would be violated. Some additional action has to be taken; this underflow condition is indicated by the Boolean variable parameter h. The only recourse is to borrow or annect an item from one of the neighboring pages, say from Q. Since this involves fetching page Q into main store -- a relatively costly operation -- one is tempted to make the best of this undesirable situation and to annect more than a single item at once. The usual strategy is to distribute the items on pages P and Q evenly on both pages. This is called page balancing.

162 Of course, it may happen that there is no item left to be annected since Q has already reached its minimal size n. In this case the total number of items on pages P and Q is 2n-1; we may merge the two pages into one, adding the middle item from the ancestor page of P and Q, and then entirely dispose of page Q. This is exactly the inverse process of page splitting. The process may be visualized by considering the deletion of key 22 in Fig. 4.47. Once again, the removal of the middle key in the ancestor page may cause its size to drop below the permissible limit n, thereby requiring that further special action (either balancing or merging) be undertaken at the next level. In the extreme case page merging may propagate all the way up to the root. If the root is reduced to size 0, it is itself deleted, thereby causing a reduction in the height of the B-tree. This is, in fact, the only way that a B-tree may shrink in height. Figure 4.49 shows the gradual decay of the B-tree of Fig. 4.48 upon the sequential deletion of the keys 25 45 24; 38 32; 8 27 46 13 42; 5 22 18 26; 7 35 15; The semicolons again mark the places where the snapshots are taken, namely where pages are being eliminated. The similarity of its structure to that of balanced tree deletion is particularly noteworthy. a)

25

10 20

5

7

8

13

15

30 40

18

22

b)

5

7

8

13

7

f)

32

15

18

20

26

27

32

35

38

35

40

42

42

8

13

15

18

20

26

27

46

10 22

7

e)

7

27

10 22 30

d)

5

26

10 22 30 40

c)

5

24

15

18

20

26

30

35

40

15

10

10

20

20

30

30

40

35

40

Fig. 4.49. Decay of B-tree of order 2 TYPE Page = POINTER TO PageRec; Entry = RECORD key: INTEGER; p: Page END ; PageRec = RECORD m: INTEGER; (*no. of entries on page*) p0: Page; e: ARRAY 2*N OF Entry END ; VAR root: Page; W: Texts.Writer;

46

35

38

42

45

46

163 PROCEDURE search(x: INTEGER; VAR p: Page; VAR k: INTEGER); VAR i, L, R: INTEGER; found: BOOLEAN; a: Page; BEGIN a := root; found := FALSE; WHILE (a # NIL) & ~found DO L := 0; R := a.m; (*binary search*) WHILE L < R DO i := (L+R) DIV 2; IF x 0 THEN FOR i := N-2 TO 0 BY -1 DO a.e[i+k] := a.e[i] END ; a.e[k-1] := c.e[s]; a.e[k-1].p := a.p0; (*move k-1 items from b to a, one to c*) DEC(b.m, k); FOR i := k-2 TO 0 BY -1 DO a.e[i] := b.e[i+b.m+1] END ; c.e[s] := b.e[b.m]; a.p0 := c.e[s].p; c.e[s].p := a; a.m := N-1+k; h := FALSE ELSE (*merge pages a and b, discard a*) c.e[s].p := a.p0; b.e[N] := c.e[s]; FOR i := 0 TO N-2 DO b.e[i+N+1] := a.e[i] END ; b.m := 2*N; DEC(c.m); h := c.m < N END END END underflow; PROCEDURE delete(x: INTEGER; a: Page; VAR h: BOOLEAN); (*search and delete key x in B-tree a; if a page underflow arises, balance with adjacent page or merge; h := "page a is undersize"*) VAR i, L, R: INTEGER; q: Page; PROCEDURE del(p: Page; VAR h: BOOLEAN); VAR k: INTEGER; q: Page; (*global a, R*) BEGIN k := p.m-1; q := p.e[k].p; IF q # NIL THEN del(q, h); IF h THEN underflow(p, q, p.m, h) END ELSE p.e[k].p := a.e[R].p; a.e[R] := p.e[k]; DEC(p.m); h := p.m < N END END del;

165 BEGIN IF a # NIL THEN L := 0; R := a.m; (*binary search*) WHILE L < R DO i := (L+R) DIV 2; IF x 0 THEN ShowTree(p.p0, level+1) END ; FOR i := 0 TO p.m-1 DO ShowTree(p.e[i].p, level+1) END END END ShowTree; Extensive analysis of B-tree performance has been undertaken and is reported in the referenced article (Bayer and McCreight). In particular, it includes a treatment of the question of optimal page size, which strongly depends on the characteristics of the storage and computing system available. Variations of the B-tree scheme are discussed in Knuth, Vol. 3, pp. 476-479. The one notable observation is that page splitting should be delayed in the same way that page merging is delayed, by first attempting to balance neighboring pages. Apart from this, the suggested improvements seem to yield marginal gains. A comprehensive survey of B-trees may be found in [4-8]. 4.7.2. Binary B-Trees The species of B-trees that seems to be least interesting is the first order B-tree (n = 1). But sometimes it is worthwhile to pay attention to the exceptional case. It is plain, however, that first-order B-trees are not useful in representing large, ordered, indexed data sets invoving secondary stores; approximately 50% of all pages will contain a single item only. Therefore, we shall forget secondary stores and again consider the problem of search trees involving a one-level store only. A binary B-tree (BB-tree) consists of nodes (pages) with either one or two items. Hence, a page contains either two or three pointers to descendants; this suggested the term 2-3 tree. According to the definition of B-trees, all leaf pages appear at the same level, and all non-leaf pages of BB-trees have either two or three descendants (including the root). Since we now are dealing with primary store only, an optimal economy of storage space is mandatory, and the representation of the items inside a node in the form of an array appears unsuitable. An alternative is the dynamic, linked allocation; that is, inside each node there exists a linked list of items of length 1 or 2. Since each node has at most three descendants and thus needs to harbor only up to three pointers, one is tempted to combine the pointers for descendants and pointers in

166 the item list as shown in Fig. 4.50. The B-tree node thereby loses its actual identity, and the items assume the role of nodes in a regular binary tree. It remains necessary, however, to distinguish between pointers to descendants (vertical) and pointers to siblings on the same page (horizontal). Since only the pointers to the right may be horizontal, a single bit is sufficient to record this distiction. We therefore introduce the Boolean field h with the meaning horizontal. The definition of a tree node based on this representation is given below. It was suggested and investigated by R. Bayer [4-3] in 1971 and represents a search tree organization guaranteeing p = 2*log(N) as maximum path length. TYPE Node = POINTER TO RECORD key: INTEGER; ........... left, right: Node; h: BOOLEAN (*right branch horizontal*) END

a

b

a

b

c

Fig. 4.50. Representation of BB-tree nodes Considering the problem of key insertion, one must distinguish four possible situations that arise from growth of the left or right subtrees. The four cases are illustrated in Fig. 4.51. Remember that B-trees have the characteristic of growing from the bottom toward the root and that the property of all leafs being at the same level must be maintained. The simplest case (1) is when the right subtree of a node A grows and when A is the only key on its (hypothetical) page. Then, the descendant B merely becomes the sibling of A, i.e., the vertical pointer becomes a horizontal pointer. This simple raising of the right arm is not possible if A already has a sibling. Then we would obtain a page with 3 nodes, and we have to split it (case 2). Its middle node B is passed up to the next higher level. Now assume that the left subtree of a node B has grown in height. If B is again alone on a page (case 3), i.e., its right pointer refers to a descendant, then the left subtree (A) is allowed to become B's sibling. (A simple rotation of pointers is necessary since the left pointer cannot be horizontal). If, however, B already has a sibling, the raising of A yields a page with three members, requiring a split. This split is realized in a very straightforward manner: C becomes a descendant of B, which is raised to the next higher level (case 4).

167

1. A

a

A

a

B

b

B

b

c

c B

2. A

a

B

b

A

a

C

c

B

C

b

c

A

d

a

C

b c

d

d

3. B

c

A

a

A

a

B

A

b

c

B

a

b

c

b B

4. B

c

A

a

C

A

d

a

B

b

C

c

A

d

a

C

b c

d

b

Fig. 4.51. Node insertion in BB-tree It should be noted that upon searching a key, it makes no effective difference whether we proceed along a horizontal or a vertical pointer. It therefore appears artificial to worry about a left pointer in case 3 becoming horizontal, although its page still contains not more than two members. Indeed, the insertion algorithm reveals a strange asymmetry in handling the growth of left and right subtrees, and it lets the BB-tree organization appear rather artificial. There is no proof of strangeness of this organization; yet a healthy intuition tells us that something is fishy, and that we should remove this asymmetry. It leads to the notion of the symmetric binary B-tree (SBB-tree) which was also investigated by Bayer [4-4] in 1972. On the average it leads to slightly more efficient search trees, but the algorithms for insertion and deletion are also slightly more complex. Furthermore, each node now requires two bits (Boolean variable lh and rh) to indicate the nature of its two pointers. Since we will restrict our detail considerations to the problem of insertion, we have once again to distinguish among four cases of grown subtrees. They are illustrated in Fig. 4.52, which makes the gained

168 symmetry evident. Note that whenever a subtree of node A without siblings grows, the root of the subtree becomes the sibling of A. This case need not be considered any further.

B

(LL)

B

C

A

B

C

A

C

A B

(LR)

A

C

A

B

C

A

C

B B

(RR)

A

B

A

B

C

A

C

C

B

(RL)

A

C

A

B

C

A

C

B

Fig. 4.52. Insertion in SBB-trees The four cases considered in Fig. 4.52 all reflect the occurrence of a page overflow and the subsequent page split. They are labelled according to the directions of the horizontal pointers linking the three siblings in the middle figures. The initial situation is shown in the left column; the middle column illustrates the fact that the lower node has been raised as its subtree has grown; the figures in the right column show the result of node rearrangement. It is advisable to stick no longer to the notion of pages out of which this organization had developed, for we are only interested in bounding the maximum path length to 2*log(N). For this we need only ensure that two horizontal pointers may never occur in succession on any search path. However, there is no reason to forbid nodes with horizontal pointers to the left and right, i.e. to treat the left and right sides differently. We therefore define the symmetric binary B-tree as a tree that has the following properties: 1. Every node contains one key and at most two (pointers to) subtrees.

169 2. Every pointer is either horizontal or vertical. There are no two consecutive horizontal pointers on any search path. 3. All terminal nodes (nodes without descendants) appear at the same (terminal) level. From this definition it follows that the longest search path is no longer than twice the height of the tree. Since no SBB-tree with N nodes can have a height larger than log(N), it follows immediately that 2*log(N) is an upper bound on the search path length. In order to visualize how these trees grow, we refer to Fig. 4.53. The lines represent snapshots taken during the insertion of the following sequences of keys, where every semicolon marks a snapshot. (1) (2) (3) (4)

1.

1 5 6 4

1

2; 4; 2; 2

3; 3; 4; 6;

2

4 1 1 1

5 2 7 7;

6; 7 3 3

2 1

7; 6; 5; 5;

2 3

4

1

3

4 5

6

2 1

2.

4

5

4 3

3.

2

6

2

4

5

4

1

3

3

5

7

6 5

4 2

4.

2

6

7

4 6

6

1

2 1

2

3

5

6 4

6

7

2 7

1

6 3

4

5

7

Fig. 4.53. Insertion of keys 1 to 7 These pictures make the third property of B-trees particularly obvious: all terminal nodes appear on the same level. One is therefore inclined to compare these structures with garden hedges that have been recently trimmed with hedge scissors. The algorithm for the construction of SBB-trees is show below. It is based on a definition of the type Node with the two components lh and rh indicating whether or not the left and right pointers are horizontal. TYPE Node = RECORD key, count: INTEGER; left, right: Node; lh, rh: BOOLEAN END

170 The recursive procedure search again follows the pattern of the basic binary tree insertion algorithm. A third parameter h is added; it indicates whether or not the subtree with root p has changed, and it corresponds directly to the parameter h of the B-tree search program. We must note, however, the consequence of representing pages as linked lists: a page is traversed by either one or two calls of the search procedure. We must distinguish between the case of a subtree (indicated by a vertical pointer) that has grown and a sibling node (indicated by a horizontal pointer) that has obtained another sibling and hence requires a page split. The problem is easily solved by introducing a three-valued h with the following meanings: 1. h = 0: the subtree p requires no changes of the tree structure. 2. h = 1: node p has obtained a sibling. 3. h = 2: the subtree p has increased in height. PROCEDURE search(VAR p: Node; x: INTEGER; VAR h: INTEGER); VAR q, r: Node; BEGIN (*h=0*) IF p = NIL THEN (*insert new node*) NEW(p); p.key := x; p.L := NIL; p.R := NIL; p.lh := FALSE; p.rh := FALSE; h := 2 ELSIF x < p.key THEN search(p.L, x, h); IF h > 0 THEN (*left branch has grown or received sibling*) q := p.L; IF p.lh THEN h := 2; p.lh := FALSE; IF q.lh THEN (*LL*) p.L := q.R; q.lh := FALSE; q.R := p; p := q ELSE (*q.rh, LR*) r := q.R; q.R := r.L; q.rh := FALSE; r.L := p.L; p.L := r.R; r.R := p; p := r END ELSE DEC(h); IF h = 1 THEN p.lh := TRUE END END END ELSIF x > p.key THEN search(p.R, x, h); IF h > 0 THEN (*right branch has grown or received sibling*) q := p.R; IF p.rh THEN h := 2; p.rh := FALSE; IF q.rh THEN (*RR*) p.R := q.L; q.rh := FALSE; q.L := p; p := q ELSE (*q.lh, RL*) r := q.L; q.L := r.R; q.lh := FALSE; r.R := p.R; p.R := r.L; r.L := p; p := r END ELSE DEC(h); IF h = 1 THEN p.rh := TRUE END END END END END search; Note that the actions to be taken for node rearrangement very strongly resemble those developed in the AVL-balanced tree search algorithm. It is evident that all four cases can be implemented by simple pointer rotations: single rotations in the LL and RR cases, double rotations in the LR and RL cases. In fact, procedure search appears here slightly simpler than in the AVL case. Clearly, the SBB-tree scheme emerges as an alternative to the AVL-balancing criterion. A performance comparison is therefore both possible and desirable.

171 We refrain from involved mathematical analysis and concentrate on some basic differences. It can be proven that the AVL-balanced trees are a subset of the SBB-trees. Hence, the class of the latter is larger. It follows that their path length is on the average larger than in the AVL case. Note in this connection the worst-case tree (4) in Fig. 4.53. On the other hand, node rearrangement is called for less frequently. The balanced tree is therefore preferred in those applications in which key retrievals are much more frequent than insertions (or deletions); if this quotient is moderate, the SBB-tree scheme may be preferred. It is very difficult to say where the borderline lies. It strongly depends not only on the quotient between the frequencies of retrieval and structural change, but also on the characteristics of an implementation. This is particularly the case if the node records have a densely packed representation, and if therefore access to fields involves part-word selection. The SBB-tree has later found a rebirth under the name of red-black tree. The difference is that whereas in the case of the symmetric, binary B-tree every node contains two h-fields indicating whether the emanating pointers are horizontal, every node of the red-black tree contains a single h-field, indicating whether the incoming pointer is horizontal. The name stems from the idea to color nodes with incoming down-pointer black, and those with incoming horizontal pointer red. No two red nodes can immedaitely follow each other on any path. Therefore, like in the cases of the BB- and SBB-trees, every search path is at most twice as long as the height of the tree. There exists a canonical mapping from binary B-trees to red-black trees.

4.8. Priority Search Trees Trees, and in particular binary trees, constitute very effective organisations for data that can be ordered on a linear scale. The preceding chapters have exposed the most frequently used ingenious schemes for efficient searching and maintenance (insertion, deletion). Trees, however, do not seem to be helpful in problems where the data are located not in a one-dimensional, but in a multi-dimensional space. In fact, efficient searching in multi-dimensional spaces is still one of the more elusive problems in computer science, the case of two dimensions being of particular importance to many practical applications. Upon closer inspection of the subject, trees might still be applied usefully at least in the two-dimensional case. After all, we draw trees on paper in a two-dimensional space. Let us therefore briefly review the characteristics of the two major kinds of trees so far encountered. 1. A search tree is governed by the invariants p.left ≠ NIL implies p.left.x < p.x p.right ≠ NIL implies p.x < p.right.x holding for all nodes p with key x. It is apparent that only the horizontal position of nodes is at all constrained by the invariant, and that the vertical positions of nodes can be arbitrarily chosen such that access times in searching, (i.e. path lengths) are minimized. 2. A heap, also called priority tree, is governed by the invariants p.left ≠ NIL implies p.y ≤ p.left.y p.right ≠ NIL implies p.y ≤ p.right.y holding for all nodes p with key y. Here evidently only the vertical positions are constrained by the invariants. It seems straightforward to combine these two conditions in a definition of a tree organization in a twodimensional space, with each node having two keys x and y, which can be regarded as coordinates of the node. Such a tree represents a point set in a plane, i.e. in a two-dimensional Cartesian space; it is therefore called Cartesian tree [4-9]. We prefer the term priority search tree, because it exhibits that this structure emerged from a combination of the priority tree and the search tree. It is characterized by the following invariants holding for each node p: p.left ≠ NIL implies (p.left.x < p.x) & (p.y ≤ p.left.y) p.right ≠ NIL implies (p.x < p.right.x) & (p.y ≤ p.right.y) It should come as no big surprise, however, that the search properties of such trees are not particularly wonderful. After all, a considerable degree of freedom in positioning nodes has been taken away and is no longer available for choosing arrangements yielding short path lengths. Indeed, no logarithmic bounds

172 on efforts involved in searching, inserting, or deleting elements can be assured. Although this had already been the case for the ordinary, unbalanced search tree, the chances for good average behaviour are slim. Even worse, maintenance operations can become rather unwieldy. Consider, for example, the tree of Fig. 4.54 (a). Insertion of a new node C whose coordinates force it to be inserted above and between A and B requires a considerable effort transforming (a) into (b). McCreight discovered a scheme, similar to balancing, that, at the expense of a more complicated insertion and deletion operation, guarantees logarithmic time bounds for these operations. He calls that structure a priority search tree [4-10]; in terms of our classification, however, it should be called a balanced priority search tree. We refrain from discussing that structure, because the scheme is very intricate and in practice hardly used. By considering a somewhat more restricted, but in practice no less relevant problem, McCreight arrived at yet another tree structure, which shall be presented here in detail. Instead of assuming that the search space be unbounded, he considered the data space to be delimited by a rectangle with two sides open. We denote the limiting values of the x-coordinate by xmin and xmax. In the scheme of the (unbalanced) priority search tree outlined above, each node p divides the plane into two parts along the line x = p.x. All nodes of the left subtree lie to its left, all those in the right subtree to its right. For the efficiency of searching this choice may be bad. Fortunately, we may choose the dividing line differently. Let us associate with each node p an interval [p.L .. p.R), ranging over all x values including p.L up to but excluding p.R. This shall be the interval within which the x-value of the node may lie. Then we postulate that the left descendant (if any) must lie within the left half, the right descendant within the right half of this interval. Hence, the dividing line is not p.x, but (p.L+p.R)/2. For each descendant the interval is halved, thus limiting the height of the tree to log(xmax-xmin). This result holds only if no two nodes have the same x-value, a condition which, however, is guaranteed by the invariant (4.90). If we deal with integer coordinates, this limit is at most equal to the wordlength of the computer used. Effectively, the search proceeds like a bisection or radix search, and therefore these trees are called radix priority search trees [4-10]. They feature logarithmic bounds on the number of operations required for searching, inserting, and deleting an element, and are governed by the following invariants for each node p: p.left ≠ NIL p.right≠ NIL

implies (p.L ≤ p.left.x < p.M) & (p.y ≤ p.left.y) implies (p.M ≤ p.right.x < p.R) & (p.y ≤ p.right.y)

where p.M p.left.L p.left.R p.right.L p.right.R

= = = = =

(p.L + p.R) DIV 2 p.L p.M p.M p.R

for all node p, and root.L = xmin, root.R = xmax. A decisive advantage of the radix scheme is that maintenance operations (preserving the invariants under insertion and deletion) are confined to a single spine of the tree, because the dividing lines have fixed values of x irrespective of the x-values of the inserted nodes. Typical operations on priority search trees are insertion, deletion, finding an element with the least (largest) value of x (or y) larger (smaller) than a given limit, and enumerating the points lying within a given rectangle. Given below are procedures for inserting and enumerating. They are based on the following type declarations: TYPE Node = POINTER TO RECORD x, y: INTEGER; left, right: Node END Notice that the attributes x L and xR need not be recorded in the nodes themselves. They are rather computed during each search. This, however, requires two additional parameters of the recursive procedure insert. Their values for the first call (with p = root) are xmin and xmax respectively. Apart from this, a search proceeds similarly to that of a regular search tree. If an empty node is encountered, the

173 element is inserted. If the node to be inserted has a y-value smaller than the one being inspected, the new node is exchanged with the inspected node. Finally, the node is inserted in the left subtree, if its x-value is less than the middle value of the interval, or the right subtree otherwise. PROCEDURE insert(VAR p: Node; X, Y, xL, xR: INTEGER); VAR xm, t: INTEGER; BEGIN IF p = NIL THEN (*not in tree, insert*) NEW(p); p.x := X; p.y := Y; p.left := NIL; p.right := NIL ELSIF p.x = X THEN (*found; don't insert*) ELSE IF p.y > Y THEN t := p.x; p.x := X; X := t; t := p.y; p.y := Y; Y := t END ; xm := (xL + xR) DIV 2; IF X < xm THEN insert(p.left, X, Y, xL, xm) ELSE insert(p.right, X, Y, xm, xR) END END END insert The task of enumerating all points x,y lying in a given rectangle, i.e. satisfying x0 ≤ x < x1 and y ≤ y1 is accomplished by the following procedure enumerate. It calls a procedure report(x,y) for each point found. Note that one side of the rectangle lies on the x-axis, i.e. the lower bound for y is 0. This guarantees that enumeration requires at most O(log(N) + s) operations, where N is the cardinality of the search space in x and s is the number of nodes enumerated. PROCEDURE enumerate(p: Ptr; x0, x1, y, xL, xR: INTEGER); VAR xm: INTEGER; BEGIN IF p # NIL THEN IF (p.y

Algorithms and Data Structures © N. Wirth 1985 (Oberon version: August 2004) Contents Preface 1 Fundamental Data Structures 1.1 Introduction 1.2 The Concept of Data Type 1.3 Primitive Data Types 1.4 Standard Primitive Types 1.4.1 Integer types 1.4.2 The type REAL 1.4.3 The type BOOLEAN 1.4.4 The type CHAR 1.4.5 The type SET 1.5 The Array Structure 1.6 The Record Structure 1.7 Representation of Arrays, Records, and Sets 1.7.1 Representation of Arrays 1.7.2 Representation of Recors 1.7.3 Representation of Sets 1.8 The File (Sequence) 1.8.1 Elementary File Operators 1.8.2 Buffering Sequences 1.8.3 Buffering between Concurrent Processes 1.8.4 Textual Input and Output 1.9 Searching 1.9.1 Linear Search 1.9.2 Binary Search 1.9.3 Table Search 1.9.4 Straight String Search 1.9.5 The Knuth-Morris-Pratt String Search 1.9.6 The Boyer-Moore String Search Exercises 2 Sorting 2.1 Introduction 2.2 Sorting Arrays 2.2.1 Sorting by Straight Insertion 2.2.2 Sorting by Straight Selection 2.2.3 Sorting by Straight Exchange 2.3 Advanced Sorting Methods 2.3.1 Insertion Sort by Diminishing Increment 2.3.2 Tree Sort 2.3.3 Partition Sort 2.3.4 Finding the Median 2.3.5 A Comparison of Array Sorting Methods 2.4 Sorting Sequences 2.4.1 Straight Merging 2.4.2 Natural Merging 2.4.3 Balanced Multiway Merging 2.4.4 Polyphase Sort 2.4.5 Distribution of Initial Runs Exercises

6 3 Recursive Algorithms 3.1 Introduction 3.2 When Not to Use Recursion 3.3 Two Examples of Recursive Programs 3.4 Backtracking Algorithms 3.5 The Eight Queens Problem 3.6 The Stable Marriage Problem 3.7 The Optimal Selection Problem Exercises 4 Dynamic Information Structures 4.1 Recursive Data Types 4.2 Pointers 4.3 Linear Lists 4.3.1 Basic Operations 4.3.2 Ordered Lists and Reorganizing Lists 4.3.3 An Application: Topological Sorting 4.4 Tree Structures 4.4.1 Basic Concepts and Definitions 4.4.2 Basic Operations on Binary Trees 4.4.3 Tree Search and Insertion 4.4.4 Tree Deletion 4.4.5 Analysis of Tree Search and Insertion 4.5 Balanced Trees 4.5.1 Balanced Tree Insertion 4.5.2 Balanced Tree Deletion 4.6 Optimal Search Trees 4.7 B-Trees 4.7.1 Multiway B-Trees 4.7.2 Binary B-Trees 4.8 Priority Search Trees Exercises 5 Key Transformations (Hashing) 5.1 Introduction 5.2 Choice of a Hash Function 5.3 Collision handling 5.4 Analysis of Key Transformation Exercises Appendices A

The ASCII Character Set

B

The Syntax of Oberon

Index

7

Preface In recent years the subject of computer programming has been recognized as a discipline whose mastery is fundamental and crucial to the success of many engineering projects and which is amenable to scientific treatement and presentation. It has advanced from a craft to an academic discipline. The initial outstanding contributions toward this development were made by E.W. Dijkstra and C.A.R. Hoare. Dijkstra's Notes on Structured Programming [1] opened a new view of programming as a scientific subject and intellectual challenge, and it coined the title for a "revolution" in programming. Hoare's Axiomatic Basis of Computer Programming [2] showed in a lucid manner that programs are amenable to an exacting analysis based on mathematical reasoning. Both these papers argue convincingly that many programmming errors can be prevented by making programmers aware of the methods and techniques which they hitherto applied intuitively and often unconsciously. These papers focused their attention on the aspects of composition and analysis of programs, or more explicitly, on the structure of algorithms represented by program texts. Yet, it is abundantly clear that a systematic and scientific approach to program construction primarily has a bearing in the case of large, complex programs which involve complicated sets of data. Hence, a methodology of programming is also bound to include all aspects of data structuring. Programs, after all, are concrete formulations of abstract algorithms based on particular representations and structures of data. An outstanding contribution to bring order into the bewildering variety of terminology and concepts on data structures was made by Hoare through his Notes on Data Structuring [3]. It made clear that decisions about structuring data cannot be made without knowledge of the algorithms applied to the data and that, vice versa, the structure and choice of algorithms often depend strongly on the structure of the underlying data. In short, the subjects of program composition and data structures are inseparably interwined. Yet, this book starts with a chapter on data structure for two reasons. First, one has an intuitive feeling that data precede algorithms: you must have some objects before you can perform operations on them. Second, and this is the more immediate reason, this book assumes that the reader is familiar with the basic notions of computer programming. Traditionally and sensibly, however, introductory programming courses concentrate on algorithms operating on relatively simple structures of data. Hence, an introductory chapter on data structures seems appropriate. Throughout the book, and particularly in Chap. 1, we follow the theory and terminology expounded by Hoare and realized in the programming language Pascal [4]. The essence of this theory is that data in the first instance represent abstractions of real phenomena and are preferably formulated as abstract structures not necessarily realized in common programming languages. In the process of program construction the data representation is gradually refined -- in step with the refinement of the algorithm -to comply more and more with the constraints imposed by an available programming system [5]. We therefore postulate a number of basic building principles of data structures, called the fundamental structures. It is most important that they are constructs that are known to be quite easily implementable on actual computers, for only in this case can they be considered the true elements of an actual data representation, as the molecules emerging from the final step of refinements of the data description. They are the record, the array (with fixed size), and the set. Not surprisingly, these basic building principles correspond to mathematical notions that are fundamental as well. A cornerstone of this theory of data structures is the distinction between fundamental and "advanced" structures. The former are the molecules -- themselves built out of atoms -- that are the components of the latter. Variables of a fundamental structure change only their value, but never their structure and never the set of values they can assume. As a consequence, the size of the store they occupy remains constant. "Advanced" structures, however, are characterized by their change of value and structure during the execution of a program. More sophisticated techniques are therefore needed for their implementation. The sequence appears as a hybrid in this classification. It certainly varies its length; but that change in structure is of a trivial nature. Since the sequence plays a truly fundamental role in practically all computer systems, its treatment is included in Chap. 1. The second chapter treats sorting algorithms. It displays a variety of different methods, all serving the same purpose. Mathematical analysis of some of these algorithms shows the advantages and disadvantages of the methods, and it makes the programmer aware of the importance of analysis in the

8 choice of good solutions for a given problem. The partitioning into methods for sorting arrays and methods for sorting files (often called internal and external sorting) exhibits the crucial influence of data representation on the choice of applicable algorithms and on their complexity. The space allocated to sorting would not be so large were it not for the fact that sorting constitutes an ideal vehicle for illustrating so many principles of programming and situations occurring in most other applications. It often seems that one could compose an entire programming course by deleting examples from sorting only. Another topic that is usually omitted in introductory programming courses but one that plays an important role in the conception of many algorithmic solutions is recursion. Therefore, the third chapter is devoted to recursive algorithms. Recursion is shown to be a generalization of repetition (iteration), and as such it is an important and powerful concept in programming. In many programming tutorials, it is unfortunately exemplified by cases in which simple iteration would suffice. Instead, Chap. 3 concentrates on several examples of problems in which recursion allows for a most natural formulation of a solution, whereas use of iteration would lead to obscure and cumbersome programs. The class of backtracking algorithms emerges as an ideal application of recursion, but the most obvious candidates for the use of recursion are algorithms operating on data whose structure is defined recursively. These cases are treated in the last two chapters, for which the third chapter provides a welcome background. Chapter 4 deals with dynamic data structures, i.e., with data that change their structure during the execution of the program. It is shown that the recursive data structures are an important subclass of the dynamic structures commonly used. Although a recursive definition is both natural and possible in these cases, it is usually not used in practice. Instead, the mechanism used in its implementation is made evident to the programmer by forcing him to use explicit reference or pointer variables. This book follows this technique and reflects the present state of the art: Chapter 4 is devoted to programming with pointers, to lists, trees and to examples involving even more complicated meshes of data. It presents what is often (and somewhat inappropriately) called list processing. A fair amount of space is devoted to tree organizations, and in particular to search trees. The chapter ends with a presentation of scatter tables, also called "hash" codes, which are oftern preferred to search trees. This provides the possibility of comparing two fundamentally different techniques for a frequently encountered application. Programming is a constructive activity. How can a constructive, inventive activity be taught? One method is to crystallize elementary composition priciples out many cases and exhibit them in a systematic manner. But programming is a field of vast variety often involving complex intellectual activities. The belief that it could ever be condensed into a sort of pure recipe teaching is mistaken. What remains in our arsenal of teaching methods is the careful selection and presentation of master examples. Naturally, we should not believe that every person is capable of gaining equally much from the study of examples. It is the characteristic of this approach that much is left to the student, to his diligence and intuition. This is particularly true of the relatively involved and long example programs. Their inclusion in this book is not accidental. Longer programs are the prevalent case in practice, and they are much more suitable for exhibiting that elusive but essential ingredient called style and orderly structure. They are also meant to serve as exercises in the art of program reading, which too often is neglected in favor of program writing. This is a primary motivation behind the inclusion of larger programs as examples in their entirety. The reader is led through a gradual development of the program; he is given various snapshots in the evolution of a program, whereby this development becomes manifest as a stepwise refinement of the details. I consider it essential that programs are shown in final form with sufficient attention to details, for in programming, the devil hides in the details. Although the mere presentation of an algorithm's principle and its mathematical analysis may be stimulating and challenging to the academic mind, it seems dishonest to the engineering practitioner. I have therefore strictly adhered to the rule of presenting the final programs in a language in which they can actually be run on a computer. Of course, this raises the problem of finding a form which at the same time is both machine executable and sufficiently machine independent to be included in such a text. In this respect, neither widely used languages nor abstract notations proved to be adequate. The language Pascal provides an appropriate compromise; it had been developed with exactly this aim in mind, and it is therefore used throughout this book. The programs can easily be understood by programmers who are familiar with some other highlevel language, such as ALGOL 60 or PL/1, because it is easy to understand the Pascal notation while proceeding through the text. However, this not to say that some proparation would not be beneficial. The

9 book Systematic Programming [6] provides an ideal background because it is also based on the Pascal notation. The present book was, however, not intended as a manual on the language Pascal; there exist more appropriate texts for this purpose [7]. This book is a condensation -- and at the same time an elaboration -- of several courses on programming taught at the Federal Institute of Technology (ETH) at Zürich. I owe many ideas and views expressed in this book to discussions with my collaborators at ETH. In particular, I wish to thank Mr. H. Sandmayr for his careful reading of the manuscript, and Miss Heidi Theiler and my wife for their care and patience in typing the text. I should also like to mention the stimulating influence provided by meetings of the Working Groups 2.1 and 2.3 of IFIP, and particularly the many memorable arguments I had on these occasions with E. W. Dijkstra and C.A.R. Hoare. Last but not least, ETH generously provided the environment and the computing facilities without which the preparation of this text would have been impossible. Zürich, Aug. 1975 1. 2. 3. 4. 5. 6. 7.

N. Wirth

In Structured Programming. O-.J. Dahl, E.W. Dijkstra, C.A.R. Hoare. F. Genuys, Ed. (New York; Academic Press, 1972), pp. 1-82. In Comm. ACM, 12, No. 10 (1969), 576-83. In Structured Programming, pp. 83-174. N. Wirth. The Programming Language Pascal. Acta Informatica, 1, No. 1 (1971), 35-63. N. Wirth. Program Development by Stepwise Refinement. Comm. ACM, 14, No. 4 (1971), 221-27. N. Wirth. Systematic Programming. (Englewood Cliffs, N.J. Prentice-Hall, Inc., 1973.) K. Jensen and N. Wirth, PASCAL-User Manual and Report. (Berlin, Heidelberg, New York; Springer-Verlag, 1974).

Preface To The 1985 Edition This new Edition incorporates many revisions of details and several changes of more significant nature. They were all motivated by experiences made in the ten years since the first Edition appeared. Most of the contents and the style of the text, however, have been retained. We briefly summarize the major alterations. The major change which pervades the entire text concerns the programming language used to express the algorithms. Pascal has been replaced by Modula-2. Although this change is of no fundamental influence to the presentation of the algorithms, the choice is justified by the simpler and more elegant syntactic structures of Modula-2, which often lead to a more lucid representation of an algorithm's structure. Apart from this, it appeared advisable to use a notation that is rapidly gaining acceptance by a wide community, because it is well-suited for the development of large programming systems. Nevertheless, the fact that Pascal is Modula's ancestor is very evident and eases the task of a transition. The syntax of Modula is summarized in the Appendix for easy reference. As a direct consequence of this change of programming language, Sect. 1.11 on the sequential file structure has been rewritten. Modula-2 does not offer a built-in file type. The revised Sect. 1.11 presents the concept of a sequence as a data structure in a more general manner, and it introduces a set of program modules that incorporate the sequence concept in Modula-2 specifically. The last part of Chapter 1 is new. It is dedicated to the subject of searching and, starting out with linear and binary search, leads to some recently invented fast string searching algorithms. In this section in particular we use assertions and loop invariants to demonstrate the correctness of the presented algorithms. A new section on priority search trees rounds off the chapter on dynamic data structures. Also this species of trees was unknown when the first Edition appeared. They allow an economical representation and a fast search of point sets in a plane.

10 The entire fifth chapter of the first Edition has been omitted. It was felt that the subject of compiler construction was somewhat isolated from the preceding chapters and would rather merit a more extensive treatment in its own volume. Finally, the appearance of the new Edition reflects a development that has profoundly influenced publications in the last ten years: the use of computers and sophisticated algorithms to prepare and automatically typeset documents. This book was edited and laid out by the author with the aid of a Lilith computer and its document editor Lara. Without these tools, not only would the book become more costly, but it would certainly not be finished yet. Palo Alto, March 1985

N. Wirth

Notation The following notations, adopted from publications of E.W. Dijkstra, are used in this book. In logical expressions, the character & denotes conjunction and is pronounced as and. The character ~ denotes negation and is pronounced as not. Boldface A and E are used to denote the universal and existential quantifiers. In the following formulas, the left part is the notation used and defined here in terms of the right part. Note that the left parts avoid the use of the symbol "...", which appeals to the readers intuition.

Ai: m ≤ i < n : Pi

≡

P m & Pm+1 & ... & P n-1

The P i are predicates, and the formula asserts that for all indices i ranging from a given value m to, but excluding a value n, P i holds.

Ei: m ≤ i < n : Pi

≡

P m or Pm+1 or ... or Pn-1

The P i are predicates, and the formula asserts that for some indices i ranging from a given value m to, but excluding a value n, P i holds.

Si: m ≤ i < n : xi

=

xm + xm+1 + ... + xn-1

MIN i: m ≤ i < n : xi =

minimum(xm , ... , xn-1)

MAX i: m ≤ i < n : xi =

maximum(xm, … , xn-1)

11

1. Fundamental Data Structures 1.1. Introduction The modern digital computer was invented and intended as a device that should facilitate and speed up complicated and time-consuming computations. In the majority of applications its capability to store and access large amounts of information plays the dominant part and is considered to be its primary characteristic, and its ability to compute, i.e., to calculate, to perform arithmetic, has in many cases become almost irrelevant. In all these cases, the large amount of information that is to be processed in some sense represents an abstraction of a part of reality. The information that is available to the computer consists of a selected set of data about the actual problem, namely that set that is considered relevant to the problem at hand, that set from which it is believed that the desired results can be derived. The data represent an abstraction of reality in the sense that certain properties and characteristics of the real objects are ignored because they are peripheral and irrelevant to the particular problem. An abstraction is thereby also a simplification of facts. We may regard a personnel file of an employer as an example. Every employee is represented (abstracted) on this file by a set of data relevant either to the employer or to his accounting procedures. This set may include some identification of the employee, for example, his or her name and salary. But it will most probably not include irrelevant data such as the hair color, weight, and height. In solving a problem with or without a computer it is necessary to choose an abstraction of reality, i.e., to define a set of data that is to represent the real situation. This choice must be guided by the problem to be solved. Then follows a choice of representation of this information. This choice is guided by the tool that is to solve the problem, i.e., by the facilities offered by the computer. In most cases these two steps are not entirely separable. The choice of representation of data is often a fairly difficult one, and it is not uniquely determined by the facilities available. It must always be taken in the light of the operations that are to be performed on the data. A good example is the representation of numbers, which are themselves abstractions of properties of objects to be characterized. If addition is the only (or at least the dominant) operation to be performed, then a good way to represent the number n is to write n strokes. The addition rule on this representation is indeed very obvious and simple. The Roman numerals are based on the same principle of simplicity, and the adding rules are similarly straightforward for small numbers. On the other hand, the representation by Arabic numerals requires rules that are far from obvious (for small numbers) and they must be memorized. However, the situation is reversed when we consider either addition of large numbers or multiplication and division. The decomposition of these operations into simpler ones is much easier in the case of representation by Arabic numerals because of their systematic structuring principle that is based on positional weight of the digits. It is generally known that computers use an internal representation based on binary digits (bits). This representation is unsuitable for human beings because of the usually large number of digits involved, but it is most suitable for electronic circuits because the two values 0 and 1 can be represented conveniently and reliably by the presence or absence of electric currents, electric charge, or magnetic fields. From this example we can also see that the question of representation often transcends several levels of detail. Given the problem of representing, say, the position of an object, the first decision may lead to the choice of a pair of real numbers in, say, either Cartesian or polar coordinates. The second decision may lead to a floating-point representation, where every real number x consists of a pair of integers denoting a fraction f and an exponent e to a certain base (such that x = f×2e). The third decision, based on the knowledge that the data are to be stored in a computer, may lead to a binary, positional representation of integers, and the final decision could be to represent binary digits by the electric charge in a semiconductor storage device. Evidently, the first decision in this chain is mainly influenced by the problem situation, and the later ones are progressively dependent on the tool and its technology. Thus, it can hardly be required that a programmer decide on the number representation to be employed, or even on the storage device characteristics. These lower-level decisions can be left to the designers of computer equipment, who have the most information available on current technology with which to make a sensible choice that will be acceptable for all (or almost all) applications where numbers play a role.

12 In this context, the significance of programming languages becomes apparent. A programming language represents an abstract computer capable of interpreting the terms used in this language, which may embody a certain level of abstraction from the objects used by the actual machine. Thus, the programmer who uses such a higher-level language will be freed (and barred) from questions of number representation, if the number is an elementary object in the realm of this language. The importance of using a language that offers a convenient set of basic abstractions common to most problems of data processing lies mainly in the area of reliability of the resulting programs. It is easier to design a program based on reasoning with familiar notions of numbers, sets, sequences, and repetitions than on bits, storage units, and jumps. Of course, an actual computer represents all data, whether numbers, sets, or sequences, as a large mass of bits. But this is irrelevant to the programmer as long as he or she does not have to worry about the details of representation of the chosen abstractions, and as long as he or she can rest assured that the corresponding representation chosen by the computer (or compiler) is reasonable for the stated purposes. The closer the abstractions are to a given computer, the easier it is to make a representation choice for the engineer or implementor of the language, and the higher is the probability that a single choice will be suitable for all (or almost all) conceivable applications. This fact sets definite limits on the degree of abstraction from a given real computer. For example, it would not make sense to include geometric objects as basic data items in a general-purpose language, since their proper repesentation will, because of its inherent complexity, be largely dependent on the operations to be applied to these objects. The nature and frequency of these operations will, however, not be known to the designer of a general-purpose language and its compiler, and any choice the designer makes may be inappropriate for some potential applications. In this book these deliberations determine the choice of notation for the description of algorithms and their data. Clearly, we wish to use familiar notions of mathematics, such as numbers, sets, sequences, and so on, rather than computer-dependent entities such as bitstrings. But equally clearly we wish to use a notation for which efficient compilers are known to exist. It is equally unwise to use a closely machine-oriented and machine-dependent language, as it is unhelpful to describe computer programs in an abstract notation that leaves problems of representation widely open. The programming language Pascal had been designed in an attempt to find a compromise between these extremes, and the successor languages Modula-2 and Oberon are the result of decades of experience [1-3]. Oberon retains Pascal's basic concepts and incorporates some improvements and some extensions; it is used throughout this book [1-5]. It has been successfully implemented on several computers, and it has been shown that the notation is sufficiently close to real machines that the chosen features and their representations can be clearly explained. The language is also sufficiently close to other languages, and hence the lessons taught here may equally well be applied in their use.

1.2. The Concept of Data Type In mathematics it is customary to classify variables according to certain important characteristics. Clear distinctions are made between real, complex, and logical variables or between variables representing individual values, or sets of values, or sets of sets, or between functions, functionals, sets of functions, and so on. This notion of classification is equally if not more important in data processing. We will adhere to the principle that every constant, variable, expression, or function is of a certain type. This type essentially characterizes the set of values to which a constant belongs, or which can be assumed by a variable or expression, or which can be generated by a function. In mathematical texts the type of a variable is usually deducible from the typeface without consideration of context; this is not feasible in computer programs. Usually there is one typeface available on computer equipment (i.e., Latin letters). The rule is therefore widely accepted that the associated type is made explicit in a declaration of the constant, variable, or function, and that this declaration textually precedes the application of that constant, variable, or function. This rule is particularly sensible if one considers the fact that a compiler has to make a choice of representation of the object within the store of a computer. Evidently, the amount of storage allocated to a variable will have to be chosen according to the size of the range of values that the variable may assume. If this information is known to a compiler, so-called dynamic storage allocation can be avoided. This is very often the key to an efficient realization of an algorithm.

13 The primary characteristics of the concept of type that is used throughout this text, and that is embodied in the programming language Oberon, are the following [1-2]: 1. A data type determines the set of values to which a constant belongs, or which may be assumed by a variable or an expression, or which may be generated by an operator or a function. 2. The type of a value denoted by a constant, variable, or expression may be derived from its form or its declaration without the necessity of executing the computational process. 3. Each operator or function expects arguments of a fixed type and yields a result of a fixed type. If an operator admits arguments of several types (e.g., + is used for addition of both integers and real numbers), then the type of the result can be determined from specific language rules. As a consequence, a compiler may use this information on types to check the legality of various constructs. For example, the mistaken assignment of a Boolean (logical) value to an arithmetic variable may be detected without executing the program. This kind of redundancy in the program text is extremely useful as an aid in the development of programs, and it must be considered as the primary advantage of good highlevel languages over machine code (or symbolic assembly code). Evidently, the data will ultimately be represented by a large number of binary digits, irrespective of whether or not the program had initially been conceived in a high-level language using the concept of type or in a typeless assembly code. To the computer, the store is a homogeneous mass of bits without apparent structure. But it is exactly this abstract structure which alone is enabling human programmers to recognize meaning in the monotonous landscape of a computer store. The theory presented in this book and the programming language Oberon specify certain methods of defining data types. In most cases new data types are defined in terms of previously defined data types. Values of such a type are usually conglomerates of component values of the previously defined constituent types, and they are said to be structured. If there is only one constituent type, that is, if all components are of the same constituent type, then it is known as the base type. The number of distinct values belonging to a type T is called its cardinality. The cardinality provides a measure for the amount of storage needed to represent a variable x of the type T, denoted by x: T. Since constituent types may again be structured, entire hierarchies of structures may be built up, but, obviously, the ultimate components of a structure are atomic. Therefore, it is necessary that a notation is provided to introduce such primitive, unstructured types as well. A straightforward method is that of enumerating the values that are to constitute the type. For example in a program concerned with plane geometric figures, we may introduce a primitive type called shape, whose values may be denoted by the identifiers rectangle, square, ellipse, circle. But apart from such programmer-defined types, there will have to be some standard, predefined types. They usually include numbers and logical values. If an ordering exists among the individual values, then the type is said to be ordered or scalar. In Oberon, all unstructured types are ordered; in the case of explicit enumeration, the values are assumed to be ordered by their enumeration sequence. With this tool in hand, it is possible to define primitive types and to build conglomerates, structured types up to an arbitrary degree of nesting. In practice, it is not sufficient to have only one general method of combining constituent types into a structure. With due regard to practical problems of representation and use, a general-purpose programming language must offer several methods of structuring. In a mathematical sense, they are equivalent; they differ in the operators available to select components of these structures. The basic structuring methods presented here are the array, the record, the set, and the sequence. More complicated structures are not usually defined as static types, but are instead dynamically generated during the execution of the program, when they may vary in size and shape. Such structures are the subject of Chap. 4 and include lists, rings, trees, and general, finite graphs. Variables and data types are introduced in a program in order to be used for computation. To this end, a set of operators must be available. For each standard data type a programming languages offers a certain set of primitive, standard operators, and likewise with each structuring method a distinct operation and notation for selecting a component. The task of composition of operations is often considered the heart of the art of programming. However, it will become evident that the appropriate composition of data is equally fundamental and essential.

14 The most important basic operators are comparison and assignment, i.e., the test for equality (and for order in the case of ordered types), and the command to enforce equality. The fundamental difference between these two operations is emphasized by the clear distinction in their denotation throughout this text. Test for equality: Assignment to x:

x=y x := y

(an expression with value TRUE or FALSE) (a statement making x equal to y)

These fundamental operators are defined for most data types, but it should be noted that their execution may involve a substantial amount of computational effort, if the data are large and highly structured. For the standard primitive data types, we postulate not only the availability of assignment and comparison, but also a set of operators to create (compute) new values. Thus we introduce the standard operations of arithmetic for numeric types and the elementary operators of propositional logic for logical values.

1.3. Primitive Data Types A new, primitive type is definable by enumerating the distinct values belonging to it. Such a type is called an enumeration type. Its definition has the form TYPE T = (c1, c2, ... , cn) T is the new type identifier, and the ci are the new constant identifiers. Examples TYPE shape = (rectangle, square, ellipse, circle) TYPE color = (red, yellow, green) TYPE sex = (male, female) TYPE weekday = (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday) TYPE currency = (franc, mark, pound, dollar, shilling, lira, guilder, krone, ruble, cruzeiro, yen) TYPE destination = (hell, purgatory, heaven) TYPE vehicle = (train, bus, automobile, boat, airplane) TYPE rank = (private, corporal, sergeant, lieutenant, captain, major, colonel, general) TYPE object = (constant, type, variable, procedure, module) TYPE structure = (array, record, set, sequence) TYPE condition = (manual, unloaded, parity, skew) The definition of such types introduces not only a new type identifier, but at the same time the set of identifiers denoting the values of the new type. These identifiers may then be used as constants throughout the program, and they enhance its understandability considerably. If, as an example, we introduce variables s, d, r, and b. VAR s: sex VAR d: weekday VAR r: rank then the following assignment statements are possible: s := male d := Sunday r := major b := TRUE Evidently, they are considerably more informative than their counterparts s := 1 d := 7 r := 6 b := 2 which are based on the assumption that c, d, r, and b are defined as integers and that the constants are mapped onto the natural numbers in the order of their enumeration. Furthermore, a compiler can check

15 against the inconsistent use of operators. For example, given the declaration of s above, the statement s := s+1 would be meaningless. If, however, we recall that enumerations are ordered, then it is sensible to introduce operators that generate the successor and predecessor of their argument. We therefore postulate the following standard operators, which assign to their argument its successor and predecessor respectively: INC(x)

DEC(x)

1.4. Standard Primitive Types Standard primitive types are those types that are available on most computers as built-in features. They include the whole numbers, the logical truth values, and a set of printable characters. On many computers fractional numbers are also incorporated, together with the standard arithmetic operations. We denote these types by the identifiers INTEGER, REAL, BOOLEAN, CHAR, SET 1.4.1. Integer types The type INTEGER comprises a subset of the whole numbers whose size may vary among individual computer systems. If a computer uses n bits to represent an integer in two's complement notation, then the admissible values x must satisfy -2n-1 ≤ x < 2n-1. It is assumed that all operations on data of this type are exact and correspond to the ordinary laws of arithmetic, and that the computation will be interrupted in the case of a result lying outside the representable subset. This event is called overflow. The standard operators are the four basic arithmetic operations of addition (+), subtraction (-), multiplication (*), and division (/, DIV). Whereas the slash denotes ordinary division resulting in a value of type REAL, the operator DIV denotes integer division resulting in a value of type INTEGER. If we define the quotient q = m DIV n and the remainder r = m MOD n, the following relations hold, assuming n > 0: q*n + r = m and 0 ≤ r < n Examples: 31 DIV 10 = 3 -31 DIV 10 = -4

31 MOD 10 = 1 -31 MOD 10 = 9

We know that dividing by 10n can be achieved by merely shifting the decimal digits n places to the right and thereby ignoring the lost digits. The same method applies, if numbers are represented in binary instead of decimal form. If two's complement representation is used (as in practically all modern computers), then the shifts implement a division as defined by the above DIV operaton. Moderately sophisticated compilers will therefore represent an operation of the form m DIV 2n or m MOD 2n by a fast shift (or mask) operation. 1.4.2. The type REAL The type REAL denotes a subset of the real numbers. Whereas arithmetic with operands of the types INTEGER is assumed to yield exact results, arithmetic on values of type REAL is permitted to be inaccurate within the limits of round-off errors caused by computation on a finite number of digits. This is the principal reason for the explicit distinction between the types INTEGER and REAL, as it is made in most programming languages. The standard operators are the four basic arithmetic operations of addition (+), subtraction (-), multiplication (*), and division (/). It is an essence of data typing that different types are incompatible under assignment. An exception to this rule is made for assignment of integer values to real variables, because here the semanitcs are unambiguous. After all, integers form a subset of real numbers. However, the inverse direction is not permissible: Assignment of a real value to an integer variable requires an operation such as truncation or rounding. The standard transfer function Entier(x) yields the integral part of x. Rounding of x is obtained by Entier(x + 0.5).

16 Many programming languages do not include an exponentiation operator. The following is an algorithm for the fast computation of y = xn, where n is a non-negative integer. y := 1.0; i := n; WHILE i > 0 DO (* x0n = xi * y *) IF ODD(i) THEN y := y*x END ; x := x*x; i := i DIV 2 END 1.4.3. The type BOOLEAN The two values of the standard type BOOLEAN are denoted by the identifiers TRUE and FALSE. The Boolean operators are the logical conjunction, disjunction, and negation whose values are defined in Table 1.1. The logical conjunction is denoted by the symbol &, the logical disjunction by OR, and negation by “~”. Note that comparisons are operations yielding a result of type BOOLEAN. Thus, the result of a comparison may be assigned to a variable, or it may be used as an operand of a logical operator in a Boolean expression. For instance, given Boolean variables p and q and integer variables x = 5, y = 8, z = 10, the two assignments p := x = y q := (x ≤ y) & (y < z) yield p = FALSE and q = TRUE. p TRUE TRUE FALSE FALSE

q TRUE FALSE TRUE FALSE

p&q

p OR q

~p

TRUE TRUE TRUE FALSE

TRUE FALSE FALSE FALSE

FALSE FALSE TRUE TRUE

Table 1.1 Boolean Operators. The Boolean operators & (AND) and OR have an additional property in most programming languages, which distinguishes them from other dyadic operators. Whereas, for example, the sum x+y is not defined, if either x or y is undefined, the conjunction p&q is defined even if q is undefined, provided that p is FALSE. This conditionality is an important and useful property. The exact definition of & and OR is therefore given by the following equations: p &q

= if p then q else FALSE

p OR q

= if p then TRUE else q

1.4.4. The type CHAR The standard type CHAR comprises a set of printable characters. Unfortunately, there is no generally accepted standard character set used on all computer systems. Therefore, the use of the predicate "standard" may in this case be almost misleading; it is to be understood in the sense of "standard on the computer system on which a certain program is to be executed." The character set defined by the International Standards Organization (ISO), and particularly its American version ASCII (American Standard Code for Information Interchange) is the most widely accepted set. The ASCII set is therefore tabulated in Appendix A. It consists of 95 printable (graphic) characters and 33 control characters, the latter mainly being used in data transmission and for the control of printing equipment. In order to be able to design algorithms involving characters (i.e., values of type CHAR) that are system independent, we should like to be able to assume certain minimal properties of character sets, namely: 1. The type CHAR contains the 26 capital Latin letters, the 26 lower-case letters, the 10 decimal digits, and a number of other graphic characters, such as punctuation marks. 2. The subsets of letters and digits are ordered and contiguous, i.e.,

17 ("A" ≤ x) & (x ≤ "Z") ("a" ≤ x) & (x ≤ "z") ("0" ≤ x) & (x ≤ "9")

implies that x is a capital letter implies that x is a lower-case letter implies that x is a decimal digit

3. The type CHAR contains a non-printing, blank character and a line-end character that may be used as separators.

THIS IS A TEXT

Fig. 1.1. Representations of a text The availability of two standard type transfer functions between the types CHAR and INTEGER is particularly important in the quest to write programs in a machine independent form. We will call them ORD(ch), denoting the ordinal number of ch in the character set, and CHR(i), denoting the character with ordinal number i. Thus, CHR is the inverse function of ORD, and vice versa, that is, ORD(CHR(i)) = i CHR(ORD(c)) = c

(if CHR(i) is defined)

Furthermore, we postulate a standard function CAP(ch). Its value is defined as the capital letter corresponding to ch, provided ch is a letter. ch is a lower-case letter implies that ch is a capital letter implies that

CAP(ch) = corresponding capital letter CAP(ch) = ch

1.4.5. The type SET The type SET denotes sets whose elements are integers in the range 0 to a small number, typically 31 or 63. Given, for example, variables VAR r, s, t: SET possible assignments are r := {5}; s := {x, y .. z}; t := {} Here, the value assigned to r is the singleton set consisting of the single element 5; to t is assigned the empty set, and to s the elements x, y, y+1, … , z-1, z. The following elementary operators are defined on variables of type SET: * + / IN

set intersection set union set difference symmetric set difference set membership

Constructing the intersection or the union of two sets is often called set multiplication or set addition, respectively; the priorities of the set operators are defined accordingly, with the intersection operator having priority over the union and difference operators, which in turn have priority over the membership operator, which is classified as a relational operator. Following are examples of set expressions and their fully parenthesized equivalents: r*s+t r-s*t r-s+ t

= (r*s) + t = r - (s*t) = (r-s) + t

18 r+s/t = r + (s/t) x IN s + t = x IN (s+t)

1.5. The Array Structure The array is probably the most widely used data structure; in some languages it is even the only one available. An array consists of components which are all of the same type, called its base type; it is therefore called a homogeneous structure. The array is a random-access structure, because all components can be selected at random and are equally quickly accessible. In order to denote an individual component, the name of the entire structure is augmented by the index selecting the component. This index is to be an integer between 0 and n-1, where n is the number of elements, the size, of the array. TYPE T = ARRAY n OF T0 Examples TYPE Row = ARRAY 4 OF REAL TYPE Card = ARRAY 80 OF CHAR TYPE Name = ARRAY 32 OF CHAR A particular value of a variable VAR x: Row with all components satisfying the equation xi = 2-i, may be visualized as shown in Fig. 1.2. x0

1.0

x1

0.5

x2

0.25

x3

0.125

Fig. 1.2 Array of type Row with x i = 2 -i An individual component of an array can be selected by an index. Given an array variable x, we denote an array selector by the array name followed by the respective component's index i, and we write x i or x[i]. Because of the first, conventional notation, a component of an array component is therefore also called a subscripted variable. The common way of operating with arrays, particularly with large arrays, is to selectively update single components rather than to construct entirely new structured values. This is expressed by considering an array variable as an array of component variables and by permitting assignments to selected components, such as for example x[i] := 0.125. Although selective updating causes only a single component value to change, from a conceptual point of view we must regard the entire composite value as having changed too. The fact that array indices, i.e., names of array components, are integers, has a most important consequence: indices may be computed. A general index expression may be substituted in place of an index constant; this expression is to be evaluated, and the result identifies the selected component. This generality not only provides a most significant and powerful programming facility, but at the same time it also gives rise to one of the most frequently encountered programming mistakes: The resulting value may be outside the interval specified as the range of indices of the array. We will assume that decent computing systems provide a warning in the case of such a mistaken access to a non-existent array component. The cardinality of a structured type, i. e. the number of values belonging to this type, is the product of the cardinality of its components. Since all components of an array type T are of the same base type T0, we obtain card(T) = card(T0)n

19 Constituents of array types may themselves be structured. An array variable whose components are again arrays is called a matrix. For example, M: ARRAY 10 OF Row is an array consisting of ten components (rows), each constisting of four components of type REAL, and is called a 10 × 4 matrix with real components. Selectors may be concatenated accordingly, such that Mij and M[i][j] denote the j th component of row Mi, which is the i th component of M. This is usually abbreviated as M[i, j] and in the same spirit the declaration M: ARRAY 10 OF ARRAY 4 OF REAL can be written more concisely as M: ARRAY 10, 4 OF REAL. If a certain operation has to be performed on all components of an array or on adjacent components of a section of the array, then this fact may conveniently be emphasized by using the FOR satement, as shown in the following examples for computing the sum and for finding the maximal element of an array declared as VAR a: ARRAY N OF INTEGER sum := 0; FOR i := 0 TO N-1 DO sum := a[i] + sum END k := 0; max := a[0]; FOR i := 1 TO N-1 DO IF max < a[i] THEN k := i; max := a[k] END END. In a further example, assume that a fraction f is represented in its decimal form with k-1 digits, i.e., by an array d such that f = S i : 0 ≤ i < k: di * 10 -i f = d0 + 10*d1 + 100*d2 + … + dk-1*10

or k-1

Now assume that we wish to divide f by 2. This is done by repeating the familiar division operation for all k-1 digits di, starting with i=1. It consists of dividing each digit by 2 taking into account a possible carry from the previous position, and of retaining a possible remainder r for the next position: r := 10*r +d[i]; d[i] := r DIV 2; r := r MOD 2 This algorithm is used to compute a table of negative powers of 2. The repetition of halving to compute 2-1, 2 -2, ... , 2-N is again appropriately expressed by a FOR statement, thus leading to a nesting of two FOR statements. PROCEDURE Power(VAR W: Texts.Writer; N: INTEGER); (*compute decimal representation of negative powers of 2*) VAR i, k, r: INTEGER; d: ARRAY N OF INTEGER; BEGIN FOR k := 0 TO N-1 DO Texts.Write(W, "."); r := 0; FOR i := 0 TO k-1 DO r := 10*r + d[i]; d[i] := r DIV 2; r := r MOD 2; Texts.Write(W, CHR(d[i] + ORD("0"))) END ; d[k] := 5; Texts.Write(W, "5"); Texts.WriteLn(W) END END Power. The resulting output text for N = 10 is

20 .5 .25 .125 .0625 .03125 .015625 .0078125 .00390625 .001953125 .0009765625

1.6. The Record Structure The most general method to obtain structured types is to join elements of arbitrary types, that are possibly themselves structured types, into a compound. Examples from mathematics are complex numbers, composed of two real numbers, and coordinates of points, composed of two or more numbers according to the dimensionality of the space spanned by the coordinate system. An example from data processing is describing people by a few relevant characteristics, such as their first and last names, their date of birth, sex, and marital status. In mathematics such a compound type is the Cartesian product of its constituent types. This stems from the fact that the set of values defined by this compound type consists of all possible combinations of values, taken one from each set defined by each constituent type. Thus, the number of such combinations, also called n-tuples, is the product of the number of elements in each constituent set, that is, the cardinality of the compound type is the product of the cardinalities of the constituent types. In data processing, composite types, such as descriptions of persons or objects, usually occur in files or data banks and record the relevant characteristics of a person or object. The word record has therefore become widely accepted to describe a compound of data of this nature, and we adopt this nomenclature in preference to the term Cartesian product. In general, a record type T with components of the types T1, T2, ... , Tn is defined as follows: TYPE T =

RECORD s1: T1; s2: T2; ... sn: Tn END

card(T) = card(T1) * card(T2) * ... * card(Tn) Examples TYPE Complex = RECORD re, im: REAL END TYPE Date = RECORD day, month, year: INTEGER END TYPE Person = RECORD name, firstname: Name; birthdate: Date; sex: (male, female); marstatus: (single, married, widowed, divorced) END We may visualize particular, record-structured values of, for example, the variables z: Complex d: Date p: Person as shown in Fig. 1.3.

21

Complex z

Date d

Person p

1.0

1

SMITH

-1.0

4

JOHN

1973

18

1

1986

male single

Fig. 1.3. Records of type Complex, Date, and Person The identifiers s1, s2, ... , sn introduced by a record type definition are the names given to the individual components of variables of that type. As components of records are called fields, the names are field identifiers. They are used in record selectors applied to record structured variables. Given a variable x: T, its i-th field is denoted by x.si. Selective updating of x is achieved by using the same selector denotation on the left side in an assignment statement: x.si := e where e is a value (expression) of type Ti. Given, for example, the record variables z, d, and p declared above, the following are selectors of components: z.im d.month p.name p.birthdate p.birthdate.day

(of type REAL) (of type INTEGER) (of type Name) (of type Date) (of type INTEGER)

The example of the type Person shows that a constituent of a record type may itself be structured. Thus, selectors may be concatenated. Naturally, different structuring types may also be used in a nested fashion. For example, the i-th component of an array a being a component of a record variable r is denoted by r.a[i], and the component with the selector name s of the i-th record structured component of the array a is denoted by a[i].s. It is a characteristic of the Cartesian product that it contains all combinations of elements of the constituent types. But it must be noted that in practical applications not all of them may be meaningful. For instance, the type Date as defined above includes the 31st April as well as the 29th February 1985, which are both dates that never occurred. Thus, the definition of this type does not mirror the actual situation entirely correctly; but it is close enough for practical purposes, and it is the responsibility of the programmer to ensure that meaningless values never occur during the execution of a program. The following short excerpt from a program shows the use of record variables. Its purpose is to count the number of persons represented by the array variable family that are both female and single: VAR count: INTEGER; family: ARRAY N OF Person; count := 0; FOR i := 0 TO N-1 DO IF (family[i].sex = female) & (family[i].marstatus = single) THEN INC(count) END END The record structure and the array structure have the common property that both are random-access structures. The record is more general in the sense that there is no requirement that all constituent types must be identical. In turn, the array offers greater flexibility by allowing its component selectors to be computable values (expressions), whereas the selectors of record components are field identifiers declared in the record type definition.

22 1.7. Representation Of Arrays, Records, And Sets The essence of the use of abstractions in programming is that a program may be conceived, understood, and verified on the basis of the laws governing the abstractions, and that it is not necessary to have further insight and knowledge about the ways in which the abstractions are implemented and represented in a particular computer. Nevertheless, it is essential for a professional programmer to have an understanding of widely used techniques for representing the basic concepts of programming abstractions, such as the fundamental data structures. It is helpful insofar as it might enable the programmer to make sensible decisions about program and data design in the light not only of the abstract properties of structures, but also of their realizations on actual computers, taking into account a computer's particular capabilities and limitations. The problem of data representation is that of mapping the abstract structure onto a computer store. Computer stores are - in a first approximation - arrays of individual storage cells called bytes. They are understood to be groups of 8 bits. The indices of the bytes are called addresses. VAR store: ARRAY StoreSize OF BYTE The basic types are represented by a small number of bytes, typically 2, 4, or 8. Computers are designed to transfer internally such small numbers (possibly 1) of contiguous bytes concurrently, ”in parallel”. The unit transferable concurrently is called a word. 1.7.1. Representation of Arrays A representation of an array structure is a mapping of the (abstract) array with components of type T onto the store which is an array with components of type BYTE. The array should be mapped in such a way that the computation of addresses of array components is as simple (and therefore as efficient) as possible. The address i of the j-th array component is computed by the linear mapping function i = i0 + j*s where i0 is the address of the first component, and s is the number of words that a component occupies. Assuming that the word is the smallest individually transferable unit of store, it is evidently highly desirable that s be a whole number, the simplest case being s = 1. If s is not a whole number (and this is the normal case), then s is usually rounded up to the next larger integer S. Each array component then occupies S words, whereby S-s words are left unused (see Figs. 1.5 and 1.6). Rounding up of the number of words needed to the next whole number is called padding. The storage utilization factor u is the quotient of the minimal amounts of storage needed to represent a structure and of the amount actually used: u = s / (s rounded up to nearest integer) store i0 array

Fig. 1.5. Mapping an array onto a store s=2.3 S=3 unused

23 Fig. 1.6. Padded representation of a record Since an implementor has to aim for a storage utilization as close to 1 as possible, and since accessing parts of words is a cumbersome and relatively inefficient process, he or she must compromise. The following considerations are relevant: 1. Padding decreases storage utilization. 2. Omission of padding may necessitate inefficient partial word access. 3. Partial word access may cause the code (compiled program) to expand and therefore to counteract the gain obtained by omission of padding. In fact, considerations 2 and 3 are usually so dominant that compilers always use padding automatically. We notice that the utilization factor is always u > 0.5, if s > 0.5. However, if s ≤ 0.5, the utilization factor may be significantly increased by putting more than one array component into each word. This technique is called packing. If n components are packed into a word, the utilization factor is (see Fig. 1.7) u = n*s / (n*s rounded up to nearest integer) padded

Fig. 1.7. Packing 6 components into one word Access to the i-th component of a packed array involves the computation of the word address j in which the desired component is located, and it involves the computation of the respective component position k within the word. j = i DIV n

k = i MOD n

In most programming languages the programmer is given no control over the representation of the abstract data structures. However, it should be possible to indicate the desirability of packing at least in those cases in which more than one component would fit into a single word, i.e., when a gain of storage economy by a factor of 2 and more could be achieved. We propose the convention to indicate the desirability of packing by prefixing the symbol ARRAY (or RECORD) in the declaration by the symbol PACKED. 1.7.2. Representation of Records Records are mapped onto a computer store by simply juxtaposing their components. The address of a component (field) r i relative to the origin address of the record r is called the field's offset k i. It is computed as ki = s1 + s2 + ... + si-1

k0 = 0

where sj is the size (in words) of the j-th component. We now realize that the fact that all components of an array are of equal type has the welcome consequence that ki = i×s. The generality of the record structure does unfortunately not allow such a simple, linear function for offset address computation, and it is therefore the very reason for the requirement that record components be selectable only by fixed identifiers. This restriction has the desirable benefit that the respective offsets are known at compile time. The resulting greater efficiency of record field access is well-known. The technique of packing may be beneficial, if several record components can be fitted into a single storage word (see Fig. 1.8). Since offsets are computable by the compiler, the offset of a field packed within a word may also be determined by the compiler. This means that on many computers packing of records causes a deterioration in access efficiency considerably smaller than that caused by the packing of arrays.

24

s1 s2

s3 s4 padded s5

s6

s7

s8

Fig. 1.8. Representation of a packed record 1.7.3. Representation of Sets A set s is conveniently represented in a computer store by its characteristic function C(s). This is an array of logical values whose ith component has the meaning “i is present in s”. As an example, the set of small integers s = {2, 3, 5, 7, 11, 13} is represented by the sequence of bits, by a bitstring: C(s) = (… 0010100010101100) The representation of sets by their characteristic function has the advantage that the operations of computing the union, intersection, and difference of two sets may be implemented as elementary logical operations. The following equivalences, which hold for all elements i of the base type of the sets x and y, relate logical operations with operations on sets: i IN (x+y) = (i IN x) OR (i IN y) i IN (x*y) = (i IN x) & (i IN y) i IN (x-y) = (i IN x) & ~(i IN y) These logical operations are available on all digital computers, and moreover they operate concurrently on all corresponding elements (bits) of a word. It therefore appears that in order to be able to implement the basic set operations in an efficient manner, sets must be represented in a small, fixed number of words upon which not only the basic logical operations, but also those of shifting are available. Testing for membership is then implemented by a single shift and a subsequent (sign) bit test operation. As a consequence, a test of the form x IN {c1, c2, ... , cn} can be implemented considerably more efficiently than the equivalent Boolean expression (x = c1) OR (x = c2) OR ... OR (x = cn) A corollary is that the set structure should be used only for small integers as elements, the largest one being the wordlength of the underlying computer (minus 1).

1.8. The File or Sequence Another elementary structuring method is the sequence. A sequence is typically a homogeneous structure like the array. That is, all its elements are of the same type, the base type of the sequence. We shall denote a sequence s with n elements by s = n is called the length of the sequence. This structure looks exactly like the array. The essential difference is that in the case of the array the number of elements is fixed by the array's declaration, whereas for the sequence it is left open. This implies that it may vary during execution of the program. Although every sequence has at any time a specific, finite length, we must consider the cardinality of a sequence type as infinite, because there is no fixed limit to the potential length of sequence variables. A direct consequence of the variable length of sequences is the impossibility to allocate a fixed amount of storage to sequence variables. Instead, storage has to be allocated during program execution, namely whenever the sequence grows. Perhaps storage can be reclaimed when the sequence shrinks. In any case, a

25 dynamic storage allocation scheme must be employed. All structures with variable size share this property, which is so essential that we classify them as advanced structures in contrast to the fundamental structures discussed so far. What, then, causes us to place the discussion of sequences in this chapter on fundamental structures? The primary reason is that the storage management strategy is sufficiently simple for sequences (in contrast to other advanced structures), if we enforce a certain discipline in the use of sequences. In fact, under this proviso the handling of storage can safely be delegated to a machanism that can be guaranteed to be reasonably effective. The secondary reason is that sequences are indeed ubiquitous in all computer applications. This structure is prevalent in all cases where different kinds of storage media are involved, i.e. where data are to be moved from one medium to another, such as from disk or tape to primary store or vice-versa. The discipline mentioned is the restraint to use sequential access only. By this we mean that a sequence is inspected by strictly proceeding from one element to its immediate successor, and that it is generated by repeatedly appending an element at its end. The immediate consequence is that elements are not directly accessible, with the exception of the one element which currently is up for inspection. It is this accessing discipline which fundamentally distinguishes sequences from arrays. As we shall see in Chapter 2, the influence of an access discipline on programs is profound. The advantage of adhering to sequential access which, after all, is a serious restriction, is the relative simplicity of needed storage management. But even more important is the possibility to use effective buffering techniques when moving data to or from secondary storage devices. Sequential access allows us to feed streams of data through pipes between the different media. Buffering implies the collection of sections of a stream in a buffer, and the subsequent shipment of the whole buffer content once the buffer is filled. This results in very significantly more effective use of secondary storage. Given sequential access only, the buffering mechanism is reasonably straightforward for all sequences and all media. It can therefore safely be built into a system for general use, and the programmer need not be burdened by incorporating it in the program. Such a system is usually called a file system, because the high-volume, sequential access devices are used for permanent storage of (persistent) data, and they retain them even when the computer is switched off. The unit of data on these media is commonly called (sequential) file. Here we will use the term file as synonym to sequence. There exist certain storage media in which the sequential access is indeed the only possible one. Among them are evidently all kinds of tapes. But even on magnetic disks each recording track constitutes a storage facility allowing only sequential access. Strictly sequential access is the primary characteristic of every mechanically moving device and of some other ones as well. It follows that it is appropriate to distinguish between the data structure, the sequence, on one hand, and the mechanism to access elements on the other hand. The former is declared as a data structure, the latter typically by the introduction of a record with associated operators, or, according to more modern terminology, by a rider object. The distinction between data and mechanism declarations is also useful in view of the fact that several access points may exist concurrently on one and the same sequence, each one representing a sequential access at a (possibly) different location. We summarize the essence of the foregoing as follows: 1. Arrays and records are random access structures. They are used when located in primary, random-access store. 2. Sequences are used to access data on secondary, sequential-access stores, such as disks and tapes. 3. We distinguish between the declaration of a sequence variable, and that of an access mechanism located at a certain position within the seqence. 1.8.1 Elementary File Operators The discipline of sequential access can be enforced by providing a set of seqencing operators through which files can be accessed exclusively. Hence, although we may here refer to the i-th element of a sequence s by writing si, this shall not be possible in a program.

26 Sequences, files, are typically large, dynamic data structures stored on a secondary storage device. Such a device retains the data even if a program is terminated, or a computer is switched off. Therefore the introduction of a file variable is a complex operation connecting the data on the external device with the file variable in the program. We therefore define the type File in a separate module, whose definition specifies the type together with its operators. We call this module Files and postulate that a sequence or file variable must be explicitly initialized (opened) by calling an appropriate operator or function: VAR f: File f := Open(name) where name identifies the file as recorded on the persistent data carrier. Some systems distinguish between opening an existing file and opening a new file: f := Old(name)

f := New(name)

The disconnection between secondary storage and the file variable then must also be explicitly requested by, for example, a call of Close(f). Evidently, the set of operators must contain an operator for generating (writing) and one for inspecting (reading) a sequence. We postulate that these operations apply not to a file directly, but to an object called a rider, which itself is connected with a file (sequence), and which implements a certain access mechanism. The sequential access discipline is guaranteed by a restrictive set of access operators (procedures). A sequence is generated by appending elements at its end after having placed a rider on the file. Assuming the declaration VAR r: Rider we position the rider r on the file f by the statement Set(r, f, pos) where pos = 0 designates the beginning of the file (sequence). A typical pattern for generating the sequence is: WHILE more DO compute next element x; Write(r, x) END A sequence is inspected by first positioning a rider as shown above, and then proceeding from element to element. A typical pattern for reading a sequence is: Read(r, x); WHILE ~r.eof DO process element x; Read(r, x) END Evidently, a certain position is always associated with every rider. It is denoted by r.pos. Furthermore, we postulate that a rider contain a predicate (flag) r.eof indicating whether a preceding read operation had reached the sequence’s end. We can now postulate and describe informally the following set of primitive operators: 1a. 1b. 2. 3. 4. 5.

New(f, name) Old(f, name) Set(r, f, pos) Write(r, x) Read(r, x) Close(f)

defines f to be the empty sequence. defines f to be the sequence persistently stored with given name. associate rider r with sequence f, and place it at position pos. place element with value x in the sequence designated by rider r, and advance. assign to x the value of the element designated by rider r, and advance. registers the written file f in the persistent store (flush buffers).

Note: Writing an element in a sequence is often a complex operation. However, mostly, files are created by appending elements at the end. In order to convey a more precise understanding of the sequencing operators, the following example of an implementation is provided. It shows how they might be expressed if sequences were represented by arrays. This example of an implementation intentionally builds upon concepts introduced and discussed earlier, and it does not involve either buffering or sequential stores which, as mentioned above, make the sequence concept truly necessary and attractive. Nevertheless, this example exhibits all the essential

27 characteristics of the primitive sequence operators, independently on how the sequences are represented in store. The operators are presented in terms of conventional procedures. This collection of definitions of types, variables, and procedure headings (signatures) is called a definition. We assume that we are to deal with sequences of characters, i.e. text files whose elements are of type CHAR. The declarations of File and Rider are good examples of an application of record structures because, in addition to the field denoting the array which represents the data, further fields are required to denote the current length and position, i.e. the state of the rider. DEFINITION Files; TYPE File; (*sequence of characters*) Rider = RECORD eof: BOOLEAN END ; PROCEDURE New(VAR name: ARRAY OF CHAR): File; PROCEDURE Old(VAR name: ARRAY OF CHAR): File; PROCEDURE Close(VAR f: File); PROCEDURE Set(VAR r: Rider; VAR f: File; pos: INTEGER); PROCEDURE Write (VAR r: Rider; ch: CHAR); PROCEDURE Read (VAR r: Rider; VAR ch: CHAR); END Files. A definition represents an abstraction. Here we are given the two data types, File and Rider, together with their operations, but without further details revealing their actual representation in store. Of the operators, declared as procedures, we see their headings only. This hiding of the details of implementation is intentional. The concept is called information hiding. About riders we only learn that there is a property called eof. This flag is set, if a read operation reaches the end of the file. The rider’s position is invisible, and hence the rider’s invariant cannot be falsified by direct access. The invariant expresses the fact that the position always lies within the limits given by the associated sequence. The invariant is established by procedure Set, and required and maintained by procedures Read and Write. The statements that implement the procedures and further, internal details of the data types, are sepecified in a construct called module. Many representations of data and implementations of procedures are possible. We chose the following as a simple example (with fixed maximal file length): MODULE Files; CONST MaxLength = 4096; TYPE File = POINTER TO RECORD len: INTEGER; a: ARRAY MaxLength OF CHAR END ; Rider = RECORD (* 0 = 0*) nonempty: Signals.Signal; (*nf >= 0*) buf: ARRAY N OF CHAR; PROCEDURE deposit(VAR x: ARRAY OF CHAR); BEGIN ne := ne - Np; IF ne < 0 THEN Signals.Wait(nonfull) END ; FOR i := 0 TO Np-1 DO buf[in] := x[i]; INC(in) END ; IF in = N THEN in := 0 END ; nf := nf + Np; IF nf >= 0 THEN Signals.Send(nonempty) END END deposit; PROCEDURE fetch(VAR x: ARRAY OF CHAR); BEGIN nf := nf - Nc; IF nf < 0 THEN Signals.Wait(nonempty) END ; FOR i := 0 TO Nc-1 DO x[i] := buf[out]; INC(out) END; IF out = N THEN out := 0 END ; ne := ne + Nc; IF ne >= 0 THEN Signals.Send(nonfull) END END fetch; BEGIN ne := N; nf := 0; in := 0; out := 0; Signals.Init(nonfull); Signals.Init(nonempty) END Buffer. 1.8.4 Textual Input and Output By standard input and output we understand the transfer of data to (from) a computer system from (to) genuinely external agents, in particular its human operator. Input may typically originate at a keyboard and output may sink into a display screen. In any case, its characteristic is that it is readable, and it typically

32 consists of a sequence of characters. It is a text. This readability condition is responsible for yet another complication incurred in most genuine input and output operations. Apart from the actual data transfer, they also involve a transformation of representation. For example, numbers, usually considered as atomic units and represented in binary form, need be transformed into readable, decimal notation. Structures need to be represented in a suitable layout, whose generation is called formatting. Whatever the transformation may be, the concept of the sequence is once again instrumental for a considerable simplification of the task. The key is the observation that, if the data set can be considered as a sequence of characters, the transformation of the sequence can be implemented as a sequence of (identical) transformations of elements. T() = We shall briefly investigate the necessary operations for transforming representations of natural numbers for input and output. The basis is that a number x represented by the sequence of decimal digits d = has the value x = Si: i = 0 .. n-1: d i * 10i x = dn-1×10 n-1 + dn-2×10n-2 + … + d1×10 + d0 x = ( … ((dn-1×10) + dn-2) ×10 + … + d1×10) + d0 Assume now that the sequence d is to be read and transformed, and the resulting numeric value to be assigned to x. The simple algorithm terminates with the reading of the first character that is not a digit. (Arithmetic overflow is not considered). x := 0; Read(ch); WHILE ("0" R) & (Ak : 0 ≤ k < L : ak < x ) & (Ak : R < k < N : ak > x)) which implies (am = x) OR (Ak : 0 ≤ k < N : ak ≠ x) The choice of m is apparently arbitrary in the sense that correctness does not depend on it. But it does influence the algorithm's effectiveness. Clearly our goal must be to eliminate in each step as many elements as possible from further searches, no matter what the outcome of the comparison is. The optimal solution is to choose the middle element, because this eliminates half of the array in any case. As a result, the maximum number of steps is log2N, rounded up to the nearest integer. Hence, this algorithm offers a drastic improvement over linear search, where the expected number of comparisons is N/2. The efficiency can be somewhat improved by interchanging the two if-clauses. Equality should be tested second, because it occurs only once and causes termination. But more relevant is the question, whether -- as in the case of linear search -- a solution could be found that allows a simpler condition for termination. We indeed find such a faster algorithm, if we abandon the naive wish to terminate the search as soon as a match is established. This seems unwise at first glance, but on closer inspection we realize that the gain in efficiency at every step is greater than the loss incurred in comparing a few extra elements. Remember that the number of steps is at most log N. The faster solution is based on the following invariant: (Ak : 0 ≤ k < L : ak < x) & (Ak : R ≤ k < N : ak ≥ x) and the search is continued until the two sections span the entire array. L := 0; R := N; WHILE L < R DO m := (L+R) DIV 2; IF a[m] < x THEN L := m+1 ELSE R := m END END The terminating condition is L ≥ R. Is it guaranteed to be reached? In order to establish this guarantee, we must show that under all circumstances the difference R-L is diminished in each step. L < R holds at the

36 beginning of each step. The arithmetic mean m then satisfies L ≤ m < R. Hence, the difference is indeed diminished by either assigning m+1 to L (increasing L) or m to R (decreasing R), and the repetition terminates with L = R. However, the invariant and L = R do not yet establish a match. Certainly, if R = N, no match exists. Otherwise we must take into consideration that the element a[R] had never been compared. Hence, an additional test for equality a[R] = x is necessary. In contrast to the first solution, this algorithm -- like linear search -- finds the matching element with the least index. 1.9.3 Table Search A search through an array is sometimes also called a table search, particularly if the keys are themselves structured objects, such as arrays of numbers or characters. The latter is a frequently encountered case; the character arrays are called strings or words. Let us define a type String as String = ARRAY M OF CHAR and let order on strings x and y be defined as follows: (x = y) ≡ (Aj: 0 ≤ j < M : xj = yj) (x < y) ≡ Ei: 0 ≤ i < N: ((Aj: 0 ≤ j < i : xj = yj) & (x i < yi)) In order to establish a match, we evidently must find all characters of the comparands to be equal. Such a comparison of structured operands therefore turns out to be a search for an unequal pair of comparands, i.e. a search for inequality. If no unequal pair exists, equality is established. Assuming that the length of the words be quite small, say less than 30, we shall use a linear search in the following solution. In most practical applications, one wishes to consider strings as having a variable length. This is accomplished by associating a length indication with each individual string value. Using the type declared above, this length must not exceed the maximum length M. This scheme allows for sufficient flexibility for many cases, yet avoids the complexities of dynamic storage allocation. Two representations of string lengths are most commonly used: 1. The length is implicitly specified by appending a terminating character which does not otherwise occur. Usually, the non-printing value 0X is used for this purpose. (It is important for the subsequent applications that it be the least character in the character set). 2. The length is explicitly stored as the first element of the array, i.e. the string s has the form s = s0, s1, s2, ... , sN-1 where s1 ... sN-1 are the actual characters of the string and s0 = CHR(N). This solution has the advantage that the length is directly available, and the disadvantage that the maximum length is limited to the size of the character set, that is, to 256 in the case of the ASCII set. For the subsequent search algorithm, we shall adhere to the first scheme. A string comparison then takes the form i := 0; WHILE (x[i] = y[i]) & (x[i] # 0X) DO i := i+1 END The terminating character now functions as a sentinel, the loop invariant is Aj: 0 ≤ j < i : x j = yj ≠ 0X, and the resulting condition is therefore ((xi = yi) OR (x i = 0X)) & (Aj: 0 < j < i : x j = yj ≠ 0X) It establishes a match between x and y, provided that xi = yi, and it establishes x < y, if xi < yi. We are now prepared to return to the task of table searching. It calls for a nested search, namely a search through the entries of the table, and for each entry a sequence of comparisons between components. For example, let the table T and the search argument x be defined as T: ARRAY N OF String; x: String

37 Assuming that N may be fairly large and that the table is alphabetically ordered, we shall use a binary search. Using the algorithms for binary search and string comparison developed above, we obtain the following program segment. L := 0; R := N; WHILE L < R DO m := (L+R) DIV 2; i := 0; WHILE (T[m,i] = x[i]) & (x[i] # 0C) DO i := i+1 END ; IF T[m,i] < x[i] THEN L := m+1 ELSE R := m END END ; IF R < N THEN i := 0; WHILE (T[R,i] = x[i]) & (x[i] # 0X) DO i := i+1 END END (* (R < N) & (T[R,i] = x[i]) establish a match*) 1.9.4. Straight String Search A frequently encountered kind of search is the so-called string search. It is characterized as follows. Given an array s of N elements and an array p of M elements, where 0 < M < N, declared as s: ARRAY N OF Item p: ARRAY M OF Item string search is the task of finding the first occurrence of p in s. Typically, the items are characters; then s may be regarded as a text and p as a pattern or word, and we wish to find the first occurrence of the word in the text. This operation is basic to every text processing system, and there is obvious interest in finding an efficient algorithm for this task. Before paying particular attention to efficiency, however, let us first present a straightforward searching algorithm. We shall call it straight string search. A more precise formulation of the desired result of a search is indispensible before we attempt to specify an algorithm to compute it. Let the result be the index i which points to the first occurrence of a match of the pattern within the string. To this end, we introduce a predicate P(i,j) P(i, j) = Ak : 0 ≤ k < j : si+k = pk Then evidently our resulting index i must satisfy P(i, M). But this condition is not sufficient. Because the search is to locate the first occurrence of the pattern, P(k, M) must be false for all k < i. We denote this condition by Q(i). Q(i) = Ak : 0 ≤ k < i : ~P(k, M) The posed problem immediately suggests to formulate the search as an iteration of comparisons, and we proposed the following approach: i := -1; REPEAT INC(i); (* Q(i) *) found := P(i, M) UNTIL found OR (i = N-M) The computation of P again results naturally in an iteration of individual character comparisons. When we apply DeMorgan's theorem to P, it appears that the iteration must be a search for inequality among corresponding pattern and string characters. P(i, j) = (Ak : 0 ≤ k < j : si+k = p k) = (~Ek : 0 ≤ k < j : si+k ≠ pk) The result of the next refinement is a repetition within a repetition. The predicates P and Q are inserted at appropriate places in the program as comments. They act as invariants of the iteration loops. i := -1; REPEAT INC(i); j := 0; (* Q(i) *) WHILE (j < M) & (s[i+j] = p[j]) DO (* P(i, j+1) *) INC(j) END (* Q(i) & P(i, j) & ((j = M) OR (s[i+j] # p[j])) *)

38 UNTIL (j = M) OR (i = N-M) The term j = M in the terminating condition indeed corresponds to the condition found, because it implies P(i,M). The term i = N-M implies Q(N-M) and thereby the nonexistence of a match anywhere in the string. If the iteration continues with j < M, then it must do so with si+j ≠ pj. This implies ~P(i,j), which implies Q(i+1), which establishes Q(i) after the next incrementing of i. Analysis of straight string search. This algorithm operates quite effectively, if we can assume that a mismatch between character pairs occurs after at most a few comparisons in the inner loop. This is likely to be the case, if the cardinality of the item type is large. For text searches with a character set size of 128 we may well assume that a mismatch occurs after inspecting 1 or 2 characters only. Nevertheless, the worst case performance is rather alarming. Consider, for example, that the string consist of N-1 A's followed by a single B, and that the pattern consist of M-1 A's followed by a B. Then in the order of N*M comparisons are necessary to find the match at the end of the string. As we shall subsequently see, there fortunately exist methods that drastically improve this worst case behaviour. 1.9.5. The Knuth-Morris-Pratt String Search Around 1970, D.E. Knuth, J.H. Morris, and V.R. Pratt invented an algorithm that requires essentially in the order of N character comparisons only, even in the worst case [1-8]. The new algorithm is based on the observation that by starting the next pattern comparison at its beginning each time, we may be discarding valuable information gathered during previous comparisons. After a partial match of the beginning of the pattern with corresponding characters in the string, we indeed know the last part of the string, and perhaps could have precompiled some data (from the pattern) which could be used for a more rapid advance in the text string. The following example of a search for the word Hooligan illustrates the principle of the algorithm. Underlined characters are those which were compared. Note that each time two compared characters do not match, the pattern is shifted all the way, because a smaller shift could not possibly lead to a full match. Hoola-Hoola girls like Hooligans. Hooligan Hooligan Hooligan Hooligan Hooligan Hooligan ...... Hooligan Using the predicates P and Q, the KMP-algorithm is the following: i := 0; j := 0; WHILE (j < M) & (i < N) DO (* Q(i-j) & P(i-j, j) *) WHILE (j >= 0) & (s[i] # p[j]) DO j := D END ; INC(i); INC(j) END This formulation is admittedly not quite complete, because it contains an unspecified shift value D. We shall return to it shortly, but first point out that the conditions Q(i-j) and P(i-j, j) are maintained as global invariants, to which we may add the relations 0 ≤ i < N and 0 ≤ j < M. This suggests that we must abandon the notion that i always marks the current position of the first pattern character in the text. Rather, the alignment position of the pattern is now i-j. If the algorithm terminates due to j = M, the term P(i-j, j) of the invariant implies P(i-M, M), that is, a match at position i-M. Otherwise it terminates with i = N, and since j < M, the invariant Q(i) implies that no match exists at all. We must now demonstrate that the algorithm never falsifies the invariant. It is easy to show that it is established at the beginning with the values i = j = 0. Let us first investigate the effect of the two statements

39 incrementing i and j by 1. They apparently neither represent a shift of the pattern to the right, nor do they falsify Q(i-j), since the difference remains unchanged. But could they falsify P(i-j, j), the second factor of the invariant? We notice that at this point the negation of the inner while clause holds, i.e. either j < 0 or si = pj. The latter extends the partial match and establishes P(i-j, j+1). In the former case, we postulate that P(i-j, j+1) hold as well. Hence, incrementing both i and j by 1 cannot falsify the invariant either. The only other assignment left in the algorithm is j := D. We shall simply postuate that the value D always be such that replacing j by D will maintain the invariant. In order to find an appropriate expression for D, we must first understand the effect of the assignment. Provided that D < j, it represents a shift of the pattern to the right by j-D positions. Naturally, we wish this shift to be as large as possible, i.e., D to be as small as possible. This is illustrated by Fig. 1.10. i A

B

C

D

string

A

B

C

E

pattern j=3

D=0

A

B

C

D A

B

C

E

j=0

Fig. 1.10. Assignment j := D shifts pattern by j-D positions Evidently the condition P(i-D, D) & Q(i-D) must hold before assigning D to j, if the invariant P(i-j, j) & Q(i-j) is to hold thereafter. This precondition is therefore our guideline for finding an appropriate expression for D. The key observation is that thanks to P(i-j, j) we know that si-j ... si-1 = p0 ... p j-1 (we had just scanned the first j characters of the pattern and found them to match). Therefore the condition P(i-D, D) with D ≤ j, i.e., p 0 ... p D-1 = si-D ... si-1 translates into p 0 ... p D-1 = pj-D ... pj-1 and (for the purpose of establishing the invariance of Q(i-D)) the predicate ~P(i-k, M) for k = 1 ... j-D translates into p 0 ... p k-1 ≠ pj-k ... p j-1

for all k = 1 ... j-D

The essential result is that the value D apparently is determined by the pattern alone and does not depend on the text string. The conditions tell us that in order to find D we must, for every j, search for the smallest D, and hence for the longest sequence of pattern characters just preceding position j, which matches an equal number of characters at the beginning of the pattern. We shall denote D for a given j by dj. Since these values depend on the pattern only, the auxiliary table d may be computed before starting the actual search; this computation amounts to a precompilation of the pattern. This effort is evidently only worthwhile if the text is considerably longer than the pattern (M = 0) & (s[i] # p[j]) DO j := d[j] END ; INC(i); INC(j) END ; IF j = m THEN r := i-m ELSE r := -1 END END Search Analysis of KMP search. The exact analysis of the performance of KMP-search is, like the algorithm itself, very intricate. In [1-8] its inventors prove that the number of character comparisons is in the order of M+N, which suggests a substantial improvement over M*N for the straight search. They also point out the welcome property that the scanning pointer i never backs up, whereas in straight string search the scan always begins at the first pattern character after a mismatch, and therefore may involve characters that had actually been scanned already. This may cause awkward problems when the string is read from secondary storage where backing up is costly. Even when the input is buffered, the pattern may be such that the backing up extends beyond the buffer contents. 1.9.6. The Boyer-Moore String Search The clever scheme of the KMP-search yields genuine benefits only if a mismatch was preceded by a partial match of some length. Only in this case is the pattern shift increased to more than 1. Unfortunately, this is the exception rather than the rule; matches occur much more seldom than mismatches. Therefore the gain in using the KMP strategy is marginal in most cases of normal text searching. The method to be discussed here does indeed not only improve performance in the worst case, but also in the average case. It was invented by R.S. Boyer and J.S. Moore around 1975, and we shall call it BM search. We shall here present a simplified version of BM-search before proceeding to the one given by Boyer and Moore.. BM-search is based on the unconventional idea to start comparing characters at the end of the pattern rather than at the beginning. Like in the case of KMP-search, the pattern is precompiled into a table d before the actual search starts. Let, for every character x in the character set, dx be the distance of the rightmost occurrence of x in the pattern from its end. Now assume that a mismatch between string and pattern was discovered. Then the pattern can immediately be shifted to the right by dp[M-1] positions, an amount that is quite likely to be greater than 1. If pM-1 does not occur in the pattern at all, the shift is even greater, namely equal to the entire pattern's length. The following example illustrates this process. Hoola-Hoola girls like Hooligans. Hooligan Hooligan Hooligan Hooligan Hooligan

42 Since individual character comparisons now proceed from right to left, the following, slightly modified versions of of the predicates P and Q are more convenient. P(i,j) = Ak: j ≤ k < M : si-j+k = p k Q(i) = Ak: 0 ≤ k < i : ~P(i, 0) These predicates are used in the following formulation of the BM-algorithm to denote the invariant conditions. i := M; j := M; WHILE (j > 0) & (i 0) & (s[k-1] = p[j-1]) DO (* P(k-j, j) & (k-j = i-M) *) DEC(k); DEC(j) END ; i := i + d[s[i-1]] END The indices satisfy 0 < j < M and 0 < i,k < N. Therefore, termination with j = 0, together with P(k-j, j), implies P(k, 0), i.e., a match at position k. Termination with j > 0 demands that i = N; hence Q(i-M) implies Q(N-M), signalling that no match exists. Of course we still have to convince ourselves that Q(i-M) and P(k-j, j) are indeed invariants of the two repetitions. They are trivially satisfied when repetition starts, since Q(0) and P(x,M) are always true. Let us first consider the effect of the two statements decrementing k and j. Q(i-M) is not affected, and, since sk-1 = pj-1 had been established, P(k-j, j-1) holds as precondition, guaranteeing P(k-j, j) as postcondition. If the inner loop terminates with j > 0, the fact that sk-1 ≠ p j-1 implies ~P(k-j, 0), since ~P(i, 0) = Ek: 0 ≤ k < M : si+k ≠ pk Moreover, because k-j = M-i, Q(i-M) & ~P(k-j, 0) = Q(i+1-M), establishing a non-match at position i-M+1. Next we must show that the statement i := i + d s[i-1] never falsifies the invariant. This is the case, provided that before the assignment Q(i+ds[i-1]-M) is guaranteed. Since we know that Q(i+1-M) holds, it suffices to establish ~P(i+h-M) for h = 2, 3, ... , ds[i-1]. We now recall that dx is defined as the distance of the rightmost occurrence of x in the pattern from the end. This is formally expressed as Ak: M-dx ≤ k < M-1 : p k ≠ x Substituting si for x, we obtain Ah: M-ds[i-1] ≤ h < M-1 : si-1 ≠ ph Ah: 1 < h ≤ ds[i-1] : si-1 ≠ ph-M Ah: 1 < h ≤ ds[i-1] : ~P(i+h-M) The following program includes the presented, simplified Boyer-Moore strategy in a setting similar to that of the preceding KMP-search program. Note as a detail that a repeat statement is used in the inner loop, incrementing k and j before comparing s and p. This eliminates the -1 terms in the index expressions. PROCEDURE Search(VAR s, p: ARRAY OF CHAR; m, n: INTEGER; VAR r: INTEGER); (*search for pattern p of length m in text s of length n*) (*if p is found, then r indicates the position in s, otherwise r = -1*) VAR i, j, k: INTEGER; d: ARRAY 128 OF INTEGER; BEGIN FOR i := 0 TO 127 DO d[i] := m END ; FOR j := 0 TO m-2 DO d[ORD(p[j])] := m-j-1 END ; i := m; REPEAT j := m; k := i; REPEAT DEC(k); DEC(j)

43 UNTIL (j < 0) OR (p[j] # s[k]); i := i + d[ORD(s[i-1])] UNTIL (j < 0) OR (i > n); IF j < 0 THEN r := k ELSE r := -1 END END Search Analysis of Boyer-Moore Search. The original publication of this algorithm [1-9] contains a detailed analysis of its performance. The remarkable property is that in all except especially construed cases it requires substantially less than N comparisons. In the luckiest case, where the last character of the pattern always hits an unequal character of the text, the number of comparisons is N/M. The authors provide several ideas on possible further improvements. One is to combine the strategy explained above, which provides greater shifting steps when a mismatch is present, with the Knuth-MorrisPratt strategy, which allows larger shifts after detection of a (partial) match. This method requires two precomputed tables; d1 is the table used above, and d2 is the table corresponding to the one of the KMPalgorithm. The step taken is then the larger of the two, both indicating that no smaller step could possibly lead to a match. We refrain from further elaborating the subject, because the additional complexity of the table generation and the search itself does not seem to yield any appreciable efficiency gain. In fact, the additional overhead is larger, and casts some uncertainty whether the sophisticated extension is an improvement or a deterioration.

Exercises 1.1. Assume that the cardinalities of the standard types INTEGER, REAL, and CHAR are denoted by cint, creal, and cchar . What are the cardinalities of the following data types defined as exemples in this chapter: sex, weekday, row, alfa, complex, date, person? 1.2. Which are the instruction sequences (on your computer) for the following: (a) Fetch and store operations for an element of packed records and arrays? (b) Set operations, including the test for membership? 1.3. What are the reasons for defining certain sets of data as sequences instead of arrays? 1.4. Given is a railway timetable listing the daily services on several lines of a railway system. Find a representation of these data in terms of arrays, records, or sequences, which is suitable for lookup of arrival and departure times, given a certain station and desired direction of the train. 1.5. Given a text T in the form of a sequence and lists of a small number of words in the form of two arrays A and B. Assume that words are short arrays of characters of a small and fixed maximum length. Write a program that transforms the text T into a text S by replacing each occurrence of a word A i by its corresponding word Bi. 1.6. Compare the following three versions of the binary search with the one presented in the text. Which of the three programs are correct? Determine the relevant invariants. Which versions are more efficient? We assume the following variables, and the constant N > 0: VAR i, j, k, x: INTEGER; a: ARRAY N OF INTEGER; Program A: i := 0; j := N-1; REPEAT k := (i+j) DIV 2; IF a[k] < x THEN i := k ELSE j := k END UNTIL (a[k] = x) OR (i > j) Program B: i := 0; j := N-1; REPEAT k := (i+j) DIV 2; IF x < a[k] THEN j := k-1 END ;

44 IF a[k] < x THEN i := k+1 END UNTIL i > j Program C: i := 0; j := N-1; REPEAT k := (i+j) DIV 2; IF x < a[k] THEN j := k ELSE i := k+1 END UNTIL i > j Hint: All programs must terminate with ak = x, if such an element exists, or ak ≠ x, if there exists no element with value x. 1.7. A company organizes a poll to determine the success of its products. Its products are records and tapes of hits, and the most popular hits are to be broadcast in a hit parade. The polled population is to be divided into four categories according to sex and age (say, less or equal to 20, and older than 20). Every person is asked to name five hits. Hits are identified by the numbers 1 to N (say, N = 30). The results of the poll are to be appropriately encoded as a sequence of characters. Hint: use procedures Read and ReadInt to read the values of the poll. TYPE hit = [N]; sex = (male, female); reponse = RECORD name, firstname: alfa; s: sex; age: INTEGER; choice: ARRAY 5 OF hit END ; VAR poll: Files.File This file is the input to a program which computes the following results: 1. A list of hits in the order of their popularity. Each entry consists of the hit number and the number of times it was mentioned in the poll. Hits that were never mentioned are omitted from the list. 2. Four separate lists with the names and first names of all respondents who had mentioned in first place one of the three hits most popular in their category. The five lists are to be preceded by suitable titles.

References 1-1. O-.J. Dahl, E.W. Dijkstra, and C.A.R. Hoare. Structured Programming. (New York: Academic Press, 1972), pp. 155-65. 1-2. C.A.R. Hoare. Notes on data structuring; in Structured Programming. Dahl, Dijkstra, and Hoare, pp. 83-174. 1-3. K. Jensen and N. Wirth. Pascal User Manual and Report. (Berlin: Springer-Verlag, 1974). 1-4. N. Wirth. Program development by stepwise refinement. Comm. ACM, 14, No. 4 (1971), 221-27. 1-5. ------, Programming in Modula-2. (Berlin, Heidelberg, New York: Springer-Verlag, 1982). 1-6. ------, On the composition of well-structured programs. Computing Surveys, 6, No. 4, (1974) 247-59. 1-7. C.A.R. Hoare. The Monitor: An operating systems structuring concept. Comm. ACM 17, 10 (Oct. 1974), 549-557. 1-8. D.E.Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6, 2, (June 1977), 323-349. 1-9. R.S. Boyer and J.S. Moore. A fast string searching algorithm. Comm. ACM, 20, 10 (Oct. 1977), 762772.

45

2. SORTING 2.1. Introduction The primary purpose of this chapter is to provide an extensive set of examples illustrating the use of the data structures introduced in the preceding chapter and to show how the choice of structure for the underlying data profoundly influences the algorithms that perform a given task. Sorting is also a good example to show that such a task may be performed according to many different algorithms, each one having certain advantages and disadvantages that have to be weighed against each other in the light of the particular application. Sorting is generally understood to be the process of rearranging a given set of objects in a specific order. The purpose of sorting is to facilitate the later search for members of the sorted set. As such it is an almost universally performed, fundamental activity. Objects are sorted in telephone books, in income tax files, in tables of contents, in libraries, in dictionaries, in warehouses, and almost everywhere that stored objects have to be searched and retrieved. Even small children are taught to put their things "in order", and they are confronted with some sort of sorting long before they learn anything about arithmetic. Hence, sorting is a relevant and essential activity, particularly in data processing. What else would be easier to sort than data! Nevertheless, our primary interest in sorting is devoted to the even more fundamental techniques used in the construction of algorithms. There are not many techniques that do not occur somewhere in connection with sorting algorithms. In particular, sorting is an ideal subject to demonstrate a great diversity of algorithms, all having the same purpose, many of them being optimal in some sense, and most of them having advantages over others. It is therefore an ideal subject to demonstrate the necessity of performance analysis of algorithms. The example of sorting is moreover well suited for showing how a very significant gain in performance may be obtained by the development of sophisticated algorithms when obvious methods are readily available. The dependence of the choice of an algorithm on the structure of the data to be processed -- an ubiquitous phenomenon -- is so profound in the case of sorting that sorting methods are generally classified into two categories, namely, sorting of arrays and sorting of (sequential) files. The two classes are often called internal and external sorting because arrays are stored in the fast, high-speed, random-access "internal" store of computers and files are appropriate on the slower, but more spacious "external" stores based on mechanically moving devices (disks and tapes). The importance of this distinction is obvious from the example of sorting numbered cards. Structuring the cards as an array corresponds to laying them out in front of the sorter so that each card is visible and individually accessible (see Fig. 2.1). Structuring the cards as a file, however, implies that from each pile only the card on the top is visible (see Fig. 2.2). Such a restriction will evidently have serious consequences on the sorting method to be used, but it is unavoidable if the number of cards to be laid out is larger than the available table. Before proceeding, we introduce some terminology and notation to be used throughout this chapter. If we are given n items a0, a1, ... , an-1 sorting consists of permuting these items into an array ak0, ak1, ... , ak[n-1] such that, given an ordering function f, f(ak0) ≤ f(ak1) ≤ ... ≤ f(ak[n-1]) Ordinarily, the ordering function is not evaluated according to a specified rule of computation but is stored as an explicit component (field) of each item. Its value is called the key of the item. As a consequence, the record structure is particularly well suited to represent items and might for example be declared as follows: TYPE Item = RECORD key: INTEGER; (*other components declared here*) END

46 The other components represent relevant data about the items in the collection; the key merely assumes the purpose of identifying the items. As far as our sorting algorithms are concerned, however, the key is the only relevant component, and there is no need to define any particular remaining components. In the following discussions, we shall therefore discard any associated information and assume that the type Item be defined as INTEGER. This choice of the key type is somewhat arbitrary. Evidently, any type on which a total ordering relation is defined could be used just as well. A sorting method is called stable if the relative order if items with equal keys remains unchanged by the sorting process. Stability of sorting is often desirable, if items are already ordered (sorted) according to some secondary keys, i.e., properties not reflected by the (primary) key itself. This chapter is not to be regarded as a comprehensive survey in sorting techniques. Rather, some selected, specific methods are exemplified in greater detail. For a thorough treatment of sorting, the interested reader is referred to the excellent and comprehensive compendium by D. E. Knuth [2-7] (see also Lorin [2-10]).

2.2. Sorting Arrays The predominant requirement that has to be made for sorting methods on arrays is an economical use of the available store. This implies that the permutation of items which brings the items into order has to be performed in situ, and that methods which transport items from an array a to a result array b are intrinsically of minor interest. Having thus restricted our choice of methods among the many possible solutions by the criterion of economy of storage, we proceed to a first classification according to their efficiency, i.e., their economy of time. A good measure of efficiency is obtained by counting the numbers C of needed key comparisons and M of moves (transpositions) of items. These numbers are functions of the number n of items to be sorted. Whereas good sorting algorithms require in the order of n*log(n) comparisons, we first discuss several simple and obvious sorting techniques, called straight methods, all of which require in the order n2 comparisons of keys. There are three good reasons for presenting straight methods before proceeding to the faster algorithms. 1. Straight methods are particularly well suited for elucidating the characteristics of the major sorting principles. 2. Their programs are easy to understand and are short. Remember that programs occupy storage as well! 3. Although sophisticated methods require fewer operations, these operations are usually more complex in their details; consequently, straight methods are faster for sufficiently small n, although they must not be used for large n. Sorting methods that sort items in situ can be classified into three principal categories according to their underlying method: Sorting by insertion Sorting by selection Sorting by exchange These three pinciples will now be examined and compared. The procedures operate on a global variable a whose components are to be sorted in situ, i.e. without requiring additional, temporary storage. The components are the keys themselves. We discard other data represented by the record type Item, thereby simplifying matters. In all algorithms to be developed in this chapter, we will assume the presence of an array a and a constant n, the number of elements of a: TYPE Item = INTEGER; VAR a: ARRAY n OF Item 2.2.1. Sorting by Straight Insertion This method is widely used by card players. The items (cards) are conceptually divided into a destination sequence a1 ... ai-1 and a source sequence ai ... an. In each step, starting with i = 2 and incrementing i by unity, the i th element of the source sequence is picked and transferred into the destination sequence by inserting it at the appropriate place.

47 Initial Keys: 44 i=1 i=2 i=3 i=4 i=5 i=6 i=7

44 12 12 12 12 06 06

55

12

42

94

18

06

67

55 44 42 42 18 12 12

12 55 44 44 42 18 18

42 42 55 55 44 42 42

94 94 94 94 55 44 44

18 18 18 18 94 55 55

06 06 06 06 06 94 67

67 67 67 67 67 67 94

Table 2.1 A Sample Process of Straight Insertion Sorting. The process of sorting by insertion is shown in an example of eight numbers chosen at random (see Table 2.1). The algorithm of straight insertion is FOR i := 1 TO n-1 DO x := a[i]; insert x at the appropriate place in a0 ... ai END In the process of actually finding the appropriate place, it is convenient to alternate between comparisons and moves, i.e., to let x sift down by comparing x with the next item aj, and either inserting x or moving aj to the right and proceeding to the left. We note that there are two distinct conditions that may cause the termination of the sifting down process: 1. An item aj is found with a key less than the key of x. 2. The left end of the destination sequence is reached. PROCEDURE StraightInsertion; VAR i, j: INTEGER; x: Item; BEGIN FOR i := 1 TO n-1 DO x := a[i]; j := i; WHILE (j > 0) & (x < a[j-1] DO a[j] := a[j-1]; DEC(j) END ; a[j] := x END END StraightInsertion Analysis of straight insertion. The number Ci of key comparisons in the i-th sift is at most i-1, at least 1, and -- assuming that all permutations of the n keys are equally probable -- i/2 in the average. The number Mi of moves (assignments of items) is Ci + 2 (including the sentinel). Therefore, the total numbers of comparisons and moves are Cmin = n-1 Cave = (n2 + n - 2)/4 Cmax = (n2 + n - 4)/4

Mmin = 3*(n-1) Mave = (n2 + 9n - 10)/4 Mmax = (n2 + 3n - 4)/2

The minimal numbers occur if the items are initially in order; the worst case occurs if the items are initially in reverse order. In this sense, sorting by insertion exhibits a truly natural behavior. It is plain that the given algorithm also describes a stable sorting process: it leaves the order of items with equal keys unchanged. The algorithm of straight insertion is easily improved by noting that the destination sequence a0 ... ai-1, in which the new item has to be inserted, is already ordered. Therefore, a faster method of determining the insertion point can be used. The obvious choice is a binary search that samples the destination sequence in the middle and continues bisecting until the insertion point is found. The modified sorting algorithm is called binary insertion. PROCEDURE BinaryInsertion(VAR a: ARRAY OF Item; n: INTEGER); VAR i, j, m, L, R: INTEGER; x: Item; BEGIN FOR i := 1 TO n-1 DO

48 x := a[i]; L := 1; R := i; WHILE L < R DO m := (L+R) DIV 2; IF a[m] 0) & (r > 0) DO IF a[i] < a[j] THEN move an item from i-source to k-destination; advance i and k; q := q-1 ELSE move an item from j-source to k-destination; advance j and k; r := r-1 END END ; copy tail of i-sequence; copy tail of j-sequence After this further refinement of the tail copying operations, the program is laid out in complete detail. Before writing it out in full, we wish to eliminate the restriction that n be a power of 2. Which parts of the algorithm are affected by this relaxation of constraints? We easily convince ourselves that the best way to cope with the more general situation is to adhere to the old method as long as possible. In this example this means that we continue merging p-tuples until the remainders of the source sequences are of length less than p. The one and only part that is influenced are the statements that determine the values of q and r, the lengths of the sequences to be merged. The following four statements replace the three statements q := p; r := p; m := m -2*p and, as the reader should convince himself, they represent an effective implementation of the strategy specified above; note that m denotes the total number of items in the two source sequences that remain to be merged: IF m >= p THEN q := p ELSE q := m END ; m := m-q; IF m >= p THEN r := p ELSE r := m END ; m := m-r In addition, in order to guarantee termination of the program, the condition p=n, which controls the outer repetition, must be changed to p ≥ n. After these modifications, we may now proceed to describe the entire algorithm in terms of a procedure operating on the global array a with 2n elements. PROCEDURE StraightMerge; VAR i, j, k, L, t: INTEGER; (*index range of a is 0 .. 2*n-1 *) h, m, p, q, r: INTEGER; up: BOOLEAN; BEGIN up := TRUE; p := 1; REPEAT h := 1; m := n; IF up THEN i := 0; j := n-1; k := n; L := 2*n-1 ELSE k := 0; L := n-1; i := n; j := 2*n-1 END ;

66 REPEAT (*merge a run from i- and j-sources to k-destination*) IF m >= p THEN q := p ELSE q := m END ; m := m-q; IF m >= p THEN r := p ELSE r := m END ; m := m-r; WHILE (q > 0) & (r > 0) DO IF a[i] < a[j] THEN a[k] := a[i]; k := k+h; i := i+1; q := q-1 ELSE a[k] := a[j]; k := k+h; j := j-1; r := r-1 END END ; WHILE r > 0 DO a[k] := a[j]; k := k+h; j := j-1; r := r-1 END ; WHILE q > 0 DO a[k] := a[i]; k := k+h; i := i+1; q := q-1 END ; h := -h; t := k; k := L; L := t UNTIL m = 0; up := ~up; p := 2*p UNTIL p >= n; IF ~up THEN FOR i := 1 TO n DO a[i] := a[i+n] END END END StraightMerge Analysis of Mergesort. Since each pass doubles p, and since the sort is terminated as soon as p > n, it involves ilog nj passes. Each pass, by definition, copies the entire set of n items exactly once. As a consequence, the total number of moves is exactly M = n × log(n) The number C of key comparisons is even less than M since no comparisons are involved in the tail copying operations. However, since the mergesort technique is usually applied in connection with the use of peripheral storage devices, the computational effort involved in the move operations dominates the effort of comparisons often by several orders of magnitude. The detailed analysis of the number of comparisons is therefore of little practical interest. The merge sort algorithm apparently compares well with even the advanced sorting techniques discussed in the previous chapter. However, the administrative overhead for the manipulation of indices is relatively high, and the decisive disadvantage is the need for storage of 2n items. This is the reason sorting by merging is rarely used on arrays, i.e., on data located in main store. Figures comparing the real time behavior of this Mergesort algorithm appear in the last line of Table 2.9. They compare favorably with Heapsort but unfavorably with Quicksort. 2.4.2. Natural Merging In straight merging no advantage is gained when the data are initially already partially sorted. The length of all merged subsequences in the k th pass is less than or equal to 2k, independent of whether longer subsequences are already ordered and could as well be merged. In fact, any two ordered subsequences of lengths m and n might be merged directly into a single sequence of m+n items. A mergesort that at any time merges the two longest possible subsequences is called a natural merge sort. An ordered subsequence is often called a string. However, since the word string is even more frequently used to describe sequences of characters, we will follow Knuth in our terminology and use the word run instead of string when referring to ordered subsequences. We call a subsequence ai ... aj such that (ai-1 > ai) & (Ak : i ≤ k < j : ak ≤ ak+1) & (aj > aj+1)

67 a maximal run or, for short, a run. A natural merge sort, therefore, merges (maximal) runs instead of sequences of fixed, predetermined length. Runs have the property that if two sequences of n runs are merged, a single sequence of exactly n runs emerges. Therefore, the total number of runs is halved in each pass, and the number of required moves of items is in the worst case n*log(n), but in the average case it is even less. The expected number of comparisons, however, is much larger because in addition to the comparisons necessary for the selection of items, further comparisons are needed between consecutive items of each file in order to determine the end of each run. Our next programming exercise develops a natural merge algorithm in the same stepwise fashion that was used to explain the straight merging algorithm. It employs the sequence structure (represented by files, see Sect. 1.8) instead of the array, and it represents an unbalanced, two-phase, three-tape merge sort. We assume that the file variable c represents the initial sequence of items. (Naturally, in actual data processing application, the initial data are first copied from the original source to c for reasons of safety.) a and b are two auxiliary file variables. Each pass consists of a distribution phase that distributes runs equally from c to a and b, and a merge phase that merges runs from a and b to c. This process is illustrated in Fig. 2.13. a

a

c

a

c

c

b

c

c

b

b

merge phase distribution phase

st

nd

1 run

th

2 run

n run

Fig. 2.13. Sort phases and passes 17 05 05 02

31' 17 11 03

05 31 13 05

59' 59' 17 07

13 11 23 11

41 13 29 13

43 23 31 17

67' 29 41 19

11 41 43 23

23 43 47 29

29 47 59 31

47' 67' 67' 37

03 02 02 41

07 03 03 43

71' 07 07 47

02 19 19 57

19 57 37 59

57' 71' 57 61

37 37 61 67

61 61 71 71

Table 2.11. Example of a Natural Mergesort. As an example, Table 2.11 shows the file c in its original state (line1) and after each pass (lines 2-4) in a natural merge sort involving 20 numbers. Note that only three passes are needed. The sort terminates as soon as the number of runs on c is 1. (We assume that there exists at least one non-empty run on the initial sequence). We therefore let a variable L be used for counting the number of runs merged onto c. By making use of the type Rider defined in Sect. 1.8.1, the program can be formulated as follows: VAR L: INTEGER; r0, r1, r2: Files.Rider; (*see 1.8.1*) REPEAT Files.Set(r0, a, 0); Files.Set(r1, b, 0); Files.Set(r2, c, 0); distribute(r2, r0, r1); (*c to a and b*) Files.Set(r0, a, 0); Files.Set(r1, b, 0); Files.Set(r2, c, 0); L := 0; merge(r0, r1, r2) (*a and b into c*) UNTIL L = 1 The two phases clearly emerge as two distinct statements. They are now to be refined, i.e., expressed in more detail. The refined descriptions of distribute (from rider r2 to riders r0 and r1) and merge (from riders r0 and r1 to rider r2) follow:

68 REPEAT copyrun(r2, r0); IF ~r2.eof THEN copyrun(r2, r1) END UNTIL r2.eof REPEAT mergerun(r0, r1, r2); INC(L) UNTIL r1.eof; IF ~r0.eof THEN copyrun(r0, r2); INC(L) END This method of distribution supposedly results in either equal numbers of runs in both a and b, or in sequence a containing one run more than b. Since corresponding pairs of runs are merged, a leftover run may still be on file a, which simply has to be copied. The statements merge and distribute are formulated in terms of a refined statement mergerun and a subordinate procedure copyrun with obvious tasks. When attempting to do so, one runs into a serious difficulty: In order to determine the end of a run, two consecutive keys must be compared. However, files are such that only a single element is immediately accessible. We evidently cannot avoid to look ahead, i.e to associate a buffer with every sequence. The buffer is to contain the first element of the file still to be read and constitutes something like a window sliding over the file. Instead of programming this mechanism explicitly into our program, we prefer to define yet another level of abstraction. It is represented by a new module Runs. It can be regarded as an extension of module Files of Sect. 1.8, introducing a new type Rider, which we may consider as an extension of type Files.Rider. This new type will not only accept all operations available on Riders and indicate the end of a file, but also indicate the end of a run and the first element of the remaining part of the file. The new type as well as its operators are presented by the following definition. DEFINITION Runs; IMPORT Files, Texts; TYPE Rider = RECORD (Files.Rider) first: INTEGER; eor: BOOLEAN END ; PROCEDURE OpenRandomSeq(f: Files.File; length, seed: INTEGER); PROCEDURE Set (VAR r: Rider; VAR f: Files.File); PROCEDURE copy(VAR source, destination: Rider); PROCEDURE ListSeq(VAR W: Texts.Writer; f: Files.File); END Runs. A few additional explanations for the choice of the procedures are necessary. As we shall see, the sorting algorithms discussed here and later are based on copying elements from one file to another. A procedure copy therefore takes the place of separate read and write operations. For convenience of testing the following examples, we also introduce a procedure ListSeq, converting a file of integers into a text. Also for convenience an additional procedure is included: OpenRandomSeq initializes a file with numbers in random order. These two procedures will serve to test the algorithms to be discussed below. The values of the fields eof and eor are defined as results of copy in analogy to eof having been defined as result of a read operation. MODULE Runs; IMPORT Files, Texts; TYPE Rider* = RECORD (Files.Rider) first: INTEGER; eor: BOOLEAN END ; PROCEDURE OpenRandomSeq*( f: Files.File; length, seed: INTEGER); VAR i: INTEGER; w: Files.Rider; BEGIN Files.Set(w, f, 0); FOR i := 0 TO length-1 DO Files.WriteInt(w, seed); seed := (31*seed) MOD 997 + 5 END ; Close(f) END OpenRandomSeq; PROCEDURE Set*(VAR r: Rider; f: Files.File); BEGIN Files.Set(r, f, 0); Files.Read (r, r.first); r.eor := r.eof END Set;

69 PROCEDURE copy*(VAR src, dest: Rider); BEGIN dest.first := src.first; Files.Write(dest, dest.first); Files.Read(src, src.first); src.eor := src.eof OR (src.first < dest.first) END copy; PROCEDURE ListSeq*(VAR W: Texts; f: Files.File;); VAR x, y, k, n: INTEGER; r: Files.Rider; BEGIN k := 0; n := 0; Files.Set(r, f, 0); Files.ReadInt(r, x); WHILE ~r.eof DO Texts.WriteInt(W, x, 6); INC(k); Files.Read(r, y); IF y < x THEN (*run ends*) Texts.Write(W, “|”); INC(n) END ; x := y END ; Texts.Write(W, “$”); Texts.WriteInt(W, k, 5); Texts.WriteInt(W, n, 5); Texts.WriteLn(W) END ListSeq; END Runs. We now return to the process of successive refinement of the process of natural merging. Procedure copyrun and the statement merge are now conveniently expressible as shown below. Note that we refer to the sequences (files) indirectly via the riders attached to them. In passing, we also note that the rider’s field first represents the next key on a sequence being read, and the last key of a sequence being written. PROCEDURE copyrun(VAR x, y: Runs.Rider); BEGIN (*copy from x to y*) REPEAT Runs.copy(x, y) UNTIL x.eor END copyrun (*merge from r0 and r1 to r2*) REPEAT IF r0.first < r1.first THEN Runs.copy(r0, r2); IF r0.eor THEN copyrun(r1, r2) END ELSE Runs.copy(r1, r2); IF r1.eor THEN copyrun(r0, r2) END END UNTIL r0.eor OR r1.eor The comparison and selection process of keys in merging a run terminates as soon as one of the two runs is exhausted. After this, the other run (which is not exhausted yet) has to be transferred to the resulting run by merely copying its tail. This is done by a call of procedure copyrun. This should supposedly terminate the development of the natural merging sort procedure. Regrettably, the program is incorrect, as the very careful reader may have noticed. The program is incorrect in the sense that it does not sort properly in some cases. Consider, for example, the following sequence of input data: 03 02 05 11 07 13 19 17 23 31 29 37 43 41 47 59 57 61 71 67 By distributing consecutive runs alternately to a and b, we obtain a = 03 ' 07 13 19 ' 29 37 43 ' 57 61 71' b = 02 05 11 ' 17 23 31 ' 41 47 59 ' 67 These sequences are readily merged into a single run, whereafter the sort terminates successfully. The example, although it does not lead to an erroneous behaviour of the program, makes us aware that mere distribution of runs to serveral files may result in a number of output runs that is less than the number of input runs. This is because the first item of the i+2nd run may be larger than the last item of the i-th run, thereby causing the two runs to merge automatically into a single run.

70 Although procedure distribute supposedly outputs runs in equal numbers to the two files, the important consequence is that the actual number of resulting runs on a and b may differ significantly. Our merge procedure, however, only merges pairs of runs and terminates as soon as b is read, thereby losing the tail of one of the sequences. Consider the following input data that are sorted (and truncated) in two subsequent passes: 17 19 13 57 23 29 11 59 31 37 07 61 41 43 05 67 47 71 02 03 13 17 19 23 29 31 37 41 43 47 57 71 11 59 11 13 17 19 23 29 31 37 41 43 47 57 59 71 Table 2.12 Incorrect Result of Mergesort Program. The example of this programming mistake is typical for many programming situations. The mistake is caused by an oversight of one of the possible consequences of a presumably simple operation. It is also typical in the sense that serval ways of correcting the mistake are open and that one of them has to be chosen. Often there exist two possibilities that differ in a very important, fundamental way: 1. We recognize that the operation of distribution is incorrectly programmed and does not satisfy the requirement that the number of runs differ by at most 1. We stick to the original scheme of operation and correct the faulty procedure accordingly. 2. We recognize that the correction of the faulty part involves far-reaching modifications, and we try to find ways in which other parts of the algorithm may be changed to accommodate the currently incorrect part. In general, the first path seems to be the safer, cleaner one, the more honest way, providing a fair degree of immunity from later consequences of overlooked, intricate side effects. It is, therefore, the way toward a solution that is generally recommended. It is to be pointed out, however, that the second possibility should sometimes not be entirely ignored. It is for this reason that we further elaborate on this example and illustrate a fix by modification of the merge procedure rather than the distribution procedure, which is primarily at fault. This implies that we leave the distribution scheme untouched and renounce the condition that runs be equally distributed. This may result in a less than optimal performance. However, the worst-case performance remains unchanged, and moreover, the case of highly unequal distribution is statistically very unlikely. Efficiency considerations are therefore no serious argument against this solution. If the condition of equal distribution of runs no longer exists, then the merge procedure has to be changed so that, after reaching the end of one file, the entire tail of the remaining file is copied instead of at most one run. This change is straightforward and is very simple in comparison with any change in the distribution scheme. (The reader is urged to convince himself of the truth of this claim). The revised version of the merge algorithm is shown below in the form of a function procedure: PROCEDURE NaturalMerge(src: Files.File): Files.File; VAR L: INTEGER; (*no. of runs merged*) f0, f1, f2: Files.File; r0, r1, r2: Runs.Rider; PROCEDURE copyrun(VAR x, y: Runs.Rider); BEGIN (*from x to y*) REPEAT Runs.copy(x, y) UNTIL x.eor END copyrun; BEGIN Runs.Set(r2, src); REPEAT f0 := Files.New("test0"); Files.Set(r0, f0, 0); f1 := Files.New("test1"); Files.Set (r1, f1, 0); (*distribute from r2 to r0 and r1*) REPEAT copyrun(r2, r0); IF ~r2.eof THEN copyrun(r2, r1) END UNTIL r2.eof; Runs.Set(r0, f0); Runs.Set(r1, f1); f2 := Files.New(""); Files.Set(r2, f2, 0); L := 0;

71 (*merge from r0 and r1 to r2*) REPEAT REPEAT IF r0.first < r1.first THEN Runs.copy(r0, r2); IF r0.eor THEN copyrun(r1, r2) END ELSE Runs.copy(r1, r2); IF r1.eor THEN copyrun(r0, r2) END END UNTIL r0.eor OR r1.eor; INC(L) UNTIL r0.eof OR r1.eof; WHILE ~r0.eof DO copyrun(r0, r2); INC(L) END ; WHILE ~r1.eof DO copyrun(r1, r2); INC(L) END ; Runs.Set(r2, f2) UNTIL L = 1; RETURN f2 END NaturalMerge; 2.4.3. Balanced Multiway Merging The effort involved in a sequential sort is proportional to the number of required passes since, by definition, every pass involves the copying of the entire set of data. One way to reduce this number is to distribute runs onto more than two files. Merging r runs that are equally distributed on N files results in a sequence of r/N runs. A second pass reduces their number to r/N 2, a third pass to r/N3, and after k passes there are r/Nk runs left. The total number of passes required to sort n items by N-way merging is therefore k = logN(n). Since each pass requires n copy operations, the total number of copy operations is in the worst case M = n×logN(n) As the next programming exercise, we will develop a sort program based on multiway merging. In order to further contrast the program from the previous natural two-phase merging procedure, we shall formulate the multiway merge as a single phase, balanced mergesort. This implies that in each pass there are an equal number of input and output files onto which consecutive runs are alternately distributed. Using 2N files, the algorithm will therefore be based on N-way merging. Following the previously adopted strategy, we will not bother to detect the automatic merging of two consecutive runs distributed onto the same file. Consequently, we are forced to design the merge program whithout assuming strictly equal numbers of runs on the input files. In this program we encounter for the first time a natural application of a data structure consisting of arrays of files. As a matter of fact, it is surprising how strongly the following program differs from the previous one because of the change from two-way to multiway merging. The change is primarily a result of the circumstance that the merge process can no longer simply be terminated after one of the input runs is exhausted. Instead, a list of inputs that are still active, i.e., not yet exhausted, must be kept. Another complication stems from the need to switch the groups of input and output files after each pass. Here the indirection of access to files via riders comes in handy. In each pass, data may be copied from the same riders r to the same riders w. At the end of each pass we merely need to reset the input and output files to different riders. Obviously, file numbers are used to index the array of files. Let us then assume that the initial file is the parameter src, and that for the sorting process 2N files are available: f, g: ARRAY N OF Files.File; r, w: ARRAY N OF Runs.Rider The algorithm can now be sketched as follows: PROCEDURE BalancedMerge(src: Files.File): Files.File; VAR i, j: INTEGER; L: INTEGER; (*no. of runs distributed*)

72 R: Runs.Rider; BEGIN Runs.Set(R, src); (*distribute initial runs from R to w[0] ... w[N-1]*) j := 0; L := 0; position riders w on files g; REPEAT copy one run from R to w[j]; INC(j); INC(L); IF j = N THEN j := 0 END UNTIL R.eof; REPEAT (*merge from riders r to riders w*) switch files g to riders r; L := 0; j := 0; (*j = index of output file*) REPEAT INC(L); merge one run from inputs to w[j]; IF j < N THEN INC(j) ELSE j := 0 END UNTIL all inputs exhausted; UNTIL L = 1 (*sorted file is with w[0]*) END BalancedMerge. Having associated a rider R with the source file, we now refine the statement for the initial distribution of runs. Using the definition of copy, we replace copy one run from R to w[j] by: REPEAT Runs.copy(R, w[j]) UNTIL R.eor Copying a run terminates when either the first item of the next run is encountered or when the end of the entire input file is reached. In the actual sort algorithm, the following statements remain to be specified in more detail: 1. Position riders w on files g 2. Merge one run from inputs to wj 3. Switch files g to riders r 4. All inputs exhausted First, we must accurately identify the current input sequences. Notably, the number of active inputs may be less than N. Obviously, there can be at most as many sources as there are runs; the sort terminates as soon as there is one single sequence left. This leaves open the possibility that at the initiation of the last sort pass there are fewer than N runs. We therefore introduce a variable, say k1, to denote the actual number of inputs used. We incorporate the initialization of k1 in the statement switch files as follows: IF L < N THEN k1 := L ELSE k1 := N END ; FOR i := 0 TO k1-1 DO Runs.Set(r[i], g[i]) END Naturally, statement (2) is to decrement k1 whenever an input source ceases. Hence, predicate (4) may easily be expressed by the relation k1 = 0. Statement (2), however, is more difficult to refine; it consists of the repeated selection of the least key among the available sources and its subsequent transport to the destination, i.e., the current output sequence. The process is further complicated by the necessity of determining the end of each run. The end of a run may be reached because (1) the subsequent key is less than the current key or (2) the end of the source is reached. In the latter case the source is eliminated by decrementing k1; in the former case the run is closed by excluding the sequence from further selection of items, but only until the creation of the current output run is completed. This makes it obvious that a second variable, say k2, is needed to denote the number of sources actually available for the selection of the next item. This value is initially set equal to k1 and is decremented whenever a run teminates because of condition (1). Unfortunately, the introduction of k2 is not sufficient. We need to know not only the number of files, but also which files are still in actual use. An obvious solution is to use an array with Boolean components indicating the availability of the files. We choose, however, a different method that leads to a more efficient selection

73 procedure which, after all, is the most frequently repeated part of the entire algorithm. Instead of using a Boolean array, a file index map, say t, is introduced. This map is used so that t0 ... tk2-1 are the indices of the available sequences. Thus statement (2) can be formulated as follows: k2 := k1; REPEAT select the minimal key, let t[m] be the sequence number on which it occurs; Runs.copy(r[t[m]], w[j]); IF r[t[m]].eof THEN eliminate sequence ELSIF r[t[m]].eor THEN close run END UNTIL k2 = 0 Since the number of sequences will be fairly small for any practical purpose, the selection algorithm to be specified in further detail in the next refinement step may as well be a straightforward linear search. The statement eliminate sequence implies a decrease of k1 as well as k2 and also a reassignment of indices in the map t. The statement close run merely decrements k2 and rearranges components of t accordingly. The details are shown in the following procedure, being the last refinement. The statement switch sequences is elaborated according to explanations given earlier. PROCEDURE BalancedMerge(src: Files.File): Files.File; VAR i, j, m, tx: INTEGER; L, k1, k2: INTEGER; min, x: INTEGER; t: ARRAY N OF INTEGER; (*index map*) R: Runs.Rider; (*source*) f, g: ARRAY N OF Files.File; r, w: ARRAY N OF Runs.Rider; BEGIN Runs.Set(R, src); FOR i := 0 TO N-1 DO g[i] := Files.New(""); Files.Set(w[i], g[i], 0) END ; (*distribute initial runs from src to g[0] ... g[N-1]*) j := 0; L := 0; REPEAT REPEAT Runs.copy(R, w[j]) UNTIL R.eor; INC(L); INC(j); IF j = N THEN j := 0 END UNTIL R.eof; FOR i := 0 TO N-1 DO t[i] := i END ; REPEAT IF L < N THEN k1 := L ELSE k1 := N END ; FOR i := 0 TO k1-1 DO Runs.Set(r[i], g[i]) END ; (*set input riders*) FOR i := 0 TO k1-1 DO g[i] := Files.New(""); Files.Set(w[i], g[i], 0) END ; (*set output riders*) (*merge from r[0] ... r[N-1] to w[0] ... w[N-1]*) FOR i := 0 TO N-1 DO t[i] := i END ; L := 0; (*nof runs merged*) j := 0; REPEAT (*merge one run from inputs to w[j]*) INC(L); k2 := k1; REPEAT (*select the minimal key*) m := 0; min := r[t[0]].first; i := 1; WHILE i < k2 DO x := r[t[i]].first; IF x < min THEN min := x; m := i END ; INC(i) END ; Runs.copy(r[t[m]], w[j]);

74 IF r[t[m]].eof THEN (*eliminate this sequence*) DEC(k1); DEC(k2); t[m] := t[k2]; t[k2] := t[k1] ELSIF r[t[m]].eor THEN (*close run*) DEC(k2); tx := t[m]; t[m] := t[k2]; t[k2] := tx END UNTIL k2 = 0; INC(j); IF j = N THEN j := 0 END UNTIL k1 = 0 UNTIL L = 1; RETURN Files.Base(w[t[0]]) END BalancedMerge 2.4.4. Polyphase Sort We have now discussed the necessary techniques and have acquired the proper background to investigate and program yet another sorting algorithm whose performance is superior to the balanced sort. We have seen that balanced merging eliminates the pure copying operations necessary when the distribution and the merging operations are united into a single phase. The question arises whether or not the given sequences could be processed even more efficiently. This is indeed the case; the key to this next improvement lies in abandoning the rigid notion of strict passes, i.e., to use the sequences in a more sophisticated way than by always having N/2 sources and as many destinations and exchanging sources and destinations at the end of each distinct pass. Instead, the notion of a pass becomes diffuse. The method was invented by R.L. Gilstad [2-3] and called Polyphase Sort. It is first illustrated by an example using three sequences. At any time, items are merged from two sources into a third sequence variable. Whenever one of the source sequences is exhausted, it immediately becomes the destination of the merge operations of data from the non-exhausted source and the previous destination sequence. As we know that n runs on each input are transformed into n runs on the output, we need to list only the number of runs present on each sequence (instead of specifying actual keys). In Fig. 2.14 we assume that initially the two input sequences f1 and f2 contain 13 and 8 runs, respectively. Thus, in the first pass 8 runs are merged from f1 and f2 to f3, in the second pass the remaining 5 runs are merged from f3 and f1 onto f2, etc. In the end, f1 is the sorted sequence. f1

f2

f3

13

8

5

0

8

0

5

3

3

2

0

1

0

2

0

1

1

1

0

0

Fig. 2.14. Polyphase mergesort of 21 runs with 3 sequences

75 A second example shows the Polyphase method with 6 sequences. Let there initially be 16 runs on f1, 15 on f2, 14 on f3, 12 on f4, and 8 on f5. In the first partial pass, 8 runs are merged onto f6; In the end, f2 contains the sorted set of items (see Fig. 2.15). f1

f2

f3

f4

f5

f6

16

15

14

12

8

8

7

6

4

0

8

4

3

2

0

4

4

2

1

0

2

2

2

1

0

1

1

1

1

0

1

0

0

0

0

Fig. 2.15. Polyphase mergesort of 65 runs with 6 sequences Polyphase is more efficient than balanced merge because, given N sequences, it always operates with an N-1way merge instead of an N/2-way merge. As the number of required passes is approximately logN n, n being the number of items to be sorted and N being the degree of the merge operations, Polyphase promises a significant improvement over balanced merging. Of course, the distribution of initial runs was carefully chosen in the above examples. In order to find out which initial distributions of runs lead to a proper functioning, we work backward, starting with the final distribution (last line in Fig. 2.15). Rewriting the tables of the two examples and rotating each row by one position with respect to the prior row yields Tables 2.13 and 2.14 for six passes and for three and six sequences, respectively. L

a1(L)

a2(L)

Sum ai(L)

0 1 2 3 4 5 6

1 1 2 3 5 8 13

0 1 1 2 3 5 8

1 2 3 5 8 13 21

Table 2.13 Perfect distribution of runs on two sequences. L

a1(L)

a2(L)

a3(L)

a4(L)

a5(L)

Sum ai(L)

0 1 2 3 4 5

1 1 2 4 8 16

0 1 2 4 8 15

0 1 2 4 7 14

0 1 2 3 6 12

0 1 1 2 4 8

1 5 9 17 33 65

Table 2.14 Perfect distribution of runs on five sequences. From Table 2.13 we can deduce for L > 0 the relations a2(L+1) = a1(L) a1(L+1) = a1(L) + a2(L)

76 and a1(0) = 1, a2(0) = 0. Defining fi+1 = a1(i), we obtain for i > 0 fi+1 = fi + fi-1, f1 = 1, f0 = 0 These are the recursive rules (or recurrence relations) defining the Fibonacci numbers: f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, ... Each Fibonacci number is the sum of its two predecessors. As a consequence, the numbers of initial runs on the two input sequences must be two consecutive Fibonacci numbers in order to make Polyphase work properly with three sequences. How about the second example (Table 2.14) with six sequences? The formation rules are easily derived as a5(L+1) a4(L+1) a3(L+1) a2(L+1) a1(L+1)

= = = = =

a1(L) a1(L) + a5(L) a1(L) + a4(L) a1(L) + a3(L) a1(L) + a2(L)

= = = =

a1(L) + a1(L) + a1(L) + a1(L) +

a1(L-1) a1(L-1) + a1(L-2) a1(L-1) + a1(L-2) + a1(L-3) a1(L-1) + a1(L-2) + a1(L-3) + a1(L-4)

Substituting fi for a1(i) yields fi+1 f4 fi

= fi + fi-1 + fi-2 + fi-3 + fi-4 = 1 = 0 for i < 4

for i > 4

These numbers are the Fibonacci numbers of order 4. In general, the Fibonacci numbers of order p are defined as follows: fi+1(p) = fi(p) + fi-1(p) + ... + fi-p(p) for i > p fp(p) = 1 fi(p) = 0 for 0 < i < p Note that the ordinary Fibonacci numbers are those of order 1. We have now seen that the initial numbers of runs for a perfect Polyphase Sort with N sequences are the sums of any N-1, N-2, ... , 1 (see Table 2.15) consecutive Fibonacci numbers of order N-2. This apparently implies that this method is only applicable to inputs whose number of runs is the sum of N-1 such Fibonacci sums. The important question thus arises: What is to be done when the number of initial runs is not such an ideal sum? The answer is simple (and typical for such situations): we simulate the existence of hypothetical empty runs, such that the sum of real and hypothetical runs is a perfect sum. The empty runs are called dummy runs. But this is not really a satisfactory answer because it immediately raises the further and more difficult question: How do we recognize dummy runs during merging? Before answering this question we must first investigate the prior problem of initial run distribution and decide upon a rule for the distribution of actual and dummy runs onto the N-1 tapes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

2 3 5 8 13 21 34 55 89 144 233 377 610 987

3 5 9 17 31 57 105 193 355 653 1201 2209 4063 7473

4 7 13 25 49 94 181 349 673 1297 2500 4819 9289 17905

5 9 17 33 65 129 253 497 977 1921 3777 7425 14597 28697

6 11 21 41 81 161 321 636 1261 2501 4961 9841 19521 38721

7 13 25 49 97 193 385 769 1531 3049 6073 12097 24097 48001

Table 2.15 Numbers of runs allowing for perfect distribution.

77 In order to find an appropriate rule for distribution, however, we must know how actual and dummy runs are merged. Clearly, the selection of a dummy run from sequence i means precisely that sequence i is ignored during this merge. resulting in a merge from fewer than N-1 sources. Merging of a dummy run from all N-1 sources implies no actual merge operation, but instead the recording of the resulting dummy run on the output sequence. From this we conclude that dummy runs should be distributed to the n-1 sequences as uniformly as possible, since we are interested in active merges from as many sources as possible. Let us forget dummy runs for a moment and consider the problem of distributing an unknown number of runs onto N-1 sequences. It is plain that the Fibonacci numbers of order N-2 specifying the desired numbers of runs on each source can be generated while the distribution progresses. Assuming, for example, N = 6 and referring to Table 2.14, we start by distributing runs as indicated by the row with index L = 1 (1, 1, 1, 1, 1); if there are more runs available, we proceed to the second row (2, 2, 2, 2, 1); if the source is still not exhausted, the distribution proceeds according to the third row (4, 4, 4, 3, 2), and so on. We shall call the row index level. Evidently, the larger the number of runs, the higher is the level of Fibonacci numbers which, incidentally, is equal to the number of merge passes or switchings necessary for the subsequent sort. The distribution algorithm can now be formulated in a first version as follows: 1. Let the distribution goal be the Fibonacci numbers of order N-2, level 1. 2. Distribute according to the set goal. 3. If the goal is reached, compute the next level of Fibonacci numbers; the difference between them and those on the former level constitutes the new distribution goal. Return to step 2. If the goal cannot be reached because the source is exhausted, terminate the distribution process. The rules for calculating the next level of Fibonacci numbers are contained in their definition. We can thus concentrate our attention on step 2, where, with a given goal, the subsequent runs are to be distributed one after the other onto the N-1 output sequences. It is here where the dummy runs have to reappear in our considerations. Let us assume that when raising the level, we record the next goal by the differences di for i = 1 ... N-1, where di denotes the number of runs to be put onto sequence i in this step. We can now assume that we immediately put di dummy runs onto sequence i and then regard the subsequent distribution as the replacement of dummy runs by actual runs, each time recording a replacement by subtracting 1 from the count di. Thus, the d i indicates the number of dummy runs on sequence i when the source becomes empty. It is not known which algorithm yields the optimal distribution, but the following has proved to be a very good method. It is called horizontal distribution (cf. Knuth, Vol 3. p. 270), a term that can be understood by imagining the runs as being piled up in the form of silos, as shown in Fig. 2.16 for N = 6, level 5 (cf. Table 2.14). In order to reach an equal distribution of remaining dummy runs as quickly as possible, their replacement by actual runs reduces the size of the piles by picking off dummy runs on horizontal levels proceeding from left to right. In this way, the runs are distributed onto the sequences as indicated by their numbers as shown in Fig. 2.16. 8 1 7 2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

6 5 4 3 2 1

Fig. 2.16. Horizontal distribution of runs

78 We are now in a position to describe the algorithm in the form of a procedure called select, which is activated each time a run has been copied and a new source is selected for the next run. We assume the existence of a variable j denoting the index of the current destination sequence. ai and di denote the ideal and dummy distribution numbers for sequence i. j, level: INTEGER; a, d: ARRAY N OF INTEGER; These variables are initialized with the following values: for i = 0 ... N-2 ai = 1, di = 1 aN-1 = 0, dN-1 = 0 dummy j = 0, level = 0 Note that select is to compute the next row of Table 2.14, i.e., the values a1(L) ... aN-1(L) each time that the level is increased. The next goal, i.e., the differences d i = ai(L) - ai(L-1) are also computed at that time. The indicated algorithm relies on the fact that the resulting di decrease with increasing index (descending stair in Fig. 2.16). Note that the exception is the transition from level 0 to level 1; this algorithm must therefore be used starting at level 1. Select ends by decrementing dj by 1; this operation stands for the replacement of a dummy run on sequence j by an actual run. PROCEDURE select; VAR i, z: INTEGER; BEGIN IF d[j] < d[j+1] THEN INC(j) ELSE IF d[j] = 0 THEN INC(level); z := a[0]; FOR i := 0 TO N-2 DO d[i] := z + a[i+1] - a[i]; a[i] := z + a[i+1] END END ; j := 0 END ; DEC(d[j]) END select Assuming the availability of a routine to copy a run from the source src woth rider R onto fj with rider rj, we can formulate the initial distribution phase as follows (assuming that the source contains at least one run): REPEAT select; copyrun UNTIL R.eof Here, however, we must pause for a moment to recall the effect encountered in distributing runs in the previously discussed natural merge algorithm: The fact that two runs consecutively arriving at the same destination may merge into a single run, causes the assumed numbers of runs to be incorrect. By devising the sort algorithm such that its correctness does not depend on the number of runs, this side effect can safely be ignored. In the Polyphase Sort, however, we are particularly concerned about keeping track of the exact number of runs on each file. Consequently, we cannot afford to overlook the effect of such a coincidental merge. An additional complication of the distribution algorithm therefore cannot be avoided. It becomes necessary to retain the keys of the last item of the last run on each sequence. Fortunately, our implementation of Runs does exactly this. In the case of output sequences, f.first represents the item last written. A next attempt to describe the distribution algorithm could therefore be REPEAT select; IF f[j].first f(optimum) THEN optimum := solution END The variable optimum records the best solution so far encountered. Naturally, it has to be properly initialized; morever, it is customary to record to value f(optimum) by another variable in order to avoid its frequent recomputation. An example of the general problem of finding an optimal solution to a given problem follows: We choose the important and frequently encountered problem of finding an optimal selection out of a given set of objects subject to constraints. Selections that constitute acceptable solutions are gradually built up by investigating individual objects from the base set. A procedure Try describes the process of investigating the suitability of one individual object, and it is called recursively (to investigate the next object) until all objects have been considered. We note that the consideration of each object (called candidates in previous examples) has two possible outcomes, namely, either the inclusion of the investigated object in the current selection or its exclusion. This makes the use of a repeat or for statement inappropriate; instead, the two cases may as well be explicitly written out. This is shown, assuming that the objects are numbered 1, 2, ... , n. PROCEDURE Try(i: INTEGER); BEGIN IF inclusion is acceptable THEN include i th object; IF i < n THEN Try(i+1) ELSE check optimality END ; eliminate i th object END ; IF exclusion is acceptable THEN IF i < n THEN Try(i+1) ELSE check optimality END END END Try From this pattern it is evident that there are 2n possible sets; clearly, appropriate acceptability criteria must be employed to reduce the number of investigated candidates very drastically. In order to elucidate this process, let us choose a concrete example for a selection problem: Let each of the n objects a0, ... , an-1 be characterized by its weight and its value. Let the optimal set be the one with the largest sum of the values of its components, and let the constraint be a limit on the sum of their weight. This is a problem well known to all travellers who pack suitcases by selecting from n items in such a way that their total value is optimal and that their total weight does not exceed a specific allowance. We are now in a position to decide upon the representation of the given facts in terms of global variables. The choices are easily derived from the foregoing developments:

106 TYPE object = RECORD weight, value: INTEGER END ; VAR obj: ARRAY n OF object; limw, totv, maxv: INTEGER; s, opts: SET The variables limw and totv denote the weight limit and the total value of all n objects. These two values are actually constant during the entire selection process. s represents the current selection of objects in which each object is represented by its name (index). opts is the optimal selection so far encountered, and maxv is its value. Which are now the criteria for acceptability of an object for the current selection? If we consider inclusion, then an object is selectable, if it fits into the weight allowance. If it does not fit, we may stop trying to add further objects to the current selection. If, however, we consider exclusion, then the criterion for acceptability, i.e., for the continuation of building up the current selection, is that the total value which is still achievable after this exclusion is not less than the value of the optimum so far encountered. For, if it is less, continuation of the search, although it may produce some solution, will not yield the optimal solution. Hence any further search on the current path is fruitless. From these two conditions we determine the relevant quantities to be computed for each step in the selection process: 1. The total weight tw of the selection s so far made. 2. The still achievable value av of the current selection s. These two entities are appropriately represented as parameters of the procedure Try. The condition inclusion is acceptable can now be formulated as tw + a[i].weight < limw and the subsequent check for optimality as IF av > maxv THEN (*new optimum, record it*) opts := s; maxv := av END The last assignment is based on the reasoning that the achievable value is the achieved value, once all n objects have been dealt with. The condition exclusion is acceptable is expressed by av - a[i].value > maxv Since it is used again thereafter, the value av - a[i].value is given the name av1 in order to circumvent its reevaluation. The entire procedure is now composed of the discussed parts with the addition of appropriate initialization statements for the global variables. The ease of expressing inclusion and exclusion from the set s by use of set operators is noteworthy. The results opts and maxv of Selection with weight allowances ranging from 10 to 120 are listed in Table 3.5. TYPE Object = RECORD value, weight: INTEGER END ; VAR obj: ARRAY n OF Object; limw, totv, maxv: INTEGER; s, opts: SET; PROCEDURE Try(i, tw, av: INTEGER); VAR av1: INTEGER; BEGIN (*try inclusion*) IF tw + obj[i].weight maxv THEN maxv := av; opts := s END ; s := s - {i} END ; (*try exclusion*)

107 IF av > maxv + obj[i].value THEN IF i < n THEN Try(i+1, tw, av - obj[i].value) ELSE maxv := av - obj[i].value; opts := s END END END Try; PROCEDURE Selection(n, Weightinc, WeightLimit: INTEGER); VAR i: INTEGER; BEGIN limw := 0; REPEAT limw := limw + WeightInc; maxv := 0; s := {}; opts := {}; Try(0, 0, totv); UNTIL limw >= WeightLimit END Selection. Weight 10 Value 18 10 20 30 40 50 60 70 80 90 100 110 120

11 20

12 17

13 19

14 25

15 21

16 27

17 23

18 25

19 24

*

* * * * * * * * *

* * * *

* * * * * * * * * *

* *

* *

* *

*

* * * * * * *

* * * * * *

* *

* * * *

* * * *

*

*

Tot 18 27 52 70 84 99 115 130 139 157 172 183

Table 3.5 Sample Output from Optimal Selection Program. This backtracking scheme with a limitation factor curtailing the growth of the potential search tree is also known as branch and bound algorithm.

Exercises 3.1 (Towers of Hanoi). Given are three rods and n disks of different sizes. The disks can be stacked up on the rods, thereby forming towers. Let the n disks initially be placed on rod A in the order of decreasing size, as shown in Fig. 3.10 for n = 3. The task is to move the n disks from rod A to rod C such that they are ordered in the original way. This has to be achieved under the constraints that 1. In each step exactly one disk is moved from one rod to another rod. 2. A disk may never be placed on top of a smaller disk. 3. Rod B may be used as an auxiliary store. Find an algorithm that performs this task. Note that a tower may conveniently be considered as consisting of the single disk at the top, and the tower consisting of the remaining disks. Describe the algorithm as a recursive program.

1 2 3

A

B

C

108 Fig. 3.10. The towers of Hanoi 3.2. Write a procedure that generates all n! permutations of n elements a1, ... , an in situ, i.e., without the aid of another array. Upon generating the next permutation, a parametric procedure Q is to be called which may, for instance, output the generated permutation. Hint: Consider the task of generating all permutations of the elements a1, ... , am as consisting of the m subtasks of generating all permutations of a1, ... , am-1 followed by am , where in the i th subtask the two elements ai and am had initially been interchanged. 3.3. Deduce the recursion scheme of Fig. 3.11 which is a superposition of the four curves W1, W2, W3, W4. The structure is similar to that of the Sierpinski curves (3.21) and (3.22). From the recursion pattern, derive a recursive program that draws these curves.

Fig. 3.11. Curves W1 – W4 3.4. Only 12 of the 92 solutions computed by the Eight Queens algorithm are essentially different. The other ones can be derived by reflections about axes or the center point. Devise a program that determines the 12 principal solutions. Note that, for example, the search in column 1 may be restricted to positions 1-4. 3.5 Change the Stable Marriage Program so that it determines the optimal solution (male or female). It therefore becomes a branch and bound program of the type represented by Program 3.7. 3.6 A certain railway company serves n stations S 0, ... , S n-1. It intends to improve its customer information service by computerized information terminals. A customer types in his departure station SA and his destination SD, and he is supposed to be (immediately) given the schedule of the train connections with minimum total time of the journey. Devise a program to compute the desired information. Assume that the timetable (which is your data bank) is provided in a suitable data structure containing departure (= arrival) times of all available trains. Naturally, not all stations are connected by direct lines (see also Exercise 1.6). 3.7 The Ackermann Function A is defined for all non-negative integer arguments m and n as follows: A(0, n) = n + 1 A(m, 0) = A(m-1, 1) (m > 0) A(m, n) = A(m-1, A(m, n-1)) (m, n > 0) Design a program that computes A(m,n) without the use of recursion. As a guideline, use Program 2.11, the non-recusive version of Quicksort. Devise a set of rules for the transformation of recursive into iterative programs in general.

References 3-1. D.G. McVitie and L.B. Wilson. The Stable Marriage Problem. Comm. ACM, 14, No. 7 (1971), 48692. 3-2. -------. Stable Marriage Assignment for Unequal Sets. Bit, 10, (1970), 295-309. 3-3. Space Filling Curves, or How to Waste Time on a Plotter. Software - Practice and Experience, 1, No. 4 (1971), 403-40. 3-4. N. Wirth. Program Development by Stepwise Refinement. Comm. ACM, 14, No. 4 (1971), 221-27.

109

4 Dynamic Information Structures 4.1. Recursive Data Types In Chap. 2 the array, record, and set structures were introduced as fundamental data structures. They are called fundamental because they constitute the building blocks out of which more complex structures are formed, and because in practice they do occur most frequently. The purpose of defining a data type, and of thereafter specifying that certain variables be of that type, is that the range of values assumed by these variables, and therefore their storage pattern, is fixed once and for all. Hence, variables declared in this way are said to be static. However, there are many problems which involve far more complicated information structures. The characteristic of these problems is that not only the values but also the structures of variables change during the computation. They are therefore called dynamic structures. Naturally, the components of such structures are -- at some level of resolution -- static, i.e., of one of the fundamental data types. This chapter is devoted to the construction, analysis, and management of dynamic information structures. It is noteworthy that there exist some close analogies between the methods used for structuring algorithms and those for structuring data. As with all analogies, there remain some differences, but a comparison of structuring methods for programs and data is nevertheless illuminating. The elementary, unstructured statement is the assignment of an expression's value to a variable. Its corresponding member in the family of data structures is the scalar, unstructured type. These two are the atomic building blocks for composite statements and data types. The simplest structures, obtained through enumeration or sequencing, are the compound statement and the record structure. They both consist of a finite (usually small) number of explicitly enumerated components, which may themselves all be different from each other. If all components are identical, they need not be written out individually: we use the for statement and the array structure to indicate replication by a known, finite factor. A choice among two or more elements is expressed by the conditional or the case statement and by extensions of record types, respectively. And finally, a repetiton by an initially unknown (and potentially infinite) factor is expressed by the while and repeat statements. The corresponding data structure is the sequence (file), the simplest kind which allows the construction of types of infinite cardinality. The question arises whether or not there exists a data structure that corresponds in a similar way to the procedure statement. Naturally, the most interesting and novel property of procedures in this respect is recursion. Values of such a recursive data type would contain one or more components belonging to the same type as itself, in analogy to a procedure containing one or more calls to itself. Like procedures, data type definitions might be directly or indirectly recursive. A simple example of an object that would most appropriately be represented as a recursively defined type is the arithmetic expression found in programming languages. Recursion is used to reflect the possibility of nesting, i.e., of using parenthesized subexpressions as operands in expressions. Hence, let an expression here be defined informally as follows: An expression consists of a term, followed by an operator, followed by a term. (The two terms constitute the operands of the operator.) A term is either a variable -- represented by an identifier -- or an expression enclosed in parentheses. A data type whose values represent such expressions can easily be described by using the tools already available with the addition of recursion: TYPE expression = RECORD op: INTEGER; opd1, opd2: term END TYPE term =

RECORD IF t: BOOLEAN THEN id: Name ELSE subex: expression END END

110 Hence, every variable of type term consists of two components, namely, the tagfield t and, if t is true, the field id, or of the field subex otherwise. Consider now, for example, the following four expressions: 1. x + y 2. x - (y * z) 3. (x + y) * (z - w) 4. (x/(y + z)) * w These expressions may be visualized by the patterns in Fig. 4.1, which exhibit their nested, recursive structure, and they determine the layout or mapping of these expressions onto a store. 1.

2.

+ T

x

T

y

T

* F

3.

x

T

y

T

z

4.

*

*

+ F

/

T

x

T

y

T

+

F F

F

T

z

T

w

x

T

y

T

z

T

w

Fig. 4.1. Storage patterns for recursive record structures A second example of a recursive information structure is the family pedigree: Let a pedigree be defined by (the name of) a person and the two pedigrees of the parents. This definition leads inevitably to an infinite structure. Real pedigrees are bounded because at some level of ancestry information is missing. Assume that this can be taken into account by again using a conditional structure: TYPE ped = RECORD IF known: BOOLEAN THEN name: Name; father, mother: ped END END Note that every variable of type ped has at least one component, namely, the tagfield called known. If its value is TRUE, then there are three more fields; otherwise there is none. A particular value is shown here in the forms of a nested expression and of a diagram that may suggest a possible storage pattern (see Fig. 4.2). (T, Ted, (T, Fred, (T, Adam, (F), (F)), (F)), (T, Mary, (F), (T, Eva, (F), (F))) The important role of the variant facility becomes clear; it is the only means by which a recursive data structure can be bounded, and it is therefore an inevitable companion of every recursive definition. The analogy between program and data structuring concepts is particularly pronounced in this case. A conditional (or selective) statement must necessarily be part of every recursive procedure in order that execution of the procedure can terminate. In practice, dynamic structures involve references or pointers to its elements, and the concept of an alternative (to terminate the recursion) is implied in the pointer, as shown in the next paragraph.

111

T

Ted T

Fred T

Adam F F

F

T

Mary F T

Eva F F

Fig. 4.2. An example of a recursive data structure

4.2. Pointers The characteristic property of recursive structures which clearly distinguishes them from the fundamental structures (arrays, records, sets) is their ability to vary in size. Hence, it is impossible to assign a fixed amount of storage to a recursively defined structure, and as a consequence a compiler cannot associate specific addresses to the components of such variables. The technique most commonly used to master this problem involves dynamic allocation of storage, i.e., allocation of store to individual components at the time when they come into existence during program execution, instead of at translation time. The compiler then allocates a fixed amount of storage to hold the address of the dynamically allocated component instead of the component itself. For instance, the pedigree illustrated in Fig. 4.2 would be represented by individual -- quite possibly noncontiguous -- records, one for each person. These persons are then linked by their addresses assigned to the respective father and mother fields. Graphically, this situation is best expressed by the use of arrows or pointers (Fig. 4.3).

112

T

T

T

Ted

Fred

T

Adam

F

F

F

Mary

T

F

Eva

F

F

Fig. 4.3. Data structure linked by pointers It must be emphasized that the use of pointers to implement recursive structures is merely a technique. The programmer need not be aware of their existence. Storage may be allocated automatically the first time a new component is referenced. However, if the technique of using references or pointers is made explicit, more general data structures can be constructed than those definable by purely recursive data definiton. In particular, it is then possible to define potentially infinite or circular (graph) structures and to dictate that certain structures are shared. It has therefore become common in advanced programming languages to make possible the explicit manipulation of references to data in additon to the data themeselves. This implies that a clear notational distinction must exist between data and references to data and that consequently data types must be introduced whose values are pointers (references) to other data. The notation we use for this purpose is the following: TYPE T = POINTER TO T0 This type declaration expresses that values of type T are pointers to data of type T0. It is fundamentally important that the type of elements pointed to is evident from the declaration of T. We say that T is bound to T0. This binding distinguishes pointers in higher-level languages from addresses in assembly codes, and it is a most important facility to increase security in programming through redundancy of the underlying notation. Values of pointer types are generated whenever a data item is dynamically allocated. We will adhere to the convention that such an occasion be explicitly mentioned at all times. This is in contrast to the situation in which the first time that an item is mentioned it is automatically allocated. For this purpose, we introduce a procedure New. Given a pointer variable p of type T, the statement New(p) effectively allocates a variable of type T0 and assigns the pointer referencing this new variable to p (see Fig. 4.4). The pointer value itself can now be referred to as p (i.e., as the value of the pointer variable p). In contrast, the variable which is referenced by p is denoted by p^. The referenced structures are typically records. If the referenced record has, for example, a field x, then it is denoted by p^.x. Because it is clear that not the pointer p has any fields, but only the referenced record p^, we allow the abbreviated notation p.x in place of p^.x.

113

p: POINTER TO T

p↑: T

Fig. 4.4. Dynamic allocation of variable p^ It was mentioned above that a variant component is essential in every recursive type to ensure finite instances. The example of the family predigree is of a pattern that exhibits a most frequently occurring constellation, namely, the case in which one of the two cases features no further components. This is expressed by the following declaration schema: TYPE T = RECORD IF nonterminal: BOOLEAN THEN S(T) END END S(T) denotes a sequence of field definitions which includes one or more fields of type T, thereby ensuring recursivity. All structures of a type patterned after this schema exhibit a tree (or list) structure similar to that shown in Fig. 4.3. Its peculiar property is that it contains pointers to data components with a tag field only, i.e., without further relevant information. The implementation technique using pointers suggests an easy way of saving storage space by letting the tag information be included in the pointer value itself. The common solution is to extend the range of values of all pointer types by a single value that is pointing to no element at all. We denote this value by the special symbol NIL, and we postulate that the value NIL can be assumed by all pointer typed variables. This extension of the range of pointer values explains why finite structures may be generated without the explicit presence of variants (conditions) in their (recursive) declaration. The new formulations of the explicitly recursive data types declared above are reformulated using pointers as shown below. Note that the field known has vanished, since ~p.known is now expressed as p = NIL. The renaming of the type ped to person reflects the difference in the viewpoint brought about by the introduction of explicit pointer values. Instead of first considering the given structure in its entirety and then investigating its substructure and its components, attention is focused on the components in the first place, and their interrelationship (represented by pointers) is not evident from any fixed declaration. TYPE term = TYPE exp = TYPE ExpDescriptor = TYPE TermDescriptor =

POINTER TO TermDescriptor; POINTER TO ExpDescriptor; RECORD op: INTEGER; opd1, opd2: term END ; RECORD id: ARRAY 32 OF CHAR END

TYPE Person =

POINTER TO RECORD name: ARRAY 32 OF CHAR; father, mother: Person END

Note: The type Person points to records of an anonymous type (PersonDescriptor). The data structure representing the pedigree shown in Figs. 4.2 and 4.3 is again shown in Fig. 4.5 in which pointers to unknown persons are denoted by NIL. The resulting improvement in storage economy is obvious.

114

T

T

T

Adam

Fred

Ted

NIL

T

NIL NIL

Mary

NIL

T

Eva

NIL NIL

Fig. 4.5. Data structure with NIL pointers Again referring to Fig. 4.5, assume that Fred and Mary are siblings, i.e., have the same father and mother. This situation is easily expressed by replacing the two NIL values in the respective fields of the two records. An implementation that hides the concept of pointers or uses a different technique of storage handling would force the programmer to represent the ancestor records of Adam and Eve twice. Although in accessing their data for inspection it does not matter whether the two fathers (and the two mothers) are duplicated or represented by a single record, the difference is essential when selective updating is permitted. Treating pointers as explicit data items instead of as hidden implementation aids allows the programmer to express clearly where storage sharing is intended and where it is not. A further consequence of the explicitness of pointers is that it is possible to define and manipulate cyclic data structures. This additional flexibility yields, of course, not only increased power but also requires increased care by the programmer, because the manipulation of cyclic data structures may easily lead to nonterminating processes. This phenomenon of power and flexibility being intimately coupled with the danger of misuse is well known in programming, and it particularly recalls the GOTO statement. Indeed, if the analogy between program structures and data structures is to be extended, the purely recursive data structure could well be placed at the level corresponding with the procedure, whereas the introduction of pointers is comparable to the use of GOTO statements. For, as the GOTO statement allows the construction of any kind of program pattern (including loops), so do pointers allow for the composition of any kind of data structure (including rings). The parallel development of corresponding program and data structures is shown in condensed form in Table 4.1. Construction Pattern

Program Statement

Data Type

Atomic element Enumeration Repetition (known factor) Choice Repetition Recursion General graph

Assignment Compound statement For statement Conditional statement While or repeat statement Procedure statement GO TO statement

Scalar type Record type Array type Type union (Variant record) Sequence type Recursive data type Structure linked by pointers

Table 4.1 Correspondences of Program and Data Structures. In Chap. 3, we have seen that iteration is a special case of recursion, and that a call of a recursive procedure P defined according to the following schema: PROCEDURE P; BEGIN IF B THEN P0; P END END where P0 is a statement not involving P, is equivalent to and replaceable by the iterative statement WHILE B DO P0 END

115 The analogies outlined in Table 4.1 reveal that a similar relationship holds between recursive data types and the sequence. In fact, a recursive type defined according to the schema TYPE T = RECORD IF b: BOOLEAN THEN t0: T0; t: T END END where T0 is a type not involving T, is equivalent and replaceable by a sequence of T0s. The remainder of this chapter is devoted to the generation and manipulation of data structures whose components are linked by explicit pointers. Structures with specific simple patterns are emphasized in particular; recipes for handling more complex structures may be derived from those for manipulating basic formations. These are the linear list or chained sequence -- the simplest case -- and trees. Our preoccupation with these building blocks of data structuring does not imply that more involved structures do not occur in practice. In fact, the following story appeared in a Zürich newspaper in July 1922 and is a proof that irregularity may even occur in cases which usually serve as examples for regular structures, such as (family) trees. The story tells of a man who laments the misery of his life in the following words: I married a widow who had a grown-up daughter. My father, who visited us quite often, fell in love with my step-daughter and married her. Hence, my father became my son-in-law, and my step-daughter became my mother. Some months later, my wife gave birth to a son, who became the brother-in-law of my father as well as my uncle. The wife of my father, that is my stepdaughter, also had a son. Thereby, I got a brother and at the same time a grandson. My wife is my grandmother, since she is my mother's mother. Hence, I am my wife's husband and at the same time her step-grandson; in other words, I am my own grandfather.

4.3. Linear Lists 4.3.1. Basic Operations The simplest way to interrelate or link a set of elements is to line them up in a single list or queue. For, in this case, only a single link is needed for each element to refer to its successor. Assume that types Node and NodeDesc are defined as shown below. Every variable of type NodeDesc consists of three components, namely, an identifying key, the pointer to its successor, and possibly further associated information. For our further discussion, only key and next will be relevant. TYPE Node = POINTER TO NodeDesc; TYPE NodeDesc = RECORD key: INTEGER; next: Ptr; data: ... END ; VAR p, q: Node (*pointer variables*) A list of nodes, with a pointer to its first component being assigned to a variable p, is illustrated in Fig. 4.6. Probably the simplest operation to be performed with a list as shown in Fig. 4.6 is the insertion of an element at its head. First, an element of type NodeDesc is allocated, its reference (pointer) being assigned to an auxiliary pointer variable, say q. Thereafter, a simple reassignment of pointers completes the operation. Note that the order of these three statements is essential. NEW(q); q.next := p; p := q p

1 2 3 4 NIL

116 Fig. 4.6. Example of a linked list The operation of inserting an element at the head of a list immediately suggests how such a list can be generated: starting with the empty list, a heading element is added repeatedly. The process of list generation is expressed in by the following piece of program; here the number of elements to be linked is n. p := NIL; (*start with empty list*) WHILE n > 0 DO NEW(q); q.next := p; p := q; q.key := n; DEC(n) END This is the simplest way of forming a list. However, the resulting order of elements is the inverse of the order of their insertion. In some applications this is undesirable, and consequently, new elements must be appended at the end instead of the head of the list. Although the end can easily be determined by a scan of the list, this naive approach involves an effort that may as well be saved by using a second pointer, say q, always designating the last element. This method is, for example, applied in Program 4.4, which generates cross-references to a given text. Its disadvantage is that the first element inserted has to be treated differently from all later ones. The explicit availability of pointers makes certain operations very simple which are otherwise cumbersome; among the elementary list operations are those of inserting and deleting elements (selective updating of a list), and, of course, the traversal of a list. We first investigate list insertion. Assume that an element designated by a pointer (variable) q is to be inserted in a list after the element designated by the pointer p. The necessary pointer assignments are expressed as follows, and their effect is visualized by Fig. 4.7. q.next := p.next; p.next := q q

q

p

Fig. 4.7. Insertion after p^ If insertion before instead of after the designated element p^ is desired, the unidirectional link chain seems to cause a problem, because it does not provide any kind of path to an element's predecessors. However, a simple trick solves our dilemma. It is illustrated in Fig. 4.8. Assume that the key of the new element is 8. NEW(q); q^ := p^; p.key := k; p.next := q

117

q

8

27

p

13

27

21

13

8

21

Fig. 4.8. Insertion before p^ The trick evidently consists of actually inserting a new component after p^ and thereafter interchanging the values of the new element and p^. Next, we consider the process of list deletion. Deleting the successor of a p^ is straightforward. This is shown here in combination with the reinsertion of the deleted element at the head of another list (designated by q). Figure 4.9 illustrates the situation and shows that it constitutes a cyclic exchange of three pointers. r := p.next; p.next := r.next; r.next := q; q := r q

q

p

Fig. 4.9. Deletion and re-insertion The removal of a designated element itself (instead of its successor) is more difficult, because we encounter the same problem as with insertion: tracing backward to the denoted element's predecessor is impossible. But deleting the successor after moving its value forward is a relatively obvious and simple solution. It can be applied whenever p^ has a successor, i.e., is not the last element on the list. However, it must be assured that there exist no other variables pointing to the now deleted element. We now turn to the fundamental operation of list traversal. Let us assume that an operation P(x) has to be performed for every element of the list whose first element is p^. This task is expressible as follows: WHILE list designated by p is not empty DO perform operation P; proceed to the successor END In detail, this operation is descibed by the following statement: WHILE p # NIL DO

118 P(p); p := p.next END It follows from the definitions of the while statement and of the linking structure that P is applied to all elements of the list and to no other ones. A very frequent operation performed is list searching for an element with a given key x. Unlike for arrays, the search must here be purely sequential. The search terminates either if an element is found or if the end of the list is reached. This is reflected by a logical conjunction consisting of two terms. Again, we assume that the head of the list is designated by a pointer p. WHILE (p # NIL) & (p.key # x) DO p := p.next END p = NIL implies that p^ does not exist, and hence that the expression p.key # x is undefined. The order of the two terms is therefore essential. 4.3.2. Ordered Lists and Reorganizing Lists The given linear list search strongly resembles the search routines for scanning an array or a sequence. In fact, a sequence is precisely a linear list for which the technique of linkage to the successor is left unspecified or implicit. Since the primitive sequence operators do not allow insertion of new elements (except at the end) or deletion (except removal of all elements), the choice of representation is left wide open to the implementor, and he may well use sequential allocation, leaving successive components in contiguous storage areas. Linear lists with explicit pointers provide more flexibility, and therefore they should be used whenever this additional flexibility is needed. To exemplify, we will now consider a problem that will occur throughout this chapter in order to illustate alternative solutions and techniques. It is the problem of reading a text, collecting all its words, and counting the frequency of their occurrence. It is called the construction of a concordance or the generation of a cross-reference list. An obvious solution is to construct a list of words found in the text. The list is scanned for each word. If the word is found, its frequency count is incremented; otherwise the word is added to the list. We shall simply call this process search, although it may actually also include an insertion. In order to be able to concentrate our attention on the essential part of list handling, we assume that the words have already been extracted from the text under investigation, have been encoded as integers, and are available in the from of an input sequence. The formulation of the procedure called search follows in a straightforward manner. The variable root refers to the head of the list in which new words are inserted accordingly. The complete algorithm is listed below; it includes a routine for tabulating the constructed cross-reference list. The tabulation process is an example in which an action is executed once for each element of the list. TYPE Word = POINTER TO RECORD key, count: INTEGER; next: Word END ; PROCEDURE search(x: INTEGER; VAR root: Word); VAR w: Word; BEGIN w := root; WHILE (w # NIL) & (w.key # x) DO w := w.next END ; (* (w = NIL) OR (w.key = x) *) IF w = NIL THEN (*new entry*) w := root; NEW(root); root.key := x; root.count := 1; root.next := w ELSE INC(w.count) END END search; PROCEDURE PrintList(w: Word); BEGIN (*uses global writer W *) WHILE w # NIL DO

119 Texts.WriteInt(W, w.key, 8); Texts.WriteInt(W, w.count, 8); Texts.WriteLn(W); w := w.next END END PrintList; The linear scan algorithm resembles the search procedure for arrays, and reminds us of a simple technique used to simplify the loop termination condition: the use of a sentinel. A sentinel may as well be used in list search; it is represented by a dummy element at the end of the list. The new procedure is listed below. We must assume that a global variable sentinel is added and that the initialization of root := NIL is replaced by the statements NEW(sentinel); root := sentinel which generate the element to be used as sentinel. PROCEDURE search(x: INTEGER; VAR root: Word); VAR w: Word; BEGIN w := root; sentinel.key := x; WHILE w.key # x DO w := w.next END ; IF w = sentinel THEN (*new entry*) w := root; NEW(root); root.key := x; root.count := 1; root.next := w ELSE INC(w.count) END END search Obviously, the power and flexibility of the linked list are ill used in this example, and the linear scan of the entire list can only be accepted in cases in which the number of elements is limited. An easy improvement, however, is readily at hand: the ordered list search. If the list is ordered (say by increasing keys), then the search may be terminated at the latest upon encountering the first key that is larger than the new one. Ordering of the list is achieved by inserting new elements at the appropriate place instead of at the head. In effect, ordering is practically obtained free of charge. This is because of the ease by which insertion in a linked list is achieved, i.e., by making full use of its flexibility. It is a possibility not provided by the array and sequence structures. (Note, however, that even in ordered lists no equivalent to the binary search of arrays is available). 7

w3

1 5

5

12

NIL

w2

w1

Fig. 4.10. Insertion in ordered list Ordered list search is a typical example of the situation, where an element must be inserted ahead of a given item, here in front of the first one whose key is too large. The technique shown here, however, differs from the one used shown earlier. Instead of copying values, two pointers are carried along in the list traversal; w2 lags one step behind w1 and thus identifies the proper insertion place when w1 has found too large a key. The general insertion step is shown in Fig. 4.10. The pointer to the new element (w3) is to be assigned to w2^.next, except when the list is still empty. For reasons of simplicity and effectiveness, we prefer to avoid this distinction by using a conditional statement. The only way to avoid

120 this is to introduce a dummy element at the list head. The initializing statement root := NIL is accordingly replaced by NEW(root); root.next := NIL Referring to Fig. 4.10, we determine the condition under which the scan continues to proceed to the next element; it consists of two factors, namely, (w1 # NIL) & (w1.key < x) The resulting search procedure is:. PROCEDURE search(x: INTEGER); VAR root: Word); VAR w1, w2, w3: Word; BEGIN (*w2 # NIL*) w2 := root; w1 := w2.next; WHILE (w1 # NIL) & (w1.key < x) DO w2 := w1; w1 := w2.next END ; (* (w1 = NIL) OR (w1.key >= x) *) IF (w1 = NIL) OR (w1.key > x) THEN (*new entry*) NEW(w3); w2.next := w3; w3.key := x; w3.count := 1; w3.next := w1 ELSE INC(w1.count) END END search In order to speed up the search, the continuation condition of the while statement can once again be simplified by using a sentinel. This requires the initial presence of a dummy header as well as a sentinel at the tail. It is now high time to ask what gain can be expected from ordered list search. Remembering that the additional complexity incurred is small, one should not expect an overwhelming improvement. Assume that all words in the text occur with equal frequency. In this case the gain through lexicographical ordering is indeed also nil, once all words are listed, because the position of a word does not matter if only the total of all access steps is significant and if all words have the same frequency of occurrence. However, a gain is obtained whenever a new word is to be inserted. Instead of first scanning the entire list, on the average only half the list is scanned. Hence, ordered list insertion pays off only if a concordance is to be generated with many distinct words compared to their frequency of occurrence. The preceding examples are therefore suitable primarily as programming exercises rather than for practical applications. The arrangement of data in a linked list is recommended when the number of elements is relatively small (< 50), varies, and, moreover, when no information is given about their frequencies of access. A typical example is the symbol table in compilers of programming languages. Each declaration causes the addition of a new symbol, and upon exit from its scope of validity, it is deleted from the list. The use of simple linked lists is appropriate for applications with relatively short programs. Even in this case a considerable improvement in access method can be achieved by a very simple technique which is mentioned here again primarily because it constitutes a pretty example for demonstrating the flexibilities of the linked list structure. A characteristic property of programs is that occurrences of the same identifier are very often clustered, that is, one occurrence is often followed by one or more reoccurrences of the same word. This information is an invitation to reorganize the list after each access by moving the word that was found to the top of the list, thereby minimizing the length of the search path the next time it is sought. This method of access is called list search with reordering, or -- somewhat pompously -- self-organizing list search. In presenting the corresponding algorithm in the form of a procedure, we take advantage of our experience made so far and introduce a sentinel right from the start. In fact, a sentinel not only speeds up the search, but in this case it also simplifies the program. The list must initially not be empty, but contains the sentinel element already. The initialization statements are

121 NEW(sentinel); root := sentinel Note that the main difference between the new algorithm and the straight list search is the action of reordering when an element has been found. It is then detached or deleted from its old position and inserted at the top. This deletion again requires the use of two chasing pointers, such that the predecessor w2 of an identified element w1 is still locatable. This, in turn, calls for the special treatment of the first element (i.e., the empty list). To conceive the linking process, we refer to Fig. 4.11. It shows the two pointers when w1 was identified as the desired element. The configuration after correct reordering is represented in Fig. 4.12, and the complete new search procedure is listed below. sentinel root

X1 3 U2 2

A0 7 G5 6

NIL

w2

w1

Fig. 4.11. List before re-ordering sentinel root

X1 3 U2 2

A0 8 G5 6

NIL

w2

w1

Fig. 4.12. List after re-ordering PROCEDURE search(x: INTEGER; VAR root: Word); VAR w1, w2: Word; BEGIN w1 := root; sentinel.key := x; IF w1 = sentinel THEN (*first element*) NEW(root); root.key := x; reoot.count := 1; root.next := sentinel ELSIF w1.key = x THEN INC(w1.count) ELSE (*search*) REPEAT w2 := w1; w1 := w2.next UNTIL w1.key = x; IF w1 = sentinel THEN (*new entry*) w2 := root; NEW(root); root.key := x; root.count := 1; root.next := w2 ELSE (*found, now reorder*) INC(w1^.count); w2.next := w1.next; w1.next := root; root := w1

122 END END END search The improvement in this search method strongly depends on the degree of clustering in the input data. For a given factor of clustering, the improvement will be more pronounced for large lists. To provide an idea of how much gain can be expected, an empirical measurement was made by applying the above crossreference program to a short and a relatively long text and then comparing the methods of linear list ordering and of list reorganization. The measured data are condensed into Table 4.2. Unfortunately, the improvement is greatest when a different data organization is needed anyway. We will return to this example in Sect. 4.4. Test 1

Test 2

Number of distinct keys Number of occurrences of keys Time for search with ordering Time for search with reordering

53 315 6207 4529

582 14341 3200622 681584

Improvement factor

1.37

4.70

Table 4.2 Comparsion of List Search Methods. 4.3.3. An Application: Partial Ordering (Topological Sorting) An appropriate example of the use of a flexible, dynamic data structure is the process of topological sorting. This is a sorting process of items over which a partial ordering is defined, i.e., where an ordering is given over some pairs of items but not between all of them. The following are examples of partial orderings: 1. In a dictionary or glossary, words are defined in terms of other words. If a word v is defined in terms of a word w, we denote this by v 〈 w. Topological sorting of the words in a dictionary means arranging them in an order such that there will be no forward references. 2. A task (e.g., an engineering project) is broken up into subtasks. Completion of certain subtasks must usually precede the execution of other subtasks. If a subtask v must precede a subtask w, we write v 〈 w. Topological sorting means their arrangement in an order such that upon initiation of each subtask all its prerequisite subtasks have been completed. 3. In a university curriculum, certain courses must be taken before others since they rely on the material presented in their prerequisites. If a course v is a prerequisite for course w, we write v 〈 w. Topological sorting means arranging the courses in such an order that no course lists a later course as prerequisite. 4. In a program, some procedures may contain calls of other procedures. If a procedure v is called by a procedure w, we write v 〈 w. Topological sorting implies the arrangement of procedure declarations in such a way that there are no forward references. In general, a partial ordering of a set S is a relation between the elements of S. It is denoted by the symbol “〈”, verbalized by precedes, and satisfies the following three properties (axioms) for any distinct elements x, y, z of S: 1. if x 〈 y and y 〈 z, then x 〈 z (transitivity) 2. if x 〈 y, then not y 〈 x (asymmetry) 3. not z 〈 z (irreflexivity) For evident reasons, we will assume that the sets S to be topologically sorted by an algorithm are finite. Hence, a partial ordering can be illustrated by drawing a diagram or graph in which the vertices denote the elements of S and the directed edges represent ordering relationships. An example is shown in Fig. 4.13.

123

2 1

10 6 4 8

9

3 5 7

Fig. 4.13. Partially ordered set The problem of topological sorting is to embed the partial order in a linear order. Graphically, this implies the arrangement of the vertices of the graph in a row, such that all arrows point to the right, as shown in Fig. 4.14. Properties (1) and (2) of partial orderings ensure that the graph contains no loops. This is exactly the prerequisite condition under which such an embedding in a linear order is possible.

7

9

1

2

4

6

3

5

8

10

Fig. 4.14. Linear arrangement of the partially ordered set of Fig. 4.13. How do we proceed to find one of the possible linear orderings? The recipe is quite simple. We start by choosing any item that is not preceded by another item (there must be at least one; otherwise a loop would exist). This object is placed at the head of the resulting list and removed from the set S. The remaining set is still partially ordered, and so the same algorithm can be applied again until the set is empty. In order to describe this algorithm more rigorously, we must settle on a data structure and representation of S and its ordering. The choice of this representation is determined by the operations to be performed, particularly the operation of selecting elements with zero predecessors. Every item should therefore be represented by three characteristics: its identification key, its set of successors, and a count of its predecessors. Since the number n of elements in S is not given a priori, the set is conveniently organized as a linked list. Consequently, an additional entry in the description of each item contains the link to the next item in the list. We will assume that the keys are integers (but not necessarily the consecutive integers from 1 to n). Analogously, the set of each item's successors is conveniently represented as a linked list. Each element of the successor list is described by an identification and a link to the next item on this list. If we call the descriptors of the main list, in which each item of S occurs exactly once, leaders, and the descriptors of elements on the successor chains trailers, we obtain the following declarations of data types: TYPE Leader = POINTER TO LeaderDesc; Trailer = POINTER TO TrailerDesc; LeaderDesc = RECORD key, count: INTEGER; trail: Trailer; next: Leader END; TrailerDesc = RECORD id: Leader; next: Trailer END

124 Assume that the set S and its ordering relations are initially represented as a sequence of pairs of keys in the input file. The input data for the example in Fig. 4.13 are shown below, in which the symbols 〈 are added for the sake of clarity, symbolizing partial order: 1 〈2 3 〈5

2 〈4 5 〈8

4 〈6 7 〈5

2 〈 10 7 〈9

4 〈8 9 〈4

6 〈3 9 〈 10

1 〈3

The first part of the topological sort program must read the input and transform the data into a list structure. This is performed by successively reading a pair of keys x and y (x 〈 y). Let us denote the pointers to their representations on the linked list of leaders by p and q. These records must be located by a list search and, if not yet present, be inserted in the list. This task is perfomed by a function procedure called find. Subsequently, a new entry is added in the list of trailers of x, along with an identification of y; the count of predecessors of y is incremented by 1. This algorithm is called input phase. Figure 4.15 illustrates the data structure generated during processing the given input data. The function find(w) yields the pointer to the list element with key w. In the following poece of program we make use of text scanning, a feature of the Oberon system’s text concept. Instead of considering a text (file) as a sequence of characters, a text is considered as a sequence of tokens, which are identifiers, numbers, strings, and special characters (such as +, *, a THEN search(w.left, a)

138 ELSE (*old entry*) NEW(q); q.lno := line; w.last.next := q; w.last := q END END search; PROCEDURE Tabulate(w: Node); VAR m: INTEGER; item: Item; BEGIN IF w # NIL THEN Tabulate(w.left); Texts.WriteString(W, w.key); item := w.first; m := 0; REPEAT IF m = 10 THEN Texts.WriteLn(W); Texts.Write(W, TAB); m := 0; END ; INC(m); Texts.WriteInt(W, item.lno, 6); item := item.next UNTIL item = NIL; Texts.WriteLn(W); Tabulate(w.right) END END Tabulate; PROCEDURE CrossRef(VAR R: Texts.Reader); VAR root: Node; (*uses global writer W*) i: INTEGER; ch: CHAR; w: Word; BEGIN root := NIL; line := 0; Texts.WriteInt(W, 0, 6); Texts.Write(W, TAB); Texts.Read(R, ch); WHILE ~R.eot DO IF ch = 0DX THEN (*line end*) Texts.WriteLn(W); INC(line); Texts.WriteInt(W, line, 6); Texts.Write(W, 9X); Texts.Read(R, ch) ELSIF ("A" 1, we will provide the root with two subtrees which again have a minimal number of nodes. Hence, the subtrees are also T's. Evidently, one subtree must have height h-1, and the other is then allowed to have a height of one less, i.e. h-2. Figure 4.30 shows the trees with height 2, 3, and 4. Since their composition principle very strongly resembles that of Fibonacci numbers, they are called Fibonacci-trees (see Fig. 4.30). They are defined as follows: 1. The empty tree is the Fibonacci-tree of height 0. 2. A single node is the Fibonacci-tree of height 1. 3. If Th-1 and Th-2 are Fibonacci-trees of heights h-1 and h-2, then Th = is a Fibonacci-tree. 4. No other trees are Fibonacci-trees.

T2

T3 2

T4 3

1

2

5 4

3

1

2

7 4

6

1

Fig. 4.30. Fibonacci-trees of height 2, 3, and 4 The number of nodes of Th is defined by the following simple recurrence relation: N 0 = 0, N1 = 1 N h = Nh-1 + 1 + Nh-2 The Ni are those numbers of nodes for which the worst case (upper limit of h) can be attained, and they are called Leonardo numbers. 4.5.1.

Balanced Tree Insertion

Let us now consider what may happen when a new node is inserted in a balanced tree. Given a root r with the left and right subtrees L and R, three cases must be distinguished. Assume that the new node is inserted in L causing its height to increase by 1: 1. hL = h R: L and R become of unequal height, but the balance criterion is not violated. 2. hL < h R: L and R obtain equal height, i.e., the balance has even been improved. 3. hL > h R: the balance criterion is violated, and the tree must be restructured. Consider the tree in Fig. 4.31. Nodes with keys 9 and 11 may be inserted without rebalancing; the tree with root 10 will become one-sided (case 1); the one with root 8 will improve its balance (case 2). Insertion of nodes 1, 3, 5, or 7, however, requires subsequent rebalancing.

144

8

4 2

10 6

Fig. 4.31. Balanced tree Some careful scrutiny of the situation reveals that there are only two essentially different constellations needing individual treatment. The remaining ones can be derived by symmetry considerations from those two. Case 1 is characterized by inserting keys 1 or 3 in the tree of Fig. 4.31, case 2 by inserting nodes 5 or 7. The two cases are generalized in Fig. 4.32 in which rectangular boxes denote subtrees, and the height added by the insertion is indicated by crosses. Simple transformations of the two structures restore the desired balance. Their result is shown in Fig. 4.33; note that the only movements allowed are those occurring in the vertical direction, whereas the relative horizontal positions of the shown nodes and subtrees must remain unchanged. case 1

case 2 A

C

B

A B

Fig. 4.32. Imbalance resulting from insertion case 1

case 2 A

B B

A

C

Fig. 4.33. Restoring the balance An algorithm for insertion and rebalancing critically depends on the way information about the tree's balance is stored. An extreme solution lies in keeping balance information entirely implicit in the tree structure itself. In this case, however, a node's balance factor must be rediscovered each time it is affected

145 by an insertion, resulting in an excessively high overhead. The other extreme is to attribute an explicitly stored balance factor to every node. The definition of the type Node is then extended into TYPE Node = POINTER TO RECORD key, count, bal: INTEGER; (*bal = -1, 0, +1*) left, right: Node END We shall subsequently interpret a node's balance factor as the height of its right subtree minus the height of its left subtree, and we shall base the resulting algorithm on this node type. The process of node insertion consists essentially of the following three consecutive parts: 1. Follow the search path until it is verified that the key is not already in the tree. 2. Insert the new node and determine the resulting balance factor. 3. Retreat along the search path and check the balance factor at each node. Rebalance if necessary. Although this method involves some redundant checking (once balance is established, it need not be checked on that node's ancestors), we shall first adhere to this evidently correct schema because it can be implemented through a pure extension of the already established search and insertion procedures. This procedure describes the search operation needed at each single node, and because of its recursive formulation it can easily accommodate an additional operation on the way back along the search path. At each step, information must be passed as to whether or not the height of the subtree (in which the insertion had been performed) had increased. We therefore extend the procedure's parameter list by the Boolean h with the meaning the subtree height has increased. Clearly, h must denote a variable parameter since it is used to transmit a result. Assume now that the process is returning to a node p^ from the left branch (see Fig. 4.32), with the indication that it has increased its height. We now must distinguish between the three conditions involving the subtree heights prior to insertion: 1. hL < h R, p.bal = +1, 2. hL = h R, p.bal = 0, 3. hL > h R, p.bal = -1,

the previous imbalance at p has been equilibrated. the weight is now slanted to the left. rebalancing is necessary.

In the third case, inspection of the balance factor of the root of the left subtree (say, p1.bal) determines whether case 1 or case 2 of Fig. 4.32 is present. If that node has also a higher left than right subtree, then we have to deal with case 1, otherwise with case 2. (Convince yourself that a left subtree with a balance factor equal to 0 at its root cannot occur in this case.) The rebalancing operations necessary are entirely expressed as sequences of pointer reassignments. In fact, pointers are cyclically exchanged, resulting in either a single or a double rotation of the two or three nodes involved. In addition to pointer rotation, the respective node balance factors have to be updated. The details are shown in the search, insertion, and rebalancing procedures.

146

a)

b)

4 5

c)

5 4

7

5 4

7

2

d)

2

1

e)

5

7

4

2

1

f)

4

5

3

4

2

7

1

6

3

5

7

Fig. 4.34. Insertions in balanced tree The working principle is shown by Fig. 4.34. Consider the binary tree (a) which consists of two nodes only. Insertion of key 7 first results in an unbalanced tree (i.e., a linear list). Its balancing involves a RR single rotation, resulting in the perfectly balanced tree (b). Further insertion of nodes 2 and 1 result in an imbalance of the subtree with root 4. This subtree is balanced by an LL single rotation (d). The subsequent insertion of key 3 immediately offsets the balance criterion at the root node 5. Balance is thereafter reestablished by the more complicated LR double rotation; the outcome is tree (e). The only candidate for losing balance after a next insertion is node 5. Indeed, insertion of node 6 must invoke the fourth case of rebalancing outlined below, the RL double rotation. The final tree is shown in Fig.4.34 (f). PROCEDURE search(x: INTEGER; VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; (*~h*) BEGIN IF p = NIL THEN (*insert*) NEW(p); h := TRUE; p.key := x; p.count := 1; p.left := NIL; p.right := NIL; p.bal := 0 ELSIF p.key > x THEN search(x, p.left, h); IF h THEN (*left branch has grown*) IF p.bal = 1 THEN p.bal := 0; h := FALSE ELSIF p.bal = 0 THEN p.bal := -1 ELSE (*bal = -1, rebalance*) p1 := p.left; IF p1.bal = -1 THEN (*single LL rotation*) p.left := p1.right; p1.right := p; p.bal := 0; p := p1 ELSE (*double LR rotation*) p2 := p1.right; p1.right := p2.left; p2.left := p1; p.left := p2.right; p2.right := p; IF p2.bal = -1 THEN p.bal := 1 ELSE p.bal := 0 END ; IF p2.bal = +1 THEN p1.bal := -1 ELSE p1.bal := 0 END ; p := p2 END ; p.bal := 0; h := FALSE END END ELSIF p.key < x THEN search(x, p.right, h);

147 IF h THEN (*right branch has grown*) IF p.bal = -1 THEN p.bal := 0; h := FALSE ELSIF p.bal = 0 THEN p.bal := 1 ELSE (*bal = +1, rebalance*) p1 := p.right; IF p1.bal = 1 THEN (*single RR rotation*) p.right := p1.left; p1.left := p; p.bal := 0; p := p1 ELSE (*double RL rotation*) p2 := p1.left; p1.left := p2.right; p2.right := p1; p.right := p2.left; p2.left := p; IF p2.bal = +1 THEN p.bal := -1 ELSE p.bal := 0 END ; IF p2.bal = -1 THEN p1.bal := 1 ELSE p1.bal := 0 END ; p := p2 END ; p.bal := 0; h := FALSE END END ELSE INC(p.count) END END search Two particularly interesting questions concerning the performance of the balanced tree insertion algorithm are the following: 1. If all n! permutations of n keys occur with equal probability, what is the expected height of the constructed balanced tree? 2. What is the probability that an insertion requires rebalancing? Mathematical analysis of this complicated algorithm is still an open problem. Empirical tests support the conjecture that the expected height of the balanced tree thus generated is h = log(n)+c, where c is a small constant (c ≈ 0.25). This means that in practice the AVL-balanced tree behaves as well as the perfectly balanced tree, although it is much simpler to maintain. Empirical evidence also suggests that, on the average, rebalancing is necessary once for approximately every two insertions. Here single and double rotations are equally probable. The example of Fig. 4.34 has evidently been carefully chosen to demonstrate as many rotations as possible in a minimum number of insertions. The complexity of the balancing operations suggests that balanced trees should be used only if information retrievals are considerably more frequent than insertions. This is particularly true because the nodes of such search trees are usually implemented as densely packed records in order to economize storage. The speed of access and of updating the balance factors -- each requiring two bits only -- is therefore often a decisive factor to the efficiency of the rebalancing operation. Empirical evaluations show that balanced trees lose much of their appeal if tight record packing is mandatory. It is indeed difficult to beat the straightforward, simple tree insertion algorithm. 4.5.2. Balanced Tree Deletion Our experience with tree deletion suggests that in the case of balanced trees deletion will also be more complicated than insertion. This is indeed true, although the rebalancing operation remains essentially the same as for insertion. In particular, rebalancing consists again of either single or a double rotations of nodes. The basis for balanced tree deletion is the ordinary tree deletion algorithm. The easy cases are terminal nodes and nodes with only a single descendant. If the node to be deleted has two subtrees, we will again replace it by the rightmost node of its left subtree. As in the case of insertion, a Boolean variable parameter h is added with the meaning “the height of the subtree has been reduced”. Rebalancing has to be considered only when h is true. h is made true upon finding and deleting a node, or if rebalancing itself reduces the height of a subtree. We now introduce the two (symmetric) balancing operations in the form of procedures, because they have to be invoked from more than one point in the deletion algorithm. Note that balanceL is applied when the left, balanceR after the right branch had been reduced in height.

148

a)

b)

5 38

8

2

4

10

6

c)

2

7

1

9

3

7

11

7 3

10

9

10 3

7

11

11

9 f)

10

7 3

7

11

11

5

1

3

1

9

2

6

2

10

6

d)

2

e)

8

1

5

1

5

10

1

9

11

9

g)

h)

7 3

10 9

10 3

11

11 9

Fig. 4.35. Deletions in balanced tree The operation of the procedure is illustrated in Fig. 4.35. Given the balanced tree (a), successive deletion of the nodes with keys 4, 8, 6, 5, 2, 1, and 7 results in the trees (b) ... (h). Deletion of key 4 is simple in itself, because it represents a terminal node. However, it results in an unbalanced node 3. Its rebalancing operation invoves an LL single rotation. Rebalancing becomes again necessary after the deletion of node 6. This time the right subtree of the root (7) is rebalanced by an RR single rotation. Deletion of node 2, although in itself straightforward since it has only a single descendant, calls for a complicated RL double rotation. The fourth case, an LR double rotation, is finally invoked after the removal of node 7, which at first was replaced by the rightmost element of its left subtree, i.e., by the node with key 3. PROCEDURE balanceL(VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; BEGIN (*h; left branch has shrunk*) IF p.bal = -1 THEN p.bal := 0 ELSIF p.bal = 0 THEN p.bal := 1; h := FALSE

149 ELSE (*bal = 1, rebalance*) p1 := p.right; IF p1.bal >= 0 THEN (*single RR rotation*) p.right := p1.left; p1.left := p; IF p1.bal = 0 THEN p.bal := 1; p1.bal := -1; h := FALSE ELSE p.bal := 0; p1.bal := 0 END ; p := p1 ELSE (*double RL rotation*) p2 := p1.left; p1.left := p2.right; p2.right := p1; p.right := p2.left; p2.left := p; IF p2.bal = +1 THEN p.bal := -1 ELSE p.bal := 0 END ; IF p2.bal = -1 THEN p1.bal := 1 ELSE p1.bal := 0 END ; p := p2; p2.bal := 0 END END END balanceL; PROCEDURE balanceR(VAR p: Node; VAR h: BOOLEAN); VAR p1, p2: Node; BEGIN (*h; right branch has shrunk*) IF p.bal = 1 THEN p.bal := 0 ELSIF p.bal = 0 THEN p.bal := -1; h := FALSE ELSE (*bal = -1, rebalance*) p1 := p.left; IF p1.bal x THEN delete(x, p.left, h);

150 IF h THEN balanceL(p, h) END ELSIF p.key < x THEN delete(x, p.right, h); IF h THEN balanceR(p, h) END ELSE (*delete p^*) q := p; IF q.right = NIL THEN p := q.left; h := TRUE ELSIF q.left = NIL THEN p := q.right; h := TRUE ELSE del(q.left, h); IF h THEN balanceL(p, h) END END END END delete Fortunately, deletion of an element in a balanced tree can also be performed with -- in the worst case -O(log n) operations. An essential difference between the behaviour of the insertion and deletion procedures must not be overlooked, however. Whereas insertion of a single key may result in at most one rotation (of two or three nodes), deletion may require a rotation at every node along the search path. Consider, for instance, deletion of the rightmost node of a Fibonacci-tree. In this case the deletion of any single node leads to a reduction of the height of the tree; in addition, deletion of its rightmost node requires the maximum number of rotations. This therefore represents the worst choice of node in the worst case of a balanced tree, a rather unlucky combination of chances. How probable are rotations, then, in general? The surprising result of empirical tests is that whereas one rotation is invoked for approximately every two insertions, one is required for every five deletions only. Deletion in balanced trees is therefore about as easy -- or as complicated -- as insertion.

4.6. Optimal Search Trees So far our consideration of organizing search trees has been based on the assumption that the frequency of access is equal for all nodes, that is, that all keys are equally probable to occur as a search argument. This is probably the best assumption if one has no idea of access distribution. However, there are cases (they are the exception rather than the rule) in which information about the probabilities of access to individual keys is available. These cases usually have the characteristic that the keys always remain the same, i.e., the search tree is subjected neither to insertion nor deletion, but retains a constant structure. A typical example is the scanner of a compiler which determines for each word (identifier) whether or not it is a keyword (reserved word). Statistical measurements over hundreds of compiled programs may in this case yield accurate information on the relative frequencies of occurrence, and thereby of access, of individual keys. Assume that in a search tree the probability with which node i is accessed is Pr {x = ki} = p i,

(Si: 1 ≤ i ≤ n : p i) = 1

We now wish to organize the search tree in a way that the total number of search steps -- counted over sufficiently many trials -- becomes minimal. For this purpose the definition of path length is modified by (1) attributing a certain weight to each node and by (2) assuming the root to be at level 1 (instead of 0), because it accounts for the first comparison along the search path. Nodes that are frequently accessed become heavy nodes; those that are rarely visited become light nodes. The (internal) weighted path length is then the sum of all paths from the root to each node weighted by that node's probability of access. P = Si: 1 ≤ i ≤ n : pi*hi hi is the level of node i. The goal is now to minimize the weighted path length for a given probability distribution. As an example, consider the set of keys 1, 2, 3, with probabilities of access p1 = 1/7, p2 = 2/7, and p3 = 4/7. These three keys can be arranged in five different ways as search trees (see Fig. 4.36).

151

a)

2 1

3

b)

3

1

c)

1

d)

2

e)

1

3

2

1

3 2

2 3

Fig. 4.36. The search trees with 3 nodes The weighted path lengths of trees (a) to (e) are computed according to their definition as P(a) = 11/7, P(b) = 12/7, P(c) = 12/7, P(d) = 15/7, P(e) = 17/7 Hence, in this example, not the perfectly balanced tree (c), but the degenerate tree (a) turns out to be optimal. The example of the compiler scanner immediately suggests that this problem should be viewed under a slightly more general condition: words occurring in the source text are not always keywords; as a matter of fact, their being keywords is rather the exception. Finding that a given word k is not a key in the search tree can be considered as an access to a hypothetical "special node" inserted between the next lower and next higher key (see Fig. 4.19) with an associated external path length. If the probability q i of a search argument x lying between the two keys ki and ki+1 is also known, this information may considerably change the structure of the optimal search tree. Hence, we generalize the problem by also considering unsuccessful searches. The overall average weighted path length is now P = (Si: 1 ≤ i ≤ n : p i*hi) + (Si: 1 ≤ i ≤ m : qi*h'i) where (Si: 1 ≤ i ≤ n : p i) + (Si: 1 ≤ i ≤m : q i) = 1. and where, hi is the level of the (internal) node i and h'j is the level of the external node j. The average weighted path length may be called the cost of the search tree, since it represents a measure for the expected amount of effort to be spent for searching. The search tree that requires the minimal cost among all trees with a given set of keys ki and probabilities p i and q i is called the optimal tree.

152

k2|a2

k1|a1

b0

k4|a4

k3|a3

b1

b2

b4

b3

Fig. 4.37. Search tree with associated access frequencies For finding the optimal tree, there is no need to require that the p's and q's sum up to 1. In fact, these probabilities are commonly determined by experiments in which the accesses to nodes are counted. Instead of using the probabilities pi and qj, we will subsequently use such frequency counts and denote them by ai = number of times the search argument x equals ki b j = number of times the search argument x lies between kj and kj+1 By convention, b 0 is the number of times that x is less than k1, and bn is the frequency of x being greater than kn (see Fig. 4.37). We will subsequently use P to denote the accumulated weighted path length instead of the average path length: P = (Si: 1 ≤ i ≤ n : ai*hi) + (Si: 1 ≤ i ≤ m : b i*h'i) Thus, apart from avoiding the computation of the probabilities from measured frequency counts, we gain the further advantage of being able to use integers instead of fractions in our search for the optimal tree. Considering the fact that the number of possible configurations of n nodes grows exponentially with n, the task of finding the optimum seems rather hopeless for large n. Optimal trees, however, have one significant property that helps to find them: all their subtrees are optimal too. For instance, if the tree in Fig. 4.37 is optimal, then the subtree with keys k3 and k4 is also optimal as shown. This property suggests an algorithm that systematically finds larger and larger trees, starting with individual nodes as smallest possible subtrees. The tree thus grows from the leaves to the root, which is, since we are used to drawing trees upside-down, the bottom-up direction [4-6]. The equation that is the key to this algorithm is derived as follows: Let P be the weighted path length of a tree, and let P L and PR be those of the left and right subtrees of its root. Clearly, P is the sum of PL and P R, and the number of times a search travels on the leg to the root, which is simply the total number W of search trials. We call W the weight of the tree. Its average path length is then P/W. P = PL + W + P R W = (Si: 1 ≤ i ≤ n : ai) + (Si: 1 ≤ i ≤ m : bi) These considerations show the need for a denotation of the weights and the path lengths of any subtree consisting of a number of adjacent keys. Let Tij be the optimal subtree consisting of nodes with keys ki+1, ki+2, ... , kj. Then let wij denote the weight and let p ij denote the path length of Tij. Clearly P = p0,n and W = w0,n. These quantities are defined by the following recurrence relations:

153 w ii w ij p ii p ij

= bi = w i, j-1 + aj + b j = w ii = w ij + MIN k: i < k ≤ j : (pi,k-1 + pkj)

(0 ≤ i ≤ n) (0 ≤ i < j ≤ n) (0 ≤ i ≤ n) (0 ≤ i < k < j ≤ n)

The last equation follows immediately from the definitions of P and of optimality. Since there are approximately n2/2 values p ij, and because its definition calls for a choice among all cases such that 0 < j-i ≤ n, the minimization operation will involve approximately n 3/6 operations. Knuth pointed out that a factor n can be saved by the following consideration, which alone makes this algorithm usable for practical purposes. Let rij be a value of k which achieves the minimum for p ij. It is possible to limit the search for r ij to a much smaller interval, i.e., to reduce the number of the j-i evaluation steps. The key is the observation that if we have found the root rij of the optimal subtree T ij, then neither extending the tree by adding a node at the right, nor shrinking the tree by removing its leftmost node ever can cause the optimal root to move to the left. This is expressed by the relation ri,j-1 ≤ rij ≤ ri+1,j which limits the search for possible solutions for r ij to the range ri,j-1 ... ri+1,j. This results in a total number of elementary steps in the order of n2. We are now ready to construct the optimization algorithm in detail. We recall the following definitions, which are based on optimal trees Tij consisting of nodes with keys ki+1 ... kj. 1. ai: 2. b j: 3. wij: 4. p ij: 5. rij:

the frequency of a search for k i. the frequency of a search argument x between kj and kj+1. the weight of T ij. the weighted path length of Tij. the index of the root of Tij.

We declare the following arrays: a: ARRAY n+1 OF INTEGER; (*a[0] not used*) b: ARRAY n+1 OF INTEGER; p,w,r: ARRAY n+1, n+1 OF INTEGER; Assume that the weights wij have been computed from a and b in a straightforward way. Now consider w as the argument of the procedure OptTree to be developed and consider r as its result, because r describes the tree structure completely. p may be considered an intermediate result. Starting out by considering the smallest possible subtrees, namely those consisting of no nodes at all, we proceed to larger and larger trees. Let us denote the width j-i of the subtree Tij by h. Then we can trivially determine the values pii for all trees with h = 0 according to the definition of p ij. FOR i := 0 TO n DO p[i,i] := b[i] END In the case h = 1 we deal with trees consisting of a single node, which plainly is also the root (see Fig. 4.38). FOR i := 0 TO n-1 DO j := i+1; p[i,j] := w[i,j] + p[i,i] + p[j,j]; r[i,j] := j END

154

kj|aj

bj-1

bj

wj-1, j-1 wj-1, j

Fig. 4.38. Optimal search tree with single node Note that i denotes the left index limit and j the right index limit in the considered tree T ij. For the cases h > 1 we use a repetitive statement with h ranging from 2 to n, the case h = n spanning the entire tree T0,n. In each case the minimal path length p ij and the associated root index rij are determined by a simple repetitive statement with an index k ranging over the interval given for rij. FOR h := 2 TO n DO FOR i := 0 TO n-h DO j := i+h; find k and min = MIN k: i < k < j : (pi,k-1 + pkj) such that ri,j-1 < k < r i+1,j; p[i,j] := min + w[i,j]; r[i,j] := k END END The details of the refinement of the statement in italics can be found in Program 4.6. The average path length of T0,n is now given by the quotient p 0,n/w0,n, and its root is the node with index r 0,n. Let us now describe the structure of the program to be designed. Its two main components are the procedures to find the optimal search tree, given a weight distribution w, and to display the tree given the indices r. First, the counts a and b and the keys are read from an input source. The keys are actually not involved in the computation of the tree structure; they are merely used in the subsequent display of the tree. After printing the frequency statistics, the program proceeds to compute the path length of the perfectly balanced tree, in passing also determining the roots of its subtrees. Thereafter, the average weighted path length is printed and the tree is displayed. In the third part, procedure OptTree is activated in order to compute the optimal search tree; thereafter, the tree is displayed. And finally, the same procedures are used to compute and display the optimal tree considering the key frequencies only, ignoring the frequencies of non-keys. To summarize, the following are the global constants and variables: CONST N = 100; (*max no. of keywords*) WordLen = 16; (*max keyword length*) VAR key: ARRAY N+1, WordLen OF CHAR; a, b: ARRAY N+1 OF INTEGER; p, w, r: ARRAY N+1, N+1 OF INTEGER; PROCEDURE BalTree(i, j: INTEGER): INTEGER; VAR k: INTEGER; BEGIN k := (i+j+1) DIV 2; r[i, j] := k; IF i >= j THEN RETURN 0 ELSE RETURN BalTree(i, k-1) + BalTree(k, j) + w[i, j] END END BalTree;

155 PROCEDURE ComputeOptTree(n: INTEGER); VAR x, min, tmp: INTEGER; i, j, k, h, m: INTEGER; BEGIN (*argument: W, results: p, r*) FOR i := 0 TO n DO p[i, i] := 0 END ; FOR i := 0 TO n-1 DO j := i+1; p[i, j] := w[i, j]; r[i, j] := j END ; FOR h := 2 TO n DO FOR i := 0 TO n-h DO j := i+h; m := r[i, j-1]; min := p[i, m-1] + p[m, j]; FOR k := m+1 TO r[i+1, j] DO tmp := p[i, k-1]; x := p[k, j] + tmp; IF x < min THEN m := k; min := x END END ; p[i, j] := min + w[i, j]; r[i, j] := m END END END ComputeOptTree; PROCEDURE WriteTree(i, j, level: INTEGER); VAR k: INTEGER; (*uses global writer W*) BEGIN IF i < j THEN WriteTree(i, r[i, j]-1, level+1); FOR k := 1 TO level DO Texts.Write(W, TAB) END ; Texts.WriteString(W, key[r[i, j]]); Texts.WriteLn(W); WriteTree(r[i, j], j, level+1) END END WriteTree; PROCEDURE Find(VAR S: Texts.Scanner); VAR i, j, n: INTEGER; (*uses global writer W*) BEGIN Texts.Scan(S); b[0] := SHORT(S.i); n := 0; Texts.Scan(S); (*input a, key, b*) WHILE S.class = Texts.Int DO INC(n); a[n] := SHORT(S.i); Texts.Scan(S); COPY(S.s, key[n]); Texts.Scan(S); b[n] := SHORT(S.i); Texts.Scan(S) END ; (*compute w from a and b*) FOR i := 0 TO n DO w[i, i] := b[i]; FOR j := i+1 TO n DO w[i, j] := w[i, j-1] + a[j] + b[j] END END ; Texts.WriteString(W, "Total weight = "); Texts.WriteInt(W, w[0, n], 6); Texts.WriteLn(W); Texts.WriteString(W, "Pathlength of balanced tree = "); Texts.WriteInt(W, BalTree(0, n), 6); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W); ComputeOptTree(n); Texts.WriteString(W, "Pathlength of optimal tree = "); Texts.WriteInt(W, p[0, n], 6); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W); FOR i := 0 TO n DO w[i, i] := 0;

156 FOR j := i+1 TO n DO w[i, j] := w[i, j-1] + a[j] END END ; ComputeOptTree(n); Texts.WriteString(W, "optimal tree not considering b"); Texts.WriteLn(W); WriteTree(0, n, 0); Texts.WriteLn(W) END Find; As an example, let us consider the following input data of a tree with 3 keys: 20 1 Albert 10 2 Ernst 1 5 Peter 1 b 0 = 20 a1 = 1 a2 = 2 a3 = 4

key1 = Albert key2 = Ernst key3 = Peter

b 1 = 10 b2 = 1 b3 = 1

The results of procedure Find are shown in Fig. 4.40 and demonstrate that the structures obtained for the three cases may differ significantly. The total weight is 40, the pathlength of the balanced tree is 78, and that of the optimal tree is 66. balanced tree

optimal tree Albert

not considering key misses Albert

Albert Ernst

Ernst Peter

Ernst Peter

Peter

Fig. 4.40. The 3 trees generated by the Optimal Tree procedure (NEW FIGURE!) It is evident from this algorithm that the effort to determine the optimal structure is of the order of n2; also, the amount of required storage is of the order n2. This is unacceptable if n is very large. Algorithms with greater efficiency are therefore highly desirable. One of them is the algorithm developed by Hu and Tucker [4-5] which requires only O(n) storage and O(n*log(n)) computations. However, it considers only the case in which the key frequencies are zero, i.e., where only the unsuccessful search trials are registered. Another algorithm, also requiring O(n) storage elements and O(n*log(n)) computations was described by Walker and Gotlieb [4-7]. Instead of trying to find the optimum, this algorithm merely promises to yield a nearly optimal tree. It can therefore be based on heuristic principles. The basic idea is the following. Consider the nodes (genuine and special nodes) being distributed on a linear scale, weighted by their frequencies (or probabilities) of access. Then find the node which is closest to the center of gravity. This node is called the centroid, and its index is (Si: 1 ≤ i ≤ n : i*ai) + (Si: 1 ≤ i ≤ m : i*b i) / W rounded to the nearest integer. If all nodes have equal weight, then the root of the desired optimal tree evidently coincides with the centroid Otherwise -- so the reasoning goes -- it will in most cases be in the close neighborhood of the centroid. A limited search is then used to find the local optimum, whereafter this procedure is applied to the resulting two subtrees. The likelihood of the root lying very close to the centroid grows with the size n of the tree. As soon as the subtrees have reached a manageable size, their optimum can be determined by the above exact algorithm.

4.7. B-Trees So far, we have restricted our discussion to trees in which every node has at most two descendants, i.e., to binary trees. This is entirely satisfactory if, for instance, we wish to represent family relationships with a preference to the pedigree view, in which every person is associated with his parents. After all, no one has more than two parents. But what about someone who prefers the posterity view? He has to cope with the

157 fact that some people have more than two children, and his trees will contain nodes with many branches. For lack of a better term, we shall call them multiway trees. Of course, there is nothing special about such structures, and we have already encountered all the programming and data definition facilities to cope with such situations. If, for instance, an absolute upper limit on the number of children is given (which is admittedly a somewhat futuristic assumption), then one may represent the children as an array component of the record representing a person. If the number of children varies strongly among different persons, however, this may result in a poor utilization of available storage. In this case it will be much more appropriate to arrange the offspring as a linear list, with a pointer to the youngest (or eldest) offspring assigned to the parent. A possible type definition for this case is the following, and a possible data structure is shown in Fig. 4.43. TYPE Person = POINTER TO RECORD name: alfa; sibling, offspring: Person END JOHN

ALBERT

PETER

MARY

PAUL

ROBERT

CAROL

CHRIS

GEORGE

PAMELA

TINA

Fig. 4.43. Multiway tree represented as binary tree We now realize that by tilting this picture by 45 degrees it will look like a perfect binary tree. But this view is misleading because functionally the two references have entirely different meanings. One usually dosen't treat a sibling as an offspring and get away unpunished, and hence one should not do so even in constructing data definitions. This example could also be easily extended into an even more complicated data structure by introducing more components in each person's record, thus being able to represent further family relationships. A likely candidate that cannot generally be derived from the sibling and offspring references is that of husband and wife, or even the inverse relationship of father and mother. Such a structure quickly grows into a complex relational data bank, and it may be possible to map serveral trees into it. The algorithms operating on such structures are intimately tied to their data definitions, and it does not make sense to specify any general rules or widely applicable techniques. However, there is a very practical area of application of multiway trees which is of general interest. This is the construction and maintenance of large-scale search trees in which insertions and deletions are necessary, but in which the primary store of a computer is not large enough or is too costly to be used for long-time storage. Assume, then, that the nodes of a tree are to be stored on a secondary storage medium such as a disk store. Dynamic data structures introduced in this chapter are particularly suitable for incorporation of secondary storage media. The principal innovation is merely that pointers are represented by disk store addresses instead of main store addresses. Using a binary tree for a data set of, say, a million items, requires on the average approximately log 10 6 (i.e. about 20) search steps. Since each step now involves a disk access (with inherent latency time), a storage organization using fewer accesses will be highly desirable. The multiway tree is a perfect solution to this problem. If an item located on a secondary store is accessed, an entire group of items may also be accessed without much additional cost. This suggests that a tree be subdivided into subtrees, and that the subtrees are represented as units that are accessed all together. We shall call these subtrees pages. Figure 4.44 shows a binary tree subdivided into pages, each page consisting of 7 nodes.

158

Fig. 4.44. Binary tree subdivided into pages The saving in the number of disk accesses -- each page access now involves a disk access -- can be considerable. Assume that we choose to place 100 nodes on a page (this is a reasonable figure); then the million item search tree will on the average require only log100(106) (i.e. about 3) page accesses instead of 20. But, of course, if the tree is left to grow at random, then the worst case may still be as large as 10 4. It is plain that a scheme for controlled growth is almost mandatory in the case of multiway trees. 4.7.1. Multiway B-Trees If one is looking for a controlled growth criterion, the one requiring a perfect balance is quickly eliminated because it involves too much balancing overhead. The rules must clearly be somewhat relaxed. A very sensible criterion was postulated by R. Bayer and E.M. McCreight [4.2] in 1970: every page (except one) contains between n and 2n nodes for a given constant n. Hence, in a tree with N items and a maximum page size of 2n nodes per page, the worst case requires logn N page accesses; and page accesses clearly dominate the entire search effort. Moreover, the important factor of store utilization is at least 50% since pages are always at least half full. With all these advantages, the scheme involves comparatively simple algorithms for search, insertion, and deletion. We will subsequently study them in detail. The underlying data structures are called B-trees, and have the following characteristics; n is said to be the order of the B-tree. 1. Every page contains at most 2n items (keys.) 2. Every page, except the root page, contains at least n items. 3. Every page is either a leaf page, i.e. has no descendants, or it has m+1 descendants, where m is its number of keys on this page. 4. All leaf pages appear at the same level.

25

10 20

2

5

7

8

13

14

30 40

15

18

22

24

26

27

28

32

35

38

41

42

45

46

Fig. 4.45. B-tree of order 2 Figure 4.45 shows a B-tree of order 2 with 3 levels. All pages contain 2, 3, or 4 items; the exception is the root which is allowed to contain a single item only. All leaf pages appear at level 3. The keys appear in increasing order from left to right if the B-tree is squeezed into a single level by inserting the descendants in between the keys of their ancestor page. This arrangement represents a natural extension of binary

159 search trees, and it determines the method of searching an item with given key. Consider a page of the form shown in Fig. 4.46 and a given search argument x. Assuming that the page has been moved into the primary store, we may use conventional search methods among the keys k1 ... km. If m is sufficiently large, one may use binary search; if it is rather small, an ordinary sequential search will do. (Note that the time required for a search in main store is probably negligible compared to the time it takes to move the page from secondary into primary store.) If the search is unsuccessful, we are in one of the following situations: 1. ki < x < ki+1, for 1 < i < m The search continues on page p i^ 2. km < x The search continues on page pm^. 3. x < k1 The search continues on page p0^.

p0 k1 p1 k2 p2

...

pm-1 km pm

Fig. 4.46. B-tree page with m keys If in some case the designated pointer is NIL, i.e., if there is no descendant page, then there is no item with key x in the whole tree, and the search is terminated. Surprisingly, insertion in a B-tree is comparatively simple too. If an item is to be inserted in a page with m < 2n items, the insertion process remains constrained to that page. It is only insertion into an already full page that has consequences upon the tree structure and may cause the allocation of new pages. To understand what happens in this case, refer to Fig. 4.47, which illustrates the insertion of key 22 in a Btree of order 2. It proceeds in the following steps: 1. Key 22 is found to be missing; insertion in page C is impossible because C is already full. 2. Page C is split into two pages (i.e., a new page D is allocated). 3. The 2n+1 keys are equally distributed onto C and D, and the middle key is moved up one level into the ancestor page A.

A

7

10

15

B

A

20

18

26

30

35

40

C

7

10

15

18

B

20

30

22

26

35

C

40

D

Fig. 4.47. Insertion of key 22 in B-tree This very elegant scheme preserves all the characteristic properties of B-trees. In particular, the split pages contain exactly n items. Of course, the insertion of an item in the ancestor page may again cause that page to overflow, thereby causing the splitting to propagate. In the extreme case it may propagate up to the root. This is, in fact, the only way that the B-tree may increase its height. The B-tree has thus a strange manner of growing: it grows from its leaves upward to the root. We shall now develop a detailed program from these sketchy descriptions. It is already apparent that a recursive formulation will be most convenient because of the property of the splitting process to propagate back along the search path. The general structure of the program will therefore be similar to balanced tree insertion, although the details are different. First of all, a definition of the page structure has to be formulated. We choose to represent the items in the form of an array. TYPE Page = Item =

POINTER TO PageDescriptor; RECORD key: INTEGER; p: Page;

160 count: INTEGER (*data*) END ; PageDescriptor =

RECORD m: INTEGER; (* 0 .. 2n *) p0: Page; e: ARRAY 2*n OF Item END

Again, the item component count stands for all kinds of other information that may be associated with each item, but it plays no role in the actual search process. Note that each page offers space for 2n items. The field m indicates the actual number of items on the page. As m ≥ n (except for the root page), a storage utilization of a least 50% is guaranteed. The algorithm of B-tree search and insertion is formulated below as a procedure called search. Its main structure is straightforward and similar to that for the balanced binary tree search, with the exception that the branching decision is not a binary choice. Instead, the “within-page search” is represented as a binary search on the array e of elements. The insertion algorithm is formulated as a separate procedure merely for clarity. It is activated after search has indicated that an item is to be passed up on the tree (in the direction toward the root). This fact is indicated by the Boolean result parameter h; it assumes a similar role as in the algorithm for balanced tree insertion, where h indicates that the subtree had grown. If h is true, the second result parameter, u, represents the item being passed up. Note that insertions start in hypothetical pages, namely, the "special nodes" of Fig. 4.19; the new item is immediately handed up via the parameter u to the leaf page for actual insertion. The scheme is sketched here: PROCEDURE search(x: INTEGER; a: Page; VAR h: BOOLEAN; VAR u: Item); BEGIN IF a = NIL THEN (*x not in tree, insert*) Assign x to item u, set h to TRUE, indicating that an item u is passed up in the tree ELSE binary search for x in array a.e; IF found THEN process data ELSE search(x, descendant, h, u); IF h THEN (*an item was passed up*) IF no. of items on page a^ < 2n THEN insert u on page a^ and set h to FALSE ELSE split page and pass middle item up END END END END END search If the paramerter h is true after the call of search in the main program, a split of the root page is requested. Since the root page plays an exceptional role, this process has to be programmed separately. It consists merely of the allocation of a new root page and the insertion of the single item given by the paramerter u. As a consequence, the new root page contains a single item only. The details can be gathered from Program 4.7, and Fig. 4.48 shows the result of using Program 4.7 to construct a B-tree with the following insertion sequence of keys: 20; 40 10 30 15; 35 7 26 18 22; 5; 42 13 46 27 8 32; 38 24 45 25; The semicolons designate the positions of the snapshots taken upon each page allocation. Insertion of the last key causes two splits and the allocation of three new pages.

161

a)

20

b)

20

10

15

30

c)

7

40

20 30

10

15

18

22

26

35

d)

5

10 20 30

7

15

18

22

e)

5

40

26

35

40

32

35

10 20 30 40

7

8

13

15

18

22

26

f)

27

42

25

10 20

5

46

7

8

13

15

30 40

18

22

24

26

27

32

35

38

42

45

46

Fig. 4.48. Growth of B-tree of order 2 Since each activation of search implies one page transfer to main store, k = logn(N) recursive calls are necessary at most, if the tree contains N items. Hence, we must be capable of accommodating k pages in main store. This is one limiting factor on the page size 2n. In fact, we need to accommodate even more than k pages, because insertion may cause page splitting to occur. A corollary is that the root page is best allocated permanently in the primary store, because each query proceeds necessarily through the root page. Another positive quality of the B-tree organization is its suitability and economy in the case of purely sequential updating of the entire data base. Every page is fetched into primary store exactly once. Deletion of items from a B-tree is fairly straight-forward in principle, but it is complicated in the details. We may distinguish two different circumstances: 1. The item to be deleted is on a leaf page; here its removal algorithm is plain and simple. 2. The item is not on a leaf page; it must be replaced by one of the two lexicographically adjacent items, which happen to be on leaf pages and can easily be deleted. In case 2 finding the adjacent key is analogous to finding the one used in binary tree deletion. We descend along the rightmost pointers down to the leaf page P, replace the item to be deleted by the rightmost item on P, and then reduce the size of P by 1. In any case, reduction of size must be followed by a check of the number of items m on the reduced page, because, if m < n, the primary characteristic of B-trees would be violated. Some additional action has to be taken; this underflow condition is indicated by the Boolean variable parameter h. The only recourse is to borrow or annect an item from one of the neighboring pages, say from Q. Since this involves fetching page Q into main store -- a relatively costly operation -- one is tempted to make the best of this undesirable situation and to annect more than a single item at once. The usual strategy is to distribute the items on pages P and Q evenly on both pages. This is called page balancing.

162 Of course, it may happen that there is no item left to be annected since Q has already reached its minimal size n. In this case the total number of items on pages P and Q is 2n-1; we may merge the two pages into one, adding the middle item from the ancestor page of P and Q, and then entirely dispose of page Q. This is exactly the inverse process of page splitting. The process may be visualized by considering the deletion of key 22 in Fig. 4.47. Once again, the removal of the middle key in the ancestor page may cause its size to drop below the permissible limit n, thereby requiring that further special action (either balancing or merging) be undertaken at the next level. In the extreme case page merging may propagate all the way up to the root. If the root is reduced to size 0, it is itself deleted, thereby causing a reduction in the height of the B-tree. This is, in fact, the only way that a B-tree may shrink in height. Figure 4.49 shows the gradual decay of the B-tree of Fig. 4.48 upon the sequential deletion of the keys 25 45 24; 38 32; 8 27 46 13 42; 5 22 18 26; 7 35 15; The semicolons again mark the places where the snapshots are taken, namely where pages are being eliminated. The similarity of its structure to that of balanced tree deletion is particularly noteworthy. a)

25

10 20

5

7

8

13

15

30 40

18

22

b)

5

7

8

13

7

f)

32

15

18

20

26

27

32

35

38

35

40

42

42

8

13

15

18

20

26

27

46

10 22

7

e)

7

27

10 22 30

d)

5

26

10 22 30 40

c)

5

24

15

18

20

26

30

35

40

15

10

10

20

20

30

30

40

35

40

Fig. 4.49. Decay of B-tree of order 2 TYPE Page = POINTER TO PageRec; Entry = RECORD key: INTEGER; p: Page END ; PageRec = RECORD m: INTEGER; (*no. of entries on page*) p0: Page; e: ARRAY 2*N OF Entry END ; VAR root: Page; W: Texts.Writer;

46

35

38

42

45

46

163 PROCEDURE search(x: INTEGER; VAR p: Page; VAR k: INTEGER); VAR i, L, R: INTEGER; found: BOOLEAN; a: Page; BEGIN a := root; found := FALSE; WHILE (a # NIL) & ~found DO L := 0; R := a.m; (*binary search*) WHILE L < R DO i := (L+R) DIV 2; IF x 0 THEN FOR i := N-2 TO 0 BY -1 DO a.e[i+k] := a.e[i] END ; a.e[k-1] := c.e[s]; a.e[k-1].p := a.p0; (*move k-1 items from b to a, one to c*) DEC(b.m, k); FOR i := k-2 TO 0 BY -1 DO a.e[i] := b.e[i+b.m+1] END ; c.e[s] := b.e[b.m]; a.p0 := c.e[s].p; c.e[s].p := a; a.m := N-1+k; h := FALSE ELSE (*merge pages a and b, discard a*) c.e[s].p := a.p0; b.e[N] := c.e[s]; FOR i := 0 TO N-2 DO b.e[i+N+1] := a.e[i] END ; b.m := 2*N; DEC(c.m); h := c.m < N END END END underflow; PROCEDURE delete(x: INTEGER; a: Page; VAR h: BOOLEAN); (*search and delete key x in B-tree a; if a page underflow arises, balance with adjacent page or merge; h := "page a is undersize"*) VAR i, L, R: INTEGER; q: Page; PROCEDURE del(p: Page; VAR h: BOOLEAN); VAR k: INTEGER; q: Page; (*global a, R*) BEGIN k := p.m-1; q := p.e[k].p; IF q # NIL THEN del(q, h); IF h THEN underflow(p, q, p.m, h) END ELSE p.e[k].p := a.e[R].p; a.e[R] := p.e[k]; DEC(p.m); h := p.m < N END END del;

165 BEGIN IF a # NIL THEN L := 0; R := a.m; (*binary search*) WHILE L < R DO i := (L+R) DIV 2; IF x 0 THEN ShowTree(p.p0, level+1) END ; FOR i := 0 TO p.m-1 DO ShowTree(p.e[i].p, level+1) END END END ShowTree; Extensive analysis of B-tree performance has been undertaken and is reported in the referenced article (Bayer and McCreight). In particular, it includes a treatment of the question of optimal page size, which strongly depends on the characteristics of the storage and computing system available. Variations of the B-tree scheme are discussed in Knuth, Vol. 3, pp. 476-479. The one notable observation is that page splitting should be delayed in the same way that page merging is delayed, by first attempting to balance neighboring pages. Apart from this, the suggested improvements seem to yield marginal gains. A comprehensive survey of B-trees may be found in [4-8]. 4.7.2. Binary B-Trees The species of B-trees that seems to be least interesting is the first order B-tree (n = 1). But sometimes it is worthwhile to pay attention to the exceptional case. It is plain, however, that first-order B-trees are not useful in representing large, ordered, indexed data sets invoving secondary stores; approximately 50% of all pages will contain a single item only. Therefore, we shall forget secondary stores and again consider the problem of search trees involving a one-level store only. A binary B-tree (BB-tree) consists of nodes (pages) with either one or two items. Hence, a page contains either two or three pointers to descendants; this suggested the term 2-3 tree. According to the definition of B-trees, all leaf pages appear at the same level, and all non-leaf pages of BB-trees have either two or three descendants (including the root). Since we now are dealing with primary store only, an optimal economy of storage space is mandatory, and the representation of the items inside a node in the form of an array appears unsuitable. An alternative is the dynamic, linked allocation; that is, inside each node there exists a linked list of items of length 1 or 2. Since each node has at most three descendants and thus needs to harbor only up to three pointers, one is tempted to combine the pointers for descendants and pointers in

166 the item list as shown in Fig. 4.50. The B-tree node thereby loses its actual identity, and the items assume the role of nodes in a regular binary tree. It remains necessary, however, to distinguish between pointers to descendants (vertical) and pointers to siblings on the same page (horizontal). Since only the pointers to the right may be horizontal, a single bit is sufficient to record this distiction. We therefore introduce the Boolean field h with the meaning horizontal. The definition of a tree node based on this representation is given below. It was suggested and investigated by R. Bayer [4-3] in 1971 and represents a search tree organization guaranteeing p = 2*log(N) as maximum path length. TYPE Node = POINTER TO RECORD key: INTEGER; ........... left, right: Node; h: BOOLEAN (*right branch horizontal*) END

a

b

a

b

c

Fig. 4.50. Representation of BB-tree nodes Considering the problem of key insertion, one must distinguish four possible situations that arise from growth of the left or right subtrees. The four cases are illustrated in Fig. 4.51. Remember that B-trees have the characteristic of growing from the bottom toward the root and that the property of all leafs being at the same level must be maintained. The simplest case (1) is when the right subtree of a node A grows and when A is the only key on its (hypothetical) page. Then, the descendant B merely becomes the sibling of A, i.e., the vertical pointer becomes a horizontal pointer. This simple raising of the right arm is not possible if A already has a sibling. Then we would obtain a page with 3 nodes, and we have to split it (case 2). Its middle node B is passed up to the next higher level. Now assume that the left subtree of a node B has grown in height. If B is again alone on a page (case 3), i.e., its right pointer refers to a descendant, then the left subtree (A) is allowed to become B's sibling. (A simple rotation of pointers is necessary since the left pointer cannot be horizontal). If, however, B already has a sibling, the raising of A yields a page with three members, requiring a split. This split is realized in a very straightforward manner: C becomes a descendant of B, which is raised to the next higher level (case 4).

167

1. A

a

A

a

B

b

B

b

c

c B

2. A

a

B

b

A

a

C

c

B

C

b

c

A

d

a

C

b c

d

d

3. B

c

A

a

A

a

B

A

b

c

B

a

b

c

b B

4. B

c

A

a

C

A

d

a

B

b

C

c

A

d

a

C

b c

d

b

Fig. 4.51. Node insertion in BB-tree It should be noted that upon searching a key, it makes no effective difference whether we proceed along a horizontal or a vertical pointer. It therefore appears artificial to worry about a left pointer in case 3 becoming horizontal, although its page still contains not more than two members. Indeed, the insertion algorithm reveals a strange asymmetry in handling the growth of left and right subtrees, and it lets the BB-tree organization appear rather artificial. There is no proof of strangeness of this organization; yet a healthy intuition tells us that something is fishy, and that we should remove this asymmetry. It leads to the notion of the symmetric binary B-tree (SBB-tree) which was also investigated by Bayer [4-4] in 1972. On the average it leads to slightly more efficient search trees, but the algorithms for insertion and deletion are also slightly more complex. Furthermore, each node now requires two bits (Boolean variable lh and rh) to indicate the nature of its two pointers. Since we will restrict our detail considerations to the problem of insertion, we have once again to distinguish among four cases of grown subtrees. They are illustrated in Fig. 4.52, which makes the gained

168 symmetry evident. Note that whenever a subtree of node A without siblings grows, the root of the subtree becomes the sibling of A. This case need not be considered any further.

B

(LL)

B

C

A

B

C

A

C

A B

(LR)

A

C

A

B

C

A

C

B B

(RR)

A

B

A

B

C

A

C

C

B

(RL)

A

C

A

B

C

A

C

B

Fig. 4.52. Insertion in SBB-trees The four cases considered in Fig. 4.52 all reflect the occurrence of a page overflow and the subsequent page split. They are labelled according to the directions of the horizontal pointers linking the three siblings in the middle figures. The initial situation is shown in the left column; the middle column illustrates the fact that the lower node has been raised as its subtree has grown; the figures in the right column show the result of node rearrangement. It is advisable to stick no longer to the notion of pages out of which this organization had developed, for we are only interested in bounding the maximum path length to 2*log(N). For this we need only ensure that two horizontal pointers may never occur in succession on any search path. However, there is no reason to forbid nodes with horizontal pointers to the left and right, i.e. to treat the left and right sides differently. We therefore define the symmetric binary B-tree as a tree that has the following properties: 1. Every node contains one key and at most two (pointers to) subtrees.

169 2. Every pointer is either horizontal or vertical. There are no two consecutive horizontal pointers on any search path. 3. All terminal nodes (nodes without descendants) appear at the same (terminal) level. From this definition it follows that the longest search path is no longer than twice the height of the tree. Since no SBB-tree with N nodes can have a height larger than log(N), it follows immediately that 2*log(N) is an upper bound on the search path length. In order to visualize how these trees grow, we refer to Fig. 4.53. The lines represent snapshots taken during the insertion of the following sequences of keys, where every semicolon marks a snapshot. (1) (2) (3) (4)

1.

1 5 6 4

1

2; 4; 2; 2

3; 3; 4; 6;

2

4 1 1 1

5 2 7 7;

6; 7 3 3

2 1

7; 6; 5; 5;

2 3

4

1

3

4 5

6

2 1

2.

4

5

4 3

3.

2

6

2

4

5

4

1

3

3

5

7

6 5

4 2

4.

2

6

7

4 6

6

1

2 1

2

3

5

6 4

6

7

2 7

1

6 3

4

5

7

Fig. 4.53. Insertion of keys 1 to 7 These pictures make the third property of B-trees particularly obvious: all terminal nodes appear on the same level. One is therefore inclined to compare these structures with garden hedges that have been recently trimmed with hedge scissors. The algorithm for the construction of SBB-trees is show below. It is based on a definition of the type Node with the two components lh and rh indicating whether or not the left and right pointers are horizontal. TYPE Node = RECORD key, count: INTEGER; left, right: Node; lh, rh: BOOLEAN END

170 The recursive procedure search again follows the pattern of the basic binary tree insertion algorithm. A third parameter h is added; it indicates whether or not the subtree with root p has changed, and it corresponds directly to the parameter h of the B-tree search program. We must note, however, the consequence of representing pages as linked lists: a page is traversed by either one or two calls of the search procedure. We must distinguish between the case of a subtree (indicated by a vertical pointer) that has grown and a sibling node (indicated by a horizontal pointer) that has obtained another sibling and hence requires a page split. The problem is easily solved by introducing a three-valued h with the following meanings: 1. h = 0: the subtree p requires no changes of the tree structure. 2. h = 1: node p has obtained a sibling. 3. h = 2: the subtree p has increased in height. PROCEDURE search(VAR p: Node; x: INTEGER; VAR h: INTEGER); VAR q, r: Node; BEGIN (*h=0*) IF p = NIL THEN (*insert new node*) NEW(p); p.key := x; p.L := NIL; p.R := NIL; p.lh := FALSE; p.rh := FALSE; h := 2 ELSIF x < p.key THEN search(p.L, x, h); IF h > 0 THEN (*left branch has grown or received sibling*) q := p.L; IF p.lh THEN h := 2; p.lh := FALSE; IF q.lh THEN (*LL*) p.L := q.R; q.lh := FALSE; q.R := p; p := q ELSE (*q.rh, LR*) r := q.R; q.R := r.L; q.rh := FALSE; r.L := p.L; p.L := r.R; r.R := p; p := r END ELSE DEC(h); IF h = 1 THEN p.lh := TRUE END END END ELSIF x > p.key THEN search(p.R, x, h); IF h > 0 THEN (*right branch has grown or received sibling*) q := p.R; IF p.rh THEN h := 2; p.rh := FALSE; IF q.rh THEN (*RR*) p.R := q.L; q.rh := FALSE; q.L := p; p := q ELSE (*q.lh, RL*) r := q.L; q.L := r.R; q.lh := FALSE; r.R := p.R; p.R := r.L; r.L := p; p := r END ELSE DEC(h); IF h = 1 THEN p.rh := TRUE END END END END END search; Note that the actions to be taken for node rearrangement very strongly resemble those developed in the AVL-balanced tree search algorithm. It is evident that all four cases can be implemented by simple pointer rotations: single rotations in the LL and RR cases, double rotations in the LR and RL cases. In fact, procedure search appears here slightly simpler than in the AVL case. Clearly, the SBB-tree scheme emerges as an alternative to the AVL-balancing criterion. A performance comparison is therefore both possible and desirable.

171 We refrain from involved mathematical analysis and concentrate on some basic differences. It can be proven that the AVL-balanced trees are a subset of the SBB-trees. Hence, the class of the latter is larger. It follows that their path length is on the average larger than in the AVL case. Note in this connection the worst-case tree (4) in Fig. 4.53. On the other hand, node rearrangement is called for less frequently. The balanced tree is therefore preferred in those applications in which key retrievals are much more frequent than insertions (or deletions); if this quotient is moderate, the SBB-tree scheme may be preferred. It is very difficult to say where the borderline lies. It strongly depends not only on the quotient between the frequencies of retrieval and structural change, but also on the characteristics of an implementation. This is particularly the case if the node records have a densely packed representation, and if therefore access to fields involves part-word selection. The SBB-tree has later found a rebirth under the name of red-black tree. The difference is that whereas in the case of the symmetric, binary B-tree every node contains two h-fields indicating whether the emanating pointers are horizontal, every node of the red-black tree contains a single h-field, indicating whether the incoming pointer is horizontal. The name stems from the idea to color nodes with incoming down-pointer black, and those with incoming horizontal pointer red. No two red nodes can immedaitely follow each other on any path. Therefore, like in the cases of the BB- and SBB-trees, every search path is at most twice as long as the height of the tree. There exists a canonical mapping from binary B-trees to red-black trees.

4.8. Priority Search Trees Trees, and in particular binary trees, constitute very effective organisations for data that can be ordered on a linear scale. The preceding chapters have exposed the most frequently used ingenious schemes for efficient searching and maintenance (insertion, deletion). Trees, however, do not seem to be helpful in problems where the data are located not in a one-dimensional, but in a multi-dimensional space. In fact, efficient searching in multi-dimensional spaces is still one of the more elusive problems in computer science, the case of two dimensions being of particular importance to many practical applications. Upon closer inspection of the subject, trees might still be applied usefully at least in the two-dimensional case. After all, we draw trees on paper in a two-dimensional space. Let us therefore briefly review the characteristics of the two major kinds of trees so far encountered. 1. A search tree is governed by the invariants p.left ≠ NIL implies p.left.x < p.x p.right ≠ NIL implies p.x < p.right.x holding for all nodes p with key x. It is apparent that only the horizontal position of nodes is at all constrained by the invariant, and that the vertical positions of nodes can be arbitrarily chosen such that access times in searching, (i.e. path lengths) are minimized. 2. A heap, also called priority tree, is governed by the invariants p.left ≠ NIL implies p.y ≤ p.left.y p.right ≠ NIL implies p.y ≤ p.right.y holding for all nodes p with key y. Here evidently only the vertical positions are constrained by the invariants. It seems straightforward to combine these two conditions in a definition of a tree organization in a twodimensional space, with each node having two keys x and y, which can be regarded as coordinates of the node. Such a tree represents a point set in a plane, i.e. in a two-dimensional Cartesian space; it is therefore called Cartesian tree [4-9]. We prefer the term priority search tree, because it exhibits that this structure emerged from a combination of the priority tree and the search tree. It is characterized by the following invariants holding for each node p: p.left ≠ NIL implies (p.left.x < p.x) & (p.y ≤ p.left.y) p.right ≠ NIL implies (p.x < p.right.x) & (p.y ≤ p.right.y) It should come as no big surprise, however, that the search properties of such trees are not particularly wonderful. After all, a considerable degree of freedom in positioning nodes has been taken away and is no longer available for choosing arrangements yielding short path lengths. Indeed, no logarithmic bounds

172 on efforts involved in searching, inserting, or deleting elements can be assured. Although this had already been the case for the ordinary, unbalanced search tree, the chances for good average behaviour are slim. Even worse, maintenance operations can become rather unwieldy. Consider, for example, the tree of Fig. 4.54 (a). Insertion of a new node C whose coordinates force it to be inserted above and between A and B requires a considerable effort transforming (a) into (b). McCreight discovered a scheme, similar to balancing, that, at the expense of a more complicated insertion and deletion operation, guarantees logarithmic time bounds for these operations. He calls that structure a priority search tree [4-10]; in terms of our classification, however, it should be called a balanced priority search tree. We refrain from discussing that structure, because the scheme is very intricate and in practice hardly used. By considering a somewhat more restricted, but in practice no less relevant problem, McCreight arrived at yet another tree structure, which shall be presented here in detail. Instead of assuming that the search space be unbounded, he considered the data space to be delimited by a rectangle with two sides open. We denote the limiting values of the x-coordinate by xmin and xmax. In the scheme of the (unbalanced) priority search tree outlined above, each node p divides the plane into two parts along the line x = p.x. All nodes of the left subtree lie to its left, all those in the right subtree to its right. For the efficiency of searching this choice may be bad. Fortunately, we may choose the dividing line differently. Let us associate with each node p an interval [p.L .. p.R), ranging over all x values including p.L up to but excluding p.R. This shall be the interval within which the x-value of the node may lie. Then we postulate that the left descendant (if any) must lie within the left half, the right descendant within the right half of this interval. Hence, the dividing line is not p.x, but (p.L+p.R)/2. For each descendant the interval is halved, thus limiting the height of the tree to log(xmax-xmin). This result holds only if no two nodes have the same x-value, a condition which, however, is guaranteed by the invariant (4.90). If we deal with integer coordinates, this limit is at most equal to the wordlength of the computer used. Effectively, the search proceeds like a bisection or radix search, and therefore these trees are called radix priority search trees [4-10]. They feature logarithmic bounds on the number of operations required for searching, inserting, and deleting an element, and are governed by the following invariants for each node p: p.left ≠ NIL p.right≠ NIL

implies (p.L ≤ p.left.x < p.M) & (p.y ≤ p.left.y) implies (p.M ≤ p.right.x < p.R) & (p.y ≤ p.right.y)

where p.M p.left.L p.left.R p.right.L p.right.R

= = = = =

(p.L + p.R) DIV 2 p.L p.M p.M p.R

for all node p, and root.L = xmin, root.R = xmax. A decisive advantage of the radix scheme is that maintenance operations (preserving the invariants under insertion and deletion) are confined to a single spine of the tree, because the dividing lines have fixed values of x irrespective of the x-values of the inserted nodes. Typical operations on priority search trees are insertion, deletion, finding an element with the least (largest) value of x (or y) larger (smaller) than a given limit, and enumerating the points lying within a given rectangle. Given below are procedures for inserting and enumerating. They are based on the following type declarations: TYPE Node = POINTER TO RECORD x, y: INTEGER; left, right: Node END Notice that the attributes x L and xR need not be recorded in the nodes themselves. They are rather computed during each search. This, however, requires two additional parameters of the recursive procedure insert. Their values for the first call (with p = root) are xmin and xmax respectively. Apart from this, a search proceeds similarly to that of a regular search tree. If an empty node is encountered, the

173 element is inserted. If the node to be inserted has a y-value smaller than the one being inspected, the new node is exchanged with the inspected node. Finally, the node is inserted in the left subtree, if its x-value is less than the middle value of the interval, or the right subtree otherwise. PROCEDURE insert(VAR p: Node; X, Y, xL, xR: INTEGER); VAR xm, t: INTEGER; BEGIN IF p = NIL THEN (*not in tree, insert*) NEW(p); p.x := X; p.y := Y; p.left := NIL; p.right := NIL ELSIF p.x = X THEN (*found; don't insert*) ELSE IF p.y > Y THEN t := p.x; p.x := X; X := t; t := p.y; p.y := Y; Y := t END ; xm := (xL + xR) DIV 2; IF X < xm THEN insert(p.left, X, Y, xL, xm) ELSE insert(p.right, X, Y, xm, xR) END END END insert The task of enumerating all points x,y lying in a given rectangle, i.e. satisfying x0 ≤ x < x1 and y ≤ y1 is accomplished by the following procedure enumerate. It calls a procedure report(x,y) for each point found. Note that one side of the rectangle lies on the x-axis, i.e. the lower bound for y is 0. This guarantees that enumeration requires at most O(log(N) + s) operations, where N is the cardinality of the search space in x and s is the number of nodes enumerated. PROCEDURE enumerate(p: Ptr; x0, x1, y, xL, xR: INTEGER); VAR xm: INTEGER; BEGIN IF p # NIL THEN IF (p.y