GGPerf: A Perfect Hash Function Generator - CiteSeerX

4 downloads 0 Views 221KB Size Report
Jun 30, 1997 - ... Processing Letters,. 43(1992) pp.257-264, Oct.1992. Tharp88] Alan Tharp. File Organization and Processing. John Wiley & Sons, Inc. 1988.
GGPerf: A Perfect Hash Function Generator Jiejun KONG

June 30, 1997

Contents 1 2

3 4 5

Introduction . . . . . . . . . . . . . . . . . . . . . . 1.1 Minimal Perfect Hash Function . . . . . 1.2 Generators and Scripting . . . . . . . . . Description . . . . . . . . . . . . . . . . . . . . . . . 2.1 Algorithm . . . . . . . . . . . . . . . . . . . . 2.2 Explanation of the Algorithm . . . . . . 2.3 Input Format . . . . . . . . . . . . . . . . . . 2.4 Output Format . . . . . . . . . . . . . . . . 2.5 Options . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . Comparison between GPerf and GGPerf . . . . Discussion: A Solution for Very Large Inputs

1

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 1 1 2 2 5 6 9 10 11 11 11

List of Figures A

The acyclic graph with its g values . . . . . . . . . . . . . . . . . . . . . .

2

4

Abstract This program is a yet-another-program-program. It reads in primary keys with relevant speci cations, then generates a high level language program. The result program is a minimal perfect hash function that can retrieve information in only one probe.

Keys yet-another-program-program, hash function, perfect hash function, minimal perfect hash function, cyclic graph, acyclic graph

1

Perfect Hash Function Generator

1 Introduction is a minimal perfect hash function generator utility written in Java. For basic concepts and algorithms, please read 1.1, 1.2, 2.1, 2.2. If you merely care about the program's input and output, just read 2.3 and later. ggperf

1.1

Minimal Perfect Hash Function

In a database holding a set of records, every record is associated with a primary key which is used to retrieve the record in the database. Primary key of each record should be unique. If we use hashing techniques to compute the correspondence between a primary key and the address of the record that is associated with the key, the database is called a hash table. Given a set K of primary keys, in K there are x primary keys each of which is a nite string of symbols over an ordered alphabet , a hash function will map a primary key into some given interval of integers I , say [0::m ? 1]; m  x, where m is an integer. Every integer of I indicates the address where the record that is associated with the primary key is stored. h : K ! I; jK j  jI j: If multiple keys k0; k1; :::; ks 2 K are mapped into same integer i, then a situation called collision arise. The multiple keys k0 ; k1 ; :::; ks are called synonyms. For resolving collisions, we need to relocate s ? 1 records to other locations and just leave one record in location i. Therefore a hash functions only yields a probable address. In contrast to a hash function, a perfect hash function is an injection

p : K ! I; jK j  jI j: A perfect hash function uniquely transforms each key of K into an address in the hash table without any collision, therefore it yields a de nite address instead of a probable address, thus only a single probe is needed to retrieve every record. In a nutshell, perfect hash functions imply time eciency. If jK j = jI j, that is, x = m, then p is a minimal perfect hash function.

p : K ! I; jK j = jI j: For a minimal perfect hash function, since the hash table can be stored in storage as few as possible, it is also the most space ecient. 1.2

Generators and Scripting

Here the word `generator' means ggperf is a yet-another-program-program. Like the famous programs yacc, lex, and gperf, ggperf will output another program instead of

Perfect Hash Function Generator

2

object code. A result program is written in high level programming languages, normally highly con gurable and portable, especially compared with object codes. This kind of programming is also called scripting when the concept of programming language covers all kinds of computer languages. An example would be the popular HTML scripting. Unfortunately, or maybe fortunately, by writing ggperf program I found Java is a programming language good at scripting, but not a language good at being scripted. On the one hand, Java's string operations are strong enough to handle all kinds of primitive scripting operations; On the other hand, Java has so many kinds of syntactical elements and rules that both printing those elements and verifying those rules is unlikely to be handled easily. Assuming that Java is not the last one of the programming language family and scripting may have signi cant in uence in the future computer programming, my suggestion is self-scripting could be one of the criteria of designing a new programming language. Like the way yacc, lex, and gperf did, ggperf divided input into two parts. One part is the data need to be processed. The other part would be verbatim transcribed to output. ggperf will not paraphrase the verbatim part, it is user's responsibility to ensure that the code contained in the verbatim part is valid.

2 Description The perfect hash function generator ggperf reads a set of "keywords" from a input "key le". It attempts to derive a minimal perfect hash function that recognizes a member of the "static keyword set" with at most a single probe into the lookup table. If ggperf succeeds in generating such a function it produces a pair of high level programming language (currently Java and C) source code routines that perform hashing and table lookup recognition. All generated code is directed to the output le. Command-line options described in 2.5 allow you to modify the input and output format to ggperf. 2.1

Algorithm

In [Czech92], Z. Czech, G. Havas, and B. Majewski have presented an optimal algorithm for generating minimal perfect hash functions based on random graphs. In Czech's algorithm, the minimal perfect hash functions are of the form

h(k) = (g(f1(k)) + g(f2(k))) mod m; where f1 and f2 are functions that map strings into integers, and g is a function that maps integers into [0::m ? 1]. The authors claimed that the expected time complexity of the algorithm is O(m) while the space complexity of the algorithm is O(m log m). To illustrate Czech's algorithm, here I use the algorithm to create a perfect hash function with 12 month names as primary key so that the ith month. i 2 [1::12] is kept in the (i?1)th location of the hash table.

3

Perfect Hash Function Generator

a b c d e f g h i j l m n o p r s t 1 17 10 22 11 1 24 10 0 2 22 10 11 19 6 3 20 21 0 8 2 20 10 24 4 24 16 15 24 16 5 0 15 23 12 7 6 13 20 6 0 14 20 12 8 7 23 1 11 22 8 18 21 9 6

u v y 1

16 6 8 24

6

4 17

Table I: Table T1 According to the mathematical description mentioned before, we are going to generate a perfect hash function p p : K ! [0::11] where K =fjanuary, february, march, april, may, june, july, august, september, october, november, decemberg1 . That is to say, in this example, x = m = 12. At rst, Czech's algorithm will generate two tables. Each item of a table is a randomly generated integer value. If an integer value is at position ith row, char column in a table, the ith character of a primary key will be converted to that integer value if the ith character is char. Each integer value is 2 [0::n] where n = c  m for some constant c. The recommended value of c is c 2 [2::10]2. In this example, c is 2 121 as in [Czech92]. Therefore every integer value in the tables is generated by a random integer value generator mod 25. Table I and Table II show the two randomly generated tables. For a k 2 K , f (k) is computed by adding all corresponding integer values of every characters in k, i.e. 0 jkj 1 X f (k) = @ T (i; ki)A mod n: i=1

where ki indicates the ith character of k. f1 is computed by searching the values in table T1 while f2 is computed by searching the values in table T2. For example, f1(january) = (11+ 22+2+8+0+12+4) mod 25 = 9 and f2(january) = (3+4+14+11+7+3+21) mod 25 = 13 and so on. The result of every f function determines a node in an intermediate graph. Figure A shows the intermediate graph. There are opportunities that during the construction of the intermediate graph, a loop is formed by adding a new edge into the graph, in other words, the graph turns into cyclic. 1 2

This example is di erent from the one supplied in [Czech92] although it seems to be the same. Please refer to Section 2.2 for the reason why c should be greater than 2.

4

Perfect Hash Function Generator

a b c d e f g h i j l m n o p r s t 1 13 17 19 3 20 22 2 9 2 4 5 6 7 3 3 11 0 11 20 14 1 19 11 4 1 12 20 22 16 5 7 10 10 8 7 21 3 17 6 20 11 10 3 20 7 3 24 19 18 8 13 14 9 3

u v y 9 11 21

18

11 december 10 november

9

january 0

13

june 5

5

april 3

12

may

4

4 9

24

g

feberuary 1

2 4 11 6

20 5 5

15 6 1

september 8

17

july 6

august 7

6

october

19

march 2

9 12 13 15 17 18 19 20 21 24 0 10 0 5 3 10 3 1 11 0

Figure A: The acyclic graph with its g values

21 6

Table II: Table T2

21

3 13 10

2

Perfect Hash Function Generator

5

In these cases, the whole process is restarted unless we get an acyclic graph over all keys. One may arise a question whether it will be practical to get such an acyclic graph in reasonable passes of table assignment. In fact, Czech's algorithm depends on probability of cyclic graphs versus all possible graphs. Increase value c will dramatically decrease the possibility. Since range of Java's long integer is so large that c could be assigned to a very large value, answer of this question is positive. Experience shows that one pass is expected when c's value is greater than 10. Please refer to 2.2 about this problem. Having obtained the acyclic graph, we continue to label the graph by assigning an integer value which is actually the result of function g to every node in the graph. In this step we select a node va and assign 0 to it for every component in the graph. After that, every neighbor of the node is labeled using either breadth rst search or depth rst search. The integer value of a neighbor vb is determined by the formula:

g(vb) = (wab ? g(va)) mod m where m = 12 and wab denotes the weight of the edge from va to vb. Figure A also shows the g values of the intermediate acyclic graph. Having obtained T1, T2, and g, the generation phase of the algorithm ends. When we want to know the address of a key, say \june", we compute

f1 (june) = (11 + 1 + 2 + 16) mod 25 = 5; f2(june) = (3 + 9 + 14 + 12) mod 25 = 13: Then the hash table address of \june" is (g(5) + g(13)) mod 12 = 5. Now T1, T2, and g are all we need to retrieve a key. 2.2

Explanation of the Algorithm

The nature of Czech's algorithm is that it utilizes the nature of a weighted undirected graph. In a weighted undirected graph, every edge links to two nodes. We may assign an appropriate integer value to each of these two nodes and derive the weight of the edge from the two values. Now if we map a primary key into two integer values representing two nodes in a weighted undirected graph and let the hash value be the weight of the edge linking the two nodes, then we get the idea of the algorithm. In this situation, T1 and T2 are introduced to map a primary key into two integer values representing two nodes. A labeling process is introduced to map the two integer values to an appropriate hash value. To achieve the goal, the weighted undirected graph should have no loops. Otherwise, the labeling process would fail. In other words, the graph should be an acyclic graph. On the rst sight of the algorithm, one may arise a question when the algorithm is mapping the graph. Since the integer values in T1 and T2 are randomly generated, there is no guarantee that the graph will de nitely be acyclic. Therefore, if the graph turns out

6

Perfect Hash Function Generator

to be cyclic again and again, the algorithm would not be practical to be used in a real time application. In the following text I want to show that this case will not happen. Given a set V of n nodes which are di erent from each other, a set E of m edges each of which links two nodes that are in the set V , the question is how many di erent undirected graphs there are and how many di erent acyclic undirected graphs there are. There are total n2m di erent undirected graphs including loops and self-loops. The result can be derived in this way: At the beginning, each node is unlinked. To add an edge, there are n possibilities to choose one vertex and n possibilities to choose another vertex, and the pattern repeats m times. There is a simple way to construct acyclic undirected graphs: At the beginning, each node is unlinked. Whenever we add a new edge to a acyclic undirected graph, we always choose a node doesn't belong to the nodes that have already linked to an edge. There is no way to generate cyclic graphs by this means. By induction we have 1. For adding the rst edge, there are Cn1 possibilities to choose a node and Cn1?1 possibilities to choose another node preventing self loop. 2. Having added k edges, for adding the (k + 1)th edge, there are at least Cn1?2k possibilities to choose a node that doesn't belong to nodes that have already linked to the former k edges (because there are at most 2k such nodes) and Cn1?1 possibilities to choose another node preventing self-loop. Therefore the lower bound is m Y

k=0

(Cn?1Cn?2k 1

1

) = (n ? 1)m 

m Y

(n ? 2k):

k=0

Therefore the probability P of generating acyclic graph is m  Qm (n ? 2k) ( n ? 1) k=0 :  2 m n When n  1, the probability P is approximately Qm (n ? 2k)  k=0 nm : From here we know why n > 2m, that is, the appropriate value for c should be greater than 2. When n  2m, the probability is nearly 1, thus the randomness of generating the graph is actually not a problem. 2.3

Input Format

The input format of ggperf is similar to UNIX utilities outline of the general format:

lex

and

). Here's an

yacc

Perfect Hash Function Generator

7

declarations %% keywords %% functions

The declarations section and the keywords section are mandatory. Only the functions section is optional. The following sections describe the input format for each section. Declarations

The keyword input le contains a section for including arbitrary high level programming language declarations and de nitions, as well as provisions for providing a user-supplied structure which speci es all associated attributes besides primary keys. The structure declaration is in the format of Lisp's symbol expression: (Structure-Name ("Member-Type" Member-Name) ("Member-Type" Member-Name) ... )

Here is simple example, using months of the year and their attributes as input: (months ("String" key) ("int" number) ("int" days) ("int" leap_days) ) %% january, 1, 31, 31 february, 2, 28, 29 march, 3, 31, 31 april, 4, 30, 30 may, 5, 31, 31 june, 6, 30, 30 july, 7, 31, 31 august, 8, 31, 31 september, 9, 30, 30 october, 10, 31, 31 november, 11, 30, 30 december, 12, 31, 31

Perfect Hash Function Generator

8

The reason why Member-Types should be enclosed in double quotes is that in Java and C type speci cation could be composite, as in "char []". Therefore, I choose double quotes as the delimiter to detect the bound. In the meanwhile, Struct-Name and Member-Names are merely identi ers, thus blank characters are actually acted as delimiters. Member-Type declarations should be conformed with the output high level language. Currently only C and Java is supported. Therefore if a Member-Type is actually a string, \String" is used when output as Java program, \char *" is used when output as C program. Structure-Name will be converted to struct name in C or class name in Java. Type of the rst member declaration must be \String" in Java or \char *" in C that represents the primary keyword. Separating the structure declaration from the list of key words and other elds are a pair of consecutive percent signs, `%%', appearing left justi ed in the rst column, as in the UNIX utility lex. Using a syntax similar to GNU utilities flex and bison, it is possible to directly include high level language source text and comments verbatim into the generated output le. This is accomplished by enclosing the region inside left-justi ed surrounding `%f', `%g' pairs. Here is an input fragment based on the previous example that illustrates this feature: %{ /* This section of code is inserted directly into the output. */ #include %} (months ("char *" key) ("int" number) ("int" days) ("int" leap_days) ) %% january, 1, 31, 31 february, 2, 28, 29 march, 3, 31, 31 ...

It is possible to have an empty declaration section. e.g.: %% january, february, march, april, ...

1, 2, 3, 4,

31, 28, 31, 30,

31 29 31 30

9

Perfect Hash Function Generator Keywords

The second key le format section contains lines of keywords and any associated attributes. A line beginning with `#' in the rst column is considered a comment. Everything following the `#' is ignored, up to and including the following newline. The rst eld of each non-comment line is always the key itself. It could include any valid ASCII character except backward, tab, space, newline, carriage return, formfeed, double quote, single quote, backslash, and, comma, If they are needed to be included in the key, use escape sequence led by backslash instead. In this context, a eld is considered to extend up to, but not include, the rst blank, comma, or newline. Additional elds may optionally follow the leading key. Fields should be separated by commas, and terminate at the end of line. What these elds mean is entirely up to the user; they are used to initialize the elements of the user-de ned structure provided in the declaration section. The number of attributes should be identical to the number of members of the user-de ned structure. Additional Functions or Java Methods

The optional third section also corresponds closely with conventions found in yacc and . All text in this section, starting at the nal `%%' and extending to the end of the input le, is included verbatim into the generated output le. Naturally, it is user's responsibility to ensure that the code contained in this section is valid C or Java. If output is in Java, this section is used to output ad hoc methods, therefore user should supply valid individual methods.

lex

2.4

Output Format

Two Java methods (or C functions) are generated. They are called `hash' and `in word set', although you may modify the name for `in word set' with a command-line option. Both functions require the key to be the only argument. C functions

Java methods

hash(char *key) in_word_set(char *key)

hash(String key) in_word_set(String key)

By default, the generated `hash' method/function returns an integer value indexed into an associated values table stored in a local static array. The associated values table is constructed internally by ggperf and later output as an array. By default, the look up `in word set' method/function returns a key with associate attributes. `-returnboolean' option can be used to let `in word set' return a Boolean value telling whether the string being looked up is one of the keys. Both name of these two methods/functions can be changed by supplied options.

Perfect Hash Function Generator

10

`-printstructure' option is a important option when there exist associated attributes. Using this option will save your time on writing your own declarations. In Java output, this option will print out a complete class including all attribute variables and their accessors. In C output, this option will simply print out a C struct. ggperf also requires all keys being stored somewhere in the result program. In C output, this is done by array initialization. In Java output, this is done by calling a method named `init()' in `main' method. 2.5

Options

-C= oat Set C value to be the oat, please refer to the algorithm section for the meaning

of C -java Generate perfect hash function in Java. -ansic Generate ANSI C code. This is the default. -gnuc Assume a GNU C compiler for C output. This makes all generated routines use the \inline" keyword to remove the cost of function calls. -printstructure When there are associated attributes, print the Java class or C struct declaration of those attributes automatically. Otherwise user must supply valid declarations. -global Make the array `wordlist' to be public. `wordlist' holds all keys with associated attributes, if there is any. -returnboolean Let the look up `in word set' method/function return Boolean instead of keys. The Boolean value tells whether the string being looked up is in `wordlist' or not. Default is returning keys with associated attributes, if there is any. -hash name=name Change the hash method/function name from `hash' to name. -lookup name=name Change the look up method/function name from `in word set' to name. -wordlist name=name Change the name of the array holding keys from `wordlist' to name.

Perfect Hash Function Generator

11

3 Examples There are several examples supplied in the `test' subdirectory. Of those examples, `c.ggperf' and `java.ggperf' are typical for C output and Java output, respectively. When you use your own input to test the ggperf program, don't forget to prepend a declaration section, at least a `%%' line, since it is mandatory to have the section in ggperf.

4 Comparison between GPerf and GGPerf GNU already has a similar generator named gperf which is included in GNU's libg++ utilities3. The input and output format of gperf is helpful in designing ggperf. However, ggperf is totally di erent from gperf in algorithm and implementation. Its performance is also much better than gperf. The name ggperf actually means Greater than GPERF, not only because ggperf can handle much larger input than gperf, but also due to the fact that the result program of ggperf is only a little bigger than gperf. The major reason why ggperf acts better is that the algorithm used by gperf is outof-date. When the number of input keys is greater than 100, gperf becomes unbearably slow and liable to crash. ggperf is robust even when the number of input keys becomes very large. Because Java's integer is 32-bit, the upper limit of the number of input keys is really not a signi cant problem. Due to the non-linear complexity of Czech's algorithm, ggperf will become slower when number of input keys increases. On a SPARCstation 4 system running Solaris 2.4 with 64M main memory, ggperf can generate perfect hash function for the top 100 words of /usr/dict/words in 10 seconds, top 1,000 words in 40 seconds, top 2,000 words in 150 seconds, and top 3,000 words in 400 seconds. Anyway, this result is impossible to be accomplished in GNU's gperf. Part of the poor performance of ggperf on very large input is relevant to the performance of Java. I have implemented ggperf in both Java and C. On the same SPARCstation 4 system, the implementation in C can generate perfect hash function for UNIX's /usr/dict/words le (25143 keys) in 500 seconds.

5 Discussion: A Solution for Very Large Inputs When a user really needs to process a very large input, there is a way to handle the problem: split the large input into several key groups each holds thousands of keys; then use double hashing techniques. 3

libg++ can be acquired from ftp://prep.ai.mit.edu/pub/gnu.

Perfect Hash Function Generator

12

For example, the Merriam Webster dictionary on NeXT station has about 200,000 entries.  To choose a key-group, we can utilize initial 2 letters of a key as the rst hashing function. e.g. 'abduct' falls into `ab' group; `zoo' falls into `zo' group. There are at most 26  26 = 676 such groups. Some groups may be possibly empty.  In each group, the ggperf could be used to perfectly hash the keys. This solution only needs one more hashing time which is not so signi cant.

Bibliography [Czech92] Zbigniew J. Czech, George Havas, and Bohdan S. Majewski \An Optimal Algorithm for Generating Minimal Perfect Hash Functions", Information Processing Letters, 43(1992) pp.257-264, Oct.1992 [Tharp88] Alan Tharp. File Organization and Processing. John Wiley & Sons, Inc. 1988.

13