FINDING MINIMAL PERFECT HASH FUNCTIONS Gary ... - CiteSeerX

4 downloads 72374 Views 278KB Size Report
and. Kevin Karplus. Department of Computer Science. Cornell. University ... Ordinary hash functions are cheap to compute, and families of good hash functions.
FINDING MINIMAL

PERFECT HASH FUNCTIONS

Gary Haggard of Computer Science of Maine at Orono and Kevin Karplus Department of Computer Science Cornell University

Department University

ABSTRACT

h(word)

heurisitic 4s given for ft;ftie perfect hash functions . minimal The procedure is to extensive searchings construct a set of graph (or hypergraph) models for the dictionary, then choose one the models for use in constructing the of hashing function. The minimal perfect construction of this function relies on a the algorithm for numbering backtracking Careful selection vertices of the graph. spent the graph model limits the time of been Good results have searching. 181 for dictionaries of up to obtained nonthe same techniques, words. Using perfect has functions have been minimal found for sets of up to 667 words.

= g(first + g(last

letter) letter)

+ length(word)

A

Many useful word sets were considered. hash function with that have no perfect Even using different functions for form. or considering the first and last letter, of letter positions is other pairs not For example, the complete list of enough. PASCAL reserved words and pre-declared contains the six words CASE, identifiers ELSE, PAGE, READ, REAL, TRUE, and TYPE.

INTRODUCTION A minimal perfect hashing function is onto mapping from a set of one-to-one, integers. This ie ys K to n consecutive quickly for presents a method paper such functions for sets of up to finding The same techniques can about 180 words. be applied to larger sets to find perfect hash functions (still one-to-one, but n > IKI>-

Figure 1. No two selector functions can distinguish CASE, ELSE, PAGE, READ, REAL, TRUE, and TYPE. Sager [Sal proposes an optimization method of Cichelli which uses a for the different intermediate process to prepare backtracking search for the for the required functions. Our method that of Cichelli, and uses a generalizes more flexible intermediate processing step to for the backtracking search prepare We search for hash functions than Sager. of the form

Ordinary hash functions are cheap to families of good hash compute, and been described in the have functions Examining arbitrary hash literature [CW]. functions until a perfect one is found has been attempted [Sp], but p;;f;cttoha;i too rare for functions are of size n where n is feasible on sets the nIKI possible hash large (of functions, only nl/(n-IK])! are perfect. Cichelli presented a method for finding Only minimal perfect hash functions [Cl. hash functions of the form

h(word)

ACM-O-89791-\78-4/86/000210191

+ 1 g,(a,(word)), i

selects a letter from the where ui (word) word based on the length of the word, and is computed) by table lookup. gi(letter) here allows The method described the construction of the 76 word dictionary of reserved words and Pascal predefined identifiers without special considerations.

Permission to copy without fee all or part of thii material is granted provided that the copies are not made or distributed for direct commercialadvantage, the ACM copyright notice and the title of the publication and its data appear, and not& is given that copying is by permission of the Association for Computing Machinery. To copy otherwise,orto npublish,nquimaafcc and/orspecificpannission. Q 1986

= length(word)

$00.75

l9i

First, we The search has two parts. the look for selector functions such that uniquely (length,o vector 1'.4Jm) The vectors can be identifies each word. as the edges of an m-partite thought of letters hypergraph whose vertices are the The word length is kept selected by uias a label for the edge. Second, we look gf(letter) such that for values of integer in each word maps to a different n where n is the size of the word 1,2,..., That is, a value is assigned to each set. vertex of the hypergraph, so that the sum of the edge label and the values on the vertices is a different integers in n for each hyperedge. Edges incident on vertices of degree one can be assigned hash any desired value, since the vertex can be assigned a value independent of other vertex any value. Thus the vertex assignment problem can be simplified by deleting all edges containing a vertex unique to that edge. The reduced graph may have new vertices of degree one, allowing more edges to be deleted. Repeating the process eventually results in a graph with no vertices of degree one. The removed edges (which we call tree edges) are assigned hash values in reverse order of their removal after all the edges in the reduced graph have been assigned values. For small sets the tree edges are a substantial part of the graph, and minimal perfect hash function can often be found For example, for the month by hand. abbreviations JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, and DEC. choosing the second and third letters gives a graph containing only tree edges Figure 2). (see Arbitrarily choosing JAN=1 . . . ..DEC=12. we can assion vertex values as-shown in Figure 3.

Figure letters names.

3

N-

0

R=

2

Y=

4

L=

1

P = -1

E = -1

G-

2

B-

0

P=

7

c = 10 O= c-

8 7

Figure 3. abbreviations

v=

0

T=

0

JAN = JUN = MAR =

I. ZI :I

APR

=

4

MAY JUL AUG FEB SEP DEC NOV OCT

= :i = ; = FI = 3. = !I = l:! = 11; = 10

Vertex assignments of month names.

for

the

A set of selector functions (cho%ce of within the worda) is positions letter all by doing a limited search of chosen sets composed of selector functions from a functions. We use a fixed family of family of 27 different selector functions, new functions can be easily added to but For each set of selector the family. functions, a word hypergraph is built. If no two words map to the same edge with the same length lable, the set of selector and vertex value functions is accepted, If all the sets allow assignment starts. words to map to identical edges, the best few sets are kept, sets and new larger From a set of are generated from them. selector set is functions, a larger by adding a function ill the constructed family that is not already in the set. First we consider the empty set: are the words separated by length alone? Then all extensions of the we consider empty set: does any single selecter function suffice7 The best few selector functions are remembered, and all pairs of selector the best functions that include one of This continues for functions are tried. The best few sets of k higher dimensions. selector functions are remembered, and all that include a sets of functions k+l Sets of remembered are tried. set selector functions are tried until a good or the size of the sets gets set is found, too large. The quality of a set of selector functions is measured by a weighted sum of the number the number of distinct edges, of tree edges, and the number of vertices Of these, the in the word hypergraph. For tree edge count is more important. more details on the weights used, see

GLNYRPBCVT

u

A = -2 u=

A

P

E

2. Graph for second of the abbreviations

0

and for

c

third month

[KHI

l

After selector functions have the been chosen, values have to be assigned to The all vertices of the word hypergraph. and the tree edges can be removed, corresponding vertices assigned values in reverse after rest of the order the vertices have values. For the main body the the vertex assignment of graph, proceeds as follows:

1)

choose

2)

assignment if no legal and change a backtrack choice,

3)

assign _otherwise, _ value to the legal the vertex value is

4)

conflicts, vertex values are popped until last the vertex removed is in the edge with the smallest partial sum and not in sum, or the edge with the largest partial vertex assignment can be made SO the zartial sum of the almost completed edges will fit. For no-fit conflicts, vertex values are popped until some vertex of an almost completed edge has been removed. The almost completed edge with the highest partial sum is excluded, since increasing the assignment for its vertices is not likely to resolve the conflict.

a vertex,

repeat l-3 until have been assigned.

exists,

previous

the

smallest (0 if unconstrained),

vertex

all

vertices

backtracking possible The simplest is to have a fixed ordering of the scheme and undo the most recent choice vertices, This scheme works when a conflict occurs. well for small graphs (such as those in [Cl), b;zriztt take a long time otsi;rger to heuristics were ones. the both the vertex choice and speed up backtracking.

For more details heuristics, and occurrences of the see [KH]. Theoretical is difficult, model for sets

backtracking on the conflicts,

analysis of running time since we lack a convincing of words, Empirically, our

program takes about .06(words)1'5 CPU for a successful search on a Vax seconds The time doesn't seem to 11-780. depend on whether a minimal perfect hashing function or a perfect hashing function is Unsuccessful sought. searches take far have not been allowed to run longer, and to completion.

Vertex choice heuristics attempt to first, choose the most difficult vertices necessary backtracks as triggering thus Define Emin as the set soon as possible. the edges with of (excluding vertices assigned). vertices choice heuristic found vertices with the the

on the statistics different

unassigned fewest all edges with vertexThe best was to choose among E most edges in min

REFERENCES

range of vertex that has the widest the This in E sums for edges partial min' hypergraphs of heuristic works well for but not as well for dimension one or two, Only edges that have dimensions. higher affect one vertex value unassigned only almost these call values vertex (we If the graph has no edges). completed iS edges, a value completed almost assigned arbitrarily to the chosen vertex. more heuristics are Backtracking ones. than the vertex-choice complicated succeed that The hash function searches so the heuristics don't backtrack rarely, The searches that run a affect them much. time spend almost all the time doing long backtracking. Three different conflicts can trigger when Edge conficts occur backtracking. have two different almost completed edges same value. Too-big conflicts occur the almost when the range of partial sums for range edges is larger than the completed No-fit conflicts unused edge values. of will assignment when every vertex occur conflict make some almost completed edge with an existing completed edge. For edge conflicts, vertex values are the the partial sum of until popped A larger value conflicting edges differ. recently assigned to the most iS popped assignment the vertex value vertex, and too-big forward For proceeds again.

193

[Cl

Cichelli. "Minimal Richard J. Perfect Hash Functions Made Simple." the ACM Communications of 23(l) (January 1980). v-19. -

[CW]

Mark N. Carter and Lawrence J. "Universal Wegman. Classes of Hash Functions." Proceedings of the 9th Annual ACM Symposium of the Theory -(May 1977), 106-112. -of Computing

[KH]

Haggard. and Gary Kevin Karplus Perfect Hash Minimal Finding Cornell Computer SciZZ Functions. Technical Report TR84-637 (September 1984).

[Sal

" A Polynomial T. J. Sager. for Minimal Perfect Generator Communications of Functions. ACM 28(5) (May 1985), 523-532,

[SPI

hashing "Perfect Sprognoli. R. functions: A single probe retrieving sets.” static method for the ACM 20(11) Communications of (November 1977),841-850.-

Time Hash the -