GP-hash: Automatic Generation of Non Cryptographic ...

5 downloads 712 Views 1MB Size Report
IRIX, Linux), videogames (ps2, Gamecube and xbox consoles), Twitter, etc. ...... Punch, B. and Zongker, D. (1998). lil-gp genetic programming system. Ryan, C.
GP-hash

GP-hash: Automatic Generation of Non Cryptographic Hash Functions Using Genetic Programming ´sar Este ´banez and Yago Saez and Pedro Isasi Ce Universidad Carlos III de Madrid, Av. de la Universidad, 30, 28911, Legan´es, Madrid, Spain. email: [email protected], [email protected], [email protected]

1

Computational Intelligence, Volume XX, Number 000, 2011

Non cryptographic hash functions have an immense number of important practical applications due to their powerful search properties. However, those properties critically depends on good designs: inappropriately chosen hash functions are a very common source of performance losses. On the other hand, hash functions are difficult to design: they are extremely non linear and counterintuitive, and relationships between the variables are often intricate and obscure. In this work we demonstrate the utility of Genetic Programming to automatically generate non cryptographic hashes that can compete with state of the art hash functions. We describe the design and implementation of our system, called GP-hash. Also, we experimentally identify the most proper terminal and function set, fitness function, and parameters set for this task, providing interesting information for future research in this topic. Using GP-hash, we were able to generate a non cryptographic hash, which we call gp-hash01. This hash is able to compete with a selection of the most important functions of the hashing literature, most of them widely used in the industry and created by world-class hashing experts with years of experience. Key words: Hash Functions, Genetic Programming, Evolutionary Computation

1. INTRODUCTION AND DEFINITIONS Hashing is everywhere. Hash functions are the core of hash tables, of course, but they also have multitude of other applications: Bloom Filters, Distributed Hash Tables, Local Sensitive Hashing, Geometric Hashing, string search algorithms, error detection schemes, transposition tables, cache implementations, and many more. For example, Robert Jenkins reports in his webpage that his hash function lookup3 have been used by top class companies like Google, Oracle, or Dreamworks (they used it for the Shrek movie). He also reported that lookup3 was used in implementations of Infoseek, Perl, Ruby, and Linux, among others. The creators of FNV hash function also report some impressive real-life applications of their function: DNS Servers, NFS implementations (FreeBSD 4.3, IRIX, Linux), videogames (ps2, Gamecube and xbox consoles), Twitter, etc. Why is hashing so important? The answer is that, under some reasonable assumptions, hashing allows to search for objects in a set in constant time O(1), independently of the size of the set. So, it is not only that the access times are optimal: the most important feature is the perfect scalability of the system. Look up times remains constant no matter how large the set is. Considering that we live in a world in which governments, companies, and research centers use every day massive databases containing thousands of terabytes of data that must be constantly accessed and updated, it should not be a surprise that hashing is a so popular technique. Of course, finding elements in time O(1) is the ideal case. In fact, one of the most important drawbacks of hashing is that it has a terrible worst case: finding an object in a set of n elements could have a cost of O(n). This happens only when the hash function maps every input key to the same hash value, and this extreme behaviour is very unlikely as long as we design a decent function. But performance losses due to unsuitable hash functions are very common. The performance of a hashing system entirely depends on how we design (or choose) the hash function. The problem is that designing top quality hash functions is a difficult process. They are extremely nonlinear, counterintuitive mathematical constructions in which the relationships between the variables are intentionally obscure and intricate. In fact, most of the non cryptographic hashes that are commonly used in the software industry were created by experts in almost handicraft processes. Some very popular functions, like FNV for example, use magic numbers, which are numerical constants arbitrarily selected in a trial-and-error process. On top of that, there is not a generally accepted way of measuring the quality of non cryptographic hash functions, so, even if one do a good job designing a hash function, it is very difficult to compare it with the state of the art. These difficulties in the design of good hash functions suggest that Artificial Intelligence techniques such as Genetic Programming (GP) could do a good job replacing humans in the task of creating new hashes. The reason is that GP is specially suitable for that specific kind of problems: In Poli et al. (2008), which at this moment is probably the most comprehensive and up to date reference covering all the theoretical and practical aspects of GP, authors claim that, based on the experience of numerous researchers over many years, GP is specially productive in problems having some or all of the following properties: Ci

2011 The Authors. Journal Compilation Ci 2011 Wiley Periodicals, Inc.

GP-hash

3

(1) (2) (3) (4)

The interrelationships among the relevant variables is unknown or poorly understood. Finding the size and shape of the ultimate solution is a major part of the problem. Significant amounts of test data are available in computer-readable form. There are good simulators to test the performance of tentative solutions to a problem, but poor methods to directly obtain good solutions. (5) Conventional mathematical analysis does not, or cannot, provide analytic solutions. (6) An approximate solution is acceptable. (7) Small improvements in performance are routinely measured (or easily measurable) and highly prized. We can say that the problem of finding new hash functions completely fulfills at least conditions 1, 3, 4, and 7. And it probably fulfills also all the others in some way. In this work we used GP to automatically generate non cryptographic hash functions. We started from ProGen: an elegant, robust, and efficient GP framework (Est´ebanez and Reta, 2010), and we combined it with a hashing library specifically written in Java for this work. We call this system GP-hash. Using GP-hash, we were able to generate some hashes that competes in performance with state-of-the-art functions that are massively used in the industry, like lookup3, FNV, SuperFastHash or MurmurHash2. We used many different combinations of operators and three different families of fitness functions. We generated dozens of new hash functions that are competitive with those created by humans. Finally, the last contribution of this work is the hash function benchmarking tool that we designed: It compiles and organizes the most important hash metrics of the literature in order to fairly compare hash functions. We used this benchmark to compare our functions with the state of the art. The rest of this document is organized as follows: In Sections 2 and 3 we introduce respectively non cryptographic hash functions, and GP. Those are the two main technologies in which this work is based. Their sections give a very brief introduction to the most important concepts and suggest bibliography to the reader interested in going into them in depth. Then, in Section 4 we review some papers that involves the application of Evolutionary Computation techniques (and Artificial Intelligence in general) to hashing. Section 5 is dedicated to our GP-hash system: we describe all the design and implementation issues, including terminal and function sets, fitness functions, parameter tuning, etc. Then, in Section 6 we use our experimental results to show the utility of GPhash to generate non cryptographic hashes. Finally in Section 7 we summarize the most important achievements and contributions of this work, and give detailed explanations of what we have learned from it.

2. NON CRYPTOGRAPHIC HASH FUNCTIONS Hash functions are a family of mathematical expressions that take a message of variable length as input, and return a hash value of fixed length in the output (see Figure 1). This asymmetry between the sizes of inputs and outputs is one of the most important properties of hash functions. Another desirable and important property, also illustrated in Figure 1, is that minimum changes in the input of a hash function should produce maximum changes in the output. Most hash functions (both cryptographic and non cryptographic) follows the Merkle-Damg˚ ard construction scheme (independently developed by Merkle (Merkle, 1989) and Damg˚ ard (Damg˚ ard, 1990)). Figure 2 illustrates how it works: inputs of the hash function are split into smaller blocks of fixed size, then blocks are processed one by one by the mixing function, whose mission is to scramble input bits and internal state producing a highly entropic output. In step i, the inputs of the mixing function are block i and the output of processing block i − 1. If the length of the message is not a multiple of the block size, then a padding must be added to the last block. There is a huge number of practical applications of hash functions, but the most important one (and the base for most of the others) is the hash table. Hash tables are data structures composed of: a random-access container (e.g. an array) with M slots (usually called buckets) that can store entries; and a hash function. Entries consists of two elements: the data we actually want to store, and a key that identifies the entry. To insert an entry into the table, the hash function is fed with the key, producing a hash value. This value is translated into a valid index of the table, and then

4

Computational Intelligence

Message

Hash Value

car

Hash Function

0x34FF79C8

the green car is so beautiful

Hash Function

0xFCB75A33

the grey car is so beautiful

Hash Function

0xA5442CBB

Figure 1. Example of a typical hash function: Input values could have any length; outputs are 32 bits values; The two last inputs only differ in a few letters, but their outputs are completely different.

INPUT

HASH FUNCTION

M

Init Value

M0

M1

M2

mix

mix

mix

M3

padding

mix

Hash Value

OUTPUT

Figure 2. Merkle-Damg˚ ard construction scheme. the key-data pair is inserted into the bucket indicated by the generated index. When looking for a particular entry in the table, the process is reversed: the key associated to the entry is hashed and the hash value is translated into an index. The entry is supposed to be in the bucket indicated by the produced index. Ideally, every hash value should identify a unique input message. But, as stated above, inputs of a hash function have variable size, and outputs have fixed size. This means that there is an infinite number of possible inputs and a finite number of possible outputs. The consequence is that some inputs must produce exactly the same output. We call this a collision. Collisions are an unavoidable problem that can dramatically decrease the performance of hashing. Keeping a low collision rate is one of the most desirable properties of a good hash function for table lookup (Knott, 1975; Heileman, 1998; Valloud, 2008; Goodrich and Tamassia, 2009). Another important property, perhaps even more important than the collision rate, is the distribution of the outputs: we want the outputs of a hash value to be uniformly distributed (Knott, 1975; Sedgewick, 1995; Heileman, 1998; Cormen, 2001; Valloud, 2008). In other words, each hash value must have the same possibilities of being generated, independently of the distribution of the inputs. In fact, this property is very related with the collision rate: If some hash values have more probabilities of being generated, then clustering problems appears and collision rate raises. The last interesting property of a non cryptographic hash function is avalanche effect. This concept was introduced by Horst Feistel as an important property of block ciphers (Feistel, 1973). Later, this concept was extended to s-boxes (Schneier, 1996), cryptographic hash functions (Preneel, 1993), non cryptographic hashes (Valloud, 2008; Mulvey, 2007), etc. We say that a hash function

5

GP-hash

achieves a good avalanche when minimum changes in the input produces maximum changes in the output. This happen if each input bit have some influence on every output bit. The consequence is that flipping a single bit in the input produces an avalanche of bit flips in the output. If a hash function achieves high avalanche effect, then the disorder caused by the hash is maximum (see Figure 3). 0 1 1 1 0 1 0 0

h

0 1 0 0 1 0 1 1

0 1 0 1 0 1 0 0

h

1 0 0 1 1 0 1 0

Figure 3. A hash function h with a nice avalanche effect. A more rigorous concept is the Strict Avalanche Criterion (SAC) introduced in Webster and Tavares (1986): A hash function satisfies the SAC if for every change in any of the input bits (toggle between 0 and 1) all the bits of the output change with probability 1/2. In other words: flipping one bit of the input changes on average half of the output bits: ∀x, y : {H(x, y) = 1} ⇒ E [H(f (x), f (y))] = n/2;

(1)

Where H(x, y) is the Hamming distance between x and y; f is a hash function; and n is the number of output bits of f . Finally, the only reason for using hashing techniques is speed: we want to make lookup operations in very reduced times. So, of course, a good hash function must be very fast. Summarizing, a non cryptographic hash function: (1) (2) (3) (4)

Must Must Must Must

minimize collisions. distribute outputs evenly. achieve avalanche. be very fast.

2.1. Further Reading According to Donald E. Knuth, the first publication about hashing is an internal memorandum by H. P. Luhn, an IBM employee, in 1953, but the most cited reference about hashing is Knuth (1998). It is probably the first textbook that gives a serious introduction to hashing, but its first edition is from the 70’s and could be a bit outdated. There are other modern textbooks that also worth the reading: Valloud (2008) is the only textbook that we know which is entirely dedicated to hashing. The only con is that the book is focused in SmallTalk. Anyway, it is a comprehensive guide on hashing. Other textbooks containing interesting chapters about hashing: Sedgewick (1995); Cormen (2001); Goodrich and Tamassia (2009); Heileman (1998). Another great source of information are the video lectures of the CS course Introduction to Algorithms at MIT, publicly available through MIT Open Course Ware.

3. GENETIC PROGRAMMING GP Koza (1992) is a stochastic search technique that tries to automatically generate solutions to a problem starting from high-level statements of what needs to be done. GP belongs to the family of Evolutionary Computation techniques. GP populations are composed of computer programs. Thus, GP part from a random population of programs, and tries to improve them through generations using mechanisms inspired by natural selection and evolution. In order to exert a selective pressure over the population and properly guide the search, GP uses a combination of two elements: First, a cost function (or fitness function) that evaluates computer programs and assign them a score indicating their level of adaptation to the problem; And second, a number of operators who select, recombine or modify individuals of the population. The whole process is illustrated in Figure 4: We start from a random initial population. Each

6

Computational Intelligence

generation, the current population is evaluated. The best individuals of the populations are selected for reproduction. Those individuals are evolved using genetic operators (the most common are reproduction, crossover, and mutation). The individuals produced during the evolution are inserted into the offspring population, which will be used in the next generation. This process is repeated generation after generation until a perfect individual is found or the maximum number of generations is reached.

Initial Population

Offspring Population

EVALUATE

EVOLVE

STOP?

YES

SELECT

Return Best Individual

Figure 4. Basic schema of a GP run. Individuals are usually represented as parse trees or their equivalent syntactic expressions in polish notation (see Figure 5). Internal nodes of the tree are functions (operators that accept parameters), and leaves are terminals (variables or constants).

* +

log

X

3

Y

7

(* (log (- X 7)) (+ 3 Y)) Figure 5. Example of GP individual represented as a tree and its equivalent syntax expression. Genetic operators work as follows: Reproduction just copies the selected individual into the offspring population; Mutation changes an individual tree by randomly replacing a node or even a subtree; Crossover takes two individuals, it selects a random crossover point in both individuals, and finally it creates the offspring by replacing the subtree rooted at the crossover point in the first parent with the subtree rooted at the crossover point in the second parent.

GP-hash

7

4. ARTIFICIAL INTELLIGENCE + HASHING In section 1, we explained the reasons why GP could be very suitable to automatically generate hash functions: Evolutionary Computation techniques in general are proved to be very good in finding approximate solutions to poorly understood problems, in which the relationships between the variables are obscure or intricate. Furthermore, GP is particularly interesting to discover unexpected hash functions because the individuals it evolves does not need to have a fixed size or shape, which is perfect for evolving mathematical expressions. Surprisingly, there is not much research on this topic. Actually, it is hard to find research work that use AI techniques to automatically generate non cryptographic hash functions. The most similar work we found is (Safdari, 2009). In fact, the author cites a previous work that we published in 2006 (Est´ebanez et al., 2006a) describing the first prototype of our GP-hash system. In his paper, M. Safdari parts from the following family of universal hash functions: ha,b (k) = ((ak + b) mod p) mod N And uses a Genetic Algorithm to evolve parameters a and b trying to find the best function for a particular set of inputs. All the databases he uses are sets of random integers lying in a predefined range. The results are promising, but the methodology is questionable: If the input data is purely random, which is the case, then a hash function is not needed at all: just use the input values (or a portion of them) as hash values to obtain a perfect uniform distribution and a minimum collision rate. It will be much more interesting to try the same experiments with biased input sets. Furthermore, there are not significance tests (or at least mean values over a number of runs) on the results. The cost function used to guide the search is based in the collision rate and the load factor of the table, two concepts that are mixed in the same function in an apparently arbitrary way. Anyway, even when the methodology could be improved, the ideas in this work are interesting and deserves more research. In a previous work (Berarducci et al., 2004) authors already followed a similar approach, trying to automatically generate hash functions for hashing integers. Their system, called GEVOSH, is even closer to GP-hash than the work of M. Safdari, because GEVOSH uses Grammatical Evolution (Ryan et al., 1998; O’Neill and Ryan, 2003), a technique which is closely related to GP, and because complete hash functions are evolved instead of using a fixed schema and evolve the parameters. The fitness function is based on the collision rate. Two hash functions are obtained and compared with six hashes extracted from (Wang, 2007). The authors claim that their hashes are competitive with the other six, but the reader can’t really tell, because charts in the paper are very difficult to understand (very low quality graphics and no explanation on the text). It is not clear in the paper whether the datasets they used were random or not. Hussain and Malliaris (2000) is a very short paper in which the authors uses a Genetic Algorithm with a collisions-based fitness to evolve some kind of polynomials that they use as hash functions. There is no explanation about how those polynomials are constructed or used to hash, so we assumed that they are using a schema similar to the Polynomial Hash Codes studied in (Goodrich and Tamassia, 2009). The experimental results appear to be good, but the extreme lack of details makes very difficult to evaluate their real impact. Another interesting flavour of this problem is the automatic design of hashing circuits using Evolvable Hardware. This technique uses evolutionary algorithms to automatically design electronic devices (Sipper et al., 1997; Gordon and Bentley, 2002). On this domain, (Damiani et al., 1998) offers an interesting approach. Uses an evolutionary algorithm to evolve a FPGA-based digital circuit which computes a hash function mapping 16-bit entries into 8-bit hash values. The evolutionary algorithm uses dynamical mutation and uniform crossover. The fitness function is based on the uniformity of the outputs distribution. In (Damiani and Tettamanzi, 1999) this system is adapted to on-line reconfiguration of the circuits. Finally, in (Widiger et al., 2006) we have another example of the application of Evolvable Hardware to the generation of FPGA hashing circuits. In this case, the hash circuits are intended to work as hardware packet classifiers inside routers. The routing rules that the device needs to hash are constantly changing, so the designed hash function must be adaptive, and the circuits must allow on-line reconfigurations. Different hash schemes are used and the results are very interesting.

8

Computational Intelligence

The automatic generation of cryptographic hashes is completely out of the scope of this work because their design goals and restrictions are completely different from those related to non cryptographic hashes. Even so, we suggest two interesting publications on this topic: (Snasel et al., 2009) and (Bedau et al., 2004). In (Est´evez-Tapiador et al., 2008) some of our colleges at Universidad Carlos III de Madrid continued our previous work presented in (Est´ebanez et al., 2006b) and created a variation of our system that evolves cryptographic hashes. Although we are not dealing with cryptographic functions in this work, we think it is important to cite this work because it shows that the central idea of GPhash is flexible and powerful enough to be easily adapted to different domains. In the aforementioned paper, authors were able to generate a block cypher that they used as the compression function of a cryptographic hash following the Miyaguchi-Preneel construction scheme (Miyaguchi et al., 1990; Preneel, 1993). The function they generated was very fast and passed some statistical tests that proves that the function has not evident weaknesses and suggest that it could be secure enough to resist some attacks. They used the same fitness function based on avalanche effect that we developed in our previous work.

5. GP-HASH: (RE)DESIGN AND IMPLEMENTATION The objective of this work is to automatically discover general purpose, state of the art, non cryptographic hash functions using GP. In order to do so, we used our GP system for automatic generation of non cryptographic hashes. We call this system GP-hash. GP-hash was previously proposed in (Est´ebanez et al., 2006b), but for the present work we completely redesigned the whole system and rewrote all the source code from scratch. The original version of GP-hash was wrote in C and it was constructed on the top of the lil-gp GP system (Punch and Zongker, 1998). For this work, we used a combination of three independent software systems: ProGen, HashBenchmarkTool, and GP-hash. ProGen is a powerful GP framework wrote in Java. It was designed to be efficient, elegant, highly modular, and very easy to use. HashBenchmarkTool is a Java application created to fill the need for a unified benchmaking tool for non cryptographic hash functions: there are many different properties that have been considered important for hash functions, but different authors use different combinations of them, so it is really difficult to make fair comparisons between competing hash functions. In HashBenckmarkTool we compiled the most important quality measures and most of the hash functions which are the current state of the art, and we put everything together in a usable environment. With those two powerful tools in our hands, the next step was to redesign GP-hash to work like a bridge between them. We translated GP-hash to Java, integrated it with the ProGen framework, and incorporated some of the functionality of the HashBenchmarkTool. ProGen provides all the GP evolution engine (populations, representation of individuals, ERCs, genetic operators, strongly typed GP when needed, etc.), and HashBenchmarkTool provides the hashing quality measurements needed by the fitness function that is in the very heart of GP-hash. The result of this three-way cooperation is that GP-hash is now much more flexible and scalable. Let’s look now into the details of GP-hash implementation: We want to evolve 32-bit hash functions1 . But we do not need the GP to discover how to break the message in blocks and feed the mixing function: we already know that the standard way to do this is to use the Merkle–Damg˚ ard construction scheme. Therefore, the only thing we actually need to evolve is the mixing function itself. Internally, individuals of GP-hash only code mixing functions whose inputs in a particular step are the block being processed and the output from the previous step. When we want to externally use those individuals we wrap them with Merkle–Damg˚ ard constructions obtaining fully functional hash functions. Mixing functions are coded as regular GP trees. The final objective is to optimize the performance of the evolved hash functions. In order to

1 In

this work we focus on 32-bit hashes because those are the most common, and we do not want to unnecessarily complicate the explanations. But it is trivial to configure GP-hash to produce functions with a different output size (64, 128, etc.).

GP-hash

9

achieve this, we must configure the GP-hash system as we usually do when tackling a problem with GP. Basically, we must make three important decisions: (1) Define the terminal and function set. (2) Design the fitness function. (3) Tune the parameters of the GP runs, including: Maximum number of generations (G), Size of the population (M), initialization method (for the individuals of the initial population), genetic operators, selection methods, size limits for the individuals, etc. 5.1. Terminal and Function Set First we will explain how we choose the terminals of our problem, then we will talk about the functions, and finally we will show the experimental evidence that support our choice. 5.1.1. Terminal Set. Given that evolved functions will follow the Merkle–Damg˚ ard scheme, we need at least two different terminals: • hval: 32-bit variable containing the internal state of the hash function. When processing block Mi , hval contains the result of processing the previous block Mi−1 . By default it is initialized to zero, but could be initialized to any other value. • a0 , a1 . . . an : These variables contain the block being processed in the current step. In the Merkle–Damg˚ ard scheme, these blocks have a fixed size, but internally the mixing function could process them in separated parts. In the most common case, blocks are 32 or 8 bits long, and we only use one variable a0 coded in an integer (32 bits) or in a byte (8 bits). But other combinations are possible and very common in the hashing literature. For example, the mixing function of lookup3 consumes blocks of 96 bits on each step, and internally the function divide each block in 3 variables of 32 bits length and mix them separately. To obtain a similar mixing function in GP-hash we should use 3 integer variables (a0 , a1 , a2 ). By default, we always use one 8-bit variable a0 . Other important building blocks are magic constants. They are very common in hashing literature, in the form of big numbers that are combined with the variables of the system (hval, a0 , a1 . . . an ) to improve the overall entropy. There are not established rules about how to choose those numbers, but, in general, prime numbers are preferred, because they are considered to provide more disorder (see for example (Partow, 2010)). In GP-hash, magic constants are implemented as Ephemeral Random Constants or ERCs (special terminals that are randomly initialized the first time they are evaluated, but that keep their values during the rest of the GP run, as defined in (Koza, 1992)). Each ERC is initialized with an integer randomly selected from a list2 of one million prime numbers between 15,485,867 and 32,452,843. Terminal Set = {hval, a0 , a1 . . . an , PrimesERC} 5.1.2. Function Set. The approach we follow to create the function set is to gather some of the most widely used non cryptographic hash functions and check which are the operators that more frequently appears. This way we defined a basic function set by putting together the most common operators in hashing literature. Then we carry out a battery of experiments to refine the basic function set. In Table 1 we show some of the most important non cryptographic hashes and the operators they use. Addition (+), subtraction (-), multiplication (*) and division (/) are the usual arithmetic operators we use everyday3 . Bitwise operators xor (∧ ), and (&), or (|), and not (¬) are also very usual, and do not need explanation. Rigth shift () and right rotation (≫) are bitwise operators that literally move the bits of a variable to the right. The difference between  and ≫ is that in the former, bits originally placed in the right end are discarded, and zeros are injected in the left

2 This

list was obtained from (Caldwell, 2009) for the division, which is protected to avoid divide-by-zero errors and respect the closure property as defined in (Koza, 1992) 3 Except

10

Computational Intelligence

end, while in ≫, bits that are shifted out on the right, are then shifted in on the left (see Figure 6 for a graphical clarification). Left shift () and left rotation (≪) operators works exactly the same but in the opposite direction. Table 1.

Operators used by some state of the art non cryptographic hashes. + √

APartow



Bernstein



BKDR



Jenkins BuzHash

-

FNV

-

Murmur2



Hsieh SFH

0

− √ -

∗ √

/



-

-

-

√ √ -

-

-

-

shifts √ √ √ √ √

rotations -

|

¬ √

-

-

-

-

-

-





-

-

-

-

-

-





-

-

-

-

-

-

-

-

-

-

-

-

-

0 1 0 1 1 0 0 1

0 0 1 0 1 1 0 0

1 0 1 1 0 0 1 0

√ √ √

0

Left Shift

0 1 0 1 1 0 0 1

0 1 0 1 1 0 0 1

1 0 1 0 1 1 0 0

1 0 1 1 0 0 1 0

Right Rotation

&

-

0 1 0 1 1 0 0 1

Right Shift





Left Rotation

Figure 6. Bitwise shift and rotation operators. We observe in Table 1 that almost every non cryptographic hash functions use a combination of some of the following operators: { +, −, ∗, , , ≫, ≪, ∧ , ¬ }. Operators /, &, | are not used by any hash. That is not a surprise given that those operators are not reversible. Using only reversible operators guarantees that the mixing function is reversible, which means that inputs of the function can be calculated out of the outputs. In other words, there is a one-to-one mapping between inputs and outputs, so the mixing function is collision-free. If the function is not reversible, then at least two different inputs must be producing the same output, which means that the mixing function is introducing totally avoidable collisions which will finally propagate to the hash function. See (Mulvey, 2007) for more information about reversible operators and mixing functions. Multiplication is reversible only in determinate circumstances and could be slow in some architectures, but it is very used because it introduces a lot of entropy. Bit shifts are very popular because they are highly entropic and also because they are extremely efficient (only 1 CPU cycle latency on most modern microprocessors). But they are not reversible unless they are combined with other operators (e.g. h = constant is not reversible, but h ∧ = h  constant is reversible), so they must be used with care. We cannot expect the GP to be careful when putting building blocks together, so given that bit rotations have a very similar behavior (and efficiency), and that they are always reversible, we tend to prefer rotations in our function set rather than shifts. Furthermore, right rotation and left rotation are completely equivalent (i.e.

11

GP-hash

(x ≫ n) = (x ≪ 32 − n)), so when using rotations we arbitrarily discard left rotation and keep only right rotation. Apart from shifts and rotations, the most frequent operators are clearly the addition, the multiplication and the exclusive or. Thus, we can define a basic function set for GP-hash based on the popularity of the operators and on our own hypothesis: Basic Function Set = {+, ∗, ≫, ∧ } 5.1.3. Validation of the Terminal and Function Set. Combining the selected functions and terminals we create the basic terminal and function set for GP-hash. Then, following an approach similar to (Wang and Soule, 2004), we carried out a battery of experiments to test whether this set is complete and minimum, and whether our hypothesis about the functions were correct. We include a summary of the results in Table 2. Each row represents the average fitness obtained with different terminal and function sets over 50 runs4 . Terminal and function sets are labeled as F1,F2 . . . F10. Row labeled as BTFS represents the average fitness obtained with the Basic Terminal and Function Set defined above, and it is used as reference. In the last column there is a symbol that encodes the statistical significance: ↓ means that results are statistically significant, and = means that there is not significant differences between row average fitness and the BTFS average fitness. It is important to note that we are minimizing fitness values, so the lower the fitness, the better the individual is. We used Shapiro-Wilk test for normality, and t-test and Wilcoxon significance tests for normal, and non-normal distributions respectively. Table 2.

Average results of 50 GP-hash runs with different Terminal and Function Sets.

Label

Terminal and Function Set

BTFS

{+, ∗, ≫,∧ , hval, a0 , PrimesERC}

F1 F2 F3 F4

Avg. Fitness

Significance

0.05026

=



0.05134

=



0.05206

=



0.05015

=

0.05113

=

0.23917



{+, ∗, ≫, , &, |, hval, a0 , PrimesERC} {+, ∗, ≫, , , , hval, a0 , PrimesERC} {+, ∗, ≪, , hval, a0 , PrimesERC} ∧

{∗, ≫, , hval, a0 , PrimesERC} ∧

F5

{+, ≫, , hval, a0 , PrimesERC}

F6

{+, ∗, ≫, hval, a0 , PrimesERC}

0.0508

=

F7

{+, ∗,∧ , hval, a0 , PrimesERC}

0.1739





0.43425





F8

{+, ∗, ≫, , a0 , PrimesERC}

F9

{+, ∗, ≫, , hval, a0 }

0.05113

=

F10

{∗, ≫, hval, a0 , PrimesERC}

0.43419



Conclusions obtained from the results: • F1 and F2: Including & and | does not improve the average fitness of GP-hash. The & operator was never selected for being part of the best individual of a GP-hash run. The | operator was selected only in around 30% of the runs. This is probably related with the non reversibility of those operators, and it is interesting to see that operators which are unpopular among hashing experts are also unpopular in GP-hash solutions. Including bit shifts does not have any effect on the average fitness of GP-hash runs either. Hash functions generated with F2 contains shifts, so shifts are used in the evolution even though they do not improve the performance of just having rotations. These results support our decision of excluding &, | and shifts operators from the BTFS.

4 For this experiments we used the Avalanche fitness based on Avalanche Matrices and RMSE explained in Section 5.2.3 and the standard parameters shown in Section 5.3

12

Computational Intelligence

• F3: As expected, replacing right rotation with left rotation does not have any effect on the average fitness of GP-hash. As we already predicted, both operators are completely equivalent. • F4 and F6: Surprisingly, removing either addition or xor operators from BTFS has no effect on the average fitness. This was completely unexpected: these operators are very popular in the hashing literature, but GP-hash seems to work fine without them. We want to stress that we are talking about two separated experiments: in the first one, we remove addition, in the second one, we remove xor. The lack of impact on the fitness could be explained if this two apparently important functions belongs to a function group as defined in (Wang and Soule, 2004). We tested this possibility with function set F10. • F5 and F7: On the other hand, removing either the multiplication or the rotation, does have a drastic impact on the average fitness. Both changes produces a significant worsening of GP-hash performance. These operators are clearly needed. • F8 and F9: We also tested hval and PrimesERC impact on the average fitness. Results show that the hval terminal is definitely needed for a correct evolution. That was totaly expected. Which was unexpected is that PirmesERC seems not to be needed. Removing it from the BFTS does not affect the average fitness. • F10: Removing both addition and xor operators produces an important worsening of average fitness. As we suspected from the results of F4 and F6, xor and addition form a function group. In other words, at least one of these operators must appear in the function set, but it does not matter which one. This explains the apparent lack of effect over the fitness of these so popular operators observed in F4 and F6. Is interesting to note that every hash function in Table 1 that does not use addition uses xor, and vice versa. According to (Wang and Soule, 2004) the optimal solution is to choose only one of those operators for the function set. Since we already have an arithmetical operator (∗), but we do not have any boolean operator, we arbitrarily decide to include xor and remove addition from the BTFS. Finally we have defined the terminal and function set for GP-hash: Terminal and Function Set = {∗, ≫,∧ , hval, a0 } 5.2. Fitness Function The most challenging part in GP-hash design was deciding how to evaluate individuals. In Section 2 we already stated that a non cryptographic hash function: (1) (2) (3) (4)

Must Must Must Must

minimize collisions. distribute outputs evenly. achieve avalanche. be very fast.

The speed itself cannot be an optimization objective. We want our function to be very fast, but that is not enough. This expression for example: h = 0x0; return h; is a syntactically valid hash function and it is extremely fast, but it is completely useless. If we use speed as the objective function of a GP run, then we will obtain many individuals like that. The speed could be seen as a secondary objective that have influence on the fitness through a weighted addition, or it could be considered a constraint of the problem. GP-hash follows the later approach: the size (number of nodes) of the evolved individuals is always limited so the evolved hashes can have a limited number of operators. This way, the execution time of the evolved hashes is bounded. Excluding the speed, we have three desired properties for hash functions, so we designed three different fitness functions for GP-hash, each one measuring one of those properties: 5.2.1. Collision Fitness. We implemented a fitness function for GP-hash based on collisions rate. Every individual in the population is used to hash a specific data set into a predefined hash table, and the collision rate is measured (collision rate is the number of collisions divided by the number of hash values generated). The data set and the size of the hash table are parameters of this function. The data set used in the evolution defines the distribution of the inputs that the hash function receives. Theoretically, the mission of GP-hash is to detect patterns in the distribution of the input bits: it will

13

GP-hash

reinforce individuals that show a better-than-average collision rate (and discard individuals which show weaknesses) when dealing with those patterns. In consequence, this fitness directly measures how many collisions we should expect when the hash function hashes a particular data set, and also other data sets with similar distributions. The main problem of this fitness function is how to choose the data set for training the hash function. In this work we are trying to use GP-hash to automatically discover general purpose hash functions. But choosing a data set with a particular distribution only guarantees that the discovered hash will do a good job hashing similar distribution of keys. We can say nothing about the expected performance with other kinds of data sets. Besides, in the battery of experiments we carried out with this fitness function, evolution curves were almost completely flat. Individuals of the population get stuck on local optima at very early generations. The fitness landscape defined by this function (in conjunction with the genetic operators and the representation) is probably rugged, deceptive or very rich on local optima (see (Talbi, 2009) for a comprehensive study on fitness landscapes and metaheuristics, and (Langdon and Poli, 2002) for more on fitness landscapes and GP). This fitness is a complete failure, which is surprising, given that at least half of the papers reviewed in Section 4 used a collision based fitness. Also, as expected from the evolution curves, hashes found with this fitness are really low-quality, obtaining a very poor performance in all the benchmarks. 5.2.2. Distribution Fitness. Another family of fitness functions based on the distribution of the outputs was created for GP-hash. Again, the hash function must be feed with a data set. The produced output is analyzed to determinate how close it is to the uniform distribution. There are many different ways of measuring how evenly a hash function distributes its outputs. We implemented three different fitness functions based on three different metrics for uniformity: Entropy, Bhattacharyya distance, and χ2 statistic. Shannon Entropy (Shannon, 1948, 1951) is a concept of Information Theory. It could be used to measure the uncertainty of a random variable and in GP-hash it is calculated as:

H(X) =

n X

p(xi )log(1/p(xi ))

i=1

When the outputs are uniformly distributed, the Entropy H is maximum. Another fitness of this family measures the Bhattacharyya distance (Bhattacharyya (1943)) between the distribution of the outputs and the uniform distribution, calculated as: n X p DB (X) = − ln p(xi ) ∗ 0.5

!

i=1

The last function of this fitness family is based on calculating the χ2 statistic for the goodness of fit of the hash output and the uniform distribution, using the well known formula: χ2 =

n X (xi − Expected)2 Expected i=1

In all the previous formula, X is a vector {x0 , x1 , . . . xn−1 }, where n is the number of possible outputs of the hash function (i.e. the number of buckets of the corresponding hash table); xi is the number of times the hash function selected output i; and p(xi ) is the probability of xi (i.e. xi divided by the total number of hashed entries). With this family of fitness functions we have the same problem than with the collisions fitness: How to choose the training data set. The great difference with collisions is that the results obtained with these fitnesses and some representative data sets are excellent: Evolution curves are smooth and many interesting hashes were found. But again, we have no information about how those hashes will perform with data sets different than the training set. The conclusion is that this family of fitnesses is not the most appropriate for finding general purpose hash functions either. But they are very good on finding hashes which we suppose to be suboptimal for the training data set. This clearly suggest a

14

Computational Intelligence

possible future work: to use a fitness belonging to this family to evolve hashes which are specifically designed to be suboptiomal with a particular data set. 5.2.3. Avalanche Fitness. We also designed a fitness function based on avalanche for GP-hash. The Strict Avalanche Criterion (SAC) is the most precise measurement of avalanche, but checking whether an individual satisfies SAC is not practical: for individuals with 32-bits input and output, that means hashing 32 ∗ 232 (i.e. 137, 438, 953, 472) bitstrings for each individual, for each generation. That is a huge amount of CPU time that we cannot afford. Instead, the avalanche fitness uses a Monte Carlo Simulation: it generates N random bitstrings5 and the hash values for those bitstrings. Then, for each bitstring, it generates the 32 possible flipped bitstrings (a flipped bitstring is the same original bitstring but with a single bit flipped) and their hash values. Finally, the avalanche function checks the differences between h(bitstring) and each of h(f lippedBitstring). Then, there are two possibilities (and two different fitness functions): (1) Measuring the probability pi,j of each input bit i affecting every output bit j (i.e. if pi,j is 0.8, that means that if input bit i changes, then output bit j changes 80% of the times). With all the probabilities, construct the Avalanche Matrix. This matrix contains all the probabilities of every input bit affecting every output bit: 0 B B AM = B @

p0,0 p1,0 .. . pn,0

p0,1 p1,1 .. . pn,1

... ... .. . ...

p0,31 p1,31 .. . pn,31

1 C C C A

For a perfect avalanche, all probabilities must be 0.5, so we can calculate the total error (we used RMSE) and use this value as the fitness of the individual. (2) Calculate Hamming distances between the hash values of original bitstring and the corresponding flipped bitstrings. We know that those distances should follow a Binomial distribution with parameters 1/2 and n: ∀x, y|H(x, y) = 1,

„ « ` ´ 1 H F (x), F (y) ≈ B ,n 2

This can be used to calculate the goodness of fit using Pearson’s Chi Square test: χ2 =

N X (Hi − n/2)2 n/2 i=1

And comparing χ2 with a chi square distribution of N − 1 degrees of freedom we obtain the goodnes of fit and the fitness of the evaluated individual. Both methods work very well, but for the default settings of GP-hash we prefer avalanche matrices because they offer the possibility of nice graphical representations like those shown in Figure 7. The color of square in position (i, j) represents the probability of input bit i affects output bit j. A red square means that changes in bit i do not change bit j at all, or change it always6 (0.0 or 1.0 probability of change). A green square means that i has a perfect influence on j (i.e. probability = 0.5). This fitness function (in both versions) works very well, and its main feature is that it is not domain-dependent: it is an statistical measure of how well the hash function disseminates the value of each input bit into every output bit. We do not need to use a data set for training because avalanche is an inherent property of the hash and the mixing function. This makes the avalanche function the perfect choice for this work.

5 In

our experiments we used N = 100 by default. that a probability of 1.0 is as bad as 0.0, because 1.0 means that the value of the output bit is defined by the input bit (every time we change input bit, output bit changes). 6 Note

15

GP-hash

25 20 10

15

20

25

30

5

10

15

20

Input Bits

Input Bits

FNV−1 Avalanche Matrix

Jenkins Avalanche Matrix

25

30

25

30

15 5

5

10

10

15

Output Bits

20

20

25

25

30

30

5

Output Bits

15

Output Bits

5

10

15 5

10

Output Bits

20

25

30

Bernstein Avalanche Matrix

30

APartow Avalanche Matrix

5

10

15 Input Bits

20

25

30

5

10

15

20

Input Bits

Figure 7. Examples of graphical representation of avalanche matrices. 5.3. Parameter Tuning We made an extensive experimentation work to find the best parameter set. We followed a similar approach to that on Section 5.1.3: part from an initial arbitrary configuration based on our knowledge about the problem and our experience working with GP; Then, using this basic configuration as a reference, try different changes on the parameters, looking for fitness improvements. We started our experiments with the basic configuration shown in Table 3, and we progressively introduced changes in all the important parameters: genetic operators rates (±30% to each one), tournament sizes (±5), population size (100, 200, 500 and 1000), initialization method (grow, full, and half and half) , init depth interval (2-4, 2-6, 3-6 and 4-6) and size limitations (25, 50 and 75 nodes). We could not find any configuration which significantly improve the average fitness of the reference tableau. Furthermore, we found out that GP-hash system is very robust: With a large number of different parameter configurations GP-hash keeps working fine, obtaining approximately the same average fitness, and very similar best individuals. Only when using extreme values the average fitness is significantly deteriorated. This is not surprising, since the GP is well known to be a very robust technique in general (Poli et al., 2008, Section 3.4). We were specially careful in tuning the maximum number of generations: we started from 50 generations and tried rising this parameter. We found out that in GP-hash, evolution curves, as we will see in the example of section 6, show very large fitness improvements in earlier generations,

16

Computational Intelligence

and very small improvements later on. This is a very typical behavior of GP populations, as stated on (Luke, 2001). The improvements obtained with long runs are not proportional to the amount of extra CPU time needed. That makes us preferring the initial value of 50 generations per run. The conclusion is that, in the light of our experimental results, we can keep the reference tableau of Table 3 as the default parameters for GP-hash. Table 3.

Basic Tableau for GP-hash. GP-hash Tableau Max Generations

50

Pop. Size

100

Max Nodes

25 ∧

Terminal and function set

{∗, ≫, , hval, a0}

Fitness

Avalanche Matrices (RMSE)

Crossover

Rate = 0.8 Selection = Tournament Tournament Size = 4

Point Mutation

Rate = 0.1 Selection = Tournament Tournament Size = 4

Reproduction

Rate = 0.1 Selection = Fitness Proportional

Elitism

NO

Initialization

Half and half, init depth 2-4

6. EXPERIMENTAL RESULTS We want to show the utility of GP-hash with an example of a non cryptographic hash function evolved with this system. We will generate a hash function and compare it with the state of the art: a selection of the most important hash functions of the non cryptographic hashing literature. These hashes are: • FNV-1 and FNV-1a (Fowler et al., 1991): Two different versions of the Fowler /Noll /Vo hash function, designed by Glenn Fowler and Phong Vo in 1991 and later improved by Landon Curt Noll. There are dozens of very important software products using FNV: Linux, BSD and IRIX distributions, Twitter, Visual C++, Symbian Mobile OS, videogames, etc. • Jenkins (Jenkins, 1997): Hash function lookup3, designed by Robert Jenkins, a world renowned expert in non cryptographic hashes, former employee of Oracle and, according to his resume, currently working on the Cosmos team at Microsoft 7 . This hash function is probably one the most important references in the hashing community. Some examples of companies and products that use lookup3 are Oracle, Dreamworks, Perl, Ruby or Linux. • Hsieh’s SuperFastHash (Hsieh, 2008): This hash was created by Paul Hsieh using another function by Robert Jenkins as a model. It is extraordinarily fast but it also shows outstanding avalanche properties. • MurmurHash2 (Appleby, 2008): Designed by Austin Appleby in 2008. Despite its short lifetime,

7 Cosmos is an internal database system used by Microsoft to perform petabyte scale distributed data analysis. Bing search engine internally uses Cosmos.

17

GP-hash

• •

• •

it has a great renown in the hashing community. It is used by some important OpenSource projects, like libmemcached, Maatkit, or Apache Hadoop. APartow (Partow, 2010): A hash function designed by Arash Partow and inspired in the ideas of many different hashes. Bernstein: A very efficient non-cryptographic hash created by Professor Daniel J. Bernstein, also involved in the creation of cryptographic hashes, like CubeHash, which is one of the candidates to become the new SHA-3 NIST standard (NIST, 2007). BKDR (Kernighan and Ritchie, 1988): Proposed in the book “The C Programming Language” by Brian Kernighan and Dennis Ritchie. Knuth (Knuth, 1998): A hash function by Donald E. Knuth, described in his book “The Art of Computer Programming”, in Volume 3.

The source code of most of those hashes is an adaptation of the code provided by Arash Partow in his General Hash Functions Library, available online in (Partow, 2010). In the previous section we made the three most important decisions of a GP experiment: the terminal and function set, the fitness function and the parameters of the GP run. Using this configuration we carried out 100 GP-hash runs. We used avalanche matrices and RMSE for the fitness function, as described in Section 5.2.3. The number of bitstrings used to construct each avalanche matrix is 100 (N = 100). The average fitness of this battery of experiments was 0.0514, and the best overall individual found had a fitness of 0.0477. We call this automatically-generated hash function gp-hash01. The pseudocode of this hash function is shown in Listing 1; Listing 1.

Pseudocode of gp-hash01

i n t hash ( byte [ ] key ) { i n t h v a l = 0 x0 ; i n t A0 = 0 x0 ; i n t tmp1 , tmp2 = 0 x0 ; for ( int {

i = 0;

i < key . l e n g t h ;

i ++)

A0 = key [ i ] & 0xFF ; tmp1 tmp2 hval hval

= = = =

r o t a t e R i g h t ( ( h v a l ∗ A0 ) , ( ( A0 ˆ h v a l ) ∗ A0 ) tmp1 ˆ tmp2 ; r o t a t e R i g h t ( hval , 1 ) ;

12);

} return h v a l ; }

It only uses two rotations, two multiplications and two xor. That makes gp-hash01 very efficient. Other functions like APartow, Bernstein or FNV are even simpler, so they could be sightly more efficient. Some others like lookup3 are way more complicated. This is clearly not a precise speed measurement. In fact it is very difficult to compare the speed of hash functions. As we already say, it greatly depends on the architecture in which the hashes are executed. Anyway, the extreme simplicity of gp-hash01 ensures a competitive speed. In Figure 8 we show the avalanche matrix of gp-hash01. All the squares are green, so the avalanche of gp-hash01 is perfect or at least very close to it. In the same Figure we show the error of the avalanche matrices of gp-hash01 and the reference hash functions (measured in terms of RMSE). RMSE of gp-hash01 is very close to zero (0.0045). Only Robert Jenkins’ lookup3 (0.0027) and Murmur2 (5.25E-4) shows better avalanche properties. A hash function as renowned as FNV-1 only achieves a 0.3637 RMSE. Using HashBenckmarkTool, we tested the performance of gp-hash01 against the reference set of hash functions. We made one test for collisions and another one for distribution of outputs. In the collisions test we used each function to hash three very different data sets. We obtained the total number of collisions generated by each function and also the average probe length (i.e. the average number of data entries that share the same hash value). In the distribution test we measure the entropy of the produced hash values.

FNV-1a Bernstein Knuth 4

5

7

18

0

Computational Intelligence

30

Murmur2

25

gp−hash01 Avalanche Matrix

gp-hash01

Jenkins

BuzHash

15

FNV-1a

10

APartow FNV-1

5

Output Bits

20

HsiehSFH

Bernstein

BKDR Knuth 5

10

15

20

25

30

0

0,13

0,25

0,38

0,50

Input Bits

(a) Matrix

(b) RMSE

Figure 8. (a): Avalanche matrix of gp-hash01. (b): Comparison between the avalanche matrix RMSE of gp-hash01 and the reference functions.

The data sets used in those experiments are the following: • Passwords: 41Mb text file containing alphanumeric strings and dictionary words in 13 different languages. • Symbols: List of compiler symbols extracted from the symbol table of lcc compiler (Fraser et al., 1995) during the compilation of a big piece of ANSI-C code. • Synthetic: Synthetic data set specifically created for testing non cryptographic hashes. Contains 1000 binary strings with a non-uniform distribution (i.e. it contains patterns). Figure 9 shows the results for the Passwords dataset. All the tested hashes show a very similar performance with this dataset. Still, there are some differences: Only FNV (both versions) produces a more entropic output than gp-hash01. All the other functions are slightly less entropic. Knuth function is the exception: its performance is clearly the worst. In fact, the entropy graph of Figure 9 does not even show the entropy of Knuth because it so bad that changes the scale and made the graph unintelligible. In the collision test, we find the same pattern: except for Knuth hash, the collision rates and probe lengths of all the other hashes are very similar. Figure 10 shows the results for the Symbols dataset. The performance of gp-hash01 with this data set is outstanding: it produces is the highest entropy, the most reduced collision rate, and the shorter average probe length. Figure 11 shows the results for the Synthetic dataset. The entropy generated by gp-hash01 on this data set is 6.2, an average score. BuzHash is the most entropic hash in this case, with 6.36, and Knuth is again the less entropic, with an entropy of 4.45. In the collisions test, gp-hash01 is the second best function, only after BuzHash.

7. CONCLUSIONS AND FUTURE WORK Hashing has a capital importance in the software industry. The possibility of finding objects in a set in constant O(1) time, independently of the size of the set, is essential for software engineers, who have been massively using hashing during the last 3 or 4 decades. However, very often engineers do not pay enough attention to the critical process of designing appropriate hash functions for their particular problems. This is understandable: designing good hash functions is a difficult process due to the extremely nonlinear constructions they use. Hash functions are designed in such a way that humans can not easily invert them, so it is perfectly natural that these expressions are difficult to design. But the same design principles that makes this process difficult for humans, also seems to make it very suitable for GP: Highly non-linear domains, in which the interrelationships among the relevant

5

BuzHash BuzHash

BuzHash BuzHash

BKDR BKDR

BKDR BKDR

Bernstein Bernstein BuzHash

Bernstein Bernstein

Murmur2

Jenkins Jenkins Jenkins

Jenkins Jenkins

APartow

HsiehSFH HsiehSFH

HsiehSFH HsiehSFH

APartow APartow Knuth

APartow APartow

gp-hash01 FNV-1 FNV-1

FNV-1 FNV-1

Hsieh SFH

Bernstein FNV-1

60920 6092060938 6093860955 6095560973 6097360990 60990

31,7031,70 31,9431,94 32,1832,18 32,4232,42 32,6632,66 32,9032,90

BKDR FNV-1a 5,8

6,1

GP-hash

6,4

6,7

BuzHash BuzHash

HsiehHsieh

HsiehHsieh

Jenkins Jenkins

FNV-1a Murmur2 Murmur2

Knuth Knuth

APartow APartow

Murmur2 Murmur2

Knuth Knuth

BuzHash BuzHash

FNV-1

gp-has01

BKDR

Jenkins Jenkins

APartow APartow

gp-hash01 gp-hash01 Bernstein

gp-hash01 gp-hash01

FNV-1 FNV-1 BuzHash

BKDR BKDR

Bernstein Bernstein Jenkins

Bernstein Bernstein

Murmur2

BKDR BKDR APartow

FNV-1 FNV-1

Hsieh SFH FNV-1a FNV-1a

FNV-1a FNV-1a

17,9481

17,9483 870,0870,0 887,5887,5 905,0905,0 922,5922,5 940,0940,0

17,9484

0

0

17,9486 4 4

19

7,0

8

8

12 12 17,9487 16 16

(a) Entropy BuzHash

Murmur2 Murmur2

Murmur2 Murmur2

gp-hash01 gp-hash01 Murmur2

gp-hash01 gp-hash01

Jenkins BuzHash BuzHash

BuzHash BuzHash

HsiehSFH

gp-hash01

FNV-1a FNV-1a

FNV-1a FNV-1a

APartow APartow

APartow APartow

APartow FNV-1a

Jenkins Jenkins BKDR

Jenkins Jenkins

FNV-1 Bernstein Bernstein

Bernstein Bernstein

Bernstein

HsiehHsieh

HsiehHsieh

BKDR BKDR

BKDR BKDR

Knuth

0

FNV-1 FNV-1

2

Knuth Knuth 3459110 3459110 3459118 3459118 3459125 3459125 3459133 3459133 3459140 3459140

(b) Total number of collisions

4

FNV-1 FNV-1

5

7

Knuth Knuth 14,194 14,19414,195 14,19514,196 14,19614,196 14,19614,197 14,197

(c) Mean probe length

BuzHash Figure BuzHash 9. BuzHash Entropy (a), total number of collisions (b), BuzHash and mean probe length (c) of gp-hash01 and gp-hash01 gp-hash01 gp-hash01 gp-hash01 Murmur2 Passwords dataset.

0,13 0,25 0,25 0,38 0,38 0,50 0,50

HsiehSFH HsiehSFH

HsiehSFH HsiehSFH

Jenkins

APartow APartow

APartow APartow

gp-hash01 HsiehSFH

Murmur2 variablesMurmur2 isMurmur2 unknown or not completely understood, areMurmur2 precisely the most adequate for GP, asBuzHash stated Jenkins Jenkins Jenkins Jenkins FNV-1a in (Poli etBKDR al., 2008). BKDR BKDR APartow BKDR But, FNV-1 surprisingly, there is not much research aboutFNV-1 the application of GP, EvolutionaryFNV-1 ComFNV-1 FNV-1 BKDR putation,FNV-1a orFNV-1a Artificial Intelligence to the design of good FNV-1a nonFNV-1a cryptographic hash functions. InBernstein section Bernstein Bernstein Bernstein Bernstein 4, we reviewed the most interesting papers on this topic that we know of. The approaches ofKnuth those Knuth Knuth Knuth 0 works have some merit, but we still think that this topicKnuth really worths a lot more research. 870 to 870 automatically 890 890 910 910 evolve 930 930 non 950 950 0 we 0 5 5 shown 10 10 that 15 15 20 In this work have it is20 possible to use GP cryptographic hash functions. We redesigned the GP-hash system for this purpose, and we learned some important facts in the process. GP-hash uses three different families of fitness functions, one for each important hashing property: collisions rate, distribution of the outputs, and avalanche effect. Collisions and distribution based fitness must be trained with a data set, so generated solutions should be expected to be optimal for a specific dataset, but not for others. Those fitnesses do not seem to be adequate to discover general purpose hash functions, so we decided to use only the avalanche fitness in this work. Even though, we identified a possibility for further research: the application of any flavor of the distribution based fitness to generate hash functions that are suboptimal for the training dataset. This could allow practitioners and software engineers to stop worrying about choosing the right hash function for their specific problems. Instead, they could use GP-hash on their data sets and obtain a customized hash function, specifically (and automatically) designed for their particular problems. We want to remark that this is completely different to Perfect Hashing (Cormen, 2001): with this system it is not needed to know the keys set in advance, but only the distribution of the keys (i.e. their internal patterns), which could be learned from a subset of them. On the other hand, we found out that the performance of the collision based fitness is not up to expectations considering that it is a popular choice in other works about automatic generation of hash functions. We are working in a new collisions function that is based on the so-called Birthday Attack (Schneier, 1996). This approach was already suggested in (Bedau et al., 2004), and the methodology is very similar to the classical collision fitness: We keep hashing randomly selected keys looking for collisions. The difference is that this time we measure how close the hash function gets to pass the

0,13

0,25

0,38

0,50

20

Computational Intelligence

gp-hash01 BuzHash Murmur2 Bernstein BKDR FNV-1a APartow FNV-1 Jenkins Hsieh SFH Knuth 8,1

8,2

8,3

8,4

8,5

6,7

7,0

(a) Entropy Murmur2 BuzHash

gp-hash01 gp-hash01 Jenkins

gp-hash01 gp-hash01

APartow Murmur2 Murmur2

Murmur2 Murmur2

Hsieh SFH

FNV-1a FNV-1a Knuth

FNV-1a FNV-1a

BuzHash gp-hash01 BuzHash

BuzHash BuzHash

Bernstein

BKDR BKDR

BKDR BKDR

Bernstein Bernstein BKDR

Bernstein Bernstein

FNV-1

FNV-1a Jenkins Jenkins

Jenkins Jenkins

5,8 HsiehSFH HsiehSFH

6,1

6,4

HsiehSFH HsiehSFH

APartow APartow

APartow APartow

FNV-1 FNV-1

FNV-1 FNV-1

FNV-1a FNV-1

6092060938 6093860955 6095560973 6097360990 60990 60920

31,70 31,94 32,18 32,42 32,66 32,90 31,70 31,94 32,18 32,42 32,66 32,90

gp-has01 BKDR Murmur2

(b) Total number of collisions

(c) Mean probe length

Bernstein BuzHash BuzHash BuzHash Hsieh Hsieh

Hsieh Hsieh Jenkins Jenkins

Figure 10. Entropy (a), total number of collisions (b), and mean probe length (c) of gp-hash01 and Jenkins Knuth Murmur2 Knuth SymbolsMurmur2 dataset. APartow APartow APartow Knuth Knuth

Hsieh SFH

17,9481

17,9483

17,9484

Murmur2 Murmur2 BuzHash BuzHash

17,9486 17,9487 APartow Jenkins as originally proposed in (Yuval, 1979). APartow Birthday Jenkins Attack Furthermore, as (Wiener, 2004) states, the gp-hash01 gp-hash01 gp-hash01 gp-hash01 costs of this fitness function could be greatly reduced in CPU time and memory by implementing BKDR FNV-1 BKDR FNV-1 the parallelized version proposed in (Oorschot and Wiener, 1996) and the Pollard’s rho method for Bernstein Bernstein Bernstein Bernstein BuzHash memory HsiehSFH reduction proposed in (Pollard, 1978). FNV-1 BKDR FNV-1 BKDR FNV-1a together ten of the most important FNV-1a the terminals and functions set, we gathered FNV-1a FNV-1a Murmur2 Concerning Jenkins 12 870,0 887,5literature 905,0 922,5 922,5 functionsgp-hash01 of the hashing and940,0 of940,0 the software industry. We 0 0 4 4studied 8 8 the 12 operators 16 16 and 870,0 887,5 905,0 variables APartow they use to generate a basic terminal and function set, and then we applied a methodology similar toFNV-1a (Wang and Soule, 2004) to refine this set. We discovered some interesting facts: First, BKDR FNV-1 that magic constants are not needed to evolve hashes with high avalanche effect; Second, that two Murmur2 Murmur2 Murmur2 Murmur2 Bernstein gp-hash01 gp-hash01 very popular operators like addition and xor form a gp-hash01 group, and only one of them is needed (this gp-hash01 Knuth BuzHash BuzHashdo not use addition, always use xor, BuzHash BuzHash is intuitively supported by the fact that hash functions that 0 2 4 5 7 FNV-1a FNV-1a FNV-1a and vice FNV-1a versa). These two discovers could help other researchers that want to apply Evolutionary APartow APartow APartow APartow Algorithms to hashing, but they also suggest hashing Jenkins experts Jenkins Jenkins that magic constants may not be Jenkins necessary in the construction of non cryptographic hashes. Bernstein Bernstein Bernstein Bernstein Hsieh Hsieh Hsieh Hsieh We also found out that GP-hash system is highly robust, and can work well with very different BKDR BKDR BKDR parameterBKDR configurations. This also supports the accepted idea that GP is a very robust technique FNV-1 FNV-1 FNV-1 FNV-1 in general.KnuthKnuth Knuth Knuth Finally, we wanted to demonstrate the utility of GP-hash with a practical example. WeMurmur2 carried 3459110 3459118 3459125 3459133 3459140 14,19414,195 14,19514,196 14,19614,196 14,19614,197 14,197 3459110 3459118 3459125 3459133 3459140 14,194 out a battery of experiments and selected the best overall generated hash function, which weJenkins called gp-hash01 gp-hash01. This hash achieves outstanding avalanche properties, it is only composed of six operations HsiehSFH (which ensures its efficiency), and competes in terms of generated entropy and collision rateBuzHash with a FNV-1a selectionBuzHash ofBuzHash the most renowned non cryptographic hashBuzHash functions, most of them currently inAPartow use in BuzHash important projects of top class companies and Open Source communities. However, we want toFNV-1 stress gp-hash01 gp-hash01 gp-hash01 gp-hash01 BKDR HsiehSFHGP-hash generated dozens of HsiehSFHis not a special case: during our experiments, HsiehSFH HsiehSFH that gp-hash01 hashes Bernstein APartow APartow APartow APartow with almost the same fitness and similar avalanche, entropy and collision properties. All theseKnuthfacts Murmur2 Murmur2 Murmur2 0 Murmur2 supports Jenkins theJenkins central claim of this work: that GP, when using the avalanche fitness and an appropriate Jenkins Jenkins functions and terminals set, is able to generate non cryptographic hash functions that are similar to BKDR BKDR BKDR BKDR FNV-1 FNV-1by hashing experts with years of experience. those generated FNV-1 FNV-1

FNV-1a FNV-1a Bernstein Bernstein Knuth Knuth

FNV-1a FNV-1a Bernstein Bernstein Knuth Knuth 0

0

5

5

10 10

15 15

20 20

870 870

890 890

910 910

930 930

950 950

0,13

0,25

0,38

0,50

BuzHash

Knuth

FNV-1a APartow

Jenkins

APartow

FNV-1 gp-hash01

gp-hash01

gp-hash01

gp-has01 BKDR BKDR Bernstein Murmur2

FNV-1

BKDR

Bernstein

Bernstein

FNV-1

BKDR

FNV-1

FNV-1a

FNV-1a

FNV-1a

Bernstein

BuzHash

887,56,7 905,0

922,5

940,0

7,0

BuzHash

0

Jenkins

4

8

12

16

870,0

887,5

905,0

922,5

940,0

0

4

14,194

14,195

870

890

APartow Hsieh SFH 17,9481

GP-hash 17,9484

17,9483

17,9486

21

17,9487

Murmur2

Murmur2

Murmur2

gp-hash01

gp-hash01

gp-hash01

BuzHash

BuzHash

BuzHash

BuzHash

FNV-1a

FNV-1a

FNV-1a

APartow Murmur2

APartow

APartow

HsiehSFH

Jenkins Jenkins

Jenkins

Jenkins

gp-hash01 Bernstein

Bernstein

Bernstein

APartow

Hsieh

Hsieh

Hsieh

BKDR BKDR

BKDR

BKDR

FNV-1 FNV-1

FNV-1

FNV-1

Bernstein Knuth

Knuth

Knuth

FNV-1a

Knuth

3459118 17,94863459125 3459133 3459140 17,9487

14,194 0

14,195

14,196

2

14,196

14,197

3459110 3459118 53459125 3459133 3459140 7

4

14,1

(a) Entropy BuzHash

BuzHash

BuzHash

gp-hash01

gp-hash01

gp-hash01

HsiehSFH

HsiehSFH

HsiehSFH

APartow

APartow

Murmur2

Murmur2

Jenkins BKDR FNV-1

FNV-1

5

10

15

20

Murmur2

Jenkins

Jenkins

Jenkins

BKDR

gp-hash01

BKDR

HsiehSFH

FNV-1

BuzHash

FNV-1a

FNV-1a

FNV-1a

FNV-1a

Bernstein

Bernstein

APartow

Bernstein

Knuth

FNV-1

Knuth

Knuth 5

APartow Murmur2

870

7

890

910

930

950

BKDR

0

5

10

15

20

Bernstein

91

Knuth

(b) Total number of collisions

(c) Mean probe length

0

Figure 11. Entropy (a), total number of collisions (b), and mean probe length (c) of gp-hash01 and Synthetic dataset.

REFERENCES Murmur2

Appleby, A. (2008). Murmurhash 2.0. Jenkins Bedau, M. A., Crandall, R., and Raven, M. (2004). Cryptographic hash functions based on artificial gp-hash01 life. Unpublished manuscript. HsiehSFH BuzHash P., Jordan, D., Martin, D., and Seitzer, J. (2004). Gevosh: Using grammatical evolution Berarducci, toFNV-1a generate hashing functions. In E. G. Berkowitz, editor, MAICS , pages 31–39. Omnipress. APartow Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined FNV-1 by BKDR their probability distributions. Bulletin of the Calcutta Mathematics Society, 35, 99–110. Bernstein Caldwell, C. (1994-2009). The prime pages. Cormen,Knuth T. H. (2001). Introduction to algorithms. The MIT Press. 0,13 A design 0,25 0,38 0,50 for hash functions. In CRYPTO ’89: Proceedings of the Damg˚ ard, I. 0(1990). principle 9th Annual International Cryptology Conference on Advances in Cryptology, pages 416–427, London, UK. Springer-Verlag. Damiani, E. and Tettamanzi, A. G. B. (1999). On-line evolution of fpga-based circuits: A case study on hash functions. In Proceedings of the First NASA/DoD Workshop on Evolvable Hardware, pages 26–33. IEEE Computer Society. Damiani, E., Liberali, V., and Tettamanzi, A. (1998). Evolutionary design of hashing function circuits using an fpga. In ICES ’98: Proceedings of the Second International Conference on Evolvable Systems, pages 36–46, London, UK. Springer-Verlag. Est´ebanez, C. and Reta, J. I. (2010). Introducing progen. Est´ebanez, C., Castro, J. C. H., Ribagorda, A., and Isasi, P. (2006a). Evolving hash functions by means of genetic programming. In M. Cattolico, editor, GECCO, pages 1861–1862. ACM. Est´ebanez, C., Castro, J. C. H., Ribagorda, A., and Vi˜ nuela, P. I. (2006b). Finding state-of-the-art non-cryptographic hashes with genetic programming. In T. P. Runarsson, H.-G. Beyer, E. K. Burke, J. J. M. Guerv´ os, L. D. Whitley, and X. Yao, editors, PPSN , volume 4193 of Lecture Notes in Computer Science, pages 818–827. Springer. Est´evez-Tapiador, J. M., Castro, J. C. H., Peris-Lopez, P., and Ribagorda, A. (2008). Automated

0,13

0,25

0,38

0,50

22

Computational Intelligence

design of cryptographic hash schemes by evolving highly-nonlinear functions. J. Inf. Sci. Eng., 24(5), 1485–1504. Feistel, H. (1973). Cryptography and computer privacy. Scientific American, 228(5), 15–23. Fowler, G., Vo, P., and Noll, L. C. (1991). Fowler / noll / vo (fnv) hash. Fraser, C., Hansen, D., and Hanson, D. (1995). A Retargetable C Compiler: Design and Implementation. Addison-Wesley Professional. Goodrich, M. T. and Tamassia, R. (2009). Algorithm Design: Foundations, Analysis and Internet Examples. John Wiley & Sons, Inc., New York, NY, USA. Gordon, T. G. W. and Bentley, P. J. (2002). On evolvable hardware. In in Soft Computing in Industrial Electronics, S. Ovaska and L. Sztandera, pages 279–323. Physica-Verlag. Heileman, G. L. (1998). Estructuras de datos, algoritmos y programaci´ on orientada a objetos. McGraw-Hill series in computer science: Fundamentals of computing and programming. Hsieh, P. (2004-2008). Hash functions. Hussain, D. and Malliaris, S. (2000). Evolutionary techniques applied to hashing: An efficient data retrieval method. In L. D. Whitley, D. E. Goldberg, E. Cant´ u-Paz, L. Spector, I. C. Parmee, and H.-G. Beyer, editors, GECCO, page 760. Morgan Kaufmann. Jenkins, R. J. (1997). Hash functions for hash table lookup. Dr. Dobb’s Journal . http://burtleburtle.net/bob/hash/evahash.html. Kernighan, B. W. and Ritchie, D. (1988). The C Programming Language, Second Edition. PrenticeHall. Knott, G. D. (1975). Hashing functions. Comput. J , 18(3), 265–278. Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition). Addison-Wesley Professional, 2 edition. Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press, Cambridge, MA. Langdon, W. B. and Poli, R. (2002). Foundations of Genetic Programming. Springer-Verlag. Luke, S. (2001). When short runs beat long runs. In L. Spector, E. D. Goodman, A. Wu, W. B. Langdon, H.-M. Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M. H. Garzon, and E. Burke, editors, Proceedings of the Third Genetic and Evolutionary Computation Conference (GECCO 2001), pages 74–80, San Francisco CA. Morgan Kaufmann Publishers, Inc. Merkle, R. C. (1989). One way hash functions and des. In CRYPTO ’89: Proceedings on Advances in cryptology, pages 428–446, New York, NY, USA. Springer-Verlag New York, Inc. Miyaguchi, S., Ohta, K., and Iwata, M. (1990). 128-bit hash function (n-hash). NTT Review , 2(6), 128–132. Mulvey, B. (2007). Hash functions. NIST (2007). National institute of standards and technology: Announcing request for candidate algorithm nominations for a new cryptographic hash algorithm (sha–3) family. Federal Register. O’Neill, M. and Ryan, C. (2003). Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language, volume 4 of Genetic programming. Kluwer Academic Publishers. Oorschot, P. C. V. and Wiener, M. J. (1996). Parallel collision search with cryptanalytic applications. Journal of Cryptology, 12, 1–28. Partow, A. (2010). General purpose hash function algorithms. Poli, R., Langdon, W. B., and McPhee, N. F. (2008). A Field Guide to Genetic Programming. http://www.lulu.com. (With contributions by John R. Koza). Available for download at: http://www.gp-field-guide.org.uk. Pollard, J. M. (1978). Monte carlo methods for index computation (mod p). Mathematics of Computation, 32(143), 918 – 924. Preneel, B. (1993). Analysis and design of cryptographic hash functions. Ph.D. thesis, Katholieke Universiteit Leuven. Punch, B. and Zongker, D. (1998). lil-gp genetic programming system. Ryan, C., Collins, J., Collins, J., and O’Neill, M. (1998). Grammatical evolution: Evolving programs for an arbitrary language. In Lecture Notes in Computer Science 1391, Proceedings of the First European Workshop on Genetic Programming, pages 83–95. Springer-Verlag. Safdari, M. (2009). Evolving universal hash functions using genetic algorithms. In F. Rothlauf, editor, GECCO (Companion), pages 2729–2732. ACM.

GP-hash

23

Schneier, B. (1996). Applied Cryptography (Second Edition). John Wiley & Sons. Sedgewick, R. (1995). Algoritmos en C++. Addison-Wesley Iberoamericana. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal , 27, 379–423, 623–656. Shannon, C. E. (1951). Prediction and entropy of printed english. The Bell System Technical Journal , 30, 50–64. Sipper, M., Sanchez, E., Mange, D., Tomassini, M., Perez-Uribe, A., and Stauffer, A. (1997). A phylogenetic, ontogenetic, and epigenetic view of bio-inspired hardware systems. Evolutionary Computation, IEEE Transactions on, 1(1), 83 –97. Snasel, V., Abraham, A., Dvorsky, J., Ochodkova, E., Platos, J., and Kromer, P. (2009). Searching for quasigroups for hash functions with genetic algorithms. In Nature Biologically Inspired Computing, 2009. NaBIC 2009. World Congress on, pages 367 –372. Talbi, E.-G. (2009). Metaheuristics: from design to implementation; electronic version. Wiley, Hoboken, NJ. Valloud, A. (2008). Hashing in Smalltalk: Theory and Practice. self-published (www.lulu.com). Wang, G. and Soule, T. (2004). How to choose appropriate function sets for gentic programming. In M. Keijzer, U.-M. O’Reilly, S. M. Lucas, E. Costa, and T. Soule, editors, Genetic Programming, volume 3003 of Lecture Notes in Computer Science, pages 198–207. Springer Berlin / Heidelberg. Wang, T. (2007). Integer hash function. Webster, A. F. and Tavares, S. E. (1986). On the design of s-boxes. In Lecture notes in computer sciences; 218 on Advances in cryptology—CRYPTO 85 , pages 523–534, New York, NY, USA. Springer-Verlag New York, Inc. Widiger, H., Salomon, R., and Timmermann, D. (2006). Packet classification with evolvable hardware hash functions - an intrinsic approach. In A. J. Ijspeert, T. Masuzawa, and S. Kusumoto, editors, BioADIT , volume 3853 of Lecture Notes in Computer Science, pages 64–79. Springer. Wiener, M. J. (2004). The full cost of cryptanalytic attacks. Journal of Cryptology, 17, 2004. Yuval, G. (1979). How to swindle rabin. Cryptologia, 3(3), 187 – 191.

LIST OF FIGURES 1 Example of a typical hash function: Input values could have any length; outputs are 32 bits values; The two last inputs only differ in a few letters, but their outputs are completely different. 2 Merkle-Damg˚ ard construction scheme. 3 A hash function h with a nice avalanche effect. 4 Basic schema of a GP run. 5 Example of GP individual represented as a tree and its equivalent syntax expression. 6 Bitwise shift and rotation operators. 7 Examples of graphical representation of avalanche matrices. 8 (a): Avalanche matrix of gp-hash01. (b): Comparison between the avalanche matrix RMSE of gp-hash01 and the reference functions. 9 Entropy (a), total number of collisions (b), and mean probe length (c) of gp-hash01 and Passwords dataset. 10 Entropy (a), total number of collisions (b), and mean probe length (c) of gp-hash01 and Symbols dataset. 11 Entropy (a), total number of collisions (b), and mean probe length (c) of gp-hash01 and Synthetic dataset.

LIST OF TABLES 1 2 3

Operators used by some state of the art non cryptographic hashes. Average results of 50 GP-hash runs with different Terminal and Function Sets. Basic Tableau for GP-hash.