Discovery of Symbolic, Neuro-Symbolic and ... - Semantic Scholar

9 downloads 806 Views 243KB Size Report
PDGP is based on a graph-like representation for parallel programs which ..... MIT. Press. 11] Riccardo Poli. Genetic programming for image analysis. In John R.
Discovery of Symbolic, Neuro-Symbolic and Neural Networks with Parallel Distributed Genetic Programming Riccardo Poli School of Computer Science The University of Birmingham Birmingham B15 2TT United Kingdom E-mail: [email protected] Technical Report: CSRP-96-14 August 1996 Abstract

Genetic Programming is a method of program discovery consisting of a special kind of genetic algorithm capable of operating on parse trees representing programs and an interpreter which can run the programs being optimised. This paper describes Parallel Distributed Genetic Programming (PDGP), a new form of genetic programming which is suitable for the development of parallel programs in which symbolic and neural processing elements can be combined a in free and natural way. PDGP is based on a graph-like representation for parallel programs which is manipulated by crossover and mutation operators which guarantee the syntactic correctness of the o spring. The paper describes these operators and reports some results obtained with the exclusive-or problem.

1 Introduction Genetic Programming (GP) is an extension of Genetic Algorithms (GAs) in which the structures that make up the population to be optimised are not xed-length character strings that encode possible solutions to a problem, but programs that, when executed, are the candidate solutions to the problem [7, 8]. Programs are expressed in GP as parse trees, rather than as lines of code. For example, the simple expression max(x * y, 3 + x * y) would be represented as shown in Figure 1. The basic search algorithm used in GP is a classical GA with mutation and crossover speci cally designed to handle parse trees. 1

The set of possible internal (non-leaf) nodes used in GP parse trees is called function set, F = f 1    NF g. F can include almost any kind of programming construct: arithmetic operators, mathematical and Boolean functions, conditionals, looping constructs, procedures with side e ects, etc. The set of terminal (leaf) nodes in the parse trees is called terminal set T = f 1    NT g. T can include: variables, constants, 0-arity functions with side e ects, random constants, etc. This form of GP has been applied successfully to a large number of dicult problems like automated design, pattern recognition, robot control, symbolic regression, music generation, image compression, image analysis, etc. [7, 8, 5, 6, 1, 11]. When appropriate terminals, functions and/or interpreters are de ned, standard GP can go beyond the production of sequential tree-like programs. For example using cellular encoding GP can be used to develop structures, like neural nets [3, 4] or electronic circuits [10, 9], which can be thought of as performing some form of parallel computation. Also, in conjunction with an interpreter implementing a parallel virtual-machine, GP can be used to develop special kinds of parallel programs [2, 14] or to translate sequential programs into parallel ones [15]. This paper describes Parallel Distributed Genetic Programming (PDGP), a new form of genetic programming which is specialised in the development of parallel programs in which symbolic, numeric and neural processing elements can be combined in totally free and natural way. PDGP is based on a graph-like representation for parallel programs and genetic operators which guarantee the syntactic correctness of the o spring. In the following sections the representation and operators used in PDGP are described and some results obtained by applying this paradigm to the XOR problem are reported. f ;

;f

t ;

;t

2 Representation Taking inspiration from the parallel distributed processing performed in neural nets [13], we represent ne-grained parallel programs as graphs with labelled nodes and oriented links. The nodes are the functions and terminals used in the program while the links determine which arguments are used by each function-node when it is next evaluated. max +

∗ x



y x

3 y

Figure 1: Parse-tree representation of the expression max(x 2

.

* y, 3 + x * y)

max + ∗ x

3 y

Figure 2: Graph-like representation of the expression max(x

.

* y, 3 + x * y)

Figure 2 shows an example of a parallel distributed program represented as a graph. The program implements the same function as the one shown in Figure 1, i.e. max(x * y, 3 + x * y). Its execution should be imagined \wave of computations" starting from the terminals and propagating upwards along the graph, more or less like the updating of the activations of the neurons in a multi-layer feed-forward neural net. This tiny-scale example shows that graph-like representations of programs can be more compact (in term of number of nodes) and more ecient (the sub-expression x * y is be computed only once instead of twice) than tree-like representations. However, the direct handling of graphs within a genetic algorithm presents some problems. PDGP uses a direct representation of graphs which, although not completely general, allows the de nition of crossover operators which always produces valid o spring (without requiring any repair operation) in a very ecient way. The representation is based on the idea of assigning each node in the graph to a physical location in a multidimensional (evenly spaced) grid with a pre- xed (regular or irregular) shape and limiting the connections between nodes to be upwards. Also, connections can only be established between nodes belonging to adjacent rows, like the connections in a standard feed-forward multi-layer neural network. This representation for parallel distributed programs is illustrated in Figure 3, where we assumed that the program has a single output at coordinates (0,0) (the y axis is pointing downwards) and that the grid is two-dimensional and includes 6  6 + 1 cells.1 By adding the identity function (i.e. a wire or pass-through node) to the function set, any parallel distributed program (i.e. any directed acyclic graph) can be rearranged so that it can be described with this grid-like graph representation. For example, the program shown in Figure 2 can be transformed into the layered network in Figure 4. In order to study all the possibilities o ered by our network-based representation of programs, we decided to expand the representation described above to explicitly include introns (\unexpressed" parts of code). In particular we assumed that once the size and shape of the grid is xed, a function or a terminal is associated to every node in the grid, i.e. also to the nodes that are not directly or indirectly connected with the output In this work we have adopted the convention that the rst row of the grid includes as many cells as the number of outputs of the program. 1

3

Output Node

Terminal Function

Figure 3: Grid-based representation of graphs representing programs in PDGP.

max + ∗

3 y

x

Figure 4: Grid-like representation of the expression max(x

4

.

* y, 3 + x * y)

Output Node

Active Terminal Active Function Inactive Terminal Inactive Function

Figure 5: Intron-augmented representation of programs in PDGP. (the introns). For example, the program shown in Figure 3 could have an expanded representation like the one in Figure 5. In some experiments we also extended the representation by associating labels to links. If the labels are real numbers, this technique allows the direct development of neural networks.

3 Genetic Operators With the representations described in the previous section, several kinds of crossover and mutation can be de ned (see [12] for a more complete description). The crossover operator most similar to the one used in standard GP is called Sub-graph Active-Active Node (SAAN) crossover. It works as follows: 1. A random active node is selected in the first parent. 2. The sub-graph including all the active nodes which are used to compute the output value of the selected node is extracted and its height and width is determined.

h

w

y coordinate ? y > h.

3. An active node in the second parent is selected such that its is compatible with the height of the sub-graph, i,e. max

h

y

4. The sub-graph is inserted in the second parent to generate the offspring.

5

If

Parent 2

Parent 1 Crossover Point

1

1

Crossover Point

2

2

1 1

2

1

2

2

2

1 1

2

1 1

Selected Sub-graph

Offspring Crossover Point

2

2

2 1

2 2

2

12

Wrapped-around

Selected Sub-graph After Wrap-around

1 1

1

1 1

Figure 6: Sub-graph active-active node (SAAN) crossover (inactive nodes are not shown).

x

the coordinate of the insertion node in the second parent is not compatible with the width of the sub-graph, the sub-graph is wrapped around.

An example of SAAN crossover is shown in Figure 6. The idea behind this form of crossover is that connected sub-graphs are functional units whose output is used by other functional units. Therefore, by replacing a sub-graph with another sub-graph, we tend to explore di erent ways of combining the functional units discovered during evolution. So, sub-graphs act as building blocks. It should be noted that in SAAN crossover inactive nodes play no role (for this reason they are not shown in the gure). Several di erent forms of crossover can be de ned by modifying SAAN crossover. In this paper we have adopted the Sub-graph Inactive-Active Node (SIAN) crossover in which the crossover point in the second parent is randomly selected among the active nodes, while the crossover point in the rst parent is randomly chosen among all nodes (active or inactive). The standard GP technique of de ning mutation as the swapping of a randomly 6

Figure 7: Symbolic networks implementing the exclusive-or function with Boolean processing elements (introns are drawn with thin lines). selected sub-tree in an individual with a new randomly generated tree can be naturally applied in PDGP as well. We call this form of mutation global mutation. It is also possible to de ne another form of mutation, link mutation, which makes local modi cations to the connection topology of the graph [12].

4 Experimental Results In this section we report on some preliminary experimental results obtained by applying PDGP to the exclusive-or problem. The problem is nding a parallel distributed program that implements the function  ( 1 2 ) = 1 if 1 6= 2 0 otherwise. In all the experiments reported below, the population size was P=200 individuals, the maximum number of generation was G=20, the crossover probability was 0.7, the global mutation probability was 0.25 while the link mutation probability was 0.25. The GA used tournament selection with tournament size 7. The other parameters were \grow" initialisation method SIAN crossover. The tness of a solution was the number of correct predictions of the entries in the XOR truth-table. x

X OR x ; x

4.1 Logic solutions

x

In these experiments we used the function set F =fAND, OR, NAND, NOR, Ig (I is the identity function) and the terminal set T =fx1, x2g. Figure 7 shows some typical solutions to the XOR problem obtained with PDGP. The gure shows the active nodes in bold and the active links as thick lines. All the rest are introns. (It should be noted that in the gure the output node, which, having coordinates (0,0), should be in the top-left corner, is actually centred horizontally for displaying purposes.) 7

In order to assess the behaviour of PDGP on this problem we performed 20 runs (with di erent seeds for the random number generator) with three di erent grid sizes: 2  2, 2  3 and 3  4. One of the criteria we used to assess the performance of PDGP was the computational e ort E used in the GP literature (E is the number of tness evaluations necessary to get a correct program, in multiple runs, with probability 99%). As the e ort of evaluating each individual (at least on a sequential machine) depends on the number of nodes it includes, we also used as a criterion the total number of nodes N to be evaluated in order to get a solution with 99% probability. The results are summarised in the following table: Grid size E N 2  2 5,200 26,000 2  3 4,200 29,400 3  4 1,600 20,800 These results indicate that increasing the size of the grid reduces considerably the number of tness evaluations necessary to get a solution. This seems reasonable considering that smaller grids impose harder constraints on the search. However, if we look at the results in terms of total number of nodes to be evaluated, the advantage of larger grids is not so clear. In fact, it may be worth spending a bit more in terms of nodes to be evaluated and get a better solution in terms of size, execution speed and generalisation.

4.2 Algebraic solutions

In these experiments we used the function set F =f+, -, *, PDIV, Ig (PDIV is the protected division, which returns its rst argument if the second is 0) and the terminal set T =fx1, x2g. In these, and the following, experiments the output is considered to be 1 if it is grater than 0.5, 0 otherwise. Figure 8 shows some typical solutions to the XOR problem obtained with PDGP using algebraic operators.2 Also in this case we performed 20 runs with three di erent grid sizes. The results are summarised in the following table: Grid size E N 2  2 2,400 12,000 2  3 2,800 19,600 3  4 1,600 20,800 These results indicate that algebraic operators make the search easier and that again larger grids reduce the number of tness evaluations but not the number of nodes to be evaluated. For the sake of clarity, in this and the following gures introns are not drawn (they are simply represented by small crosses). It should also be noted that in general the interpretation of the noncommutative nodes (like ? and PDIV) requires the knowledge of the order of evaluation of their incoming links. However, for clarity reasons, we have preferred not to add this information to the gures in the paper, as the order of evaluation can be easily inferred given the simplicity of the examples reported. 2

8

Figure 8: Algebraic network-like realisations of the exclusive-or function.

Figure 9: Exclusive-or implementations based on neuro-algebraic parallel distributed programs.

4.3 Neuro-Algebraic solutions

In these experiments we used the same function set and terminal set as in Section 4.2, but we added random weights in the range [?1 1] to the links. The weights act as pre-multipliers for the arguments of the functions in F . Figure 9 shows some typical solutions to the XOR problem obtained with PDGP using neuro-algebraic operators. The results obtained in 60 runs with this paradigm are reported in the following table: Grid size E N 22 9,600 48,000 2  3 12,000 84,000 34 7,000 91,000 The table indicates that the use of random weights make the search much harder and that again increasing the size of the grid reduces the number of tness evaluations but not necessarily the number of nodes to be evaluated. This might seem surprising at rst as in theory by adding weights to the connections we can explore a much large space of programs. However, enlarging the search space does not necessarily mean to increase the ;

9

Figure 10: Weight-less neural networks implementing the exclusive-or function. frequency of solutions. As PDGP can explore only part of the search space, it is quite likely that, at least for this problem, by adding weights we make a less ecient use of the available resources (the tness evaluations).

4.4 Weight-less Neural Solutions

In these experiments we used the function set F =f+, ?, S2, S3, P2, P3,, Ig where + and ? are introduced to simulate linear neurons, S2 and S3 are neurons with sigmoid activation function and P2 and P3 are  neurons which compute the product of their inputs. The terminal set included also a random constant generator, to create biases in the range [-1.0,+1.0]. The links had no weights. Figure 10 shows some typical solutions to the XOR problem obtained with these operators, while the following table reports the results obtained in 60 runs: Grid size E N 2  2 10,200 54,000 23 6,800 47,600 34 6,000 78,000

These results suggest that the use of neurons instead of Boolean or algebraic nodes make the search harder. The reason for the degradation in performance might be related to the limited expressive power of neurons with respect to other classes of functions, at least for Boolean classi cation problems. The table also shows that, again, increasing the size of the grid is not advantageous.

4.5 Neural Solutions

In these experiments we used the same function set and terminal set as in Section 4.4 but we added weights (in the range [-1,1]) to the links to obtain a more standard form of neural nets. Figure 11 shows some typical solutions to the XOR problem obtained with these operators. The results obtained in 60 runs are summarised in the following table: 10

Figure 11: Neural realisations of the exclusive-or function. Grid size E N 2  2 342,000 1,710,000 2  3 378,000 2,646,000 34 46,200 600,600

As expected given the results reported in the previous sections, the combination of the negative e ects of weights and neural processing elements has led to a considerable degradation of the performance of PDGP. However, in this case a large grid seems to be a big relative advantage.

5 Discussion and Conclusions In this paper we have presented PDGP, a new form of genetic programming which is suitable for the automatic discovery of parallel network-like programs in which symbolic and sub-symbolic (neural, numeric, etc) primitives can be combined in a free and natural way. The grid-like representation of programs used in PDGP allowed us to develop ecient forms of crossover and mutation. By changing the size, shape and dimensionality of the grid, this representation allows a ne control on the size, eciency and degree of parallelism of the programs being developed. In this paper we have studied the representational capabilities o ered by PDGP using a simple problem: learning the exclusive-or function. The result with this problem are very promising as they clearly show how PDGP can explore an entire new space of programs in which non-recurrent neural nets and classical tree-like programs are just special cases. Some work is currently being devoted to removing some of the constraints we are for now imposing on the graphs produced by PDGP (e.g. connections limited to adjacent layers, acyclicity) and give PDGP an even bigger representational power. The results with XOR are promising also under other respects. For example, the very small number of tness evaluations required to develop symbolic and algebraic networks (comparable with the computation e ort required by the standard backpropagation algorithm) suggest that evolutionary methods can nally compete with other well established 11

machine learning techniques in terms of eciency. This result, together with the small size of the populations used in this work, suggests that PDGP will be able to scale up and face much harder problems (this is con rmed by preliminary experiments with the even-3 and 4 parity functions, the Iris classi cation problem, the Monks problems, the encoder-decoder, etc.) However, the increased computational e ort required to develop programs with weighted links seems to indicate that the operators used by PDGP to optimise the topology and the processing elements of parallel distributed programs are probably ine ective in optimising the connection weights. Specialised operators which should solve this problems, like weight mutation or gradient-based tuning, are being currently tested. In the paper we have used a form of program interpretation similar to the propagation of activation in the neurons of feed-forward neural nets. This is quite suitable for programs which do not include functions or terminals with side e ects. If this is the case, our gridbased representation of programs can be directly mapped onto the nodes of ne-grained parallel machines thus leading to a very ecient implementations of PDGP programs capable of producing a new result at every clock tick. However, a di erent form of program interpretation which solves the problems generated by the use of terminals and functions with side e ects has been developed and will be described in a later papers. Finally, it should be noted that PDGP has quite a broad applicability, which goes beyond programming. In fact, the graphs produced by PDGP need not be interpreted as programs: they can be seen as designs, semantic nets, transition nets, belief nets, etc. For example, PDGP programs (possibly developed with some additional constraints) could be used to de ne the functions of eld programmable gate arrays and therefore PDGP can be considered as a tool for the evolution of hardware as well as software.

Acknowledgements The author wishes to thank the members of the EEBIC (Evolutionary and Emergent Behaviour Intelligence and Computation) group for useful discussions and comments. This research is partially supported by a grant under the British Council-MURST/CRUI agreement.

References [1] Late Breaking Papers at the Genetic Programming 1996 Conference, Stanford University, July 1996. Stanford Bookstore. [2] Forrest H. Bennett III. Automatic creation of an ecient multi-agent architecture using genetic programming with architecture-altering operations. In John R. Koza, David E. Goldberg, David B. Fogel, and Rick L. Riolo, editors, Genetic Programming 1996: Proceedings of the First Annual Conference, page 30, Stanford University, CA, USA, 28{31 July 1996. MIT Press. 12

[3] F Gruau and D. Whitley. Adding learning to the cellular development process: a comparative study. Evolutionary Computation, 1(3):213{233, 1993. [4] Frederic Gruau. Genetic micro programming of neural networks. In Kenneth E. Kinnear, Jr., editor, Advances in Genetic Programming, chapter 24. MIT Press, 1994. [5] K. E. Kinnear, Jr., editor. Advances in Genetic Programming. MIT Press, 1994. [6] J. R. Koza, D. E. Goldberg, D. B. Fogel, and R. L. Riolo, editors. Proceedings of the First International Conference on Genetic Programming, Stenford University, July 1996. MIT Press. [7] John R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. [8] John R. Koza. Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Pres, Cambridge, Massachusetts, 1994. [9] John R. Koza, David Andre, Forrest H. Bennett III, and Martin A. Keane. Use of automatically de ned functions and architecture-altering operations in automated circuit synthesis using genetic programming. In John R. Koza, David E. Goldberg, David B. Fogel, and Rick L. Riolo, editors, Genetic Programming 1996: Proceedings of the First Annual Conference, page 132, Stanford University, CA, USA, 28{31 July 1996. MIT Press. [10] John R. Koza, Forrest H. Bennett III David Andre, and Martin A. Keane. Automated WYWIWYG design of both the topology and component values of electrical circuits using genetic programming. In John R. Koza, David E. Goldberg, David B. Fogel, and Rick L. Riolo, editors, Genetic Programming 1996: Proceedings of the First Annual Conference, page 123, Stanford University, CA, USA, 28{31 July 1996. MIT Press. [11] Riccardo Poli. Genetic programming for image analysis. In John R. Koza, David E. Goldberg, David B. Fogel, and Rick L. Riolo, editors, Genetic Programming 1996: Proceedings of the First Annual Conference, page 363, Stanford University, CA, USA, 28{31 July 1996. MIT Press. [12] Riccardo Poli. Some steps towards a form of parallel distributed genetic programming. In Proceedings of the First On-line Workshop on Soft Computing, August 1996. [13] D.E. Rumelhart and J.L. McClelland, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1-2. MIT Press, Cambridge, MA, 1986.

13

[14] Astro Teller and Manuela Veloso. Neural programming and an internal reinforcement policy. In John R. Koza, editor, Late Breaking Papers at the Genetic Programming 1996 Conference Stanford University July 28-31, 1996, pages 186{192, Stanford University, CA, USA, 28{31 July 1996. Stanford Bookstore. [15] Paul Walsh and Conor Ryan. Paragen: A novel technique for the autoparallelisation of sequential programs using genetic programming. In John R. Koza, David E. Goldberg, David B. Fogel, and Rick L. Riolo, editors, Genetic Programming 1996: Proceedings of the First Annual Conference, page 406, Stanford University, CA, USA, 28{31 July 1996. MIT Press.

14