How GAs do NOT Work Understanding GAs ... - Semantic Scholar

3 downloads 0 Views 338KB Size Report
Apr 25, 1995 - Hans-Georg Beyer. Bericht Nr. SYS ... cult to come to a de nite conclusion by exper- iments. There are ..... cal DeJong functions. For the otherĀ ...
                               

c FH

       

 

                

   T T T T TT   T T T T T  T T T T T  T T T T T SYS

How GAs do NOT Work Understanding GAs without Schemata and Building Blocks Hans-Georg Beyer Bericht Nr.

SYS { 2/95

April 25, 1995

Systems Analysis Research Group Universitat Dortmund Fachbereich Informatik Lehrstuhl XI D{44221 Dortmund ISSN 0941{ 4568

0

1

Abstract An alternative explanation for the working of GAs is presented, focusing on the collective phenomena taking place in populations performing recombination. The new theory is mainly based upon three `basic principles': the `evolutionary progress principle (EPP)', the `genetic repair' (GR) hypothesis, and the `mutation induced species by recombination' (MISR) principle. This paper has been rejected for presentation at the 6th ICGA. Having lost the opportunity to defend the new ideas and the points attacked by the reviewers at the conference, the author has added his replies to the reviewers comments.

SYS { 2/95

SyS

2

UNI DO

SYS

ISSN 0941{ 4568

How GAs do NOT Work Understanding GAs without Schemata and Building Blocks  Hans-Georg Beyery University of Dortmund, Department of Computer Science, Systems Analysis Research Group, D-44221 Dortmund, Germany 4th January 1995

Abstract

bination [10]. But where is the truth? Since mutation vs. recombination is not the only di erence between GA and EP, it will be dicult to come to a de nite conclusion by experiments. There are too many degrees of freedom that have in uence on the assessments made. The major aw is the lack of a theory that quantitatively predicts the in uence of the genetic operators. In order to prove the usefulness of an operator, the theory should provide a performance measure for the operators. Thus, it would be possible to compare, e. g., a GA consisting of recombination & mutation with a GA without recombination. And if there were a signi cant performance di erence, then it could be traced back to `basic principles' inherent in the theory developed. This is not merely a program, but it has been performed for a special variant of Evolution Strategies (ES), the so-called (=; )ES [7, 6]. The results obtained are useful for the GA theory as well. The extracted `basic principles', in particular, should be considered by the EA-community as a whole. They shed light on the interaction between mutation and crossover with - at rst glance - unexpected results pointing in a direction diametrically opposed to the building block hypothesis [14, 11]. This raises the question of whether the schema theorem (SchT) and its (quasi) corollaries, i. e., the building block hypothesis (BBH) and the implicit parallelism (IP), are really the essence of the GA. Do

An alternative explanation for the working of GAs is presented, focusing on the collective phenomena taking place in populations performing recombination. The new theory is mainly based upon three `basic principles': the `evolutionary progress principle (EPP)', the `genetic repair' (GR) hypothesis, and the `mutation induced species by recombination' (MISR) principle.

1 Introduction Although GA-theory has gained considerable attention in the past few years, a theory is still missing that gives us an insight into the basic working principles of GAs. Many GAresearchers may not agree, but one should take into account the existence of di erent `EA-schools'. Each one claims that its basic principles are the essence of EAs. One prime example concerns the debate about the importance of mutation and recombination/crossover, separating very well the GAcommunity from the EP-community. For a long period of time the GA-community has considered the recombination as the major search operator, whereas the EP-protagonists deny up until now any signi cance of recom y

This work was funded by DFG grant Be 1578/1-1 e-mail: [email protected]

1

planes [13]. From the viewpoint of function optimization, the SchT does not provide any guarantee for convergence to/or divergence from the optimum state. This is the major aw of schema analysis. The SchT cannot be used to prove/disprove whether a certain GA is a function optimizer or not. Taking this into account, the next steps in GA-theory, i. e., the building block hypothesis (BBH) and the implicit parallelism (IP), become very questionable, because they are built upon sandy ground. We should especially draw our attention to the extended version of the BBH that tries to incorporate the working of the recombination operator, since it is often invoked to explain the working of the GA [11, p. 41]: \Just as a child creates magni cent fortresses through the arrangement of simple blocks, so does a genetic algorithm seek near optimal performance through the juxtaposition of short, low-order, high-performance schemata, or building blocks." This statement is not wrong, if taken as an (empirical) observation from the optimization process. However, if the BBH is regarded as the reason why the GA works, then cause and e ect are reversed. Furthermore, the observation that leads to the BBH is not a feature exclusively observed in GAs. It holds for each iterative procedure that successively approaches the optimum, if building blocks are interpreted as certain decompositions of the optimum solution. Even the exponential growth of schemata can be observed in gradient strategies, under such conditions. We are interested in knowing the main di erence between GAs (or EAs, generally) and other optimization techniques. Grefenstette [12] advocates: \...any admissible genetic algorithm exhibits a form of implicit parallelism." Goldberg writes [11, p. 40]: \Even though each generation we perform computation proportional to the size of the population, we get useful processing of something like n3 schemata in parallel ...", and [11,

these principles separate the EA-based algorithms from other optimization techniques? This paper is organized as follows. First, a short discussion about the questions just raised is given in section 2. Section 3 is devoted to the (=; )-Evolution Strategies. This class of strategies performing multi- recombination and (; )-truncation selection exhibits larger progress rates than its non-recombinative counterpart, the (; )-ES (theoretically provable!). The basic working principles will be extracted in order to explain the bene t of recombination. In section 4 we will discuss the ndings made for the (=; )ES in the light of GAs. There is some evidence indicating that there is at least `a certain substance' in the principles proposed.

2 Current GA-Theories 2.1 The Schema Theorem

For nearly one and a half decades the SchT (Schema Theorem) [14] had been the only existing approach to the understanding of the canonical GA (with tness proportionate selection). The SchT states that [11, p. 33]: \Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations". Although this statement is not wrong, it is only a weak predictor for the working of the GA. The main emphasis in the SchT is on the tness proportionate selection. Mutation and recombination appear in this theorem as `disturbances' changing the selection equation into the (much) weaker inequality. But even if selection is considered alone, the SchT cannot predict the performance of the GA as can be seen in the case of the minimal deceptive problem [11, p. 46]. The reasons for this failure are well known. One of them, concerning selection, is the in nite population size. Closely connected to this - though not equivalent - is Grefenstette's distinction between the observed average tness of the schemata/hyper-planes and the static average tness of the schemata/hyper2

gence to stationary state probabilities only. The theory developed (up to now) cannot give any insight into the working mechanisms of crossover, mutation, and selection, as pointed out (indirectly) by Davis et al. [8] (concerning the crossover): \...for our established bound, crossover does not in uence the asymptotic behavior of the two-operator algorithm" and \...the addition of crossover does not seem to bring any additional advantage." This leads us to the major aw of current Markov chain analysis. Though, they are exact models, a link is still missing that connects the microscopic level of description to `macroscopic' variables such as the expected tness change over time (or number of generations). The problem is very similar to physics, e. g., in the derivation of hydrodynamics and its transport coecients from quantum mechanical master equations. A technique for deriving macroscopic dynamical equations from the transition matrices of the Markov chain has yet to be developed. Therefore, from this approach, one cannot expect any satisfactory answers on how the GA really works. Things may change in the future. For the time being, however, the `mesoscopic approach' seems to be the most promising one.

p. 20] \Because this processing leverage is so important (and apparently unique to genetic algorithms), we give it a special name, implicit parallelism." Apart from the fact that the n3 -counting argument is questionable, as pointed out by Grefenstette [12], there remain some qualms about the IP's \uniqueness to GAs". The counting argument is mainly based upon the number of schemata owned by an individual string of length `. Each individual is a member of exactly 2` schemata and a population of  individuals contains (theoretically) up to 2` schemata. However, consider a gradient-like strategy. Each data point, needed to compute the gradient direction, is a member of 2` schemata, too! Thus, even a gradient strategy would exhibit some kind of implicit parallelism. The author does not claim that there are not any signi cant di erences between gradient strategies and GAs. Furthermore, GAs have proven to be an ecacious optimization technique. However, the BBH and the IP should not be invoked for explaining the working or the uniqueness of GAs, because they point in a wrong direction.

2.2 Alternative Approaches 2.2.1 Markov Chain Analysis

Markov chains provide an exact microscopic model of GAs. Results obtained up until now have concentrated on convergence properties only. A detailed analysis has been done by Davis et al. [8]. They state: \...the simple GA, which seeks convergence to a uniform population of globally optimal solutions, fails to provide convergence to global optimality." This does not come as a big surprise, since the `simple' GA performs tness proportionate selection. Thus, there is always a probability for losing the best solution found so far. Convergence, however, can be obtained and proved, if some kind of elitism is incorporated into the GA, e. g., by keeping track of the best individual obtained so far [15, 19]. As already stated, the Markov chain model has been used to prove/disprove the conver-

2.2.2 The `Mesoscopic' Approach The notion `mesoscopic' indicates that this approach is sort of in-between having some features from the exact microscopic theory and some phenomenologic ingredients obtained by empirical methods. Attempts in this direction are made in various articles by  hlenbein et al. (e. g. [16, 17]) and reMu cently by Goldberg et al. [22]. The GA analysis is concentrated on a few simple tness functions. Especially the bitcounting function (ONEMAX) has been investigated. Since there is no epistasis in this function, the bit positions can be treated independently from each other. This should simplify the analysis considerably. In spite of this, 3

problem domain considered (e. g. combinatorial optimization, integer programming, real parameter optimization). The theory developed belongs to the eld of real valued N dimensional parameter optimization. Therefore, the mutations are generated by addition of N (0;  ) normally distributed random components to the parental states ym;. An o spring yl is generated as

the analytical treatment of the GA, including the three operators selection, mutation, and crossover, is still a pending problem. From the current state of the theory one cannot expect the derivation of `basic principles'. But unlike the SchT, the mesoscopic theory is able to give estimates for optimal mutation rates and the expected run-time complexity of the algorithms [17].

(; )-ES: m := Randomf1 : : :g; yl := ym; + Z; l = 1 : : :

3 Basic Principles from the (=; )-ES Theory

(1)

(NB y and Z are N -dimensional vectors). All components of the random vector Z uctuate with equal standard deviation  . The mutation strength  plays the same role as the bit

ipping probability pm in GAs (ES-GA correspondence principle, see below). It is an important strategy parameter that controls the performance of the ES (cf. 3.2, 3.3). Now, recombination is introduced. In contrast to the biological standard, it is possible to de ne multi-mixing strategies denoted by (=; ), where  indicates the number of parents involved in the act of procreation. In this paper  =  strategies are considered only. There are two variants of recombination. First, the (=I ; ) intermediate recombination. Its result is equivalent to the `center of mass' of the  parental states ym;. The rest is equal to the (; )-ES, eq. (1). Thus, the o spring yl is generated by

In this section a short introduction to the theory of (; )- and (=; )-Evolution Strategies (ES) will be given. The theory is at a state that allows the extraction of `basic evolution principles'. First, a short de nition of (; ), (=I ; ), and (=D ; ) will be provided. The second point concerns the means of measuring evolutionary progress. The third point presents the results of the progress rate theory. There, we will formulate the `basic principles' and we will explain why recombination is bene cial. The author is rmly convinced that the `basic principles' of GAs, ESs, and all the other EA-variants are the same. Therefore, we will always try to establish connections and correspondences between the results of the ES-theory and the GA.

3.1 The ES-Algorithms

(=I ; )-ES:

Modern ESs are multi-membered strategies working with population(s). Schwefel's nomenclature (; )-ES (cp. [20]) indicates that there are  o spring yl generated from a pool of  parents ( < ) by random selection (reproduction) and application of a mutation operator. The  parents ym; are obtained by truncation selection from the  o spring yl of the preceding generation. Truncation selection means that the  best o spring are selected according to their tness F (y) (expressed by the `m; '-nomenclature). The form of the mutation Z depends on the

yl := 1

 X

m=1

ym  + Z: (2) ;

Second, the (=D ; ) dominant recombination. The ith component of an o spring vector (denoted by f:gi) is obtained by random choice of one of the coordinate-i-values from the  parents (=D ; )-ES: mi = Randomf1 : : :g; fylgi := fym ;gi + fZgi: (3) i

This strategy is equivalent to Schwefel's global, discrete recombination scheme [3] and 4

reminiscent of uniform crossover introduced 3.3 Basic Principles by Syswerda [21] for bit strings. It is indeed very similar to uniform crossover, if multi- 3.3.1 The (; )-Theory mixing is introduced in GAs, as proposed by For the case N ! 1 one nds the author [6] (see below). ?2 (5) '? = c;  ? ? (2 ) : 3.2 Measuring Performance This formula contains all important informaA natural performance measure for function tion necessary to understand an ES compris-? optimizers is the distance-to-optimum change ing of a mutation operator with strength  R observed from generation (g ) to (g + 1). and (; ) truncation selection. The in uIn the ES-theory the expectation of R is ence of the selection is described by the soc; [7] bounded by called the progress rate ' := E fRg. For a called progress coecient p population, there are di erent possibilities to 0  c;  c1; < 2 ln . to the de ne the distance-to-optimum. Usually the Positive progress, i. e., convergence ? distance from the center of mass point to the optimum, is achieved for  -values smaller maximal (normalized) optimum point y^ is taken. If for the sake of than 2c;. The 2 progress is ( c ) = 2, reachable if  ? is tuned ; simplicity the coordinate origin is de ned at the optimum point y^ , then ' can be written such that  ? = c; holds. If re-normalized by eq. (4),  = c;R=N indicates that approachas ing the optimum (R decreases) the mutation

) ( 

1 X

strength has to be decreased (g ! 1 )

' := E R ?

 ym(g;+1) 

;  ! 0). Usually, this is attained by -selfm=1 adaptation developed by Schwefel [20]. In

P

1  (g ) the opposite case,  = const:, the (; )-ES with R =  m=1 ym; . not be a function optimizer. That is, It is obvious that ' also depends on the would for g ! 1 a residual distance R1 to the opti tness landscape de ned by the tness func- mum remains R1 = N=2c; [4]. Note, simtion F (y). As in the case of the `mesoscopic' ilar statements be made for the mutation approach to the GA-theory, a F (y) has to be rate p in GAscan (see e. g. [8]). Keeping pm m chosen which is both simple to yield analytical constant at a very results and sucient to describe (at least) the the best policy. small value is not always local behavior of the tness landscape. This From eq. (5) we can extract the governis the case for the spherical model, i. e., F (y) ing principle, i. e., the `Evolutionary Progress depends on the radius function R = kyk only. (EPP), valid for all EAs: Progress Note, the spherical model is nonlinear, it in- Principle' (positive or negative) is a result of two corporates epistatic e ects. Furthermore, the opposite tendencies, gain and results of this model are scalable, i. e., they progress loss. Carefulprogress vetting the derivation can be expressed in a state (generation num- of eq. (5) reveals that there is a decomposition ber) independent form by the normalization of the mutation vector Z into two parts: x in '? := ' N=R;  ? :=  N=R : (4) hoptimum direction ?eR and a perpendicular Z = ?xeR + h: (6) Below, we will present the main results and the `basic principles' of the theory. For their Thus, the progress gain is connected to x and derivation, the asymptotic formulae (N ! 1 the progress loss is due to the length of the [18]) suce. Details for the real world case h part. Each additional (recombination) operator is aiming at a decrease of khk and/or N < 1 can be found in [7, 6]. 5

at an increase of x. Note, at a more abstract the `good properties' are simply the x-values level this even holds for combinatorial opti- of the  best o spring. Since they are averaged, the result cannot be better than the x mization problems, e. g. for the TSP. of the best o spring. The improved performance in (=; )-ESs is due to the larger ad3.3.2 The (=I ; )-Theory missible mutation strengths. Optimal perforThe asymptotic '? formula reads mance is achieved for  ? = c=; in contrast ? )2 to  ? = c; for the (; )-ES (NB, though (  ? ? (7) c=;  c; holds, both coecients are of ' = c=;  ? 2 the same magnitude). At this point it also with the progress coecient c=; [6] becomes clear why multi-mixing is preferred bounded by 0  c=;  c;  c1; < to the  = 2 (biological) standard case: Due p 2 ln . Comparison of the progress loss to eq. (8) multi-mixing ( = ) provides the terms in (5) and (7) reveals that the interme- smallest progress loss. diate recombination decreases the loss term ( ?)2 =2 by a factor of 1=. Whereas the gain 3.3.3 The (=D ; )-Theory term is not increased, because c=;  c; holds. This is a very important observation For the progress rate, one nds ?2 leading to the `basic principle' of (intermedi? = p c ? ? ( ) : '  (9) =; ate) recombination. Because of decomposi2 tion (6), the intermediate averaging in eq. (2) is applied to the x-values (NB, x does not co- At rst glance one might think that there is incide necessarily with a coordinate direction) no GR (genetic repair) in dominant strategies, of the  best o spring and to its h-vectors as because there is no parental averaging in alwell. Selection acts mainly on the x-values, gorithm (3). And it seems that the loss term whereas the h-vectors are (almost) selectively in eq. (9) is equal to that of the (; )-theory. is that, eq. (9) can be neutral. Moreover, the h-vectors are statisti- But the matter of fact ? cally independent from each other. The aver- derived from the ' -formula of the intermeaging of these h-vectors produces a recombi- diate theory (7) by introduction of surrogate native h-vector hhi with an expected length mutations S with standard deviation s p  : quadrat khhik2 which is by a factor of 1= (10)  = s smaller than the expected length quadrat of a single h-vector The theory behind this model is based on the fact that each parent distribution (popu2 hhi2 = 1 h2   N : (8) lation) can be described by a probability density. The density may be expressed by its rst If one interprets the h-components as the moment - the center of mass parent - and the `harmful part' of the mutation Z (because (higher order) central moments. Application they increase the distance to the optimum), of the dominant recombination and the mutathen a recombination producing hhi decreases tion operators changes these central moments, this part. The author calls this the `basic however, the resulting distribution can still be principle' of intermediate recombination, the interpreted as if it were generated from the `Genetic Repair' (GR) [6]. It is evident that center of mass parent by the surrogate muthe building block hypothesis does not hold. tation S. A rst approximation of this disThe statement often invoked that recombina- tribution can be obtained by neglecting the tion `combines good properties of the mates to in uence of selection on the shape of the N form better o spring' is wrong. In our model dimensional surrogate mutation density. The

6

4 Evidences for EPP, MISR, and GR in GAs

examination of the single component standard deviation s of S suces. Consider a population (comprising of  elements) from a single (vector) component s of S. Let this population be subject to an evolutionary process consisting of recombination and `physical' mutation with strength  iteratively applied, then the steady state standard deviation s of the variate s is given by eq. (10) [6]. This is a very astonishing and important result, which is in contrast to the single operator e ects.1 The dominant recombination transforms physical mutations into larger surrogate mutations. However, although there is no selective pressure in this model(!) the resulting mutation strength (10) is restricted. The population is concentrated around the center of mass parent (the `wild-type') forming a (what biologists would call) `species'. This result obtained is very general. It relies on the statistical independence of mutation and recombination only. Therefore, it holds for GAs, too (see below). The `basic principle' of dominant recombination can be formulated as: Mutation Induced Species by Recombination' (MISR). The variety restriction in species goes well with the GR-mechanism. GR can only work with a restricted mutation size (cp. 3.2.2). Mixing between di erent species does not work (yet another argument against the BBH). But, is there GR in (=D ; )strategies at all? - It is, but implicitly. This becomes clear if one interprets the dominant recombination as a sampling process estimating the parental distribution. Thus, the center of mass - as the rst moment of the distribution - is (implicitly) estimated. Note, the estimation process works best (i. e. with the smallest statistical error) if all coordinates are sampled independently from each other. This might be the reason for which uniform crossover in GAs is often superior to one- or two-point crossover [21].

4.1 The EPP-Hypothesis

Although the mesoscopic GA-theory provides estimates for the computational complexity [17] it cannot be used for the identi cation of the evolutionary progress principle (EPP). This is due to the semi-empirical character of the theory leaving some phenomenologic coecients undetermined (up to now). There is one exception concerning ONEMAX with (1; )-truncation selection where the in uence of the mutation rate pm on the evolutionary progress ' has been determined.  ck [2] was able to express ' as a sum of Ba probability expressions, but failed to derive an analytical '-formula. Plots numerically obtained indicate that there is a dependence of ' on pm similar to the ndings of section 3.3.1, eq. (5), if one accepts the correspondence pm !  ?. In order to support the EPP, the author has tried to derive an analytical '-expression - and has succeeded. With a technique developed in [5], he easily obtains a rough but very instructive approximation

p q '(pm ) = c1; ` pm (1 ? pm ) ? (2Q ? `) pm

(` - string length, Q - number of correct bits) which clearly supports the EPP.

4.2 The MISR-Principle

The `Mutation Induced Species by Recombination' (MISR)-principle applied to the binary GA postulates (for each bit position) a larger (but restricted) population variance s2 than the single bit- ipping variance generated with the mutation rate pm . The variance of a single bit- ip is  2 fbg = pm (1 ? pm). The analytical determination of the (averaged) population variance s2 has not been done yet. Note, eq. (10) cannot be used directly, be1 Recombination alone results in random genetic cause it holds for continuous variates with drift [1], whereas mutation alone produces a random zero mean mutations. There is not an upper limit in eq. (10), whereas the variance of walk. 7

a bit-population is always bounded, s2  1=4. However, the e ect of the MISR-principle can be easily simulated and veri ed by iterated application of the operators mutation and recombination on a bit-population. Since selection is switched o , the center of mass (de ned as the bit-count divided by the population size ) performs a random walk, whereas the averaged (over the generations) population variance proves to be a function of pm , monotonously increasing with . A comparison of the experiments with the predictions of eq. (10) shows that (10) may serve as a (very) rough estimate, if pm is suciently small.

be taken into account. However, there are predictions from the MISR-GR-principle that might support the theory. In manuscript [6], the author has suggested the investigation of multi-mixing GAs. Such GAs should exhibit - provided the strategy parameters (pm , , ) are right - faster convergence than the standard GA. This hypothesis received an unexpected support by Eiben et al. [9] presented at the PPSN III in Jerusalem. They investigated simple GAs (proportionate selection) with `multiparent recombination', the so-called `gene scanning'. Among the di erent variants there have been two of special interest. The rst variant is the `uniform gene scanning' which is similar to the dominant recombination described in section 3.1 (exceptions: all parents are tness proportionally selected, mutation by bit- ipping). The second variant is the `occurrence-based scanning', a straightforward implementation of the intermediate recombination on binary strings. Since the averaging of bits would produce `fractional bits', the analogy breaks down.3 Therefore, the intermediate bit value is determined by majority decision (occurrence-based).4 The results obtained by Eiben et al. are promising [9]: \The experiments show that 2parent recombination is inferior on the classical DeJong functions. For the other problems the results are not conclusive, in some cases 2 parents are optimal, while in some others more parents are better." Though there are many open questions in their work (concerning truncation selection and mutation rate pm ), it is at least a clue that supports the MISR-GR-hypothesis.

4.3 The GR-Hypothesis

Up until now, there are no theoretical proofs of the GR (genetic repair) hypothesis in GAs. Unlike the ES-theory, where GR is well understood as a principle that works bene cial in curved, i. e. nonlinear, tness landscapes, the main characteristic of GAs is not yet extracted. What is the analogue to `curved' in GAs? Nonlinearity in ESs can be measured by the success probability Pss .2 Pss is a function of the mutation strength  . Exploiting the ES-GA correspondence principle:

 ! pm

(11)

that postulates the analogue of mutation strength  and the mutation rate pm , then nonlinearity could be expressed by Pss (pm ) > 1=2 (concave) and Pss (pm) < 1=2 (convex), respectively. GR should be bene cial in tness landscapes with Pss (pm ) < 1=2. E. g., for the ONEMAX(b)-function this would be the case if ONEMAX(b) > `=2 holds (` - string length of b). Due to a lack of time and resources, the author was not able to thoroughly perform experiments supporting the GR hypothesis. 3 From the viewpoint of ES-theory, this indicates Such experiments are involved, since the de- that the binary alphabet should be exchanged by pendence on the mutation rate pm has to higher cardinality alphabets (if possible). GR could 2

work much better. Why does nature use quaternary

Pss is the probability of producing a successful DNA instead of bits?!

4 ( tter) o spring from the parental state by a single Notice, this technique can produce large round o trial. errors diminishing the e ect of GR.

8

5 Concluding Remarks

duced by surrogate mutations from an average parent (wild-type, center of mass parent). GR-performing populations are very similar to species: the individuals are crowded around the average individual. The MISRhypothesis explains why the small physical mutations are transformed into larger (surrogate) mutations at the population level by means of the recombination/crossover operator. However, these mutations are still restricted which is a prerequisite for GR. At least for the multi-recombinant ES, the species produced by MISR performs GR well. It is an open question whether this holds for GAs, too, especially if the standard recombination operators one-point/two-point crossover are considered.

In this paper an alternative hypothesis on how GAs do work has been presented. It was argued that the BBH (building block hypothesis) does not explain the evolutionary mechanisms, i. e. the cause that makes a GA a function optimizer or not. The observed building block evolution is rather an e ect of each function optimizer, successively approaching the optimum. The main goal of this paper was to shed light on the process of mutation/ recombination, whereas the aspects of selection (see e. g. [17]) and the prerequisite for evolvability [18] have been omitted here. They build up two additional `basic principles' of similar importance. Let us give a short summary of what the EPP-GR-MISR-hypotheses mean. The governing principle is EPP. Each designer of a GA (or generally EA) should ask for the processes that provide progress gain and progress loss. The analysis of schema disruption/production, however, is not equivalent to progress loss/gain. Therefore, in general, the results of schema analysis will not be as signi cant as those obtained by EPP (though it may be more dicult to extract EPP). Once EPP is accepted, GR aims at the decrease of the loss part by statistical error correction. Thus, it is possible to increase the mutation rate pm with the e ect of a higher progress gain. Once again, mutations are not directed and without bias: A larger mutation rate pm increases both the tness increasing components and the harmful components of a mutation. Selection extracts those individuals produced by the mutations with the larger tness increasing components (on average), and GR diminishes the in uence of their harmful parts. As already pointed out, the magnitude of the mutations is a crucial parameter for the EA performance. This holds for the spreading of the population as well, because the whole population can be interpreted as if it were pro-

References [1] H. Asoh and H. Muhlenbein. On the Mean Convergence Time of Evolutionary Algorithms without Selection and Mutation. In Y. Davidor, R. Manner, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature, 3, pages 88{97, Heidelberg, 1994. Springer. [2] T. Back. The Interaction of Mutation Rate, Selection, and Self-Adaptation Within a Genetic Algorithm. In R. Manner and B. Manderick, editors, Parallel Problem Solving from Nature, 2, pages 85{94. North Holland, Amsterdam, 1992. [3] T. Back and H.-P. Schwefel. An Overview of Evolutionary Algorithms for Parameter Optimization. Evolutionary Computation, 1(1):1{23, 1993. [4] H.-G. Beyer. Toward a Theory of Evolution Strategies: Some Asymptotical Results from the (1;+ )-Theory. Evolutionary Computation, 1(2):165{188, 1993. [5] H.-G. Beyer. Towards a Theory of `Evolution Strategies': Progress Rates 9

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

and Quality Gain for (1;+ )-Strategies on (Nearly) Arbitrary Fitness Functions. In Y. Davidor, R. Manner, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature, 3, pages 58{67, Heidelberg, 1994. Springer. H.-G. Beyer. Toward a Theory of Evolution Strategies: On the Bene t of Sex the (=; )-Theory. Evolutionary Computation, 3(1):xxx{yyy, 1995. H.-G. Beyer. Toward a Theory of Evolution Strategies: The (; )-Theory. Evolutionary Computation, 2(4):xxx{yyy, 1995. T. E. Davis and J. C. Principe. A Markov Chain Framework for the Simple Genetic Algorithm. Evolutionary Computation, 1(3):269{288, 1993. A. E. Eiben, P.-E. Raue, and Z. Ruttkay. Genetic algorithms with multiparent recombination. In Y. Davidor, R. Manner, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature, 3, pages 78{87, Heidelberg, 1994. Springer. D. B. Fogel and L. C. Stayton. On the e ectiveness of crossover in simulated evolutionary optimization. BioSystems, 32:171{182, 1994. D.E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison Wesley, Reading/Mass., 1989. J.J. Grefenstette. Conditions for Implicit Parallelism. In G.J.E. Rawlins, editor, Foundations of Genetic Algorithms, 1, pages 252{261. Morgan Kaufmann, San Mateo (CA), 1991. J.J. Grefenstette. Deception considered harmful. In L.D. Whitley, editor, Foundations of Genetic Algorithms, 2, pages 75{91. Morgan Kaufmann, San Mateo (CA), 1993.

[14] J.H. Holland. Adaptation in natural and arti cial systems. The University of Michigan Press, Ann Arbor, 1975. [15] K. De Jong. Are genetic algorithms function optimizers ? In R. Manner and B. Manderick, editors, Parallel Problem Solving from Nature, 2, pages 3{13. North Holland, Amsterdam, 1992. [16] H. Muhlenbein. How Genetic Algorithms Really Work I: Mutation and Hillclimbing. In R. Manner and B. Manderick, editors, Parallel Problem Solving from Nature, 2, pages 15{25. North Holland, Amsterdam, 1992. [17] H. Muhlenbein and D. SchlierkampVoosen. The science of breeding and its application to the breeder genetic algorithm BGA. Evolutionary Computation, 1:335{360, 1994. [18] I. Rechenberg. Evolutionsstrategie '94. Frommann{Holzboog Verlag, Stuttgart, 1994. [19] G. Rudolph. Convergence properties of canonical genetic algorithms. IEEE Transaction on Neural Networks, 5(1):96{101, 1994. [20] H.-P. Schwefel. Numerical Optimization of Computer Models. Wiley, Chichester, 1981. [21] G. Syswerda. Uniform Crossover in Genetic Algorithms. In J.D. Scha er, editor, Proc. 3rd Int'l Conf. on Genetic Algorithms, pages 2{9, San Mateo, CA, 1989. Morgan Kaufmann. [22] D. Thierens and D. Goldberg. Convergence Models of Genetic Algorithm Selection Schemes. In Y. Davidor, R. Manner, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature, 3, pages 119{129, Heidelberg, 1994. Springer.

10

The ICGA-reviews This paper has been rejected for presentation at the 6th ICGA. Having lost the opportunity to defend my new ideas and the points attacked by the reviewers at the conference, I've added my replies to the reviewers comments. The review text is in typewriter font. This has been chosen in order to avoid quotations. My comments are written in standard font (some times italization and bold font used). Unfortunately, I cannot give citations/references concerning the review text, since it was an anonymous reviewing process. However, if some reviewers are insist on reading their names, then I will add them. Feel free to contact me. ------------------------- REVIEWS -------------------------------COMMENTS FOR AUTHORS: ICGA-95 Paper Review Form Comments for the Author(s) Paper Num: 146 Title: How GAs do NOT Work: Understanding GAs without ... Reviewer Num: 1 EVALUATION: 8 (1 = excellent, 9 = poor) excellent........................poor | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Overall significance Technical Content Scholarship/Technical Soundness Originality Organization/writing/clarity Relevance to Conference

: : : : : :

9 3 9 3 7 2

Further comments, advice for the author: The paper is confused in several respects. 1) You say: "the SchT cannot predict the preformance of the GA as can be seen in the case of the minimal deceptive problem [11, p.46]". The reference [11, p.46] does NOT use Holland's schema theorem, but instead uses equations which produce the expected next generation under the assumption of no mutation. The reference does a fairly good job of discussing the situation (as far as it goes) given that one is interested in using the expected transition to model behavior.

11

I fail to understand how failure "can be seen" in this context.

The only di erence to the SchT comes from the no mutation , i. e., only selection condition removing the relation sign from the SchT. (The SchT becomes an equation.) You seem to be a real expert in the eld of (mis-)interpretation (I will come back to this later on). As pointed out by reviewer no. 093: ...since Goldberg used principles based on the Schema Theorem to construct this problem (the MDP is meant). The MDP is a good example to demonstrate the weakness of the SchT. 2) You go on to say: "The reasons for this failure are well known. One of them, concerning selection, is the infinite population size". O.K., so by GAs you mean finite population GAs. Again, in what sense does Goldberg's analysis of the minimal deceptive problem "fail"? It seems to me that it only fails at being what it is not. I have a cat which "fails" at being a dog, so what? In any case, his equations produce the expected next generation for *all* population sizes, FINITE as well as infinite.

You should REALLY READ the paper as it is and NOT put INTERPRETATIONS on it which t in your argumentation. Your statement is simply wrong, I don't have claimed in this paper that ...Goldberg's analysis of the minimal deceptive problem "fail". Actually, it hits the point that the SchT is weak. 3) You continue: "Closely connected to this - though not equivalent is greffenstette's distinction between the observed average fitness of the schema/hyper-planes and the static average fitness of the schemata/hyperplanes" It is not connected at all. The dynamics of Goldberg's equations are based on the present population's composition -- i.e., the *observed* fitness -- and not on any static average fitness of schemata or hyperplanes.

See 2) 4) You go on to say: "the SchT does not provide any guarantee for convergence to/or divergence from the optimal state. This is the major flaw of schemata analysis."

12

By your previous remarks, I presume by "SchT" you mean models like Goldberg's, and when you refer to GAs implicitly then you mean finite population GAs. I am not trying to be perverse, but you do not make yourself clear!

Again, you try to INTERPRET the text, however, this is not The Bible. You are strongly recommended to stick to the text. In this case, observe that a GA is a *stochastic* algorithm. This means *nothing* can predict convergence to or divergence from the optimal state.

Really interesting, you should at least say what kind of GA you do mean. In general your statement is WRONG. References are given in the text [15, 19]. Oh sorry, I've forgotten you are still concerned with Goldberg's MDP analysis. The inability to guarantee the impossible is not a flaw, it is a simple fact. Therefore, if Goldberg's analysis does not provide any guarantee for convergence to, or divergence from, the optimal state, then that is as it should be; were the analysis to do otherwise, it would be in error. 5) You conclude the paragraph with "Taking this into account, the building block hypothesis (BBH) and the implicit parallelism (IP), become very questionable, because they are built upon sandy ground." The "sandy ground" to which you refer appears to be your previous inaccurate description of the situation.

Due to your misinterpretations this statement does not come as a surprise. I too have serious questions regarding IP (and to a lesser extent BBH). However, these concepts cannot be attacked by first assuming they are built upon some foundation, and then questioning the underpinnings which have been shoved beneath! To be concrete, suppose I claim 2+2 = 4 and offer the reason that 1+1 = 9. You will not succeed in making 2+2 = 4 questionable by attacking 1+1 = 9. 6) You make the claim: "if the BBH is regarded as the reason for which the GA works, then cause and effect are reversed". This is the key question. To be honest, I don't know which way it goes, but you nevertheless have offered no evidence to support your view.

13

Oh, that really comes as a big surprise. First, under paragraph 9) you say: ...previously

existing theory is applied there to shed light on the working mechanisms of crossover and selection, but you are not able to answer your key question. That is a shame! It seems to me that your `light

providing lamp' is not very lucid. (See also my comments on paragraphs 11) and 12).) Second, is seems that the reviewer has not really read the rest of the paper (section no.  3), because I DO OFFER evidence for the EPP-MISR-GR hypotheses. 7) You state: "We are interested in knowing the main difference between GAs (or EAs, generally) and other optimization techniques", and then point out that the BBH and IP may not be unique to GAs, and finally conclude: "the BBH and the IP should not be invoked for explaining the working or uniqueness of GAs, because they point in the wrong direction". I will grant that you are interested in differences between GAs and other algorithms and that the BBH and IP may not be unique to GAs. But your conclusion is nevertheless wrong. Yes, the BBH and the IP *perhaps* should not be invoked for explaining the uniqueness of GAs, but that would really depend on *how* they relate to GAs and *whether* it, in some important sense, is unique with respect to GAs. To be concrete, dogs and cats both have eyes. Yet we can use eyes to distinguish between them because "eyes" relates to cats differently than how "eyes" relates to dogs.

I think that it is much easier to distinguish between cats and dogs by the sound they produce than the di erences between their eyes. And this is exactly the problem of current GA-theory: Some GA-theorists stare at the eyes of the dog (i. e., the BBH) totally hypnotized. AND, they try to explain the behavior of the dog (i. e., the working of the GA) by looking at the dog's eyes ONLY. As far as your assertion that the BBH and the IP should not be invoked for explaining the *working* of GAs, ...if the correctness of the explanation could be proved, then why not?

Yes, if ... !!! 8) You say: "The Markov chain model has been used to prove/disprove the convergence to stationary state probabilities only."

14

Wrong. Transient behavior has also been analyzed, and the Markov model has been used to link finite and infinite population behavior.

Right, but this is not the point. Such results are of minor interest. What really interests is the convergence rate (and order) to the optimum point. 9) You continue with: "The theory developed (up to now) cannot give any insight into the working mechanisms of crossover, mutation and selection" Wrong. Although the FOGA III proceedings are not yet publicly available, previously existing theory is applied there to shed light on the working mechanisms of crossover and selection. The models used are old expected transition models which for years have been linked to finite population GA behavior through previously existing Markov theory. The fact that no one until recently has been applying the theory says nothing about whether it *can* be applied, neither does it indicate anything about what insights that theory is *unable* to give.

Again, this is a misinterpretation of my text. I did not claim that the theory of Markov chains cannot be applied to GAs. 10) You quote Davis to support your claim (in point 9 above). You missconstrue his statement. Davis made remarks with respect to the bound which he was able to prove relating to steady state behavior in his formalism.

Does that mean you are able to prove more? If so, congratulation! 11) You state: "a link is still missing that connects the microscopic level of description to `macroscopic' variables such as the expected fitness change over time". The link is easy to write down in terms of the transition matrix of the markov chain.

15

Writing down the transition matrix does not provide the link I have spoken about. This is just a formal step (see below, paragraph 12). In order to solve the linking-problem you have to nd approximations to the Chapman-Kolmogorov-Equation (also known as Master Equation) in terms of state probabilities, and then, you have to compute the tness expectation. Sure, this can be done numerically, as long as the state space is small. However, numerical results are NOT analytical ones and dicult to interpret. 12) You say, in the context of the counting ones problem: "the analytical treatment of the GA, including the three operators selection, mutation and crossover, is still a pending problem". The expected increase in population fitness for a GA using crossover, mutation, and any fitness function is known. It may be a complicated expression, but then the GA is a complicated algorithm; some things are inherently hard. This is not a fault, it is a fact.

First, as to the paragraph 11) and 12) this statements are of NO VALUE, since no pointers to relevant articles, technical reports, etc. are presented. So I have no possibility to assess your statements. Second, what does it mean ...some things are inherently hard? For example, solving Newton's equations for a vessel lled with gas is inherently hard, because of the high number of molecules > 1023. However, its thermodynamic state at normal temperature can be easily predicted by the ideal gas theory. The real challenge is it to derive such laws from the basic equations. To be more GA speci c, writing down the transition matrix for the GA is in most of the cases an almost trivial task. However, to derive analytically the macroscopic equations of interest (e. g., the tness change over the time) from the constituted Markov chain, that is the real challenge. Maybe you have done this. It should be possible for simple tness functions. BUT, even if you were successful, the question on the basic EA principles still remains. 13) You next launch into ES. The problem with this entire section is *relevance*. What you loose sight of is that the ES theory you consider is based on a completely *trivial* model, the spherical model, and that even if that were not an issue (which it is, from all except for the most shallow of perspectives), you give absolutely no proof that principles which appear in that context apply to GAs.

Concerning the completely *trivial* model, the spherical model you seems to be totally unaware that this model can be extended to more complicated tness landscapes (see reference [5]). 16

Partially right, I ...give absolutely no proof that principles which appear in that context apply to GAs. However, EVIDENCES ARE PROVIDED. The situation

is the same as for the BBH. NO ONE has given (up to now - as far as I know) a proof that the BBH is the CAUSE for the working of the GA. COMMENTS FOR AUTHORS: ICGA-95 Paper Review Form Comments for the Author(s) Paper Num: 146 Title: How GAs do NOT Work Reviewer Num: EVALUATION: (1 = excellent, 9 = poor) excellent........................poor | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Overall significance Technical Content Scholarship/Technical Soundness Originality Organization/writing/clarity Relevance to Conference

: : : : : :

2 2 2 2 2 1

Further comments, advice for the author: Good paper, very much interesting! In any case, I point out that, being the quantitative results in Section 3 and 4 simply reported from other papers already in publication, I simply"assumed" their correctness without trying to check it again. In any case they seem plausible.

Thank you - I appreciate your words. I really need them. As a personal opinion, I'm not so shure that the BBH is to give up!!!

That is OK. Science ought to live from the dispute. 17

------------------------------ REVIEW FORM -------------------------------COMMENTS FOR AUTHORS: ICGA-95 Paper Review Form Comments for the Author(s) Paper Num: 146 Title: How GAs do NOT Work: Building Blocks Reviewer Num: 89

Understanding GAs without Schemata and

EVALUATION: (1 = excellent, 9 = poor) excellent........................poor | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Overall significance Technical Content Scholarship/Technical Soundness Originality Organization/writing/clarity Relevance to Conference

: : : : : :

3 2 3 3 5 1

Further comments, advice for the author: The main reason this technically sound paper is given a weak recommendation is that what is claims departs from what it delivers. It claims to be about GAs generally, when it deals with very specialized fitness functions that are basically smoothe with a single optimum, the spherical model. Once this is chosen, the rest of the conclusions of the paper fall out logically. But no justification is given as to why the spherical model is valid for studying the general properties of GAs the author claims to be. And I do not believe it is valid. If the author revised the paper to be more circumspect about its claims, addressing in detail the class of problems to which the model applies, and to which it does not apply, it would be recommended strongly. The reviewer would reject the paper outright judging only on whether the conclusions were justified; but there are enough ideas of technical interest and potential to merit acceptance.

Well, concerning the justi cation you are partially right. However, the main goal of this paper was it to build a bridge between di erent EA variants and to show that there are certain similarities that can be summed up into basic principles of EA. However, basic principles do hold for simple tness functions, too! The idea behind this is it to analyze simple tness functions, extracting basic principles and then, to proceed with more dicult ones. The author would be well-served rhetorically by less self-promotion.

18

This is simply a matter of taste, not of self-promotion. It would be a simple task for me to write a paper on GAs that would be accepted in all likelihood, however this was NOT my intention. My intention was it to present a NEW PARADIGM that may serve as a guiding line in GA-theory. COMMENTS FOR AUTHORS: ICGA-95 Paper Review Form Comments for the Author(s) Paper Num: 146 Title: How GAs do NOT Work: Understanding GAs without ... Reviewer Num: 93 EVALUATION: 8 (1 = excellent, 9 = poor) excellent........................poor | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Overall significance Technical Content Scholarship/Technical Soundness Originality Organization/writing/clarity Relevance to Conference

: : : : : :

7 5 7 7 7 5

Further comments, advice for the author:

I review 3 or 4 papers like this every year.

Oh, that are not good news for scientists working on alternative GA-theories. Such papers attack the obvious weaknesses with the Schema Theorem (SchT), which are now well known. Quoting Goldberg's textbook is just a cheap shot.

Well, it is cited very often with respect to the SchT and especially to BBH. The quote from Goldberg on page 2 (reference [11: 33]) is somewhat taken out of context.

19

I do not think so. The statement that the "SchT cannot predict the performance of the GA as can be seen in the case of the minimal deceptive problem" is also somewhat misleading, since Goldberg used principles based on the Schema Theorem to construct this problem.

This is not misleading. It is just an example for the weakness of the SchT. One might as well think that the MDP was constructed by a GA-opponent! The last part of section 2.1 (on page 3) is trivial. methods don't statistically sample the space.

Gradient

Sure, it is trivial, nevertheless, it is necessary to be mentioned. Convergence based on the used of elistism is old news and largely trivial.

I did not claim that this is a novelty. The real news can be found in the sections  3: On the beginning of page 4, the author seems clearly unaware that these models have existed for a few years now. Markov and nonMarkov models exist for infinite and finite populations.

This is really funny, the referee seems to have skipped section 2.2.1. and 2.2.2. No comments on the relevant (i. e., new information containing) part of the paper (section  3:1), perhaps forgotten reading the rest?!

20