Computationally Intensive and Noisy Tasks: Co ... - CiteSeerX

8 downloads 0 Views 155KB Size Report
Chellapilla and Fogel [3] which used co-evolution to create a Checkers .... [3] Kumar Chellapilla and David B. Fogel. Evolution, neu- ... [19] Steven L. Salzberg.
Computationally Intensive and Noisy Tasks: Co-Evolutionary Learning and Temporal Difference Learning on Backgammon Paul J. Darwen Department of Computer Science and Electrical Engineering The University of Queensland Brisbane, Queensland 4072 Australia Email: [email protected] AbstractThe most difficult but realistic learning tasks are both noisy and computationally intensive. This paper investigates how, for a given solution representation, coevolutionary learning can achieve the highest ability from the least computation time. Using a population of Backgammon strategies, this paper examines ways to make computational costs reasonable. With the same simple architecture Gerald Tesauro used for Temporal Difference learning to create the Backgammon strategy “Pubeval”, co-evolutionary learning here creates a better player.

here. To compare performance, the unseen benchmark opponent is Gerald Tesauro’s Pubeval [23], which was optimized with Temporal Difference learning [21]. Here co-evolution uses Pubeval’s architecture, as described in Section 3.1. Section 4 observes how learning performance varies with such parameters as population size, and the number of games per evaluation (and thus CPU time). Section 5 discusses these results, and the paper concludes in Section 6.

2 Motivation

Real-world tasks are often both computationally demanding and contain a degree of randomness. Such problems are more challenging for machine learning, and more important to realworld applications, than are deterministic toy problems. Co-evolutionary learning can discover solutions to problems, without prior knowledge from human experts. The method has achieved impressive results, both on games such as Checkers [3] and the iterated Prisoner’s Dilemma [1] [6] [7], as well as in other tasks [12] [17], such as creating a sorting algorithm [10] and schedule optimization [11]. The question facing this paper is, for a given solution representation, how can co-evolutionary learning obtain the highest skill from the least CPU time on a noisy task with heavy computational demands? There are some surprising interactions between population size, CPU time, and the number of samples per evaluation. Most surprisingly, more samples per evaluation (for more accurate fitness, costing more computation time) can actually make learning worse, as shown in Section 4.1.1. Co-evolution here uses the same simple representation that Gerald Tesauro used to create the Backgammon player called Pubeval [23] using Temporal Difference learning [21]. Even though Pubeval is presumably highly optimized, here coevolution creates a slightly better player on the same simple architecture, without seeing Pubeval or any other opponent.

For many optimization problems, such as job shop scheduling, numerous static benchmark problems exist. Spending days on a supercomputer with a static benchmark problem avoids the unpredictability of many real-world situations. Consider job shop scheduling: there is little point in starting up a three day supercomputer run to optimize a schedule for the situation at 9 o’clock on a Monday morning, if a customer makes a lucrative rush order at 9.15, and a lathe breaks down 20 minutes later. The situation changes too quickly. When dealing with an uncertain future in any dynamic and fast-changing environment, optimizing a static version of the problem avoids the real issue — the interesting part is the randomness in an uncertain future. The game of Backgammon uses random dice, which gives it an unpredictable flavor. A novice will occasionally beat an expert with lucky dice rolls, so a player can’t know in advance what dice rolls will occur or how an opponent will react. This resembles many real-world tasks, where managers can’t be sure how a solution will perform until it’s deployed. Therefore, it is reasonable to expect that whatever machine learning techniques work on Backgammon should also work on many real-world tasks that face an unpredictable future, such as training simulations [8] and financial forecasting [15]. This paper is not just about Backgammon, which is simply a proxy for real-world problems. Creating a good Backgammon player is less interesting than learning how to create the best possible player from a given architecture in the least CPU time without prior human expertise.

1.1 Overview

3 Experimental Setup

1 Introduction

Section 2 reviews why tasks with random uncertainty are of such tremendous practical importance for real-world applications, and why the game of Backgammon is a reasonable proxy for those real-world tasks. Section 3 describes the implementation of the co-evolutionary learning system used

3.1 A Benchmark and a Representation To measure the performance of the strategies produced by co-evolution after it has finished learning, we need a fixed, unseen opponent to permit an off-line comparison (but co-

evolution does not learn against an outside expert, instead it learns without any prior knowledge of the game). The benchmark used here is Pubeval, courtesy of Gerald Tesauro [23]. For a fair comparison with Pubeval, the coevolutionary system here also uses Pubeval’s simple linear representation. Pubeval was created by Temporal Difference learning [21]. On a more sophisticated neural network architecture, this method created the world’s best Backgammon computer, TD-Gammon [22]. So if co-evolution can match or exceed Pubeval’s skill at Backgammon, then we can infer it has obtained a good solution for that representation. Pubeval is a move evaluation function. Each move, a pair of pseudo-random dice are rolled, and a legal move generator produces the board configuration of every possible move that can be reached from the current board with those dice rolls, i.e., one-move look-ahead search. The move evaluation function (i.e., an individual from the population, or Pubeval) returns a value for each possible board position. The move with the highest value is taken. Actually, Pubeval is two linear functions, one for the main part of a game of Backgammon, the other for the final “racing” stage, when pieces do not have to pass the opponent’s pieces. This “racing” stage is less interesting than the main part, because there is an algorithm to exactly solve the end game [21, page 265]. Therefore, and following Tesauro’s Pub-1 metric [23, Table 1], co-evolution here only optimizes the first function, for the main part of the game. The final “racing” part of the game uses Pubeval’s racing weights. The move evaluator does not use hand-written feature detectors, but takes 122 inputs directly from the board position:  One to indicate how many of one’s own pieces are on the bar (the temporary prison for captured pieces).  One to indicate how many of one’s own pieces are off the board (having completed their journey).  For each of the board’s 24 positions, 5 inputs:

 

One if this position is occupied by only one of the opponent’s pieces1 .

Four more, for counting one’s own pieces. The output is the sum of the products of each input and its corresponding weight. It’s just a linear function, with no hidden nodes or sigmoiding. Co-evolution directly manipulates the values of the 122 weights — there is no backpropagation. A more elaborate architecture, with sophisticated handwritten feature detectors, would no doubt have evolved a better player. However, this paper uses Backgammon merely as a typical noisy task resembling real-world problems, rather than because of any specific interest in the game. Section 5.1 discusses architecture in more detail. 1 One can move to a board position occupied by only one opposing piece, but not to a position occupied by more than one opposing piece. Since it is not a legal move, there is no detector for multiple opposing pieces.

3.1.1 Fitness Function and Other Parameters A strategy’s fitness is the fraction of games won by playing all the other members of the current population (not including itself) for a fixed number of games per pairwise interaction. Some previous studies have used two populations, where a member of one population is evaluated by how it performs against the members of the other population, and vica-versa [10] [14]. However, learning a game like Backgammon has no pressing reason to use two populations, so this paper uses only a single population of Backgammon strategies. The number of games per two-player interaction is as low as 10 games for some runs, and as high as 100. For example, a run in Figure 1 has a population of 200 where each player plays every other player except itself, giving 19900 pairwise interactions. At 10 games per interaction, each generation requires just under two hundred thousand games. For 350 generations, that comes to almost seventy million games for the entire run. This may sound excessive, but evolutionary computation is easily parallelized and such runs can be done in a day or two on affordable parallel hardware [5]. Other details of the implementation are as follows. Population size is constant during a run, with sizes from 100 to 240. Each of the 122 weights are represented as real numbers, not bit strings. Elitism is 5%: the best 5% of the population get copied unchanged to the next generation. Selection is linear ranked selection with stochastic universal selection [2, page 16], with the best individual expecting 2 offspring, the worst expecting none. Each allele has a 10% chance of a Gaussian mutation: mean 0, standard deviation 0.1, giving many small changes to reduce major disruptions. Half the new offspring are mutated, the other half are crossed over, but nobody suffers both. Crossover is uniform crossover. Initial values are uniformly random between -20 and +20, although the initial range is irrelevant because dividing or multiplying all weights by a positive constant will not change a player’s behavior. We used the PGA-Pack software package2 and the MPICH implementation3 of the Message Passing Interface for parallel machines. 3.2 Measuring Genetic Diversity A popular measure of genetic diversity is the Shannon index. Given n different groups, each of which has fraction fi of the total number of individuals, the Shannon index H is [20]:

X n

H =

fi ln(fi )

(1)

i=1

A population of Backgammon strategies has no clearly partitioned groups. However, for any given board position, the population partitions itself according to what move they 2 Available 3 Available

at http://www.mcs.anl.gov/ ~levine/PGAPACK. at http://www.mcs.anl.gov/mpi.

0.52 Pop best against Pubeval for 10,000 games

want to make next. So on a per-move basis, the population’s diversity can be measured with Equation 1. A single board position and dice roll would not give a fair measure of diversity — what if the whole population agreed on what to do for that particular position? To provide a representative sample, we generated 1306 board positions by playing Pubeval against itself for 20 games. Measuring the Shannon index of a population for each of these 1306 board positions, gives an average of the Shannon index. This measure of diversity is used below.

4 Results

0.5

0.48 4 games per interaction 0.46

0.44 Population 240, 4 games per interaction Population 240, 6 games per interaction Population 240, 10 games per interaction Population 240, 20 games per interaction

0.42

0.4 100

150

The question facing this paper is, for a given way to represent a solution, how can co-evolutionary learning obtain the highest ability from the least CPU time? One plausible answer is that on this noisy task, more samples (i.e., more games and more accurate measurement of ability) would improve learning. In fact, it is an uneconomical use of valuable computation time.

6 games

0.46 4 games 0.44

0.14

Population 200, 4 games per interaction Population 200, 6 games per interaction Population 200, 8 games per interaction Population 200, 10 games per interaction Population 200, 50 games per interaction

0.4 100

150

200

250

300

300

350

Figure 2: Population size 240, for varying accuracy in evaluation. Too few games means inaccurate measurement of ability and gives poor learning, as expected. But beyond a certain number of games, extra computation and thus more accurate measurement of ability does not produce faster learning or a higher final ability.

350

Generation

Figure 1: Population size 200, for varying accuracy in evaluation. Too few games does not accurately measure ability and gives poor learning, as expected. But beyond a certain number of games, the extra computation and thus more accurate measurement of ability does not produce faster or better learning. Goldberg et al. [9, page 342] show that a genetic algorithm typically needs a population size comparable to the number of variables being optimized. With a generous population size (of 200 and 240) for the simple representation of 122 floating-point values, and varying the number of games played per pairwise interaction reveals an interesting effect in Figures 1 and 2: too few evaluations give worse peak ability (which is no surprise) but more than a certain number of evaluations does not make learning faster, nor bring it to a higher final ability. It would be reasonable to expect that if a noisy evaluation function makes learning more difficult, that spending extra

Average Shannon index (genetic diversity)

Best in pop plays Pubeval for 10,000 games

0.5

0.42

250

CPU on sampling to reduce that noise would learn better. But this is not what happens in Figures 1 and 2. Do those extra games make any difference at all? Perhaps the extra games are unnecessary as the uncertainty has been already reduced essentially to zero? Figures 3 and 4 demonstrate that this is not the case — the extra games (and thus more accurate evaluation) have an effect, but a bad one.

0.52

0.48

200 Generation

4.1 How Big a Population, How Many Games

Population 200, 4 games per interaction Population 200, 6 games per interaction Population 200, 8 games per interaction Population 200, 10 games per interaction Population 200, 50 games per interaction

0.12

0.1

0.08

0.06

0.04 50

100

150

200 Generation

250

300

350

Figure 3: Population size 200, showing the genetic diversity of the runs in Figure 1, as measured with the Shannon index described in Section 3.2. More games, and thus the more accuracy in evaluation, reduces diversity. Figures 3 and 4 show the population diversity (as measured by the Shannon index described in Section 3.2) corresponding to the runs in Figures 1 and 2. Basically, diversity goes down sharply until about generation 100, after which it stays more or less constant — but at a level determined by the number of evaluations.

Best in population plays Pubeval for 10,000 games

Average Shannon index (genetic diversity)

0.14 Population 240, 4 games per interaction Population 240, 6 games per interaction Population 240, 10 games per interaction Population 240, 20 games per interaction

0.12

0.1

0.08

0.06

0.04 50

100

150

200

250

300

0.5 100 games 0.48 0.46

6 games

0.44 10 games 0.42 0.4

Population 120 4 games per pairwise interaction 6 games per pairwise interaction 10 games per pairwise interaction 20 games per pairwise interaction 100 games per pairwise interaction

0.38 0.36 100

350

150

200

Sampling more games would more accurately discern the differences in ability among the members of the population, and thus be beneficial, only if those differences in ability were themselves independent of the number of games. Figures 3 and 4 show that the extra precision in those evaluations does indeed have an effect, but a negative one: more games reduce the behavioral diversity, which ipso facto require more evaluations to discern those smaller differences between players. So what’s happening? Early in a run, when there are many players of widely differing ability, extra evaluations allow the weeding out the poor players (and their genes) faster, reducing the diversity of those left behind. With fewer games and thus more lenient evaluation, those second-best players can survive for longer, and their genes find their way into better offspring. Diversity thus stays high for the entire run. 4.1.1 Corollary: Small Population, More Games is Worse For the representation of only 122 floating-point values, one might consider that a population of 200 or more was rather generous. Since we’re interested in achieving a representation’s peak ability, from the least CPU time, it is reasonable to ask what happens when the population size is barely large enough, instead of generously large. Smaller populations use less CPU time. But a smaller population has more trouble maintaining diversity, and Figure 3 and 4 show that more games reduces diversity still further — maybe too far. So we may test the following prediction: there will be certain small-ish population sizes where more games per interaction (and thus more accurate evaluation) will actually make learning worse, not better, by reducing diversity too far. The runs in Figure 5 and 6 test this prediction for a population of either 120 and 160. Figures 5 and 6 show that around 20 games per pairwise interaction (so each individual is involved in 1600 games) gives worse learning than for only 8 or 10 games per inter-

300

350

Figure 5: Population size 120, for different levels of precision in the noisy evaluation. Too few games gives poor learning, as expected. More games (and thus more accurate measurement of ability) tends to reduce diversity, as shown in Figures 3 and 4. A smaller population has enough trouble maintaining diversity already, so here more games can make learning worse. action. A massive 100 games per interaction reduces uncertainty so much that learning improves again, a tremendously inefficient use of valuable computation time. 0.52 Best in pop against Pubeval for 10,000 games

Figure 4: Population size 240, showing the genetic diversity of the runs in Figure 2. Using more games per pairwise interaction, to give a more accurate measure of ability in the fitness function, drives down the diversity of the population.

250

Generation

Generation

100 games 0.5

0.48 20 games

0.46

0.44

4 games Population 160, 4 games per interaction Population 160, 8 games per interaction Population 160, 10 games per interaction Population 160, 20 games per interaction Population 160, 100 games per interaction

0.42

0.4 100

150

200

250

300

350

Generation

Figure 6: Population size 160, for different levels of precision in the noisy evaluation. Doing 20 games per pairwise interaction does worse than only 6 or 8 games per interaction. This is because more games (and thus more accurate measurement of ability) tends to reduce diversity, and a small population has trouble maintaining diversity at the best of times. Figures 5 and 6 represent a real trap for anyone trying to get maximum learning from minimum CPU time. Skimping on population size, while using more CPU to reduce inaccuracies, sounds like a reasonable way to do faster runs, especially for parallel hardware which goes faster by coarsely dividing the problem. But more evaluations reduces diversity, so marginal population sizes (which already have trouble maintaining enough diversity) can do worse, not better.

4.2 Some Suggestions That Don’t Work 4.2.1 Greedy and Noisy Is Better Perhaps Figure 6 is a side-effect of a parameter in Section 3.1.1: the best individual expects two offspring, the worst expects zero, declining linearly. This may be too greedy. Perhaps it is merely getting stuck in a local optimum. To test this, Figure 7 shows two pairs of runs, all with population 160, using either 6 or 20 games per pairwise evaluation. Those with greedy selection (where the best expects 2.0 offspring, the worst none) do better than those who are less greedy (the best expects 1.2 offspring, the worst 0.8).

Best in pop against Pubeval for 10,000 games

0.45 0.4

Same CPU time

0.35 0.3 0.25

Same CPU time 0.2

Same CPU time

0.15 Population 160, 4 games per interaction Restart from generation 20 with 6 games Restart from generation 40 with 6 games Population 160, 6 games per interaction

0.1 0.05 0 0

10

20

30

40

50

60

70

Generation

Best in pop against Pubeval for 10,000 games

0.5

Figure 8: Instead of having all-plays-all for 20 games per pairwise interaction from the start, what if we play fewer games at first, then play more later on? This shows that learning does not improve.

0.4

0.3 Population 160 Greedy, 6 games per pairwise interaction Greedy, 20 games per pairwise interaction Not greedy, 6 games per pairwise interaction Not greedy, 20 games per pairwise interaction

0.2

It may be that a more elaborate annealing schedule (for increasing the accuracy of evaluations during a run) might be able to reduce the amount of computation time. But this brief glance at the issue suggests that coming up with a good annealing schedule is not as easy as it first seemed.

0.1

0 0

50

100

150 200 Generation

250

300

350

4.3 Voting, Bagging, Boosting Figure 7: Less greedy search does worse, so premature convergence is not to blame for more games doing worse in Figure 6. Figure 7 by itself does not prove that greed is good — we have not elaborated the precise mechanism at work. But suffice it to say, too-high selection pressure is apparently not the villain here, and being less greedy is not a simple cure-all. 4.2.2 Annealing the Number of Games Doesn’t Work Another suggestion is to anneal the number of games. Early in a run (when differences in ability are large) should need few games to discern those differences, and later in a run more games could discern the smaller differences. Alas, it doesn’t work, as shown in Figure 8. More games (and thus more accurate measurement of ability) is indeed unnecessary early in a run, when differences in ability are large. For the same amount of CPU time, the run with only 4 games per pairwise interaction is ahead of the run with 6 games per interaction — but only at first. Later in the run, when only reasonably good players have survived, differences in ability are smaller and thus require more games to discern those smaller differences. Increasing the number of games per pairwise interaction from 4 to 6 does not, however, close the gap in ability that has already emerged between the two runs in Figure 8. For the same CPU time, better results are obtained by simply using 6 game per pairwise interaction from the beginning.

The performance in Figures 1 and 2 is rather unreliable: the best individual in one generation may be slightly better, the next generation slightly worse. This is due to minor changes from mutation and crossover. How can we extract a single strategy from a population which, although nearing convergence, is still a population of imperfect members each with minor flaws? We need a single strategy to deploy against unseen opponents. Some researchers [4] [6] [13] use evolutionary computation to come up with not just a single monolithic solution, but a population of diverse solutions – an ensemble of specialists — which can then be used together, as have ensembles created by hand. Table 1 shows that the simple expedient of having a population vote (on what move to make next) can fix individual imperfections, and give consistently good performance. Using a simple voting setup on an entire population (instead of picking the best individual in that population) provides quite stable performance, beating Pubeval at least 51% of the time. That 51% might seem a small margin, but bear in mind that Pubeval is already highly optimized. Pubeval was created by Temporal Difference learning, which also created the world’s best Backgammon computer [21], so improving upon TD-learning on its favorite task is no small achievement. We may test if the difference is a fluke by calculating the binomial confidence [19, page 6]: for a player that wins with probability 0:5 (the null hypothesis, that both players are in fact

Learning method TD GA GA GA GA GA GA TD TD TD

Pop size

Games

Strategy

200 200 200 200 240 240

8 8 10 10 6 6

Pubeval Best gen 350 Vote gen 350 Best gen 370 Vote gen 370 Best gen 370 Vote gen 370

NN arch linear linear linear linear linear linear linear 10 nodes 20 nodes 40 nodes

Wins out of 10,000 5,000 4,896 5,123 4,976 5,185 5,023 5,140

Binomial confid. 0.500 0.981 7:14



10

Pub-1

3

0.681 1:12



10

4

0.326 2:63



10

3

0.500 0.490 0.512 0.498 0.519 0.502 0.514 0.527 0.571 0.611

Table 1: Compares the co-evolutionary genetic algorithm (GA) with Temporal Difference (TD-) learning on the Pub-1 metric [23, Table 1] which plays 10,000 games of Backgammon against the unseen benchmark strategy, Pubeval, which was created by TD-learning. Utilizing the diversity of the whole GA population by the simple expedient of voting gives performance approaching that of TD-learning on a more sophisticated nonlinear architecture with 10 nodes. Here, binomial confidence is the probability of the null hypothesis, that the player is only as good as Pubeval; a value of 0.05 or less means it is almost certainly better than Pubeval.

equally good), the probability of it winning s or more games out of n is given by summing:

No. of wins

P(



X n

s) =

i=s

n! i!(n

i)!

i p (1

p)

(n i)

(2)

The Pub-1 metric [23, Table 1] plays a strategy against the unseen benchmark Pubeval for 10,000 games, which allows us to calculate the binomial confidence, and thus to reject the null hypothesis that they are equal in ability. If Equation 2 gives less than 0.05 we may be confident that the difference is not a fluke. From Table 1, getting the population to vote will consistently outperform Pubeval, even though the GA used Pubeval’s own architecture, as shown by the binomial confidence values in Table 1. Even TD-learning could only reach 52.7% on a more sophisticated architecture with 10 hidden nodes — co-evolutionary learning covers much of that gap, without needing the fancier architecture. Voting is by no means the only possible approach: bagging and boosting [18] are popular methods to utilize diverse opinions. Another possibility is using local search methods [26, Section 2.4], such as a variation on hill-climbing [16], to fix the imperfections in a single individual.

Temporal Difference learning achieves only 52.7% on the Pub-1 metric [23, Table 1] on a more sophisticated architecture with 10 hidden nodes — co-evolutionary learning covers much of that gap, without needing the fancier architecture. One suggestion is to apply co-evolutionary learning on a more elaborate architecture, just as TD-learning did to create TD-Gammon, the world’s best Backgammon computer [22]. Unfortunately, there are some obstacles. 5.1.1 Elaborate Feature Detectors Human experts wrote the feature detectors used by TDGammon. Gerald Tesauro obtained TD-Gammon’s feature detectors from the Gammontool program, courtesy of Sun Microsystems [21, page 389]. Human experts wrote those feature detectors. While this is feasible, it increases the need for human engineering and expertise, and the whole point of artificial intelligence is to get the machine to do as much work as possible by itself. More interesting than TD-Gammon in that no handwritten feature detectors were needed, is the recent work of Chellapilla and Fogel [3] which used co-evolution to create a Checkers player that plays at the Expert level (a US Chess Federation standard), without hand-written feature detectors, using only raw board positions. 5.1.2 Crossover and Neural Networks

5 Discussion 5.1 TD-Gammon and Neural Network Issues Voting runs Table 1 gives performance of more than 51% against Pubeval on Pubeval’s own representation. While this may sound modest, calculating the binomial confidence in Table 1 shows it is a small but statistically significant improvement over the already highly optimized solution of Pubeval.

More significantly, a fixed feedforward neural network architecture is ill-suited to the genetic algorithm used here, because the competing label convention problem [24] [26, Section 2.1] makes crossover problematic. Consider the two feedforward networks in Figure 9: if they use a sigmoid function that is odd symmetric such as the tanh function, they are functionally identical; two hidden nodes have merely been swapped over, and all of the weights associated with one of

the nodes have been multiplied by -1. Despite being functionally identical, the two networks in Figure 9 will appear very different in a genetic algorithm. Crossover will produce weak offspring, because the corresponding parts are not in corresponding places. In general, for a network with n hidden nodes, there are 2n n! different ways to write down the same network [24, page 572], obtained by permuting hidden nodes and (if the sigmoid function is odd symmetric, like the tanh function) by flipping signs. Output

-1.1

1.0

-2.5

-2.5

1.2

Hidden nodes

2.4

2.4 2.3

2.3

-1.2

1.1 -1.0

Inputs

Figure 9: These two feedforward neural networks are functionally equivalent. The hidden nodes have been swapped over, and one has had all its weights multiplied by -1. For n n hidden nodes, there are 2 n! different permutations of the same network, from swapping nodes and flipping signs. Crossover has trouble because the functionally equivalent nodes are not in the same place, or have the same sign. One way around the competing label convention problem is to sort the nodes in some way before crossover. Thierens [24] obtained interesting results with an approach like this, but presented limited experimental evidence to support it. The usual way to deal with this problem is to simply avoid doing crossover altogether, or to invent tailored genetic operators [26, section 2.2]. 5.1.3 Enormous Population Sizes Even with a way to crossover feedforward networks, a more daunting problem is that genetic algorithms need a population size proportional to the number of variables [9, page 342]. On the linear architecture of only 122 weights, a population of 160 to 200 was required here. So to emulate TDGammon with its 160 hidden nodes, even ignoring any elaborate feature detectors and simply using the 122 raw board inputs, would require a network with about twenty thousand weights. That would need a population of twenty-five thousand or so, and would require a much heavier investment of computation time than TD-learning required. Even though evolutionary computation is easily parallelized, and TD-learning is not, such demands are a considerable obstacle.

5.2 Co-Evolution, What Is It Good For? Which noisy tasks are more suited to Temporal Difference learning [21], or some other non-population method such as hill-climbing [16], instead of co-evolution? The answer depends partly on the your learning task and computational resources, and of course on the No Free Lunch theorem [25]. Table 1 could beat Pubeval over 51% of the time, which suggests that co-evolution is the better option. However, that small but real margin required vastly more computation. If your task is such that a small improvement doesn’t count, and parallel hardware is unavailable, then temporal difference learning is attractive. If the task requires the best possible competitive advantage, and coming second means losing, then using much more CPU time for the best possible results may be worth it. Parallel implementation means co-evolution can be just as fast as a non-population method on a serial computer. Evolutionary computation is easily parallelized, unlike TD learning or hill-climbing. So if parallel hardware is available, coevolution becomes attractive. For example, all the runs in this paper were done on idle student lab machines in a manner that did not interrupt students; the program detects when a student starts doing something, and quietly turns itself off [5]. This approach provides supercomputer capabilities at zero extra cost. All the runs in this paper took at most several days. Most were merely overnight runs.

6 Conclusion The question facing this paper is, for a given solution representation, how can co-evolutionary learning obtain the highest skill from the least CPU time? This question can be answered with the following brief points.  Use a generously large population: more noise requires more population [9, page 342]. If you skimp on population size, more evaluations can be worse for learning, not better, because of its tendency to reduce diversity.



Use just enough evaluations, so that more does not improve learning. This depends on the task, and your implementation. Here, each individual needs to take part in about 1600 games. Computationally intensive noisy tasks are tractable to coevolutionary learning on inexpensive parallel hardware, and given enough computational power, can create a solution comparable to Temporal Difference learning.

Acknowledgments The author is grateful to Chris Pascoe, David Reeves, and Adrian Lee for invaluable technical support. Discussions with colleagues from Brandeis University’s DEMO Lab, the University of Queensland’s Cognitive Science Group, and the University of Melbourne’s Computer Science Department were most helpful. Gerald Tesauro made available

the Backgammon strategy Pubeval. The author’s research is partly supported by the Australian Research Council.

Bibliography [1] Robert M. Axelrod. The evolution of strategies in the iterated prisoner’s dilemma. In Genetic Algorithms and Simulated Annealing, chapter 3, pages 32–41. Morgan Kaufmann, 1987. [2] James E. Baker. Reducing bias and inefficiency in the selection algorithm. In Proceedings of the Second International Conference on Genetic Algorithms, pages 14– 21. Lawrence Erlbaum Associates, 1987. [3] Kumar Chellapilla and David B. Fogel. Evolution, neural networks, games, and intelligence. Proceedings of the IEEE, 87(9):1471–1496, September 1999.

[13] Yong Liu and Xin Yao. Ensemble learning via negative correlation. Neural Networks, 12(10):1399–1404, December 1999. [14] Helmut A. Mayer and Roland Schwaiger. Evolutionary and coevolutionary approachs to time series prediction using generalized multi-layer perceptrons. In Congress on Evolutionary Computation, pages 275–280. IEEE Press, 1999. [15] Rorbert G. Palmer, W. Brian Arthur, John H. Holland, Blake LeBaron, and Paul Tayler. Artificial economic life: a simple model of a stockmarket. Physica D, 75:264–274, 1994. [16] Jordan B. Pollack and Alan D. Blair. Co-evolution in the successful learning of Backgammon strategy. Machine Learning, 32(3):225–240, 1998.

[4] Sung-Bae Cho and Katsunori Shimohara. Modular neural networks evolved by genetic programming. In Proceedings of the 1996 IEEE Conference on Evolutionary Computation, pages 681–686. IEEE Press, 1996.

[17] Mitchell A. Potter, Kenneth A. De Jong, and John J. Grefenstette. A coevolutionary approach to learning sequential decision rules. In Proceedings of the Sixth International Conference on Genetic Algorithms, pages 366–372. Morgan Kaufmann, 1995.

[5] Paul J. Darwen. Unobtrusive workstation farming without inconveniencing owners: Learning Backgammon with a genetic algorithm. In IEEE International Workshop on Cluster Computing, pages 303–311. IEEE Computer Society Press, 1999.

[18] Ross J. Quinlan. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth AAAI National Conference on Artificial Intelligence, pages 725–30. AAAI Press, 1996.

[6] Paul J. Darwen and Xin Yao. Speciation as automatic categorical modularization. IEEE Transactions on Evolutionary Computation, 1(2):101–108, July 1997. [7] David B. Fogel. Evolving behaviours in the iterated prisoner’s dilemma. Evolutionary Computation, 1(1):77–97, 1993. [8] Richard P. Gagan. Artificial intelligence in training applications. Electronic Progress, 21(1):22–27, 1992. [9] David E. Goldberg, Kalyanmoy Deb, and James H. Clark. Genetic algorithms, noise, and the sizing of populations. Complex Systems, 6:333–362, 1992. [10] W. Daniel Hillis. Co-evolving parasites improve simulated evolution as an optimization procedure. In Artificial Life 2, pages 313–323. Addison-Wesley, 1991. [11] Philip Husbands and Frank Mill. Simulated coevolution as the mechanism for emergent planning and scheduling. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 264– 270. Morgan Kaufmann, 1991. [12] Hugues Juill´e and Jordan B. Pollack. Co-evolving intertwined spirals. In Proceedings of the Fifth Annual Conference on Evolutionary Programming, pages 461– 468. MIT Press, 1996.

[19] Steven L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1(3):317–327, 1997. [20] Claude E. Shannon and Warren Weaver. Mathematical Theory of Communication. University of Illinois Press, 1949. [21] Gerald Tesauro. Practical issues in temporal difference learning. Machine Learning, 8:257–277, 1992. [22] Gerald Tesauro. Temporal difference learning and tdgammon. Communications of the ACM, 38(3):58–68, March 1995. [23] Gerald Tesauro. Comments on “Co-evolution in the successful learning of Backgammon strategy”. Machine Learning, 32(3):241–243, 1998. [24] Dirk Thierens. Non-redundant genetic coding of neural networks. In Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, pages 571–575. IEEE Press, 1996. [25] David H. Wolpert and William G. Macready. No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67–82, April 1997. [26] Xin Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447, September 1999.