Social interaction as a heuristic for combinatorial optimization problems

1 downloads 0 Views 314KB Size Report
Nov 4, 2010 - classification of binary input patterns of size F by a Boolean Binary Perceptron. ... The interactions are such that the agents' strings (or cultures) become more similar ... Particle swarm optimization, however, suits best to ..... 300. 400. 2. 4. 6. 8. 10. 12. T/N. C. FIG. 7. Scaled relaxation time of the ACH as ...
Social interaction as a heuristic for combinatorial optimization problems Jos´e F. Fontanari

arXiv:1009.1114v2 [cs.DS] 4 Nov 2010

Instituto de F´ısica de S˜ ao Carlos, Universidade de S˜ ao Paulo, Caixa Postal 369, 13560-970 S˜ ao Carlos, S˜ ao Paulo, Brazil We investigate the performance of a variant of Axelrod’s model for dissemination of culture – the Adaptive Culture Heuristic (ACH) – on solving an NP-Complete optimization problem, namely, the classification of binary input patterns of size F by a Boolean Binary Perceptron. In this heuristic, N agents, characterized by binary strings of length F which represent possible solutions to the optimization problem, are fixed at the sites of a square lattice and interact with their nearest neighbors only. The interactions are such that the agents’ strings (or cultures) become more similar to the low-cost strings of their neighbors resulting in the dissemination of these strings across the lattice. Eventually the dynamics freezes into a homogeneous absorbing configuration in which all agents exhibit identical solutions to the optimization problem. We find through extensive simulations that the probability of finding the optimal solution is a function of the reduced variable F/N 1/4 so that the number of agents must increase with the fourth power of the problem size, N ∝ F 4 , to guarantee a fixed probability of success. In this case, we find that the relaxation time to reach an absorbing configuration scales with F 6 which can be interpreted as the overall computational cost of the ACH to find an optimal set of weights for a Boolean Binary Perceptron, given a fixed probability of success. PACS numbers: 87.23.Ge 89.75.Da, 89.70.Eg, 05.50.+q

I.

INTRODUCTION

In the early eighties, the perception that the dynamics of the celebrated Hopfield model of associative memory [1] was solving an optimization problem, namely, that of finding which stored pattern is closest to the input configuration, led to the proposal of a powerful general-purpose optimization heuristic, the so-called Hopfield-Tank neural network [2]. A similar situation happened in the late nineties, when Kennedy [3] pointed out that Axelrod’s model of culture dissemination [4] could work as a collective problem-solving system provided that one associates the cultures of the agents (represented by strings of integer numbers) with the trial solutions of a given optimization problem. That proof-of-concept paper demonstrated then that social interaction is a natural computation method. In contrast with Hopfield-Tank neural network, the optimization heuristic based on social interaction, which henceforth we refer to as the Adaptive Culture Heuristic (ACH), has not enjoyed great popularity among the physics and computer science community, perhaps because of the appearance at the same time of a related algorithm, called particle swarm optimization, which has by now become an established optimization paradigm [5, 6]. Particle swarm optimization, however, suits best to search in space of real-valued variables, whereas ACH is proper to explore configuration spaces of discrete-valued variables, which is the case of most combinatorial optimization problems that have attracted the attention of the statistical physics community [7]. Here we attempt to change this situation by showing that the performance of the ACH seems to scale very favorably (it improves exponentially fast) with the number of agents in the system. Following Axelrod’s model [4], the ACH requires a pop-

ulation of N = L2 agents placed at the sites of a square lattice of size L × L with periodic boundary conditions. The agents can interact with their four nearest neighbors only. Each agent is characterized by a binary string of length F , which represents the agent’s solution to the optimization problem in the ACH interpretation. In Axelrod’s model this string, which is not necessarily binary, represents the culture of the agent. The interaction between any two neighboring agents occurs whenever the agents have different strings, regardless of their associated cost, and it is such that the string of the agent with the higher cost solution is slightly modified to become more similar to that of the more efficient partner. We recall that in Axelrod’s model the interaction between two neighboring agents takes place with probability proportional to the number of entries their cultural strings have in common and so agents with completely different cultures do not interact. In the case the agents are allowed to interact, the interaction results in the increase of the similarity between the cultures of the two agents, as in the ACH update rule. The fact that some agents are prohibited to interact is the key ingredient for the existence of stable globally polarized states (i.e., culturally heterogeneous absorbing configurations) which is the major outcome of Axelrod’s model [4]. In the ACH, however, we seek homogeneous absorbing configurations associated to low cost solutions of the target optimization problem and so the homogenizing interactions are always allowed regardless of the similarity between the strings of the neighboring agents [3]. In order to obtain statistically reliable results on the scaling of the performance of ACH with the size F of the optimization problem and the number N of agents in the lattice, we focus on a specific optimization problem which involves the manipulation of binary variables only,

2 namely, the categorization of binary input patterns by the Boolean Binary Perceptron. This is a NP-Complete problem [8] for which there is no efficient specific heuristic optimization method available [9] and whose random version has received a considerable attention from the statistical mechanics community (see, e.g., [10–15]) because its phase diagram exhibits a frozen phase similar to that of the Random Energy Model [16]. The main result of this paper is that, given a fixed probability of success, the overall computational cost of ACH to find a minimum-cost solution for the learning problem in a Boolean Binary Perceptron scales with the sixth power of the size of the input string. Of course, this finding has no implication on the celebrated N P 6= P conjecture of computer science since the F 6 scaling holds for typical realizations of the input-output mapping, rather than for all realizations as would be required to disprove that assertion. In addition, ACH is not a deterministic algorithm which disqualifies the heuristic as a candidate to disprove the N P 6= P conjecture. The rest of this paper is organized as follows. First we introduce the target optimization problem – categorization of binary patterns by the Boolean Binary Perceptron – on which we will measure the performance of the Adaptive Culture Heuristic (Sect. II). This heuristic is then described in great detail in Sect. III and the results of its performance on the training task, measured by the probability that the heuristic finds a minimum cost solution, are presented in Sect. IV. In this section we present also the performance of the ACH in the case the agents are placed at the nodes of random symmetric graphs and argue that the square lattice connectivity C = 4 yields the best performance. Finally, in Sect. V we present our concluding remarks.

II.

THE BOOLEAN BINARY PERCEPTRON

The Boolean Binary Perceptron is a single-layer neural network whose weights are constrained to take on binary values only. More explicitly, the network consists of an input layer with F binary neurons sk = ±1, k = 1, . . . , F with each input neuron connected to the output unit o = ±1 through the weights wk = ±1, k = 1, . . . , F . The state of the output unit is given by the equation ! F X wk sk (1) o = sign k=1

where sign (x) = 1 for x ≥ 0 and −1 otherwise. We will restrict F to take on odd integer values only, so we can guarantee that the argument of the sign function will never vanish. The learning task is to find a set of weights w ˆk = ±1, k = 1, . . . ,F that emulates the inputoutput mapping sl1 , . . . , slF → tl for l = 1, . . . , M . If the weights were allowed to assume real values then this learning task could easily be accomplished by the perceptron learning algorithm or by the Widrow-Hoff rule

[17]. However, when the binary constraint is taken into account the learning task becomes an NP-complete problem since it is equivalent to integer programming [8]. Assuming that N P 6= P , this means that no deterministic algorithm can find w ˆk , k = 1, . . . , F (if it exists) for any realization of the input-output mapping in a time that grows polynomially with the parameter F . Here we focus on random versions of the input-output mapping where the input entries slk are statistically independent random variables chosen as ±1 with equal probability. As for the output tl we consider two schemes. In the first scheme, we choose tl = ±1 at random with equal probability – so-called random output mapping. In this case, it is not possible to guarantee that there is a set of binary weights that emulates the input-output mapping perfectly. In fact, statistical mechanics studies based on the landmarking paper by Gardner [18], show that in the limit F → ∞ there are optimal sets of weights provided that the ratio α ≡ M/F is less than αcr ≈ 0.83 [11, 15]. So, in this limit, we say that the input-output mapping is linearly Boolean separable for α < αcr . However, it is convenient to consider input-output mappings which are linearly Boolean separable for any choice of the parameters F and M . This observation motivates the second scheme to set the values of the outputs tl , which are given by ! F X l 0 l t = sign wk sk (2) k=1

for l = 1, . . . , M . Here wk0 , k = 1, . . . , F are statistically independent random variables that take on the values ±1 with equal probability. Clearly, such input-output mapping is linearly Boolean separable by construction, since the set of binary weights wk0 , k = 1, . . . , F emulates it perfectly. The solution weight space of this problem was studied numerically [10, 12] and analytically [13, 14], resulting in the conclusion that in the limit F → ∞ the only solution to the mapping is the teacher perceptron wk0 , k = 1, . . . , F for α > αc0 ≈ 1.245. From the perspective of interpreting the neural network training as an optimization problem we define the following cost function ! M F X X E ({wk }) = Θ −tl wk slk (3) l=1

k=1

where Θ (x) = 1 if x ≥ 0 and 0 otherwise. Hence the cost E yields the number of misclassified inputs and so its minimum (optimum) value is zero in the case of a linearly Boolean separable mapping. In this paper we will concentrate mostly on the linearly Boolean separable mappings defined by Eq. (2) because in this case the optimal solution is known a priori so we can evaluate the performance of the ACH for relatively large problems (F < 200), whereas in the random output mapping we are restricted to the range F < 25, since we need to carry out an exhaustive search over the 2F

3 possible weight configurations in order to find the minimum cost solution. However, our findings indicate that, regarding the scaling with respect to the relevant parameters of the problem, the performance of the heuristic is essentially the same regardless of whether the mapping is linearly Boolean separable or not.

re-examine the active/inactive status of the target agent as well as of all its neighbors so as to update the list of active agents. The dynamics is frozen when the list of active agents is empty. Note that the cost of the solution string plays no role in the definition of active agents.

IV. III.

RESULTS

THE ADAPTIVE CULTURE HEURISTIC

The set of weights of a Boolean Binary Perceptron is completely specified by a binary string of length F . In the adaptive culture heuristic, each such string is interpreted as the culture of an agent and its cost, given by Eq. (3), measures the unworthiness of the culture. The idea behind the ACH is that the agents should prefer to adopt more valuable cultures, i.e., those cultures associated with low cost values [3]. In this context, it is more convenient to refer to the strings that characterize the agents as solutions rather than cultures. As already pointed out, the agents are fixed at the sites of a square lattice of size L × L with periodic boundary conditions and can interact with their four nearest neighbors only. At each time we pick an agent at random (this is the target agent) as well as one of its four neighbors. These two agents will interact provided that the cost (3) of the solution associated to the target agent is greater or equal to the cost of the solution associated to the randomly selected neighbor. An interaction consists of selecting at random and then flipping one of the entries which distinguish the target agent from its neighbor. Note that only the string of the target agent is updated, i.e., the agent with the higher cost solution is changed to become more similar to its neighbor. This change may actually increase the cost of the solution of the target agent, due to the highly nonlinear dependence of the cost (3) on the individual entries of the binary string. This procedure is repeated until the dynamics freezes in a homogeneous absorbing configuration. We can guarantee that the frozen configurations are homogeneous because we allow interactions, and so changes in the target agent, even when the two interacting agents have the same cost value. Because of the need to re-calculate the cost function after each interaction, the implementation of the ACH to search for near optimal weights of the Boolean Binary Perceptron is a very computationally demanding problem and so an extensive statistical analysis of the performance of this heuristic requires a highly optimized code. In particular, to simulate efficiently the ACH for large lattices we use a procedure based on the concept of active agents (see [19, 20]). An active agent is an agent whose solution differs from the solution of at least one of its four neighbors. Clearly, only active agents can change their strings and so it is more efficient to select the target agent randomly from the list of active agents rather than from the entire lattice. In the case that the solution string of the target agent is modified by the updating rule, we need to

All our results are obtained for M = 2F so that for the linearly Boolean separable case the teacher set of weights wk0 is the only global minimum (zero-cost) solution of the cost function (3), provided that F is sufficiently large. However, what is crucial for our purposes is the knowledge that for any value of F there is at least one solution for which the cost is zero, so that we can focus on the number of runs of the ACH which results in this minimal cost, regardless of whether the actual solution found by the heuristic is the teacher solution or another degenerate zero-cost solution. In particular, for each realization of the input-output mapping we run the ACH for 100 random initial settings of the agents’ solutions and calculate the fraction of runs for which the heuristic reaches a minimum cost solution. This fraction is then averaged over a variable number, ranging from 500 to 106 , of realizations of the input-output mapping. As pointed out before, most of our results are for the linearly Boolean separable case since in this case we know by construction the cost of the optimum solution and so we can study the performance of the heuristic for large values of F . At the end of this section we present some results for the random Boolean mapping in the region F ≤ 25 since then we first need to perform an exhaustive search in the solution space to find the minimum cost. The main quantity we focus here is the mean fraction of runs for which the heuristic reached the minimum-cost solution, which can be interpreted as the probability Pm that a run of the ACH finds the optimum cost. This quantity is shown in Fig. 1 for the linearly Boolean separable case as function of the size F of the problem and of the number N of agents in the system. Figure 1 reveals a most surprising aspect about the performance of the ACH, namely, that for small N , say N = 52 , a fourfold increment on the number of agents in the system, increases the probability of finding an optimal solution by several orders of magnitude. Actually, this observation holds true even for large N , provided that F is large enough. To quantify this observation, in Fig. 2 we show how Pm approaches 1 as the number of agents N increases for two values of the input size F . This analysis shows that for N > 302 , the probability 1 − Pm that the heuristic fails  to find the optimum cost vanishes like exp −aF N 1/4 where the (fitting) parameter aF is inversely proportional to F . These findings prompt us to redraw Fig. 1 in terms of the rescaled variable u ≡ F/N 1/4 , which is done in Fig. 3. The collapse of the data for N ≥ 302 into a single curve implies that Pm = g (u). We note that the

4

0.8

0.8

0.6

0.6 Pm

1

Pm

1

0.4

0.4

0.2

0.2

0

0 0

40

80

120

160

200

0

5

10

15

F

20

25

30

35

u

FIG. 1. The probability that a run of the ACH finds a zero-cost solution for linearly Boolean separable mappings as function of the input size F for lattices with (left to right) N = 52 , 102 , 202 , 302 , 402 , 502 and 602 agents. The error bars are smaller than the sizes of the symbols and the lines are guides to eye.

FIG. 3. The same data exhibited in Fig. 1 plotted in terms of the rescaled variable u ≡= F/N 1/4 . The data for N ≥ 302 lie in approximately the same curve given by the scaling function Pm = g (u).

100 10-1

100

10-2 10-3

1-Pm

Pm

10-1

10-4 10-5

-2

10

10-6 10-7 0

-3

10

2

3

4

5

6

7

8

9

10

40

80

120

160

200

F

N1/4

FIG. 2. Semi-logarithmic plot of the probability 1 − Pm that a run of the ACH does not find a zero-cost solution for linearly Boolean separable mappings as function of N 1/4 for F = 91 ( ) and F = 41 (4).  straight lines are  The dashed

FIG. 4. Semi-logarithmic plot of the probability that a run of the ACH finds a zero-cost solution for linearly Boolean separable mappings as function of the input size F for N = 52 (+) and N = 102 (×). The error bars are smaller than the sizes of the symbols. The solid straight line yields the probability that the optimal solution is chosen in a random selection, 2−F , whereas the dashed straight lines are the fittings Pm = bN exp (−aN F ).

failure of the scaling function g (u) to describe the data for N < 302 was already expected from the results of Fig. 2. In fact, those results show that in the limit u → 0 we have g (u) ∼ exp (−a/u) with a ≈ 0.5. The study of the scaling function g (u) in the other extreme limit, u → ∞, requires very large input sizes (F > 200) for relatively large lattices (N ≥ 302 ) which is computationally unfeasible because of the need to use a huge number of samples to get a reliable statistics since Pm → 0 in this limit. Nevertheless, in Fig. 4 we present

such analysis in the case of small lattices N = 52 and N = 102 , for which we know the scaling behavior is not valid. As expected, the results show that Pm vanishes exponentially with increasing F , i.e., Pm ∼ exp (−aN F ). Here the fitting parameter is given by aN ≈ 1/N 1/2 , indicating that for small N the gain on performance obtained by increasing the number of agents is much larger than the gain in the scaling regime where aN ∼ 1/N 1/4 . In addition, Fig. 4 is useful to highlight the enormous gain on performance resulting from the increase of the number

the fittings 1 − Pm = bF exp −aF N 1/4 .

5 1

4000

0.8

3000

0.6 Pm

T/N

5000

2000

0.4

1000

0.2

0

0 20

40

60

80

100 120 140 160 180

5

10

F FIG. 5. Scaled relaxation time of the ACH as function of the input size F for N = 302 ( ), 402 (4) and 502 (5). The error bars are smaller than the sizes of the symbols. The dashed curve is the fitting T /N = 0.12F 2 .

of agents involved in the optimization procedure. A most appealing feature of the ACH is that the dynamics always freezes in a homogeneous absorbing configuration and so the algorithm halts. We must note, however, that the ACH is a stochastic heuristic since the same initial configuration of the lattice can lead to different absorbing configurations depending on the sequence of site updates. The fact that the dynamics eventually freezes allows us to define a relaxation time for the ACH, which is a quite unexpected bonus for a stochastic heuristic. Accordingly, in Fig. 5 we show the scaled average relaxation time T /N as function of the input size F . The unsurprising fact that T scales linearly with the number of agents N is manifested by the coincidence of results for different lattice sizes. The instructive result here is that T grows with the square of the input size only. This result will be useful for the evaluation of the overall computational demand of the ACH (see Sect. V). The effect of the use of linearly Boolean separable input-output mappings on the measured performance of ACH can be appreciated in Figure 6 where we show a comparison between the performance of that heuristic for the random and the linearly Boolean separable mapping. As mentioned before, in the case of the random mapping the minimum cost is not necessarily zero and the global minimum is obtained through an exhaustive search in the configuration space (hence the restriction to F ≤ 25). Although the random mapping seems to be a harder problem to the ACH, there is no qualitative difference between the dependence of our performance measure Pm on the parameters N and F for the two mappings, and so our scaling results are likely to remain true for the random mapping as well. To conclude our analysis, a word is in order about the impact of the connectivity between the agents on the per-

15

20

F FIG. 6. Comparison between the performances of the ACH for the random mapping (open symbols) and the linearly Boolean separable mapping (filled symbols) for N = 52 ( ), 102 (4) and 202 (5). The error bars are smaller than the sizes of the symbols and the lines are guides to the eye.

formance of the ACH. It is well-known that the expansion of the influence range of the agents, modeled by increasing the connectivity of the lattice [21, 22] or by placing the agents in more complex networks [23] (e.g., smallworld and scale-free networks), results in the cultural homogenization of the population in Axelrod’s model. Hence, it is not unreasonable to expect that by increasing the connectivity of the lattice (or network) the relaxation time would decrease and so the computational cost of the heuristic would be reduced. Alas, that is not so. In fact, the results of Fig. 7, which shows the scaled relaxation time T /N as function of the connectivity C of a random symmetric network composed of N = 102 agents, indicate that T /N reaches a minimum around C = 4. As expected, we find that the probability Pm of reaching the optimal solution is not affected by the choice of the connectivity C, and so the connectivity C = 4 yields the best performance, in the sense of the least computational cost, of the ACH for not too small F . In addition, the finding that the results of the random symmetric network with C = 4 are indistinguishable from the results obtained for the regular square lattice (data not shown) suggests that the topology of the network does not influence the performance of the ACH.

V.

CONCLUSION

Understanding and quantifying how cooperation can improve the performance of groups of individuals to solve problems is an issue of great interest to many areas - ranging from computer science to business administration [24]. Our findings about the performance of the Adaptive Culture Heuristic (ACH) indicate that the

25

6 the optimization problem – and for a large number N of agents involved in the collective problem solving task. We focused on a single performance measure Pm , which yields the probability that a run of the ACH finds an op300 timal solution, and found that it is a function of the reduced variable u = F/N 1/4 for N ≥ 30 (see Figs. 2 and 3). This is a most remarkable and useful result which 200 informs how the number of agents must scale with the problem size for a given fixed performance of the ACH, namely, N ∝ F 4 . Recalling that the scaled relaxation 100 time T /N scales with F 2 (see Fig. 5) we find that the overall computational cost to find an optimal solution with a fixed probability scales with F 6 . As mentioned 0 in Sect. I, this finding has no bearing on the N P 6= P 2 4 6 8 10 12 conjecture of computer science. In addition, a surprising C result, which is summarized in Fig. 7, indicates that the implementation of the ACH on a square lattice or on a random symmetric network of connectivity C = 4, yields FIG. 7. Scaled relaxation time of the ACH as function of the connectivity C of random symmetric networks of N = 102 the best performance when compared with the implemenagents for F = 11(), 21(5), 31(4) and 41( ). Each symbol tation on a random network of different connectivity. represents the average over 103 distinct random symmetric It would be most interesting to find out whether the networks of fixed connectivity. The error bars are smaller F 6 scaling law derived for the problem of learning linthan the sizes of the symbols and the lines are guides to the early separable patterns by a Boolean Binary Perceptron eye. holds for other optimization problems as well. In that case, one would have revealed a genuine property of the ACH which, given the minimal nature of the underlying number of agents participating of the collective solution social interaction mechanism, might serve as a bound to of an optimization problem may influence the outcome the performance of heuristics based on collective compuof the process in a highly non-linear way (see, e.g., Fig. tation. 1). T/N

400

Our results were derived for a particular NP-Complete optimization problem, namely, the classification of linearly Boolean separable input patterns by a Boolean Binary Perceptron, whose optimal (zero-cost) solution is known by construction and which involves the manipulation of binary variables only. These two features allowed the study of the performance of the ACH for very large input sizes F – which essentially measures the ‘size’ of

This research was supported by The Southern Office of Aerospace Research and Development (SOARD), grant FA9550-10-1-0006, and Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnol´ogico (CNPq).

[1] J.J. Hopfield, Proc. Natl. Acad. Sci. USA 79, 2554 (1982). [2] J.J. Hopfield and D.W. Tank, Biol. Cybern. 52, 141 (1985). [3] J. Kennedy, J. Conflict Res. 42, 56 (1998). [4] R. Axelrod, J. Conflict Res. 41, 203 (1997). [5] E. Bonabeau, M. Dorigo, G. Theraulaz, Swarm Intelligence: From Natural to Artificial Systems (Oxford University Press, Oxford, UK, 1999). [6] R.C. Eberhart and Y. Shi, Particle swarm optimization: developments, applications and resources in Proceedings of the 2001 Congress on Evolutionary Computation. Seoul , South Korea, pp. 81–86 (2001). [7] M. M´ezard, G. Parisi and M.A. Virasoro, Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications (World Scientific, Singapore, 1986). [8] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness (Freeman, San Francisco, CA, 1979).

[9] J.F. Fontanari and R. Meir, Network: Comput. Neur. Syst. 2, 353 (1991). [10] E. Gardner and B. Derrida, J. Phys. A: Math. Gen. 22, 1983 (1989). [11] W. Krauth and M. M´ezard, J. Physique 50, 3057 (1989). [12] J.F. Fontanari and R. K¨ oberle, J. Phys. France 51, 1403 (1990). [13] G. Gy¨ orgyi, Phys. Rev. A 41, 7097 (1990). [14] H.S. Seung, H. Sompolinsky and N. Tishby, Phys. Rev. A 45, 6056 (1992). [15] J.F. Fontanari and R. Meir, J. Phys. A: Math. Gen. 26, 1077 (1993). [16] B. Derrida, Phys. Rev. B 24, 2613 (1981). [17] R.O. Duda and P.E. Harl, Pattern Classification and Scene Analysis (Willey, New York, 1973). [18] E. Gardner, J. Phys. A: Math. Gen. 21, 257 (1988). [19] L. A. Barbosa and J. F. Fontanari, Theory Biosci. 128, 205 (2009). [20] L. R. Peres and J. F. Fontanari, J. Phys. A: Math. Theor.

ACKNOWLEDGMENTS

7 43, 055003 (2010). [21] J. M. Greig, Conflict Res. 46, 225 (2002). [22] K. Klemm, V. M. Egu´ıluz, R. Toral, M. San Miguel, Physica A 327, 1 (2003).

[23] K. Klemm, V. M. Egu´ıluz, R. Toral, M. San Miguel, Phys. Rev. E 67, 026120 (2003). [24] S.H. Clearwater, B.A. Huberman and T. Hogg, Science 254, 1181 (1991).