Annealing Evolutionary Stochastic Approximation ... - Semantic Scholar

1 downloads 0 Views 270KB Size Report
16. Shekel(S4,5). 4. -10.153. 36. Powell(PW24). 24. 0. 17. Shekel(S4,7). 4. -10.403. 37. Dixon&Price(DP25). 25. 0. 18. Shekel(S4,10). 4. -10.536. 38. Levy(L30).
Annealing Evolutionary Stochastic Approximation Monte Carlo for Global Optimization

Faming Liang

Texas A&M University

Literature Review

Problems

Consider the global optimization problem

min U (x), x∈X

where X is the sample space. Without loss of generality, assuming X is compact. The function can contain multiple local minima separated by high barriers.

Two basic stochastic algorithms (1) Simulated annealing (Kirkpatrick et al., 1983) (2) Genetic algorithm. (Holland, 1975; Goldberg, 1989).

Literature Review

Simulated Annealing

Simulated Annealing Let T1 > T2 > · · · > Tk > · · · be a sequence of monotonically decreasing temperatures in which T1 is reasonably large and limk→∞ Tk = 0. (a) Initialize at an arbitrary configuration T1 .

x(0) and temperature level

k , run Nk iterations of an MCMC scheme with πk ∝ exp(−U (x)/Tk ) as its target distribution. Pass the last configu-

(b) For each

ration to the next temperature level. (c) Increase k to k Temperature:

+ 1.

Tk+1 = αTk , where α is a number close to 1.

Literature Review

Simulated Annealing

Geman and Geman (1984): The global minimum of U (x) can be reached by simulated annealing with probability 1 if the temperature Tk decreases at an order of Ω(1/ log(Lk )), where Lk = N1 + · · · + Nk .

Holley et al. (1989): No cooling schedule faster than a logarithmic rate can be guaranteed to find the global minimum.

Literature Review

Genetic Algorithm

Genetic Algorithm This algorithm mimics the natural evolutionary process. It works with a population of samples, which undergo some genetic-type operations, such as mutation and crossover, in order to produce samples that minimize the objective function. Mutation: an individual sample is selected to undergo a point change. Crossover: two individual samples are selected to recombine to produce offspring samples. Through this operation, the sample information distributed across population are effectively used in the process of function optimization. Although the algorithm works well for many hard optimization problems, there is no a rigorous theory to support its convergence to the global minimum.

Literature Review

ASAMC

Annealing Stochastic Approximation Monte Carlo (Liang, 2006) – Set ψ(x)

= exp{−U (x)/τ }.

E1 = {x : U (x) ≤ u1 }, E2 = {x : u1 < U (x) ≤ u2 }, . . . , Em = {x : U (x) > um−1 }.

– Sample space partition: (t)



θi : working weight associated with the subregion Ei .



γt : gain factor satisfying the conditions: γt > 0,

∞ X t=0

for any ζ

γt = ∞,

∞ X

γtζ < ∞.

t=0

∈ (1, 2).

P π = (π1 , . . . , πm ) be an m-vector with 0 < πi < 1 and πi = 1. The π is called the desired sampling distribution.

– Let

Literature Review

ASAMC

ASAMC initiates the search in the entire sample space X0 and then iteratively searches in the set

=

Sm i=1

Ei ,

(t)

$(Umin +ℵ)

Xt =

[

Ei , t = 1, 2, . . . ,

(1)

i=1 (t)

where Umin is the best function value obtained until iteration t, and ℵ > 0 is a user specified parameter which determines the broadness of the sample space at each iteration. Since the sample space shrinks iteration by iteration, the algorithm is called annealing SAMC.

Literature Review

ASAMC

(a) (Initialization) Partition the sample space X into m disjoint subregions E1 , . . . , Em according to the objective function U (x); specify a desired sampling distribution π ; initialize x(0) by a sam(0)

(0) (0) (θ1 , . . . , θm )

ple randomly drawn from the sample space X , θ = Sm (0, 0, . . . , 0), ℵ, and X0 = i=1 Ei ; and set the iteration number t = 0. (b) (Sampling) Update the current sample x(t) by a single or few MH moves which leave the following distribution invariant, (t)

$(Umin +ℵ)

fθ(t) (x) ∝

X i=1

πi ψ(x) e

(t) θi

I(x ∈ Ei ), (t) θi

(2)

is a working weight where I(x ∈ Ei ) is an indicate function, associated with the subregion Ei , ψ(x) = exp{−U (x)/τ } is an unnormalized Boltzmann density, and τ is a user-specified parameter. Denote the new sample by x(t+1) .

=

Literature Review

ASAMC

(c) (Working weight updating) Update the working weight θ (t) as follows: (t+1) θi

=

(t) θi

£

+ γt+1 I(x(t+1) ∈ Ei ) − πi ], i = 1, . . . , m,

where γt is called the gain factor. (d) (Termination Checking) Check the termination condition, e.g., a fixed number of iterations or an optimal solution set have been reached. Otherwise, set t → t + 1 and go to step (b).

Algorithm Description

AESAMC

Let x = (x1 , . . . , xn ) denote the population, where n is the population size, and xi = (xi1 , . . . , xid ) is a d-dimensional vector called an individual or chromosome in terms of genetic algorithms.

U (x) can be obtained by minimizing the function i=1 U (xi ). An unnormalized Boltzmann density can be

The minimum of

U (x) =

Pn

defined for the population as

ψ(x) = exp{−U (x)/τ }, x ∈ X n , where X n

= X × · · · × X is a product sample space.

(3)

Algorithm Description

AESAMC

Genetic Operators They need to satisfy the reversibility condition of MCMC, but do not need to keep the individual samples to be independent across iterations. They can be different from those used in the genetic algorithm and Evolutionary Monte Carlo.

Algorithm Description

AESAMC

K-point Crossover

(xi1 , . . . , xid )

(xi1 , . . . , xia , xj,a+1 , . . . , xjb , xi,b+1 , . . . , xid ), =⇒

(xj1 , . . . , xjd )

(xj1 , . . . , xja , xi,a+1 , . . . , xib , xj,b+1 , . . . , xjd ).

A new population can then be formed as x0 xi+1 , . . . , xj−1 , x0j , xj+1 , . . . , xn ).

= (x1 , . . . , xi−1 , x0i ,

Algorithm Description

AESAMC

A modified snooker Crossover (a) Randomly select one chromosome, say xi , from the current population x. (b) Select the other chromosome, say, xj , from the subpopulation x \ {xi }. (c) Let e = (xj − xi )/kxj − xi k, and let x0i = xi + re, where r is a random variable drawn for a normal distribution N (0, σc2 ) with the standard deviation σc being calibrated such that the operation has a reasonable overall acceptance probability.

x0 by replacing xi with the ½ offspring ¾ f θ(t) (x0 ) 0 xi , and accept the new population with probability min 1, . f θ(t) (x) For this operator, the transition probability ratio is 1; that is, T (x → x0 ) = T (x0 → x).

(d) Construct a new population

Algorithm Description

AESAMC

Liner crossover This operator has the same procedure with the snooker crossover operator except that step (c) is changed as follows. (c) Set x0i

= xi + rxj , where r ∼ U nif [−1, 1].

Algorithm Description

AESAMC

Mutation (t+1)

= 1, . . . , n, given the population x(t+1,i−1) = (x1 (t) (t) xi , . . . , xn ): For i

(a) Generate a random direction vector e from a uniform distribution on the surface of a unit d-dimensional hypersphere. 2 N (0, σm )

x0i

(t) xi

(b) Generate a random number r ∼ and set = + re, where σm is calibrated such that the operator has a reasonable overall acceptance rate. (t) xi

(t+1)

, . . . , xi−1 ,

with x0i , accept the (c) Construct a new population by replacing new population according to the MH rule as in the snooker crossover, and denote the new population by x(t+1,i) . For this operator, the transition is symmetric, i.e., T (x(t+1,i−1) → x(t+1,i) ) = T (x(t+1,i) → x(t+1,i−1) ).

Algorithm Description

AESAMC

AESAMC algorithm (a) (Initialization) Partition the sample space X n into m disjoint subregions E 1 , . . . , E m ; choose the threshold value ℵ and the working probabilities ρ1 , . . . , ρ5 ; initialize a population x(0) at random; and set

θ

(0)

(0)

U min

(0) (0) (θ1 , . . . , θm ) (0)

= = (0, 0, . . . , 0), = U (x ) and t = 0.

X0n

=

Sm

i=1

E i,

(b) (Sampling) Update the current population x(t) using the MH-Gibbs mutation, K -point mutation, K -point crossover, snooker crossover, and linear crossover operators according to the respective working probabilities. (c) (Working weight updating) Update the working weight θ (t) : (t+1)

θi

(t)

(t)

= θi +γt+1 Hi (θ(t) , x(t+1) ), i = 1, . . . , $(U min +ℵ),

where Hi (θ (t) , x(t+1) ) = I(x(t+1) ∈ E i ) − πi for the crossover Pn (t) (t+1) operators, and Hi (θ , x ) = j=1 I(x(t+1,j) ∈ E i )/n − πi for the mutation operators.

Algorithm Description

AESAMC

(d) (Termination Checking) Check the termination condition, e.g., a fixed number of iterations or an optimal solution set have been reached. Otherwise, set t → t + 1 and go to step (b).

Algorithm Description

AESAMC

Convergence conditions

• f θ(t) (x) is bounded away from 0 and ∞ on X n . • There exists a marginally global mutation operator, and the working probability of the mutation operator is greater than 0.

Algorithm Description

AESAMC

Convergence Theorem 0.1 Let E 1 , . . . , E m form a partition of a compact sample spaceR X n and ψ(x) be a non-negative function defined on X n with 0 < E ψ(x)dx < ∞ for all E i ’s. Let π = (π1 , . . . , πm ) be an m-

Pm

i

vector, 0 < πi < 1, and i=1 πi = 1. Let Θ be a compact set of m dimensions, and assume that there exists at least a constant C such that R θ˘ ∈ Θ, where θ˘ = (θ˘1 , . . . , θ˘m ) and θ˘i = C + log( E ψ(x)dx) −

log(πi ). Let θ(t) ∈ Θ be the estimate of θ˘ at iteration t.

i

Under above conditions, we have

µZ

½ P

(t)

lim θi = C + log

t→∞

Ei

¶ ¾ ψ(x)dx − log(πi ) = 1, i = 1, . . . , m,

as n goes to infinity, where C is an arbitrary constant.

(4)

Algorithm Description

AESAMC

Convergence Rate Theorem 0.2 Suppose that

£

(t)

(t+1)

(t)

¤

(5) E H(θ , x ) − h(θ )|Ft = 0, R with h(θ) = X n H(θ, x)p(dx) for θ ∈ Θ, and that there exists a Pm 2 constant δ > 0 such that min1≤i≤m πi − δ > 0. Let i=1 πi /m >P m Sk the gain factor be chosen with T0 > δ −1 . Define v(θ) = 21 ( k=1 S − R P πk )2 , where Si = Ei ψ(x)dx/eθti and S = m k=1 Sk . We then have

Ev(θ(t) ) ≤ λ∗ γt , for some suitable constant λ∗

> 0.

Numerical Examples

AESAMC

Consider to minimize rotated Rastrigin’s function, D X ¡ 2 ¢ U (x) = yi − 10 cos(2πyi ) + 10 , (y1 , . . . , yD )0 = R(x1 , . . . , xD )0 , i=1

(6) within the region [−5.12, 5.12]D , where D = 30, and R is a rotation matrix generated using Salomon’s method (Salomon, 1996) by multiplying 2D − 3 elementary rotation matrices with the rotation coordinates being selected systematically. The global minimizer for this function is located at x∗ = (0, 0, . . . , 0) with U (x∗ ) = 0. This function has a multitude of local minima. For D = 2, there are about 50 local minimizers (Ali et al., 2005).

Numerical Examples Algorithm AESAMC

SAMC

SA

RGA

AESAMC

Setting

Minimum

Maximum

Average

SD

Cost

ℵ = 20

0.000

0.006

0.002

3.2e-4

5.5 × 105

ℵ = 50 ℵ = 100 ℵ = 200 ℵ = 300 τhigh = 10 τhigh = 20 τhigh = 50 n = 50 n = 100 n = 250

0.000

0.010

0.003

4.8e-4

19.75 19.75 15.30

54.95 48.08 46.71

31.85 29.55 29.16

1.56 1.24 1.23

19.36 12.96 24.60

62.07 71.03 73.52

35.16 35.11 36.11

1.88 2.45 1.73

13.93 7.96 8.95

53.73 32.84 56.71

27.16 21.99 23.98

1.71 1.02 2.23

5.5 × 105 106 106 106 106 106 106 1.01 × 106 1.01 × 106 1.02 × 106

Table 1: Comparison of AESAMC, SAMC, SA and RGA for minimizing rotated Rastrigin’s function. Let ui denote the best function value produced in the i-th run. The numbers in the columns of Minimum, Maximum, Average, and SD are calculated, reP30 spectively, as follows: min1≤i≤30 ui , max1≤i≤30 ui , i=1 ui /30, and the standard deviation of the average. Cost: the number of function evaluations in each run.

AESAMC

200

Numerical Examples

100 0

50

Best values

150

AESAMC SAMC SA RGA

0e+00

2e+05

4e+05

6e+05

8e+05

1e+06

Number of function evaluations

Figure 1: Average progression curves of the best function values (over 30 runs) and their 95% confidence regions (shaded area) produced by AESAMC (ℵ = 50), SAMC (ℵ = 200), SA (τhigh = 20) and RGA (n = 100) for rotated Rastrigin’s function.

Numerical Examples

AESAMC

This example is based on the rational-expectations model of Hoffman and Schmidt (1981). Minimizing the following function

U (α, β, γ, x01 , x02 , x03 ) = +

T X

T X (y1t − g1t (α, β, γ))2 + (y2t − g2t (α, β, γ))

t=1 3 X T X

t=1

(xti − γi xt−1,i )2 ,

i=1 t=1

(7) on the space:

|αij | ≤ 10, |βij | ≤ 10, |γi | ≤ 1 and |x0i | ≤ 10.

Numerical Examples

Algorithm AESAMC

SAMC

SA

RGA

AESAMC

Setting

Minimum

Maximum

Average

SD

Cost

ℵ = 5000, T0 = 20000

62694

62984

62815

19

1.85 × 107

ℵ = 15000, T0 = 50000 ℵ = 10000, T0 = 20000

62694

63078

62813

29

62694

63893

62864

71

1.85 × 107 1.85 × 107

ℵ = 20000, T0 = 50000 τhigh = 500 τhigh = 1000 τhigh = 2000 n = 50 n = 100 n = 200

62694

63370

62940

64

62699 62695 62700

63956 64242 64779

63412 63543 63501

84 84 118

64585 64579 64579

66703 65299 65288

65140 64822 64861

122 65 63

Table 2: Comparison of AESAMC, SAMC, SA and RGA for minimization of the function (7). Refer to Table 1 for the notations.

1.85 × 107 1.85 × 107 1.85 × 107 1.85 × 107 1.9 × 107 1.9 × 107 1.9 × 107

Numerical Examples

AESAMC

65000 63000

64000

Best values

66000

AESAMC SAMC SA RGA

0

5000000

10000000

15000000

Number of function evaluations

Figure 2: Average progression curves of the best function values (over 20 runs) and their 95% confidence regions (shaded area) produced by AESAMC (ℵ = 5000, T0 = 20000), SAMC (ℵ = 10000, T0 = 20000), SA (τhigh = 10) and RGA (n = 100) for minimizing the function (7).

Numerical Examples

AESAMC

Comparing AESAMC with other metaheuristics AESAMC was compared with the following algorithms on a set of benchmark multimodal test functions whose global minima are known. The names of the test functions are given in Table 3 and the definitions can be found in Laguna and Mart´I (2005), Hedar and Fukushima (2006) or Hirsch et al. (2006).

• Genetic algorithm (Genocop III) (Michalewicz and Nazhiyath, 1995) Genocop III is an implementation of the genetic algorithm that is customized for solving nonlinear optimization problems with continuous and bounded variables.

• Scatter search algorithm (Laguna and Mart´I, 2005) The scatter search algorithm is an evolutionary algorithm that, unlike the genetic algorithm, operates on a small population of solutions and makes only limited use of randomization as a proxy for diversification when searching for a globally optimal solution.

Numerical Examples

AESAMC

• Directed tabu search (DTS) algorithm (Hedar and Fukushima, 2006) The DTS algorithm is a hybrid of the tabu search algorithm (Glover and Laguna, 1997) and a direct search algorithm.

• Continuous GRASP (C-GRASP) algorithm (Hirsch et al., 2006; Hirsch et al., 2007) The C-GRASP algorithm is multi-start searching algorithm for continuous optimization problems subject to box constraints, where a starting solution for local improvement is constructed in a greedy randomized fashion.

Numerical Examples

No.

Function

1 2 3 4 5 6 7 8 9 10

Branin RCOS(RC) Bohachevsky(B2 ) Easom(ES ) Goldstein&Price(GP) Shubert(SH) Beale(BL) Booth(BO) Matyas(MT) Hump(HM) Schwefel(SC2 )

AESAMC

da

b Umin

No.

Function

da

b Umin

2 2 2 2 2 2 2 2 2 2

0.398 0 -1 3 -186.731 0 0 0 0 0

21 22 23 24 25 26 27 28 29 30

Power Sum(PS8,18,44,114 ) Hartmann(H6,4 ) Schwefel(SC6 ) Trid(T6 ) Trid(T10 ) Rastrigin(RT10 ) Griewank(G10 ) Sum Squares(SS10 ) Rosenbrock(R10 ) Zakharov(Z10 )

4 6 6 6 10 10 10 10 10 10

0 -3.335 0 -50 -186.731 0 0 0 0 0

Numerical Examples

11 12 13 14 15 16 17 18 19 20

Rosenbrock(R2 ) Zakharov(Z2 ) De Joung(DJ) Hartmann(H3,4 ) Colville(CV) Shekel(S4,5 ) Shekel(S4,7 ) Shekel(S4,10 ) Perm(P4,0.5 ) 0 Perm(P4,0.5 )

AESAMC

2 2 3 3 4 4 4 4 4 4

0 0 0 -3.863 0 -10.153 -10.403 -10.536 0 0

31 32 33 34 35 36 37 38 39 40

Rastrigin(RT20 ) Griewank(G20 ) Sum Squares(SS20 ) Rosenbrock(R20 ) Zakharov(Z20 ) Powell(PW24 ) Dixon&Price(DP25 ) Levy(L30 ) Sphere(SR30 ) Ackley(AK30 )

20 20 20 20 20 24 25 30 30 30

Table 3: Test functions (Laguna and Mart´I, 2005). a Dimension of the domain Global minimum value of the test function.

0 0 0 0 0 0 0 0 0 0

X.

b

Numerical Examples

AESAMC

Algorithm

5000a

10000a

20000a

50000a

Genocop IIIb Scatter Searchb C-GRASPc DTSd

636.37 4.96 6.20 4.22

399.52 3.60 4.73 1.80

320.84 3.52 3.92 1.70

313.34 3.46 3.02 1.29

4.105(0.114) 3.422(0.093) 3.034(0.114)

2.550(0.079) 2.363(0.073) 2.023(0.090)

1.757(0.079) 1.505(0.063) 1.391(0.074)

1.048(0.057) 0.939(0.034) 0.946(0.057)

3.584(0.105) 3.053(0.106) 2.986(0.117) 2.360(0.106) 2.445(0.105)

2.594(0.080) 1.800(0.085) 1.889(0.086) 1.547(0.080) 1.389(0.073)

1.907(0.056) 1.170(0.058) 1.116(0.062) 1.063(0.062) 1.061(0.068)

1.161(0.042) 0.713(0.032) 0.689(0.032) 0.668(0.031) 0.754(0.056)

1.592(0.083) 1.934(0.272)

0.822(0.018) 0.785(0.013)

0.676(0.012) 0.657(0.016)

0.499(0.008) 0.488(0.011)

ASAMCe (ℵ ASAMCe (ℵ ASAMCe (ℵ

= 10) = 5) = 1) SAe (thigh = 20) SAe (thigh = 5) SAe (thigh = 2) SAe (thigh = 1) SAe (thigh = 0.5) AESAMCf (ℵ = 10) AESAMCf (ℵ = 1)

Table 4: Average optimality gap values over the 40 test functions. a The number of function evaluations. b The results are from Laguna and Marti (2005). c The results are from Hirsch et al. (2006). d The results are from Hedar and Fukushima (2006). e The number in the q parentheses denotes the standard deviation of the average, and

P

Numerical Examples

AESAMC

(a)

4

6

8

T=20 T=5 T=2 T=1 T=0.5

0

0

2

4

SA−AESAMC

6

8

Aleph=10 Aleph=5 Aleph=1

2

ASAMC−AESAMC

(b)

0

10

20 Test functions

30

40

0

10

20

30

40

Test functions

Figure 3: Comparison of optimality gap values produced by AESAMC (with ℵ = 1), ASAMC and SA with 50000 function evaluations for each of the 40 test functions. (a) Difference of the optimality gap values produced by ASAMC and AESAMC (with ℵ = 1). (b) Difference of the optimality gap values produced by SA and AESAMC (with ℵ = 1).

Discussion

AESAMC

• Self-adjusting ability: not trapped by local minima. • Converge in distribution to the global minimum at a rate of O(1/t), which is much faster than the convergence rate of simulated annealing.

• Improve the acceptance rate of crossover operations, and thus the distribution information can be more effectively used in optimization.