16. Shekel(S4,5). 4. -10.153. 36. Powell(PW24). 24. 0. 17. Shekel(S4,7). 4. -10.403. 37. Dixon&Price(DP25). 25. 0. 18. Shekel(S4,10). 4. -10.536. 38. Levy(L30).
Annealing Evolutionary Stochastic Approximation Monte Carlo for Global Optimization
Faming Liang
Texas A&M University
Literature Review
Problems
Consider the global optimization problem
min U (x), x∈X
where X is the sample space. Without loss of generality, assuming X is compact. The function can contain multiple local minima separated by high barriers.
Two basic stochastic algorithms (1) Simulated annealing (Kirkpatrick et al., 1983) (2) Genetic algorithm. (Holland, 1975; Goldberg, 1989).
Literature Review
Simulated Annealing
Simulated Annealing Let T1 > T2 > · · · > Tk > · · · be a sequence of monotonically decreasing temperatures in which T1 is reasonably large and limk→∞ Tk = 0. (a) Initialize at an arbitrary configuration T1 .
x(0) and temperature level
k , run Nk iterations of an MCMC scheme with πk ∝ exp(−U (x)/Tk ) as its target distribution. Pass the last configu-
(b) For each
ration to the next temperature level. (c) Increase k to k Temperature:
+ 1.
Tk+1 = αTk , where α is a number close to 1.
Literature Review
Simulated Annealing
Geman and Geman (1984): The global minimum of U (x) can be reached by simulated annealing with probability 1 if the temperature Tk decreases at an order of Ω(1/ log(Lk )), where Lk = N1 + · · · + Nk .
Holley et al. (1989): No cooling schedule faster than a logarithmic rate can be guaranteed to find the global minimum.
Literature Review
Genetic Algorithm
Genetic Algorithm This algorithm mimics the natural evolutionary process. It works with a population of samples, which undergo some genetic-type operations, such as mutation and crossover, in order to produce samples that minimize the objective function. Mutation: an individual sample is selected to undergo a point change. Crossover: two individual samples are selected to recombine to produce offspring samples. Through this operation, the sample information distributed across population are effectively used in the process of function optimization. Although the algorithm works well for many hard optimization problems, there is no a rigorous theory to support its convergence to the global minimum.
Literature Review
ASAMC
Annealing Stochastic Approximation Monte Carlo (Liang, 2006) – Set ψ(x)
= exp{−U (x)/τ }.
E1 = {x : U (x) ≤ u1 }, E2 = {x : u1 < U (x) ≤ u2 }, . . . , Em = {x : U (x) > um−1 }.
– Sample space partition: (t)
–
θi : working weight associated with the subregion Ei .
–
γt : gain factor satisfying the conditions: γt > 0,
∞ X t=0
for any ζ
γt = ∞,
∞ X
γtζ < ∞.
t=0
∈ (1, 2).
P π = (π1 , . . . , πm ) be an m-vector with 0 < πi < 1 and πi = 1. The π is called the desired sampling distribution.
– Let
Literature Review
ASAMC
ASAMC initiates the search in the entire sample space X0 and then iteratively searches in the set
=
Sm i=1
Ei ,
(t)
$(Umin +ℵ)
Xt =
[
Ei , t = 1, 2, . . . ,
(1)
i=1 (t)
where Umin is the best function value obtained until iteration t, and ℵ > 0 is a user specified parameter which determines the broadness of the sample space at each iteration. Since the sample space shrinks iteration by iteration, the algorithm is called annealing SAMC.
Literature Review
ASAMC
(a) (Initialization) Partition the sample space X into m disjoint subregions E1 , . . . , Em according to the objective function U (x); specify a desired sampling distribution π ; initialize x(0) by a sam(0)
(0) (0) (θ1 , . . . , θm )
ple randomly drawn from the sample space X , θ = Sm (0, 0, . . . , 0), ℵ, and X0 = i=1 Ei ; and set the iteration number t = 0. (b) (Sampling) Update the current sample x(t) by a single or few MH moves which leave the following distribution invariant, (t)
$(Umin +ℵ)
fθ(t) (x) ∝
X i=1
πi ψ(x) e
(t) θi
I(x ∈ Ei ), (t) θi
(2)
is a working weight where I(x ∈ Ei ) is an indicate function, associated with the subregion Ei , ψ(x) = exp{−U (x)/τ } is an unnormalized Boltzmann density, and τ is a user-specified parameter. Denote the new sample by x(t+1) .
=
Literature Review
ASAMC
(c) (Working weight updating) Update the working weight θ (t) as follows: (t+1) θi
=
(t) θi
£
+ γt+1 I(x(t+1) ∈ Ei ) − πi ], i = 1, . . . , m,
where γt is called the gain factor. (d) (Termination Checking) Check the termination condition, e.g., a fixed number of iterations or an optimal solution set have been reached. Otherwise, set t → t + 1 and go to step (b).
Algorithm Description
AESAMC
Let x = (x1 , . . . , xn ) denote the population, where n is the population size, and xi = (xi1 , . . . , xid ) is a d-dimensional vector called an individual or chromosome in terms of genetic algorithms.
U (x) can be obtained by minimizing the function i=1 U (xi ). An unnormalized Boltzmann density can be
The minimum of
U (x) =
Pn
defined for the population as
ψ(x) = exp{−U (x)/τ }, x ∈ X n , where X n
= X × · · · × X is a product sample space.
(3)
Algorithm Description
AESAMC
Genetic Operators They need to satisfy the reversibility condition of MCMC, but do not need to keep the individual samples to be independent across iterations. They can be different from those used in the genetic algorithm and Evolutionary Monte Carlo.
Algorithm Description
AESAMC
K-point Crossover
(xi1 , . . . , xid )
(xi1 , . . . , xia , xj,a+1 , . . . , xjb , xi,b+1 , . . . , xid ), =⇒
(xj1 , . . . , xjd )
(xj1 , . . . , xja , xi,a+1 , . . . , xib , xj,b+1 , . . . , xjd ).
A new population can then be formed as x0 xi+1 , . . . , xj−1 , x0j , xj+1 , . . . , xn ).
= (x1 , . . . , xi−1 , x0i ,
Algorithm Description
AESAMC
A modified snooker Crossover (a) Randomly select one chromosome, say xi , from the current population x. (b) Select the other chromosome, say, xj , from the subpopulation x \ {xi }. (c) Let e = (xj − xi )/kxj − xi k, and let x0i = xi + re, where r is a random variable drawn for a normal distribution N (0, σc2 ) with the standard deviation σc being calibrated such that the operation has a reasonable overall acceptance probability.
x0 by replacing xi with the ½ offspring ¾ f θ(t) (x0 ) 0 xi , and accept the new population with probability min 1, . f θ(t) (x) For this operator, the transition probability ratio is 1; that is, T (x → x0 ) = T (x0 → x).
(d) Construct a new population
Algorithm Description
AESAMC
Liner crossover This operator has the same procedure with the snooker crossover operator except that step (c) is changed as follows. (c) Set x0i
= xi + rxj , where r ∼ U nif [−1, 1].
Algorithm Description
AESAMC
Mutation (t+1)
= 1, . . . , n, given the population x(t+1,i−1) = (x1 (t) (t) xi , . . . , xn ): For i
(a) Generate a random direction vector e from a uniform distribution on the surface of a unit d-dimensional hypersphere. 2 N (0, σm )
x0i
(t) xi
(b) Generate a random number r ∼ and set = + re, where σm is calibrated such that the operator has a reasonable overall acceptance rate. (t) xi
(t+1)
, . . . , xi−1 ,
with x0i , accept the (c) Construct a new population by replacing new population according to the MH rule as in the snooker crossover, and denote the new population by x(t+1,i) . For this operator, the transition is symmetric, i.e., T (x(t+1,i−1) → x(t+1,i) ) = T (x(t+1,i) → x(t+1,i−1) ).
Algorithm Description
AESAMC
AESAMC algorithm (a) (Initialization) Partition the sample space X n into m disjoint subregions E 1 , . . . , E m ; choose the threshold value ℵ and the working probabilities ρ1 , . . . , ρ5 ; initialize a population x(0) at random; and set
θ
(0)
(0)
U min
(0) (0) (θ1 , . . . , θm ) (0)
= = (0, 0, . . . , 0), = U (x ) and t = 0.
X0n
=
Sm
i=1
E i,
(b) (Sampling) Update the current population x(t) using the MH-Gibbs mutation, K -point mutation, K -point crossover, snooker crossover, and linear crossover operators according to the respective working probabilities. (c) (Working weight updating) Update the working weight θ (t) : (t+1)
θi
(t)
(t)
= θi +γt+1 Hi (θ(t) , x(t+1) ), i = 1, . . . , $(U min +ℵ),
where Hi (θ (t) , x(t+1) ) = I(x(t+1) ∈ E i ) − πi for the crossover Pn (t) (t+1) operators, and Hi (θ , x ) = j=1 I(x(t+1,j) ∈ E i )/n − πi for the mutation operators.
Algorithm Description
AESAMC
(d) (Termination Checking) Check the termination condition, e.g., a fixed number of iterations or an optimal solution set have been reached. Otherwise, set t → t + 1 and go to step (b).
Algorithm Description
AESAMC
Convergence conditions
• f θ(t) (x) is bounded away from 0 and ∞ on X n . • There exists a marginally global mutation operator, and the working probability of the mutation operator is greater than 0.
Algorithm Description
AESAMC
Convergence Theorem 0.1 Let E 1 , . . . , E m form a partition of a compact sample spaceR X n and ψ(x) be a non-negative function defined on X n with 0 < E ψ(x)dx < ∞ for all E i ’s. Let π = (π1 , . . . , πm ) be an m-
Pm
i
vector, 0 < πi < 1, and i=1 πi = 1. Let Θ be a compact set of m dimensions, and assume that there exists at least a constant C such that R θ˘ ∈ Θ, where θ˘ = (θ˘1 , . . . , θ˘m ) and θ˘i = C + log( E ψ(x)dx) −
log(πi ). Let θ(t) ∈ Θ be the estimate of θ˘ at iteration t.
i
Under above conditions, we have
µZ
½ P
(t)
lim θi = C + log
t→∞
Ei
¶ ¾ ψ(x)dx − log(πi ) = 1, i = 1, . . . , m,
as n goes to infinity, where C is an arbitrary constant.
(4)
Algorithm Description
AESAMC
Convergence Rate Theorem 0.2 Suppose that
£
(t)
(t+1)
(t)
¤
(5) E H(θ , x ) − h(θ )|Ft = 0, R with h(θ) = X n H(θ, x)p(dx) for θ ∈ Θ, and that there exists a Pm 2 constant δ > 0 such that min1≤i≤m πi − δ > 0. Let i=1 πi /m >P m Sk the gain factor be chosen with T0 > δ −1 . Define v(θ) = 21 ( k=1 S − R P πk )2 , where Si = Ei ψ(x)dx/eθti and S = m k=1 Sk . We then have
Ev(θ(t) ) ≤ λ∗ γt , for some suitable constant λ∗
> 0.
Numerical Examples
AESAMC
Consider to minimize rotated Rastrigin’s function, D X ¡ 2 ¢ U (x) = yi − 10 cos(2πyi ) + 10 , (y1 , . . . , yD )0 = R(x1 , . . . , xD )0 , i=1
(6) within the region [−5.12, 5.12]D , where D = 30, and R is a rotation matrix generated using Salomon’s method (Salomon, 1996) by multiplying 2D − 3 elementary rotation matrices with the rotation coordinates being selected systematically. The global minimizer for this function is located at x∗ = (0, 0, . . . , 0) with U (x∗ ) = 0. This function has a multitude of local minima. For D = 2, there are about 50 local minimizers (Ali et al., 2005).
Numerical Examples Algorithm AESAMC
SAMC
SA
RGA
AESAMC
Setting
Minimum
Maximum
Average
SD
Cost
ℵ = 20
0.000
0.006
0.002
3.2e-4
5.5 × 105
ℵ = 50 ℵ = 100 ℵ = 200 ℵ = 300 τhigh = 10 τhigh = 20 τhigh = 50 n = 50 n = 100 n = 250
0.000
0.010
0.003
4.8e-4
19.75 19.75 15.30
54.95 48.08 46.71
31.85 29.55 29.16
1.56 1.24 1.23
19.36 12.96 24.60
62.07 71.03 73.52
35.16 35.11 36.11
1.88 2.45 1.73
13.93 7.96 8.95
53.73 32.84 56.71
27.16 21.99 23.98
1.71 1.02 2.23
5.5 × 105 106 106 106 106 106 106 1.01 × 106 1.01 × 106 1.02 × 106
Table 1: Comparison of AESAMC, SAMC, SA and RGA for minimizing rotated Rastrigin’s function. Let ui denote the best function value produced in the i-th run. The numbers in the columns of Minimum, Maximum, Average, and SD are calculated, reP30 spectively, as follows: min1≤i≤30 ui , max1≤i≤30 ui , i=1 ui /30, and the standard deviation of the average. Cost: the number of function evaluations in each run.
AESAMC
200
Numerical Examples
100 0
50
Best values
150
AESAMC SAMC SA RGA
0e+00
2e+05
4e+05
6e+05
8e+05
1e+06
Number of function evaluations
Figure 1: Average progression curves of the best function values (over 30 runs) and their 95% confidence regions (shaded area) produced by AESAMC (ℵ = 50), SAMC (ℵ = 200), SA (τhigh = 20) and RGA (n = 100) for rotated Rastrigin’s function.
Numerical Examples
AESAMC
This example is based on the rational-expectations model of Hoffman and Schmidt (1981). Minimizing the following function
U (α, β, γ, x01 , x02 , x03 ) = +
T X
T X (y1t − g1t (α, β, γ))2 + (y2t − g2t (α, β, γ))
t=1 3 X T X
t=1
(xti − γi xt−1,i )2 ,
i=1 t=1
(7) on the space:
|αij | ≤ 10, |βij | ≤ 10, |γi | ≤ 1 and |x0i | ≤ 10.
Numerical Examples
Algorithm AESAMC
SAMC
SA
RGA
AESAMC
Setting
Minimum
Maximum
Average
SD
Cost
ℵ = 5000, T0 = 20000
62694
62984
62815
19
1.85 × 107
ℵ = 15000, T0 = 50000 ℵ = 10000, T0 = 20000
62694
63078
62813
29
62694
63893
62864
71
1.85 × 107 1.85 × 107
ℵ = 20000, T0 = 50000 τhigh = 500 τhigh = 1000 τhigh = 2000 n = 50 n = 100 n = 200
62694
63370
62940
64
62699 62695 62700
63956 64242 64779
63412 63543 63501
84 84 118
64585 64579 64579
66703 65299 65288
65140 64822 64861
122 65 63
Table 2: Comparison of AESAMC, SAMC, SA and RGA for minimization of the function (7). Refer to Table 1 for the notations.
1.85 × 107 1.85 × 107 1.85 × 107 1.85 × 107 1.9 × 107 1.9 × 107 1.9 × 107
Numerical Examples
AESAMC
65000 63000
64000
Best values
66000
AESAMC SAMC SA RGA
0
5000000
10000000
15000000
Number of function evaluations
Figure 2: Average progression curves of the best function values (over 20 runs) and their 95% confidence regions (shaded area) produced by AESAMC (ℵ = 5000, T0 = 20000), SAMC (ℵ = 10000, T0 = 20000), SA (τhigh = 10) and RGA (n = 100) for minimizing the function (7).
Numerical Examples
AESAMC
Comparing AESAMC with other metaheuristics AESAMC was compared with the following algorithms on a set of benchmark multimodal test functions whose global minima are known. The names of the test functions are given in Table 3 and the definitions can be found in Laguna and Mart´I (2005), Hedar and Fukushima (2006) or Hirsch et al. (2006).
• Genetic algorithm (Genocop III) (Michalewicz and Nazhiyath, 1995) Genocop III is an implementation of the genetic algorithm that is customized for solving nonlinear optimization problems with continuous and bounded variables.
• Scatter search algorithm (Laguna and Mart´I, 2005) The scatter search algorithm is an evolutionary algorithm that, unlike the genetic algorithm, operates on a small population of solutions and makes only limited use of randomization as a proxy for diversification when searching for a globally optimal solution.
Numerical Examples
AESAMC
• Directed tabu search (DTS) algorithm (Hedar and Fukushima, 2006) The DTS algorithm is a hybrid of the tabu search algorithm (Glover and Laguna, 1997) and a direct search algorithm.
• Continuous GRASP (C-GRASP) algorithm (Hirsch et al., 2006; Hirsch et al., 2007) The C-GRASP algorithm is multi-start searching algorithm for continuous optimization problems subject to box constraints, where a starting solution for local improvement is constructed in a greedy randomized fashion.
Numerical Examples
No.
Function
1 2 3 4 5 6 7 8 9 10
Branin RCOS(RC) Bohachevsky(B2 ) Easom(ES ) Goldstein&Price(GP) Shubert(SH) Beale(BL) Booth(BO) Matyas(MT) Hump(HM) Schwefel(SC2 )
AESAMC
da
b Umin
No.
Function
da
b Umin
2 2 2 2 2 2 2 2 2 2
0.398 0 -1 3 -186.731 0 0 0 0 0
21 22 23 24 25 26 27 28 29 30
Power Sum(PS8,18,44,114 ) Hartmann(H6,4 ) Schwefel(SC6 ) Trid(T6 ) Trid(T10 ) Rastrigin(RT10 ) Griewank(G10 ) Sum Squares(SS10 ) Rosenbrock(R10 ) Zakharov(Z10 )
4 6 6 6 10 10 10 10 10 10
0 -3.335 0 -50 -186.731 0 0 0 0 0
Numerical Examples
11 12 13 14 15 16 17 18 19 20
Rosenbrock(R2 ) Zakharov(Z2 ) De Joung(DJ) Hartmann(H3,4 ) Colville(CV) Shekel(S4,5 ) Shekel(S4,7 ) Shekel(S4,10 ) Perm(P4,0.5 ) 0 Perm(P4,0.5 )
AESAMC
2 2 3 3 4 4 4 4 4 4
0 0 0 -3.863 0 -10.153 -10.403 -10.536 0 0
31 32 33 34 35 36 37 38 39 40
Rastrigin(RT20 ) Griewank(G20 ) Sum Squares(SS20 ) Rosenbrock(R20 ) Zakharov(Z20 ) Powell(PW24 ) Dixon&Price(DP25 ) Levy(L30 ) Sphere(SR30 ) Ackley(AK30 )
20 20 20 20 20 24 25 30 30 30
Table 3: Test functions (Laguna and Mart´I, 2005). a Dimension of the domain Global minimum value of the test function.
0 0 0 0 0 0 0 0 0 0
X.
b
Numerical Examples
AESAMC
Algorithm
5000a
10000a
20000a
50000a
Genocop IIIb Scatter Searchb C-GRASPc DTSd
636.37 4.96 6.20 4.22
399.52 3.60 4.73 1.80
320.84 3.52 3.92 1.70
313.34 3.46 3.02 1.29
4.105(0.114) 3.422(0.093) 3.034(0.114)
2.550(0.079) 2.363(0.073) 2.023(0.090)
1.757(0.079) 1.505(0.063) 1.391(0.074)
1.048(0.057) 0.939(0.034) 0.946(0.057)
3.584(0.105) 3.053(0.106) 2.986(0.117) 2.360(0.106) 2.445(0.105)
2.594(0.080) 1.800(0.085) 1.889(0.086) 1.547(0.080) 1.389(0.073)
1.907(0.056) 1.170(0.058) 1.116(0.062) 1.063(0.062) 1.061(0.068)
1.161(0.042) 0.713(0.032) 0.689(0.032) 0.668(0.031) 0.754(0.056)
1.592(0.083) 1.934(0.272)
0.822(0.018) 0.785(0.013)
0.676(0.012) 0.657(0.016)
0.499(0.008) 0.488(0.011)
ASAMCe (ℵ ASAMCe (ℵ ASAMCe (ℵ
= 10) = 5) = 1) SAe (thigh = 20) SAe (thigh = 5) SAe (thigh = 2) SAe (thigh = 1) SAe (thigh = 0.5) AESAMCf (ℵ = 10) AESAMCf (ℵ = 1)
Table 4: Average optimality gap values over the 40 test functions. a The number of function evaluations. b The results are from Laguna and Marti (2005). c The results are from Hirsch et al. (2006). d The results are from Hedar and Fukushima (2006). e The number in the q parentheses denotes the standard deviation of the average, and
P
Numerical Examples
AESAMC
(a)
4
6
8
T=20 T=5 T=2 T=1 T=0.5
0
0
2
4
SA−AESAMC
6
8
Aleph=10 Aleph=5 Aleph=1
2
ASAMC−AESAMC
(b)
0
10
20 Test functions
30
40
0
10
20
30
40
Test functions
Figure 3: Comparison of optimality gap values produced by AESAMC (with ℵ = 1), ASAMC and SA with 50000 function evaluations for each of the 40 test functions. (a) Difference of the optimality gap values produced by ASAMC and AESAMC (with ℵ = 1). (b) Difference of the optimality gap values produced by SA and AESAMC (with ℵ = 1).
Discussion
AESAMC
• Self-adjusting ability: not trapped by local minima. • Converge in distribution to the global minimum at a rate of O(1/t), which is much faster than the convergence rate of simulated annealing.
• Improve the acceptance rate of crossover operations, and thus the distribution information can be more effectively used in optimization.