Revisited: The Article Covariance Matrix Adaptation Revisited - the CMSA Evolution Strategy

The paper Covariance Matrix Adaptation Revisited - the CMSA Evolution Strategy by Beyer and Sendhoff [1] proposes the so-called covariance matrix self-adaptation evolution strategy, CMSA-ES. Compared to CMA-ES, cumulative step-size adaptation of the global step-size is replaced by self-adaptation in CMSA-ES. Furthermore the internal parameter setting is modified. More specifically, the CMSA-ES algorithm deviates from default CMA-ES [3,4] in the following way:

Step-size σ is adapted via mutative step-size control (i.e. "pure" self-adaptation) instead of cumulative step-size adaptation (i.e. derandomized and non-local self-adaptation, AKA cumulative path length control).
The rank-one update of the covariance matrix is omitted. In the default CMA-ES this can be accomplished setting the learning parameter for the rank-one update of the covariance matrix to zero. (The parameter is denoted as c₁ in [4], see Appendix A, or as α_cov×c_cov in [2], or as c_cov / μ_cov in [3], or as 1 / (τ_c μ_eff) in [1]. In the notation of [2] this means α_cov = 0, in the notation of [3] this means μ_cov = ∞, in the notation of [1] the option is not available because μ_eff cannot be freely chosen, however setting τ_p = 1 comes close).
The recombination weights are set to 1/μ, with μ ≈ λ/4, like in [2]. The current default recombination weights in CMA-ES are monotonically decreasing and positive up to μ ≈ λ/2. In both cases μ_eff ≈ λ/4.

The two latter points are simply a deviation from the current default parameter setting of CMA-ES, where the third point is likely to be of minor relevance.

In [1], experimental results compare CMSA-ES with CMA-ES with population sizes λ = 8;4n;4n² on the four functions

Sphere
Schwefels ellipsoid
Rosenbrock
Rastrigin (only for λ = 4n²)

The following relevant effects on the number of function evaluations to reach a small target function value can be observed.

With λ = 8 and λ = 4n and n > 10, CMSA becomes worse on the Rosenbrock function by a factor of approximately ten.
With λ = 4n² and n > 10, CMSA becomes better than CMA on all functions but the Rosenbrock function. The effect increases with increasing dimension n and is about a factor of 15 in 80-D (with population size 25600).

The presented results suggest that on functions, where covariance matrix adaptation does not play a significant role, CMSA-ES remarkably outperforms CMA-ES when the population size λ is as large as at least 1000 and λ ≫ n > 10. However, this conclusion does not hold.

Discussion

Change of the algorithm

Mutative step-size control (σ-self-adaptation), as used in CMSA-ES, yields (almost) invariably a smaller step-size than cumulative step-size adaptation as used in CMA-ES. This means we expect to improve on rather simple functions, where fast convergence is the major concern, however to become worse on more difficult functions, where too small step-sizes are a common problem. For very large population size λ, the effect of cumulative step-size adaptation in CMA-ES fades out (the damping has a component roughly proportional to (μ_eff/n)^1/2), however taking the sum of a large number of steps can counteract this effect and significant step-size increments can still be observed.
Changing the default recombination weights is expected to have a small effect. The default setting performs typically about 15% faster.
Omitting the rank-one update (setting c₁ = 0) has a severe effect with small population size: it impairs the scaling (of the number of needed f-evaluations) of CMA-ES on the class of cigar- and smooth ridge-like functions from linear with the search space dimension to sub-quadratic (see e.g. Principled design..., Sect 7.2). Furthermore, on functions where the full covariance matrix needs to be adapted, the typical loss factor for setting c₁ = 0 is 2 to 4. For very large population size the relevance of the rank-one update is supposed to diminish, because by default c₁ → 0 for λ → ∞ (however taking the sum of a larger number of steps might counteract this effect).

Selection of test functions

With small λ, covariance matrix adaptation has a significant impact on the performance only on the Rosenbrock function, in that it speeds up an isotropic ES by a factor of about 10 to 30 (here the effect is more pronounced in smaller dimensions). On Schwefels ellipsoid, the known speed up is about a factor of two. This effect is small compared to the effect on actually ill-conditioned functions: the speed up on the "classic ill-conditioned ellipsoid" (with condition number 10⁶ and log-uniform factors) is between a factor of 100 and 1000, on the cigar function the factor is 50000 (sic!) and independent of the dimension.

In order to solve the Rastrigin function, a large population size is needed. On the Rastrigin function, the distribution shape does not need to be adapted, but with a large population size covariance matrix adaptation becomes increasingly important for the convergence process.

The shaping of the sample distribution is on the tested functions of remarkable benefit only on the Rosenbrock function.

Results

The following tables show number of function evaluations to reach the target value 10^-10 for dimension n = 80 taken from [1] and from [2] (Hybr-CMA which entertains equal recombination weights as CMSA-ES and an odd setting of the parameters for cumulative step-size adaptation) and simulation results of CMA-ES with default settings according to [3] (where the defect in [2] was addressed by adjusting step-size parameters for large population size and by introducing H_σ that indirectly modulates c₁) and [4] (with the significant adjustment of c_c = (4 + μ_eff/n) / (4 + n + 2×μ_eff/n), Hansen 2009, for large population size, i.e. large μ_eff, like it was done for c_σ in [3]).

Table 1. Population Size λ = 4n = 320
function evaluations [× 10000] to reach the target function value 10^-10 for n = 80

sphere Schwefel Rosenbrock

CMSA from [1] 6 37 1100

CMA from [1] 10 19 150

Hybr-CMA from [2] 20 N/A 200

CMA as in [3] 10 19 130

CMA as in [4] 10 19 130

**Table 1. Population Size λ = 4n = 320**
function evaluations [× 10000] to reach the target function value 10^-10 for n = 80
	sphere	Schwefel	Rosenbrock
CMSA from [1]	6	37	1100
CMA from [1]	10	19	150
Hybr-CMA from [2]	20	N/A	200
CMA as in [3]	10	19	130
CMA as in [4]	10	19	130

Table 2. Population Size λ = 4n² = 25600
function evaluations [× 25600] to reach the target function value 10^-10 for n = 80

sphere Schwefel Rastrigin Rosenbrock

CMSA from [1] 90 100 180 6000

CMA from [1] 1400 1500 3000 8000

Hybr-CMA from [2] 400 N/A N/A 6000

CMA as in [3] 205 227 324 2250

CMA as in [4] 130 136 229 1033

**Table 2. Population Size λ = 4n² = 25600**
function evaluations [× 25600] to reach the target function value 10^-10 for n = 80
	sphere	Schwefel	Rastrigin	Rosenbrock
CMSA from [1]	90	100	180	6000
CMA from [1]	1400	1500	3000	8000
Hybr-CMA from [2]	400	N/A	N/A	6000
CMA as in [3]	205	227	324	2250
CMA as in [4]	130	136	229	1033

Table 2 shows the setting (n = 80 and λ = 4n²) where CMSA-ES seemed to excel the most (Table 1 serves as check-back for moderately large population size). The standard deviation for the simulated results CMA as in [3] and [4] are about ±1%. Results from the literature are read from graphs and can have an error of about ±10%.

The results for CMA-ES as presented in [1] for λ = 4n² (in shallow red) cannot be reproduced, even with the odd parameter settings from [2]. Review of the source code used in [1] reveals that the results are due to a bug: the update of the sample distribution from the covariance matrix was postponed λ times longer than advised in the original algorithm.

On the sphere, Schwefels function and Rastrigins function, performance with very large population size is to a large extend determined by approaching the optimum fast (on the sphere function this is virtually the only concern). Self-adaptation of the global step-size in CMSA-ES reduces the step-size faster than cumulative step-size adaptation in CMA-ES, however the performance difference hardly exceeds a factor of two. In CMA-ES, for very large population size, the major part of the variance reduction is due to covariance matrix adaptation, and not due to step-size control (because d_σ is large). This also explains, why the bug strongly effects the performance on the sphere function.

The parameter correction for very large population size in [4] seems to be reasonably effective.

Why CMA-ES is considerably faster on the Rosenbrock function even for very large population size is not immediately obvious. Additional experiments reveal that the rank-one update in [4] is irrelevant for this result (in [3] it leads to the observed impairment) and weighted recombination has only a minor effect (indeed about 15%). Consequently, replacing cumulative step-size adaptation with σ-self-adaptation must be the reason for the performance decline of CMSA with λ = 4n² on the Rosenbrock function. Without step-size adaptation, CMA-ES with λ = 4n² fails to solve the Rosenbrock function (it still works with λ ≤ 4n). We find that also with λ = 4n the step-size adaptation is the main reason for the performance decline of CMSA-ES on the Rosenbrock function.

Conclusion

The simulation results in [1], leading to the conclusion that CMSA-ES performs superior with large population size, are due to a bug in the source code used for CMA-ES. When comparing CMSA-ES with a correct implementation of CMA-ES or results from the literature, the most pronounced advantage does not exceed a factor of two. This advantage is obtained when a rapid decrease of the step-size is desirable. The advantage can be explained by known algorithm properties: self-adaptation of the global step-size as used in CMSA-ES adapts smaller step-sizes and decreases the step-size more rapidly.

When adaptation of the covariance matrix is decisive, CMA-ES outperforms CMSA-ES, where the difference is less pronounced with larger population size and with smaller dimension. Above, we observed a factor of 6 to 8 on the Rosenbrock function for population size λ ≥ 4n, where self-adaptation of the global step-size has been identified as the main reason for the performance decline of CMSA-ES. Here, unlike on the sphere function, the smaller adapted step-size becomes a substantial disadvantage.

References

[1] Beyer, H.-G., B. Sendhoff (2008). Covariance Matrix Adaptation Revisited - the CMSA Evolution Strategy. In Rudolph et al. (eds.) Parallel Problem Solving from Nature, PPSN X, Proceedings, pp. 123-132, Springer, (pdf).

[2] Hansen, N., S.D. Müller and P. Koumoutsakos (2003). Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18, (abstract & errata, paper in pdf).

[3] Hansen, N. and S. Kern (2004). Evaluating the CMA Evolution Strategy on Multimodal Test Functions. In Eighth International Conference on Parallel Problem Solving from Nature, PPSN VIII, Proceedings, pp. 282-291, Springer, (abstract and errata, paper in pdf, paper in ps).

[4] Hansen, N. (2011). The CMA evolution strategy: a tutorial. Continuously updated technical report, June 28, 2011.

last change $Date: 2014-06-14 13:19:16 +0200 (Sat, 14 Jun 2014) $