Efficient Covariance Matrix Update for Variable Metric
Evolution Strategies
Thorsten Suttorp, Nikolaus Hansen and Christian Igel
Randomized direct search algorithms for continuous domains, such as
Evolution Strategies, are basic tools in machine learning. They are
especially needed when the gradient of an objective function (e.g.,
loss, energy, or reward function) cannot be computed or estimated
efficiently. Application areas include supervised and reinforcement
learning as well as model selection. These randomized search
strategies often rely on normally distributed additive variations of
candidate solutions. In order to efficiently search in non-separable
and ill-conditioned landscapes the covariance matrix of the normal
distribution must be adapted, amounting to a variable metric
method. Consequently, Covariance Matrix Adaptation (CMA) is
considered state-of-the-art in Evolution Strategies. In order to
sample the normal distribution, the adapted covariance matrix needs
to be decomposed, requiring in general $\Theta(n^3)$ operations,
where $n$ is the search space dimension. We propose a new update
mechanism which can replace a rank-one covariance matrix update and
the computationally expensive decomposition of the covariance
matrix. The newly developed update rule reduces the computational
complexity of the rank-one covariance matrix adaptation to
$\Theta(n^2)$ without resorting to outdated distributions. We
derive new versions of the elitist Covariance Matrix Adaptation
Evolution Strategy (CMA-ES) and the multi-objective CMA-ES. These
algorithms are equivalent to the original procedures except that the
update step for the variable metric distribution scales better in
the problem dimension. We also introduce a simplified variant of the
non-elitist CMA-ES with the incremental covariance matrix update and
investigate its performance. Apart from the reduced time-complexity
of the distribution update, the algebraic computations involved in
all new algorithms are simpler compared to the original versions.
The new update rule improves the performance of the CMA-ES for large
scale machine learning problems in which the objective function can
be evaluated fast.
ERRATUM: On p. 181 the parameter cc is discussed:
"... Only for 1/cc \propto n the performance scale-up of the
optimization on ridge like functions becomes linear with n
(Hansen and Ostermeier 2001) which rules out considerably larger
values, for example 1/cc = sqrt(n). "
This sentence is a misinterpretation of the cited source. In fact,
values of sqrt(n) can be used, without considerable loss in performance
for the (mu/mu_w,lambda)-CMA-ES from Hansen and Ostermeier (2001).
For the (1+1)-CMA-ES this has not been tested.