A hybrid of SPSA and local optimization

niel5946 · Post by **niel5946** » Tue Jun 01, 2021 3:42 pm

Hello everyone.

For some time now, I have been thinking about ways to make texel tuning - or optimization in general - more precise while keeping it fast. As you may have read in the Progress on Loki thread, I have also been working on a neural network for my engine, and developing a trainer for that. During this time, I have been trying to implement backpropagation in order to train the networks, and I got an idea from that.

Discussing SPSA and local optimization
Before I elaborate on my idea, let's go through some of the advantages/disadvantages of both SPSA and local optimization:

Advantages of local optimization:

It is very stable since it measures the objective function for each parameter at a time
You are nearly always guaranteed to find a minimum, and stay at it.
Some local optimization algorithms doesn't need any hyper-parameters in order to work, which makes implementation and setup easier.

Disadvantages of local optimization:

It is very slow since the objective function, at best, has to be measured once for each parameter. At worst, it has to be measured twice. This gives it a time complexity anywhere from the best case: O(n) and worst case O(2*n). Note: This is for one iteration in the pseudo code from CPW's Texel's tuning method
Even though one is very likely to find a minimum, it is in no way guaranteed to be a global one. Local optimization needs very good initial guesses in order to get to a global minima, and such algorithms are notorious for getting stuck in sub-optimal local minima.

Advantages of SPSA:

It is very fast with a time complexity independent of the amount of parameters, of O(2). There are only two objective function measurements each iteration.
More likely to find global minima due to its solution-space searching. This is what it gets from the random perturbations of the parameter vector.

Disadvantages of SPSA:

Very hyper-parameter dependent. The algorithm is oftentimes very sensitive to the hyper-parameters like the step size and perturbation size. Unless these are chosen perfectly (which is nearly impossible), the algorithm will in some sense forget the less important parameters (i.e. the queen's value will be so dominant, that the algorithm can't optimize a passed pawn on the third rank's value properly).
The gradients will all have the same numeric values. The only difference between the gradients are their sign, which is a byproduct of only computing the objective function twice. This means that small values (-40-40cp or something) will get drastically changed from each iteration due to over-compensation.

The idea
While working on backpropagation, I wondered if it would be possible to apply the same principles to a normal HCE. I don't know if this is possible generally, but if we use Loki's HCE as an example, it is computed the following way (the same as the one from CPW engine):

Eval(pos) = (p[mg] * E[mg] + p[eg] * E[eg]) / p[total] = (1 / p[total]) * (p[mg] * sum(T[mg] * W[mg]) + p[eg] * sum(T[eg] * W[eg]))
(Note: the [] are only notation for the different phases of the tapered eval. Not indexing)
Where E[mg] and E[eg] are the middlegame and endgame evaluations respectively, and p[mg] and p[eg] are the phases for the two. T[mg] and T[eg] and W[mg] and W[eg] are the terms and weights for the corresponding phases. As used by normal texel tuning, all the evaluations will be mapped to a WDL (win-draw-lose) probability using the following sigmoid function:

σ(x) = 1 / (1 + 10^(-K*x / 400))

Where K is a numerically determined constant, specific to each individual evaluation function. We can now define the loss function as a mean squared error, and thus, the loss of a single training example becomes C = (a - y)^2 (where a is σ(Eval) and y is the game's outcome. 1 for a white win, 0.5 for a draw and 0 for a black win). Note that the evaluation needs to be white-relative.
Now comes the idea from backpropagation in neural networks: Suppose we want to optimize the i'th middlegame weight, denoted by Wi[mg]. We want to know what the gradient of the loss function with respect to this specific weight is, ∂C/∂Wi[mg]. From the chain rule, we can now express this as:

∂C/∂Wi[mg] = ∂C/∂Eval * ∂Eval/∂Wi[mg]

Since we know that C = (a - y)^2 = (σ(Eval) - y)^2, we find that ∂C/∂Eval = 2 * (σ(Eval) - y) * σ'(Eval). For the other term, it can be shown that

∂Eval/∂Wi[mg] = (p[mg] / p[total]) * T[mg]

Thus, we can express the gradient of the loss function for a singe training example, with respect to the i'th middlegame weight as:

∂C/∂Wi[mg] = 2 * (σ(Eval) - y) * σ'(Eval) * (p[mg] / p[total]) * T[mg]

Note that this is the same for any endgame weights. One thing to note is that this has to be done differently for other kinds of weights that are not necessarily linear, but it should be possible.
Now, gradient estimations are nothing new. The primary idea is to use the perturbations from SPSA to do the optimization while also searching the solution space for global minima:

In the beginning of each iteration, we compute theta_plus and theta_minus exactly as we would for regular SPSA.
Now we use the backpropagation-inspired algorithm to compute an average of the gradients over the entire training set (Note: This can also be done for batches if the dataset is very large) for both vectors separately.
Apply these gradients and update the weights of theta_plus and theta_minus such that they reach local minima.
Measure the loss function of theta_plus and theta_minus after they have been optimized.
Choose theta for the next iteration as the one of the above-mentioned with the lowest error after backprop.
Repeat this process until theta_plus and theta_minus converges.

Some characteristics of this algorithm is:

Because of the perturbations from SPSA, we get the desired effect of searching the solution space. This will then allow us to find global minima - or at least make it more likely than with regular local optimization. This solution-space search kind of reminds me of genetic algorithms, but instead of having a big, randomly initialized population, it is more controlled and better for good initial guesses.
There are fewer hyper parameters with only a learning rate (for when optimizing theta_plus and theta_minus) and a perturbation size.
The gradients are not just different by their sign, but also by their values as compared to normal SPSA. This removes the earlier mentioned problem of over-compensation for smaller/less dominating parameters.
The algorithm's execution speed is nearly the same for different numbers of parameters. By nearly, I mean that the backpropagation will take some extra time for bigger sets of parameters, but this is negligible compared to local optimization. Therefore it's time complexity is ~O(2) (two measurements of the objective function).
Due to the local optimization of theta_plus and theta_minus, we are less likely (compared to normal SPSA) to shoot past a global minimum. This is because theta_plus and theta_minus always will return to said minimum if it's strong enough.

I hope this idea can be of use to someone. I myself wont have time the next weeks to actually implement it, but I thought it would be nice to throw it out there

Any feedback/questions will be much appreciated! I would also like to hear if anyone has done something similar in the past, and if so, how it turned out

jdart · Post by **jdart** » Wed Jun 02, 2021 4:49 am

> This gives it a time complexity anywhere from the best case: O(n) and worst case O(2*n)

The article on Texel tuning mentions "local search," involving tuning each parameter separately, but pretty much no one does this, I think.

Standard gradient methods including SGD and ADAM tune all parameters together and are reasonably efficient.

amanjpro · Post by **amanjpro** » Wed Jun 02, 2021 6:11 am

jdart wrote: ↑Wed Jun 02, 2021 4:49 am > This gives it a time complexity anywhere from the best case: O(n) and worst case O(2*n)

The article on Texel tuning mentions "local search," involving tuning each parameter separately, but pretty much no one does this, I think.

Standard gradient methods including SGD and ADAM tune all parameters together and are reasonably efficient.

Actually I did the local optimization, as per the original Texel article for my engine (Zahak), and I got 250-300 Elo boost, just a few weeks ago

niel5946 · Post by **niel5946** » Wed Jun 02, 2021 12:38 pm

jdart wrote: ↑Wed Jun 02, 2021 4:49 am > This gives it a time complexity anywhere from the best case: O(n) and worst case O(2*n)

The article on Texel tuning mentions "local search," involving tuning each parameter separately, but pretty much no one does this, I think.

Standard gradient methods including SGD and ADAM tune all parameters together and are reasonably efficient.

Well, I have seen at least two engines doing it with local search, and many algorithms do calculate at least the gradient - if not the loss - for each parameter, so I though it would be nice to implement such precise gradient estimation into SPSA.
Regarding using ADAM: I never really had any luck with it in SPSA. I think it might have something to do with the weird way SPSA calculates its gradients, with only the signs being parameter-specific, not the values.

amanjpro wrote: ↑Wed Jun 02, 2021 6:11 am Actually I did the local optimization, as per the original Texel article for my engine (Zahak), and I got 250-300 Elo boost, just a few weeks ago

Local search _is_ good (if the initial guess is), but it's just terribly slow. That is why I thought of a way to make it's time complexity constant instead of linear (like the SPSA with local optimization).

amanjpro · Post by **amanjpro** » Wed Jun 02, 2021 1:31 pm

niel5946 wrote: ↑Wed Jun 02, 2021 12:38 pm
amanjpro wrote: ↑Wed Jun 02, 2021 6:11 am Actually I did the local optimization, as per the original Texel article for my engine (Zahak), and I got 250-300 Elo boost, just a few weeks ago
Local search _is_ good (if the initial guess is), but it's just terribly slow. That is why I thought of a way to make it's time complexity constant instead of linear (like the SPSA with local optimization).

It is slow I agree with you, it is not optimal too.. but I was just replying that there are people who use it with some success

jdart · Post by **jdart** » Wed Jun 02, 2021 7:02 pm

Arasan uses ADAM. Ethereal uses Adagrad. (Not with SPSA, standard implementation with gradient descent).

niel5946 · Post by **niel5946** » Wed Jun 02, 2021 7:12 pm

jdart wrote: ↑Wed Jun 02, 2021 7:02 pm Arasan uses ADAM. Ethereal uses Adagrad.

I know, and both ADAM and Adagrad work nicely, but that isn't the issue though.

These methods are not used for gradient estimation, but rather for learning rate calculation based on previous changes. That is all good and such, but if the gradient estimations are sloppy - like in SPSA - learning rates won't matter much anyway.
That is why I am suggesting to combine SPSA and regular gradient descent into one algorithm that both, 1) Searches the solution space with random perturbations of the parameters, and 2) Computes the gradients with respect to each weight's contribution to the error instead of each perturbation's contribution to said error. This will then give good gradients that can be used in conjunction with ADAM or Adagrad.

A hybrid of SPSA and local optimization

A hybrid of SPSA and local optimization

Re: A hybrid of SPSA and local optimization

Re: A hybrid of SPSA and local optimization

Re: A hybrid of SPSA and local optimization

Re: A hybrid of SPSA and local optimization

Re: A hybrid of SPSA and local optimization

Re: A hybrid of SPSA and local optimization