For some time now, I have been thinking about ways to make texel tuning - or optimization in general - more precise while keeping it fast. As you may have read in the Progress on Loki thread, I have also been working on a neural network for my engine, and developing a trainer for that. During this time, I have been trying to implement backpropagation in order to train the networks, and I got an idea from that.
Discussing SPSA and local optimization
Before I elaborate on my idea, let's go through some of the advantages/disadvantages of both SPSA and local optimization:
Advantages of local optimization:
- It is very stable since it measures the objective function for each parameter at a time
- You are nearly always guaranteed to find a minimum, and stay at it.
- Some local optimization algorithms doesn't need any hyper-parameters in order to work, which makes implementation and setup easier.
- It is very slow since the objective function, at best, has to be measured once for each parameter. At worst, it has to be measured twice. This gives it a time complexity anywhere from the best case: O(n) and worst case O(2*n). Note: This is for one iteration in the pseudo code from CPW's Texel's tuning method
- Even though one is very likely to find a minimum, it is in no way guaranteed to be a global one. Local optimization needs very good initial guesses in order to get to a global minima, and such algorithms are notorious for getting stuck in sub-optimal local minima.
- It is very fast with a time complexity independent of the amount of parameters, of O(2). There are only two objective function measurements each iteration.
- More likely to find global minima due to its solution-space searching. This is what it gets from the random perturbations of the parameter vector.
- Very hyper-parameter dependent. The algorithm is oftentimes very sensitive to the hyper-parameters like the step size and perturbation size. Unless these are chosen perfectly (which is nearly impossible), the algorithm will in some sense forget the less important parameters (i.e. the queen's value will be so dominant, that the algorithm can't optimize a passed pawn on the third rank's value properly).
- The gradients will all have the same numeric values. The only difference between the gradients are their sign, which is a byproduct of only computing the objective function twice. This means that small values (-40-40cp or something) will get drastically changed from each iteration due to over-compensation.
While working on backpropagation, I wondered if it would be possible to apply the same principles to a normal HCE. I don't know if this is possible generally, but if we use Loki's HCE as an example, it is computed the following way (the same as the one from CPW engine):
Eval(pos) = (p[mg] * E[mg] + p[eg] * E[eg]) / p[total] = (1 / p[total]) * (p[mg] * sum(T[mg] * W[mg]) + p[eg] * sum(T[eg] * W[eg]))
(Note: the [] are only notation for the different phases of the tapered eval. Not indexing)
Where E[mg] and E[eg] are the middlegame and endgame evaluations respectively, and p[mg] and p[eg] are the phases for the two. T[mg] and T[eg] and W[mg] and W[eg] are the terms and weights for the corresponding phases. As used by normal texel tuning, all the evaluations will be mapped to a WDL (win-draw-lose) probability using the following sigmoid function:
σ(x) = 1 / (1 + 10^(-K*x / 400))
Where K is a numerically determined constant, specific to each individual evaluation function. We can now define the loss function as a mean squared error, and thus, the loss of a single training example becomes C = (a - y)^2 (where a is σ(Eval) and y is the game's outcome. 1 for a white win, 0.5 for a draw and 0 for a black win). Note that the evaluation needs to be white-relative.
Now comes the idea from backpropagation in neural networks: Suppose we want to optimize the i'th middlegame weight, denoted by Wi[mg]. We want to know what the gradient of the loss function with respect to this specific weight is, ∂C/∂Wi[mg]. From the chain rule, we can now express this as:
∂C/∂Wi[mg] = ∂C/∂Eval * ∂Eval/∂Wi[mg]
Since we know that C = (a - y)^2 = (σ(Eval) - y)^2, we find that ∂C/∂Eval = 2 * (σ(Eval) - y) * σ'(Eval). For the other term, it can be shown that
∂Eval/∂Wi[mg] = (p[mg] / p[total]) * T[mg]
Thus, we can express the gradient of the loss function for a singe training example, with respect to the i'th middlegame weight as:
∂C/∂Wi[mg] = 2 * (σ(Eval) - y) * σ'(Eval) * (p[mg] / p[total]) * T[mg]
Note that this is the same for any endgame weights. One thing to note is that this has to be done differently for other kinds of weights that are not necessarily linear, but it should be possible.
Now, gradient estimations are nothing new. The primary idea is to use the perturbations from SPSA to do the optimization while also searching the solution space for global minima:
- In the beginning of each iteration, we compute theta_plus and theta_minus exactly as we would for regular SPSA.
- Now we use the backpropagation-inspired algorithm to compute an average of the gradients over the entire training set (Note: This can also be done for batches if the dataset is very large) for both vectors separately.
- Apply these gradients and update the weights of theta_plus and theta_minus such that they reach local minima.
- Measure the loss function of theta_plus and theta_minus after they have been optimized.
- Choose theta for the next iteration as the one of the above-mentioned with the lowest error after backprop.
- Repeat this process until theta_plus and theta_minus converges.
- Because of the perturbations from SPSA, we get the desired effect of searching the solution space. This will then allow us to find global minima - or at least make it more likely than with regular local optimization. This solution-space search kind of reminds me of genetic algorithms, but instead of having a big, randomly initialized population, it is more controlled and better for good initial guesses.
- There are fewer hyper parameters with only a learning rate (for when optimizing theta_plus and theta_minus) and a perturbation size.
- The gradients are not just different by their sign, but also by their values as compared to normal SPSA. This removes the earlier mentioned problem of over-compensation for smaller/less dominating parameters.
- The algorithm's execution speed is nearly the same for different numbers of parameters. By nearly, I mean that the backpropagation will take some extra time for bigger sets of parameters, but this is negligible compared to local optimization. Therefore it's time complexity is ~O(2) (two measurements of the objective function).
- Due to the local optimization of theta_plus and theta_minus, we are less likely (compared to normal SPSA) to shoot past a global minimum. This is because theta_plus and theta_minus always will return to said minimum if it's strong enough.
Any feedback/questions will be much appreciated! I would also like to hear if anyone has done something similar in the past, and if so, how it turned out