Tuning: A success story

Whiskers · Post by **Whiskers** » Sun Feb 19, 2023 11:02 pm

I hit a little bit of a wall with further improvements to my engine Willow and decided to try to auto-tune some material evals. I started out with PeSTO material and piece-square evals, and while they worked great as a quick fix early on, after adding evals for mobility, space, king safety, passed pawn etc. I was starting to wonder how much ELO I could squeeze out of improving the PSTs with regards to these other evaluation terms.

For a set of positions, I took both Zurichess quiet datasets, and filtered out all the ones where the quiescent search score didn't match the static eval score - this left me with about 2.5 million positions, which should be more than enough.

Once I got the positions and was able to find the mean error of the expected result compared to the actual result, I first tried to use a simple method (described here http://www.talkchess.com/forum3/viewtop ... =7&t=76238) because I was afraid of gradient descent. It did not work well for two reasons:
1. this may have been due to human error, but I found that if you started the tuning process with piece values such as {0,0,0,0,0}, it failed quite badly to converge to the same degree as if you started the tuning process with reasonable values. This was probably due to only increasing and decreasing the values by 1 at a time, but I couldn't get more flexible parameter changes to work either.
2. I would turn into a skeleton before I finished waiting for it to optimize the 788 parameters of my material+PSTs.

So I ended up having to confront gradient descent after all. Turns out it was... much less scary than I thought lol. This forum (http://www.talkchess.com/forum3/viewtopic.php?t=55265) was a massive help, and in code it turns out that calculating the gradient for each parameter is pretty much just:

Code: Select all

for (int i = 0; i < N; i++){   //after every evaluation: for each eval parameter(assuming N parameters), add how many times it was used
        					//if white has 5 pawns to black's 7 for example, weights[WPAWNMATERIAL] = -2
        					
   GRADIENTEW[i] += ((result-sigmoid) * weights[i]);
            					//sigmoid is the win probability that the eval correlated to: if it was too optimistic for white, the 	 
 						//gradient will be reduced by the weight, if too pessimistic then vice versa (and the closer it is the less it gets 							 
   						// changed)
}
        
        //after all the positions have been evaluated, where a is the number of positions, average out the gradient
for (int i = 0; i < N; i++){
    GRADIENTEW[i] = ((double)-1/a) * GRADIENTEW[i];
}
    	
    //between iterations:
for (int i = 0; i < N; i++){
   params[i] -= K*GRADIENTEW[i];       //K can be determined with a line search, but I kept it simple and made it a constant that slowly 
 							      //decreases over time. It still works, it just takes longer.
}

After testing it on a couple thousand positions at a time and first getting it to work, and then verifying that it did indeed work, I threw it on the big test set. After 10,000 iterations, it was done.

Some interesting things to note about the new sets of values:
In general, it despises pawns. Willow has always been a fairly enterprising engine with a tendency to sac a pawn or two for positional compensation, but now it has a pawn in the middlegame all the way down to 70 centipawns compared to 363 cps for knights and 368 cps for bishops. The PSTs for pawns are also not too much changed, but an exception is for endgame far advanced pawns, where it shows a huge favoritism to flank pawns compared to center pawns.

The knight is pretty much worth the same as a bishop, in fact the base value for a knight in the endgame is ever so slightly higher than one for a bishop. However, this is immediately explained by the fact that 7 pseudolegal moves for a bishop give +30 cps while the highest mobility bonus for a knight doesn't even hit 20.

The king table remained fairly similar, but the parameter for a king on e1 went down from "small gain" to "40 cps worse than being castled". This should probably encourage Willow to castle faster.

PST-based bonuses for developing pieces and central control went down quite a bit. The improvement in pstscore from a knight on g1 to a knight on f3 went from 36 to 19, for e4 the bonus plunged from 32 to 3, and bishops on their starting square now even get a small bonus! (Although they still improve their score by moving, of course.) This is again to be expected, as both mobility and space get bonuses from playing moves like e4 and is effectively doing the old PSTs job for it to some degree.

not one parameter went ridiculous due to overfitting, which I think confirms the quality of the dataset

As for the ELO difference?
440 game test: +188 -124 =128 - +51 elo

algerbrex · Post by **algerbrex** » Mon Feb 20, 2023 8:44 am

Congratulations on getting your tuner working. I remember updating my tuner from the naive method found on the CPW to basic gradient descent. It's pretty cool seeing your tuner become blazing fast AND produce better values than the old one.

Just out of curiosity: I'm a little confused about the process you're using to compute your gradient. What's the cost function you're basing your gradient on?

Whiskers · Post by **Whiskers** » Mon Feb 20, 2023 4:39 pm

algerbrex wrote: ↑Mon Feb 20, 2023 8:44 am Congratulations on getting your tuner working. I remember updating my tuner from the naive method found on the CPW to basic gradient descent. It's pretty cool seeing your tuner become blazing fast AND produce better values than the old one.

Just out of curiosity: I'm a little confused about the process you're using to compute your gradient. What's the cost function you're basing your gradient on?

Just a least square of (result-sigmoid)^2 for each position - I thought I updated it to a more advanced one when switching to gradient descent, but it turns out I forgot and by the time I realized it I was already a few thousand iterations in and didn't want to start over.

clayt · Post by **clayt** » Tue Feb 21, 2023 9:41 pm

If a thousand tuning iterations is taking that long, that may mean that there could be room for improvement in your tuner.

My implementation runs an epoch of 725000 training samples in about 20ms (roughly 35M samples/sec). The main speedup was from using a sparse representation of the feature vector for each training sample, since most features on a board are zero.

Whiskers · Post by **Whiskers** » Tue Feb 21, 2023 11:39 pm

clayt wrote: ↑Tue Feb 21, 2023 9:41 pm If a thousand tuning iterations is taking that long, that may mean that there could be room for improvement in your tuner.

My implementation runs an epoch of 725000 training samples in about 20ms (roughly 35M samples/sec). The main speedup was from using a sparse representation of the feature vector for each training sample, since most features on a board are zero.

It was 10,000, but I can definitely not do 35m samples per second like you did lol. The evaluation function is the same one as in the original program, but all the paramaters are lumped together in an array. So it still does the evaluation as it would normally. This has the advantage of being able to account for king safety, space, and mobility, without me having to ever edit those parameters either (because I am not ready to tune king safety automatically just yet).

My program is in general a bit of a slow evaluator/searcher, it uses a 8x8 array of chars for its board, it gathers mobility information pretty inefficiently, etc. And my laptop is pretty slow as well - even SF with 4 cores only hits about 1.2 million NPS.

jdart · Post by **jdart** » Thu Feb 23, 2023 8:19 pm

I used ADAM, in which each parameter gets its own separate learning rate: see https://optimization.cbe.cornell.edu/in ... title=Adam.

Gradient-tuned evals are good, but NNUE evals are better by a lot.

Sazgr · Post by **Sazgr** » Thu Feb 23, 2023 9:19 pm

clayt wrote: ↑Tue Feb 21, 2023 9:41 pm If a thousand tuning iterations is taking that long, that may mean that there could be room for improvement in your tuner.

My implementation runs an epoch of 725000 training samples in about 20ms (roughly 35M samples/sec). The main speedup was from using a sparse representation of the feature vector for each training sample, since most features on a board are zero.

Upon reading this, I revisited my tuner implementation, which was completing an iteration with a 2.5 million position test set in 2 seconds. By making the code more efficient, I managed to get each iteration of gradient descent to run in 80-100 ms (around 25 million positions every second).

I was just wondering if your tuning is multi-threaded, or single threaded, to see if I have more room for improvement, as the 25m positions/second is with 6 threads.

JVMerlino · Post by **JVMerlino** » Thu Feb 23, 2023 9:32 pm

Sazgr wrote: ↑Thu Feb 23, 2023 9:19 pm
clayt wrote: ↑Tue Feb 21, 2023 9:41 pm If a thousand tuning iterations is taking that long, that may mean that there could be room for improvement in your tuner.

My implementation runs an epoch of 725000 training samples in about 20ms (roughly 35M samples/sec). The main speedup was from using a sparse representation of the feature vector for each training sample, since most features on a board are zero.
Upon reading this, I revisited my tuner implementation, which was completing an iteration with a 2.5 million position test set in 2 seconds. By making the code more efficient, I managed to get each iteration of gradient descent to run in 80-100 ms (around 25 million positions every second). I was just wondering if your tuning is multi-threaded, or single threaded, to see if I have more room for improvement, as the 25m positions/second is with 6 threads.

Sigh. My Texel implementation is single-threaded and needs 4 seconds to do 2.8m positions. Still don't understand this gradient stuff, but I'm still bashing my head against it in the hopes of a breakthrough.

clayt · Post by **clayt** » Sat Feb 25, 2023 7:01 am

Sazgr wrote: ↑Thu Feb 23, 2023 9:19 pm
clayt wrote: ↑Tue Feb 21, 2023 9:41 pm If a thousand tuning iterations is taking that long, that may mean that there could be room for improvement in your tuner.

My implementation runs an epoch of 725000 training samples in about 20ms (roughly 35M samples/sec). The main speedup was from using a sparse representation of the feature vector for each training sample, since most features on a board are zero.
Upon reading this, I revisited my tuner implementation, which was completing an iteration with a 2.5 million position test set in 2 seconds. By making the code more efficient, I managed to get each iteration of gradient descent to run in 80-100 ms (around 25 million positions every second). I was just wondering if your tuning is multi-threaded, or single threaded, to see if I have more room for improvement, as the 25m positions/second is with 6 threads.

Mine is multi-threaded, so yours sounds perfect to me.

Whiskers · Post by **Whiskers** » Sun Feb 26, 2023 5:36 am

I tuned all my other evaluation parameters as well other than just PSTs, and it gave another +80 Elo in self play. It's amazing how much of an improvement it is to have all the parameters working in sync rather than them being stacked up one on top of the other!

Willow played this cute game today:
[pgn]1. e4 c5 2. Nf3 d6 3. Nc3 Nc6 4. d4 cxd4 5. Nxd4 Nf6 6. Bb5 Bd7 7. Bg5 a6 8.
Bxc6 Bxc6 9. Bxf6 gxf6 10. O-O Rg8 11. Qh5 h6 12. Ne6 Qc8 13. Nd5 Bxd5 14. exd5
Rg6 15. c3 Qc4 16. Rfe1 Rc8 17. h3 Qb5 18. Qf5 Rc4 19. Rab1 Qd7 20. g4 h5 21.
Qxh5 Rh6 22. Nxf8 Rxh5 23. Nxd7 Rxh3 24. Nxf6+ Kd8 25. Ne4 Rh6 26. f3 Rg6 27.
Kf2 Rc8 28. Rh1 Kd7 29. Rh7 Rf8 30. Ng3 Rg5 31. Nf5 e6 32. dxe6+ Kxe6 33. Re1+
Kd5 34. f4 1-0[/pgn]

Tuning: A success story

Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story