Some more HCE "Texel Tuning" Data

AndrewGrant · Post by **AndrewGrant** » Thu Jun 17, 2021 12:51 am

I made a thread a long way back, but I'll link all my sets generated so far. The older ones are tried and true, and are used by Ethereal and at least a dozen other engines for tweaking the HCE. The tuning scheme I laid out in my little paper is only concerned with the game label, so to me there is not much issue with using the data despite being "generated" by another engine.

E12.33-STD: ~10.0M positions of D12 resolved PVs, 1s+.01s :
E12.41-STD: ~10.0M positions of D12 resolved PVs, 2s+.02s :
E12.52-STD: ~10.0M positions of D12 resolved PVs, 4s+.04s :
E12.46-FRC: ~12.5M positions of D12 resolved PVs, 1s+.01s :

And the newest thing, which was generated just to see how an NNUE Ethereal would look. So this is made using the same paradigm as the previous sets as described, but with E13.04, which is running my current best NNUE Network.

E13.04-STD: ~10.0M positions of D12 resolved PVs, 1s+.01s :

Some (all?) of the datasets contain evals as well. The first four sets I believe have the static eval of the position (useless to you), where as the final set has a D9 eval of the position (useful if you want to try pruning the dataset, like exlcude all positions with |eval| > N). A trivial little example of how that looks (

Edsel Apostol · Post by **Edsel Apostol** » Thu Jun 17, 2021 11:30 am

Thanks Andrew. This is just what I'm looking for. I'm on my final push to improve my HCE before I hop into the NNUE train as well.

Edsel Apostol · Post by **Edsel Apostol** » Sun Jun 27, 2021 7:41 am

So I have this naive implementation of the Texel tuning and it seems to work (just a local search). I tried implementing the Adam optimizer but the default learn rate is very small. It takes forever to tune values. It's like 1 centipawn per 100 epoch. So I increased the learn rate from 0.001 to 0.3 and it seems to work but I'm not comfortable with the idea of my implementation being too far from the suggested default hyper parameters. I was also too lazy with my gradients. I was just changing single parameters and recompute the mse by (p1-p) or (p2-p)/2 or (p1 - m1)/2. Those became my gradients.

Now I've tried what you did by collecting the eval features (coefficients) on the first run and just reusing them, while at the same time getting the gradients from the derivative like what's described on the paper. It was indeed faster evaluation wise, but I still have the same issue with the gradients and the learn rate. It takes forever. Maybe I am missing something. Let's say you set the values of the queen to (mid: 500, end: 500) how many epochs does it take in your tuner to get it back to the values around (mid: ~1000, end: ~1600)?

AndrewGrant · Post by **AndrewGrant** » Sun Jun 27, 2021 8:01 am

Edsel Apostol wrote: ↑Sun Jun 27, 2021 7:41 am So I have this naive implementation of the Texel tuning and it seems to work (just a local search). I tried implementing the Adam optimizer but the default learn rate is very small. It takes forever to tune values. It's like 1 centipawn per 100 epoch. So I increased the learn rate from 0.001 to 0.3 and it seems to work but I'm not comfortable with the idea of my implementation being too far from the suggested default hyper parameters. I was also too lazy with my gradients. I was just changing single parameters and recompute the mse by (p1-p) or (p2-p)/2 or (p1 - m1)/2. Those became my gradients.

Now I've tried what you did by collecting the eval features (coefficients) on the first run and just reusing them, while at the same time getting the gradients from the derivative like what's described on the paper. It was indeed faster evaluation wise, but I still have the same issue with the gradients and the learn rate. It takes forever. Maybe I am missing something. Let's say you set the values of the queen to (mid: 500, end: 500) how many epochs does it take in your tuner to get it back to the values around (mid: ~1000, end: ~1600)?

So my paper was actually using AdaGrad, not Adam. Although Weiss has used Adam with everything else I wrote about at my suggestion, and that has gained elo in tuning. I tend to use an LR of like 0.1 when using AdaGrad. However, here is code for Adam, which I've tried a number of times (but tuning Ethereal now is hard, especially with NNUE being the major evaluator). https://github.com/AndyGrant/EtherealDe ... 9396510ee5

I'll also note that the Queen might be a hard thing to tune by itself to a great degree. How many positions have a Queen imbalance? Not many I would imagine. I just ran the AdaGrad version, and its about +1 cp per epoch, from initially cutting the queen value in half. (LR=0.1, Batchsize=16384). So a few hundred epochs to ramp it all the way back up.

Using the Adam code I shared, with those same hyperparams (LR=0.001, Batchsize=16384, BETA_1=0.9, BETA_2=0.999), I get about +3 CP per epoch. So only about a hundred epochs or so until reaching the original values.

I'll note that Adam is my go to now. When I wrote my NNUE training toolset, I was initially using Adagrad. It was so slow to converge (or make any meaningful progress from randomness) that I actually deleted the entire code base assuming I had made an error. A fresh rewrite, with no copy pasted code and new maths, had the same result. Then I tried Adam, and boom, errors zooming towards the expected values. Adam seemed miles above AdaGrad for HalfKP problems.

TL;DR: Try Adam instead of Adagrad. If you don't see fairly fast convergence, then something is off. I'm able to do about an epoch per second on ~55M positions here, so returning to original values should not take long.

Desperado · Post by **Desperado** » Sun Jun 27, 2021 11:51 am

... I'm able to do about an epoch per second on ~55M positions here ...

How many cores do you use to get this performance ?

AndrewGrant · Post by **AndrewGrant** » Sun Jun 27, 2021 3:59 pm

Desperado wrote: ↑Sun Jun 27, 2021 11:51 am
... I'm able to do about an epoch per second on ~55M positions here ...
How many cores do you use to get this performance ?

Ryzen 3700x. So 8 cores, 16 threads.

lithander · Post by **lithander** » Mon Jun 28, 2021 12:59 pm

Even though my self-tuned PSTs perform pretty well (it's what saves my otherwise slow engine from being utterly rubbish) reading your paper and having such a large, high-quality dataset of training positions to go with makes me (almost) want to start the whole endeavor from scratch. Maybe at some point I will...

Thanks for sharing!

*bookmarked*

Edsel Apostol · Post by **Edsel Apostol** » Sat Jul 03, 2021 1:49 am

AndrewGrant wrote: ↑Sun Jun 27, 2021 8:01 am
Edsel Apostol wrote: ↑Sun Jun 27, 2021 7:41 am So I have this naive implementation of the Texel tuning and it seems to work (just a local search). I tried implementing the Adam optimizer but the default learn rate is very small. It takes forever to tune values. It's like 1 centipawn per 100 epoch. So I increased the learn rate from 0.001 to 0.3 and it seems to work but I'm not comfortable with the idea of my implementation being too far from the suggested default hyper parameters. I was also too lazy with my gradients. I was just changing single parameters and recompute the mse by (p1-p) or (p2-p)/2 or (p1 - m1)/2. Those became my gradients.

Now I've tried what you did by collecting the eval features (coefficients) on the first run and just reusing them, while at the same time getting the gradients from the derivative like what's described on the paper. It was indeed faster evaluation wise, but I still have the same issue with the gradients and the learn rate. It takes forever. Maybe I am missing something. Let's say you set the values of the queen to (mid: 500, end: 500) how many epochs does it take in your tuner to get it back to the values around (mid: ~1000, end: ~1600)?
So my paper was actually using AdaGrad, not Adam. Although Weiss has used Adam with everything else I wrote about at my suggestion, and that has gained elo in tuning. I tend to use an LR of like 0.1 when using AdaGrad. However, here is code for Adam, which I've tried a number of times (but tuning Ethereal now is hard, especially with NNUE being the major evaluator). https://github.com/AndyGrant/EtherealDe ... 9396510ee5

I'll also note that the Queen might be a hard thing to tune by itself to a great degree. How many positions have a Queen imbalance? Not many I would imagine. I just ran the AdaGrad version, and its about +1 cp per epoch, from initially cutting the queen value in half. (LR=0.1, Batchsize=16384). So a few hundred epochs to ramp it all the way back up.

Using the Adam code I shared, with those same hyperparams (LR=0.001, Batchsize=16384, BETA_1=0.9, BETA_2=0.999), I get about +3 CP per epoch. So only about a hundred epochs or so until reaching the original values.

I'll note that Adam is my go to now. When I wrote my NNUE training toolset, I was initially using Adagrad. It was so slow to converge (or make any meaningful progress from randomness) that I actually deleted the entire code base assuming I had made an error. A fresh rewrite, with no copy pasted code and new maths, had the same result. Then I tried Adam, and boom, errors zooming towards the expected values. Adam seemed miles above AdaGrad for HalfKP problems.

TL;DR: Try Adam instead of Adagrad. If you don't see fairly fast convergence, then something is off. I'm able to do about an epoch per second on ~55M positions here, so returning to original values should not take long.

I have implemented Adam since 2019, but I was mostly using local search as I found out in my setup it is better when tuning just a few parameters at a time (e.g. just king safety, or material, or passed pawn related).

I managed to make your way of tuning work. My initial issue was I was fixing my K value to the previous values but I have adopted the exp in Sigmoid which has a K like 2 to 3 times higher.

What I like about this method is that it is fast, at the expense of memory (for the collected features/coefficients). What I don't like about it is the additional code in eval.

I also have implemented a demon momentum rule in Adam. It basically just decays the Beta1 parameter. It might be more useful in NNUE training.

op12no2 · Post by **op12no2** » Fri Jul 09, 2021 12:52 pm

Cheers Andrew, I am playing around with small nets in Javascript using your 13.04 data.

https://github.com/op12no2/lozza

AndrewGrant · Post by **AndrewGrant** » Fri Jul 09, 2021 1:13 pm

op12no2 wrote: ↑Fri Jul 09, 2021 12:52 pm Cheers Andrew, I am playing around with small nets in Javascript using your 13.04 data.

https://github.com/op12no2/lozza

I'll note that this data proved terrible for NNUE. Since it was designed for HCE training specifically. Anything is better than nothing I suppose, but one can easily get much better data, with more diversity, on demand.

Some more HCE "Texel Tuning" Data

Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data

Re: Some more HCE "Texel Tuning" Data