Experiments in generating Texel Tuning data

Tearth · Post by **Tearth** » Tue Jul 05, 2022 5:25 pm

Yesterday I've finished my little tool in Inanis for parsing PGN and picking quiet positions (+ some other criteria, mostly covered by these ones in the first post), so I could finally start tests. As the first PGN, I've chosen CCRL games, because my assumption was they should have good quality and variety - about 100,000 games which is not that much but should give some decent results. Sadly it did not work at all: the engine noticed an embarrassing three-digit Elo loss, and there were some very strange values, like PST for end-game pawns not promoting being on the last ranks at all. The only explanation I could find is that these games are frequently adjudicated much before promotions happen, which disturbs the end result.

As the second attempt (in progress), I've decided to run 40/1+0.02 self-play, gather 100,000 games and try again. I'm very skeptical though since the quality of these games is going to be rather poor to say the least, but I guess it's worth checking it on my own.

akanalytics · Post by **akanalytics** » Fri Jul 08, 2022 7:02 pm

I had success tuning Odonata using my own games rather than Zurichess
Couple of items:
- saw a gain of +30 Elo over a zurichess tuned Odonata
- short time control didn't hurt except detailed below
- the time control needed to be sufficient to allow deep enough search depth to resolve simple endgames to a win rather than a draw. For instance an average search depth of just 5 does not allow Odonata to checkmate KBB v k (or it didn't when I collected my pgns). I used a generous move time increment.
- adjudication off
- ignored positions within 10 or so ply of the opening (unsure why this was needed tbh, but it helped)
- taking several positions from each game seemed to work fine, though I randomly discarded 3/4 or more of the positions (fearing correlation but no proof that this helped)
- used more mid game positions than end game (again, no idea why this helped)
- used about 1.4M fens overall for tuning
- I use L1 regularisation, and very clearly some positions were not getting exposure in the dataset (tripled pawns, fianchetto) so currently these remain untuned at zero weight. At some point I'll address this.
- I used games with Odonata playing other engines rather than self play. This might not be to everyone's taste if you are striving for some purity.
- I used engines of comparable strength (again no evidence that this helped)

Andy

algerbrex · Post by **algerbrex** » Fri Jul 08, 2022 7:37 pm

akanalytics wrote: ↑Fri Jul 08, 2022 7:02 pm I had success tuning Odonata using my own games rather than Zurichess
Couple of items:
- saw a gain of +30 Elo over a zurichess tuned Odonata
- short time control didn't hurt except detailed below
- the time control needed to be sufficient to allow deep enough search depth to resolve simple endgames to a win rather than a draw. For instance an average search depth of just 5 does not allow Odonata to checkmate KBB v k (or it didn't when I collected my pgns). I used a generous move time increment.
- adjudication off
- ignored positions within 10 or so ply of the opening (unsure why this was needed tbh, but it helped)
- taking several positions from each game seemed to work fine, though I randomly discarded 3/4 or more of the positions (fearing correlation but no proof that this helped)
- used more mid game positions than end game (again, no idea why this helped)
- used about 1.4M fens overall for tuning
- I use L1 regularisation, and very clearly some positions were not getting exposure in the dataset (tripled pawns, fianchetto) so currently these remain untuned at zero weight. At some point I'll address this.
- I used games with Odonata playing other engines rather than self play. This might not be to everyone's taste if you are striving for some purity.
- I used engines of comparable strength (again no evidence that this helped)
Andy

All that make sense to me. I think at the end of the day, what's important is getting a good variety of positions that have positional or material themes that correlate with your evaluation weights. This is why I think some datasets, like Andrew Grant's, always hurt Blunder. The themes in the dataset just don't correlate very well with the way Blunder's evaluation and weights are structured.

Tearth · Post by **Tearth** » Fri Jul 08, 2022 10:38 pm

Tearth wrote: ↑Tue Jul 05, 2022 5:25 pm As the second attempt (in progress), I've decided to run 40/1+0.02 self-play, gather 100,000 games and try again. I'm very skeptical though since the quality of these games is going to be rather poor to say the least, but I guess it's worth checking it on my own.

It turned out to be actually not bad, after 100,000 games I've selected 10 random positions from every match (with the following restrictions: not duplicate, not in check, the difference between quiescence score and evaluation less than 50, evaluation less than 250, first 8 moves and last 6 last moves excluded) - with tuned parameters, the engine was playing only 10 Elo worse than before, so considering that it was entirely new set I see it as some success. Now I've tried to extend the set provided by @algerbrex by adding 4 positions from every match (so about 400,000) and I see a few more Elo, but it's not sure yet since only 6000 games were played.

akanalytics · Post by **akanalytics** » Sat Jul 09, 2022 8:05 am

Sounds promising.

What's the logic behind "difference between quiescence score and evaluation less than 50"?

I'm interpreting "evaluation" here as static eval, not a search eval. I'm also assuming a QS search is applied to positions before using them.

Tearth · Post by **Tearth** » Sat Jul 09, 2022 12:09 pm

Yes, I've basically ignored all positions where the difference between the returned score by quiescence search was larger (by 50) than the static evaluation, to pick only quiet ones. These margins are a subject to tune, but I think I will leave it for the future since I've got enough Elo to finally polish the code and make a new release.

Also, after 16,000 games I've got a solid 5 Elo advantage over the last try (although I can't say for sure that it's because of these new positions, and not because the tuner was luckier with initial random values), so if anyone is interested then here is the extended-extended-Zurichess dataset.

https://mega.nz/file/TstgiBbR#Q6OIISJFQ ... ypnAIWdOWY

algerbrex · Post by **algerbrex** » Sat Jul 09, 2022 4:19 pm

Tearth wrote: ↑Sat Jul 09, 2022 12:09 pm Yes, I've basically ignored all positions where the difference between the returned score by quiescence search was larger (by 50) than the static evaluation, to pick only quiet ones. These margins are a subject to tune, but I think I will leave it for the future since I've got enough Elo to finally polish the code and make a new release.

Also, after 16,000 games I've got a solid 5 Elo advantage over the last try (although I can't say for sure that it's because of these new positions, and not because the tuner was luckier with initial random values), so if anyone is interested then here is the extended-extended-Zurichess dataset.

https://mega.nz/file/TstgiBbR#Q6OIISJFQ ... ypnAIWdOWY

Thanks. Glad to see you were able to end up making some progress.

Patrice Duhamel · Post by **Patrice Duhamel** » Sat Jul 09, 2022 5:59 pm

I need to test more but for Cheese actually I have good results mixing 50% positions from games at 1s+0.01s and 50% positions from games at 10s+0.1s (total 1,8 million positions).

It seems better for me to tune all parameters at once, and using a middle and an endgame value for all paremeters.
With one exceptions because I don't tune king safety with texel's tuning method, but I will try.

tmokonen · Post by **tmokonen** » Wed Jul 13, 2022 1:29 pm

hgm wrote: ↑Sun Jul 03, 2022 1:25 pm
The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.

So it seems I would need a completely different type of evaluation, before tuning can have any effect.

I am wondering if you can mitigate this over-eager attacking behavior by adding a low effort array based evaluation term indexed by attackers - defenders.

algerbrex · Post by **algerbrex** » Wed Jul 13, 2022 2:32 pm

Patrice Duhamel wrote: ↑Sat Jul 09, 2022 5:59 pm I need to test more but for Cheese actually I have good results mixing 50% positions from games at 1s+0.01s and 50% positions from games at 10s+0.1s (total 1,8 million positions).

It seems better for me to tune all parameters at once, and using a middle and an endgame value for all paremeters.
With one exceptions because I don't tune king safety with texel's tuning method, but I will try.

My findings mostly concur with yours, Patrice.

I haven't tried mixing super-fast and fast time control games yet, but I'll put that down.

Funny enough too, I've haven't quite been able to use middle and endgame values for all parameters. Since I'm not the best chess player, my tuner has been a bit like my oracle. So when I've tried to tune, say, a middlegame value for doubled pawns, and it tunes the value to 0, then I just remove the value, since the tuners basically telling me the term isn't as important as I think it is.

And with king safety, ironically, texel tuning has been the only way I've gotten good results from it. When I tried to tune anything manually, it was always an Elo loss, or barely an Elo gain. With my most recent king safety scheme, after running the king safety values through the tuner, I managed to get about 46 Elo in self-play.

What does your king safety scheme look like? For Blunder, right now, it's very simple, just collecting points from features such as semi-open files around our king, and how many squares different pieces attack around the "king-zone" (16 squares around the king), and then running those points through a quadratic model to get a centi-pawn value.

Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data