Experiments in generating Texel Tuning data

algerbrex · Post by **algerbrex** » Sat Oct 30, 2021 12:14 am

I've recently been experimenting with re-tuning Blunder's evaluation parameters. The current parameters (for material, PSTs, and mobility) were tuned using the set from Zurichess, so for this tuning session, I decided I'd experiment and try data from two different sources.

For the first set of data, I downloaded a list of several thousand games from Magnus Carlsen and Hikaru Nakamura from here: http://www.pgnmentor.com/files.html#players. So about ~8000 games. For the second set, I collected self-play games using Blunder at a hyper-bullet time control like what was originally suggested.

For the self-play games I used the following cutechess command:

Code: Select all

cutechess-cli -pgnout games.pgn -srand $RANDOM -engine cmd=blunder name="blunder1" -engine cmd=blunder name="blunder2" -openings file=$HOME/2moves_v2a.pgn format=pgn order=random -each tc=40/2+0.05 proto=uci option.Hash=64 timemargin=60000 -games 2 -rounds 4000 -repeat 2 -concurrency 8 -recover -ratinginterval 50 -tournament gauntlet

For each data set, I parsed the PGNs, and extracted a list of positions using the following criteria:

No duplicate positions
No check or checkmate positions
No positions until 16 half moves.
No positions within six moves of the end of the game.
No positions where static eval was greater than 60 cp.
No positions where static eval and quiescence search were more than 30 cp apart.

This provided me with roughly ~400k-500k FENs for tuning. I also randomized the order of the positions before writing them to a data file to avoid overfitting. From these positions, I loaded 400K into the tuner and let it run while observing its tweaking.

From both sets, it seems the values the parameters were being tweaked to are inferior to the current values. Particularly, the mobility parameters (knight mobility, bishop mobility, rook endgame, and middle mobility, and queen midgame and endgame mobility) are driven from the current values to one or zero. This seems to indicate the tuner is attempting to make mobility irrelevant in the evaluation. This definitely seems to minimize the mean square error, but I highly doubt making the engine blind to mobility is going to make it play better (from my current testing, with the features Blunder now has, mobility is currently worth ~50-60 Elo).

I'm going to keep experimenting to see if I can generate anything useful. Computing the K value for both data sets also seems a bit weird, as K is driven to an incredibly small value, which to me suggests overfitting is occurring. So perhaps I'll experiment with using more positions (1M-2M), though I'd probably need to look into better hardware to do this reasonably efficiently. I'll also try to get a better variety of grandmaster games from tournaments and whatnot.

I'm curious to know what others' experiences have been like with regards to generating training data from self-play or human (grandmaster) games? I've also read the paper here (https://bitbucket.org/zurichess/zuriche ... g%20Method) from Zurichess and the experimenting they've done.

mvanthoor · Post by **mvanthoor** » Sat Oct 30, 2021 12:26 am

algerbrex wrote: ↑Sat Oct 30, 2021 12:14 am I'm curious to know what others' experiences have been like with regards to generating training data from self-play or human (grandmaster) games?

I'm almost done completing the XBoard-protocol (darn, that's a lot of work compared to UCI...), so I'll get to this myself in some time; hopefully before the end of next month.

I think, looking at this as a chess player, there are two things you could be looking at:

- If you download lots and lots of games of the same few players, you are training the engine into a particular style of play; a style it may not (yet) be compatible with, because it doesn't have either enough depth or enough evaluation parameters.
- If you use hyper-bullet games (Blunder self-play?) the games may be of too low a quality.

My suggestion would be to download millionbase 3.45 (https://rebel13.nl/download/data.html) and extend that with the games from The Week in Chess (TWIC, https://theweekinchess.com/ ) Then dump games played between 2001 and 2021, played by grandmasters starting at 2600 Elo. You could use SCID for this. Create the tuning FEN's from this.

It could be fun to try and tune on games played from 1850 to 1950, to see if you can create an engine that is as strong as your current version, but plays in a really old style....

I think it will work better if you have a more diverse game pool, and when you have more evaluation terms.

algerbrex · Post by **algerbrex** » Sat Oct 30, 2021 12:37 am

mvanthoor wrote: ↑Sat Oct 30, 2021 12:26 am I'm almost done completing the XBoard-protocol (darn, that's a lot of work compared to UCI...), so I'll get to this myself in some time; hopefully before the end of next month.

Ah. Hearing this just makes me want to put off implementing XBoard protocol in Blunder even more...

mvanthoor wrote: ↑Sat Oct 30, 2021 12:26 am I think, looking at this as a chess player, there are two things you could be looking at:

- If you download lots and lots of games of the same few players, you are training the engine into a particular style of play; a style it may not (yet) be compatible with because it doesn't have either enough depth or enough evaluation parameters.
- If you use hyper-bullet games (Blunder self-play?) the games may be of too low a quality.

Hmm, I see. Those were some ideas I had about what might be going on as well. I'm definitely going to look into getting a much wider variety of human grandmaster games for an actual tuning session since this quick and dirty experimenting has shown only selecting games from several players has quite a nasty overfitting effect for Blunder's current evaluation parameters.

As far as the hyper-bullet games go, I suppose so as well. Reading Peter's original method of tuning on the CPW, it seems hyper-bullet games were the suggested method of generating self-play games. And the Zurichess article also seemed to suggest the same idea. Of course, both Texel and Zurichess are currently much stronger than Blunder with much more mature evaluations. So it could be that they can get away with not searching nearly as deep as Blunder since their evaluations pick up the slack in the game quality. I'll probably look into using a longer time control format to see what sort of effect it has.

mvanthoor wrote: ↑Sat Oct 30, 2021 12:26 am My suggestion would be to download millionbase 3.45 (https://rebel13.nl/download/data.html) and extend that with the games from The Week in Chess (TWIC, https://theweekinchess.com/ ) Then dump games played between 2001 and 2021, played by grandmasters starting at 2600 Elo. You could use SCID for this. Create the tuning FEN's from this.

Thanks. I've been looking for more data. This sounds like a very nice idea. I'll make sure to report back how it goes. And SCID would probably be the easier option, although I suppose technically I could modify my PGN parser to parse the dates and Elo's of the games played and select those in accordance with the criteria you outlined.

mvanthoor wrote: ↑Sat Oct 30, 2021 12:26 am It could be fun to try and tune on games played from 1850 to 1950, to see if you can create an engine that is as strong as your current version but plays in a really old style....

I think it will work better if you have a more diverse game pool, and when you have more evaluation terms.

That's definitely something I eventually wanted to look into as well. Especially since I'll probably be looking at studying older games to improve my chess since I've heard their easier for beginners to understand than the "hypermodern" approaches taken by current grandmasters (still not entirely sure what this means, but I'm assuming it'll become clear when I start studying more).

jdart · Post by **jdart** » Sat Oct 30, 2021 10:43 am

I think using human games is inadvisable, because humans blunder quite a lot, more than computers. Even GM games have blunders. So you're tuning based on game results but the position value may not correlate with the game result all that well. Maybe it doesn't matter, but pretty much the standard here is to use selfplay games or at least computer games.

mvanthoor · Post by **mvanthoor** » Sat Oct 30, 2021 2:30 pm

jdart wrote: ↑Sat Oct 30, 2021 10:43 am I think using human games is inadvisable, because humans blunder quite a lot, more than computers. Even GM games have blunders. So you're tuning based on game results but the position value may not correlate with the game result all that well. Maybe it doesn't matter, but pretty much the standard here is to use selfplay games or at least computer games.

Granted, but with engines in "our" range (Blunder, Rustic), between 2100 and 2400, grandmaster games may probably still be better than self-play or gauntlet games. If you have an engine that is over >= 3000 Elo, it would probably be better to use self-play or gauntlet games.

op12no2 · Post by **op12no2** » Sun Oct 31, 2021 9:57 am

more data: http://www.talkchess.com/forum3/viewtop ... =7&t=75350

algerbrex · Post by **algerbrex** » Sun Oct 31, 2021 6:49 pm

op12no2 wrote: ↑Sun Oct 31, 2021 9:57 am more data: http://www.talkchess.com/forum3/viewtop ... =7&t=75350

Thanks, I may look into using some positions from there as well.

algerbrex · Post by **algerbrex** » Fri Nov 19, 2021 12:45 am

Update.

So after some more investigation, I discovered I had a bug in my position extractor! This basically caused my tuner to load in every position as white having won the game, which of course caused issues.

After I fixed this bug I'm getting a much more reasonable K value for the datasets I generate:

Code: Select all

Done loading 400000 positions...
Best K of 1.000000 on iteration 0
Best K of 1.200000 on iteration 1
Best K of 1.160000 on iteration 2
Best K of 1.161000 on iteration 3
Best K of 1.161100 on iteration 4
Best K of 1.161050 on iteration 5
Best K of 1.161053 on iteration 6
Best K of 1.161052 on iteration 7
Best K of 1.161052 on iteration 8
Best K of 1.161052 on iteration 9
Best K is: 1.1610522079999988

Everything looks pretty good, and I've doubled check for any other bugs. So I'll retune Blunder's evaluation parameters using this dataset and see if I can get some improvement. I'll post another update tomorrow, which is when I expect the tuning session will have finished.

algerbrex · Post by **algerbrex** » Mon Jan 24, 2022 3:26 pm

It's been a while since I posted here, and hopefully, this isn't seen as necromancy, but I finally got around this past weekend to generating my own Texel Tuning data and trying to tune from it.

I first downloaded the 3.5 million high-quality human chess games from here: https://rebel13.nl/download/data.html, and then used Scid to select all games played between players 2700 and higher, and only selected games from the last 20 or so years (2001-2020). This gave me ~21K games to select my FENs from.

I'm going to be doing a series of experiments with generating and using Texel Tuning data, so for this first trial run, I went as basic as possible. I only excluded FENs:

that had a king in check or checkmate
where the qsearch score and raw eval score differed by more than 25 centi-pawn

This selection criterion gave me ~1.4M games, which I then used to tune all of Blunder's evaluation parameters from scratch. The tuning session ran for roughly 13 hours with rather anti-climatic results:

Code: Select all

Score of Blunder 7.7.0 vs Blunder 7.6.0: 93 - 287 - 87  [0.292] 467
...      Blunder 7.7.0 playing White: 49 - 134 - 50  [0.318] 233
...      Blunder 7.7.0 playing Black: 44 - 153 - 37  [0.267] 234
...      White vs Black: 202 - 178 - 87  [0.526] 467
Elo difference: -153.6 +/- 30.6, LOS: 0.0 %, DrawRatio: 18.6 %
SPRT: llr -2.95 (-100.1%), lbound -2.94, ubound 2.94 - H0 was accepted
Finished match

Blunder 7.7.0 was the dev version of Blunder, with the newly tuned values.

As I said, this was my first experiment, and I have some ideas of what to try this next go around hopefully this weekend. But at least for now, Zurichess's dataset has still been the best dataset Blunder's used for tuning.

Which makes sense when you think about it, since if I remember correctly, he only selected a few FENs from each game he used, meaning he got a greater variety of FENs and introduced less bias into his training set. And he also made sure to take each position and play it out with Stockfish, to ensure the accuracy of the W/L/D scores. This clearly gave a very high-quality dataset. I may explore doing this for this next go round. But I would like to find a way to avoid having to introduce Stockfish into the equation.

Also, given Jon Dart's earlier comment, I think it'll also be helpful to experiment with tuning using CCRL games and self-play gauntlet games from Blunder. With the later I'll need to decide on an appropriately long time control and I'll need to produce quite a large number of games.

chrisw · Post by **chrisw** » Mon Jan 24, 2022 5:01 pm

algerbrex wrote: ↑Mon Jan 24, 2022 3:26 pm It's been a while since I posted here, and hopefully, this isn't seen as necromancy, but I finally got around this past weekend to generating my own Texel Tuning data and trying to tune from it.

I first downloaded the 3.5 million high-quality human chess games from here: https://rebel13.nl/download/data.html, and then used Scid to select all games played between players 2700 and higher, and only selected games from the last 20 or so years (2001-2020). This gave me ~21K games to select my FENs from.

I'm going to be doing a series of experiments with generating and using Texel Tuning data, so for this first trial run, I went as basic as possible. I only excluded FENs:

that had a king in check or checkmate

where the qsearch score and raw eval score differed by more than 25 centi-pawn

This selection criterion gave me ~1.4M games, which I then used to tune all of Blunder's evaluation parameters from scratch. The tuning session ran for roughly 13 hours with rather anti-climatic results:
Code: Select all
Score of Blunder 7.7.0 vs Blunder 7.6.0: 93 - 287 - 87  [0.292] 467
...      Blunder 7.7.0 playing White: 49 - 134 - 50  [0.318] 233
...      Blunder 7.7.0 playing Black: 44 - 153 - 37  [0.267] 234
...      White vs Black: 202 - 178 - 87  [0.526] 467
Elo difference: -153.6 +/- 30.6, LOS: 0.0 %, DrawRatio: 18.6 %
SPRT: llr -2.95 (-100.1%), lbound -2.94, ubound 2.94 - H0 was accepted
Finished match
Blunder 7.7.0 was the dev version of Blunder, with the newly tuned values.

As I said, this was my first experiment, and I have some ideas of what to try this next go around hopefully this weekend. But at least for now, Zurichess's dataset has still been the best dataset Blunder's used for tuning.

Which makes sense when you think about it, since if I remember correctly, he only selected a few FENs from each game he used, meaning he got a greater variety of FENs and introduced less bias into his training set. And he also made sure to take each position and play it out with Stockfish, to ensure the accuracy of the W/L/D scores. This clearly gave a very high-quality dataset. I may explore doing this for this next go round. But I would like to find a way to avoid having to introduce Stockfish into the equation.

Also, given Jon Dart's earlier comment, I think it'll also be helpful to experiment with tuning using CCRL games and self-play gauntlet games from Blunder. With the later I'll need to decide on an appropriately long time control and I'll need to produce quite a large number of games.

There may be some purists who complain if you use SF, but if you use it for WDL game check/result only, then there is no extraction of evaluation function values. Seems perfectly legit.
In any case you can only afford to do relative low depth playouts with inevitable noise. Is there any good reason why SF self-play would be any better than any other engine for this purpose? SF is not known for low depth search with high accuracy.

Experiments in generating Texel Tuning data

Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data