Experiments in generating Texel Tuning data

Tearth · Post by **Tearth** » Fri Jul 01, 2022 3:59 pm

algerbrex wrote: ↑Fri Jul 01, 2022 3:44 pm If anyone's interested, I can upload the dataset a little later. Have no clue if it'll be beneficial to another engine besides Blunder or not:

I would be interested, mainly to see if expanding the Zurichess dataset can be in practice helpful for Inanis' tuner before starting making my own tools and sets. Nice job!

algerbrex · Post by **algerbrex** » Fri Jul 01, 2022 4:09 pm

Tearth wrote: ↑Fri Jul 01, 2022 3:59 pm
algerbrex wrote: ↑Fri Jul 01, 2022 3:44 pm If anyone's interested, I can upload the dataset a little later. Have no clue if it'll be beneficial to another engine besides Blunder or not:
I would be interested, mainly to see if expanding the Zurichess dataset can be in practice helpful for Inanis' tuner before starting making my own tools and sets. Nice job!

Thanks! And sure thing, here it is:

Tearth · Post by **Tearth** » Fri Jul 01, 2022 4:44 pm

Thanks, I will try to find some time between other tests and check it.

algerbrex · Post by **algerbrex** » Sat Jul 02, 2022 10:10 pm

I decided to run some more thorough testing on the new set of training positions I produced using some gauntlet testing. Here's the current version of Blunder, 8.0.0, against the gauntlet, with the ratings being anchored to Blunder 7.6.0's CCRL rating, computed using Ordo:

Code: Select all

   # PLAYER           : RATING    POINTS  PLAYED    (%)
   1 Inanis 1.0.1     : 2780.5     257.5     400   64.4%
   2 GreKo 2018.02    : 2749.7     241.0     400   60.3%
   3 Zahak 5.0        : 2739.7     235.5     400   58.9%
   4 Blunder 8.0.0    : 2676.8     931.0    2000   46.5%
   5 Blunder 7.6.0    : 2631.0     174.0     400   43.5%
   6 Nalwald 1.9      : 2607.6     161.0     400   40.3%

Now here's the development version of Blunder, 8.1.0, with the evaluation tuned from scratch using the same training process, but using the extended Zurichess dataset:

Code: Select all

   # PLAYER           : RATING    POINTS  PLAYED    (%)
   1 Inanis 1.0.1     : 2756.8     210.5     372   56.6%
   2 Zahak 5.0        : 2732.8     199.0     374   53.2%
   3 GreKo 2018.02    : 2716.0     189.0     372   50.8%
   4 Blunder 8.1.0    : 2710.3     971.5    1863   52.1%
   5 Nalwald 1.9      : 2637.7     148.0     372   39.8%
   6 Blunder 7.6.0    : 2631.0     145.0     373   38.9%

Now as you can see there are about 150 games left to run in the latest gauntlet test, but I don't expect the values to change much. And if it's accurate, it shows that Blunder gained ~34 Elo from the extended dataset, which I'm a little surprised by, to be honest, but happy so far that my experiments are successful so far.

If I find the time between other tests, I'm going to try comparing the above results to training using only the positions I extracted from Blunder's self-play games during testing.

hgm · Post by **hgm** » Sun Jul 03, 2022 1:25 pm

I am trying Texel tuning for my engine for Korean chess, but so far with very little success. I generated games by ultra-bullet self-play, and sampled from the positions at the end of the PVs to makes sure they were quiet. With an evaluation that is mostly PST the tuning enormously reduced the MSE compared to that for the HCE. But when I match them against each other, the HCE wins nearly 100%.

The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.

So it seems I would need a completely different type of evaluation, before tuning can have any effect.

Tearth · Post by **Tearth** » Sun Jul 03, 2022 8:53 pm

algerbrex wrote: ↑Fri Jul 01, 2022 4:09 pm Thanks! And sure thing, here it is:

I've done all my previous tests, so I could finally check this dataset today - and the result is astonishing, after running Inanis' tuner a few hours and then a gauntlet, I've noticed +25 Elo after 4000 games (which is still not that solid since I usually wait until 20000 games, but the advantage is already very very clear at this point so I wanted to share the result now) compared to the values tuned using a traditional Zurichess quiet positions. This proved to me that there's still a lot of interesting work that can be done here, so it seems like writing my own tools to generate these positions will be the next task. Summarizing, @algerbrex you have independent proof that this dataset indeed works : )

EDIT: comparing differences between old and new parameters: there's a much higher focus on mobility now, also piece-square tables are almost completely rebuilt in some areas. The rest of the parameters (king safety, pawn structure) seem similar.

algerbrex · Post by **algerbrex** » Sun Jul 03, 2022 9:36 pm

Tearth wrote: ↑Sun Jul 03, 2022 8:53 pm I've done all my previous tests, so I could finally check this dataset today - and the result is astonishing, after running Inanis' tuner a few hours and then a gauntlet, I've noticed +25 Elo after 4000 games (which is still not that solid since I usually wait until 20000 games, but the advantage is already very very clear at this point so I wanted to share the result now) compared to the values tuned using a traditional Zurichess quiet positions. This proved to me that there's still a lot of interesting work that can be done here, so it seems like writing my own tools to generate these positions will be the next task. Summarizing, @algerbrex you have independent proof that this dataset indeed works : )

Thanks for the update, that's very surprising to me. Looks like my little experiment worked pretty well, at least right now

glad I could help confirm for you and me that extending the Zurichess dataset through self-play is something worthwhile to explore. I'm now probably going to dedicate a bit more time into exploring that area.

Tearth wrote: ↑Sun Jul 03, 2022 8:53 pm EDIT: comparing differences between old and new parameters: there's a much higher focus on mobility now, also piece-square tables are almost completely rebuilt in some areas. The rest of the parameters (king safety, pawn structure) seem similar.

Interesting. For me, the biggest change was actually changing the material values. The value of the queen dropped a good bit, as did the value of a pawn in the middle game, while the value of rooks rose. Further, many of the peculiar-looking extreme values I always seemed to get from tuning with Zurichess's dataset were scaled down to much more reasonable-looking values. For example, a knight on A8 was often given an oddly heavy penalty, but now has something I think is a bit more reasonable.

algerbrex · Post by **algerbrex** » Sun Jul 03, 2022 9:47 pm

hgm wrote: ↑Sun Jul 03, 2022 1:25 pm I am trying Texel tuning for my engine for Korean chess, but so far with very little success. I generated games by ultra-bullet self-play, and sampled from the positions at the end of the PVs to makes sure they were quiet. With an evaluation that is mostly PST the tuning enormously reduced the MSE compared to that for the HCE. But when I match them against each other, the HCE wins nearly 100%.

That's interesting, to say the least. I'm doing everything you described, but I'm also making sure to exclude certain positions that could skew the tuning, like positions from games after the ply > 200, since at that point it's most likely a boring draw. I also exclude positions near the end of the game when a checkmate sequence has likely been found, and so the positions might appear odd since the attacking is fine with throwing away some material or putting the pieces in odd places to achieve checkmate. I'm also only sampling no more than 10 random positions per game.

I'm assuming you're also already doing something along these lines, where applicable for Korean chess, which I'm not very familiar with. My only thought might be the hyper-bullet games. For the games I used to generate this new dataset, the vast majority were no quicker than 10+0.1s per game, and some 15+0.2 and 5+0.5s. I've seen people like Andrew describing how they were able to use very fast games, something like 1+0.1s or 2+0.2s, which I've tried before but have never had success with. I hypothesize this is simply something Etheral and stronger engines can get away with, since their search and evaluation are often much more accurate at lower depths than Blunder's, so the positions extracted aren't as poor quality.

hgm wrote: ↑Sun Jul 03, 2022 1:25 pm The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.

So it seems I would need a completely different type of evaluation, before tuning can have any effect.

Again, I might be off base here since I'm not very familiar with Korean checkers, but you might be right here if your engine currently just has PST evaluation for attacking the king. I feel like you'd need to find a way to tune some non-linearity into the bonuses for pieces attacking the king, since as you said, an attack is something that has to be built up slowly. You can't just plop a couple of pieces around a secure king and pretend you have an attack.

lithander · Post by **lithander** » Sun Jul 03, 2022 11:02 pm

hgm wrote: ↑Sun Jul 03, 2022 1:25 pm The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.

Are you using phases already? Most chess engines seem to have settled on two sets of PSTs one for Midgame and one for Endgame. The difference in value between the two phases is often above 100 centipawns. When I was still using one phase the engine had serious problems to win endgames or to know when to start moving the king forward.

The way you describe it the amount of attackers attacking the palace could determine the phase. And only in the later phase (3-4 attackers) you get these high values on attack squares that would cause your engine to sacrifice material for stacking even more attackers.

It's a moonshot... I don't even know the game!

j.t. · Post by **j.t.** » Tue Jul 05, 2022 12:32 pm

lithander wrote: ↑Sun Jul 03, 2022 11:02 pm Most chess engines seem to have settled on two sets of PSTs one for Midgame and one for Endgame.

I even do this for all evaluation parameters, e.g. I have two sets of values for mobility for opening and endgame.

Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data