I would be interested, mainly to see if expanding the Zurichess dataset can be in practice helpful for Inanis' tuner before starting making my own tools and sets. Nice job!
Experiments in generating Texel Tuning data
Moderator: Ras
-
- Posts: 70
- Joined: Thu Feb 25, 2021 5:12 pm
- Location: Poland
- Full name: Pawel Osikowski
Re: Experiments in generating Texel Tuning data
Inanis (Rust, active development) - https://github.com/Tearth/Inanis, http://talkchess.com/forum3/viewtopic.php?f=7&t=79625
Latest version: 1.6.0 (3100 Elo) - https://github.com/Tearth/Inanis/releases/tag/v1.6.0
Cosette, Bitboard Viewer
Latest version: 1.6.0 (3100 Elo) - https://github.com/Tearth/Inanis/releases/tag/v1.6.0
Cosette, Bitboard Viewer
-
- Posts: 608
- Joined: Sun May 30, 2021 5:03 am
- Location: United States
- Full name: Christian Dean
Re: Experiments in generating Texel Tuning data
-
- Posts: 70
- Joined: Thu Feb 25, 2021 5:12 pm
- Location: Poland
- Full name: Pawel Osikowski
Re: Experiments in generating Texel Tuning data
Thanks, I will try to find some time between other tests and check it.
Inanis (Rust, active development) - https://github.com/Tearth/Inanis, http://talkchess.com/forum3/viewtopic.php?f=7&t=79625
Latest version: 1.6.0 (3100 Elo) - https://github.com/Tearth/Inanis/releases/tag/v1.6.0
Cosette, Bitboard Viewer
Latest version: 1.6.0 (3100 Elo) - https://github.com/Tearth/Inanis/releases/tag/v1.6.0
Cosette, Bitboard Viewer
-
- Posts: 608
- Joined: Sun May 30, 2021 5:03 am
- Location: United States
- Full name: Christian Dean
Re: Experiments in generating Texel Tuning data
I decided to run some more thorough testing on the new set of training positions I produced using some gauntlet testing. Here's the current version of Blunder, 8.0.0, against the gauntlet, with the ratings being anchored to Blunder 7.6.0's CCRL rating, computed using Ordo:
Now here's the development version of Blunder, 8.1.0, with the evaluation tuned from scratch using the same training process, but using the extended Zurichess dataset:
Now as you can see there are about 150 games left to run in the latest gauntlet test, but I don't expect the values to change much. And if it's accurate, it shows that Blunder gained ~34 Elo from the extended dataset, which I'm a little surprised by, to be honest, but happy so far that my experiments are successful so far.
If I find the time between other tests, I'm going to try comparing the above results to training using only the positions I extracted from Blunder's self-play games during testing.
Code: Select all
# PLAYER : RATING POINTS PLAYED (%)
1 Inanis 1.0.1 : 2780.5 257.5 400 64.4%
2 GreKo 2018.02 : 2749.7 241.0 400 60.3%
3 Zahak 5.0 : 2739.7 235.5 400 58.9%
4 Blunder 8.0.0 : 2676.8 931.0 2000 46.5%
5 Blunder 7.6.0 : 2631.0 174.0 400 43.5%
6 Nalwald 1.9 : 2607.6 161.0 400 40.3%
Code: Select all
# PLAYER : RATING POINTS PLAYED (%)
1 Inanis 1.0.1 : 2756.8 210.5 372 56.6%
2 Zahak 5.0 : 2732.8 199.0 374 53.2%
3 GreKo 2018.02 : 2716.0 189.0 372 50.8%
4 Blunder 8.1.0 : 2710.3 971.5 1863 52.1%
5 Nalwald 1.9 : 2637.7 148.0 372 39.8%
6 Blunder 7.6.0 : 2631.0 145.0 373 38.9%
If I find the time between other tests, I'm going to try comparing the above results to training using only the positions I extracted from Blunder's self-play games during testing.
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Experiments in generating Texel Tuning data
I am trying Texel tuning for my engine for Korean chess, but so far with very little success. I generated games by ultra-bullet self-play, and sampled from the positions at the end of the PVs to makes sure they were quiet. With an evaluation that is mostly PST the tuning enormously reduced the MSE compared to that for the HCE. But when I match them against each other, the HCE wins nearly 100%.
The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.
So it seems I would need a completely different type of evaluation, before tuning can have any effect.
The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.
So it seems I would need a completely different type of evaluation, before tuning can have any effect.
-
- Posts: 70
- Joined: Thu Feb 25, 2021 5:12 pm
- Location: Poland
- Full name: Pawel Osikowski
Re: Experiments in generating Texel Tuning data
I've done all my previous tests, so I could finally check this dataset today - and the result is astonishing, after running Inanis' tuner a few hours and then a gauntlet, I've noticed +25 Elo after 4000 games (which is still not that solid since I usually wait until 20000 games, but the advantage is already very very clear at this point so I wanted to share the result now) compared to the values tuned using a traditional Zurichess quiet positions. This proved to me that there's still a lot of interesting work that can be done here, so it seems like writing my own tools to generate these positions will be the next task. Summarizing, @algerbrex you have independent proof that this dataset indeed works : )
EDIT: comparing differences between old and new parameters: there's a much higher focus on mobility now, also piece-square tables are almost completely rebuilt in some areas. The rest of the parameters (king safety, pawn structure) seem similar.
Inanis (Rust, active development) - https://github.com/Tearth/Inanis, http://talkchess.com/forum3/viewtopic.php?f=7&t=79625
Latest version: 1.6.0 (3100 Elo) - https://github.com/Tearth/Inanis/releases/tag/v1.6.0
Cosette, Bitboard Viewer
Latest version: 1.6.0 (3100 Elo) - https://github.com/Tearth/Inanis/releases/tag/v1.6.0
Cosette, Bitboard Viewer
-
- Posts: 608
- Joined: Sun May 30, 2021 5:03 am
- Location: United States
- Full name: Christian Dean
Re: Experiments in generating Texel Tuning data
Thanks for the update, that's very surprising to me. Looks like my little experiment worked pretty well, at least right nowTearth wrote: ↑Sun Jul 03, 2022 8:53 pm I've done all my previous tests, so I could finally check this dataset today - and the result is astonishing, after running Inanis' tuner a few hours and then a gauntlet, I've noticed +25 Elo after 4000 games (which is still not that solid since I usually wait until 20000 games, but the advantage is already very very clear at this point so I wanted to share the result now) compared to the values tuned using a traditional Zurichess quiet positions. This proved to me that there's still a lot of interesting work that can be done here, so it seems like writing my own tools to generate these positions will be the next task. Summarizing, @algerbrex you have independent proof that this dataset indeed works : )

Interesting. For me, the biggest change was actually changing the material values. The value of the queen dropped a good bit, as did the value of a pawn in the middle game, while the value of rooks rose. Further, many of the peculiar-looking extreme values I always seemed to get from tuning with Zurichess's dataset were scaled down to much more reasonable-looking values. For example, a knight on A8 was often given an oddly heavy penalty, but now has something I think is a bit more reasonable.
-
- Posts: 608
- Joined: Sun May 30, 2021 5:03 am
- Location: United States
- Full name: Christian Dean
Re: Experiments in generating Texel Tuning data
That's interesting, to say the least. I'm doing everything you described, but I'm also making sure to exclude certain positions that could skew the tuning, like positions from games after the ply > 200, since at that point it's most likely a boring draw. I also exclude positions near the end of the game when a checkmate sequence has likely been found, and so the positions might appear odd since the attacking is fine with throwing away some material or putting the pieces in odd places to achieve checkmate. I'm also only sampling no more than 10 random positions per game.hgm wrote: ↑Sun Jul 03, 2022 1:25 pm I am trying Texel tuning for my engine for Korean chess, but so far with very little success. I generated games by ultra-bullet self-play, and sampled from the positions at the end of the PVs to makes sure they were quiet. With an evaluation that is mostly PST the tuning enormously reduced the MSE compared to that for the HCE. But when I match them against each other, the HCE wins nearly 100%.
I'm assuming you're also already doing something along these lines, where applicable for Korean chess, which I'm not very familiar with. My only thought might be the hyper-bullet games. For the games I used to generate this new dataset, the vast majority were no quicker than 10+0.1s per game, and some 15+0.2 and 5+0.5s. I've seen people like Andrew describing how they were able to use very fast games, something like 1+0.1s or 2+0.2s, which I've tried before but have never had success with. I hypothesize this is simply something Etheral and stronger engines can get away with, since their search and evaluation are often much more accurate at lower depths than Blunder's, so the positions extracted aren't as poor quality.
Again, I might be off base here since I'm not very familiar with Korean checkers, but you might be right here if your engine currently just has PST evaluation for attacking the king. I feel like you'd need to find a way to tune some non-linearity into the bonuses for pieces attacking the king, since as you said, an attack is something that has to be built up slowly. You can't just plop a couple of pieces around a secure king and pretend you have an attack.hgm wrote: ↑Sun Jul 03, 2022 1:25 pm The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.
So it seems I would need a completely different type of evaluation, before tuning can have any effect.
-
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Experiments in generating Texel Tuning data
Are you using phases already? Most chess engines seem to have settled on two sets of PSTs one for Midgame and one for Endgame. The difference in value between the two phases is often above 100 centipawns. When I was still using one phase the engine had serious problems to win endgames or to know when to start moving the king forward.hgm wrote: ↑Sun Jul 03, 2022 1:25 pm The only explanation I can think of is that a PST-based evaluation utterly sucks, and that projecting the eval terms that would really correlate well with the game result onto PST does more harm than good. The tuned eval gives very high bonus for pieces attacking the Palace. Which is understandable, because you would need to attack the Palace in order to checkmate the King that is confined in it. So most won games will have many pieces attacking the Palace. The problem is that a single attacker can never achieve a checkmate, and if there are some defenders in the Palace (as there usually are) you might need 3 or 4 attackers before it gets dangerous. And with the 'tuned' PST the engine gets so excited when it can get 2 attackers on optimal squares that it is willing to sacrifice material in order to achieve that.
The way you describe it the amount of attackers attacking the palace could determine the phase. And only in the later phase (3-4 attackers) you get these high values on attack squares that would cause your engine to sacrifice material for stacking even more attackers.
It's a moonshot... I don't even know the game!

-
- Posts: 263
- Joined: Wed Jun 16, 2021 2:08 am
- Location: Berlin
- Full name: Jost Triller
Re: Experiments in generating Texel Tuning data
I even do this for all evaluation parameters, e.g. I have two sets of values for mobility for opening and endgame.lithander wrote: ↑Sun Jul 03, 2022 11:02 pm Most chess engines seem to have settled on two sets of PSTs one for Midgame and one for Endgame.