Evaluation & Tuning in Chess Engines

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

AndrewGrant
Posts: 1754
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant »

RubiChess wrote: Wed Sep 09, 2020 1:28 pm
AndrewGrant wrote: Fri Sep 04, 2020 3:10 am As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results
Hi Andrew.

I don't understand steps 3 and 4 in your "generate tuning data howto".
You have a random position from some game you played before and a pv with 12 moves following this position. And then?
What does "apply all of the moves in the depth 12 search" exaclty mean? Go to the positions 12 plies later? And then??
Or "for pvmoves = 1 to 12: apply move and do something (what?)"
And where do the results came from? We usually need a WDL score for each position, don't we? Where does it come from? Probably not from the orininal game as the depth 12 search would be useless?

Here is an exchange I had with Alayan the other day, that paints a better picture.

Code: Select all

[8:50 AM] Andrews: Let me just lay out the whole process:
[8:50 AM] Alayan: Do it
[8:50 AM] Andrews: 1) I play a set of 1 million games of Ethereal vs Ethereal at 2s+.02s.
[8:50 AM] Andrews: 2) I parse those PGNs, toss out any games with fewer than 10 moves
[8:50 AM] Andrews: 3) From each PGN, I randomly sample 10 positions from the game.
[8:51 AM] Andrews: 4) Now I have 10 million positions. I perform a depth 12 search on all of them, and save the principle variation.
[8:51 AM] Andrews: 5) Now for each (position, principle variation), I take the position and apply each move in the PV to it.
[8:51 AM] Andrews: 6) I save that final position (IE, last position of the PV). Those are the lines listed in the final dataset.
[8:52 AM] Andrews: Note that my PV includes qsearch, so "tactical" resolutions are somewhat inherent. I demonstrated this by tuning with and without resolving the positions with an additional qsearch, and saw no difference.
[8:53 AM] Andrews: There is one big flaw: If position A has a known result of R, who is to say that A + 12 or more moves STILL should have the result of R.
[8:53 AM] Alayan: So, the final position that comes from the d12 PV is rated depending on the 2s+0.02s game result from which the position at the start of PV was extracted from
[8:53 AM] Andrews: Precisely.
[8:53 AM] Andrews: I can rationalize how I ended up with this (seemingly/somewhat) flawed system, but thats for another day.
[8:53 AM] Alayan: Yeah, that's because of this big flaw I didn't understand correctly when you explained me the first time, it simply didn't occur to me this could actually work
[8:54 AM] Andrews: TL;DR: Its important that the positions in the dataset are ones that Ethereal would reach if allowed to make a bunch of moves.
[8:54 AM] Andrews: Everything else is up for debate.
[8:54 AM] Alayan: Playing ultra-short games is good to reach a lot of uncommon position. d12 PV is also important to reach uncommon positions in games but relevant in tree.
[8:55 AM] Alayan: That's the flaw of "high-quality games" datasets, if you only get positions that end up being played in good quality games, you miss out on the mass of positions that will never get played but that need to be evaluated well to actually not make mistakes
[8:55 AM] Andrews: If I had all the compute in the world, I would go back and take my ~32 millon positions and play fresh games using 10s+.1s on them, and update the results for each entry accordingly. This is what I STARTED doing with our old datasets, but throwing in the towel and being a man and doing math
[8:56 AM] Andrews: Yeah, so we have "diverse" positions, but not necessarily "highly accurate" results for them.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
RubiChess
Posts: 584
Joined: Fri Mar 30, 2018 7:20 am
Full name: Andreas Matthies

Re: Evaluation & Tuning in Chess Engines

Post by RubiChess »

AndrewGrant wrote: Wed Sep 09, 2020 1:39 pm
RubiChess wrote: Wed Sep 09, 2020 1:28 pm
AndrewGrant wrote: Fri Sep 04, 2020 3:10 am As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results
Hi Andrew.

I don't understand steps 3 and 4 in your "generate tuning data howto".
You have a random position from some game you played before and a pv with 12 moves following this position. And then?
What does "apply all of the moves in the depth 12 search" exaclty mean? Go to the positions 12 plies later? And then??
Or "for pvmoves = 1 to 12: apply move and do something (what?)"
And where do the results came from? We usually need a WDL score for each position, don't we? Where does it come from? Probably not from the orininal game as the depth 12 search would be useless?

Here is an exchange I had with Alayan the other day, that paints a better picture.

Code: Select all

[8:50 AM] Andrews: Let me just lay out the whole process:
[8:50 AM] Alayan: Do it
[8:50 AM] Andrews: 1) I play a set of 1 million games of Ethereal vs Ethereal at 2s+.02s.
[8:50 AM] Andrews: 2) I parse those PGNs, toss out any games with fewer than 10 moves
[8:50 AM] Andrews: 3) From each PGN, I randomly sample 10 positions from the game.
[8:51 AM] Andrews: 4) Now I have 10 million positions. I perform a depth 12 search on all of them, and save the principle variation.
[8:51 AM] Andrews: 5) Now for each (position, principle variation), I take the position and apply each move in the PV to it.
[8:51 AM] Andrews: 6) I save that final position (IE, last position of the PV). Those are the lines listed in the final dataset.
[8:52 AM] Andrews: Note that my PV includes qsearch, so "tactical" resolutions are somewhat inherent. I demonstrated this by tuning with and without resolving the positions with an additional qsearch, and saw no difference.
[8:53 AM] Andrews: There is one big flaw: If position A has a known result of R, who is to say that A + 12 or more moves STILL should have the result of R.
[8:53 AM] Alayan: So, the final position that comes from the d12 PV is rated depending on the 2s+0.02s game result from which the position at the start of PV was extracted from
[8:53 AM] Andrews: Precisely.
[8:53 AM] Andrews: I can rationalize how I ended up with this (seemingly/somewhat) flawed system, but thats for another day.
[8:53 AM] Alayan: Yeah, that's because of this big flaw I didn't understand correctly when you explained me the first time, it simply didn't occur to me this could actually work
[8:54 AM] Andrews: TL;DR: Its important that the positions in the dataset are ones that Ethereal would reach if allowed to make a bunch of moves.
[8:54 AM] Andrews: Everything else is up for debate.
[8:54 AM] Alayan: Playing ultra-short games is good to reach a lot of uncommon position. d12 PV is also important to reach uncommon positions in games but relevant in tree.
[8:55 AM] Alayan: That's the flaw of "high-quality games" datasets, if you only get positions that end up being played in good quality games, you miss out on the mass of positions that will never get played but that need to be evaluated well to actually not make mistakes
[8:55 AM] Andrews: If I had all the compute in the world, I would go back and take my ~32 millon positions and play fresh games using 10s+.1s on them, and update the results for each entry accordingly. This is what I STARTED doing with our old datasets, but throwing in the towel and being a man and doing math
[8:56 AM] Andrews: Yeah, so we have "diverse" positions, but not necessarily "highly accurate" results for them.
Okay, that makes it clearer. The "big flaw" you and Alayan mentioned is what I meant with "Probably not from the orininal game as the depth 12 search would be useless?". So the accurate "look into future" evaluation is more worth than the error introduced by wrong WDL result in some positions. Still strange that it works. What is the average depth you reach in your 2+0.2 games? I guess it is > 12? So the moves in the game should be as good as the depth 12 pv and taking 10 positions at (random + 12) and do a qsearch on them should also work, shouldn't it? Maybe you can "rationalize why this works on another day" :-)
Kieren Pearson
Posts: 70
Joined: Tue Dec 31, 2019 2:52 am
Full name: Kieren Pearson

Re: Evaluation & Tuning in Chess Engines

Post by Kieren Pearson »

AndrewGrant wrote: Wed Sep 09, 2020 1:39 pm
RubiChess wrote: Wed Sep 09, 2020 1:28 pm
AndrewGrant wrote: Fri Sep 04, 2020 3:10 am As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results
Hi Andrew.

I don't understand steps 3 and 4 in your "generate tuning data howto".
You have a random position from some game you played before and a pv with 12 moves following this position. And then?
What does "apply all of the moves in the depth 12 search" exaclty mean? Go to the positions 12 plies later? And then??
Or "for pvmoves = 1 to 12: apply move and do something (what?)"
And where do the results came from? We usually need a WDL score for each position, don't we? Where does it come from? Probably not from the orininal game as the depth 12 search would be useless?

Here is an exchange I had with Alayan the other day, that paints a better picture.

Code: Select all

[8:50 AM] Andrews: Let me just lay out the whole process:
[8:50 AM] Alayan: Do it
[8:50 AM] Andrews: 1) I play a set of 1 million games of Ethereal vs Ethereal at 2s+.02s.
[8:50 AM] Andrews: 2) I parse those PGNs, toss out any games with fewer than 10 moves
[8:50 AM] Andrews: 3) From each PGN, I randomly sample 10 positions from the game.
[8:51 AM] Andrews: 4) Now I have 10 million positions. I perform a depth 12 search on all of them, and save the principle variation.
[8:51 AM] Andrews: 5) Now for each (position, principle variation), I take the position and apply each move in the PV to it.
[8:51 AM] Andrews: 6) I save that final position (IE, last position of the PV). Those are the lines listed in the final dataset.
[8:52 AM] Andrews: Note that my PV includes qsearch, so "tactical" resolutions are somewhat inherent. I demonstrated this by tuning with and without resolving the positions with an additional qsearch, and saw no difference.
[8:53 AM] Andrews: There is one big flaw: If position A has a known result of R, who is to say that A + 12 or more moves STILL should have the result of R.
[8:53 AM] Alayan: So, the final position that comes from the d12 PV is rated depending on the 2s+0.02s game result from which the position at the start of PV was extracted from
[8:53 AM] Andrews: Precisely.
[8:53 AM] Andrews: I can rationalize how I ended up with this (seemingly/somewhat) flawed system, but thats for another day.
[8:53 AM] Alayan: Yeah, that's because of this big flaw I didn't understand correctly when you explained me the first time, it simply didn't occur to me this could actually work
[8:54 AM] Andrews: TL;DR: Its important that the positions in the dataset are ones that Ethereal would reach if allowed to make a bunch of moves.
[8:54 AM] Andrews: Everything else is up for debate.
[8:54 AM] Alayan: Playing ultra-short games is good to reach a lot of uncommon position. d12 PV is also important to reach uncommon positions in games but relevant in tree.
[8:55 AM] Alayan: That's the flaw of "high-quality games" datasets, if you only get positions that end up being played in good quality games, you miss out on the mass of positions that will never get played but that need to be evaluated well to actually not make mistakes
[8:55 AM] Andrews: If I had all the compute in the world, I would go back and take my ~32 millon positions and play fresh games using 10s+.1s on them, and update the results for each entry accordingly. This is what I STARTED doing with our old datasets, but throwing in the towel and being a man and doing math
[8:56 AM] Andrews: Yeah, so we have "diverse" positions, but not necessarily "highly accurate" results for them.
If my engine were quite weak, would it be much better to train using SF to create the dataset? Or would there still be benefit to using my own engine as it would then favour going into positions that it knows how to win rather than going into positions that SF could win but my engine can't
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Evaluation & Tuning in Chess Engines

Post by chrisw »

Kieren Pearson wrote: Wed Sep 09, 2020 4:59 pm
AndrewGrant wrote: Wed Sep 09, 2020 1:39 pm
RubiChess wrote: Wed Sep 09, 2020 1:28 pm
AndrewGrant wrote: Fri Sep 04, 2020 3:10 am As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results
Hi Andrew.

I don't understand steps 3 and 4 in your "generate tuning data howto".
You have a random position from some game you played before and a pv with 12 moves following this position. And then?
What does "apply all of the moves in the depth 12 search" exaclty mean? Go to the positions 12 plies later? And then??
Or "for pvmoves = 1 to 12: apply move and do something (what?)"
And where do the results came from? We usually need a WDL score for each position, don't we? Where does it come from? Probably not from the orininal game as the depth 12 search would be useless?

Here is an exchange I had with Alayan the other day, that paints a better picture.

Code: Select all

[8:50 AM] Andrews: Let me just lay out the whole process:
[8:50 AM] Alayan: Do it
[8:50 AM] Andrews: 1) I play a set of 1 million games of Ethereal vs Ethereal at 2s+.02s.
[8:50 AM] Andrews: 2) I parse those PGNs, toss out any games with fewer than 10 moves
[8:50 AM] Andrews: 3) From each PGN, I randomly sample 10 positions from the game.
[8:51 AM] Andrews: 4) Now I have 10 million positions. I perform a depth 12 search on all of them, and save the principle variation.
[8:51 AM] Andrews: 5) Now for each (position, principle variation), I take the position and apply each move in the PV to it.
[8:51 AM] Andrews: 6) I save that final position (IE, last position of the PV). Those are the lines listed in the final dataset.
[8:52 AM] Andrews: Note that my PV includes qsearch, so "tactical" resolutions are somewhat inherent. I demonstrated this by tuning with and without resolving the positions with an additional qsearch, and saw no difference.
[8:53 AM] Andrews: There is one big flaw: If position A has a known result of R, who is to say that A + 12 or more moves STILL should have the result of R.
[8:53 AM] Alayan: So, the final position that comes from the d12 PV is rated depending on the 2s+0.02s game result from which the position at the start of PV was extracted from
[8:53 AM] Andrews: Precisely.
[8:53 AM] Andrews: I can rationalize how I ended up with this (seemingly/somewhat) flawed system, but thats for another day.
[8:53 AM] Alayan: Yeah, that's because of this big flaw I didn't understand correctly when you explained me the first time, it simply didn't occur to me this could actually work
[8:54 AM] Andrews: TL;DR: Its important that the positions in the dataset are ones that Ethereal would reach if allowed to make a bunch of moves.
[8:54 AM] Andrews: Everything else is up for debate.
[8:54 AM] Alayan: Playing ultra-short games is good to reach a lot of uncommon position. d12 PV is also important to reach uncommon positions in games but relevant in tree.
[8:55 AM] Alayan: That's the flaw of "high-quality games" datasets, if you only get positions that end up being played in good quality games, you miss out on the mass of positions that will never get played but that need to be evaluated well to actually not make mistakes
[8:55 AM] Andrews: If I had all the compute in the world, I would go back and take my ~32 millon positions and play fresh games using 10s+.1s on them, and update the results for each entry accordingly. This is what I STARTED doing with our old datasets, but throwing in the towel and being a man and doing math
[8:56 AM] Andrews: Yeah, so we have "diverse" positions, but not necessarily "highly accurate" results for them.
If my engine were quite weak, would it be much better to train using SF to create the dataset? Or would there still be benefit to using my own engine as it would then favour going into positions that it knows how to win rather than going into positions that SF could win but my engine can't
You have to start from something, the cycle time for game generation and analysis is in days (barring you own a few data centres), starting from “Zero” is not really an option. I use(d) a mix of CCRL games and evaluations merged from LC0 and SF11 which gives my Texelated home brew evaluator search combo slightly better results than the great mass of engines around 3100. Same thing test-tuned recently on 25 million LiChess distilled human games (plus SF evals) gave a result around 200 Elo less before getting switched off, so a good guess would be that where the training sets come from is important and how they get evaluated is important. Using your own eval once it’s matured seems sensible, using own games likewise, and this original idea to use EPDs + N ply looks like being worth experimenting with. Different engines will no doubt get different results, one size doesn’t fit all etc. The challenge for proto engines is to get well past what seems to be a quite widely and easily achievable 3100. Mine has only just had pondering built in and now is facing turning from a one thread engine to N threads.
Tony P.
Posts: 216
Joined: Sun Jan 22, 2017 8:30 pm
Location: Russia

Re: Evaluation & Tuning in Chess Engines

Post by Tony P. »

D Sceviour wrote: Mon Aug 24, 2020 8:37 pm Has anybody tried to use a larger PST database such as:

PST [piece][square][enemy king position]
PK once had an idea to index by both kings' locations. Afaik, he hasn't got it to work. However, later in that thread, I cited an article about a fork of a Xiangqi engine Chimo where a similar idea has worked.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Evaluation & Tuning in Chess Engines

Post by D Sceviour »

Tony P. wrote: Sat Oct 03, 2020 9:37 pm
D Sceviour wrote: Mon Aug 24, 2020 8:37 pm Has anybody tried to use a larger PST database such as:

PST [piece][square][enemy king position]
PK once had an idea to index by both kings' locations. Afaik, he hasn't got it to work. However, later in that thread, I cited an article about a Xiangqi engine Chimo, where a similar idea has worked.
I am still working on it, but am held up now for the want of quality training sets. I have an idea to reduce the size of the table using inverse matrix influence, but it is probably a crazy idea and will never work.
AndrewGrant
Posts: 1754
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant »

D Sceviour wrote: Sat Oct 03, 2020 9:47 pm
Tony P. wrote: Sat Oct 03, 2020 9:37 pm
D Sceviour wrote: Mon Aug 24, 2020 8:37 pm Has anybody tried to use a larger PST database such as:

PST [piece][square][enemy king position]
PK once had an idea to index by both kings' locations. Afaik, he hasn't got it to work. However, later in that thread, I cited an article about a Xiangqi engine Chimo, where a similar idea has worked.
I am still working on it, but am held up now for the want of quality training sets. I have an idea to reduce the size of the table using inverse matrix influence, but it is probably a crazy idea and will never work.
I managed to tune an array of [0][our king sq][pawn], [1][their king sq][pawn], but it was supposed to just be a joke. 2x4096 param tables. Managed to gain +2 or +3 elo at LTC testing, but I never committed it.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
Tony P.
Posts: 216
Joined: Sun Jan 22, 2017 8:30 pm
Location: Russia

Re: Evaluation & Tuning in Chess Engines

Post by Tony P. »

AndrewGrant wrote: Sat Oct 03, 2020 9:58 pm I managed to tune an array of [0][our king sq][pawn], [1][their king sq][pawn], but it was supposed to just be a joke.
You have a talent for making jokes with a massive strength gain potential. Please keep joking!