I am working on the search of a new engine, and have been testing using tactical test suites (since I don't care about eval yet).
The problem is, a lot of times, scoring on test suites do not correspond to playing strength.
For example, I introduced a severe bug into futility pruning at one point (forgetting to negate material score), and the number of solved problems actually went way up! While playing strength obviously went down the toilet (at least -500 ELO).
Turned out it was just doing pretty much random pruning, and the increased depth actually solved more problems by luck than problems lost to blunders.
Obviously, the most reliable way is to play a lot of games. But I have been thinking, maybe it's possible to do more useful testing using test suites.
The problem with test suites is that all wrong solutions are scored the same.
If a position has 5 possible moves, with scores
300, 5, 5, 0, -500
The best move would be the move that wins a knight. However, a typical test suite run would give the same score to an engine that picked the score=5 move, as another engine that picked the move that lost a rook.
I am thinking about taking a bunch of random positions from lower level games (since the quality of games doesn't really matter, and at higher levels, player resign earlier and would not give the engine enough exposure to more one-sided positions).
Then I would have a strong engine analyze all possible moves from each position, and record all the scores.
Then, on a test suite run, we can take the move the engine picks, and compare the score of that to the score of the move with highest score. The differences can be added together to give the final result of the test run (lower is better).
With the example above (serious bug), it would get a very bad result due to serious blunders, and the fact that it happens to pick more best moves would be inconsequential.
This can also be used to test risky pruning techniques, etc, to see if it helps on average.
Has anyone been doing similar things?
More details from test suites
Moderators: hgm, Rebel, chrisw
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
More details from test suites
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: More details from test suites
I did a relatively similar test some time ago.
I created a file of fen positions followed by acceptable moves (analyzed by stockfish), for example all the moves that where at most -20cp than the best one.
Then I used it to improve the parameters of pruning, razoring, null move, etc. so to be able to increase the speed, testing those parameters with random values, and to lose the minimum of strength.
I created a file of fen positions followed by acceptable moves (analyzed by stockfish), for example all the moves that where at most -20cp than the best one.
Then I used it to improve the parameters of pruning, razoring, null move, etc. so to be able to increase the speed, testing those parameters with random values, and to lose the minimum of strength.
Daniel José - http://www.andscacs.com
-
- Posts: 64
- Joined: Fri Oct 18, 2013 11:40 pm
- Location: New York
Re: More details from test suites
I use "strategic test suite". It is big (1,000 positions), and moves are scored according to their value. I usually run it at 0.5 second/position, for 500 seconds total run time.
My results on that test match reasonably close to playing strength (the match is not perfect, so it does not work for small changes).
My results on that test match reasonably close to playing strength (the match is not perfect, so it does not work for small changes).
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: More details from test suites
I didn't know something like that existed. That's awesome!ymatioun wrote:I use "strategic test suite". It is big (1,000 positions), and moves are scored according to their value. I usually run it at 0.5 second/position, for 500 seconds total run time.
My results on that test match reasonably close to playing strength (the match is not perfect, so it does not work for small changes).
EDIT: I just took a quick look. The problem with using STS is that they don't have scores for bad moves. Eg. a passive move would still get the same score as throwing away a queen.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: More details from test suites
I used test suites for years, tuning for more solutions. That wasn't entirely futile but it was largely wasted effort because of the poor correlation with games.
Your method sounds plausible, but still may do poorly unless you have a large collection of positions for tuning. I think one of the core issues with test suites is that billions of positions can be visited during a game, but most test suites are small, so you are getting tuning against only the tiniest sample of those possible positions. On top of that many common test suites have uncommon positions where there is a hidden deep solution move not found by shallow search.
--Jon
Your method sounds plausible, but still may do poorly unless you have a large collection of positions for tuning. I think one of the core issues with test suites is that billions of positions can be visited during a game, but most test suites are small, so you are getting tuning against only the tiniest sample of those possible positions. On top of that many common test suites have uncommon positions where there is a hidden deep solution move not found by shallow search.
--Jon
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: More details from test suites
I think the biggest problem is test suites don't punish blunders nearly enough, so they would strongly favour aggressive pruning.jdart wrote:I used test suites for years, tuning for more solutions. That wasn't entirely futile but it was largely wasted effort because of the poor correlation with games.
Your method sounds plausible, but still may do poorly unless you have a large collection of positions for tuning. I think one of the core issues with test suites is that billions of positions can be visited during a game, but most test suites are small, so you are getting tuning against only the tiniest sample of those possible positions. On top of that many common test suites have uncommon positions where there is a hidden deep solution move not found by shallow search.
--Jon
For example, a change that makes an engine find 20 more solutions out of 200, but blunder 10 more times will be favoured by the test suite, while being detrimental to game play.
I guess the optimal combination would be mostly normal positions + a few positions from test suites. That way deep solutions are rewarded, but only if they don't significantly increase number of blunders.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: More details from test suites
Code: Select all
8446 matthew 20 0 231564 138404 1464 S 100.5 0.2 1:03.53 stockfish
8458 matthew 20 0 231564 136376 1468 S 100.5 0.2 1:03.52 stockfish
8468 matthew 20 0 231564 136360 1464 S 100.5 0.2 1:03.50 stockfish
8469 matthew 20 0 231564 136356 1468 S 100.5 0.2 1:03.52 stockfish
8470 matthew 20 0 231564 136364 1468 S 100.5 0.2 1:03.52 stockfish
8456 matthew 20 0 231564 136372 1468 S 100.2 0.2 1:03.52 stockfish
8460 matthew 20 0 231564 136376 1468 S 100.2 0.2 1:03.52 stockfish
8461 matthew 20 0 231564 136364 1468 S 100.2 0.2 1:03.52 stockfish
8462 matthew 20 0 231564 138340 1464 S 100.2 0.2 1:03.54 stockfish
8465 matthew 20 0 231564 136376 1468 S 100.2 0.2 1:03.52 stockfish
8466 matthew 20 0 231564 136364 1468 S 100.2 0.2 1:03.52 stockfish
8467 matthew 20 0 231564 136364 1468 S 100.2 0.2 1:03.52 stockfish
8471 matthew 20 0 231564 138408 1468 S 100.2 0.2 1:03.51 stockfish
8472 matthew 20 0 231564 136360 1468 S 100.2 0.2 1:03.53 stockfish
8473 matthew 20 0 231564 136364 1468 S 100.2 0.2 1:03.53 stockfish
8474 matthew 20 0 231564 136360 1468 S 100.2 0.2 1:03.50 stockfish
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: More details from test suites
Aaaand it's done.
http://matthewlai.ca/random_positions.scores
These are 500 random positions from the gm2006 database. I decided to not bother with lower level games since there is surprisingly quite a bit of variety in these positions already.
At most 1 position is selected from each game to minimize correlation. Duplicate positions are taken out (from early moves in openings). Otherwise all positions are completely randomly selected.
Each position begins with the FEN on its own line.
Next line is number of legal moves from that position.
This is followed by one line for each move, and score after making that move.
Scores are from Stockfish. 30 seconds analysis on Xeon E5-2670. This is run on Amazon's older generation cc2.8xlarge instances, with 16 processes, so I got the whole physical machine and there shouldn't be any other load on that physical machine.
I initially thought about adding some positions from tactical test suites to reward deeper solutions, but decided against it since I want it to be an accurate representative of positions from actual games, and what's a better sampling method than random sampling from actual games? Biasing it towards tactical positions would probably encourage risky pruning.
Next - writing a tool to run the test suite with any engine.
http://matthewlai.ca/random_positions.scores
These are 500 random positions from the gm2006 database. I decided to not bother with lower level games since there is surprisingly quite a bit of variety in these positions already.
At most 1 position is selected from each game to minimize correlation. Duplicate positions are taken out (from early moves in openings). Otherwise all positions are completely randomly selected.
Each position begins with the FEN on its own line.
Next line is number of legal moves from that position.
This is followed by one line for each move, and score after making that move.
Scores are from Stockfish. 30 seconds analysis on Xeon E5-2670. This is run on Amazon's older generation cc2.8xlarge instances, with 16 processes, so I got the whole physical machine and there shouldn't be any other load on that physical machine.
I initially thought about adding some positions from tactical test suites to reward deeper solutions, but decided against it since I want it to be an accurate representative of positions from actual games, and what's a better sampling method than random sampling from actual games? Biasing it towards tactical positions would probably encourage risky pruning.
Next - writing a tool to run the test suite with any engine.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: More details from test suites
Some results with random engines I have around in my gauntlet:
Stockfish did the best, but that's not surprising since Stockfish also developed the solutions (though at 30 seconds per move).
If you want to run it on your engine:
1. Make sure you have Mercurial and GCC 4.9 or LLVM installed. GCC 4.8 has horribly broken <regex>. This tool only works on Linux/OSX.
2. "hg clone https://bitbucket.org/waterreaction/chessenginetools"
3. Go into searchtestrun
4. "make"
5. "./searchtestrun random_positions.scores <engine directory> <max nodes> <max time (seconds)>"
<engine directory> is absolute or relative path to your engine. If your engine is xboard, create an "engine_def.txt" file in the directory, with one line "exec <executable>".
If your engine is UCI, copy polyglot binary to the directory, and just use "exec polyglot".
Max nodes and max time are whichever is reached first.
This is at 0.1 second per position.Crafty
Average node count: 481728
Average plies reached: 12.04
Average NPS: 4.13285e+06
Average penalty: 31.314
Penalty bins:
[0, 0]: 263
(0, 60]: 189
(60, 140]: 28
(140, 450]: 14
(450, 750]: 0
(750, 1000]: 6
Stockfish
Average node count: 210545
Average plies reached: 13.15
Average NPS: 1.77024e+06
Average penalty: 25.266
Penalty bins:
[0, 0]: 305
(0, 60]: 162
(60, 140]: 19
(140, 450]: 8
(450, 750]: 0
(750, 1000]: 6
Greko
Average node count: 270613
Average plies reached: 9.30862
Average NPS: 2.31583e+06
Average penalty: 45.6934
Penalty bins:
[0, 0]: 245
(0, 60]: 196
(60, 140]: 26
(140, 450]: 19
(450, 750]: 3
(750, 1000]: 10
Diablo
Average node count: 140827
Average plies reached: 7.526
Average NPS: 1.31305e+06
Average penalty: 50.01
Penalty bins:
[0, 0]: 242
(0, 60]: 193
(60, 140]: 35
(140, 450]: 12
(450, 750]: 8
(750, 1000]: 10
RobboLito
Average node count: 278180
Average plies reached: 11.778
Average NPS: 2.38186e+06
Average penalty: 31.01
Penalty bins:
[0, 0]: 287
(0, 60]: 177
(60, 140]: 18
(140, 450]: 8
(450, 750]: 3
(750, 1000]: 7
Giraffe (my new engine, ~2000)
Average node count: 158418
Average plies reached: 6.836
Average NPS: 1.80894e+06
Average penalty: 64.528
Penalty bins:
[0, 0]: 198
(0, 60]: 204
(60, 140]: 52
(140, 450]: 26
(450, 750]: 8
(750, 1000]: 12
Brainless (my old engine, ~2200)
Average node count: 239766
Average plies reached: 7.518
Average NPS: 2.44139e+06
Average penalty: 61.034
Penalty bins:
[0, 0]: 212
(0, 60]: 204
(60, 140]: 37
(140, 450]: 27
(450, 750]: 10
(750, 1000]: 10
Stockfish did the best, but that's not surprising since Stockfish also developed the solutions (though at 30 seconds per move).
If you want to run it on your engine:
1. Make sure you have Mercurial and GCC 4.9 or LLVM installed. GCC 4.8 has horribly broken <regex>. This tool only works on Linux/OSX.
2. "hg clone https://bitbucket.org/waterreaction/chessenginetools"
3. Go into searchtestrun
4. "make"
5. "./searchtestrun random_positions.scores <engine directory> <max nodes> <max time (seconds)>"
<engine directory> is absolute or relative path to your engine. If your engine is xboard, create an "engine_def.txt" file in the directory, with one line "exec <executable>".
If your engine is UCI, copy polyglot binary to the directory, and just use "exec polyglot".
Max nodes and max time are whichever is reached first.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.