Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.cdani wrote:So here it is:
www.andscacs.com/varis/test_quiesce.zip
Contains:
Stockfish_quiesce.exe. Plays always searching all the root nodes plus calling quiesce with open window, and takes just the best move.
Andscacs_quiesce.exe. The same for Andscacs.
The modified Stockfish source is include. I have changed only search.cpp. It has the changes marked with //170523. If anyone finds some error just tell me and I will do a new version.
I included also test.bat, a cutechess-li bat file that I used to run the test that generated the included pgn file.
The search was at fixed depth 1. The result was:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) 1 Stockfish 230517 64 POPCNT : 2900.7 5.0 1907.0 3052 62.5% 2 Andscacs 0.91028 : 2811.3 5.0 1145.0 3052 37.5%
Another attempt at comparing Evals ELO-wise
Moderators: hgm, Rebel, chrisw
-
- Posts: 5960
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
Re: Another attempt at comparing Evals ELO-wise
Komodo rules!
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Another attempt at comparing Evals ELO-wise
Of course. I tried to make the two executables equal:lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.
Probably the two versions are not 100% equal, but they should be mostly.
Daniel José - http://www.andscacs.com
-
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: Oops...
Stockfish uses very aggressive move count pruning even in PV nodes. Basically at low depths Stockfish looks at captures, checks advanced pawn moves, and that is about it (around 2-3 moves which increases with depth). So in a 2 ply search it looks at the root moves, then just a few responses by the opponent. This is very fast, but not accurate in shallow searches. Since move count pruning requires a lot of search information to generate the history tables that order non captures, move count pruning is not good at very shallow searches. But it must get a lot better for them at more realistic search depths.
You can modify Stockfish to take that out. We have experimented a bit to try to determine how good the evaluations are. Basically making the programs into non-pruning full width searches with no or identical extensions. Of course, eval is also used by pruning, so they go together, making the eval comparison not quite fair. Tuning the eval to do best in a full width program might not perform as well in one that uses the evaluation to prune (futility and static null move pruning).
It is all about time. How much time does your pruning save, and how much time does one extra evaluation feature take. Nothing seems to do better than a lot of timed games.
You can modify Stockfish to take that out. We have experimented a bit to try to determine how good the evaluations are. Basically making the programs into non-pruning full width searches with no or identical extensions. Of course, eval is also used by pruning, so they go together, making the eval comparison not quite fair. Tuning the eval to do best in a full width program might not perform as well in one that uses the evaluation to prune (futility and static null move pruning).
It is all about time. How much time does your pruning save, and how much time does one extra evaluation feature take. Nothing seems to do better than a lot of timed games.
-
- Posts: 6994
- Joined: Thu Aug 18, 2011 12:04 pm
Re: Oops...
In an utopian world programmers take a couple of days and replace the TSCP evaluation with their own and release the executable and fair comparisons can be made.
I once did it for fun, see: http://rebel13.nl/misc/efs/tscp.html
I once did it for fun, see: http://rebel13.nl/misc/efs/tscp.html
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Oops...
Yes. Good work is never easy But for having a more reasonable estimation I think that was enough.Rebel wrote:In an utopian world programmers take a couple of days and replace the TSCP evaluation with their own and release the executable and fair comparisons can be made.
I once did it for fun, see: http://rebel13.nl/misc/efs/tscp.html
Daniel José - http://www.andscacs.com
-
- Posts: 5960
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
Re: Another attempt at comparing Evals ELO-wise
I wouldn't be too concerned about the 90 elo gap you report from SF at one ply, even if the searches are identical. The evals of all of the top engines are totally unsuitable for one ply games. Similarly an absolutely optimum one-ply eval would probably be more than a hundred elo worse than what top engines use currently for games with reasonable time limits. Are your weights for things like potential checks, or threats to enemy pieces, significantly higher or lower than those of SF? If so that could account for the one-ply results.cdani wrote:Of course. I tried to make the two executables equal:lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.
Probably the two versions are not 100% equal, but they should be mostly.
Komodo rules!
-
- Posts: 6052
- Joined: Tue Jun 12, 2012 12:41 pm
Re: Another attempt at comparing Evals ELO-wise
thanks Daniel.cdani wrote:Of course. I tried to make the two executables equal:lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.
Probably the two versions are not 100% equal, but they should be mostly.
great work!
it is good that a knowledgeable person puts their hand at work.
so, SF is winning Andscacs quite convincingly. it simply could not be the other way round. In my tests, it was probably quiescence that turned things upside down.
But we still do not know what engine, Komodo or SF, has best eval.
PS. I am also of Larry's opinion that it is impossible to objectively measure eval without any distortions at depth 1. But you did the most of it.
-
- Posts: 6052
- Joined: Tue Jun 12, 2012 12:41 pm
Re: Another attempt at comparing Evals ELO-wise
and that is the level of play of SF and Komodo ad depth 1 without quiescence search:
[pgn][Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "1"]
[White "Komodo 10.1 64-bit"]
[Black "Stockfish 8 64 POPCNT"]
[Result "0-1"]
[ECO "B22"]
[PlyCount "58"]
1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both last book move} 3. d3 {0.67/1 0}
d5 {0.00/1 0} 4. e5 {0.89/1 0} Ng4 {-0.42/1 0} 5. d4 {1.49/1 0} Bf5 {-0.01/1 0
(cxd4)} 6. Be2 {1.39/1 0} h5 {0.10/1 0} 7. h3 {1.83/1 0} Nh6 {1.88/1 0} 8. dxc5
{2.61/1 0 (Bxh5)} Nc6 {0.45/1 0} 9. Qa4 {1.09/1 0} e6 {0.20/1 0} 10. b4 {1.09/
1 0} Qd7 {0.18/1 0} 11. Nf3 {1.09/1 0 (Bxh5)} Nxe5 {-0.16/1 0} 12. Qxd7+ {0.26/
1 0} Nxd7 {-0.16/1 0} 13. Nd4 {0.55/1 0} Bg6 {-0.05/1 0} 14. Bb5 {0.48/1 0} e5
{-0.29/1 0} 15. Nf3 {0.34/1 0} a6 {-0.31/1 0} 16. Bxd7+ {1.61/1 0} Kxd7 {1.52/
1 0} 17. Nxe5+ {1.61/1 0} Kc7 {1.52/1 0 (Ke6)} 18. Na3 {1.59/1 0 (Nxg6)} Re8 {
0.08/1 0} 19. Bf4 {0.78/1 0} f6 {-0.16/1 0} 20. Bxh6 {-0.55/1 0} Rxh6 {-0.93/1
0 (fxe5)} 21. O-O-O {-2.93/1 0} Rxe5 {-3.22/1 0} 22. f4 {-2.46/1 0} Re2 {-3.43/
1 0} 23. Rxd5 {-2.89/1 0} Rxa2 {-3.43/1 0} 24. f5 {-2.71/1 0} Bf7 {-3.71/1 0
(Rxa3)} 25. Rd3 {-6.54/1 0} Rxa3 {-6.37/1 0} 26. Rhd1 {-6.31/1 0} Ra1+ {-6.01/
1 0} 27. Kc2 {-7.16/1 0} Ra2+ {-6.77/1 0} 28. Kc1 {-7.16/1 0} Bb3 {-6.37/1 0
(Rxg2)} 29. R1d2 {-6.62/1 0 (Rd7+)} Rxd2 {-6.35/1 0} 0-1
[Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "2"]
[White "Stockfish 8 64 POPCNT"]
[Black "Komodo 10.1 64-bit"]
[Result "0-1"]
[ECO "B50"]
[PlyCount "32"]
{600MB, CM8000.ctg, OWNER-PC} 1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both
last book move} 3. d3 {0.80/1 0} Nc6 {0.14/1 0} 4. Nf3 {0.83/1 0} d6 {-0.06/1 0
} 5. Nbd2 {0.19/1 0} Bg4 {-0.21/1 0} 6. h3 {0.19/1 0} Be6 {0.00/1 0} 7. d4 {0.
72/1 0} cxd4 {0.11/1 0} 8. cxd4 {0.72/1 0 (Nxd4)} Qa5 {-0.10/1 0} 9. d5 {1.33/
1 0} Nxd5 {0.85/1 0} 10. exd5 {1.17/1 0} Bxd5 {0.85/1 0} 11. a3 {1.29/1 0} e5 {
0.65/1 0} 12. b4 {2.23/1 0} Nxb4 {0.93/1 0} 13. Nc4 {1.29/1 0} Nc2+ {-15.11/1 0
} 14. Ke2 {-17.22/1 0} Bxc4+ {-15.11/1 0} 15. Qd3 {-17.22/1 0} Bxd3+ {-15.11/1
0} 16. Kxd3 {-17.22/1 0} Rc8 {-15.14/1 0 (Nxa1)} 0-1
[/pgn]
around 1200-1500 elo strength.
I guess any patzer out there would be able to win them.
so theory all engines currently know is captures and checks should be more or less confirmed.
I can almost bet the strongest engine in the not very distant future will be a single-plier with no quiescence search, it is seemingly the best way to proceed forward.
[pgn][Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "1"]
[White "Komodo 10.1 64-bit"]
[Black "Stockfish 8 64 POPCNT"]
[Result "0-1"]
[ECO "B22"]
[PlyCount "58"]
1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both last book move} 3. d3 {0.67/1 0}
d5 {0.00/1 0} 4. e5 {0.89/1 0} Ng4 {-0.42/1 0} 5. d4 {1.49/1 0} Bf5 {-0.01/1 0
(cxd4)} 6. Be2 {1.39/1 0} h5 {0.10/1 0} 7. h3 {1.83/1 0} Nh6 {1.88/1 0} 8. dxc5
{2.61/1 0 (Bxh5)} Nc6 {0.45/1 0} 9. Qa4 {1.09/1 0} e6 {0.20/1 0} 10. b4 {1.09/
1 0} Qd7 {0.18/1 0} 11. Nf3 {1.09/1 0 (Bxh5)} Nxe5 {-0.16/1 0} 12. Qxd7+ {0.26/
1 0} Nxd7 {-0.16/1 0} 13. Nd4 {0.55/1 0} Bg6 {-0.05/1 0} 14. Bb5 {0.48/1 0} e5
{-0.29/1 0} 15. Nf3 {0.34/1 0} a6 {-0.31/1 0} 16. Bxd7+ {1.61/1 0} Kxd7 {1.52/
1 0} 17. Nxe5+ {1.61/1 0} Kc7 {1.52/1 0 (Ke6)} 18. Na3 {1.59/1 0 (Nxg6)} Re8 {
0.08/1 0} 19. Bf4 {0.78/1 0} f6 {-0.16/1 0} 20. Bxh6 {-0.55/1 0} Rxh6 {-0.93/1
0 (fxe5)} 21. O-O-O {-2.93/1 0} Rxe5 {-3.22/1 0} 22. f4 {-2.46/1 0} Re2 {-3.43/
1 0} 23. Rxd5 {-2.89/1 0} Rxa2 {-3.43/1 0} 24. f5 {-2.71/1 0} Bf7 {-3.71/1 0
(Rxa3)} 25. Rd3 {-6.54/1 0} Rxa3 {-6.37/1 0} 26. Rhd1 {-6.31/1 0} Ra1+ {-6.01/
1 0} 27. Kc2 {-7.16/1 0} Ra2+ {-6.77/1 0} 28. Kc1 {-7.16/1 0} Bb3 {-6.37/1 0
(Rxg2)} 29. R1d2 {-6.62/1 0 (Rd7+)} Rxd2 {-6.35/1 0} 0-1
[Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "2"]
[White "Stockfish 8 64 POPCNT"]
[Black "Komodo 10.1 64-bit"]
[Result "0-1"]
[ECO "B50"]
[PlyCount "32"]
{600MB, CM8000.ctg, OWNER-PC} 1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both
last book move} 3. d3 {0.80/1 0} Nc6 {0.14/1 0} 4. Nf3 {0.83/1 0} d6 {-0.06/1 0
} 5. Nbd2 {0.19/1 0} Bg4 {-0.21/1 0} 6. h3 {0.19/1 0} Be6 {0.00/1 0} 7. d4 {0.
72/1 0} cxd4 {0.11/1 0} 8. cxd4 {0.72/1 0 (Nxd4)} Qa5 {-0.10/1 0} 9. d5 {1.33/
1 0} Nxd5 {0.85/1 0} 10. exd5 {1.17/1 0} Bxd5 {0.85/1 0} 11. a3 {1.29/1 0} e5 {
0.65/1 0} 12. b4 {2.23/1 0} Nxb4 {0.93/1 0} 13. Nc4 {1.29/1 0} Nc2+ {-15.11/1 0
} 14. Ke2 {-17.22/1 0} Bxc4+ {-15.11/1 0} 15. Qd3 {-17.22/1 0} Bxd3+ {-15.11/1
0} 16. Kxd3 {-17.22/1 0} Rc8 {-15.14/1 0 (Nxa1)} 0-1
[/pgn]
around 1200-1500 elo strength.
I guess any patzer out there would be able to win them.
so theory all engines currently know is captures and checks should be more or less confirmed.
I can almost bet the strongest engine in the not very distant future will be a single-plier with no quiescence search, it is seemingly the best way to proceed forward.
-
- Posts: 6052
- Joined: Tue Jun 12, 2012 12:41 pm
Re: Another attempt at comparing Evals ELO-wise
[d]rnbqkb1r/pp2pppp/8/2ppP3/3P2n1/2P5/PP3PPP/RNBQKBNR b KQkq - 0 5
Komodo evaluates this position with 150cps white advantage, and there are no captures in sight.
scores throughout the game regularly jump by 1-2 full pawns.
Komodo evaluates this position with 150cps white advantage, and there are no captures in sight.
scores throughout the game regularly jump by 1-2 full pawns.
-
- Posts: 6052
- Joined: Tue Jun 12, 2012 12:41 pm
Re: Another attempt at comparing Evals ELO-wise
[d]r3kb1r/pp3ppp/3p4/q2bp3/1nN5/P4N1P/5PP1/R1BQKB1R b KQkq - 1 13
and here, SF has just played Nd2-c4, with 130cps white edge eval.
one move later, SF assessment plunges down to -1730cps disedge!
19 full pawns discrepancy in a single move.
long live SF!
and here, SF has just played Nd2-c4, with 130cps white edge eval.
one move later, SF assessment plunges down to -1730cps disedge!
19 full pawns discrepancy in a single move.
long live SF!