Another attempt at comparing Evals ELO-wise

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Another attempt at comparing Evals ELO-wise

Post by lkaufman »

cdani wrote:So here it is:

www.andscacs.com/varis/test_quiesce.zip

Contains:

Stockfish_quiesce.exe. Plays always searching all the root nodes plus calling quiesce with open window, and takes just the best move.

Andscacs_quiesce.exe. The same for Andscacs.

The modified Stockfish source is include. I have changed only search.cpp. It has the changes marked with //170523. If anyone finds some error just tell me and I will do a new version.

I included also test.bat, a cutechess-li bat file that I used to run the test that generated the included pgn file.

The search was at fixed depth 1. The result was:

Code: Select all

 # PLAYER                        : RATING  ERROR   POINTS  PLAYED    (%)
 1 Stockfish 230517 64 POPCNT    : 2900.7    5.0   1907.0    3052   62.5%
 2 Andscacs 0.91028              : 2811.3    5.0   1145.0    3052   37.5%
Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
Komodo rules!
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Another attempt at comparing Evals ELO-wise

Post by cdani »

lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
Of course. I tried to make the two executables equal:

* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.

Probably the two versions are not 100% equal, but they should be mostly.
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Oops...

Post by mjlef »

Stockfish uses very aggressive move count pruning even in PV nodes. Basically at low depths Stockfish looks at captures, checks advanced pawn moves, and that is about it (around 2-3 moves which increases with depth). So in a 2 ply search it looks at the root moves, then just a few responses by the opponent. This is very fast, but not accurate in shallow searches. Since move count pruning requires a lot of search information to generate the history tables that order non captures, move count pruning is not good at very shallow searches. But it must get a lot better for them at more realistic search depths.

You can modify Stockfish to take that out. We have experimented a bit to try to determine how good the evaluations are. Basically making the programs into non-pruning full width searches with no or identical extensions. Of course, eval is also used by pruning, so they go together, making the eval comparison not quite fair. Tuning the eval to do best in a full width program might not perform as well in one that uses the evaluation to prune (futility and static null move pruning).

It is all about time. How much time does your pruning save, and how much time does one extra evaluation feature take. Nothing seems to do better than a lot of timed games.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: Oops...

Post by Rebel »

In an utopian world programmers take a couple of days and replace the TSCP evaluation with their own and release the executable and fair comparisons can be made.

I once did it for fun, see: http://rebel13.nl/misc/efs/tscp.html
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Oops...

Post by cdani »

Rebel wrote:In an utopian world programmers take a couple of days and replace the TSCP evaluation with their own and release the executable and fair comparisons can be made.

I once did it for fun, see: http://rebel13.nl/misc/efs/tscp.html
Yes. Good work is never easy :-) But for having a more reasonable estimation I think that was enough.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Another attempt at comparing Evals ELO-wise

Post by lkaufman »

cdani wrote:
lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
Of course. I tried to make the two executables equal:

* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.

Probably the two versions are not 100% equal, but they should be mostly.
I wouldn't be too concerned about the 90 elo gap you report from SF at one ply, even if the searches are identical. The evals of all of the top engines are totally unsuitable for one ply games. Similarly an absolutely optimum one-ply eval would probably be more than a hundred elo worse than what top engines use currently for games with reasonable time limits. Are your weights for things like potential checks, or threats to enemy pieces, significantly higher or lower than those of SF? If so that could account for the one-ply results.
Komodo rules!
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Another attempt at comparing Evals ELO-wise

Post by Lyudmil Tsvetkov »

cdani wrote:
lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
Of course. I tried to make the two executables equal:

* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.

Probably the two versions are not 100% equal, but they should be mostly.
thanks Daniel.
great work!

it is good that a knowledgeable person puts their hand at work.

so, SF is winning Andscacs quite convincingly. it simply could not be the other way round. In my tests, it was probably quiescence that turned things upside down.

But we still do not know what engine, Komodo or SF, has best eval. :)

PS. I am also of Larry's opinion that it is impossible to objectively measure eval without any distortions at depth 1. But you did the most of it.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Another attempt at comparing Evals ELO-wise

Post by Lyudmil Tsvetkov »

and that is the level of play of SF and Komodo ad depth 1 without quiescence search:

[pgn][Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "1"]
[White "Komodo 10.1 64-bit"]
[Black "Stockfish 8 64 POPCNT"]
[Result "0-1"]
[ECO "B22"]
[PlyCount "58"]

1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both last book move} 3. d3 {0.67/1 0}
d5 {0.00/1 0} 4. e5 {0.89/1 0} Ng4 {-0.42/1 0} 5. d4 {1.49/1 0} Bf5 {-0.01/1 0
(cxd4)} 6. Be2 {1.39/1 0} h5 {0.10/1 0} 7. h3 {1.83/1 0} Nh6 {1.88/1 0} 8. dxc5
{2.61/1 0 (Bxh5)} Nc6 {0.45/1 0} 9. Qa4 {1.09/1 0} e6 {0.20/1 0} 10. b4 {1.09/
1 0} Qd7 {0.18/1 0} 11. Nf3 {1.09/1 0 (Bxh5)} Nxe5 {-0.16/1 0} 12. Qxd7+ {0.26/
1 0} Nxd7 {-0.16/1 0} 13. Nd4 {0.55/1 0} Bg6 {-0.05/1 0} 14. Bb5 {0.48/1 0} e5
{-0.29/1 0} 15. Nf3 {0.34/1 0} a6 {-0.31/1 0} 16. Bxd7+ {1.61/1 0} Kxd7 {1.52/
1 0} 17. Nxe5+ {1.61/1 0} Kc7 {1.52/1 0 (Ke6)} 18. Na3 {1.59/1 0 (Nxg6)} Re8 {
0.08/1 0} 19. Bf4 {0.78/1 0} f6 {-0.16/1 0} 20. Bxh6 {-0.55/1 0} Rxh6 {-0.93/1
0 (fxe5)} 21. O-O-O {-2.93/1 0} Rxe5 {-3.22/1 0} 22. f4 {-2.46/1 0} Re2 {-3.43/
1 0} 23. Rxd5 {-2.89/1 0} Rxa2 {-3.43/1 0} 24. f5 {-2.71/1 0} Bf7 {-3.71/1 0
(Rxa3)} 25. Rd3 {-6.54/1 0} Rxa3 {-6.37/1 0} 26. Rhd1 {-6.31/1 0} Ra1+ {-6.01/
1 0} 27. Kc2 {-7.16/1 0} Ra2+ {-6.77/1 0} 28. Kc1 {-7.16/1 0} Bb3 {-6.37/1 0
(Rxg2)} 29. R1d2 {-6.62/1 0 (Rd7+)} Rxd2 {-6.35/1 0} 0-1

[Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "2"]
[White "Stockfish 8 64 POPCNT"]
[Black "Komodo 10.1 64-bit"]
[Result "0-1"]
[ECO "B50"]
[PlyCount "32"]

{600MB, CM8000.ctg, OWNER-PC} 1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both
last book move} 3. d3 {0.80/1 0} Nc6 {0.14/1 0} 4. Nf3 {0.83/1 0} d6 {-0.06/1 0
} 5. Nbd2 {0.19/1 0} Bg4 {-0.21/1 0} 6. h3 {0.19/1 0} Be6 {0.00/1 0} 7. d4 {0.
72/1 0} cxd4 {0.11/1 0} 8. cxd4 {0.72/1 0 (Nxd4)} Qa5 {-0.10/1 0} 9. d5 {1.33/
1 0} Nxd5 {0.85/1 0} 10. exd5 {1.17/1 0} Bxd5 {0.85/1 0} 11. a3 {1.29/1 0} e5 {
0.65/1 0} 12. b4 {2.23/1 0} Nxb4 {0.93/1 0} 13. Nc4 {1.29/1 0} Nc2+ {-15.11/1 0
} 14. Ke2 {-17.22/1 0} Bxc4+ {-15.11/1 0} 15. Qd3 {-17.22/1 0} Bxd3+ {-15.11/1
0} 16. Kxd3 {-17.22/1 0} Rc8 {-15.14/1 0 (Nxa1)} 0-1

[/pgn]

around 1200-1500 elo strength.
I guess any patzer out there would be able to win them.
so theory all engines currently know is captures and checks should be more or less confirmed.

I can almost bet the strongest engine in the not very distant future will be a single-plier with no quiescence search, it is seemingly the best way to proceed forward.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Another attempt at comparing Evals ELO-wise

Post by Lyudmil Tsvetkov »

[d]rnbqkb1r/pp2pppp/8/2ppP3/3P2n1/2P5/PP3PPP/RNBQKBNR b KQkq - 0 5

Komodo evaluates this position with 150cps white advantage, and there are no captures in sight.

scores throughout the game regularly jump by 1-2 full pawns.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Another attempt at comparing Evals ELO-wise

Post by Lyudmil Tsvetkov »

[d]r3kb1r/pp3ppp/3p4/q2bp3/1nN5/P4N1P/5PP1/R1BQKB1R b KQkq - 1 13

and here, SF has just played Nd2-c4, with 130cps white edge eval.

one move later, SF assessment plunges down to -1730cps disedge!

19 full pawns discrepancy in a single move. :)

long live SF!