Another attempt at comparing Evals ELO-wise

lkaufman · Post by **lkaufman** » Tue May 23, 2017 10:48 pm

cdani wrote:So here it is:

www.andscacs.com/varis/test_quiesce.zip

Contains:

Stockfish_quiesce.exe. Plays always searching all the root nodes plus calling quiesce with open window, and takes just the best move.

Andscacs_quiesce.exe. The same for Andscacs.

The modified Stockfish source is include. I have changed only search.cpp. It has the changes marked with //170523. If anyone finds some error just tell me and I will do a new version.

I included also test.bat, a cutechess-li bat file that I used to run the test that generated the included pgn file.

The search was at fixed depth 1. The result was:
Code: Select all
 # PLAYER                        &#58; RATING  ERROR   POINTS  PLAYED    (%)
 1 Stockfish 230517 64 POPCNT    &#58; 2900.7    5.0   1907.0    3052   62.5%
 2 Andscacs 0.91028              &#58; 2811.3    5.0   1145.0    3052   37.5%

Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.

cdani · Post by **cdani** » Tue May 23, 2017 11:00 pm

lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.

Of course. I tried to make the two executables equal:

* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.

Probably the two versions are not 100% equal, but they should be mostly.

mjlef · Post by **mjlef** » Tue May 23, 2017 11:07 pm

Stockfish uses very aggressive move count pruning even in PV nodes. Basically at low depths Stockfish looks at captures, checks advanced pawn moves, and that is about it (around 2-3 moves which increases with depth). So in a 2 ply search it looks at the root moves, then just a few responses by the opponent. This is very fast, but not accurate in shallow searches. Since move count pruning requires a lot of search information to generate the history tables that order non captures, move count pruning is not good at very shallow searches. But it must get a lot better for them at more realistic search depths.

You can modify Stockfish to take that out. We have experimented a bit to try to determine how good the evaluations are. Basically making the programs into non-pruning full width searches with no or identical extensions. Of course, eval is also used by pruning, so they go together, making the eval comparison not quite fair. Tuning the eval to do best in a full width program might not perform as well in one that uses the evaluation to prune (futility and static null move pruning).

It is all about time. How much time does your pruning save, and how much time does one extra evaluation feature take. Nothing seems to do better than a lot of timed games.

Rebel · Post by **Rebel** » Tue May 23, 2017 11:40 pm

In an utopian world programmers take a couple of days and replace the TSCP evaluation with their own and release the executable and fair comparisons can be made.

I once did it for fun, see: http://rebel13.nl/misc/efs/tscp.html

cdani · Post by **cdani** » Wed May 24, 2017 12:03 am

Rebel wrote:In an utopian world programmers take a couple of days and replace the TSCP evaluation with their own and release the executable and fair comparisons can be made.

I once did it for fun, see: http://rebel13.nl/misc/efs/tscp.html

Yes. Good work is never easy

But for having a more reasonable estimation I think that was enough.

lkaufman · Post by **lkaufman** » Wed May 24, 2017 6:47 am

cdani wrote:
lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
Of course. I tried to make the two executables equal:

* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.

Probably the two versions are not 100% equal, but they should be mostly.

I wouldn't be too concerned about the 90 elo gap you report from SF at one ply, even if the searches are identical. The evals of all of the top engines are totally unsuitable for one ply games. Similarly an absolutely optimum one-ply eval would probably be more than a hundred elo worse than what top engines use currently for games with reasonable time limits. Are your weights for things like potential checks, or threats to enemy pieces, significantly higher or lower than those of SF? If so that could account for the one-ply results.

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Wed May 24, 2017 7:49 am

cdani wrote:
lkaufman wrote: Details like whether you include checks in qsearch, whether you prune "losing" moves in qsearch, etc. have a huge effect on one ply search results. Unless these things are all very similar, it's not a test of eval. Also check extension rules play a huge role even at one ply.
Of course. I tried to make the two executables equal:

* At root search do not prune/extend.
* At first qsearch detph do not generate checks, unless a capture is a check.
* Qsearch futility pruning is in effect.

Probably the two versions are not 100% equal, but they should be mostly.

thanks Daniel.
great work!

it is good that a knowledgeable person puts their hand at work.

so, SF is winning Andscacs quite convincingly. it simply could not be the other way round. In my tests, it was probably quiescence that turned things upside down.

But we still do not know what engine, Komodo or SF, has best eval.

PS. I am also of Larry's opinion that it is impossible to objectively measure eval without any distortions at depth 1. But you did the most of it.

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Wed May 24, 2017 8:06 am

and that is the level of play of SF and Komodo ad depth 1 without quiescence search:

[pgn][Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "1"]
[White "Komodo 10.1 64-bit"]
[Black "Stockfish 8 64 POPCNT"]
[Result "0-1"]
[ECO "B22"]
[PlyCount "58"]

1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both last book move} 3. d3 {0.67/1 0}
d5 {0.00/1 0} 4. e5 {0.89/1 0} Ng4 {-0.42/1 0} 5. d4 {1.49/1 0} Bf5 {-0.01/1 0
(cxd4)} 6. Be2 {1.39/1 0} h5 {0.10/1 0} 7. h3 {1.83/1 0} Nh6 {1.88/1 0} 8. dxc5
{2.61/1 0 (Bxh5)} Nc6 {0.45/1 0} 9. Qa4 {1.09/1 0} e6 {0.20/1 0} 10. b4 {1.09/
1 0} Qd7 {0.18/1 0} 11. Nf3 {1.09/1 0 (Bxh5)} Nxe5 {-0.16/1 0} 12. Qxd7+ {0.26/
1 0} Nxd7 {-0.16/1 0} 13. Nd4 {0.55/1 0} Bg6 {-0.05/1 0} 14. Bb5 {0.48/1 0} e5
{-0.29/1 0} 15. Nf3 {0.34/1 0} a6 {-0.31/1 0} 16. Bxd7+ {1.61/1 0} Kxd7 {1.52/
1 0} 17. Nxe5+ {1.61/1 0} Kc7 {1.52/1 0 (Ke6)} 18. Na3 {1.59/1 0 (Nxg6)} Re8 {
0.08/1 0} 19. Bf4 {0.78/1 0} f6 {-0.16/1 0} 20. Bxh6 {-0.55/1 0} Rxh6 {-0.93/1
0 (fxe5)} 21. O-O-O {-2.93/1 0} Rxe5 {-3.22/1 0} 22. f4 {-2.46/1 0} Re2 {-3.43/
1 0} 23. Rxd5 {-2.89/1 0} Rxa2 {-3.43/1 0} 24. f5 {-2.71/1 0} Bf7 {-3.71/1 0
(Rxa3)} 25. Rd3 {-6.54/1 0} Rxa3 {-6.37/1 0} 26. Rhd1 {-6.31/1 0} Ra1+ {-6.01/
1 0} 27. Kc2 {-7.16/1 0} Ra2+ {-6.77/1 0} 28. Kc1 {-7.16/1 0} Bb3 {-6.37/1 0
(Rxg2)} 29. R1d2 {-6.62/1 0 (Rd7+)} Rxd2 {-6.35/1 0} 0-1

[Event "OWNER-PC, 1Ply / 1Ply"]
[Site "Microsoft"]
[Date "2017.05.24"]
[Round "2"]
[White "Stockfish 8 64 POPCNT"]
[Black "Komodo 10.1 64-bit"]
[Result "0-1"]
[ECO "B50"]
[PlyCount "32"]

{600MB, CM8000.ctg, OWNER-PC} 1. e4 {B 0} c5 {B 0} 2. c3 {B 0} Nf6 {B 0 Both
last book move} 3. d3 {0.80/1 0} Nc6 {0.14/1 0} 4. Nf3 {0.83/1 0} d6 {-0.06/1 0
} 5. Nbd2 {0.19/1 0} Bg4 {-0.21/1 0} 6. h3 {0.19/1 0} Be6 {0.00/1 0} 7. d4 {0.
72/1 0} cxd4 {0.11/1 0} 8. cxd4 {0.72/1 0 (Nxd4)} Qa5 {-0.10/1 0} 9. d5 {1.33/
1 0} Nxd5 {0.85/1 0} 10. exd5 {1.17/1 0} Bxd5 {0.85/1 0} 11. a3 {1.29/1 0} e5 {
0.65/1 0} 12. b4 {2.23/1 0} Nxb4 {0.93/1 0} 13. Nc4 {1.29/1 0} Nc2+ {-15.11/1 0
} 14. Ke2 {-17.22/1 0} Bxc4+ {-15.11/1 0} 15. Qd3 {-17.22/1 0} Bxd3+ {-15.11/1
0} 16. Kxd3 {-17.22/1 0} Rc8 {-15.14/1 0 (Nxa1)} 0-1

[/pgn]

around 1200-1500 elo strength.
I guess any patzer out there would be able to win them.
so theory all engines currently know is captures and checks should be more or less confirmed.

I can almost bet the strongest engine in the not very distant future will be a single-plier with no quiescence search, it is seemingly the best way to proceed forward.

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Wed May 24, 2017 8:17 am

[d]rnbqkb1r/pp2pppp/8/2ppP3/3P2n1/2P5/PP3PPP/RNBQKBNR b KQkq - 0 5

Komodo evaluates this position with 150cps white advantage, and there are no captures in sight.

scores throughout the game regularly jump by 1-2 full pawns.

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Wed May 24, 2017 8:22 am

[d]r3kb1r/pp3ppp/3p4/q2bp3/1nN5/P4N1P/5PP1/R1BQKB1R b KQkq - 1 13

and here, SF has just played Nd2-c4, with 130cps white edge eval.

one move later, SF assessment plunges down to -1730cps disedge!

19 full pawns discrepancy in a single move.

long live SF!

Another attempt at comparing Evals ELO-wise

Re: Another attempt at comparing Evals ELO-wise

Re: Another attempt at comparing Evals ELO-wise

Re: Oops...

Re: Oops...

Re: Oops...

Re: Another attempt at comparing Evals ELO-wise

Re: Another attempt at comparing Evals ELO-wise

Re: Another attempt at comparing Evals ELO-wise

Re: Another attempt at comparing Evals ELO-wise

Re: Another attempt at comparing Evals ELO-wise