Stockfish seems definitely the strongest engine

Jouni · Post by **Jouni** » Wed Jan 22, 2014 10:30 am

Bobby is testing with 12 cores!

Stefan Schiffermueller · Wed Jan 22, 2014 10:46 am

Laskos wrote:

Games Completed = 100 of 100 (Avg game length = 2650.582 sec)
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD:C:\LittleBlitzer\8moves_v2.epd(32000)
Time = 66710 sec elapsed, 0 sec remaining
 1.  SF 19.01.14                56.0/100	23-11-66  	(L: m=11 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=15184.2 d=34.90 nps=2007547)
 2.  H4 Contempt=0              44.0/100	11-23-66  	(L: m=23 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=14413.0 d=23.69 nps=2349506)

LOS = 98.0%, highly significant.

56 points out of 100 games are highly significant?. Sorry, I don't understand.

Ajedrecista · Post by **Ajedrecista** » Wed Jan 22, 2014 11:08 am

Hello again:

Laskos wrote:Hello Jesus,

Can you simulate a "real" 35 Elo gap, drawelo=270, elo0 = 0, elo1 = 30, alpha, beta=0.05, to see the median at the stop? I suspect it will be several hundred. SPRT is not only about small Elo differences, that's entirely depending on the window. Thanks!

35 Elo gap with drawelo = 270 more less translate into 60.75 Bayeselo units (expected draw ratio ~ 63.97% ~ 64%):

Code: Select all

Shortest simulation:   39 games (+22 -1 =16).
Longest simulation:  1380 games (+271 -215 =894).

Average number of games per simulation: 281

Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.01 %

I ran 80000 simulations: the stronger engine failed SPRT(0, 30) in 11 of them. I do not have any idea on the median this time (only that it is between 39 and 999 games), I need to add code to my simulator... probably the next month.

So, in average only 281 games and not several hundreds (this is what I think that SPRT seems to be more intended to small differences, although I am not an expert). I think you should write your own SPRT simulator, just in case mine is wrong.

Just a question: when you say 'window', do you refer to 'elo1 - elo0'?

If I modifiy bounds to more less SPRT(0, 0.2) with the rest of parameters unchanged, then:

Code: Select all

Shortest simulation: 26944 games (+6480 -3373 =17091).
Longest simulation:  35310 games (+7942 -4830 =22538).

Average number of games per simulation: 30728

Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.00 %

I ran 10000 simulations this time. The median is between 30000 and 31000 games, probably very near to 30700 games.

So, gap = 0 Bayeselo and SPRT(-3, 3) gives more less the same average stop than gap ~ 60.75 Bayeselo and SPRT(0, 0.2)! (Bounds of SPRT are also expressed in Bayeselo).

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Wed Jan 22, 2014 11:08 am

Stefan Schiffermueller wrote:

Laskos wrote:

Code: Select all

Games Completed = 100 of 100 (Avg game length = 2650.582 sec)
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD:C:\LittleBlitzer\8moves_v2.epd(32000)
Time = 66710 sec elapsed, 0 sec remaining
 1.  SF 19.01.14                56.0/100	23-11-66  	(L: m=11 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=15184.2 d=34.90 nps=2007547)
 2.  H4 Contempt=0              44.0/100	11-23-66  	(L: m=23 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=14413.0 d=23.69 nps=2349506)

LOS = 98.0%, highly significant.

56 points out of 100 games are highly significant?. Sorry, I don't understand.

Many draws, LOS 98%. Yes, I would have liked a SPRT stop, but I would probably needed 2-3 days of testing.

Laskos · Post by **Laskos** » Wed Jan 22, 2014 11:17 am

Ajedrecista wrote:Hello again:

Laskos wrote:Hello Jesus,

Can you simulate a "real" 35 Elo gap, drawelo=270, elo0 = 0, elo1 = 30, alpha, beta=0.05, to see the median at the stop? I suspect it will be several hundred. SPRT is not only about small Elo differences, that's entirely depending on the window. Thanks!
35 Elo gap with drawelo = 270 more less translate into 60.75 Bayeselo units (expected draw ratio ~ 63.97% ~ 64%):
Code: Select all
Shortest simulation:   39 games (+22 -1 =16).
Longest simulation:  1380 games (+271 -215 =894).

Average number of games per simulation: 281

Yes, this was my guess when I said that in these conditions SPRT alpha, beta 0.05 with reasonable window would need 2-3 times more games (200-300 games). 281 is very close to this guess.

Code: Select all
Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.01 %
I ran 80000 simulations: the stronger engine failed SPRT(0, 30) in 11 of them. I do not have any idea on the median this time (only that it is between 39 and 999 games), I need to add code to my simulator... probably the next month.

So, in average only 281 games and not several hundreds (this is what I think that SPRT seems to be more intended to small differences, although I am not an expert). I think you should write your own SPRT simulator, just in case mine is wrong.

Just a question: when you say 'window', do you refer to 'elo1 - elo0'?

Yes

If I modifiy bounds to more less SPRT(0, 0.2) with the rest of parameters unchanged, then:
Code: Select all
Shortest simulation: 26944 games (+6480 -3373 =17091).
Longest simulation:  35310 games (+7942 -4830 =22538).

Average number of games per simulation: 30728

Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.00 %
I ran 10000 simulations this time. The median is between 30000 and 31000 games, probably very near to 30700 games.

So, gap = 0 Bayeselo and SPRT(-3, 3) gives more less the same average stop than gap ~ 60.75 Bayeselo and SPRT(0, 0.2)! (Bounds of SPRT are also expressed in Bayeselo).

Regards from Spain.

Ajedrecista.

Thanks, Jesus.

ouachita · Post by **ouachita** » Wed Jan 22, 2014 2:58 pm

Laskos wrote:10'+10'' is almost double TC compared to 10'+1''.

1. Probably closer to 50% more time on average game, and,
2. 12 cores run at approx. 8-10 times the kN/s as one core.
3. Engines using 12 cores/each will be beat engines using one core/each >90% when using same time control.

The results of one core matches cannot be reasonably compared to 4, 6, 8, 12 or 16 core matches, all other variables being the same.

Laskos · Post by **Laskos** » Wed Jan 22, 2014 4:41 pm

ouachita wrote:
Laskos wrote:10'+10'' is almost double TC compared to 10'+1''.
1. Probably closer to 50% more time on average game, and,
2. 12 cores run at approx. 8-10 times the kN/s as one core.
3. Engines using 12 cores/each will be beat engines using one core/each >90% when using same time control.

The results of one core matches cannot be reasonably compared to 4, 6, 8, 12 or 16 core matches, all other variables being the same.

Ah, ok, you are testing engines on 12 cores. A bit more games would be useful, and I am solely testing 1-core performance, not the MP one. Anyway, I decided to leave the new match until a SPRT stop (alpha, beta=0.05, elo0=0, elo1=30), maybe in 2-3 days it will stop, after 200-400 games at 10'+10''.

ouachita · Post by **ouachita** » Wed Jan 22, 2014 9:44 pm

There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:

Code: Select all

1-22-14

SF0901014IP-12 core v SF 080114-1 core

1+1
50 positions, alternating colors
defaults
# of cores is sole setting difference.
                                        
1   Stockfish 090114IP 64 SSE4.2  +205  +53/=47/-0 76.50%   76.5/100
2   Stockfish 080114 64 SSE4.2    -205  +0/=47/-53 23.50%   23.5/100

Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.

Food for thought.

Laskos · Post by **Laskos** » Wed Jan 22, 2014 10:13 pm

ouachita wrote:There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:
Code: Select all
1-22-14

SF0901014IP-12 core v SF 080114-1 core

1+1
50 positions, alternating colors
defaults
# of cores is sole setting difference.
                                        
1   Stockfish 090114IP 64 SSE4.2  +205  +53/=47/-0 76.50%   76.5/100
2   Stockfish 080114 64 SSE4.2    -205  +0/=47/-53 23.50%   23.5/100
Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.

Food for thought.

150-200 Elo points from 1 to 12 cores at 1'+1'' are to be expected, but I don't agree that 1-core results are unrelated to 12-core results. The MP scaling of top engines is comparable. In your place I would play 10'+10'' games on one core or on several cores SF against Houdini Contempt=0 until a SPRT stop, to dispel some myths (that SF does not scale better, for example). There were many 1-core results, but none of them had LOS of 98% SF against Houdini 4 Contempt=0, and that happens at somewhat larger TC than blitz. I will now wait for a SPRT stop in Cutechess-Cli to show that SF overtook Houdini (if that is the case).

ernest · Post by **ernest** » Thu Jan 23, 2014 1:11 am

Laskos wrote:
ernest wrote:
Laskos wrote:1. Stockfish scales better to longer TC
55% => 56% with 56% based on 100 games...
Kai, I am ashamed of you...
45% -> 56%

Indeed you wrote
At 15s + 0.05s TC Houdini scored 55% in 2,000 games.

Now I am ashamed...

Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine