Stockfish seems definitely the strongest engine

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

Jouni
Posts: 3743
Joined: Wed Mar 08, 2006 8:15 pm
Full name: Jouni Uski

Re: Stockfish seems definitely the strongest engine

Post by Jouni »

Bobby is testing with 12 cores!
Jouni
Stefan Schiffermueller
Posts: 12
Joined: Thu Dec 05, 2013 10:48 am

Re: Stockfish seems definitely the strongest engine

Post by Stefan Schiffermueller »

Laskos wrote:

Code: Select all

Games Completed = 100 of 100 (Avg game length = 2650.582 sec)
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD:C:\LittleBlitzer\8moves_v2.epd(32000)
Time = 66710 sec elapsed, 0 sec remaining
 1.  SF 19.01.14                56.0/100	23-11-66  	(L: m=11 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=15184.2 d=34.90 nps=2007547)
 2.  H4 Contempt=0              44.0/100	11-23-66  	(L: m=23 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=14413.0 d=23.69 nps=2349506)
LOS = 98.0%, highly significant.
56 points out of 100 games are highly significant?. Sorry, I don't understand.
User avatar
Ajedrecista
Posts: 2156
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Stockfish seems definitely the strongest engine.

Post by Ajedrecista »

Hello again:
Laskos wrote:Hello Jesus,

Can you simulate a "real" 35 Elo gap, drawelo=270, elo0 = 0, elo1 = 30, alpha, beta=0.05, to see the median at the stop? I suspect it will be several hundred. SPRT is not only about small Elo differences, that's entirely depending on the window. Thanks!
35 Elo gap with drawelo = 270 more less translate into 60.75 Bayeselo units (expected draw ratio ~ 63.97% ~ 64%):

Code: Select all

Shortest simulation:   39 games (+22 -1 =16).
Longest simulation:  1380 games (+271 -215 =894).

Average number of games per simulation: 281

Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.01 %
I ran 80000 simulations: the stronger engine failed SPRT(0, 30) in 11 of them. I do not have any idea on the median this time (only that it is between 39 and 999 games), I need to add code to my simulator... probably the next month.

So, in average only 281 games and not several hundreds (this is what I think that SPRT seems to be more intended to small differences, although I am not an expert). I think you should write your own SPRT simulator, just in case mine is wrong.

Just a question: when you say 'window', do you refer to 'elo1 - elo0'?

If I modifiy bounds to more less SPRT(0, 0.2) with the rest of parameters unchanged, then:

Code: Select all

Shortest simulation: 26944 games (+6480 -3373 =17091).
Longest simulation:  35310 games (+7942 -4830 =22538).

Average number of games per simulation: 30728

Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.00 %
I ran 10000 simulations this time. The median is between 30000 and 31000 games, probably very near to 30700 games.

So, gap = 0 Bayeselo and SPRT(-3, 3) gives more less the same average stop than gap ~ 60.75 Bayeselo and SPRT(0, 0.2)! (Bounds of SPRT are also expressed in Bayeselo).

Regards from Spain.

Ajedrecista.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine

Post by Laskos »

Stefan Schiffermueller wrote:
Laskos wrote:

Code: Select all

Games Completed = 100 of 100 (Avg game length = 2650.582 sec)
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD:C:\LittleBlitzer\8moves_v2.epd(32000)
Time = 66710 sec elapsed, 0 sec remaining
 1.  SF 19.01.14                56.0/100	23-11-66  	(L: m=11 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=15184.2 d=34.90 nps=2007547)
 2.  H4 Contempt=0              44.0/100	11-23-66  	(L: m=23 t=0 i=0 a=0)	(D: r=40 i=16 f=10 s=0 a=0)	(tpm=14413.0 d=23.69 nps=2349506)
LOS = 98.0%, highly significant.
56 points out of 100 games are highly significant?. Sorry, I don't understand.
Many draws, LOS 98%. Yes, I would have liked a SPRT stop, but I would probably needed 2-3 days of testing.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine.

Post by Laskos »

Ajedrecista wrote:Hello again:
Laskos wrote:Hello Jesus,

Can you simulate a "real" 35 Elo gap, drawelo=270, elo0 = 0, elo1 = 30, alpha, beta=0.05, to see the median at the stop? I suspect it will be several hundred. SPRT is not only about small Elo differences, that's entirely depending on the window. Thanks!
35 Elo gap with drawelo = 270 more less translate into 60.75 Bayeselo units (expected draw ratio ~ 63.97% ~ 64%):

Code: Select all

Shortest simulation:   39 games (+22 -1 =16).
Longest simulation:  1380 games (+271 -215 =894).

Average number of games per simulation: 281
Yes, this was my guess when I said that in these conditions SPRT alpha, beta 0.05 with reasonable window would need 2-3 times more games (200-300 games). 281 is very close to this guess.

Code: Select all

Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.01 %
I ran 80000 simulations: the stronger engine failed SPRT(0, 30) in 11 of them. I do not have any idea on the median this time (only that it is between 39 and 999 games), I need to add code to my simulator... probably the next month.

So, in average only 281 games and not several hundreds (this is what I think that SPRT seems to be more intended to small differences, although I am not an expert). I think you should write your own SPRT simulator, just in case mine is wrong.

Just a question: when you say 'window', do you refer to 'elo1 - elo0'?
Yes
If I modifiy bounds to more less SPRT(0, 0.2) with the rest of parameters unchanged, then:

Code: Select all

Shortest simulation: 26944 games (+6480 -3373 =17091).
Longest simulation:  35310 games (+7942 -4830 =22538).

Average number of games per simulation: 30728

Type I errors  (false positives): 0.00 %
Type II errors (false negatives): 0.00 %
I ran 10000 simulations this time. The median is between 30000 and 31000 games, probably very near to 30700 games.

So, gap = 0 Bayeselo and SPRT(-3, 3) gives more less the same average stop than gap ~ 60.75 Bayeselo and SPRT(0, 0.2)! (Bounds of SPRT are also expressed in Bayeselo).

Regards from Spain.

Ajedrecista.
Thanks, Jesus.
ouachita
Posts: 454
Joined: Tue Jan 15, 2013 4:33 pm
Location: Ritz-Carlton, NYC
Full name: Bobby Johnson

Re: Stockfish seems definitely the strongest engine

Post by ouachita »

Laskos wrote:10'+10'' is almost double TC compared to 10'+1''.
1. Probably closer to 50% more time on average game, and,
2. 12 cores run at approx. 8-10 times the kN/s as one core.
3. Engines using 12 cores/each will be beat engines using one core/each >90% when using same time control.

The results of one core matches cannot be reasonably compared to 4, 6, 8, 12 or 16 core matches, all other variables being the same.
SIM, PhD, MBA, PE
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine

Post by Laskos »

ouachita wrote:
Laskos wrote:10'+10'' is almost double TC compared to 10'+1''.
1. Probably closer to 50% more time on average game, and,
2. 12 cores run at approx. 8-10 times the kN/s as one core.
3. Engines using 12 cores/each will be beat engines using one core/each >90% when using same time control.

The results of one core matches cannot be reasonably compared to 4, 6, 8, 12 or 16 core matches, all other variables being the same.
Ah, ok, you are testing engines on 12 cores. A bit more games would be useful, and I am solely testing 1-core performance, not the MP one. Anyway, I decided to leave the new match until a SPRT stop (alpha, beta=0.05, elo0=0, elo1=30), maybe in 2-3 days it will stop, after 200-400 games at 10'+10''.
ouachita
Posts: 454
Joined: Tue Jan 15, 2013 4:33 pm
Location: Ritz-Carlton, NYC
Full name: Bobby Johnson

Re: Stockfish seems definitely the strongest engine

Post by ouachita »

There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:

Code: Select all

1-22-14

SF0901014IP-12 core v SF 080114-1 core

1+1
50 positions, alternating colors
defaults
# of cores is sole setting difference.
                                        
1   Stockfish 090114IP 64 SSE4.2  +205  +53/=47/-0 76.50%   76.5/100
2   Stockfish 080114 64 SSE4.2    -205  +0/=47/-53 23.50%   23.5/100

Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.

Food for thought.
SIM, PhD, MBA, PE
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine

Post by Laskos »

ouachita wrote:There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:

Code: Select all

1-22-14

SF0901014IP-12 core v SF 080114-1 core

1+1
50 positions, alternating colors
defaults
# of cores is sole setting difference.
                                        
1   Stockfish 090114IP 64 SSE4.2  +205  +53/=47/-0 76.50%   76.5/100
2   Stockfish 080114 64 SSE4.2    -205  +0/=47/-53 23.50%   23.5/100

Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.

Food for thought.
150-200 Elo points from 1 to 12 cores at 1'+1'' are to be expected, but I don't agree that 1-core results are unrelated to 12-core results. The MP scaling of top engines is comparable. In your place I would play 10'+10'' games on one core or on several cores SF against Houdini Contempt=0 until a SPRT stop, to dispel some myths (that SF does not scale better, for example). There were many 1-core results, but none of them had LOS of 98% SF against Houdini 4 Contempt=0, and that happens at somewhat larger TC than blitz. I will now wait for a SPRT stop in Cutechess-Cli to show that SF overtook Houdini (if that is the case).
ernest
Posts: 2053
Joined: Wed Mar 08, 2006 8:30 pm

Re: Stockfish seems definitely the strongest engine

Post by ernest »

Laskos wrote:
ernest wrote:
Laskos wrote:1. Stockfish scales better to longer TC
55% => 56% with 56% based on 100 games...
Kai, I am ashamed of you... 8-)
45% -> 56%
Indeed you wrote
At 15s + 0.05s TC Houdini scored 55% in 2,000 games.

:oops: :oops: :oops:
Now I am ashamed...