Stockfish seems definitely the strongest engine
Moderator: Ras
-
Jouni
- Posts: 3743
- Joined: Wed Mar 08, 2006 8:15 pm
- Full name: Jouni Uski
-
Stefan Schiffermueller
- Posts: 12
- Joined: Thu Dec 05, 2013 10:48 am
Re: Stockfish seems definitely the strongest engine
56 points out of 100 games are highly significant?. Sorry, I don't understand.Laskos wrote:LOS = 98.0%, highly significant.Code: Select all
Games Completed = 100 of 100 (Avg game length = 2650.582 sec) Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD:C:\LittleBlitzer\8moves_v2.epd(32000) Time = 66710 sec elapsed, 0 sec remaining 1. SF 19.01.14 56.0/100 23-11-66 (L: m=11 t=0 i=0 a=0) (D: r=40 i=16 f=10 s=0 a=0) (tpm=15184.2 d=34.90 nps=2007547) 2. H4 Contempt=0 44.0/100 11-23-66 (L: m=23 t=0 i=0 a=0) (D: r=40 i=16 f=10 s=0 a=0) (tpm=14413.0 d=23.69 nps=2349506)
-
Ajedrecista
- Posts: 2156
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Stockfish seems definitely the strongest engine.
Hello again:
I ran 80000 simulations: the stronger engine failed SPRT(0, 30) in 11 of them. I do not have any idea on the median this time (only that it is between 39 and 999 games), I need to add code to my simulator... probably the next month.
So, in average only 281 games and not several hundreds (this is what I think that SPRT seems to be more intended to small differences, although I am not an expert). I think you should write your own SPRT simulator, just in case mine is wrong.
Just a question: when you say 'window', do you refer to 'elo1 - elo0'?
If I modifiy bounds to more less SPRT(0, 0.2) with the rest of parameters unchanged, then:
I ran 10000 simulations this time. The median is between 30000 and 31000 games, probably very near to 30700 games.
So, gap = 0 Bayeselo and SPRT(-3, 3) gives more less the same average stop than gap ~ 60.75 Bayeselo and SPRT(0, 0.2)! (Bounds of SPRT are also expressed in Bayeselo).
Regards from Spain.
Ajedrecista.
35 Elo gap with drawelo = 270 more less translate into 60.75 Bayeselo units (expected draw ratio ~ 63.97% ~ 64%):Laskos wrote:Hello Jesus,
Can you simulate a "real" 35 Elo gap, drawelo=270, elo0 = 0, elo1 = 30, alpha, beta=0.05, to see the median at the stop? I suspect it will be several hundred. SPRT is not only about small Elo differences, that's entirely depending on the window. Thanks!
Code: Select all
Shortest simulation: 39 games (+22 -1 =16).
Longest simulation: 1380 games (+271 -215 =894).
Average number of games per simulation: 281
Type I errors (false positives): 0.00 %
Type II errors (false negatives): 0.01 %So, in average only 281 games and not several hundreds (this is what I think that SPRT seems to be more intended to small differences, although I am not an expert). I think you should write your own SPRT simulator, just in case mine is wrong.
Just a question: when you say 'window', do you refer to 'elo1 - elo0'?
If I modifiy bounds to more less SPRT(0, 0.2) with the rest of parameters unchanged, then:
Code: Select all
Shortest simulation: 26944 games (+6480 -3373 =17091).
Longest simulation: 35310 games (+7942 -4830 =22538).
Average number of games per simulation: 30728
Type I errors (false positives): 0.00 %
Type II errors (false negatives): 0.00 %So, gap = 0 Bayeselo and SPRT(-3, 3) gives more less the same average stop than gap ~ 60.75 Bayeselo and SPRT(0, 0.2)! (Bounds of SPRT are also expressed in Bayeselo).
Regards from Spain.
Ajedrecista.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Stockfish seems definitely the strongest engine
Many draws, LOS 98%. Yes, I would have liked a SPRT stop, but I would probably needed 2-3 days of testing.Stefan Schiffermueller wrote:56 points out of 100 games are highly significant?. Sorry, I don't understand.Laskos wrote:LOS = 98.0%, highly significant.Code: Select all
Games Completed = 100 of 100 (Avg game length = 2650.582 sec) Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD:C:\LittleBlitzer\8moves_v2.epd(32000) Time = 66710 sec elapsed, 0 sec remaining 1. SF 19.01.14 56.0/100 23-11-66 (L: m=11 t=0 i=0 a=0) (D: r=40 i=16 f=10 s=0 a=0) (tpm=15184.2 d=34.90 nps=2007547) 2. H4 Contempt=0 44.0/100 11-23-66 (L: m=23 t=0 i=0 a=0) (D: r=40 i=16 f=10 s=0 a=0) (tpm=14413.0 d=23.69 nps=2349506)
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Stockfish seems definitely the strongest engine.
Yes, this was my guess when I said that in these conditions SPRT alpha, beta 0.05 with reasonable window would need 2-3 times more games (200-300 games). 281 is very close to this guess.Ajedrecista wrote:Hello again:
35 Elo gap with drawelo = 270 more less translate into 60.75 Bayeselo units (expected draw ratio ~ 63.97% ~ 64%):Laskos wrote:Hello Jesus,
Can you simulate a "real" 35 Elo gap, drawelo=270, elo0 = 0, elo1 = 30, alpha, beta=0.05, to see the median at the stop? I suspect it will be several hundred. SPRT is not only about small Elo differences, that's entirely depending on the window. Thanks!
Code: Select all
Shortest simulation: 39 games (+22 -1 =16). Longest simulation: 1380 games (+271 -215 =894). Average number of games per simulation: 281
YesI ran 80000 simulations: the stronger engine failed SPRT(0, 30) in 11 of them. I do not have any idea on the median this time (only that it is between 39 and 999 games), I need to add code to my simulator... probably the next month.Code: Select all
Type I errors (false positives): 0.00 % Type II errors (false negatives): 0.01 %
So, in average only 281 games and not several hundreds (this is what I think that SPRT seems to be more intended to small differences, although I am not an expert). I think you should write your own SPRT simulator, just in case mine is wrong.
Just a question: when you say 'window', do you refer to 'elo1 - elo0'?
Thanks, Jesus.If I modifiy bounds to more less SPRT(0, 0.2) with the rest of parameters unchanged, then:
I ran 10000 simulations this time. The median is between 30000 and 31000 games, probably very near to 30700 games.Code: Select all
Shortest simulation: 26944 games (+6480 -3373 =17091). Longest simulation: 35310 games (+7942 -4830 =22538). Average number of games per simulation: 30728 Type I errors (false positives): 0.00 % Type II errors (false negatives): 0.00 %
So, gap = 0 Bayeselo and SPRT(-3, 3) gives more less the same average stop than gap ~ 60.75 Bayeselo and SPRT(0, 0.2)! (Bounds of SPRT are also expressed in Bayeselo).
Regards from Spain.
Ajedrecista.
-
ouachita
- Posts: 454
- Joined: Tue Jan 15, 2013 4:33 pm
- Location: Ritz-Carlton, NYC
- Full name: Bobby Johnson
Re: Stockfish seems definitely the strongest engine
1. Probably closer to 50% more time on average game, and,Laskos wrote:10'+10'' is almost double TC compared to 10'+1''.
2. 12 cores run at approx. 8-10 times the kN/s as one core.
3. Engines using 12 cores/each will be beat engines using one core/each >90% when using same time control.
The results of one core matches cannot be reasonably compared to 4, 6, 8, 12 or 16 core matches, all other variables being the same.
SIM, PhD, MBA, PE
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Stockfish seems definitely the strongest engine
Ah, ok, you are testing engines on 12 cores. A bit more games would be useful, and I am solely testing 1-core performance, not the MP one. Anyway, I decided to leave the new match until a SPRT stop (alpha, beta=0.05, elo0=0, elo1=30), maybe in 2-3 days it will stop, after 200-400 games at 10'+10''.ouachita wrote:1. Probably closer to 50% more time on average game, and,Laskos wrote:10'+10'' is almost double TC compared to 10'+1''.
2. 12 cores run at approx. 8-10 times the kN/s as one core.
3. Engines using 12 cores/each will be beat engines using one core/each >90% when using same time control.
The results of one core matches cannot be reasonably compared to 4, 6, 8, 12 or 16 core matches, all other variables being the same.
-
ouachita
- Posts: 454
- Joined: Tue Jan 15, 2013 4:33 pm
- Location: Ritz-Carlton, NYC
- Full name: Bobby Johnson
Re: Stockfish seems definitely the strongest engine
There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:
Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.
Food for thought.
Code: Select all
1-22-14
SF0901014IP-12 core v SF 080114-1 core
1+1
50 positions, alternating colors
defaults
# of cores is sole setting difference.
1 Stockfish 090114IP 64 SSE4.2 +205 +53/=47/-0 76.50% 76.5/100
2 Stockfish 080114 64 SSE4.2 -205 +0/=47/-53 23.50% 23.5/100
Food for thought.
SIM, PhD, MBA, PE
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Stockfish seems definitely the strongest engine
150-200 Elo points from 1 to 12 cores at 1'+1'' are to be expected, but I don't agree that 1-core results are unrelated to 12-core results. The MP scaling of top engines is comparable. In your place I would play 10'+10'' games on one core or on several cores SF against Houdini Contempt=0 until a SPRT stop, to dispel some myths (that SF does not scale better, for example). There were many 1-core results, but none of them had LOS of 98% SF against Houdini 4 Contempt=0, and that happens at somewhat larger TC than blitz. I will now wait for a SPRT stop in Cutechess-Cli to show that SF overtook Houdini (if that is the case).ouachita wrote:There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:
Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.Code: Select all
1-22-14 SF0901014IP-12 core v SF 080114-1 core 1+1 50 positions, alternating colors defaults # of cores is sole setting difference. 1 Stockfish 090114IP 64 SSE4.2 +205 +53/=47/-0 76.50% 76.5/100 2 Stockfish 080114 64 SSE4.2 -205 +0/=47/-53 23.50% 23.5/100
Food for thought.
-
ernest
- Posts: 2053
- Joined: Wed Mar 08, 2006 8:30 pm
Re: Stockfish seems definitely the strongest engine
Indeed you wroteLaskos wrote:45% -> 56%ernest wrote:55% => 56% with 56% based on 100 games...Laskos wrote:1. Stockfish scales better to longer TC
Kai, I am ashamed of you...
At 15s + 0.05s TC Houdini scored 55% in 2,000 games.
Now I am ashamed...