Most new version include something like this.
LLR: 2.95 (-2.94,2.94)
Total: 39818 W: 8174 L: 7956 D: 23688
bench: 3453941
Can anyone explain what they mean and how they are calculated. I get the second line, but what conditions are used for the games.
Thanks.
Stats and bench on Stockfish development site
Moderators: hgm, Rebel, chrisw
-
- Posts: 284
- Joined: Tue Aug 13, 2013 9:44 am
-
- Posts: 1971
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Stats and bench on Stockfish development site.
Hello John:
You are referring to this sequential test. LLR means Log Likelihood Ratio and it is a measure of SPRT (Sequential Probability Ratio Test). Stage I of that test had these testing conditions:
Which means a sequential probability ratio test at time control 15" + 0.05"/move (a Fischer time control) per player, and one thread (th 1) each engine. Stage I has this TC of 15+0.05 and some parameters in SPRT: alpha = 0.05 (5%) and beta = 0.05 (5%), which are type I and type II errors of the test (alpha represents type I errors and beta represents type II errors).
The code of the calculation of LLR can be found here between lines 54 and 121. The numbers between parenthesis (-2.94, 2.94) represent the lower and upper bounds. If LLR < -2.94, the patch is discarded; else if LLR > 2.94, the patch is accepted (in this case, it is tested again at longer TC at stage II: 60+0.05).
Looking at the code (lines 95 and 96), these bounds are easily calulated: (lower bound) = ln[beta/(1 - alpha)] = ln(0.05/0.95) = -ln(19) ~ -2.9444; (upper bound) = ln[(1 - beta)/alpha] = ln(0.95/0.05) = ln(19) ~ 2.9444. In this case: LLR > (upper bound), so this patch will be tested again at longer TC. LLR is a sort of stopping rule.
SPRT has different parameters depending on each stage: alpha and beta remain constant, but elo0 and elo1 parameters (measured in BayesElo units) vary:
Stage II is more restrictive and therefore more difficult to pass: a lot of patches accepted at Stage I fail at Stage II; Stage I is a kind of fast filter of bad patches.
I can give you some additional numbers in the example you posted:
I got these results with my own programme (I copied the mathematic underlyings from the piece of code I told you before). Of course: LLR = LLR(wins) + LLR(loses)+ LLR(draws).
Draws count negatively in LLR: this is because (elo0 + elo1)/2 > 0; elo0 + elo1 > 0. Draws do not affect LLR if (elo0 + elo1)/2 = 0; elo0 = -elo1. Logically, draws count positively in LLR if (elo0 + elo1)/2 < 0.
I said before that Stage II is more restrictive than Stage I and it is due to parameters elo0 and elo1. SPRT(-1.5, 4.5) gave a LLR ~ 2.9454; if we take the same number of wins, draws and loses but we work with SPRT(0, 6), then LLR ~ -0.1136, much less than the other LLR value, so the test would continue.
------------------------
Regarding bench, here is an interesting thing on the subject. Bench is ran with a command line instruction. Open the cmd in the folder that contains the executable and type the following:
(You have to type stockfish_4_32bit bench 128 1 12 default depth; the text before the word 'bench' is the name of the executable without '.exe'). 128 means 128 MB of hash; 1 means one core; 12 means depth 12. At the end of the benchmark task it should appear something like this:
Bench = Nodes searched = 4132352 for my copy of SF 4, which is different from the reported bench at this site:
My copy is probably corrupted.
Bench is done with these 16 positions (lines 35 to 50 of benchmark.cpp file):
I hope that this post will be helpful for you.
Regards from Spain.
Ajedrecista.
I am not an expert on the subject but I will try my best.JohnS wrote:Most new version include something like this.
LLR: 2.95 (-2.94,2.94)
Total: 39818 W: 8174 L: 7956 D: 23688
bench: 3453941
Can anyone explain what they mean and how they are calculated. I get the second line, but what conditions are used for the games.
Thanks.
You are referring to this sequential test. LLR means Log Likelihood Ratio and it is a measure of SPRT (Sequential Probability Ratio Test). Stage I of that test had these testing conditions:
Code: Select all
sprt @ 15+0.05 th 1
The code of the calculation of LLR can be found here between lines 54 and 121. The numbers between parenthesis (-2.94, 2.94) represent the lower and upper bounds. If LLR < -2.94, the patch is discarded; else if LLR > 2.94, the patch is accepted (in this case, it is tested again at longer TC at stage II: 60+0.05).
Looking at the code (lines 95 and 96), these bounds are easily calulated: (lower bound) = ln[beta/(1 - alpha)] = ln(0.05/0.95) = -ln(19) ~ -2.9444; (upper bound) = ln[(1 - beta)/alpha] = ln(0.95/0.05) = ln(19) ~ 2.9444. In this case: LLR > (upper bound), so this patch will be tested again at longer TC. LLR is a sort of stopping rule.
SPRT has different parameters depending on each stage: alpha and beta remain constant, but elo0 and elo1 parameters (measured in BayesElo units) vary:
Code: Select all
Always: elo0 < elo1.
Stage I:
elo0 = -1.5
elo1 = 4.5
Stage II:
elo0 = 0
elo1 = 6
I can give you some additional numbers in the example you posted:
Code: Select all
Parameters found at LLR_parameters.txt file:
alpha: 0.0500
beta: 0.0500
bayeselo_0: -1.5000
bayeselo_1: 4.5000
----------------------------
Lower bound for LLR: -2.9444
Upper bound for LLR: 2.9444
----------------------------
Games: 39818
Wins: 8174 (20.53 %).
Loses: 7956 (19.98 %).
Draws: 23688 (59.49 %).
bayeselo: 2.9443
drawelo: 238.0870
----------------------------
LLR(wins): 224.7448
LLR(loses): -219.5174
LLR(draws): -2.2820
LLR: 2.9454
Draws count negatively in LLR: this is because (elo0 + elo1)/2 > 0; elo0 + elo1 > 0. Draws do not affect LLR if (elo0 + elo1)/2 = 0; elo0 = -elo1. Logically, draws count positively in LLR if (elo0 + elo1)/2 < 0.
I said before that Stage II is more restrictive than Stage I and it is due to parameters elo0 and elo1. SPRT(-1.5, 4.5) gave a LLR ~ 2.9454; if we take the same number of wins, draws and loses but we work with SPRT(0, 6), then LLR ~ -0.1136, much less than the other LLR value, so the test would continue.
------------------------
Regarding bench, here is an interesting thing on the subject. Bench is ran with a command line instruction. Open the cmd in the folder that contains the executable and type the following:
Code: Select all
C:\Documents and Settings\[...]\StockFish\stockfish-4-win\stockfish-4-win\Windows>stockfish_4_32bit bench 128 1 12 default depth
Code: Select all
===========================
Total time (ms) : 11469
Nodes searched : 4132352
Nodes/second : 360306
Code: Select all
Author: Marco Costalba
Date: Tue Aug 20 09:01:25 2013 +0200
Timestamp: 1376982085
Stockfish 4
Stockfish bench signature is: 4132374
Bench is done with these 16 positions (lines 35 to 50 of benchmark.cpp file):
Code: Select all
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 10
8/2p5/3p4/KP5r/1R3p1k/8/4P1P1/8 w - - 0 11
4rrk1/pp1n3p/3q2pQ/2p1pb2/2PP4/2P3N1/P2B2PP/4RRK1 b - - 7 19
rq3rk1/ppp2ppp/1bnpb3/3N2B1/3NP3/7P/PPPQ1PP1/2KR3R w - - 7 14
r1bq1r1k/1pp1n1pp/1p1p4/4p2Q/4Pp2/1BNP4/PPP2PPP/3R1RK1 w - - 2 14
r3r1k1/2p2ppp/p1p1bn2/8/1q2P3/2NPQN2/PPP3PP/R4RK1 b - - 2 15
r1bbk1nr/pp3p1p/2n5/1N4p1/2Np1B2/8/PPP2PPP/2KR1B1R w kq - 0 13
r1bq1rk1/ppp1nppp/4n3/3p3Q/3P4/1BP1B3/PP1N2PP/R4RK1 w - - 1 16
4r1k1/r1q2ppp/ppp2n2/4P3/5Rb1/1N1BQ3/PPP3PP/R5K1 w - - 1 17
2rqkb1r/ppp2p2/2npb1p1/1N1Nn2p/2P1PP2/8/PP2B1PP/R1BQK2R b KQ - 0 11
r1bq1r1k/b1p1npp1/p2p3p/1p6/3PP3/1B2NN2/PP3PPP/R2Q1RK1 w - - 1 16
3r1rk1/p5pp/bpp1pp2/8/q1PP1P2/b3P3/P2NQRPP/1R2B1K1 b - - 6 22
r1q2rk1/2p1bppp/2Pp4/p6b/Q1PNp3/4B3/PP1R1PPP/2K4R w - - 2 18
4k2r/1pb2ppp/1p2p3/1R1p4/3P4/2r1PN2/P4PPP/1R4K1 b - - 3 22
3q2k1/pb3p1p/4pbp1/2r5/PpN2N2/1P2P2P/5PP1/Q2R2K1 b - - 4 26
Regards from Spain.
Ajedrecista.
-
- Posts: 1167
- Joined: Thu Dec 25, 2008 9:07 pm
- Full name: Herbert L
Re: Stats and bench on Stockfish development site.
Thank you for this explanation
-
- Posts: 215
- Joined: Sun Feb 24, 2008 2:08 am
Re: Stats and bench on Stockfish development site.
Thanks to Paul and Jesus for the explanations.
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: Stats and bench on Stockfish development site.
Just run 'stockfish bench'Ajedrecista wrote: My copy is probably corrupted.
-
- Posts: 4567
- Joined: Sun Mar 12, 2006 2:40 am
- Full name:
Re: Stats and bench on Stockfish development site.
I think at the time of the thread, it was not yet possible to run the bench command from inside the program. Marco added that possibility later. So you now just look for the stockfish.exe in your download, doubleclick or rightclick -> open. And then after the message appearsAjedrecista wrote:
Regarding bench, here is an interesting thing on the subject. Bench is ran with a command line instruction. Open the cmd in the folder that contains the executable and type the following:
(You have to type stockfish_4_32bit bench 128 1 12 default depth; the text before the word 'bench' is the name of the executable without '.exe'). 128 means 128 MB of hash; 1 means one core; 12 means depth 12. At the end of the benchmark task it should appear something like this:Code: Select all
C:\Documents and Settings\[...]\StockFish\stockfish-4-win\stockfish-4-win\Windows>stockfish_4_32bit bench 128 1 12 default depth
Bench = Nodes searched = 4132352 for my copy of SF 4, which is different from the reported bench at this site:Code: Select all
=========================== Total time (ms) : 11469 Nodes searched : 4132352 Nodes/second : 360306
My copy is probably corrupted.Code: Select all
Author: Marco Costalba Date: Tue Aug 20 09:01:25 2013 +0200 Timestamp: 1376982085 Stockfish 4 Stockfish bench signature is: 4132374
Stockfish 050913 by Tord Romstad, Marco Costalba and Joona Kiiski
(depending on the date the code was compiled)
type bench, hit Enter key.
That way you don't have to open a separate command window in the same folder as stockfish.exe is in. Very useful change for us simple Windows users, thanks to Marco !
Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan