AndrewGrant wrote: ↑Mon Jan 18, 2021 12:53 pm
Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.
I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
Code: Select all
ELO | 42.43 +- 25.95 (95%)
SPRT | 10.0+0.1s Threads=1 Hash=8MB
LLR | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.
Here are the results now, as I type this...
Code: Select all
ELO | -4.22 +- 6.11 (95%)
SPRT | 10.0+0.1s Threads=1 Hash=8MB
LLR | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/
Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
I think that the interesting question is if you are sure that testing conditions are good.
I can imagine bad testing conditions when you test X against Y as one of the following:
1)In part of the games X is slowed down by a significant factor when it does not happen to Y and in part of the games it is the opposite.
If you have also statistics about number of nodes per second of both engines you can identify this type of problem and if you see
for example that Y get more nodes per second than X in all games 1-300 when X get more nodes per second in all games 301-600 then it is obvious that something is wrong in testing.
2)You do not play every position with both colors.
I believe that with the same number of games playing every position with both colors should reduce the error in testing.
Note that I believe that with correct testing the +- 25.95 after 288 games can be reduced because I believe this number is based on assumption
that the result of pair of consecutive games are independent when it does not have to be this case with the best testing.