I used the jackknife estimator (https://en.wikipedia.org/wiki/Jackknife_resampling) to see for a particular, most widely used type of test suites, whether the statistical predictions of the binomial model match the estimator value. The quantity compared was the variance of the engine results in the suite.

I got that the variance of the test suite engine results as predicted by the binomial model greatly exceeds the variance as calculated by the jackknife estimator. The most marked difference was observed for 1 threaded results. The results also show that the variance of the engine results in suites depends on the engine used or at least on the type of engine used.

Test suite: Arasan 10, 200 positions which can be either solved correctly or not, 1 point for solving each, 0 for not. To note that Arasan is a particular type of test suites where the delimitation between best moves and other moves is clear cut, both in (correct) engine eval and in scoring.

Engines used: Stockfish 12 1 thread, Stockfish 12 4 threads, Lc0 v26.2 SV4300

Analysis time per position: 0.1s

Number of times the test suite is run for each engine: 100

Results:

Stockfish 12 1 thread:

Binomial model standard deviation: 6.8

Jackknife estimator standard deviation: 1.2

Stockfih 12 4 threads:

Binomial model standard deviation: 7.0

Jackknife estimator standard deviation: 4.3

Lc0 v26.2 SV4300 2 threads:

Binomial model standard deviation: 7.0

Jackknife estimator standard deviation: 1.3

It can be seen that the "real life" standard deviation of engine results in test suites is much smaller that that predicted by the binomial model, especially in the cases of 1 threaded AB engines and Lc0.

Some useful things to remember. The binomial standard deviation (sd) of the results is given by

binomial model sd = sqrt[ (solved positions) * (unsolved positions) / (total positions) ] ;

Rule of thumb for results in tests suites for 1 threaded AB engines or Lc0:

real sd = 20% of binomial model sd ;

Rule of thumb for results in tests suites for multi-threaded AB engines:

real sd = 60% of binomial model sd ;

A quick and dirty binomial model sd can be get as 0.5 * sqrt [ total positions ]

when the engine isn't solving almost all of the positions or misses almost all.

"total positions" is the total number of positions in the test suite.

## Jackknife estimate of the variance of engine results for Arasan test suite

**Moderators:** bob, hgm, Harvey Williamson

**Forum rules**

This textbox is used to restore diagrams posted with the [d] tag before the upgrade.