Indeed, the sequential Wald test should ideally be built-in to cutechess-cli. As I'm working on my own tournament manager, it's definitely on my todo list. I think the proper way of testing for an improvement with an epsilon elo resolution let's say is as follows:Michel wrote:Actually that is a briliant idea. To put a truncated Wald test in cutechess-cli (I believe cutechess-cli can run tournaments now).You could use cutechess-cli to run one experi,ment (ie a game vs each opponent) and a python script that runs it and test the stopping rule.
The math for determining the parameters of a truncated Wald test is quite complicated (not the execution of the test itself which is trivial) but I wrote some notes about it, so I am available for consulting if necessary.
1/ we plain 2 games against each program P(1)...P(n), one with white and one with black, in a given position. The results give a sequence of X(t). Note that this avoids the pitfall of self play, and the X(t) are independant and approximately equally distributed (if positions are chosen to give equal chances and a varied representation of the set of possibilities).
2/ we do a sequential Wald test of elo_diff = 0 vs elo_diff = epsilon, where scores and elos are deducted from one another using the formula from bayes elo (removing the color biais correction which isn't relevant here).
It's just about finding a way that is generic enough to not make the command line ugly and messy. cutechess-cli already plays gauntlets, so it's just a question of sampling double gauntlets, and defining a stopping rule. 3 param for the stopping rule: type I or II probability of error, and epsilon (elo resolution, where [0,epsilon] is the "grey zone" as usual with the -sequential or not- Wald test).
This would REALLY be a nice feature to have built-in to cutechess-cli, but unfortunately patching cutechess-cli is not trivial for me, as it requires a lot of QT knowledge (on top of C++). I've never even figured out how to install qmake on my Fedora 17, to begin with
