running tests: how many?

flok · Post by **flok** » Thu Nov 14, 2013 10:37 pm

Hi,

I was wondering: when I want to test a change, if it helped the strength or not, how many games should it play to have some confidence?

For that matter: I have setup a computer which 24/7 plays games between a couple of chess engines. Some for reference (fairymax, crafty, fruit, etc) and the rest revisions of my own program (http://vanheusden.com/cchess/battle.html).

Rebel · Post by **Rebel** » Thu Nov 14, 2013 11:19 pm

For starters - http://www.top-5000.nl/tuning.htm

xmas79 · Post by **xmas79** » Thu Nov 14, 2013 11:41 pm

From that page:

A depth-based testing system looks somewhat suspicious at first glance (and it is) but with a few whistles and bells it has become and elegant and working system that excludes any form of user and/or Windows (and friends) interference that influence time control based testing. Doing depth-based matches you (the operating system) can do as many things in the background as you wish, the match result will be the same.

This works only for single-threaded matches IMHO... Play a bunch of SMP games and I'm pretty sure most won't match because OSes introduce race conditions here and there...

bob · Post by **bob** » Fri Nov 15, 2013 2:20 am

flok wrote:Hi,

I was wondering: when I want to test a change, if it helped the strength or not, how many games should it play to have some confidence?

For that matter: I have setup a computer which 24/7 plays games between a couple of chess engines. Some for reference (fairymax, crafty, fruit, etc) and the rest revisions of my own program (http://vanheusden.com/cchess/battle.html).

You have asked an unanswerable question. If you want to measure a change that is 4-5 Elo or so, it takes on the order of 30,000 games. If you want to measure a change that is 1-2 elo, figure 100,000 games. If the change is 30 elo, then you can get by with fewer than 30K games...

lucasart · Post by **lucasart** » Sat Nov 16, 2013 2:17 am

flok wrote:Hi,

I was wondering: when I want to test a change, if it helped the strength or not, how many games should it play to have some confidence?

For that matter: I have setup a computer which 24/7 plays games between a couple of chess engines. Some for reference (fairymax, crafty, fruit, etc) and the rest revisions of my own program (http://vanheusden.com/cchess/battle.html).

Bob said your question cannot be answered. This is only partially true:

It cannot be answered by a number. Let's say we assume (naively) that 10k games are enough. You play 10k games and get (W,L,D)=(2510,2490,5000). The p-value is only 61.1%, which is not significant at all. On the other hand if the result had been (2600,2400,5000), that would be sigbnificant, with a p-value of 99.8%.
But it can be answered by an algorithm. Sequential probability ratio test. That's what I use, and it's built-in cutechess-cli. SPRT is a bit difficult to understand for beginners, and the idea is that there is a tradeoff: the smaller resolution you want, the higher the expected (random) stopping time will be. For example, if you use SPRT(elo0=0, elo1=4, alpha=0.05, beta=0.05), it means that you have a typeI error alpha=5% (chance of a negative elo patch slipping through), a typeII error beta=5% (chance of a patch worth 4 elo or more to be rejected). You are basically doing a risk averse strategy (asymmetry between typeI error elo<0, and typeII error elo>4) with a 4 elo resolution. You could choose instead to implement a risk-neutral strategy (elo=-2 elo1=+2), for example to test a patch that simplifies the code. Or even a risky strategy, where you are volontarely sacrificing a (controlled) amount of elo to test a big simplification that is an investment in the long term (elo0=-4 elo1=0).

Regarding the time control, I would not recommend fixed depth (or fixed nodes) testing. I've seen many results behave completely differently from one depth to the next, and become zero elo at reasonable time control. Besides, time management is part of the elo of an engine, so you need to test it too. What I always use is a time limit, but a very fast one:
* 1.2"+0.01" for CLOP tuning
* 2.4"+0.02" for a first step SPRT test (pre-selection phase)
* 6"+0.05" for a validation SPRT test (before commit)

But your engne needs to be capable of playing such time controls without losing on time. And, of course, it needs to be capable of playing tens of thousands of games without ever crashing. There is also the problem of timer resolution which becomes important in 1.2"+0.01": the timer resolution used in the engine and the UI must be sub millisecond (eg. std::chrono::high_resolution_clock). Also it's nice to have a parametrable time buffer in your engine (the time it always leaves on the clock, no matter what): for that I use 10ms, and it never loses on time with cutechess-cli. In my experience the only UI capable of handling such extreme time controls properly, without incorrectly giving time losses is cutechess-cli.

I used to use longer time controls, but I became impatient, and had to push the time vs. testing resolution more in the direction of improving resolution. This is what happens when your engine becomes strong: diminishing returns, and good patches provide fewer and fewer elo.

running tests: how many?

running tests: how many?

Re: running tests: how many?

Re: running tests: how many?

Re: running tests: how many?

Re: running tests: how many?