I don't know exactly how BayesElo calculates the LOS. So I can't say for sure. But I get the impression that stopping on the LOS is unsound. Because your algo would have a margin of error <= 5% at time t. That is when you decide t in advance. But if for any t you can stop with error <= 5%, you cannot say that the error overall is <= 5%.Ferdy wrote:I made an auto-testing system (composed of batch files and an executable) to run cutechess-cli and bayeselo, and depending on the result of bayeselo, will stop the test. Sample bayeselo report.lucasart wrote:also another feature, which derives from the previous one quite simply is the following:
rather than playing 200 games, let's say.
maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.
this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.
Let me know what you think.
Lucas
Rating tableLOS is hereCode: Select all
Rank Name Elo + - games score oppo. draws 1 Crafty_23.4 84 36 35 200 63% -8 22% 2 Deuterium-v11.02.29.82 42 49 49 100 44% 84 24% 3 Bison-v9.11 28 49 48 100 63% -57 29% 4 Daydreamer-v1.75 -26 47 47 100 55% -57 34% 5 Deuterium-v11.02.29.87 -57 24 25 400 41% 4 28% 6 N2-v0.4 -70 48 48 100 48% -57 28%
My base engine is Deuterium-v11.02.29.87 with 400 games (I know this is low, just an experiment), and my test engine is Deuterium-v11.02.29.82. First I played base engine up to 400 games, then I let the test engine to run under auto-tester with conditions to stop if it is better/bad by elo criteria* and number of games (100) against the base engine.Code: Select all
Cr De Bi Da De N2 Crafty_23.4 92 90 99 99 99 Deuterium-v11.02.29.82 7 60 90 98 98 Bison-v9.11 9 39 90 99 99 Daydreamer-v1.75 0 9 9 86 86 Deuterium-v11.02.29.87 0 1 0 13 67 N2-v0.4 0 1 0 13 32
* Elo criteria is calculated by the following
After running test engine to 100 games, run bayeselo, and get info on base engine and test engine.
From the rating table,Since base engine max elo (-33) is already below the min elo (-7) of test engine, testing was stopped. The LOS was showing 98. I would like to ask an experts opinion, could this be acceptable to stop the games this early? Another stopping condition I have is stop the test if max elo of test engine is below min elo of base engine after 100 games.Code: Select all
te_min_elo = te_base_elo - te_min_margin = 42 - 49 = -7 be_max_elo = be_base_elo + be_max_margin = -57 + 24 = -33 te = test engine be = base engine
I am a little bit unhappy with the 100 games, may be at least 50% of the base engine games.
Is there another method to stop testing based from the generated report of bayeselo?
Thanks.
On the other hand if d(t) =5%*(t*(t+1)) so that sum(t=0,infinity) d(t) = 5%, then stopping at time t if LOS >= 1-d(t) or LOS <= d(t) should be theoretically sound.
But again, maybe the LOS calculated by BayesElo isn't what I think it is, so I don't know.
The only way to see if the algo works in practice is to feed random numbers with given proba P(win) P(draw) into BayesElo (which is *much* faster than playing games), and run the stopping algo many times to see what happens, and if your risk of error is less than what it should be or not. Especiallt when P(win)+P(draw)/2 is close to 50%...
For that you'd have to take the bit of code in BayesElo that calculates the LOS, and feed in random numbers into it.
Perhaps Remi Coulom is best placed to give some insights on this