Testing results interpretation.

mathmoi · Post by **mathmoi** » Wed Jan 27, 2010 4:38 am

I've run two version of my engines agains five opponents for a total of 3000 games per engines. As you can see bellow, bayes elo gives me result with confidences ranges that overlaps for the two engines (MatMoi 7.13.1d and MatMoi 7.14.0-see) by 1 points.

Is it reasonably safe to assume that MatMoi 7.13.1d is stronger than MatMoi 7.14.0-see even if the confidence ranges overlap. If not, can I simply make them play maybe 500 games more to see if the ranges shrink enough that they no longer overlap or do I need to run a new test with 3500+ games for each engines?

Code: Select all

Rank Name                Elo    +    - games score oppo. draws 
   1 Matheus              97   18   18  1200   65%   -15   20% 
   2 Prophet              68   18   18  1200   61%   -15   14% 
   3 Matant               63   18   18  1200   60%   -15   18% 
   4 Monarch              62   17   17  1211   61%   -16   22% 
   5 MatMoi 7.13.1d       -5   11   11  3000   46%    20   18% 
   6 MatMoi 7.14.0-see   -26   11   11  3000   44%    20   18% 
   7 MatMoi 7.12.4e      -71  182  182    11   32%    62    9% 
   8 Sharper            -189   19   19  1200   28%   -15   16%

Eelco de Groot · Post by **Eelco de Groot** » Wed Jan 27, 2010 6:05 am

If it is really stronger running more games, add them to the total pgn for Bayeselo and the uncertainty range should come down. But the differences are not very large; it is possible if you select a different set of opponents, the results would be reversed. Such is the life of a tester

I would just save the old version and try the old parts that differed sometime later. If you are not sure. It may just depend on a combination of features to get the older code to work. One test is usually not sufficient to determine that.

Dirt · Post by **Dirt** » Wed Jan 27, 2010 6:55 am

BayesElo lets you print a likelihood of superiority (LOS) table. See if that number is convincing enough for you here. I suspect it is.

hgm · Post by **hgm** » Wed Jan 27, 2010 9:20 am

The LOS should be above 99% here. (two 95% confidence intervals nearly touching.) So I guess you are pretty safe.

You are more likely to suffer from systematic errors by not using enough different opponents, than from statistical flukes.

Edmund · Post by **Edmund** » Wed Jan 27, 2010 12:25 pm

Eelco de Groot wrote:If it is really stronger running more games, add them to the total pgn for Bayeselo and the uncertainty range should come down. (...)

You would have to re-run the whole tournament as you may not alter the sample size depending on the current state.

eg. lets say 100 games were run between two equal strength engines
in the end one engine is ahead by 11 wins (this would happen in <10% of all cases)
So this would suggest a LOS of 90%

Lets now say you are not happy with the result (hoping for 95%) and want to run another 100 games
for LOS of 95% after 200 games you need to be 20 wins ahead

So you only need another 9 wins in the next 100 games to reach your objective. The likelyhood for this is 13.81%.

Lets now say you happen to test 10 changes (all of them are equal strenghts) 9 of them are LOS < 90%; 1 of them is LOS > 90%;
if you now use this condition and keep only the one > 90% and run another test you only need to reach a LOS of 86.19% if in fact you really wanted to achieve a 95% (which would also be displayed by BayesElo)

regards,
Edmund

mathmoi · Post by **mathmoi** » Wed Jan 27, 2010 3:17 pm

Hi,

Thanks, that's what I was looking for. I did not knew Bayeselo was providing it.

mathmoi · Post by **mathmoi** » Wed Jan 27, 2010 3:20 pm

hgm wrote:The LOS should be above 99% here. (two 95% confidence intervals nearly touching.) So I guess you are pretty safe.

You are more likely to suffer from systematic errors by not using enough different opponents, than from statistical flukes.

Hi,

I know that it would be better if I had say 10 or 15 opponents, but it's not easy to find engines that are freely avaible/within 200-300 elo points from my engine/run under linux natively or through Wine/Will play well at 45/0:15 time control.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 28, 2010 12:46 am

I happen to use a similar setup for my tests. 3000 games which gives +-11 SE. I suggest that you play a round robin match among the "non-Mathmoi" engines once and then add your versions one by one by doing a gauntlet. With the current setup that you have, the ELO of other engines is dependent on how well they perform against MathMoi. Matheus could be the least performer after the round-robin match, and this could affect the ELO of your two versions significantly..

do I need to run a new test with 3500+ games for each engines?

Forget about doing more games

To halve the SE (down to 5 ELOs) you need 45k or so games. To bring it down to 1 elo hundres of thousands...
So I believe doing the round robin match is better more than anything else.

You can use the LOS table but i don't belive that helps much IMO . You see a 99% superiority for such a small increase in average score..
Daniel

Dirt · Post by **Dirt** » Thu Jan 28, 2010 2:28 am

Daniel Shawul wrote:I happen to use a similar setup for my tests. 3000 games which gives +-11 SE. I suggest that you play a round robin match among the "non-Mathmoi" engines once and then add your versions one by one by doing a gauntlet.

The round robin is, I think, useless for determining which of the MatMoi versions is better.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 28, 2010 5:39 am

Are you sure ? If one of the MatMoi versions (v1) improve score against Matheus by 1% and the other (v2) scored 1% better against Matant, the current scheme will assign higher elo to v1 because Matheus is higher ranked than Matant. But Matant could actually be much stronger than Matheus if they were to be matched. What we have here is only their strength agains MatMoi, that is why I suggested an initial round robin tournament to be played only _once_ . For all i care, initial ratings from CEGT could be assinged. But this would be very rough approximation. Then after, you add your engines one by one by a gauntlet match thereby _all engines_ would have played the same number of games all the time.

Testing results interpretation.

Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.

Re: Testing results interpretation.