Testing results interpretation.

Discussion of chess software programming and technical issues.

Moderator: Ras

mathmoi
Posts: 290
Joined: Mon Mar 13, 2006 5:23 pm
Location: Québec
Full name: Mathieu Pagé

Testing results interpretation.

Post by mathmoi »

I've run two version of my engines agains five opponents for a total of 3000 games per engines. As you can see bellow, bayes elo gives me result with confidences ranges that overlaps for the two engines (MatMoi 7.13.1d and MatMoi 7.14.0-see) by 1 points.

Is it reasonably safe to assume that MatMoi 7.13.1d is stronger than MatMoi 7.14.0-see even if the confidence ranges overlap. If not, can I simply make them play maybe 500 games more to see if the ranges shrink enough that they no longer overlap or do I need to run a new test with 3500+ games for each engines?

Code: Select all

Rank Name                Elo    +    - games score oppo. draws 
   1 Matheus              97   18   18  1200   65%   -15   20% 
   2 Prophet              68   18   18  1200   61%   -15   14% 
   3 Matant               63   18   18  1200   60%   -15   18% 
   4 Monarch              62   17   17  1211   61%   -16   22% 
   5 MatMoi 7.13.1d       -5   11   11  3000   46%    20   18% 
   6 MatMoi 7.14.0-see   -26   11   11  3000   44%    20   18% 
   7 MatMoi 7.12.4e      -71  182  182    11   32%    62    9% 
   8 Sharper            -189   19   19  1200   28%   -15   16% 
User avatar
Eelco de Groot
Posts: 4669
Joined: Sun Mar 12, 2006 2:40 am
Full name:   Eelco de Groot

Re: Testing results interpretation.

Post by Eelco de Groot »

If it is really stronger running more games, add them to the total pgn for Bayeselo and the uncertainty range should come down. But the differences are not very large; it is possible if you select a different set of opponents, the results would be reversed. Such is the life of a tester :) I would just save the old version and try the old parts that differed sometime later. If you are not sure. It may just depend on a combination of features to get the older code to work. One test is usually not sufficient to determine that.
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: Testing results interpretation.

Post by Dirt »

BayesElo lets you print a likelihood of superiority (LOS) table. See if that number is convincing enough for you here. I suspect it is.
User avatar
hgm
Posts: 28387
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing results interpretation.

Post by hgm »

The LOS should be above 99% here. (two 95% confidence intervals nearly touching.) So I guess you are pretty safe.

You are more likely to suffer from systematic errors by not using enough different opponents, than from statistical flukes.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Testing results interpretation.

Post by Edmund »

Eelco de Groot wrote:If it is really stronger running more games, add them to the total pgn for Bayeselo and the uncertainty range should come down. (...)
You would have to re-run the whole tournament as you may not alter the sample size depending on the current state.

eg. lets say 100 games were run between two equal strength engines
in the end one engine is ahead by 11 wins (this would happen in <10% of all cases)
So this would suggest a LOS of 90%

Lets now say you are not happy with the result (hoping for 95%) and want to run another 100 games
for LOS of 95% after 200 games you need to be 20 wins ahead

So you only need another 9 wins in the next 100 games to reach your objective. The likelyhood for this is 13.81%.


Lets now say you happen to test 10 changes (all of them are equal strenghts) 9 of them are LOS < 90%; 1 of them is LOS > 90%;
if you now use this condition and keep only the one > 90% and run another test you only need to reach a LOS of 86.19% if in fact you really wanted to achieve a 95% (which would also be displayed by BayesElo)

regards,
Edmund
mathmoi
Posts: 290
Joined: Mon Mar 13, 2006 5:23 pm
Location: Québec
Full name: Mathieu Pagé

Re: Testing results interpretation.

Post by mathmoi »

Hi,

Thanks, that's what I was looking for. I did not knew Bayeselo was providing it.
mathmoi
Posts: 290
Joined: Mon Mar 13, 2006 5:23 pm
Location: Québec
Full name: Mathieu Pagé

Re: Testing results interpretation.

Post by mathmoi »

hgm wrote:The LOS should be above 99% here. (two 95% confidence intervals nearly touching.) So I guess you are pretty safe.

You are more likely to suffer from systematic errors by not using enough different opponents, than from statistical flukes.
Hi,

I know that it would be better if I had say 10 or 15 opponents, but it's not easy to find engines that are freely avaible/within 200-300 elo points from my engine/run under linux natively or through Wine/Will play well at 45/0:15 time control.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Testing results interpretation.

Post by Daniel Shawul »

I happen to use a similar setup for my tests. 3000 games which gives +-11 SE. I suggest that you play a round robin match among the "non-Mathmoi" engines once and then add your versions one by one by doing a gauntlet. With the current setup that you have, the ELO of other engines is dependent on how well they perform against MathMoi. Matheus could be the least performer after the round-robin match, and this could affect the ELO of your two versions significantly..
do I need to run a new test with 3500+ games for each engines?
Forget about doing more games ;) To halve the SE (down to 5 ELOs) you need 45k or so games. To bring it down to 1 elo hundres of thousands...
So I believe doing the round robin match is better more than anything else.

You can use the LOS table but i don't belive that helps much IMO . You see a 99% superiority for such a small increase in average score..
Daniel
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: Testing results interpretation.

Post by Dirt »

Daniel Shawul wrote:I happen to use a similar setup for my tests. 3000 games which gives +-11 SE. I suggest that you play a round robin match among the "non-Mathmoi" engines once and then add your versions one by one by doing a gauntlet.
The round robin is, I think, useless for determining which of the MatMoi versions is better.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Testing results interpretation.

Post by Daniel Shawul »

Are you sure ? If one of the MatMoi versions (v1) improve score against Matheus by 1% and the other (v2) scored 1% better against Matant, the current scheme will assign higher elo to v1 because Matheus is higher ranked than Matant. But Matant could actually be much stronger than Matheus if they were to be matched. What we have here is only their strength agains MatMoi, that is why I suggested an initial round robin tournament to be played only _once_ . For all i care, initial ratings from CEGT could be assinged. But this would be very rough approximation. Then after, you add your engines one by one by a gauntlet match thereby _all engines_ would have played the same number of games all the time.