Interpreting test results

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 6:15 am
Location: Australia

Interpreting test results

Post by nthom »

I've just recently started testing larger volumes with very fast games. These results are from a 10000 game gauntlet, using bayeselo fixing LT 1.03 at 2300.

LT v1.05.01

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6%
3 Cheese 1.3                2554   16   15  1428   64%  2437    7%
4 DanaSah226                2461   15   15  1428   53%  2437   10%
5 LittleThought-1.05.01     2437    6    7 10000   42%  2513    9%
6 Hamsters 0.0.6            2415   15   15  1428   47%  2437    5%
7 LittleThought-1.04        2377   15   14  1428   42%  2437   17%
8 LittleThought-1.03        2300   15   15  1428   33%  2437   14%
Replacing 1.05.01 with 1.05.02 which includes a small change

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6%
3 Cheese 1.3                2582   16   16  1428   68%  2428    7%
4 DanaSah226                2453   15   15  1428   53%  2428   10%
5 LittleThought-1.05.02     2428    7    6 10000   42%  2506    9%
6 Hamsters 0.0.6            2397   15   15  1428   46%  2428    7%
7 LittleThought-1.04        2377   14   15  1428   43%  2428   17%
8 LittleThought-1.03        2300   15   15  1428   34%  2428   13%
How would you interpret these results? I'm assuming the change was bad (2437 reduced to 2428), but some of the other engine results varied by more than that.
Hart

Re: Interpreting test results

Post by Hart »

If, after 10,000 games against a variety of opponents, your results are worse, then you should be able to say with high confidence that your change was not positive.
My guess is that some of the ratings of the other engine vary so much because they are only connected to the one engine being tested and their individual ratings can vary more than would be expected from small changes in your program, IMO. Why not rate the rest of them with several thousand games each, or would that take too much time?
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Interpreting test results

Post by michiguel »

nthom wrote:I've just recently started testing larger volumes with very fast games. These results are from a 10000 game gauntlet, using bayeselo fixing LT 1.03 at 2300.

LT v1.05.01

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6%
3 Cheese 1.3                2554   16   15  1428   64%  2437    7%
4 DanaSah226                2461   15   15  1428   53%  2437   10%
5 LittleThought-1.05.01     2437    6    7 10000   42%  2513    9%
6 Hamsters 0.0.6            2415   15   15  1428   47%  2437    5%
7 LittleThought-1.04        2377   15   14  1428   42%  2437   17%
8 LittleThought-1.03        2300   15   15  1428   33%  2437   14%
Can't you Include both versions in the same calculation?

You cannot directly compare the rating points from two different calculation sets.

Miguel

Replacing 1.05.01 with 1.05.02 which includes a small change

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6%
3 Cheese 1.3                2582   16   16  1428   68%  2428    7%
4 DanaSah226                2453   15   15  1428   53%  2428   10%
5 LittleThought-1.05.02     2428    7    6 10000   42%  2506    9%
6 Hamsters 0.0.6            2397   15   15  1428   46%  2428    7%
7 LittleThought-1.04        2377   14   15  1428   43%  2428   17%
8 LittleThought-1.03        2300   15   15  1428   34%  2428   13%
How would you interpret these results? I'm assuming the change was bad (2437 reduced to 2428), but some of the other engine results varied by more than that.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Interpreting test results

Post by Sven »

Just make one BayesElo table from all 20000 games, please, and come back with the result. Rémi has once stated that this is necessary.

When adding a third version, just add the new 10000 games to the same pool and recalculate.

Sven
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Interpreting test results

Post by Adam Hair »

Miguel is right. No accurate comparison can be made unless you know
the Elo of each of the other engines under the test conditions. As Remi
has pointed out before, combine the sets of games in order to get a
comparison of LT 1.05.01 and 1.05.02. Also use the LOS function in Bayeselo
to get a qualitative comparison of strength.
User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 6:15 am
Location: Australia

Re: Interpreting test results

Post by nthom »

Thanks all, will start combining the pgns from now on (my testing program overwrites it each time - doh!).