Interpreting test results

nthom · Post by **nthom** » Wed Nov 18, 2009 3:53 am

I've just recently started testing larger volumes with very fast games. These results are from a 10000 game gauntlet, using bayeselo fixing LT 1.03 at 2300.

LT v1.05.01

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6%
3 Cheese 1.3                2554   16   15  1428   64%  2437    7%
4 DanaSah226                2461   15   15  1428   53%  2437   10%
5 LittleThought-1.05.01     2437    6    7 10000   42%  2513    9%
6 Hamsters 0.0.6            2415   15   15  1428   47%  2437    5%
7 LittleThought-1.04        2377   15   14  1428   42%  2437   17%
8 LittleThought-1.03        2300   15   15  1428   33%  2437   14%

Replacing 1.05.01 with 1.05.02 which includes a small change

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6%
3 Cheese 1.3                2582   16   16  1428   68%  2428    7%
4 DanaSah226                2453   15   15  1428   53%  2428   10%
5 LittleThought-1.05.02     2428    7    6 10000   42%  2506    9%
6 Hamsters 0.0.6            2397   15   15  1428   46%  2428    7%
7 LittleThought-1.04        2377   14   15  1428   43%  2428   17%
8 LittleThought-1.03        2300   15   15  1428   34%  2428   13%

How would you interpret these results? I'm assuming the change was bad (2437 reduced to 2428), but some of the other engine results varied by more than that.

Hart · Post by **Hart** » Wed Nov 18, 2009 4:17 am

If, after 10,000 games against a variety of opponents, your results are worse, then you should be able to say with high confidence that your change was not positive.
My guess is that some of the ratings of the other engine vary so much because they are only connected to the one engine being tested and their individual ratings can vary more than would be expected from small changes in your program, IMO. Why not rate the rest of them with several thousand games each, or would that take too much time?

michiguel · Post by **michiguel** » Wed Nov 18, 2009 4:46 am

nthom wrote:I've just recently started testing larger volumes with very fast games. These results are from a 10000 game gauntlet, using bayeselo fixing LT 1.03 at 2300.

LT v1.05.01

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2839   23   22  1432   90%  2437    3%
2 Hermann 2.5               2647   17   16  1428   74%  2437    6%
3 Cheese 1.3                2554   16   15  1428   64%  2437    7%
4 DanaSah226                2461   15   15  1428   53%  2437   10%
5 LittleThought-1.05.01     2437    6    7 10000   42%  2513    9%
6 Hamsters 0.0.6            2415   15   15  1428   47%  2437    5%
7 LittleThought-1.04        2377   15   14  1428   42%  2437   17%
8 LittleThought-1.03        2300   15   15  1428   33%  2437   14%

Can't you Include both versions in the same calculation?

You cannot directly compare the rating points from two different calculation sets.

Miguel

Replacing 1.05.01 with 1.05.02 which includes a small change

Code: Select all

Rank Name                    Elo    +    - games score oppo. Draws
1 Twisted Logic 20090922    2770   20   20  1432   85%  2428    3%
2 Hermann 2.5               2666   17   17  1428   77%  2428    6%
3 Cheese 1.3                2582   16   16  1428   68%  2428    7%
4 DanaSah226                2453   15   15  1428   53%  2428   10%
5 LittleThought-1.05.02     2428    7    6 10000   42%  2506    9%
6 Hamsters 0.0.6            2397   15   15  1428   46%  2428    7%
7 LittleThought-1.04        2377   14   15  1428   43%  2428   17%
8 LittleThought-1.03        2300   15   15  1428   34%  2428   13%

How would you interpret these results? I'm assuming the change was bad (2437 reduced to 2428), but some of the other engine results varied by more than that.

Sven · Post by **Sven** » Wed Nov 18, 2009 3:50 pm

Just make one BayesElo table from all 20000 games, please, and come back with the result. Rémi has once stated that this is necessary.

When adding a third version, just add the new 10000 games to the same pool and recalculate.

Sven

Adam Hair · Post by **Adam Hair** » Wed Nov 18, 2009 3:50 pm

Miguel is right. No accurate comparison can be made unless you know
the Elo of each of the other engines under the test conditions. As Remi
has pointed out before, combine the sets of games in order to get a
comparison of LT 1.05.01 and 1.05.02. Also use the LOS function in Bayeselo
to get a qualitative comparison of strength.

nthom · Post by **nthom** » Thu Nov 19, 2009 2:29 am

Thanks all, will start combining the pgns from now on (my testing program overwrites it each time - doh!).

Interpreting test results

Interpreting test results

Re: Interpreting test results

Re: Interpreting test results

Re: Interpreting test results

Re: Interpreting test results

Re: Interpreting test results