A question on testing methodology

Discussion of chess software programming and technical issues.

Moderator: Ras

Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: A question on testing methodology [here is the data]

Post by Edsel Apostol »

bob wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.

Code: Select all

   2 Crafty-23.1-1        2654    4    4 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    5    5 16000   52%  2607   22% 
   4 Toga2                2616    5    5 16000   51%  2607   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2522    5    5 16000   38%  2607   22% 
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.

Code: Select all

   2 Crafty-23.1-1        2653    4    3 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    3    3 48000   54%  2596   23% 
   4 Toga2                2615    4    4 48000   52%  2597   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2521    4    4 48000   38%  2613   23% 
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).

So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.

Bob
I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.
The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.

Should be able to get this done by 12:00 or so.
Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.

Test results alone:

Code: Select all

Rank Name                  Elo    +    - games score oppo. draws
   2 Crafty-23.1-1        2654   17   16  1040   58%  2595   23% 
   3 Toga2                2639   25   25   416   54%  2613   24% 
   4 Glaurung 2.2         2617   25   25   416   51%  2613   23% 
   5 Crafty-23.0-1        2571   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2532   25   26   416   39%  2613   22% 
test results + RR games:

Code: Select all

   2 Crafty-23.1-1        2654   17   17  1040   58%  2595   23% 
   3 Glaurung 2.2         2623    5    4 32416   54%  2589   24% 
   4 Toga2                2613    4    4 32416   53%  2591   23% 
   5 Crafty-23.0-1        2570   16   17  1040   47%  2595   22% 
   6 Fruit 2.1            2519    4    4 32416   37%  2614   23% 
As I said, you can not "cheat the sample size gods". That is just not going to help recognize the effect of even +/- 10 Elo changes, which are large, much less the 3-5 we search for. More games between the RR opponents doesn't affect our error margin much at all.
Thanks Bob for the data. Can you try the covariance feature of bayeselo on this? I'm just curious if there will be difference in the output.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology [here is the data]

Post by bob »

Edsel Apostol wrote:
bob wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.

Code: Select all

   2 Crafty-23.1-1        2654    4    4 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    5    5 16000   52%  2607   22% 
   4 Toga2                2616    5    5 16000   51%  2607   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2522    5    5 16000   38%  2607   22% 
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.

Code: Select all

   2 Crafty-23.1-1        2653    4    3 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    3    3 48000   54%  2596   23% 
   4 Toga2                2615    4    4 48000   52%  2597   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2521    4    4 48000   38%  2613   23% 
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).

So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.

Bob
I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.
The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.

Should be able to get this done by 12:00 or so.
Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.

Test results alone:

Code: Select all

Rank Name                  Elo    +    - games score oppo. draws
   2 Crafty-23.1-1        2654   17   16  1040   58%  2595   23% 
   3 Toga2                2639   25   25   416   54%  2613   24% 
   4 Glaurung 2.2         2617   25   25   416   51%  2613   23% 
   5 Crafty-23.0-1        2571   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2532   25   26   416   39%  2613   22% 
test results + RR games:

Code: Select all

   2 Crafty-23.1-1        2654   17   17  1040   58%  2595   23% 
   3 Glaurung 2.2         2623    5    4 32416   54%  2589   24% 
   4 Toga2                2613    4    4 32416   53%  2591   23% 
   5 Crafty-23.0-1        2570   16   17  1040   47%  2595   22% 
   6 Fruit 2.1            2519    4    4 32416   37%  2614   23% 
As I said, you can not "cheat the sample size gods". That is just not going to help recognize the effect of even +/- 10 Elo changes, which are large, much less the 3-5 we search for. More games between the RR opponents doesn't affect our error margin much at all.
Thanks Bob for the data. Can you try the covariance feature of bayeselo on this? I'm just curious if there will be difference in the output.
Sure. Can you remind me how to do that and I'll be happy to run the data through it...
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: A question on testing methodology [here is the data]

Post by Edsel Apostol »

bob wrote:
Edsel Apostol wrote:
bob wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.

Code: Select all

   2 Crafty-23.1-1        2654    4    4 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    5    5 16000   52%  2607   22% 
   4 Toga2                2616    5    5 16000   51%  2607   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2522    5    5 16000   38%  2607   22% 
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.

Code: Select all

   2 Crafty-23.1-1        2653    4    3 40000   57%  2597   22% 
   3 Glaurung 2.2         2625    3    3 48000   54%  2596   23% 
   4 Toga2                2615    4    4 48000   52%  2597   23% 
   5 Crafty-23.0-1        2561    3    4 40000   45%  2597   21% 
   6 Fruit 2.1            2521    4    4 48000   38%  2613   23% 
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).

So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.

Bob
I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.
The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.

Should be able to get this done by 12:00 or so.
Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.

Test results alone:

Code: Select all

Rank Name                  Elo    +    - games score oppo. draws
   2 Crafty-23.1-1        2654   17   16  1040   58%  2595   23% 
   3 Toga2                2639   25   25   416   54%  2613   24% 
   4 Glaurung 2.2         2617   25   25   416   51%  2613   23% 
   5 Crafty-23.0-1        2571   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2532   25   26   416   39%  2613   22% 
test results + RR games:

Code: Select all

   2 Crafty-23.1-1        2654   17   17  1040   58%  2595   23% 
   3 Glaurung 2.2         2623    5    4 32416   54%  2589   24% 
   4 Toga2                2613    4    4 32416   53%  2591   23% 
   5 Crafty-23.0-1        2570   16   17  1040   47%  2595   22% 
   6 Fruit 2.1            2519    4    4 32416   37%  2614   23% 
As I said, you can not "cheat the sample size gods". That is just not going to help recognize the effect of even +/- 10 Elo changes, which are large, much less the 3-5 we search for. More games between the RR opponents doesn't affect our error margin much at all.
Thanks Bob for the data. Can you try the covariance feature of bayeselo on this? I'm just curious if there will be difference in the output.
Sure. Can you remind me how to do that and I'll be happy to run the data through it...
http://www.talkchess.com/forum/viewtopi ... &start=163

Remi says to use it after mm. I think it should replace exactdist, though I'm not sure of that.

I have not read much about the testing thread last year and it seems most of my questions are answered there.

I'm still interested in the result on your data though.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology [here is the data]

Post by bob »

OK, here is "covariance" output for full match:

Code: Select all

   2 Crafty-23.1-1        2654   16   16  1040   58%  2595   23% 
   3 Glaurung 2.2         2623    5    5 32416   54%  2589   24% 
   4 Toga2                2613    5    5 32416   53%  2591   23% 
   5 Crafty-23.0-1        2570   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2519    5    5 32416   37%  2614   23% 
and same without the RR data.

Code: Select all

   2 Crafty-23.1-1        2654   16   16  1040   58%  2595   23% 
   3 Toga2                2639   26   26   416   54%  2613   24% 
   4 Glaurung 2.2         2617   26   26   416   51%  2613   23% 
   5 Crafty-23.0-1        2571   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2532   27   27   416   39%  2613   22% 
Note that they are in the reverse order compared to the previous set, which had RR data last.
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: A question on testing methodology [here is the data]

Post by Edsel Apostol »

bob wrote:OK, here is "covariance" output for full match:

Code: Select all

   2 Crafty-23.1-1        2654   16   16  1040   58%  2595   23% 
   3 Glaurung 2.2         2623    5    5 32416   54%  2589   24% 
   4 Toga2                2613    5    5 32416   53%  2591   23% 
   5 Crafty-23.0-1        2570   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2519    5    5 32416   37%  2614   23% 
and same without the RR data.

Code: Select all

   2 Crafty-23.1-1        2654   16   16  1040   58%  2595   23% 
   3 Toga2                2639   26   26   416   54%  2613   24% 
   4 Glaurung 2.2         2617   26   26   416   51%  2613   23% 
   5 Crafty-23.0-1        2571   16   16  1040   47%  2595   22% 
   6 Fruit 2.1            2532   27   27   416   39%  2613   22% 
Note that they are in the reverse order compared to the previous set, which had RR data last.
Thanks. There seems to be not much difference than what I was hoping for. I'm currently running a round robin match and I will post my own findings here after the tests is done most probably tomorrow.
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: A question on testing methodology [here is the data]

Post by Edsel Apostol »

OK, here's my results. The round robin of the opponents seems to have no effect on the ordering of the engines, at least in this test. It only makes accurate the opponents error bars but I couldn't find any effect at all in the error bars of the engines being tested, maybe more games is needed or maybe it has no effect at all. I was hoping that at least it would improve the error bars so that we could be assured of a little bit accuracy than the normal gauntlet. Oh well.

Gauntlet with exactdist:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2943   17   17  1320   74%  2767   23% 
   2 Strelka 2.0 B           2913   17   16  1320   70%  2767   25% 
   3 Cyclone 3.4             2893   17   17  1320   67%  2767   18% 
   4 Grapefruit 1.0 beta     2891   16   16  1320   67%  2767   24% 
   5 Fruit 2.3.1             2872   16   16  1320   65%  2767   26% 
   6 Protector 1.3.1 x32     2841   16   16  1320   60%  2767   23% 
   7 Doch32 09.980 JA        2798   16   16  1320   54%  2767   23% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 t20091113               2769   17   17  1200   41%  2837   23% 
  11 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  12 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  13 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  16 bright-0.4a             2762   16   16  1320   49%  2767   20% 
  17 t20091113IIDSf          2760   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2758   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2737   16   16  1320   46%  2767   24% 
  21 Twisted Logic 20080620  2718   16   16  1320   43%  2767   27% 
Round Robin with exactdist:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2933   12   12  2400   69%  2793   22% 
   2 Strelka 2.0 B           2910   12   12  2400   66%  2795   26% 
   3 Cyclone 3.4             2891   12   12  2400   63%  2795   23% 
   4 Grapefruit 1.0 beta     2889   12   12  2400   63%  2796   26% 
   5 Fruit 2.3.1             2869   11   11  2400   60%  2797   27% 
   6 Protector 1.3.1 x32     2841   12   12  2400   56%  2798   25% 
   7 Doch32 09.980 JA        2796   12   12  2400   49%  2800   24% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 bright-0.4a             2773   12   12  2400   46%  2801   22% 
  11 t20091113               2769   17   17  1200   41%  2837   23% 
  12 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  13 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  16 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  17 t20091113IIDSf          2761   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2759   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2734   12   12  2400   40%  2803   24% 
  21 Twisted Logic 20080620  2731   12   12  2400   40%  2803   26% 
Gauntlet with covariance:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2943   17   17  1320   74%  2767   23% 
   2 Strelka 2.0 B           2913   16   16  1320   70%  2767   25% 
   3 Cyclone 3.4             2893   16   16  1320   67%  2767   18% 
   4 Grapefruit 1.0 beta     2891   16   16  1320   67%  2767   24% 
   5 Fruit 2.3.1             2872   16   16  1320   65%  2767   26% 
   6 Protector 1.3.1 x32     2841   16   16  1320   60%  2767   23% 
   7 Doch32 09.980 JA        2798   16   16  1320   54%  2767   23% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 t20091113               2769   17   17  1200   41%  2837   23% 
  11 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  12 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  13 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  16 bright-0.4a             2762   16   16  1320   49%  2767   20% 
  17 t20091113IIDSf          2760   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2758   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2737   16   16  1320   46%  2767   24% 
  21 Twisted Logic 20080620  2718   16   16  1320   43%  2767   27% 
Round Robin with covariance:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2933   12   12  2400   69%  2793   22% 
   2 Strelka 2.0 B           2910   12   12  2400   66%  2795   26% 
   3 Cyclone 3.4             2891   12   12  2400   63%  2795   23% 
   4 Grapefruit 1.0 beta     2889   12   12  2400   63%  2796   26% 
   5 Fruit 2.3.1             2869   11   11  2400   60%  2797   27% 
   6 Protector 1.3.1 x32     2841   12   12  2400   56%  2798   25% 
   7 Doch32 09.980 JA        2796   12   12  2400   49%  2800   24% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 bright-0.4a             2773   12   12  2400   46%  2801   22% 
  11 t20091113               2769   17   17  1200   41%  2837   23% 
  12 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  13 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  16 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  17 t20091113IIDSf          2761   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2759   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2734   12   12  2400   40%  2803   24% 
  21 Twisted Logic 20080620  2731   12   12  2400   40%  2803   26% 
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: A question on testing methodology [here is the data]

Post by Sven »

These results, yours and those of Bob, are indeed surprising for me. The error bars for the candidate engines are not going down with RR games + covariance, so the number of required gauntlet games to match predefined quality criteria can't be reduced despite what I hoped.

It seems that I have to give up ... although I still don't understand why. Maybe Rémi can comment on this, giving us a mathematical explanation for it?

At least we now have concrete results that can be discussed or referred to, instead of just statements and assumptions. Even if the outcome is not what I expected, it has definitely been better to test it instead of just saying that including RR games will not help.

Also one result for sure is that including RR games does not hurt.

Sven
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology [here is the data]

Post by bob »

Sven Schüle wrote:These results, yours and those of Bob, are indeed surprising for me. The error bars for the candidate engines are not going down with RR games + covariance, so the number of required gauntlet games to match predefined quality criteria can't be reduced despite what I hoped.

It seems that I have to give up ... although I still don't understand why. Maybe Rémi can comment on this, giving us a mathematical explanation for it?

At least we now have concrete results that can be discussed or referred to, instead of just statements and assumptions. Even if the outcome is not what I expected, it has definitely been better to test it instead of just saying that including RR games will not help.

Also one result for sure is that including RR games does not hurt.

Sven
Just for the record, I did not "just state that it didn't work." I ran this exact same experiment last year, and got the same results. I, too, thought it would make a difference. But it didn't. If I precede a comment with "I believe" or "I think" then you can assume that I am not certain. If I say "I know" that means I have actually tested the idea to determine the answer.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A question on testing methodology [here is the data]

Post by bob »

Edsel Apostol wrote:OK, here's my results. The round robin of the opponents seems to have no effect on the ordering of the engines, at least in this test. It only makes accurate the opponents error bars but I couldn't find any effect at all in the error bars of the engines being tested, maybe more games is needed or maybe it has no effect at all. I was hoping that at least it would improve the error bars so that we could be assured of a little bit accuracy than the normal gauntlet. Oh well.

Gauntlet with exactdist:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2943   17   17  1320   74%  2767   23% 
   2 Strelka 2.0 B           2913   17   16  1320   70%  2767   25% 
   3 Cyclone 3.4             2893   17   17  1320   67%  2767   18% 
   4 Grapefruit 1.0 beta     2891   16   16  1320   67%  2767   24% 
   5 Fruit 2.3.1             2872   16   16  1320   65%  2767   26% 
   6 Protector 1.3.1 x32     2841   16   16  1320   60%  2767   23% 
   7 Doch32 09.980 JA        2798   16   16  1320   54%  2767   23% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 t20091113               2769   17   17  1200   41%  2837   23% 
  11 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  12 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  13 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  16 bright-0.4a             2762   16   16  1320   49%  2767   20% 
  17 t20091113IIDSf          2760   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2758   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2737   16   16  1320   46%  2767   24% 
  21 Twisted Logic 20080620  2718   16   16  1320   43%  2767   27% 
Round Robin with exactdist:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2933   12   12  2400   69%  2793   22% 
   2 Strelka 2.0 B           2910   12   12  2400   66%  2795   26% 
   3 Cyclone 3.4             2891   12   12  2400   63%  2795   23% 
   4 Grapefruit 1.0 beta     2889   12   12  2400   63%  2796   26% 
   5 Fruit 2.3.1             2869   11   11  2400   60%  2797   27% 
   6 Protector 1.3.1 x32     2841   12   12  2400   56%  2798   25% 
   7 Doch32 09.980 JA        2796   12   12  2400   49%  2800   24% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 bright-0.4a             2773   12   12  2400   46%  2801   22% 
  11 t20091113               2769   17   17  1200   41%  2837   23% 
  12 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  13 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  16 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  17 t20091113IIDSf          2761   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2759   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2734   12   12  2400   40%  2803   24% 
  21 Twisted Logic 20080620  2731   12   12  2400   40%  2803   26% 
Gauntlet with covariance:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2943   17   17  1320   74%  2767   23% 
   2 Strelka 2.0 B           2913   16   16  1320   70%  2767   25% 
   3 Cyclone 3.4             2893   16   16  1320   67%  2767   18% 
   4 Grapefruit 1.0 beta     2891   16   16  1320   67%  2767   24% 
   5 Fruit 2.3.1             2872   16   16  1320   65%  2767   26% 
   6 Protector 1.3.1 x32     2841   16   16  1320   60%  2767   23% 
   7 Doch32 09.980 JA        2798   16   16  1320   54%  2767   23% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 t20091113               2769   17   17  1200   41%  2837   23% 
  11 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  12 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  13 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  16 bright-0.4a             2762   16   16  1320   49%  2767   20% 
  17 t20091113IIDSf          2760   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2758   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2737   16   16  1320   46%  2767   24% 
  21 Twisted Logic 20080620  2718   16   16  1320   43%  2767   27% 
Round Robin with covariance:

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws 
   1 Stockfish 1.4 JA        2933   12   12  2400   69%  2793   22% 
   2 Strelka 2.0 B           2910   12   12  2400   66%  2795   26% 
   3 Cyclone 3.4             2891   12   12  2400   63%  2795   23% 
   4 Grapefruit 1.0 beta     2889   12   12  2400   63%  2796   26% 
   5 Fruit 2.3.1             2869   11   11  2400   60%  2797   27% 
   6 Protector 1.3.1 x32     2841   12   12  2400   56%  2798   25% 
   7 Doch32 09.980 JA        2796   12   12  2400   49%  2800   24% 
   8 t20091113FHwPVf         2785   17   17  1200   43%  2837   22% 
   9 t20091113KAAL7          2774   17   17  1200   42%  2837   23% 
  10 bright-0.4a             2773   12   12  2400   46%  2801   22% 
  11 t20091113               2769   17   17  1200   41%  2837   23% 
  12 t20091113KAAL8          2768   17   17  1200   41%  2837   22% 
  13 t20091113KAAL9          2767   17   17  1200   40%  2837   24% 
  14 t20091113FHwPV2         2766   17   17  1200   40%  2837   25% 
  15 t20091113SAtrue         2766   17   17  1200   40%  2837   23% 
  16 t20091113NTfalse        2763   17   17  1200   40%  2837   25% 
  17 t20091113IIDSf          2761   17   17  1200   40%  2837   23% 
  18 t20091113KAAL6          2759   17   17  1200   39%  2837   23% 
  19 t20091113KAAL5          2756   17   17  1200   39%  2837   23% 
  20 Fruit 2.1               2734   12   12  2400   40%  2803   24% 
  21 Twisted Logic 20080620  2731   12   12  2400   40%  2803   26% 
Remember, I had warned you. This was already tested last year when someone raised the same point. And the test produced the same results then as now. :)
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: A question on testing methodology [here is the data]

Post by Rémi Coulom »

Sven Schüle wrote:These results, yours and those of Bob, are indeed surprising for me. The error bars for the candidate engines are not going down with RR games + covariance, so the number of required gauntlet games to match predefined quality criteria can't be reduced despite what I hoped.

It seems that I have to give up ... although I still don't understand why. Maybe Rémi can comment on this, giving us a mathematical explanation for it?
Bayeselo offers 3 different methods to compute confidence intervals.

The default method assumes that the posterior distribution is independent Gaussian (ie, without covariance). It is the fastest, and least accurate of all 3.

The "exactdist" method assumes that the rating of opponents are their true ratings, but does not make a Gaussian assumption. This produces more accurate results when the number of games is small, or the winning rate close to 0% or 100% (with non-symmetrical intervals). With "exactdist", the uncertainty of opponent ratings is not taken into consideration at all.

The "covariance" method assumes a Gaussian distribution of the posterior. It always produces symmetrical intervals. It is the only method of all 3 that takes opponent uncertainty into consideration. It is the most accurate method when the number of games is high. The "los" command is also based on the same approximation as "covariance".

"covariance" has a cost cubic in the number of players (a matrix inversion). That is the reason why I did not make it the default: it is too slow when the number of players is very high.

In general, it is not a good idea to infer anything from confidence intervals when trying to compare two programs. Use LOS is better.

So, maybe, it is a good idea to re-run those experiment with "covariance" instead of "exactdist". And compare LOS matrices, not confidence intervals.

Rémi