Thanks Bob for the data. Can you try the covariance feature of bayeselo on this? I'm just curious if there will be difference in the output.bob wrote:Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.bob wrote:The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.Edsel Apostol wrote:I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.Code: Select all
2 Crafty-23.1-1 2654 4 4 40000 57% 2597 22% 3 Glaurung 2.2 2625 5 5 16000 52% 2607 22% 4 Toga2 2616 5 5 16000 51% 2607 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2522 5 5 16000 38% 2607 22%
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).Code: Select all
2 Crafty-23.1-1 2653 4 3 40000 57% 2597 22% 3 Glaurung 2.2 2625 3 3 48000 54% 2596 23% 4 Toga2 2615 4 4 48000 52% 2597 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2521 4 4 48000 38% 2613 23%
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Should be able to get this done by 12:00 or so.
Test results alone:test results + RR games:Code: Select all
Rank Name Elo + - games score oppo. draws 2 Crafty-23.1-1 2654 17 16 1040 58% 2595 23% 3 Toga2 2639 25 25 416 54% 2613 24% 4 Glaurung 2.2 2617 25 25 416 51% 2613 23% 5 Crafty-23.0-1 2571 16 16 1040 47% 2595 22% 6 Fruit 2.1 2532 25 26 416 39% 2613 22%
As I said, you can not "cheat the sample size gods". That is just not going to help recognize the effect of even +/- 10 Elo changes, which are large, much less the 3-5 we search for. More games between the RR opponents doesn't affect our error margin much at all.Code: Select all
2 Crafty-23.1-1 2654 17 17 1040 58% 2595 23% 3 Glaurung 2.2 2623 5 4 32416 54% 2589 24% 4 Toga2 2613 4 4 32416 53% 2591 23% 5 Crafty-23.0-1 2570 16 17 1040 47% 2595 22% 6 Fruit 2.1 2519 4 4 32416 37% 2614 23%
A question on testing methodology
Moderator: Ras
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: A question on testing methodology [here is the data]
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
Sure. Can you remind me how to do that and I'll be happy to run the data through it...Edsel Apostol wrote:Thanks Bob for the data. Can you try the covariance feature of bayeselo on this? I'm just curious if there will be difference in the output.bob wrote:Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.bob wrote:The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.Edsel Apostol wrote:I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.Code: Select all
2 Crafty-23.1-1 2654 4 4 40000 57% 2597 22% 3 Glaurung 2.2 2625 5 5 16000 52% 2607 22% 4 Toga2 2616 5 5 16000 51% 2607 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2522 5 5 16000 38% 2607 22%
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).Code: Select all
2 Crafty-23.1-1 2653 4 3 40000 57% 2597 22% 3 Glaurung 2.2 2625 3 3 48000 54% 2596 23% 4 Toga2 2615 4 4 48000 52% 2597 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2521 4 4 48000 38% 2613 23%
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Should be able to get this done by 12:00 or so.
Test results alone:test results + RR games:Code: Select all
Rank Name Elo + - games score oppo. draws 2 Crafty-23.1-1 2654 17 16 1040 58% 2595 23% 3 Toga2 2639 25 25 416 54% 2613 24% 4 Glaurung 2.2 2617 25 25 416 51% 2613 23% 5 Crafty-23.0-1 2571 16 16 1040 47% 2595 22% 6 Fruit 2.1 2532 25 26 416 39% 2613 22%
As I said, you can not "cheat the sample size gods". That is just not going to help recognize the effect of even +/- 10 Elo changes, which are large, much less the 3-5 we search for. More games between the RR opponents doesn't affect our error margin much at all.Code: Select all
2 Crafty-23.1-1 2654 17 17 1040 58% 2595 23% 3 Glaurung 2.2 2623 5 4 32416 54% 2589 24% 4 Toga2 2613 4 4 32416 53% 2591 23% 5 Crafty-23.0-1 2570 16 17 1040 47% 2595 22% 6 Fruit 2.1 2519 4 4 32416 37% 2614 23%
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: A question on testing methodology [here is the data]
http://www.talkchess.com/forum/viewtopi ... &start=163bob wrote:Sure. Can you remind me how to do that and I'll be happy to run the data through it...Edsel Apostol wrote:Thanks Bob for the data. Can you try the covariance feature of bayeselo on this? I'm just curious if there will be difference in the output.bob wrote:Actually, this took almost no time. I ran 1040 games with the two versions of Crafty.bob wrote:The error bar for the RR participants will go down a lot if you play enough games. But with a limited number of games for your program, you are stuck with a big error margin. I'll run 1000 games and give the bayeselo for that, then include the large RR results and run again. Will do this right after a 11:00am class I have coming up.Edsel Apostol wrote:I think one could only notice a small difference if there is already a big number of games just like the data above. The differences would be more pronounced when there is only a small number of positions where the total number of opponents game is much bigger than the engine's number of games. This is useful for us who have limited testing resources and can only afford to test each version with a little more than 1000 games.bob wrote:First, two crafty versions (at the moment, these are 23.0 and 23.1... I am having to manually search for the two that were 10 elo apart and that is a pain). The first set is with 23.0 and 23.1 playing the usual gauntlet of 5 programs, 4,000 positions.
And now by factoring in the complete RR between the 5 opponents. Again, each plays the other 4 for 4,000 positions, alternating colors as well.Code: Select all
2 Crafty-23.1-1 2654 4 4 40000 57% 2597 22% 3 Glaurung 2.2 2625 5 5 16000 52% 2607 22% 4 Toga2 2616 5 5 16000 51% 2607 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2522 5 5 16000 38% 2607 22%
So the game winning percentages change, since the opponents played each other. And there was a very slight reduction in the error margin. But check out the ratings for 23.1 and 23.0. 23.1 dropped by 1 point... Notice that the error margins for the opponents all dropped since they played 3x as many games... which drops Crafty's error bar a bit (it played no additional games between the two samples).Code: Select all
2 Crafty-23.1-1 2653 4 3 40000 57% 2597 22% 3 Glaurung 2.2 2625 3 3 48000 54% 2596 23% 4 Toga2 2615 4 4 48000 52% 2597 23% 5 Crafty-23.0-1 2561 3 4 40000 45% 2597 21% 6 Fruit 2.1 2521 4 4 48000 38% 2613 23%
So pretty much what I saw the last time I ran this kind of test to see if the RR data was useful.
Bob
Should be able to get this done by 12:00 or so.
Test results alone:test results + RR games:Code: Select all
Rank Name Elo + - games score oppo. draws 2 Crafty-23.1-1 2654 17 16 1040 58% 2595 23% 3 Toga2 2639 25 25 416 54% 2613 24% 4 Glaurung 2.2 2617 25 25 416 51% 2613 23% 5 Crafty-23.0-1 2571 16 16 1040 47% 2595 22% 6 Fruit 2.1 2532 25 26 416 39% 2613 22%
As I said, you can not "cheat the sample size gods". That is just not going to help recognize the effect of even +/- 10 Elo changes, which are large, much less the 3-5 we search for. More games between the RR opponents doesn't affect our error margin much at all.Code: Select all
2 Crafty-23.1-1 2654 17 17 1040 58% 2595 23% 3 Glaurung 2.2 2623 5 4 32416 54% 2589 24% 4 Toga2 2613 4 4 32416 53% 2591 23% 5 Crafty-23.0-1 2570 16 17 1040 47% 2595 22% 6 Fruit 2.1 2519 4 4 32416 37% 2614 23%
Remi says to use it after mm. I think it should replace exactdist, though I'm not sure of that.
I have not read much about the testing thread last year and it seems most of my questions are answered there.
I'm still interested in the result on your data though.
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
OK, here is "covariance" output for full match:
and same without the RR data.
Note that they are in the reverse order compared to the previous set, which had RR data last.
Code: Select all
2 Crafty-23.1-1 2654 16 16 1040 58% 2595 23%
3 Glaurung 2.2 2623 5 5 32416 54% 2589 24%
4 Toga2 2613 5 5 32416 53% 2591 23%
5 Crafty-23.0-1 2570 16 16 1040 47% 2595 22%
6 Fruit 2.1 2519 5 5 32416 37% 2614 23%
Code: Select all
2 Crafty-23.1-1 2654 16 16 1040 58% 2595 23%
3 Toga2 2639 26 26 416 54% 2613 24%
4 Glaurung 2.2 2617 26 26 416 51% 2613 23%
5 Crafty-23.0-1 2571 16 16 1040 47% 2595 22%
6 Fruit 2.1 2532 27 27 416 39% 2613 22%
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: A question on testing methodology [here is the data]
Thanks. There seems to be not much difference than what I was hoping for. I'm currently running a round robin match and I will post my own findings here after the tests is done most probably tomorrow.bob wrote:OK, here is "covariance" output for full match:and same without the RR data.Code: Select all
2 Crafty-23.1-1 2654 16 16 1040 58% 2595 23% 3 Glaurung 2.2 2623 5 5 32416 54% 2589 24% 4 Toga2 2613 5 5 32416 53% 2591 23% 5 Crafty-23.0-1 2570 16 16 1040 47% 2595 22% 6 Fruit 2.1 2519 5 5 32416 37% 2614 23%
Note that they are in the reverse order compared to the previous set, which had RR data last.Code: Select all
2 Crafty-23.1-1 2654 16 16 1040 58% 2595 23% 3 Toga2 2639 26 26 416 54% 2613 24% 4 Glaurung 2.2 2617 26 26 416 51% 2613 23% 5 Crafty-23.0-1 2571 16 16 1040 47% 2595 22% 6 Fruit 2.1 2532 27 27 416 39% 2613 22%
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: A question on testing methodology [here is the data]
OK, here's my results. The round robin of the opponents seems to have no effect on the ordering of the engines, at least in this test. It only makes accurate the opponents error bars but I couldn't find any effect at all in the error bars of the engines being tested, maybe more games is needed or maybe it has no effect at all. I was hoping that at least it would improve the error bars so that we could be assured of a little bit accuracy than the normal gauntlet. Oh well.
Gauntlet with exactdist:
Round Robin with exactdist:
Gauntlet with covariance:
Round Robin with covariance:
Gauntlet with exactdist:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Stockfish 1.4 JA 2943 17 17 1320 74% 2767 23%
2 Strelka 2.0 B 2913 17 16 1320 70% 2767 25%
3 Cyclone 3.4 2893 17 17 1320 67% 2767 18%
4 Grapefruit 1.0 beta 2891 16 16 1320 67% 2767 24%
5 Fruit 2.3.1 2872 16 16 1320 65% 2767 26%
6 Protector 1.3.1 x32 2841 16 16 1320 60% 2767 23%
7 Doch32 09.980 JA 2798 16 16 1320 54% 2767 23%
8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22%
9 t20091113KAAL7 2774 17 17 1200 42% 2837 23%
10 t20091113 2769 17 17 1200 41% 2837 23%
11 t20091113KAAL8 2768 17 17 1200 41% 2837 22%
12 t20091113KAAL9 2767 17 17 1200 40% 2837 24%
13 t20091113SAtrue 2766 17 17 1200 40% 2837 23%
14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25%
15 t20091113NTfalse 2763 17 17 1200 40% 2837 25%
16 bright-0.4a 2762 16 16 1320 49% 2767 20%
17 t20091113IIDSf 2760 17 17 1200 40% 2837 23%
18 t20091113KAAL6 2758 17 17 1200 39% 2837 23%
19 t20091113KAAL5 2756 17 17 1200 39% 2837 23%
20 Fruit 2.1 2737 16 16 1320 46% 2767 24%
21 Twisted Logic 20080620 2718 16 16 1320 43% 2767 27%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Stockfish 1.4 JA 2933 12 12 2400 69% 2793 22%
2 Strelka 2.0 B 2910 12 12 2400 66% 2795 26%
3 Cyclone 3.4 2891 12 12 2400 63% 2795 23%
4 Grapefruit 1.0 beta 2889 12 12 2400 63% 2796 26%
5 Fruit 2.3.1 2869 11 11 2400 60% 2797 27%
6 Protector 1.3.1 x32 2841 12 12 2400 56% 2798 25%
7 Doch32 09.980 JA 2796 12 12 2400 49% 2800 24%
8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22%
9 t20091113KAAL7 2774 17 17 1200 42% 2837 23%
10 bright-0.4a 2773 12 12 2400 46% 2801 22%
11 t20091113 2769 17 17 1200 41% 2837 23%
12 t20091113KAAL8 2768 17 17 1200 41% 2837 22%
13 t20091113KAAL9 2767 17 17 1200 40% 2837 24%
14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25%
15 t20091113SAtrue 2766 17 17 1200 40% 2837 23%
16 t20091113NTfalse 2763 17 17 1200 40% 2837 25%
17 t20091113IIDSf 2761 17 17 1200 40% 2837 23%
18 t20091113KAAL6 2759 17 17 1200 39% 2837 23%
19 t20091113KAAL5 2756 17 17 1200 39% 2837 23%
20 Fruit 2.1 2734 12 12 2400 40% 2803 24%
21 Twisted Logic 20080620 2731 12 12 2400 40% 2803 26%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Stockfish 1.4 JA 2943 17 17 1320 74% 2767 23%
2 Strelka 2.0 B 2913 16 16 1320 70% 2767 25%
3 Cyclone 3.4 2893 16 16 1320 67% 2767 18%
4 Grapefruit 1.0 beta 2891 16 16 1320 67% 2767 24%
5 Fruit 2.3.1 2872 16 16 1320 65% 2767 26%
6 Protector 1.3.1 x32 2841 16 16 1320 60% 2767 23%
7 Doch32 09.980 JA 2798 16 16 1320 54% 2767 23%
8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22%
9 t20091113KAAL7 2774 17 17 1200 42% 2837 23%
10 t20091113 2769 17 17 1200 41% 2837 23%
11 t20091113KAAL8 2768 17 17 1200 41% 2837 22%
12 t20091113KAAL9 2767 17 17 1200 40% 2837 24%
13 t20091113SAtrue 2766 17 17 1200 40% 2837 23%
14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25%
15 t20091113NTfalse 2763 17 17 1200 40% 2837 25%
16 bright-0.4a 2762 16 16 1320 49% 2767 20%
17 t20091113IIDSf 2760 17 17 1200 40% 2837 23%
18 t20091113KAAL6 2758 17 17 1200 39% 2837 23%
19 t20091113KAAL5 2756 17 17 1200 39% 2837 23%
20 Fruit 2.1 2737 16 16 1320 46% 2767 24%
21 Twisted Logic 20080620 2718 16 16 1320 43% 2767 27%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Stockfish 1.4 JA 2933 12 12 2400 69% 2793 22%
2 Strelka 2.0 B 2910 12 12 2400 66% 2795 26%
3 Cyclone 3.4 2891 12 12 2400 63% 2795 23%
4 Grapefruit 1.0 beta 2889 12 12 2400 63% 2796 26%
5 Fruit 2.3.1 2869 11 11 2400 60% 2797 27%
6 Protector 1.3.1 x32 2841 12 12 2400 56% 2798 25%
7 Doch32 09.980 JA 2796 12 12 2400 49% 2800 24%
8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22%
9 t20091113KAAL7 2774 17 17 1200 42% 2837 23%
10 bright-0.4a 2773 12 12 2400 46% 2801 22%
11 t20091113 2769 17 17 1200 41% 2837 23%
12 t20091113KAAL8 2768 17 17 1200 41% 2837 22%
13 t20091113KAAL9 2767 17 17 1200 40% 2837 24%
14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25%
15 t20091113SAtrue 2766 17 17 1200 40% 2837 23%
16 t20091113NTfalse 2763 17 17 1200 40% 2837 25%
17 t20091113IIDSf 2761 17 17 1200 40% 2837 23%
18 t20091113KAAL6 2759 17 17 1200 39% 2837 23%
19 t20091113KAAL5 2756 17 17 1200 39% 2837 23%
20 Fruit 2.1 2734 12 12 2400 40% 2803 24%
21 Twisted Logic 20080620 2731 12 12 2400 40% 2803 26%
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: A question on testing methodology [here is the data]
These results, yours and those of Bob, are indeed surprising for me. The error bars for the candidate engines are not going down with RR games + covariance, so the number of required gauntlet games to match predefined quality criteria can't be reduced despite what I hoped.
It seems that I have to give up ... although I still don't understand why. Maybe Rémi can comment on this, giving us a mathematical explanation for it?
At least we now have concrete results that can be discussed or referred to, instead of just statements and assumptions. Even if the outcome is not what I expected, it has definitely been better to test it instead of just saying that including RR games will not help.
Also one result for sure is that including RR games does not hurt.
Sven
It seems that I have to give up ... although I still don't understand why. Maybe Rémi can comment on this, giving us a mathematical explanation for it?
At least we now have concrete results that can be discussed or referred to, instead of just statements and assumptions. Even if the outcome is not what I expected, it has definitely been better to test it instead of just saying that including RR games will not help.
Also one result for sure is that including RR games does not hurt.
Sven
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
Just for the record, I did not "just state that it didn't work." I ran this exact same experiment last year, and got the same results. I, too, thought it would make a difference. But it didn't. If I precede a comment with "I believe" or "I think" then you can assume that I am not certain. If I say "I know" that means I have actually tested the idea to determine the answer.Sven Schüle wrote:These results, yours and those of Bob, are indeed surprising for me. The error bars for the candidate engines are not going down with RR games + covariance, so the number of required gauntlet games to match predefined quality criteria can't be reduced despite what I hoped.
It seems that I have to give up ... although I still don't understand why. Maybe Rémi can comment on this, giving us a mathematical explanation for it?
At least we now have concrete results that can be discussed or referred to, instead of just statements and assumptions. Even if the outcome is not what I expected, it has definitely been better to test it instead of just saying that including RR games will not help.
Also one result for sure is that including RR games does not hurt.
Sven
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: A question on testing methodology [here is the data]
Remember, I had warned you. This was already tested last year when someone raised the same point. And the test produced the same results then as now.Edsel Apostol wrote:OK, here's my results. The round robin of the opponents seems to have no effect on the ordering of the engines, at least in this test. It only makes accurate the opponents error bars but I couldn't find any effect at all in the error bars of the engines being tested, maybe more games is needed or maybe it has no effect at all. I was hoping that at least it would improve the error bars so that we could be assured of a little bit accuracy than the normal gauntlet. Oh well.
Gauntlet with exactdist:Round Robin with exactdist:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Stockfish 1.4 JA 2943 17 17 1320 74% 2767 23% 2 Strelka 2.0 B 2913 17 16 1320 70% 2767 25% 3 Cyclone 3.4 2893 17 17 1320 67% 2767 18% 4 Grapefruit 1.0 beta 2891 16 16 1320 67% 2767 24% 5 Fruit 2.3.1 2872 16 16 1320 65% 2767 26% 6 Protector 1.3.1 x32 2841 16 16 1320 60% 2767 23% 7 Doch32 09.980 JA 2798 16 16 1320 54% 2767 23% 8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22% 9 t20091113KAAL7 2774 17 17 1200 42% 2837 23% 10 t20091113 2769 17 17 1200 41% 2837 23% 11 t20091113KAAL8 2768 17 17 1200 41% 2837 22% 12 t20091113KAAL9 2767 17 17 1200 40% 2837 24% 13 t20091113SAtrue 2766 17 17 1200 40% 2837 23% 14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25% 15 t20091113NTfalse 2763 17 17 1200 40% 2837 25% 16 bright-0.4a 2762 16 16 1320 49% 2767 20% 17 t20091113IIDSf 2760 17 17 1200 40% 2837 23% 18 t20091113KAAL6 2758 17 17 1200 39% 2837 23% 19 t20091113KAAL5 2756 17 17 1200 39% 2837 23% 20 Fruit 2.1 2737 16 16 1320 46% 2767 24% 21 Twisted Logic 20080620 2718 16 16 1320 43% 2767 27%
Gauntlet with covariance:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Stockfish 1.4 JA 2933 12 12 2400 69% 2793 22% 2 Strelka 2.0 B 2910 12 12 2400 66% 2795 26% 3 Cyclone 3.4 2891 12 12 2400 63% 2795 23% 4 Grapefruit 1.0 beta 2889 12 12 2400 63% 2796 26% 5 Fruit 2.3.1 2869 11 11 2400 60% 2797 27% 6 Protector 1.3.1 x32 2841 12 12 2400 56% 2798 25% 7 Doch32 09.980 JA 2796 12 12 2400 49% 2800 24% 8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22% 9 t20091113KAAL7 2774 17 17 1200 42% 2837 23% 10 bright-0.4a 2773 12 12 2400 46% 2801 22% 11 t20091113 2769 17 17 1200 41% 2837 23% 12 t20091113KAAL8 2768 17 17 1200 41% 2837 22% 13 t20091113KAAL9 2767 17 17 1200 40% 2837 24% 14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25% 15 t20091113SAtrue 2766 17 17 1200 40% 2837 23% 16 t20091113NTfalse 2763 17 17 1200 40% 2837 25% 17 t20091113IIDSf 2761 17 17 1200 40% 2837 23% 18 t20091113KAAL6 2759 17 17 1200 39% 2837 23% 19 t20091113KAAL5 2756 17 17 1200 39% 2837 23% 20 Fruit 2.1 2734 12 12 2400 40% 2803 24% 21 Twisted Logic 20080620 2731 12 12 2400 40% 2803 26%
Round Robin with covariance:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Stockfish 1.4 JA 2943 17 17 1320 74% 2767 23% 2 Strelka 2.0 B 2913 16 16 1320 70% 2767 25% 3 Cyclone 3.4 2893 16 16 1320 67% 2767 18% 4 Grapefruit 1.0 beta 2891 16 16 1320 67% 2767 24% 5 Fruit 2.3.1 2872 16 16 1320 65% 2767 26% 6 Protector 1.3.1 x32 2841 16 16 1320 60% 2767 23% 7 Doch32 09.980 JA 2798 16 16 1320 54% 2767 23% 8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22% 9 t20091113KAAL7 2774 17 17 1200 42% 2837 23% 10 t20091113 2769 17 17 1200 41% 2837 23% 11 t20091113KAAL8 2768 17 17 1200 41% 2837 22% 12 t20091113KAAL9 2767 17 17 1200 40% 2837 24% 13 t20091113SAtrue 2766 17 17 1200 40% 2837 23% 14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25% 15 t20091113NTfalse 2763 17 17 1200 40% 2837 25% 16 bright-0.4a 2762 16 16 1320 49% 2767 20% 17 t20091113IIDSf 2760 17 17 1200 40% 2837 23% 18 t20091113KAAL6 2758 17 17 1200 39% 2837 23% 19 t20091113KAAL5 2756 17 17 1200 39% 2837 23% 20 Fruit 2.1 2737 16 16 1320 46% 2767 24% 21 Twisted Logic 20080620 2718 16 16 1320 43% 2767 27%
Code: Select all
Rank Name Elo + - games score oppo. draws 1 Stockfish 1.4 JA 2933 12 12 2400 69% 2793 22% 2 Strelka 2.0 B 2910 12 12 2400 66% 2795 26% 3 Cyclone 3.4 2891 12 12 2400 63% 2795 23% 4 Grapefruit 1.0 beta 2889 12 12 2400 63% 2796 26% 5 Fruit 2.3.1 2869 11 11 2400 60% 2797 27% 6 Protector 1.3.1 x32 2841 12 12 2400 56% 2798 25% 7 Doch32 09.980 JA 2796 12 12 2400 49% 2800 24% 8 t20091113FHwPVf 2785 17 17 1200 43% 2837 22% 9 t20091113KAAL7 2774 17 17 1200 42% 2837 23% 10 bright-0.4a 2773 12 12 2400 46% 2801 22% 11 t20091113 2769 17 17 1200 41% 2837 23% 12 t20091113KAAL8 2768 17 17 1200 41% 2837 22% 13 t20091113KAAL9 2767 17 17 1200 40% 2837 24% 14 t20091113FHwPV2 2766 17 17 1200 40% 2837 25% 15 t20091113SAtrue 2766 17 17 1200 40% 2837 23% 16 t20091113NTfalse 2763 17 17 1200 40% 2837 25% 17 t20091113IIDSf 2761 17 17 1200 40% 2837 23% 18 t20091113KAAL6 2759 17 17 1200 39% 2837 23% 19 t20091113KAAL5 2756 17 17 1200 39% 2837 23% 20 Fruit 2.1 2734 12 12 2400 40% 2803 24% 21 Twisted Logic 20080620 2731 12 12 2400 40% 2803 26%

-
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: A question on testing methodology [here is the data]
Bayeselo offers 3 different methods to compute confidence intervals.Sven Schüle wrote:These results, yours and those of Bob, are indeed surprising for me. The error bars for the candidate engines are not going down with RR games + covariance, so the number of required gauntlet games to match predefined quality criteria can't be reduced despite what I hoped.
It seems that I have to give up ... although I still don't understand why. Maybe Rémi can comment on this, giving us a mathematical explanation for it?
The default method assumes that the posterior distribution is independent Gaussian (ie, without covariance). It is the fastest, and least accurate of all 3.
The "exactdist" method assumes that the rating of opponents are their true ratings, but does not make a Gaussian assumption. This produces more accurate results when the number of games is small, or the winning rate close to 0% or 100% (with non-symmetrical intervals). With "exactdist", the uncertainty of opponent ratings is not taken into consideration at all.
The "covariance" method assumes a Gaussian distribution of the posterior. It always produces symmetrical intervals. It is the only method of all 3 that takes opponent uncertainty into consideration. It is the most accurate method when the number of games is high. The "los" command is also based on the same approximation as "covariance".
"covariance" has a cost cubic in the number of players (a matrix inversion). That is the reason why I did not make it the default: it is too slow when the number of players is very high.
In general, it is not a good idea to infer anything from confidence intervals when trying to compare two programs. Use LOS is better.
So, maybe, it is a good idea to re-run those experiment with "covariance" instead of "exactdist". And compare LOS matrices, not confidence intervals.
Rémi