I will bet you didn't train for a marathon by running 400M races, however. I saw some quirks I didn't like, and stopped that kind of testing for anything other than debugging.Don wrote:However I do not believe there is much difference in testing against one program vs another, even self.bob wrote:However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"Don wrote:There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.bob wrote: I don't follow your "more efficient use of resources."
In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.
In testing against others, the cpu time spent running those programs is wasted.
To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.
So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.
But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
You believe in testing as you plan to run? I have no idea which program Komodo will play against so that is already a problem. Right now it is playing Nemo, a program I never considered testing against.While I don't know who I WILL play, obviously, I DO KNOW who I WON'T play.I WON'T be playing against Crafty...
A pool of opponents with different searches and different evals give, IMHO, a better indication of whether a change is good.
It better be really a big deal to be worth the extra testing time. And I know for a fact that it isn't a big deal.Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.
I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
I used to run marathons and 10k races but I never trained on the same course I ran on. But I felt that my training was similar enough. I think there is little difference in playing Critter, or Komodo for these kinds of tests.
obvious/easy move
Moderator: Ras
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: obvious/easy move - final results
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: obvious/easy move - final results
What I meant was "test in the same environment you expect to see when running." For example, I test with ponder=on to make sure nothing changes. I test against a group of strong opponents. I use a varying set of starting positions so that I don't tune just to play better against the Ruy Lopez (or whatever). I even do some SMP testing although I limit that because it impacts testing speed (it uses extra cores for one game that could be used to play multiple games).Rebel wrote:You say it yourself, different groups of opponents give different ratings. However self-play does not but has its own buts. Whatever system you chose it's imperfect. Combining them gives some more security, still imperfect.bob wrote: However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"
Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.
That's an intriguing statement, no idea what you mean by that.I am an absolutely firm believer in "test like you plan to run".
There is no absolute truth but feel free to make an interesting contribution.I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: obvious/easy move - final results
That last point I KNOW to be wrong. I HAVE had test results (Crafty vs Crafty) that showed an improvement, while Crafty vs gauntlet showed a LOSS. That was back when I was first starting the cluster testing approach and it caught my eye. I tweaked king safety to make the new version more aggressive, and it was a significant jump over the previous version. UNTIL I played it against non-crafty opponents that quickly showed that it was TOO aggressive. I don't see any difference between playing 30K against an opponent group, or 30K games against crafty (or 15K games). The entire test (for me) can be run in as short as 15 minutes...Don wrote:So what is the problem with the results being overstated? We don't care if the result is overstated as we rarely run automated tests to measure how much improvement, we only use it in an attempt to show that there is some improvement.bob wrote: This is not quite "urban legend". I ran such a test years ago, and what I saw, was an "overstated gain" when using self-play. If I would see +20 in self-play, it was ALWAYS much less than that against other opponents.
If self-testing is a major transitivity issue then testing computer vs computer is highly flawed too. The similarities between 2 programs is 99% compared to the similarities compared to humans.
You should run this test again. I think you will find that if you improve Crafty in self testing it will always translate to an improvement against other opponents if you run this test to the point that it is statistically convincing.
I can actually run this test again, and will do so. But it will take a bit of time as I don't have any cluster testing set up to run crafty vs crafty. But I can do something like Crafty-23.4 vs Crafty-23.5 and then each against the normal gauntlet. Will report back...
Every time we thought we say this intransitivity it turns out that we were just looking at statistical noise and running more games corrected the situation.
I already said I believe that intransitivity exists, so if you really wanted to show intransitivity you could if you worked very hard at it, perhaps rigging the test, searching for a very idiosyncratic way to take advantage of a Crafty weakness (but which doesn't work against other programs) or some other means. but it's not the kind of thing I lose any sleep over and it has not prevented rapid progress in Komodo.
-
Evert
- Posts: 2929
- Joined: Sat Jan 22, 2011 12:42 am
- Location: NL
Re: obvious/easy move - final results
I have a gut-feeling that it may make a difference whether you play against programs that are of similar strength to yours, or whether you play against programs that are perhaps slightly stronger.
In general, I like it if I play against an opponent that will exploit some weakness in my program, because then I can more easily detect what is wrong and fix it. I have seen situations where self-play was a notable gain, but I didn't gain anything at all against a group of opponents, but I've also seen the reverse: no difference at all in self-play but a notable improvement against other programs.
I try to mix it up a bit, but I always include a (the) previous version of my own program in the gauntlet because on average I find it a quick way to see if there is a regression.
In general, I like it if I play against an opponent that will exploit some weakness in my program, because then I can more easily detect what is wrong and fix it. I have seen situations where self-play was a notable gain, but I didn't gain anything at all against a group of opponents, but I've also seen the reverse: no difference at all in self-play but a notable improvement against other programs.
I try to mix it up a bit, but I always include a (the) previous version of my own program in the gauntlet because on average I find it a quick way to see if there is a regression.
-
Don
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: obvious/easy move - final results
Actually, training by running races is a poor training strategy. You train by running lot's of miles but not so many and not so fast you wear yourself out.bob wrote:I will bet you didn't train for a marathon by running 400M races, however. I saw some quirks I didn't like, and stopped that kind of testing for anything other than debugging.Don wrote:However I do not believe there is much difference in testing against one program vs another, even self.bob wrote:However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"Don wrote:There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.bob wrote: I don't follow your "more efficient use of resources."
In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.
In testing against others, the cpu time spent running those programs is wasted.
To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.
So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.
But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
You believe in testing as you plan to run? I have no idea which program Komodo will play against so that is already a problem. Right now it is playing Nemo, a program I never considered testing against.While I don't know who I WILL play, obviously, I DO KNOW who I WON'T play.I WON'T be playing against Crafty...
A pool of opponents with different searches and different evals give, IMHO, a better indication of whether a change is good.
It better be really a big deal to be worth the extra testing time. And I know for a fact that it isn't a big deal.Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.
I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
I used to run marathons and 10k races but I never trained on the same course I ran on. But I felt that my training was similar enough. I think there is little difference in playing Critter, or Komodo for these kinds of tests.
The point is that you have to RUN to train and minor details are unimportant as long as you get the more important "big picture" right.
For example I want Komodo to primarily be good at time controls such as 40/1 hour or more, but I test considerably faster. I use a varied opening book even though I don't expect to play that way and my primary opponent is a chess program (Komodo) but I don't expect to meet Komodo every round in a chess tournament. My opponents probably won't be running the hardware I am running now either but each one something different.
None of that matters very much. For testing I just need to play games and get the more important aspects correct. I consider the time control a lot more important that the exact opponent I use. You can train for the marathon by running about 7 miles a day (perhaps alternating between 5 and 9) even though the distance of a marathon is 26.2 miles. Likewise, you would like to test at 40/1 (or whatever your ideal target is) but you can do almost as well testing considerably faster. You can use 100 different opponents but that is not necessary (even though you may meet one you didn't test against.)
So I seriously doubt your testing is as specific as you think it is. If I had hardware to waste I would prefer to run at longer time controls, that would give greater benefit for the same error margins and time invested.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
-
Sven
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: obvious/easy move - final results
Agreed, but why do you and I get the same result, where I do not use all that sqrt() stuff?hgm wrote:The point you missed is that the error bar in a difference of two independently measured ratings is the root-mean-square sum of the individual arror bars. (So sqrt(2) as large if they were equal.) So if you are interested in the difference of the the two ratings (to know which was better), you need an extra doubling of the number of games to get the individual error bars sqrt(2) smaller.
Case 1: A1-A2 self-play, 1000 games
Code: Select all
A1 - A2 +580 -60 =360
Rank Name Elo + - games score oppo. draws
1 A1 100 18 18 1000 76% -100 36%
2 A2 -100 18 18 1000 24% 100 36%Code: Select all
A1 - B +460 -180 =360
A2 - B +180 -460 =360
Rank Name Elo + - games score oppo. draws
1 A1 100 18 18 1000 64% 0 36%
2 B 0 13 13 2000 50% 0 36%
3 A2 -100 18 18 1000 36% 0 36%In case 1 I need to play a total of 1000 games to get error margins of +/- 18 for the ratings of A1 and A2, and an error margin of 18 * sqrt(2) = 25 for the comparison A1 vs. A2.
In case 2 I need to play a total of 2000 games to get error margins of +/- 18 for the ratings of A1 and A2, and an error margin of 18 * sqrt(2) = 25 for the comparison A1 vs. A2.
So both ways have the square root in it for the direct comparison of the two candidates, and don't have it for the individual ratings. Therefore I still see the reason for the doubling of games from self-play to gauntlet in the simple fact that N games of self-play produce N rated games for A1 and A2 simultaneously, while in the gauntlet variant there is always one "non-candidate" in each game. For me that "square root" is not the key argument.
Right or wrong?
Sven
-
hgm
- Posts: 28452
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: obvious/easy move - final results
Wrong!
1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.
2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.
In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.
1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.
2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.
In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: obvious/easy move - final results
I believe you need to test against as broad a group of programs as possible. You need endgame experts, attacking experts, "duck your head, and bolt the doors" experts, etc. The NUMBER is not that important, if you can be certain you have a "representative group". Testing against GNUChess will teach your program the best way to beat gnu, even though that program is not exactly the primary competition you will be up against. I'd love to be able to include some commercial programs in my testing, but my test environment is linux, no wine allowed on cluster nodes, no graphical devices, just computing.Don wrote:Actually, training by running races is a poor training strategy. You train by running lot's of miles but not so many and not so fast you wear yourself out.bob wrote:I will bet you didn't train for a marathon by running 400M races, however. I saw some quirks I didn't like, and stopped that kind of testing for anything other than debugging.Don wrote:However I do not believe there is much difference in testing against one program vs another, even self.bob wrote:However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"Don wrote:There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.bob wrote: I don't follow your "more efficient use of resources."
In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.
In testing against others, the cpu time spent running those programs is wasted.
To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.
So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.
But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
You believe in testing as you plan to run? I have no idea which program Komodo will play against so that is already a problem. Right now it is playing Nemo, a program I never considered testing against.While I don't know who I WILL play, obviously, I DO KNOW who I WON'T play.I WON'T be playing against Crafty...
A pool of opponents with different searches and different evals give, IMHO, a better indication of whether a change is good.
It better be really a big deal to be worth the extra testing time. And I know for a fact that it isn't a big deal.Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.
I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
I used to run marathons and 10k races but I never trained on the same course I ran on. But I felt that my training was similar enough. I think there is little difference in playing Critter, or Komodo for these kinds of tests.
The point is that you have to RUN to train and minor details are unimportant as long as you get the more important "big picture" right.
For example I want Komodo to primarily be good at time controls such as 40/1 hour or more, but I test considerably faster. I use a varied opening book even though I don't expect to play that way and my primary opponent is a chess program (Komodo) but I don't expect to meet Komodo every round in a chess tournament. My opponents probably won't be running the hardware I am running now either but each one something different.
None of that matters very much. For testing I just need to play games and get the more important aspects correct. I consider the time control a lot more important that the exact opponent I use. You can train for the marathon by running about 7 miles a day (perhaps alternating between 5 and 9) even though the distance of a marathon is 26.2 miles. Likewise, you would like to test at 40/1 (or whatever your ideal target is) but you can do almost as well testing considerably faster. You can use 100 different opponents but that is not necessary (even though you may meet one you didn't test against.)
So I seriously doubt your testing is as specific as you think it is. If I had hardware to waste I would prefer to run at longer time controls, that would give greater benefit for the same error margins and time invested.
Anyone can feel free to test as they want. I just don't want to see the lasting impression that "self-test" is OK. It MIGHT be, but I have seen evidence that suggests problems. I see fewer problems with using several opponents. And you can learn quite a bit with some study.
For example, your new version is +10 elo, not bad. But when you look at the per-opponent tests, you see a -10 against one opponent, and a +15 against the others. Interesting to look and see what caused the drop against one opponent, and determine if that can be eliminated. Which will gain even more elo. We see this fairly frequently, in fact. Fiddle with king safety against a strong attacker or defender and it will exploit holes. The others won't since they don't quite understand the problem.
There is the issue of number of games. I'm not sure I understand why you need fewer games in self-play. When you play A vs B, you STILL need the same number of games played to reduce the error bar, and the games against the old version still waste 1/2 of the CPU time since the old program is irrelevant. It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...
-
Sven
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: obvious/easy move - final results
To replace "synthetic" with "real" BayesElo output, here is what I get for the same results quoted above.hgm wrote:Wrong!
1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.
2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.
In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.
Code: Select all
Rank Name Elo + - games score oppo. draws
1 A1 47 8 8 1000 76% -47 36%
2 A2 -47 8 8 1000 24% 47 36%Code: Select all
Rank Name Elo + - games score oppo. draws
1 A1 99 12 12 1000 64% -1 36%
2 B -1 8 9 2000 50% 0 36%
3 A2 -99 12 12 1000 36% -1 36%Code: Select all
Rank Name Elo + - games score oppo. draws
1 A1 100 8 9 2000 64% -1 36%
2 B -1 6 6 4000 50% 1 36%
3 A2 -99 9 9 2000 36% -1 36%Sven
-
Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: obvious/easy move - final results
In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...