obvious/easy move

bob · Post by **bob** » Sat Jan 26, 2013 9:24 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"

However I do not believe there is much difference in testing against one program vs another, even self.

You believe in testing as you plan to run? I have no idea which program Komodo will play against so that is already a problem. Right now it is playing Nemo, a program I never considered testing against.

While I don't know who I WILL play, obviously, I DO KNOW who I WON'T play. I WON'T be playing against Crafty...

A pool of opponents with different searches and different evals give, IMHO, a better indication of whether a change is good.

Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.

I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
It better be really a big deal to be worth the extra testing time. And I know for a fact that it isn't a big deal.

I used to run marathons and 10k races but I never trained on the same course I ran on. But I felt that my training was similar enough. I think there is little difference in playing Critter, or Komodo for these kinds of tests.

I will bet you didn't train for a marathon by running 400M races, however. I saw some quirks I didn't like, and stopped that kind of testing for anything other than debugging.

bob · Post by **bob** » Sat Jan 26, 2013 9:26 pm

Rebel wrote:
bob wrote: However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"

Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.
You say it yourself, different groups of opponents give different ratings. However self-play does not but has its own buts. Whatever system you chose it's imperfect. Combining them gives some more security, still imperfect.

I am an absolutely firm believer in "test like you plan to run".
That's an intriguing statement, no idea what you mean by that.

I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
There is no absolute truth but feel free to make an interesting contribution.

What I meant was "test in the same environment you expect to see when running." For example, I test with ponder=on to make sure nothing changes. I test against a group of strong opponents. I use a varying set of starting positions so that I don't tune just to play better against the Ruy Lopez (or whatever). I even do some SMP testing although I limit that because it impacts testing speed (it uses extra cores for one game that could be used to play multiple games).

bob · Post by **bob** » Sat Jan 26, 2013 9:29 pm

Don wrote:
bob wrote: This is not quite "urban legend". I ran such a test years ago, and what I saw, was an "overstated gain" when using self-play. If I would see +20 in self-play, it was ALWAYS much less than that against other opponents.
So what is the problem with the results being overstated? We don't care if the result is overstated as we rarely run automated tests to measure how much improvement, we only use it in an attempt to show that there is some improvement.

If self-testing is a major transitivity issue then testing computer vs computer is highly flawed too. The similarities between 2 programs is 99% compared to the similarities compared to humans.

I can actually run this test again, and will do so. But it will take a bit of time as I don't have any cluster testing set up to run crafty vs crafty. But I can do something like Crafty-23.4 vs Crafty-23.5 and then each against the normal gauntlet. Will report back...
You should run this test again. I think you will find that if you improve Crafty in self testing it will always translate to an improvement against other opponents if you run this test to the point that it is statistically convincing.

Every time we thought we say this intransitivity it turns out that we were just looking at statistical noise and running more games corrected the situation.

I already said I believe that intransitivity exists, so if you really wanted to show intransitivity you could if you worked very hard at it, perhaps rigging the test, searching for a very idiosyncratic way to take advantage of a Crafty weakness (but which doesn't work against other programs) or some other means. but it's not the kind of thing I lose any sleep over and it has not prevented rapid progress in Komodo.

That last point I KNOW to be wrong. I HAVE had test results (Crafty vs Crafty) that showed an improvement, while Crafty vs gauntlet showed a LOSS. That was back when I was first starting the cluster testing approach and it caught my eye. I tweaked king safety to make the new version more aggressive, and it was a significant jump over the previous version. UNTIL I played it against non-crafty opponents that quickly showed that it was TOO aggressive. I don't see any difference between playing 30K against an opponent group, or 30K games against crafty (or 15K games). The entire test (for me) can be run in as short as 15 minutes...

Evert · Post by **Evert** » Sat Jan 26, 2013 9:49 pm

I have a gut-feeling that it may make a difference whether you play against programs that are of similar strength to yours, or whether you play against programs that are perhaps slightly stronger.

In general, I like it if I play against an opponent that will exploit some weakness in my program, because then I can more easily detect what is wrong and fix it. I have seen situations where self-play was a notable gain, but I didn't gain anything at all against a group of opponents, but I've also seen the reverse: no difference at all in self-play but a notable improvement against other programs.

I try to mix it up a bit, but I always include a (the) previous version of my own program in the gauntlet because on average I find it a quick way to see if there is a regression.

Don · Post by **Don** » Sat Jan 26, 2013 9:58 pm

bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"

However I do not believe there is much difference in testing against one program vs another, even self.

You believe in testing as you plan to run? I have no idea which program Komodo will play against so that is already a problem. Right now it is playing Nemo, a program I never considered testing against.

While I don't know who I WILL play, obviously, I DO KNOW who I WON'T play. I WON'T be playing against Crafty...

A pool of opponents with different searches and different evals give, IMHO, a better indication of whether a change is good.

Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.

I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
It better be really a big deal to be worth the extra testing time. And I know for a fact that it isn't a big deal.

I used to run marathons and 10k races but I never trained on the same course I ran on. But I felt that my training was similar enough. I think there is little difference in playing Critter, or Komodo for these kinds of tests.
I will bet you didn't train for a marathon by running 400M races, however. I saw some quirks I didn't like, and stopped that kind of testing for anything other than debugging.

Actually, training by running races is a poor training strategy. You train by running lot's of miles but not so many and not so fast you wear yourself out.

The point is that you have to RUN to train and minor details are unimportant as long as you get the more important "big picture" right.

For example I want Komodo to primarily be good at time controls such as 40/1 hour or more, but I test considerably faster. I use a varied opening book even though I don't expect to play that way and my primary opponent is a chess program (Komodo) but I don't expect to meet Komodo every round in a chess tournament. My opponents probably won't be running the hardware I am running now either but each one something different.

None of that matters very much. For testing I just need to play games and get the more important aspects correct. I consider the time control a lot more important that the exact opponent I use. You can train for the marathon by running about 7 miles a day (perhaps alternating between 5 and 9) even though the distance of a marathon is 26.2 miles. Likewise, you would like to test at 40/1 (or whatever your ideal target is) but you can do almost as well testing considerably faster. You can use 100 different opponents but that is not necessary (even though you may meet one you didn't test against.)

So I seriously doubt your testing is as specific as you think it is. If I had hardware to waste I would prefer to run at longer time controls, that would give greater benefit for the same error margins and time invested.

Sven · Post by **Sven** » Sat Jan 26, 2013 11:01 pm

hgm wrote:The point you missed is that the error bar in a difference of two independently measured ratings is the root-mean-square sum of the individual arror bars. (So sqrt(2) as large if they were equal.) So if you are interested in the difference of the the two ratings (to know which was better), you need an extra doubling of the number of games to get the individual error bars sqrt(2) smaller.

Agreed, but why do you and I get the same result, where I do not use all that sqrt() stuff?

Case 1: A1-A2 self-play, 1000 games

Code: Select all

A1 - A2 +580 -60 =360

Rank Name                      Elo    +    - games score oppo. draws
   1 A1                        100   18   18  1000   76%  -100   36%
   2 A2                       -100   18   18  1000   24%   100   36%

Case 2: A1 vs. B and A2 vs. B, 1000 games each

Code: Select all

A1 - B  +460 -180 =360
A2 - B  +180 -460 =360

Rank Name                      Elo    +    - games score oppo. draws
   1 A1                        100   18   18  1000   64%     0   36%
   2 B                           0   13   13  2000   50%     0   36%
   3 A2                       -100   18   18  1000   36%     0   36%

(I did not use BayesElo for this calculation so the numbers may be artificial but this is not important here.)

In case 1 I need to play a total of 1000 games to get error margins of +/- 18 for the ratings of A1 and A2, and an error margin of 18 * sqrt(2) = 25 for the comparison A1 vs. A2.
In case 2 I need to play a total of 2000 games to get error margins of +/- 18 for the ratings of A1 and A2, and an error margin of 18 * sqrt(2) = 25 for the comparison A1 vs. A2.

So both ways have the square root in it for the direct comparison of the two candidates, and don't have it for the individual ratings. Therefore I still see the reason for the doubling of games from self-play to gauntlet in the simple fact that N games of self-play produce N rated games for A1 and A2 simultaneously, while in the gauntlet variant there is always one "non-candidate" in each game. For me that "square root" is not the key argument.

Right or wrong?

Sven

hgm · Post by **hgm** » Sat Jan 26, 2013 11:17 pm

Wrong!

1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.

2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.

In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.

bob · Post by **bob** » Sat Jan 26, 2013 11:18 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"

However I do not believe there is much difference in testing against one program vs another, even self.

You believe in testing as you plan to run? I have no idea which program Komodo will play against so that is already a problem. Right now it is playing Nemo, a program I never considered testing against.

While I don't know who I WILL play, obviously, I DO KNOW who I WON'T play. I WON'T be playing against Crafty...

A pool of opponents with different searches and different evals give, IMHO, a better indication of whether a change is good.

Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.

I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...
It better be really a big deal to be worth the extra testing time. And I know for a fact that it isn't a big deal.

I used to run marathons and 10k races but I never trained on the same course I ran on. But I felt that my training was similar enough. I think there is little difference in playing Critter, or Komodo for these kinds of tests.
I will bet you didn't train for a marathon by running 400M races, however. I saw some quirks I didn't like, and stopped that kind of testing for anything other than debugging.
Actually, training by running races is a poor training strategy. You train by running lot's of miles but not so many and not so fast you wear yourself out.

The point is that you have to RUN to train and minor details are unimportant as long as you get the more important "big picture" right.

For example I want Komodo to primarily be good at time controls such as 40/1 hour or more, but I test considerably faster. I use a varied opening book even though I don't expect to play that way and my primary opponent is a chess program (Komodo) but I don't expect to meet Komodo every round in a chess tournament. My opponents probably won't be running the hardware I am running now either but each one something different.

None of that matters very much. For testing I just need to play games and get the more important aspects correct. I consider the time control a lot more important that the exact opponent I use. You can train for the marathon by running about 7 miles a day (perhaps alternating between 5 and 9) even though the distance of a marathon is 26.2 miles. Likewise, you would like to test at 40/1 (or whatever your ideal target is) but you can do almost as well testing considerably faster. You can use 100 different opponents but that is not necessary (even though you may meet one you didn't test against.)

So I seriously doubt your testing is as specific as you think it is. If I had hardware to waste I would prefer to run at longer time controls, that would give greater benefit for the same error margins and time invested.

I believe you need to test against as broad a group of programs as possible. You need endgame experts, attacking experts, "duck your head, and bolt the doors" experts, etc. The NUMBER is not that important, if you can be certain you have a "representative group". Testing against GNUChess will teach your program the best way to beat gnu, even though that program is not exactly the primary competition you will be up against. I'd love to be able to include some commercial programs in my testing, but my test environment is linux, no wine allowed on cluster nodes, no graphical devices, just computing.

Anyone can feel free to test as they want. I just don't want to see the lasting impression that "self-test" is OK. It MIGHT be, but I have seen evidence that suggests problems. I see fewer problems with using several opponents. And you can learn quite a bit with some study.

For example, your new version is +10 elo, not bad. But when you look at the per-opponent tests, you see a -10 against one opponent, and a +15 against the others. Interesting to look and see what caused the drop against one opponent, and determine if that can be eliminated. Which will gain even more elo. We see this fairly frequently, in fact. Fiddle with king safety against a strong attacker or defender and it will exploit holes. The others won't since they don't quite understand the problem.

There is the issue of number of games. I'm not sure I understand why you need fewer games in self-play. When you play A vs B, you STILL need the same number of games played to reduce the error bar, and the games against the old version still waste 1/2 of the CPU time since the old program is irrelevant. It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...

Sven · Post by **Sven** » Sun Jan 27, 2013 12:58 am

hgm wrote:Wrong!

1) We don't get the same thing. I say that you need 4000 games (2000 A1-B ad 2000 A2-B) to get the same accuracy for the difference between the A versions as with 1000 A1-A2 games. You say you need only 2000.

2) The ratings of A1 and A2 calculated by the rating program are not independent, but 100% anti-correlating. So the error in their difference does directly add, and not throug the sqrt stuff. You say the quoted data is synthetic, and I really wonder if BaysElo would report itt like that. I think it should report 9 Elo error bars in the A1-A2 case, not 18. So that the error bar in the difference will be 18.

In any case, using the error bars reported by rating programs is tricky with few players, as the condition that the average rating is zero creates a correlation between the ratings. So the simple sqrt addition no longer applies, and you have to take acount of the covariance as well as the variances.

To replace "synthetic" with "real" BayesElo output, here is what I get for the same results quoted above.

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A1      47    8    8  1000   76%   -47   36%
   2 A2     -47    8    8  1000   24%    47   36%

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A1      99   12   12  1000   64%    -1   36%
   2 B       -1    8    9  2000   50%     0   36%
   3 A2     -99   12   12  1000   36%    -1   36%

when using "mm 1 1" and "exactdist". I also checked that you are right by again doubling the number of games in case 2 where I now get:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A1     100    8    9  2000   64%    -1   36%
   2 B       -1    6    6  4000   50%     1   36%
   3 A2     -99    9    9  2000   36%    -1   36%

Thanks,
Sven

Michel · Post by **Michel** » Sun Jan 27, 2013 3:35 am

It takes me 30K games to get to +/-4 elo, regardless of whether those 30K games are against 1 opponent, or several. Unless I have overlooked something...

In principle you need fewer games to prove that the new version is stronger than the old version when using self testing.

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results