obvious/easy move

Sven · Post by **Sven** » Mon Jan 28, 2013 10:22 pm

Sven Schüle wrote:One final question to you: why are the ratings of A1 and A2 independent in the "reality" but dependent in case of using a rating program?

I know that you wrote about this twice in this thread, mentioning the fact that rating programs move the ratings around to create an average of zero as the reason for it. I just want to know why and how this leads to dependence.

Sven

AlvaroBegue · Post by **AlvaroBegue** » Mon Jan 28, 2013 10:28 pm

bob wrote:Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...

If you never run the old version again, you may suffer from the "curse of the lucky version". As was described earlier, a version that just happens to get an inflated score because of a statistical fluctuation will seem unreasonably hard to beat, and you might miss on some improvements because of it.

Rein Halbersma · Post by **Rein Halbersma** » Mon Jan 28, 2013 10:40 pm

Sven Schüle wrote:
Sven Schüle wrote:One final question to you: why are the ratings of A1 and A2 independent in the "reality" but dependent in case of using a rating program?
I know that you wrote about this twice in this thread, mentioning the fact that rating programs move the ratings around to create an average of zero as the reason for it. I just want to know why and how this leads to dependence.

Sven

only rating differences have any meaning. FIDE could add 10,000 to all ratings and the predicted match outcomes would not matter one bit. For simplicity, the average rating is set to zero by programs (with an optional offset).

Don · Post by **Don** » Mon Jan 28, 2013 10:46 pm

bob wrote:
Don wrote:
AlvaroBegue wrote:
bob wrote:But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?
I'll give it a try. We are not particularly interested in measuring the ELO score of our program: We are interested in measuring if a particular modification to our program is an improvement. If we match the proposed version against the previous version, you get a +/-4 ELO points estimate after 30K games. If you match the new version against ANYTHING, you'll know the ELO of the new version within +/-4 ELO, but we need to compute the difference between this value and the ELO score of the previous version, which also needs to be measured. That takes a factor of 2 more games and when you subtract two numbers with +/-4 error bars on them, you get an larger error bar of +/-4*sqrt(2). If you want to recover the +/-4 error bar, you need to again double the number of games. That's where the factor of 4 comes from.

Someone (Don?) said that you can get away with "just" a factor of 2 because presumably you have already run the ELO measurement of the previous version. However, this has a bit of a problem: If the testing of the previous version was lucky and got an estimate of ELO inflated by 12 points (a 3-sigma event happens sometimes), now every change will tend to look bad, just because other tests are unlikely to be that lucky.
It was me who said it. However my statement came with some qualifications and I agree the right number is 4x. We have actually ran across the problem you describe, at least in the past where we primarily did guantlets against foreign programs. We would run our 50,000 games and one version would get "lucky" by a few ELO. Even then we knew that this could be a problem because we would find it almost impossible to "beat" this version.

So the self testing solves this problem by simply testing the same version over and over again. There is a tradeoff here though. If you compare the two methods directly you cannot say 4x if you are testing the same version over and over again in one method but not the other.

However, I think 4x is the number you have to go with because these other consideration are logistical and irrelevant from a cost analysis point of view. You have convinced me that the self-testing methods (even compared to the rotating gauntlet to save computation) is more than 2x superior. First of all if the new version tests better you don't ever have to run the old version again, which is a savings that is comparable to the gauntlets 2x saving I mentioned. But you get something for doing this even if it doesn't test better, some statistical protection against a temporary false read.
Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...

I think you just misunderstood what I said. You have a "reference" version which is your current best. With self-testing you test a candidate against it. If the candidate is better, you retire the reference version forever and the candidate becomes the new reference version.

However, if the candidate does not prove to be best, you don't retire the reference version. Until a candidate proves to be better you must always play the reference version against each candidate. That is why I said the 4X is really 2X - because you have to amortize the total work. With your method you are doing twice as much work on average.

Lets' say we both have a good version and we produce 10 versions that are no good but we are compelled to test them. How much extra work?

Since your reference version has already been tested we will give you that for free. Let's say you play 40,000 games - to get the same error I have to play 20,000 games. So I end up playing 20,000 game 10 times and you have to play 40,000 games 10 times. I play half as many games despite the fact that I am testing the same reference version 10 times from scratch!

As has been pointed out my test is STILL more robust because in your test the reference version may have been unusually fortunate and get over rated or under rated. That will have a long term affect on all the tests that come after until you get a new reference version. In my test such an event can happen too, but it only affects a single test, not all the ones that come after it.

Don · Post by **Don** » Mon Jan 28, 2013 11:08 pm

AlvaroBegue wrote:
bob wrote:Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...
If you never run the old version again, you may suffer from the "curse of the lucky version". As was described earlier, a version that just happens to get an inflated score because of a statistical fluctuation will seem unreasonably hard to beat, and you might miss on some improvements because of it.

Isn't the issue that the rating could go either way? I would think that re-testing (when needed), is a benefit because any statistical fluctuation of a single version is limited to a single case.

AlvaroBegue · Post by **AlvaroBegue** » Mon Jan 28, 2013 11:28 pm

Don wrote:
AlvaroBegue wrote:
bob wrote:Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...
If you never run the old version again, you may suffer from the "curse of the lucky version". As was described earlier, a version that just happens to get an inflated score because of a statistical fluctuation will seem unreasonably hard to beat, and you might miss on some improvements because of it.
Isn't the issue that the rating could go either way? I would think that re-testing (when needed), is a benefit because any statistical fluctuation of a single version is limited to a single case.

Well, it could go either way, but you are less likely to have admitted the previous version if its rating were accidentally low, and more likely to have admitted it if its rating were accidentally high.

bob · Post by **bob** » Tue Jan 29, 2013 2:46 am

AlvaroBegue wrote:
bob wrote:Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...
If you never run the old version again, you may suffer from the "curse of the lucky version". As was described earlier, a version that just happens to get an inflated score because of a statistical fluctuation will seem unreasonably hard to beat, and you might miss on some improvements because of it.

When I get a SIGNIFICANT change, I always re-test. But for those 1-2 Elo changes, No. I depend on 30K games over a bunch of different starting positions, to keep me out of that "fringe area." Yes, I understand that occasionally those 2-sigma or 3-sigma events occur.

I try to catch them manually by (a) if a change seems reasonable, but the result is worse, I do further checking to see if either the result was unlucky (which is very uncommon) or if the idea can be improved to bring the result to a positive conclusion; (b) if a result is significantly better than expected, it gets the SAME scrutiny, as I have a pretty good idea of whether I am working on a small gain or a big gain idea...

bob · Post by **bob** » Tue Jan 29, 2013 2:47 am

Don wrote:
AlvaroBegue wrote:
bob wrote:Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...
If you never run the old version again, you may suffer from the "curse of the lucky version". As was described earlier, a version that just happens to get an inflated score because of a statistical fluctuation will seem unreasonably hard to beat, and you might miss on some improvements because of it.
Isn't the issue that the rating could go either way? I would think that re-testing (when needed), is a benefit because any statistical fluctuation of a single version is limited to a single case.

Unfortunately, that is NOT the case. In fact, if you DON'T get two occasional back-to-back extra good or extra bad results, something is wrong with the testing...

bob · Post by **bob** » Tue Jan 29, 2013 3:00 am

Don wrote:
bob wrote:
Don wrote:
AlvaroBegue wrote:
bob wrote:But again, I did not see anyone explain why the following is supposedly false:

I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?
I'll give it a try. We are not particularly interested in measuring the ELO score of our program: We are interested in measuring if a particular modification to our program is an improvement. If we match the proposed version against the previous version, you get a +/-4 ELO points estimate after 30K games. If you match the new version against ANYTHING, you'll know the ELO of the new version within +/-4 ELO, but we need to compute the difference between this value and the ELO score of the previous version, which also needs to be measured. That takes a factor of 2 more games and when you subtract two numbers with +/-4 error bars on them, you get an larger error bar of +/-4*sqrt(2). If you want to recover the +/-4 error bar, you need to again double the number of games. That's where the factor of 4 comes from.

Someone (Don?) said that you can get away with "just" a factor of 2 because presumably you have already run the ELO measurement of the previous version. However, this has a bit of a problem: If the testing of the previous version was lucky and got an estimate of ELO inflated by 12 points (a 3-sigma event happens sometimes), now every change will tend to look bad, just because other tests are unlikely to be that lucky.
It was me who said it. However my statement came with some qualifications and I agree the right number is 4x. We have actually ran across the problem you describe, at least in the past where we primarily did guantlets against foreign programs. We would run our 50,000 games and one version would get "lucky" by a few ELO. Even then we knew that this could be a problem because we would find it almost impossible to "beat" this version.

So the self testing solves this problem by simply testing the same version over and over again. There is a tradeoff here though. If you compare the two methods directly you cannot say 4x if you are testing the same version over and over again in one method but not the other.

However, I think 4x is the number you have to go with because these other consideration are logistical and irrelevant from a cost analysis point of view. You have convinced me that the self-testing methods (even compared to the rotating gauntlet to save computation) is more than 2x superior. First of all if the new version tests better you don't ever have to run the old version again, which is a savings that is comparable to the gauntlets 2x saving I mentioned. But you get something for doing this even if it doesn't test better, some statistical protection against a temporary false read.
Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...
I think you just misunderstood what I said. You have a "reference" version which is your current best. With self-testing you test a candidate against it. If the candidate is better, you retire the reference version forever and the candidate becomes the new reference version.

However, if the candidate does not prove to be best, you don't retire the reference version. Until a candidate proves to be better you must always play the reference version against each candidate. That is why I said the 4X is really 2X - because you have to amortize the total work. With your method you are doing twice as much work on average.

This I don't buy very much.

(1) suppose your test version is better immediately. You do NOT do any of the extra testing on that old version. (2) suppose your test version is worse? The games against the test version are meaningless since you played against a flawed opponent, and you would again test against a new version.

I think testing has to be methodical and repeatable. Without depending on anything other than the quality of the new version. I still don't like the idea of A vs A', since A' is usually just A + a small change. That "small change" could help or hurt against YOUR program, and it could do the opposite against OTHER programs. That's what keeps me away from much of this kind of testing. Once I observed that effect a few times, I decided it was too risky. My intent for testing is to make progress in the general case, as best I can, and deal with whatever the cost might be.

Lets' say we both have a good version and we produce 10 versions that are no good but we are compelled to test them. How much extra work?

Since your reference version has already been tested we will give you that for free. Let's say you play 40,000 games - to get the same error I have to play 20,000 games. So I end up playing 20,000 game 10 times and you have to play 40,000 games 10 times. I play half as many games despite the fact that I am testing the same reference version 10 times from scratch!

Why do you need 1/2 the games? We are back to that statistical anomaly again. Where you already get a larger ERROR than I do because self-testing demonstrably inflates ratings. I'm not saying this is no good, I am simply saying it seems to be based solely on accepting additional error to reduce the number of games required.

The scientific test would be to take 10 old versions of a program, and test them as you do, then test them as I do, and see if BOTH tests would lead to the same conclusion. I can't precisely identify the number of self-tests it takes before you find a change that looks good against your old version, but looks bad against a gauntlet. But it definitely happens, and it happens more than "near zero times" based on past testing. That leaves me a little suspicious. I almost lost the 1986 WCCC event because of using self-testing to validate changes, back when a variety of opponents was really not available. Self-testing identified a problem that a fix cured. And the cure was far worse than the problem against OTHER programs... And I am talking about a 4-line change to Cray Blitz, not a big change, but it looked much better in the testing Bert and I did, but then fell flat in 1985 and it took me the first round of the 1986 WCCC event (yet another loss) before I figured out what was broken...

As has been pointed out my test is STILL more robust because in your test the reference version may have been unusually fortunate and get over rated or under rated. That will have a long term affect on all the tests that come after until you get a new reference version. In my test such an event can happen too, but it only affects a single test, not all the ones that come after it.

hgm · Post by **hgm** » Tue Jan 29, 2013 7:49 am

Sven Schüle wrote:
Sven Schüle wrote:One final question to you: why are the ratings of A1 and A2 independent in the "reality" but dependent in case of using a rating program?
I know that you wrote about this twice in this thread, mentioning the fact that rating programs move the ratings around to create an average of zero as the reason for it. I just want to know why and how this leads to dependence.

If, for instance, you play A1-B and A2-B, then A1 doing better against B would push not only B's rating down, but also A2's rating. Because the program will keep the A2-B rating difference fixed based on their mutual result. So there always is this 'recoil' effect; you can only gain rating by pushing the average of all other ratings down, and if your opponent and the others are connected, it will not be just your opponent that suffers from this. This is especially bad if the pairing network has a star topology, where they all connect well to the star center, and not directly to you.

But the effect tends to disappear when there is a large number of (well-connected) players, because then their combined 'mass' gets so large that there is very little recoil.

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results