Don wrote:bob wrote:Don wrote:AlvaroBegue wrote:bob wrote:But again, I did not see anyone explain why the following is supposedly false:
I want to determine how much better version Y is than version X. I can play version Y against X, for 30K games to get an error bar of +/-4. I can play version Y against ANYTHING for 30K games to get an error bar of +/-4. How can playing against version X require fewer games to get the same accuracy?
I'll give it a try. We are not particularly interested in measuring the ELO score of our program: We are interested in measuring if a particular modification to our program is an improvement. If we match the proposed version against the previous version, you get a +/-4 ELO points estimate after 30K games. If you match the new version against ANYTHING, you'll know the ELO of the new version within +/-4 ELO, but we need to compute the difference between this value and the ELO score of the previous version, which also needs to be measured. That takes a factor of 2 more games and when you subtract two numbers with +/-4 error bars on them, you get an larger error bar of +/-4*sqrt(2). If you want to recover the +/-4 error bar, you need to again double the number of games. That's where the factor of 4 comes from.
Someone (Don?) said that you can get away with "just" a factor of 2 because presumably you have already run the ELO measurement of the previous version. However, this has a bit of a problem: If the testing of the previous version was lucky and got an estimate of ELO inflated by 12 points (a 3-sigma event happens sometimes), now every change will tend to look bad, just because other tests are unlikely to be that lucky.
It was me who said it. However my statement came with some qualifications and I agree the right number is 4x. We have actually ran across the problem you describe, at least in the past where we primarily did guantlets against foreign programs. We would run our 50,000 games and one version would get "lucky" by a few ELO. Even then we knew that this could be a problem because we would find it almost impossible to "beat" this version.
So the self testing solves this problem by simply testing the same version over and over again. There is a tradeoff here though. If you compare the two methods directly you cannot say 4x if you are testing the same version over and over again in one method but not the other.
However, I think 4x is the number you have to go with because these other consideration are logistical and irrelevant from a cost analysis point of view. You have convinced me that the self-testing methods (even compared to the rotating gauntlet to save computation) is more than 2x superior. First of all if the new version tests better you don't ever have to run the old version again, which is a savings that is comparable to the gauntlets 2x saving I mentioned. But you get something for doing this even if it doesn't test better, some statistical protection against a temporary false read.
Now I am lost once again. Why would I EVER "run the old version again?" I never do this. Crafty is a continuous series of versions, where I run one against the gauntlet and it is then retired forever. If it was better than the previous version, it becomes the new best, but is never tested again. If it is worse, it is discarded unless something about the change deserves further study...
I think you just misunderstood what I said. You have a "reference" version which is your current best. With self-testing you test a candidate against it. If the candidate is better, you retire the reference version forever and the candidate becomes the new reference version.
However, if the candidate does not prove to be best, you don't retire the reference version. Until a candidate proves to be better you must always play the reference version against each candidate. That is why I said the 4X is really 2X - because you have to amortize the total work. With your method you are doing twice as much work on average.
This I don't buy very much.
(1) suppose your test version is better immediately. You do NOT do any of the extra testing on that old version. (2) suppose your test version is worse? The games against the test version are meaningless since you played against a flawed opponent, and you would again test against a new version.
I think testing has to be methodical and repeatable. Without depending on anything other than the quality of the new version. I still don't like the idea of A vs A', since A' is usually just A + a small change. That "small change" could help or hurt against YOUR program, and it could do the opposite against OTHER programs. That's what keeps me away from much of this kind of testing. Once I observed that effect a few times, I decided it was too risky. My intent for testing is to make progress in the general case, as best I can, and deal with whatever the cost might be.
Lets' say we both have a good version and we produce 10 versions that are no good but we are compelled to test them. How much extra work?
Since your reference version has already been tested we will give you that for free. Let's say you play 40,000 games - to get the same error I have to play 20,000 games. So I end up playing 20,000 game 10 times and you have to play 40,000 games 10 times. I play half as many games despite the fact that I am testing the same reference version 10 times from scratch!
Why do you need 1/2 the games? We are back to that statistical anomaly again. Where you already get a larger ERROR than I do because self-testing demonstrably inflates ratings. I'm not saying this is no good, I am simply saying it seems to be based solely on accepting additional error to reduce the number of games required.
The scientific test would be to take 10 old versions of a program, and test them as you do, then test them as I do, and see if BOTH tests would lead to the same conclusion. I can't precisely identify the number of self-tests it takes before you find a change that looks good against your old version, but looks bad against a gauntlet. But it definitely happens, and it happens more than "near zero times" based on past testing. That leaves me a little suspicious. I almost lost the 1986 WCCC event because of using self-testing to validate changes, back when a variety of opponents was really not available. Self-testing identified a problem that a fix cured. And the cure was far worse than the problem against OTHER programs... And I am talking about a 4-line change to Cray Blitz, not a big change, but it looked much better in the testing Bert and I did, but then fell flat in 1985 and it took me the first round of the 1986 WCCC event (yet another loss) before I figured out what was broken...
As has been pointed out my test is STILL more robust because in your test the reference version may have been unusually fortunate and get over rated or under rated. That will have a long term affect on all the tests that come after until you get a new reference version. In my test such an event can happen too, but it only affects a single test, not all the ones that come after it.