Ralph Stoesser wrote:bob wrote:mcostalba wrote:Ralph Stoesser wrote:
I think this is a good example where a test with much more games at unreasonable fast time control is more reliable than test with few games at reasonable time control.
Ralph, please no offence, but for me definition of
reliable is not: reliable = What I would like to see out of the results
I consider the comment of Joona much more up to the point: probably that single piece of code has no effect at all, in any direction.
P.S: Regarding testing at unreasonable fast time control you have not proved they work. To prove if they work is necessary to verify them at longer time controls and check results are the same. So IMHO test at unreasonable time control are not validated and can be useful to quickly filter out some bad patches, but not to validate candidate good ones.
This +has+ been verified to work. I have played millions of very fast games, dealing primarily with evaluation changes or changes that make the program faster, and then verified that with longer and longer time controls, the results remain consistent.
Search changes are a bit more difficult, but at least 80% of those changes have been verified with both fast and slow games...
All you have to avoid is time controls where a program loses too many games due to flag falling, rather than by getting beat.
If this is true, and I'm a firm believer, it should be rather pointless to outvote a test with 20000 games against a test with 1000 games, especially in this case where opposite side castling must be possible or must have happened to find eval differences at all. I still assume 1000 games are probably not enough to reveal an effect, regardless of used time control.
Also it would be easy for Marco to verify my result. For 20000 games @1 sec test we don't exactly need a cluster.
The only exceptions I have found to this process revolve around two themes:
(1) time allocation. Changes where you use more time, or alter the way time is allocated to each move, extending the time on fail-lows or other cosiderations, etc. You have to check representative time controls to be sure that things work the same when you have more time vs less time.
(2) search changes where you might see an exponential problem pop up. For example, in 1994 we had some odd failures in Cray Blitz due to singular extensions. Our testing was on slow hardware, but we played (in 1994) on hardware that would peak at about 7M nps. The significant extra depth led to many more (than expected) singular extensions. So things that might well cause tree explosion at deeper depths (extensions, reductions, etc) need to be verified at longer time controls. But in our testing, which is now beyond the 100M game level, most of these changes remain consistent.
Note that by consistent, I mean that A and A' (original and modified) show about the same "gap" (in terms of Elo) across all time controls. I have lots of examples of where program A does worse against B (two different programs) at different time controls. But with the same program, just two versions, this has not been a problem. A and A' might do worse against B at fast times, vs slow times, but if A' is better than A, it will be consistently better so that measuring the Elo gap between them produces a near-constant number.
1000 games has such a high error bar, unless the change is dramatic in nature, such a match will produce noise but no usable results. If you think a change is in the 2-3-4 Elo range, you need to produce between 40,000 and 100,000 games.
By the way, the 1,000 slow games vs 20,000 fast games myth is worthy of "The Myth Busters" TV program. It sounds plausible, but is far from it in reality. Yet we will continue to see this over and over.