I don't get your point. _any_ knowledge added to a program will slow it down. So this test will answer the question "is the new knowledge worth the loss in speed" in a pretty effective way. No dispute there. But it will _not_ answer the question "How much _better_ is the new version than the old version?" with any degree of accuracy. The new knowledge will get an exaggerated result against the old program, unless the speed loss offsets the new knowledge or worse...Uri Blass wrote:Your assumption("it horribly inflates the ELo ") seems not to be correctbob wrote:Personally I think this is a _terrible_ way of estimating Elo gain. I quit doing this years ago because it horribly inflates the ELo for a simple reason...Milton wrote:By lkaufman Date 2008-07-08 09:51 Since yesterday I've been testing a version of Rybka that is very close to Rybka 3, with the improved scaling and all my latest eval terms added. I'm running it against 2.3.2a mp. It appears that on a direct match basis, we will reach the goal of a 100 Elo gain, at least on quads. As of now, after 900 games total, the lead is 110 Elo (105 Elo on quads, 120 on my octal). This is with both programs using the same short generic book, each taking White once in every opening. To achieve this result Rybka 3 has to win about 4 games for each win by 2.3.2a on the quads and about 5 for 1 on the octal, due to draws. How this will translate to gains on the rating lists remains to be seen.
When you add some new piece of knowledge that might be helpful here and there, and that is the _only_ difference between the two engines, then any rating change is a direct result of that change plus the normal randomness that games between equal opponents produces. Since the two programs are identical except for the new piece of knowledge, the one with the new piece will occasionally use it to win a game.
But in real games between _different_ opponents, that new piece of knowledge might produce absolutely no improvement at all, or one so small that it takes thousands of games to measure. Once you think about it for a few minutes, you see why this is pretty meaningless. The fact that it produces _any_ improvement is certainly significant, but the fact that it produces a 100 Elo improvement is worthless...
I could probably find some test results to show this as at times, we add an old version of Crafty to our gauntlet for testing, and new changes tend to exaggerate that score compared to the scores against other programs in the mix.
here.
Larry explained that the new knowledge also made rybka slower so it was outsearched by older rybka.
He claims that this reason made the improvement smaller in rybka-rybka games(relaive to rybka against other opponents).
tests against other opponents in the rybka forum suggest slightly bigger improvement relative to rybka-rybka games.
Uri
Anything is possible in computer chess. And on occasion the N vs N+1 testing might well produce more accurate answers than N+1 vs the world. But not generally, which was the point I tried to make.

