Milton wrote:By lkaufman Date 2008-07-08 09:51 Since yesterday I've been testing a version of Rybka that is very close to Rybka 3, with the improved scaling and all my latest eval terms added. I'm running it against 2.3.2a mp. It appears that on a direct match basis, we will reach the goal of a 100 Elo gain, at least on quads. As of now, after 900 games total, the lead is 110 Elo (105 Elo on quads, 120 on my octal). This is with both programs using the same short generic book, each taking White once in every opening. To achieve this result Rybka 3 has to win about 4 games for each win by 2.3.2a on the quads and about 5 for 1 on the octal, due to draws. How this will translate to gains on the rating lists remains to be seen.
Personally I think this is a _terrible_ way of estimating Elo gain. I quit doing this years ago because it horribly inflates the ELo for a simple reason...
When you add some new piece of knowledge that might be helpful here and there, and that is the _only_ difference between the two engines, then any rating change is a direct result of that change plus the normal randomness that games between equal opponents produces. Since the two programs are identical except for the new piece of knowledge, the one with the new piece will occasionally use it to win a game.
But in real games between _different_ opponents, that new piece of knowledge might produce absolutely no improvement at all, or one so small that it takes thousands of games to measure. Once you think about it for a few minutes, you see why this is pretty meaningless. The fact that it produces _any_ improvement is certainly significant, but the fact that it produces a 100 Elo improvement is worthless...
I could probably find some test results to show this as at times, we add an old version of Crafty to our gauntlet for testing, and new changes tend to exaggerate that score compared to the scores against other programs in the mix.
Your assumption("it horribly inflates the ELo ") seems not to be correct
here.
Larry explained that the new knowledge also made rybka slower so it was outsearched by older rybka.
He claims that this reason made the improvement smaller in rybka-rybka games(relaive to rybka against other opponents).
tests against other opponents in the rybka forum suggest slightly bigger improvement relative to rybka-rybka games.
Uri
I don't get your point. _any_ knowledge added to a program will slow it down. So this test will answer the question "is the new knowledge worth the loss in speed" in a pretty effective way. No dispute there. But it will _not_ answer the question "How much _better_ is the new version than the old version?" with any degree of accuracy. The new knowledge will get an exaggerated result against the old program, unless the speed loss offsets the new knowledge or worse...
Anything is possible in computer chess. And on occasion the N vs N+1 testing might well produce more accurate answers than N+1 vs the world. But not generally, which was the point I tried to make.
Jeroen came up with some new numbers vs other opponents than Rybka:
I decided to run some matches against competing programs to verify whether the hundred Elo gains I've measured in direct play will show up in the rating lists which test against unrelated programs.
Based solely on 1'+1" quad games against Deep Fritz 10 and Deep Shredder 11 (and bearing in mind Rybka 3 is still not quite finalized), it appears that the answer is "yes".
Results so far: against Deep Fritz 10, +128=31-10 for +300 Elo;
against Deep Shredder 11 +98=34-8 for +265 Elo.
Based on CCRL blitz ratings (CEGT doesn't have a blitz rating for Deep Fritz 10 quad) that works out to a performance rating of 3242 against Fritz and 3296 (!) against Shredder, or about 3267 vs. 3132 for Rybka 2.3.2a quad, a gain of 135 Elo. It's only two opponents and the quick time control probably favors the stronger program slightly, but it looks pretty clear that the rating lists will confirm the 100+ rating gains (at least in blitz) for Rybka 3 quad as measured in direct play.
bob wrote:
When you add some new piece of knowledge that might be helpful here and there, and that is the _only_ difference between the two engines, then any rating change is a direct result of that change plus the normal randomness that games between equal opponents produces. Since the two programs are identical except for the new piece of knowledge, the one with the new piece will occasionally use it to win a game.
But in real games between _different_ opponents, that new piece of knowledge might produce absolutely no improvement at all, or one so small that it takes thousands of games to measure. Once you think about it for a few minutes, you see why this is pretty meaningless. The fact that it produces _any_ improvement is certainly significant, but the fact that it produces a 100 Elo improvement is worthless...
I could probably find some test results to show this as at times, we add an old version of Crafty to our gauntlet for testing, and new changes tend to exaggerate that score compared to the scores against other programs in the mix.
I had the exact same objection some time ago about that.
It may be better to a Rybka_b1 versus Rybka_b2 contest, but in rating lists where Rybka_b1 and Rybka_b2 will face different opponents then things might be different.
The answer i got, was that this isn't so logical, since Rybka's game is quite balanced in all aspects of Chess, so an improvement against its previous version will be an improvement against all programs.
I don't quite accept this idea but until now judging from their testing method(if something works against Rybka_b1-Rybka_b2 ELO speaking, then keep it) it really works.
After his son's birth they've asked him:
"Is it a boy or girl?"
YES! He replied.....
Yes knowledge and speed are a trade off.....However with the computers of today (or tommorow) is this going to be relevant for long? Yes to program the code for knowledge (of closed positions) is daunting..... however some one is goining to plunge into the abyss sooner or later......No? Then again like shredder's triple brain concept to get the Positional engines' eval to be relevant and to compare analysis to the engine that plays open positions and perhaps primarily tactics is frightening to attempt I am sure in the regard as to which engine's eval does the program trust? DEEP SIGH........
Nimzovik wrote:Yes knowledge and speed are a trade off.....However with the computers of today (or tommorow) is this going to be relevant for long? Yes to program the code for knowledge (of closed positions) is daunting..... however some one is goining to plunge into the abyss sooner or later......No? Then again like shredder's triple brain concept to get the Positional engines' eval to be relevant and to compare analysis to the engine that plays open positions and perhaps primarily tactics is frightening to attempt I am sure in the regard as to which engine's eval does the program trust? DEEP SIGH........
One day, perhaps. But I doubt in my lifetime. That is _still_ a long way away from reality.
Note that these are Rybka 3 beta versions, in the first 2 matches the Leiden version of Rybka played (it won the ICT Leiden end of May).
I think that would be better test Rybka 3 with generic book just like CEGT and CCRL do, but .. is very impressive to see Rybka 2.3.2a be totally destroyed by Rybka 3
Note that these are Rybka 3 beta versions, in the first 2 matches the Leiden version of Rybka played (it won the ICT Leiden end of May).
I think that would be better test Rybka 3 with generic book just like CEGT and CCRL do, but .. is very impressive to see Rybka 2.3.2a be totally destroyed by Rybka 3
Heh, which book do you expect the Rybka bookcooker to use, if not his own?
Note that these are Rybka 3 beta versions, in the first 2 matches the Leiden version of Rybka played (it won the ICT Leiden end of May).
I think that would be better test Rybka 3 with generic book just like CEGT and CCRL do, but .. is very impressive to see Rybka 2.3.2a be totally destroyed by Rybka 3
No,a generic opening book will harm Rybka's performance....
Note that the new Rybka will be packed with it's own commercial opening book,so what's the point of testing with a generic one
_No one can hit as hard as life.But it ain’t about how hard you can hit.It’s about how hard you can get hit and keep moving forward.How much you can take and keep moving forward….