lkaufman wrote:Of all the factors that can influence test results, such as time limit, increment vs. repeating controls, ponder, hardware, etc., the one we are currently most interested in is the effect of opening books/testsuites. Our own distributed tester uses a five move book, rather shorter than that used by most testers. Since it shows a sixteen elo lead for Komodo 5 over Houdini 1.5 (after over 11k games) which is not shown by the testing agencies, and since the only result on this forum showing Komodo 5 beating Houdini 2 in a long match used a four move book, we decided to make a new testbook that is more typical of books normally used in tests - it averages six moves, but some popular lines are much longer than this. Based on hyper-fast testing, our performance drops by 12 Elo playing against Critter (the closest opponent at hyperspeed levels) after 6700 games. So assuming this would also be true at the normal blitz levels used in the distributed test, this would appear to account for most of the discrepancy between our own test results and the others.
Has anyone else run long tests to compare the effect of different opening books on test results? The tests would have to be several thousand games long, but can be at very fast levels.
Probably we will modify our tester to use this or a similar new book, so that future results will be better predicted by it. My conclusion is that Komodo is better than other top programs at playing the early opening, but the longer the book line supplied, the less valuable this asset becomes. Perhaps switching to a more normal book for testing will gradually help Komodo as different features are tuned using this new book.
I never considered the opening book to be much of a factor in test results (assuming colors are switched for each book position tested), but I am gradually becoming a believer.
My testing used a set of 18,000 positions that all were 4 moves deep. These positions were derived from the databases of the CCRL, CEGT, SWCR, UEL, and my own games. Though I am certain that there are some unbalanced positions in this set, for the most part they are not too unbalanced nor too drawish. White score for my games have been just under 53%.
I do not use reversed colors. Doing so automatically reduces the independence of the positions used, which increases the actual error of the measurements. I depend on randomness to keep White (or Black) bias low. I think that shows in the White score of my games, which includes many more games than just those played by the Also-Ran engines.
I have evidence that using a large set of positions without reversed colors is much better than using a small set of positions with reversed colors. There is some variance that comes into play by not using reversed colors, especially if the pool of opponents is wide. But, it is more than offset (in my experience) by the large number of positions used, covering more situations that would be found in general.
I realize that you would like to adjust Komodo's testing in such a way that it would better predict the results of the rating list testers. And possibly you could achieve this. But it is not certain that it would make Komodo better (stronger). It could even make it worse.