The influence of books on test results.

Adam Hair · Post by **Adam Hair** » Wed Jul 25, 2012 5:12 pm

carldaman wrote:I hope everyone realizes that playing each opening with reversed colors only makes sense if the resulting position (where the book ends) has a lot of fight in it, and both sides have chances.

For example, if the book ends with a clear advantage to White, then that will lead to real rating distortions, especially in matches between engines of unequal strength. The stronger engine will win with White as expected, but the weaker engine will also win far too many times with White.

Likewise, if the book ends with a very dead/drawish position, the weaker engine will again benefit by drawing too often due to the opening.

In these cases, playing the same opening with both colors would only do harm to the test. Sorry if I'm stating the obvious, but a lot of people seem to treat testing with reversed colors as being fair by default.

Regards,
CL

That is also my interpretation.

Adam Hair · Post by **Adam Hair** » Wed Jul 25, 2012 6:09 pm

lkaufman wrote:
Adam Hair wrote:
lkaufman wrote:Of all the factors that can influence test results, such as time limit, increment vs. repeating controls, ponder, hardware, etc., the one we are currently most interested in is the effect of opening books/testsuites. Our own distributed tester uses a five move book, rather shorter than that used by most testers. Since it shows a sixteen elo lead for Komodo 5 over Houdini 1.5 (after over 11k games) which is not shown by the testing agencies, and since the only result on this forum showing Komodo 5 beating Houdini 2 in a long match used a four move book, we decided to make a new testbook that is more typical of books normally used in tests - it averages six moves, but some popular lines are much longer than this. Based on hyper-fast testing, our performance drops by 12 Elo playing against Critter (the closest opponent at hyperspeed levels) after 6700 games. So assuming this would also be true at the normal blitz levels used in the distributed test, this would appear to account for most of the discrepancy between our own test results and the others.
Has anyone else run long tests to compare the effect of different opening books on test results? The tests would have to be several thousand games long, but can be at very fast levels.
Probably we will modify our tester to use this or a similar new book, so that future results will be better predicted by it. My conclusion is that Komodo is better than other top programs at playing the early opening, but the longer the book line supplied, the less valuable this asset becomes. Perhaps switching to a more normal book for testing will gradually help Komodo as different features are tuned using this new book.
I never considered the opening book to be much of a factor in test results (assuming colors are switched for each book position tested), but I am gradually becoming a believer.
My testing used a set of 18,000 positions that all were 4 moves deep. These positions were derived from the databases of the CCRL, CEGT, SWCR, UEL, and my own games. Though I am certain that there are some unbalanced positions in this set, for the most part they are not too unbalanced nor too drawish. White score for my games have been just under 53%.

I do not use reversed colors. Doing so automatically reduces the independence of the positions used, which increases the actual error of the measurements. I depend on randomness to keep White (or Black) bias low. I think that shows in the White score of my games, which includes many more games than just those played by the Also-Ran engines.

I have evidence that using a large set of positions without reversed colors is much better than using a small set of positions with reversed colors. There is some variance that comes into play by not using reversed colors, especially if the pool of opponents is wide. But, it is more than offset (in my experience) by the large number of positions used, covering more situations that would be found in general.

I realize that you would like to adjust Komodo's testing in such a way that it would better predict the results of the rating list testers. And possibly you could achieve this. But it is not certain that it would make Komodo better (stronger). It could even make it worse.
Ironically, your testing is similar to the way we have been testing up until now. But I have to agree with Robert here, it seems better to have a book that represents openings more or less in proportion to their use in master play.
Question: Have you found that your own results differ noticeably from those of other testers who use more conventional, deeper books/ test sets? For example, have you found that Komodo scores better, worse, or about the same with your book than with standard ones? Thanks.

I have not seen any real inconsistencies between my method and other people's methods. I have not compared Komodo's results with my opening pgn with other opening books/pgns. I will say that IPON (though I do not know the true nature of Ingo's positions) and CEGT used a different approach than I did, and we all found essentially the same result for Komodo 5.

When I first started in 2009, I used a book (HS Arena Mainbook 7 moves). After a while, I started using a pgn of opening positions (Silver, LaCrosse, Franklin Titus, Sedat, and others). Then I studied the rating list databases. I saw that many (maybe 20% to 30%, IIRC) of the most common opening positions (the position where the engines start thinking) had a White score outside the range of 52% to 56%. And other positions had a draw percentage of 40% or more. I delved deeper into this, studying White performance for some of these positions. That was sobering to me. In many cases, I saw a difference in excess of 50 Elo between White strength and White performance.

I will say that what I am calling "positions" were actually groups of positions. I used extended ECO to sort the games, but that was not enough to sort the games into individual positions.

I attempted to construct my own "Perfect" opening database, but my knowledge of openings is woefully inadequate. I starting using a set of positions constructed by someone who used statistics and opening theory. But, that set was not big enough for some of my projects. So, I constructed several databases, for my use, from the rating list databases. I removed all positional duplicates and made an attempt to minimize the number of positions that would tend to quickly transpose into the natural line of another position (I really have no idea how successful that attempt was). I do not claim that my opening pgns are better than other pgns or books. All I do know is that I do check my accumulated games (which is probably 170,000 games at 40 moves in 3 minutes, repeating, since I stopped using books and positional suites) periodically, and I am satisfied with my White score and draw rate.

Houdini · Post by **Houdini** » Wed Jul 25, 2012 6:52 pm

Adam Hair wrote:All I do know is that I do check my accumulated games (which is probably 170,000 games at 40 moves in 3 minutes, repeating, since I stopped using books and positional suites) periodically, and I am satisfied with my White score and draw rate.

Very interesting!
While you have found a great degree of internal consistency in terms of draw rate and win percentage, an important question is whether the results you obtain are relevant for the users of the ratings you produce.

If I understand correctly, in all of the games you play only a very small fraction will be "real-life" opening variations encountered by the people that consult the rating list. For example, you only have a single 4-move position out of 18,000 that leads to the main Ruy Lopez: 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 4.Ba4 Nf6 - that's it. In other words: your rating results don't contain any Ruy Lopez games at all.

It implies that most of the tuning engine authors perform with a normal, "real-life" set of opening positions is more or less wasted for CCRL.

Robert

Laskos · Post by **Laskos** » Wed Jul 25, 2012 7:22 pm

carldaman wrote:I hope everyone realizes that playing each opening with reversed colors only makes sense if the resulting position (where the book ends) has a lot of fight in it, and both sides have chances.

For example, if the book ends with a clear advantage to White, then that will lead to real rating distortions, especially in matches between engines of unequal strength. The stronger engine will win with White as expected, but the weaker engine will also win far too many times with White.

Likewise, if the book ends with a very dead/drawish position, the weaker engine will again benefit by drawing too often due to the opening.

In these cases, playing the same opening with both colors would only do harm to the test. Sorry if I'm stating the obvious, but a lot of people seem to treat testing with reversed colors as being fair by default.

Regards,
CL

I already mentioned that playing with reversed is especially beneficial when the engines are very closely matched, and this happens when searching for 3 Elo points improvements. The most massive tests of developers are about very closely matched engines. Then, a "wrong" opening (too unbalanced) played once gives a deterministic 1-0. A wrong opening played with reversed colours gives a deterministic 1-0 and 0-1. If the engines strength are 0.505:0.495, then the error introduced in a single game is ~0.5 points, the error introduced in 2 opposite colours games is 0.01 points.

This is an extreme case, but with very closely matched engines and limited number of games, playing with reversed colours is better. Not until millions or more games per matches, the faster decay of the error as 1/sqrt(N) of non-reversed colours will start to show its efficiency.

Kai

Adam Hair · Post by **Adam Hair** » Wed Jul 25, 2012 8:11 pm

Houdini wrote:
Adam Hair wrote:All I do know is that I do check my accumulated games (which is probably 170,000 games at 40 moves in 3 minutes, repeating, since I stopped using books and positional suites) periodically, and I am satisfied with my White score and draw rate.
Very interesting!
While you have found a great degree of internal consistency in terms of draw rate and win percentage, an important question is whether the results you obtain are relevant for the users of the ratings you produce.

If I understand correctly, in all of the games you play only a very small fraction will be "real-life" opening variations encountered by the people that consult the rating list. For example, you only have a single 4-move position out of 18,000 that leads to the main Ruy Lopez: 1.e4 e5 2.Nf3 Nc6 3.Bb5 a6 4.Ba4 Nf6 - that's it. In other words: your rating results don't contain any Ruy Lopez games at all.

It implies that most of the tuning engine authors perform with a normal, "real-life" set of opening positions is more or less wasted for CCRL.

Robert

Perhaps wasted for the CCRL 40/4 list, though I am not the only tester for that list.

I can only say that my results do not vary considerably from other people's results, and tend to match what most authors have measured for themselves, including you. I do keep watch on those sorts of things, to see if anything I do based on my thinking produces results at odds with everybody else.

And again, if I tested with a long time control, my focus would be different.

If my testing method possibly negates something you are trying to do with Houdini, then you are free to ban me from receiving a copy of the next version or from publishing any results. I would not complain, nor would I have hard feelings about you.

Adam Hair · Post by **Adam Hair** » Wed Jul 25, 2012 8:48 pm

Laskos wrote:
carldaman wrote:I hope everyone realizes that playing each opening with reversed colors only makes sense if the resulting position (where the book ends) has a lot of fight in it, and both sides have chances.

For example, if the book ends with a clear advantage to White, then that will lead to real rating distortions, especially in matches between engines of unequal strength. The stronger engine will win with White as expected, but the weaker engine will also win far too many times with White.

Likewise, if the book ends with a very dead/drawish position, the weaker engine will again benefit by drawing too often due to the opening.

In these cases, playing the same opening with both colors would only do harm to the test. Sorry if I'm stating the obvious, but a lot of people seem to treat testing with reversed colors as being fair by default.

Regards,
CL
I already mentioned that playing with reversed is especially beneficial when the engines are very closely matched, and this happens when searching for 3 Elo points improvements. The most massive tests of developers are about very closely matched engines. Then, a "wrong" opening (too unbalanced) played once gives a deterministic 1-0. A wrong opening played with reversed colours gives a deterministic 1-0 and 0-1. If the engines strength are 0.505:0.495, then the error introduced in a single game is ~0.5 points, the error introduced in 2 opposite colours games is 0.01 points.

This is an extreme case, but with very closely matched engines and limited number of games, playing with reversed colours is better. Not until millions or more games per matches, the faster decay of the error as 1/sqrt(N) of non-reversed colours will start to show its efficiency.

Kai

The error from a single game per position is ~0.5. But the accumulative error is not d*0.5, where d is the number of deterministic games. And it is certain that reversed colors has error margins √2 larger than the single game per position when N games are played for each method (not counting the deterministic games). Plus, a drawish position gets played twice with reversed colors. I am not seeing how it takes millions of games for playing one game per position wins out over reversed positions. I am not saying you are wrong. I just am not convinced yet.

Laskos · Post by **Laskos** » Wed Jul 25, 2012 8:58 pm

Adam Hair wrote:
Laskos wrote:
carldaman wrote:I hope everyone realizes that playing each opening with reversed colors only makes sense if the resulting position (where the book ends) has a lot of fight in it, and both sides have chances.

For example, if the book ends with a clear advantage to White, then that will lead to real rating distortions, especially in matches between engines of unequal strength. The stronger engine will win with White as expected, but the weaker engine will also win far too many times with White.

Likewise, if the book ends with a very dead/drawish position, the weaker engine will again benefit by drawing too often due to the opening.

In these cases, playing the same opening with both colors would only do harm to the test. Sorry if I'm stating the obvious, but a lot of people seem to treat testing with reversed colors as being fair by default.

Regards,
CL
I already mentioned that playing with reversed is especially beneficial when the engines are very closely matched, and this happens when searching for 3 Elo points improvements. The most massive tests of developers are about very closely matched engines. Then, a "wrong" opening (too unbalanced) played once gives a deterministic 1-0. A wrong opening played with reversed colours gives a deterministic 1-0 and 0-1. If the engines strength are 0.505:0.495, then the error introduced in a single game is ~0.5 points, the error introduced in 2 opposite colours games is 0.01 points.

This is an extreme case, but with very closely matched engines and limited number of games, playing with reversed colours is better. Not until millions or more games per matches, the faster decay of the error as 1/sqrt(N) of non-reversed colours will start to show its efficiency.

Kai
The error from a single game per position is ~0.5. But the accumulative error is not d*0.5, where d is the number of deterministic games. And it is certain that reversed colors has error margins √2 larger than the single game per position when N games are played for each method (not counting the deterministic games). Plus, a drawish position gets played twice with reversed colors. I am not seeing how it takes millions of games for playing one game per position wins out over reversed positions. I am not saying you are wrong. I just am not convinced yet.

If the correlation between the reversed colors, non-deterministic games is not high, say 0.1 (for fast games it's even lower), then it's not sqrt(2) but say sqrt(1.1), if the number of "wrong" openings is 10%, and for closely matched engines, say score 0.52:0.48. The error is sure not d*0.5, but sqrt(d)*0.5, but that 0.5 is large, and will be offset slowly with the number of games in combined systematic + statistical errors.

Kai

Houdini · Post by **Houdini** » Wed Jul 25, 2012 11:14 pm

Adam Hair wrote:If my testing method possibly negates something you are trying to do with Houdini, then you are free to ban me from receiving a copy of the next version or from publishing any results. I would not complain, nor would I have hard feelings about you.

I have no problem whatsoever with your testing, you know what you do and do it intelligently.
It's interesting to see how the testing approaches can be different, it can explain some of the systematic differences between rating lists and my own development testing.

Robert

Adam Hair · Post by **Adam Hair** » Wed Jul 25, 2012 11:17 pm

Laskos wrote:
Adam Hair wrote:
Laskos wrote:
carldaman wrote:I hope everyone realizes that playing each opening with reversed colors only makes sense if the resulting position (where the book ends) has a lot of fight in it, and both sides have chances.

For example, if the book ends with a clear advantage to White, then that will lead to real rating distortions, especially in matches between engines of unequal strength. The stronger engine will win with White as expected, but the weaker engine will also win far too many times with White.

Likewise, if the book ends with a very dead/drawish position, the weaker engine will again benefit by drawing too often due to the opening.

In these cases, playing the same opening with both colors would only do harm to the test. Sorry if I'm stating the obvious, but a lot of people seem to treat testing with reversed colors as being fair by default.

Regards,
CL
I already mentioned that playing with reversed is especially beneficial when the engines are very closely matched, and this happens when searching for 3 Elo points improvements. The most massive tests of developers are about very closely matched engines. Then, a "wrong" opening (too unbalanced) played once gives a deterministic 1-0. A wrong opening played with reversed colours gives a deterministic 1-0 and 0-1. If the engines strength are 0.505:0.495, then the error introduced in a single game is ~0.5 points, the error introduced in 2 opposite colours games is 0.01 points.

This is an extreme case, but with very closely matched engines and limited number of games, playing with reversed colours is better. Not until millions or more games per matches, the faster decay of the error as 1/sqrt(N) of non-reversed colours will start to show its efficiency.

Kai
The error from a single game per position is ~0.5. But the accumulative error is not d*0.5, where d is the number of deterministic games. And it is certain that reversed colors has error margins √2 larger than the single game per position when N games are played for each method (not counting the deterministic games). Plus, a drawish position gets played twice with reversed colors. I am not seeing how it takes millions of games for playing one game per position wins out over reversed positions. I am not saying you are wrong. I just am not convinced yet.
If the correlation between the reversed colors, non-deterministic games is not high, say 0.1 (for fast games it's even lower), then it's not sqrt(2) but say sqrt(1.1), if the number of "wrong" openings is 10%, and for closely matched engines, say score 0.52:0.48. The error is sure not d*0.5, but sqrt(d)*0.5, but that 0.5 is large, and will be offset slowly with the number of games in combined systematic + statistical errors.

Kai

Under those conditions, I both understand and agree with you.

Adam Hair · Post by **Adam Hair** » Wed Jul 25, 2012 11:28 pm

Houdini wrote:
Adam Hair wrote:If my testing method possibly negates something you are trying to do with Houdini, then you are free to ban me from receiving a copy of the next version or from publishing any results. I would not complain, nor would I have hard feelings about you.
I have no problem whatsoever with your testing, you know what you do and do it intelligently.
It's interesting to see how the testing approaches can be different, it can explain some of the systematic differences between rating lists and my own development testing.

Robert

I do not want you to feel that Houdini would be unfairly handicapped. People disagree with various CCRL testing conditions, but no one should ever feel that there is bias against their engine.

The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.

Re: The influence of books on test results.