Is Naum 4 a serious challenge for Rybka

Laskos · Post by **Laskos** » Tue Jan 06, 2009 4:33 am

Dr.Wael Deeb wrote:
Under the ChessBase GUI you can use one opening book for both engines playing both as white and black the same opening line....this is even better than using test suits....

There are still problems with suites and books. For example some suites could favour one engine, because it likes those openings. The same for books, and even more, I do not like to include book learning in the equation. I have seen on CSS forum that some engines prefer certain openings, and a biased suite could ruin the result. A Rybka anti-Rybka playchess book could favour Rybka in a match with other engine by showing the lines Rybka should prefer and avoid.

Kai

Mike S. · Post by **Mike S.** » Tue Jan 06, 2009 5:56 am

Laskos wrote: There are still problems with suites and books. For example some suites could favour one engine, because it likes those openings.

This is very, very theorectical, or speculative, especially if the variations are repeated with switched sides. Most important is that the vast majority of positions where the variations end, is fairly balanced.

Sedat Canbaz offers five sets of different size (top-10 to top-200), and I think he has selected the variations by frequency. I think that is an objective and independant criteria. So, even if a particular engine would "like" that set more than other sets, then it doesn't spoil the significance of a test IMO, because these are simply "realistic" openings. An engine which is good at that in the lab, will also be good in the wild.

http://www.sedatchess.com/download.html

(just as example; I don't doubt that other sets are also good and suitable)

Also, the conclusion that an engine "(dis-)likes" a particular book or opening suite, requires that a trusted result or rating already exists, to compare with. But I have doubts in an approach which uses such a comparision for test design. If I finish this thought, it would mean to say, my tests are wrong as long as they produce other results

That is certainly wrong.

As for the openings in tests, I think variety is required (no sicilian thematic tournament), and a certain quality. Just nothing extreme. That should be sufficient. - Over time, I've changed my mind from (very) short to medium deep test books.

I dislike "no book" tests very much, except as experiment, but then one shouldn't draw conclusions about general playing strength, from such a test (at least not from the results only). I have also seen tests and even ratings being published, based on only one single opening variation which was used for all games. That is absurd.

I also doubt somewhat the idea of permanent book tuning, if conclusions about engine strengths shall be drawn (not in general). The influence of such tuning is something "artificial" which does not exist in a normal chess players practice with an engine. Because he maybe tunes openings for himself to use, not for the engine!

Albert Silver · Post by **Albert Silver** » Tue Jan 06, 2009 6:34 am

Laskos wrote:
Dr.Wael Deeb wrote:
Under the ChessBase GUI you can use one opening book for both engines playing both as white and black the same opening line....this is even better than using test suits....
There are still problems with suites and books. For example some suites could favour one engine, because it likes those openings. The same for books, and even more, I do not like to include book learning in the equation. I have seen on CSS forum that some engines prefer certain openings, and a biased suite could ruin the result. A Rybka anti-Rybka playchess book could favour Rybka in a match with other engine by showing the lines Rybka should prefer and avoid.

Kai

I don't think this is correct, unless the suite is poorly designed. My suite was designed to test a wide variety of openings, each with several acceptable playable moves. I also tested several engines to be sure these alternatives were actually played, since what would be the point if the engines all played only one move? The openings were chosen according to their frequency played according to the databases. That said, whenever testing with any suite, one must assuredly test against a variety of opponents. I have seen tweaks give large improvements against one opponent but then be canceled out when tested against another. One gauntlet is definitely not enough.

Laskos · Post by **Laskos** » Wed Jan 07, 2009 2:37 am

Mike S. wrote:
Laskos wrote: There are still problems with suites and books. For example some suites could favour one engine, because it likes those openings.
This is very, very theorectical, or speculative, especially if the variations are repeated with switched sides. Most important is that the vast majority of positions where the variations end, is fairly balanced.

Sedat Canbaz offers five sets of different size (top-10 to top-200), and I think he has selected the variations by frequency. I think that is an objective and independant criteria. So, even if a particular engine would "like" that set more than other sets, then it doesn't spoil the significance of a test IMO, because these are simply "realistic" openings. An engine which is good at that in the lab, will also be good in the wild.

http://www.sedatchess.com/download.html

You are certainly right, but a person doing on purpose a suite favouring one engine can accomplish that easily. From CSS rating list, if I am not wrong, some engines prefered some openings (both as black and white) by more than 15% in the result to other openings. I think the criteria should be variety and representativity. An unbiased person doing a suite probably will create one which does not favour a certain engine.

(just as example; I don't doubt that other sets are also good and suitable)

Also, the conclusion that an engine "(dis-)likes" a particular book or opening suite, requires that a trusted result or rating already exists, to compare with. But I have doubts in an approach which uses such a comparision for test design. If I finish this thought, it would mean to say, my tests are wrong as long as they produce other results That is certainly wrong.

As for the openings in tests, I think variety is required (no sicilian thematic tournament), and a certain quality. Just nothing extreme. That should be sufficient. - Over time, I've changed my mind from (very) short to medium deep test books.

I dislike "no book" tests very much

Why? I actually prefer suites to book tests. They are almost reproducible and give a better hint for example of what happens at longer time controls compared to shorter. With books, I have a huge variability which is hard to appreciate for me even as error bars (2 sigma or so).

, except as experiment, but then one shouldn't draw conclusions about general playing strength, from such a test (at least not from the results only). I have also seen tests and even ratings being published, based on only one single opening variation which was used for all games. That is absurd.

I also doubt somewhat the idea of permanent book tuning, if conclusions about engine strengths shall be drawn (not in general). The influence of such tuning is something "artificial" which does not exist in a normal chess players practice with an engine. Because he maybe tunes openings for himself to use, not for the engine!

Totally agree, there could be even, for example, playchess ratings with highly tuned books, and general ratings with representative openings.

Ragards,
Kai

Laskos · Post by **Laskos** » Wed Jan 07, 2009 3:03 am

Albert Silver wrote:
Laskos wrote:
Dr.Wael Deeb wrote:
Under the ChessBase GUI you can use one opening book for both engines playing both as white and black the same opening line....this is even better than using test suits....
There are still problems with suites and books. For example some suites could favour one engine, because it likes those openings. The same for books, and even more, I do not like to include book learning in the equation. I have seen on CSS forum that some engines prefer certain openings, and a biased suite could ruin the result. A Rybka anti-Rybka playchess book could favour Rybka in a match with other engine by showing the lines Rybka should prefer and avoid.

Kai
I don't think this is correct, unless the suite is poorly designed. My suite was designed to test a wide variety of openings, each with several acceptable playable moves. I also tested several engines to be sure these alternatives were actually played, since what would be the point if the engines all played only one move? The openings were chosen according to their frequency played according to the databases. That said, whenever testing with any suite, one must assuredly test against a variety of opponents. I have seen tweaks give large improvements against one opponent but then be canceled out when tested against another. One gauntlet is definitely not enough.

I agree, I only wanted to say that one can build a suite favouring one engine if he wants to do so, just by looking at CSS results. A carefully, unbiased and representatively selected suite is very useful in testing. There could be also a suite "according to playchess" for highly tuned books (the database is changed), as Mike said. But it would be "Playchess rating list". Btw, where can I get your suite? Also, can I merge your suite with Noomen 30? I would like to do that in order to have more games to decrease the error bars, but maybe these suites are different in substance.

Regards,
Kai

Zach Wegner · Post by **Zach Wegner** » Wed Jan 07, 2009 3:22 am

Albert Silver wrote:I don't think this is correct, unless the suite is poorly designed. My suite was designed to test a wide variety of openings, each with several acceptable playable moves. I also tested several engines to be sure these alternatives were actually played, since what would be the point if the engines all played only one move? The openings were chosen according to their frequency played according to the databases. That said, whenever testing with any suite, one must assuredly test against a variety of opponents. I have seen tweaks give large improvements against one opponent but then be canceled out when tested against another. One gauntlet is definitely not enough.

I don't agree. It is easy to say that a test suite is balanced and that both engines play both sides, but there are many very small discrepancies that are not noticed on a small scale.

In a thread a few months ago, Bob Hyatt posted some results where the error bars for identical engines did not overlap. IIRC the chance of this happening, assuming that the result of a chess game is randomly distributed according to Elo theory, was something like 0.06%. IMO, there are two main causes for this: a limited set of positions, and a limited set of opponents. Whenever two engines play from the same starting position with the same colors, the result is very highly correlated (non-random). In fact there is a pretty good chance you will get the exact same game. Say you take a starting position where with two engines of equal strength X and Y, X beats engine Y, regardless of color, 95% of the time (say by understanding some sort of compensation better). If you play the same setup a thousand times and feed it into an Elo program, you will get the impression that X is hundreds of rating points better with a very good confidence. But if you play many different positions with many different engines, they might be equal. This is an exaggeration, but the point is that these results are not random, and cannot be used to draw conclusions about engine strength. IMO one engine should only play another one two times from the same starting position.

Laskos · Post by **Laskos** » Wed Jan 07, 2009 3:56 am

Zach Wegner wrote:
Albert Silver wrote:I don't think this is correct, unless the suite is poorly designed. My suite was designed to test a wide variety of openings, each with several acceptable playable moves. I also tested several engines to be sure these alternatives were actually played, since what would be the point if the engines all played only one move? The openings were chosen according to their frequency played according to the databases. That said, whenever testing with any suite, one must assuredly test against a variety of opponents. I have seen tweaks give large improvements against one opponent but then be canceled out when tested against another. One gauntlet is definitely not enough.
I don't agree. It is easy to say that a test suite is balanced and that both engines play both sides, but there are many very small discrepancies that are not noticed on a small scale.

In a thread a few months ago, Bob Hyatt posted some results where the error bars for identical engines did not overlap. IIRC the chance of this happening, assuming that the result of a chess game is randomly distributed according to Elo theory, was something like 0.06%. IMO, there are two main causes for this: a limited set of positions, and a limited set of opponents. Whenever two engines play from the same starting position with the same colors, the result is very highly correlated (non-random). In fact there is a pretty good chance you will get the exact same game. Say you take a starting position where with two engines of equal strength X and Y, X beats engine Y, regardless of color, 95% of the time (say by understanding some sort of compensation better). If you play the same setup a thousand times and feed it into an Elo program, you will get the impression that X is hundreds of rating points better with a very good confidence. But if you play many different positions with many different engines, they might be equal. This is an exaggeration, but the point is that these results are not random, and cannot be used to draw conclusions about engine strength. IMO one engine should only play another one two times from the same starting position.

I never play more than white/black on the same position, it would be stupid. The results would be highly correlated (I would like the engines to be deterministic, i.e. correlation 1) and it is meaningless. That's why I would like a larger, representative opening suite. I think you are adressing a different issue.

Kai

Tomcass · Post by **Tomcass** » Wed Jan 07, 2009 12:42 pm

It is not so difficult to understand, Ernst, that I am referring to a victory under certain circumstances, which I have explained in detail.

I am not also implying anything about the strength of both programmes. I know very well which it is. There are good tests available. My only purpose here is:

- Enjoying the process of testing engines and exploring new books. This is my first driver. And

- To share the results with my feelings to this chess community. (To people that could be interested in it, obviously).

Enjoy!.

Tom.

ernest · Post by **ernest** » Wed Jan 07, 2009 6:29 pm

Hi Mike,

I usually use your books

for my engine matches:
your old 5moves.ctg
and your recent (2008) PB5moves.ctg

Can you comment on the differences?

ernst · Post by **ernst** » Thu Jan 08, 2009 8:20 am

Tomcass wrote:It is not so difficult to understand, Ernst, that I am referring to a victory under certain circumstances, which I have explained in detail.

I am not also implying anything about the strength of both programmes. I know very well which it is. There are good tests available. My only purpose here is:

- Enjoying the process of testing engines and exploring new books. This is my first driver. And

- To share the results with my feelings to this chess community. (To people that could be interested in it, obviously).

Enjoy!.

Tom.

A quote from an earlier post...

Posted: Sat Jan 03, 2009 11:59 pm Post subject: Re: Is Naum 4 a serious challenge for Rybka

--------------------------------------------------------------------------------

I have very clear that the answer to the title of this post is a big YES! when we talk about long time controls.

Tomorrow I will post the final results of mi test of 200 games at 60 minutes in a Quad 6700 between both engines. The victory of Naum4 has been impressive.

My feeling is that at long time controls and fast hardware, Naum4 is slightly better than Rybka3.

Regards from Barcelona.

Tom.

This clearly talks about program strenght, doesn't it?