A question concerning testing methods

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: A question concerning testing methods

Post by Lyudmil Tsvetkov »

velmarin wrote: With your permission, I'll take this ...

Give a theoretical lesson, even a marvelous compendium of assessment (I have translated into Spanish) is a fairly easy job to handle.

Setting a good engine to play chess, is somewhat complicated, very difficult, very frustrating.
Besides that adjust to make it even better than an engine "X", long story .....
You should study the engine"X" your ideas, implement them properly, after the tests, after the analysis of the tests, then look for remedies, and back again ....

Really, I think a super effort for very little ...
After the work with "engine no X"
Hi Jose, you are a great guy. :)

I bow to each personal effort, especially if it is well-deserved.

I know very little of engine developing apart from evaluation, but I bet it is the same as everything else: the better you know your job, the better results you get. You have to know all the intricacies of your engine, as well as chess knowledge, to be really successul.

My simple question was the following one (and I am interested in it because I would like to know which is currently the strongest engine in the world): why does Stockfish DD manage to draw Houdini 4 but Houdini scores consistently better against other opponents? This decides the fate of the engine chess championship. Any insight on the above simple fact would be very welcome.
tpetzke
Posts: 686
Joined: Thu Mar 03, 2011 4:57 pm
Location: Germany

Re: A question concerning testing methods

Post by tpetzke »

Hi,

it can happen that a patch that sees good in self test scores worse against other engines, but this is rare. Common is that the level of improvement you see in the self test is smaller against other engines, but still an improvement.

This is not about "thinking what is best". It is all about statistics. There once was a thread where people showed that you need 4 times as many games if you test against other engines to get the same level of confidence. You might get it down to 2 times if the engines in pool have played a lot of games and there strength is accurately known.

Thomas...
Thomas...

=======
http://macechess.blogspot.com - iCE Chess Engine
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: A question concerning testing methods

Post by Lyudmil Tsvetkov »

Daniel Shawul wrote:
My point of view is that it is not contempt
Well it is contempt :) That has been the case with Houdini 3 so no need to change point of view unless you are sure that changed. Also being tuned for blitz rating lists is another issue, otherwise it can't be what 60 elo better than stockfish/komodo and then loose embarassingly at TCEC.
Hi Daniel, I would agree, I am not certain, but I could agree that contempt is good for blitz and bullet, but bad for long games. That makes pretty much sense. However, how do you explain the fact that Houdini scores 75% against Rybka 4.1 and similarly against Critter, while Stockfish manages 10-12% less? Contempt is not supposed to work against engines of considerable strength as Rybka and Critter.
ouachita
Posts: 454
Joined: Tue Jan 15, 2013 4:33 pm
Location: Ritz-Carlton, NYC
Full name: Bobby Johnson

Re: A question concerning testing methods

Post by ouachita »

Stockfish and Komodo do much better against Houdini in a direct clash than in a larger pool of engines.

. . .Stockfish mostly play self tests. . .
I understand that self tests tell the developer whether or not and by how much the new version is better or worse than the previous version. But players like me dont care so much about this aspect, we mainly care about how engines fare when playing against each other. WHich is why I will test H4, K next release and SS DD against each other here.
SIM, PhD, MBA, PE
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: A question concerning testing methods

Post by Lyudmil Tsvetkov »

tpetzke wrote:Hi,

it can happen that a patch that sees good in self test scores worse against other engines, but this is rare. Common is that the level of improvement you see in the self test is smaller against other engines, but still an improvement.

Thomas...
But that is the point. What is preferable: to test against a predecessor, get 50 elo gain, which will translate into 30 elo gain with different opponents, or test changes directly with different opponents, gain 50 elo and stay with it?

In the first case you lose somewhere 20 elo, those 20 or more elo Stockfish needs to definitely catch Houdini and become undisputed world champion. Or Komodo, for that instance.

Larry was very concerned (but also Don, may he rest in peace) why Komodo ties with Houdini but scores worst against most other engines. If it is not contempt, what is it then? I say it is the simple fact that Houdini has the best testing conditions: without many foreign opponents you will never be that accurate.

Again, I am very interested to know which currently is the strongest engine in the world (so that I can devote more time to beating it :)). The simple answer is that Houdini is king of bullet (Pohl's list is excellent and statistically very sound, but the quality of games is very poor, so that does not satisfy me completely); Houdini maybe still has an edge at blitz (where most rating lists are both statistically relevant, but also the quality of chess is pretty good, could qualify for a world championship), and Komodo most probably is still the strongest in LTC.

Which is the true champion? TCEC and most other LTC tests are great but would not be statistically relevant, so that the only option is to consider world champion (if we do not split the categories) the engine that scores best in different blitz rating lists. And now we are before a dilemma: Stockfish and Komodo are almost potential world champions in blitz had it not been for the small detail that both engines are outperformed by Houdini in their matches against other opponents. This decides the world championship. If contempt was responsible for Houdini's superb performance at blitz against the larger pool, then why do not Stockfish and Komodo authors implement it? I think the answer is simple: because it is practically impossible. If it had only been possible, they would certainly have done so. They test all settings and they set the best one as a default.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: A question concerning testing methods

Post by Lyudmil Tsvetkov »

Milos wrote: H4 score 73% against Rybka and 71% against Critter. They are weak engines compared to H4 and using them for testing is less benficial than using selftesting.
But then, if Critter and Rybka are weak engines, which engine is not weak? There is no one to test against... I do not buy into that theory: both Rybka and Critter are very strong engines, with which contempt should not work sucessfully.

Milos wrote: Despite whatever RH says or write in his manual, Houdini contempt is tuned for H to perform best in typical rating lists. Period.
Unfortunately, all typical rating lists are the most representative ones.

And a simple question: if contempt helps you to perform better overall, then why does not everybody implement it? And more specifically: why do not Komodo and Stockfish implement it? Do you believe they could do that sucessfully?
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: A question concerning testing methods

Post by Milos »

Lyudmil Tsvetkov wrote:
Milos wrote: H4 score 73% against Rybka and 71% against Critter. They are weak engines compared to H4 and using them for testing is less benficial than using selftesting.
But then, if Critter and Rybka are weak engines, which engine is not weak? There is no one to test against... I do not buy into that theory: both Rybka and Critter are very strong engines, with which contempt should not work sucessfully.
150Elo weaker engine is weak compared to engine that is 150Elo stronger ;). Rybka (and partially Critter since it is particularly weak at LTC) are borderline case where H4 scores identically with both contempt=1 and contempt=0. Against weaker engines it scores better with contempt=1, against stronger it scores better with contempt=0. The problem is that stronger engines are only Komodo and SF (and maybe Gull to some extent). But that is the current state.
Lyudmil Tsvetkov wrote:And a simple question: if contempt helps you to perform better overall, then why does not everybody implement it? And more specifically: why do not Komodo and Stockfish implement it? Do you believe they could do that sucessfully?
That is question for their authors not for me. I can only assume that implementing it in a right way (where it is sufficiently beneficial) is not a trivial task at all as you might think.
User avatar
velmarin
Posts: 1600
Joined: Mon Feb 21, 2011 9:48 am

Re: A question concerning testing methods

Post by velmarin »

Lyudmil Tsvetkov wrote: My simple question was the following one (and I am interested in it because I would like to know which is currently the strongest engine in the world): why does Stockfish DD manage to draw Houdini 4 but Houdini scores consistently better against other opponents? This decides the fate of the engine chess championship. Any insight on the above simple fact would be very welcome.
There no is the question of the first post " A question concerning testing methods",
you changed the subject.
You Speak World Championship, there is no such event, if you mean the recent TCEC,
statistics and results command,
where you get both draw, I guess you talk about some other event.

Table by Houdini 9601: "Stage 3 and Stage 4:
Image


Table by Stockfish in Stage 4:
Image