Houdini 1.03a The New Nr 1 !!!

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, chrisw, Rebel

Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Houdini 1.03a The New Nr 1 !!!

Post by Adam Hair »

Milos wrote:
Adam Hair wrote: Milos, you keep stating that the CCRL lists have no scientific significance,
yet you do not state why you say this. Furthermore, you also have stated
that some tests done with the Ippolit derivatives are much less flawed.
I have not seen one yet that is any better than what is done in the CCRL,
including one that I ran and posted in the Open Chess forum. Perhaps
you could state your grievances concerning CCRL methodologies, either
in a new thread or pm me. The CCRL most definitely has flaws, yet the
total disregard of its results by you seems to be due to things unrelated
to science. I have seen many outlandish statements and insults from you.
Perhaps you could dial back the antagonism and start giving thoughtful
statements. I have found several things that you have said quite interesting.
Unfortunately, too much of the time you are more interested in attacking
than in debating and pointing out the holes in other peoples' arguments.
Don't get offended. I'm aware you are not professional but hobby chess enthusiasts. The problem I have is with presenting results from these lists as official despite disclaimer that you have (I'm not saying you or Graham is doing it, but I lot of ppl get that impression).

Now regarding scientific insignificance let me just name a few flaws that come to my mind (the actual list would be much longer). I will skip personal biases and even presume that none of you have any intentions of distorting results.
So lets see:
1) Inequality of testing conditions

a) You do testing on various machines. Most of them much stronger than what you claim is your benchmark machine. You adjust speed by running Fritz (or Crafty, I'm not sure on that) benchmark (both are synthetic, outdated and don't give the right picture). Benchmarks are synthetic and measure just a couple of positions that are certainly not representative of a chess game. You adjust benchmark results to TC you use. Even though there you make engines have the same number of nodes for fixed depth as on the reference machine you change TC and directly change TM (I'll talk more on TC in further points). So all your tests are done with effectively different TM for the same engine. Moreover, various machines have different multi-core implementations, and they have different caches which might distort things very much in SMP engine matches.
I agree 100%. In my case, the two computers I use for testing are
indentical except that one cpu is an Intel 8400 and the other is
an Intel QX6700. I have adjusted the speed of the two processors so
that they both have approximately the same Crafty 19.17 benchmark,
but under no pretense do I believe that they are now identical. They have
different Fritz and Arena benchmarks. They do have the same L1 cache
but different L2 caches. The internal core architectures are not identical.
They are definitely two different computers.

I don't remember if anybody in the CCRL is currently using an AMD
processor. Several are using I5's and I7's, which are definitely different
from the Core 2 processors.

My point is that if everybody in the CCRL tested the same engine with
tens of thousands of games with the same group of opponents, I would
expect to see results that varied significantly more so than if indentical
computers were used.
b) You do testing on various OSes, where you have different running programs in parallel, different services running in the background, different schedulers and prioritization. Moreover, I strongly doubt that all of you only use those machines for engine testing and not run other things in the same time on them.
Definitely another source of variation. However, I do believe everybody
realizes the importance of not using the computer for other things while
testing.
c) You do tests under various GUIs. GUIs handle time differently, handle UCI protocol differently, adjudicate games differently, etc. In some cases some GUI's even have tendencies to favor its own engine.
Not as great of a source of variation. I can't vouch for every member,
but adjudication is not left up to the GUI. And the GUI's used have
not been reported to favor one engine over another - unless Matthias
has slipped something into ChessGUI to favor Big Lion :) .
Conclusion here, even though you make some effort to balance things, you are effectively comparing apples and oranges.
I think that the situation is more like comparing different varieties of
apples from different regions.
2) You do not have representative sample of opponents for each engine that you test. There is no methodology there. Since there are engines that are correlated, and also engines tend to favor some opponents and some not, there is a problem of huge difference in ratings and you don't do any effort to balance this, you are introducing even more noise that would be the case in testing of i.e. only 3 same standard opponents for each engine.
The situation is not as bad as it may seem to you. On the CCRL website,
when you look at the list of opponents that an engine has played, you
will see an ugly assortment of opponents for many of the engines. Some
engines have played multiple versions of other engines. Some do not have
enough games played or do not have a wide pool of opponents. And these
results are used to generate the complete Elo listing. People may think
that, since many more games are used to generate that list and that the
error bars are smaller, then the complete list is more accurate. Nonsense.
If you look at the pure lists however, the matches used to generate those
lists are, for the most part, do represent the proper methodology. And
there is an effort going on to make some corrections.
3) You use general book. This is far worse than using huge set of start positions. You think that with a general book and many games you'll get a good distribution. Instead you get a lot of openings of the same type, not enough diversity and certainly not a good representation of computer chess (representative sample). This introduces even more noise.
Some use general books. Some use starting positions. I do agree that,
with some of the general books available, the diversity of the openings
is not great enough. That fact and a distrust in the eveness of the positions
coming out of book lead me to start using a pgn of starting positions
even before I joined the CCRL 1-1/2 months ago. The pgn I am using
at the moment was developed by a member of the CCRL using high rated,
longer time control engine matches.
4) You use EGBT. In this way you disfavor engines that don't use tablebases or use other types of them.
I have seen no consensus exactly how much EGTB helps, though there
are a couple of 2600+ Elo engines that can't win a KQK endgame. I do
think there is some unfairness, due solely to the fact it seems to be very
difficult to get permission to use the EGTB code. Thankfully, everybody
has the choice to use GTB now, if they so choose.
5) Your reference machine Athlon 64 X2 4600+ (2.4 GHz) is way too week, and what you claim is a long TC testing (40/40) is effectively blitz in today's most powerful machines.
I am guessing that for a stock I7 that 40/40 is actually going to be played
at around 20 minutes for 40 moves. Not exactly blitz. I am not up to date
on overclocking I7's, but I doubt that a high end I7 running at 4.5 GHz
is going have to play at less than 15 minutes per 40 moves.
6) Tournament TCs might not be the best choice for TCs. Even though it's FIDE standard it doesn't mean it beneficial for computer chess. TCs that give more equal distribution between times for various moves are much better for engine testing since they capture more of a raw engine strength (which is purely important for analysis). So incremental TCs are better. The best would be not to use TCs at all and instead use time per move.
I would prefer to use incremental TC myself. However, as noted above,
different computers are being used, and not everybody has technical
know-how ( or confidence) to over or underclock their engine to achieve
a set benchmark, which would be necessary in order to use an incremental
TC.

If we were testing one engine with the purpose of checking to see that code
changes were improving strength, then I would use time per move. However,
we are testing multiple engines to determine their relative strengths.
Note that we are testing a wide range of opponents for their overall
ability. Time management is a factor in that. If the goal was to measure
raw strength for the application to analysis, then we would solely test
the top tier of engines.
7) You play way too few games. Bayeselo gives error margins assuming ideal testing conditions on the same machine with the same OS and the same GUI. In other words, noise is Gaussian and just comes from little uncertainties in CPU, OS (same CPU and OS for both engines and the whole tournament) and SMP implementations. In your case there are so many sources of noise (points 1, 2 and 3), some of them are not even Gaussian and real error margins are actually couple of times bigger than what you show. For claiming validity of results in your testing conditions you would have to run for one to two orders of magnitude more games - at least tens of thousands, if not hundreds of thousands.
The validity of the results is related to goals of the testing. The goal of
the CCRL is not and can not be to determine actual Elo differences between
chess engines. The goal is to give programmers and computer chess enthusiasts
an idea of the relative strength of the engines. As I am writing this, I see that you have written:
Milos wrote:Just to simplify things for you guys. Your lists can be used for detecting
differences between engines that are more than 30 elo points apart. For anything else they are useless.
If the resolution of the CCRL pure lists is that good, that would be great.
Better than I would guess.

Another purpose of the CCRL is to try to highlight the work of every chess author.
According to Guenther Simon's WB/UCI anthology, there are over 500
chess engines. That means probably at least 450 people have written
a chess engine. There is an attempt to test the older engines and to
test new versions of the active engines. It is the same goal that I had
while testing on my own. I personally feel that is more important than to
try to precisely measure the Elo difference between engines.


I have a problem with the logic you use in your arguments concerning
Rybka vs Houdini,Ivanhoe, etc... . You have rightly pointed out flaws
concerning the use of the CCRL lists as "official" lists, though the CCRL
does not have a role in that debate. However, you fail to point out flaws
in the testing of Rybka, Houdini, Ivanhoe, etc... .

1) The list that I would consider most likely to be most accurate in determining
what engine is strongest, Ingo's list, has a major flaw.
He does not want to release the data upon which the list has been constructed.
Without the data being available, his list might as well not
exist. No one interested in scientific significance could rely on heresay.
I believe Ingo to be unbiased and knows what he is doing, but his results
are just heresay without the data being available.

2) Many of the tournament results posted are head-to-head match ups,
consisting of too few games. As you stated above, some engines match
up favorably against certain opponents. To run a head to head match
and claim that the winner of that match is stronger is an invalid conclusion.

3) Time controls, GUI's, books and/or starting positions, and computers
vary from tester to tester.

4) Unfortunately, the pool of available opponents for these matches is small.

I will note that the above points are true whether or not a particular tester
claims Rybka 4 to be stronger or Houdini, Ivanhoe, etc... to be stronger.

I have seen some testing done that attempts to be accurate. The results so
far seem to be inconclusive.

Thank you for responding to me. As I said before, I find your thoughts on
certain subjects interesting. I do think, in this case, your statement that
the CCRL has no scientific significance is an imprecise statement.

Adam
Roger Brown
Posts: 782
Joined: Wed Mar 08, 2006 9:22 pm

Re: Houdini 1.03a The New Nr 1 !!!

Post by Roger Brown »

Adam Hair wrote:
[SNIP]

Thank you for responding to me. As I said before, I find your thoughts on
certain subjects interesting. I do think, in this case, your statement that
the CCRL has no scientific significance is an imprecise statement.

Adam

Hello Adam,

No, thank you for responding with class and in such detail that even an idiot such as myself could get something useful from it.

Now if others could follow your example think how much information, useful information, could be spread by this forum. Thanks also to Milos for giving me some food for thought as well.

To dream of a day when conversations and measured responses instead of bitter arguments (nothing wrong with a good tussle now and then!) could dominate.......

Later.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Houdini 1.03a The New Nr 1 !!!

Post by Adam Hair »

Roger Brown wrote:
Adam Hair wrote:
[SNIP]

Thank you for responding to me. As I said before, I find your thoughts on
certain subjects interesting. I do think, in this case, your statement that
the CCRL has no scientific significance is an imprecise statement.

Adam

Hello Adam,

No, thank you for responding with class and in such detail that even an idiot such as myself could get something useful from it.

Now if others could follow your example think how much information, useful information, could be spread by this forum. Thanks also to Milos for giving me some food for thought as well.

To dream of a day when conversations and measured responses instead of bitter arguments (nothing wrong with a good tussle now and then!) could dominate.......

Later.
Thank you very much, Roger.

The Internet is a very casual medium for many. I guess that plays a role
in the ease and frequency which arguments occur. In my case, I tend to
struggle with putting my thoughts into words, so most of my responses
are measured ( and too formal).

It would be nice to see actual debates rather than insults and ridicule.

Adam
BTO7

Re: Houdini 1.03a The New Nr 1 !!!

Post by BTO7 »

De Vos W wrote:
Graham Banks wrote:
Mark Mason wrote:Hi,

How do those results square with the one's posted here which suggest Deep Rybka 4 tops Houdini ?

http://www.talkchess.com/forum/viewtopi ... 7&start=20
Must be a Belgian conspiracy. :wink: :lol:

Conspiracy ? 100 Euro for Rybka 4, but I get better customer service from the free programs like Houdini 1.03a. A bug in Houdini and it is fixed with a new release in a few days. with a sorry from the programmer for putting out a version with a bug!! Maybe for you its Belgian conspiracy but for us its just Belgian service.
Bugs are bugs and they will show up from time to time. It is how the author reacts to the bugs that makes the difference. Does the author take care of his customers?
We don't expect Vasik Rajlich to give any of his customers money back, but to not fix what are now known problems is just disrespectful to his customers.
This statement is spot on. Robert is a class act who cares about his customers and what hes doing....all while not making a dime. The Vas lovers I personally just feel sorry for. Vas could care a less ....its all about the money in that camp and you dont even get what you pay for. Robert deserves to be on top. Rybka fan boys need to stop being sore losers ;)

Regards
BT
De Vos W
Posts: 431
Joined: Tue Dec 01, 2009 11:59 am

Re: Houdini 1.03a The New Nr 1 !!!

Post by De Vos W »

That's right!