Houdini 1.03a The New Nr 1 !!!

Milos · Post by **Milos** » Sat Jul 31, 2010 10:31 pm

Adam Hair wrote: Milos, you keep stating that the CCRL lists have no scientific significance,
yet you do not state why you say this. Furthermore, you also have stated
that some tests done with the Ippolit derivatives are much less flawed.
I have not seen one yet that is any better than what is done in the CCRL,
including one that I ran and posted in the Open Chess forum. Perhaps
you could state your grievances concerning CCRL methodologies, either
in a new thread or pm me. The CCRL most definitely has flaws, yet the
total disregard of its results by you seems to be due to things unrelated
to science. I have seen many outlandish statements and insults from you.
Perhaps you could dial back the antagonism and start giving thoughtful
statements. I have found several things that you have said quite interesting.
Unfortunately, too much of the time you are more interested in attacking
than in debating and pointing out the holes in other peoples' arguments.

Don't get offended. I'm aware you are not professional but hobby chess enthusiasts. The problem I have is with presenting results from these lists as official despite disclaimer that you have (I'm not saying you or Graham is doing it, but I lot of ppl get that impression).

Now regarding scientific insignificance let me just name a few flaws that come to my mind (the actual list would be much longer). I will skip personal biases and even presume that none of you have any intentions of distorting results.
So lets see:
1) Inequality of testing conditions

a) You do testing on various machines. Most of them much stronger than what you claim is your benchmark machine. You adjust speed by running Fritz (or Crafty, I'm not sure on that) benchmark (both are synthetic, outdated and don't give the right picture). Benchmarks are synthetic and measure just a couple of positions that are certainly not representative of a chess game. You adjust benchmark results to TC you use. Even though there you make engines have the same number of nodes for fixed depth as on the reference machine you change TC and directly change TM (I'll talk more on TC in further points). So all your tests are done with effectively different TM for the same engine. Moreover, various machines have different multi-core implementations, and they have different caches which might distort things very much in SMP engine matches.

b) You do testing on various OSes, where you have different running programs in parallel, different services running in the background, different schedulers and prioritization. Moreover, I strongly doubt that all of you only use those machines for engine testing and not run other things in the same time on them.

c) You do tests under various GUIs. GUIs handle time differently, handle UCI protocol differently, adjudicate games differently, etc. In some cases some GUI's even have tendencies to favor its own engine.

Conclusion here, even though you make some effort to balance things, you are effectively comparing apples and oranges.

2) You do not have representative sample of opponents for each engine that you test. There is no methodology there. Since there are engines that are correlated, and also engines tend to favor some opponents and some not, there is a problem of huge difference in ratings and you don't do any effort to balance this, you are introducing even more noise that would be the case in testing of i.e. only 3 same standard opponents for each engine.

3) You use general book. This is far worse than using huge set of start positions. You think that with a general book and many games you'll get a good distribution. Instead you get a lot of openings of the same type, not enough diversity and certainly not a good representation of computer chess (representative sample). This introduces even more noise.

4) You use EGBT. In this way you disfavor engines that don't use tablebases or use other types of them.

5) Your reference machine Athlon 64 X2 4600+ (2.4 GHz) is way too week, and what you claim is a long TC testing (40/40) is effectively blitz in today's most powerful machines.

6) Tournament TCs might not be the best choice for TCs. Even though it's FIDE standard it doesn't mean it beneficial for computer chess. TCs that give more equal distribution between times for various moves are much better for engine testing since they capture more of a raw engine strength (which is purely important for analysis). So incremental TCs are better. The best would be not to use TCs at all and instead use time per move.

7) You play way too few games. Bayeselo gives error margins assuming ideal testing conditions on the same machine with the same OS and the same GUI. In other words, noise is Gaussian and just comes from little uncertainties in CPU, OS (same CPU and OS for both engines and the whole tournament) and SMP implementations. In your case there are so many sources of noise (points 1, 2 and 3), some of them are not even Gaussian and real error margins are actually couple of times bigger than what you show. For claiming validity of results in your testing conditions you would have to run for one to two orders of magnitude more games - at least tens of thousands, if not hundreds of thousands.

Graham Banks · Post by **Graham Banks** » Sat Jul 31, 2010 10:38 pm

frcha wrote:I think the main flaw is the exclusion of certain engines ...The topic of this thread - Houdini is new # 1 cannot be answered by looking at CCRL since it is not allowed.
Still, the origin of the ippolit engines is unclear and I guess you will only include them if they are proven to NOT be derivatives!

One can find flaws or dislikes with any rating list, so perfection is an unattainable goal.

You're right about Ippo and co. We've discussed them internally and decided not to test them officially.

Another thing to consider is this. Even if we had decided to test them, new versions seem to come out every other day, so it would be a waste of time trying to keep up.

Cheers,
Graham.

Graham Banks · Post by **Graham Banks** » Sat Jul 31, 2010 10:42 pm

Milos wrote:
Adam Hair wrote: Milos, you keep stating that the CCRL lists have no scientific significance,
yet you do not state why you say this. Furthermore, you also have stated
that some tests done with the Ippolit derivatives are much less flawed.
I have not seen one yet that is any better than what is done in the CCRL,
including one that I ran and posted in the Open Chess forum. Perhaps
you could state your grievances concerning CCRL methodologies, either
in a new thread or pm me. The CCRL most definitely has flaws, yet the
total disregard of its results by you seems to be due to things unrelated
to science. I have seen many outlandish statements and insults from you.
Perhaps you could dial back the antagonism and start giving thoughtful
statements. I have found several things that you have said quite interesting.
Unfortunately, too much of the time you are more interested in attacking
than in debating and pointing out the holes in other peoples' arguments.
Don't get offended. I'm aware you are not professional but hobby chess enthusiasts. The problem I have is with presenting results from these lists as official despite disclaimer that you have (I'm not saying you or Graham is doing it, but I lot of ppl get that impression).

Now regarding scientific insignificance let me just name a few flaws that come to my mind (the actual list would be much longer). I will skip personal biases and even presume that none of you have any intentions of distorting results.
So lets see:
1) Inequality of testing conditions

a) You do testing on various machines. Most of them much stronger than what you claim is your benchmark machine. You adjust speed by running Fritz (or Crafty, I'm not sure on that) benchmark (both are synthetic, outdated and don't give the right picture). Benchmarks are synthetic and measure just a couple of positions that are certainly not representative of a chess game. You adjust benchmark results to TC you use. Even though there you make engines have the same number of nodes for fixed depth as on the reference machine you change TC and directly change TM (I'll talk more on TC in further points). So all your tests are done with effectively different TM for the same engine. Moreover, various machines have different multi-core implementations, and they have different caches which might distort things very much in SMP engine matches.

b) You do testing on various OSes, where you have different running programs in parallel, different services running in the background, different schedulers and prioritization. Moreover, I strongly doubt that all of you only use those machines for engine testing and not run other things in the same time on them.

c) You do tests under various GUIs. GUIs handle time differently, handle UCI protocol differently, adjudicate games differently, etc. In some cases some GUI's even have tendencies to favor its own engine.

Conclusion here, even though you make some effort to balance things, you are effectively comparing apples and oranges.

2) You do not have representative sample of opponents for each engine that you test. There is no methodology there. Since there are engines that are correlated, and also engines tend to favor some opponents and some not, there is a problem of huge difference in ratings and you don't do any effort to balance this, you are introducing even more noise that would be the case in testing of i.e. only 3 same standard opponents for each engine.

3) You use general book. This is far worse than using huge set of start positions. You think that with a general book and many games you'll get a good distribution. Instead you get a lot of openings of the same type, not enough diversity and certainly not a good representation of computer chess (representative sample). This introduces even more noise.

4) You use EGBT. In this way you disfavor engines that don't use tablebases or use other types of them.

5) Your reference machine Athlon 64 X2 4600+ (2.4 GHz) is way too week, and what you claim is a long TC testing (40/40) is effectively blitz in today's most powerful machines.

6) Tournament TCs might not be the best choice for TCs. Even though it's FIDE standard it doesn't mean it beneficial for computer chess. TCs that give more equal distribution between times for various moves are much better for engine testing since they capture more of a raw engine strength (which is purely important for analysis). So incremental TCs are better. The best would be not to use TCs at all and instead use time per move.

7) You play way too few games. Bayeselo gives error margins assuming ideal testing conditions on the same machine with the same OS and the same GUI. In other words, noise is Gaussian and just comes from little uncertainties in CPU, OS (same CPU and OS for both engines and the whole tournament) and SMP implementations. In your case there are so many sources of noise (points 1, 2 and 3), some of them are not even Gaussian and real error margins are actually couple of times bigger than what you show. For claiming validity of results in your testing conditions you would have to run for one to two orders of magnitude more games - at least tens of thousands, if not hundreds of thousands.

Although you have made some valid points, it's still difficult to argue against the fact that all the various rating lists show a similar correlation in the comparative ratings of engines.

Milos · Post by **Milos** » Sat Jul 31, 2010 10:51 pm

Graham Banks wrote:Although you have made some valid points, it's still difficult to argue against the fact that all the various rating lists show a similar correlation in the comparative ratings of engines.

What you call correlation is just the place on the rating lists. Regarding for example absolute value of elo differences between engines (note I'm not talking about rating lists offsets, but elo difference between engines which should be universal under the same conditions) there's almost no correlation at all between various list.

Just to simplify things for you guys. Your lists can be used for detecting differences between engines that are more than 30 elo points apart. For anything else they are useless.

Graham Banks · Post by **Graham Banks** » Sat Jul 31, 2010 10:57 pm

Milos wrote:
Graham Banks wrote:Although you have made some valid points, it's still difficult to argue against the fact that all the various rating lists show a similar correlation in the comparative ratings of engines.
What you call correlation is just the place on the rating lists. Regarding for example absolute value of elo differences between engines (note I'm not talking about rating lists offsets, but elo difference between engines which should be universal under the same conditions) there's almost no correlation at all between various list.

There will always be margins of error. Even taking those into account, the various rating lists give a similar picture of relative engine strength.
However, I realise that rating lists aren't everybody's cup of tea, so there's no need for those people to bother looking at them.

Graham Banks · Post by **Graham Banks** » Sat Jul 31, 2010 11:00 pm

Milos wrote:..... For anything else they are useless.

One could use the same argument against wannabe programmers trying to rip off closed source engines.

Chan Rasjid · Post by **Chan Rasjid** » Sun Aug 01, 2010 12:15 am

Still, the origin of the ippolit engines is unclear and I guess you will only include them if they are proven to NOT be derivatives!

Houdini, by any other name, would play the same.

Eastendboy · Post by **Eastendboy** » Sun Aug 01, 2010 4:37 am

Graham Banks wrote:
Milos wrote:
Graham Banks wrote:Although you have made some valid points, it's still difficult to argue against the fact that all the various rating lists show a similar correlation in the comparative ratings of engines.
What you call correlation is just the place on the rating lists. Regarding for example absolute value of elo differences between engines (note I'm not talking about rating lists offsets, but elo difference between engines which should be universal under the same conditions) there's almost no correlation at all between various list.
There will always be margins of error. Even taking those into account, the various rating lists give a similar picture of relative engine strength.
However, I realise that rating lists aren't everybody's cup of tea, so there's no need for those people to bother looking at them.

You can't argue with them Graham -- when the CCRL ratings list displays information they agree with, the methods are more than adequate. When the CCRL ratings list doesn't display information they agree with, the methods are terrible and the half a million games and countless CPU cycles over the last decade mean nothing.

I would argue that the single most important factor for any testing regime is consistency. Even if the methodology contains flaws, consistency over time gives a very accurate picture of where each engine sits in comparison to other engines on the ratings list and in that regard, the CCRL has, by all appearances, done an excellent job. Judging by the number of engine authors who are eager to have their engines tested, I think it's safe to say that I'm not alone in my thinking.

There's no glory in testing engines. Only higher electric bills and hardware that has a shorter-than-average life expectancy. As with most things in life, people are quick to complain and slow to help. Ignore that type of person and rest easy knowing that your work is appreciated very much by the community!

De Vos W · Post by **De Vos W** » Sun Aug 01, 2010 9:42 am

Myamoto Musashi wrote:

There's no glory in testing engines. Only higher electric bills and hardware that has a shorter-than-average life expectancy. As with most things in life, people are quick to complain and slow to help. Ignore that type of person and rest easy knowing that your work is appreciated very much by the community![/quote]

Don't speak in the name of the computer chess community! How the hell
can you know what everybody feel and think ?

Guenther · Post by **Guenther** » Sun Aug 01, 2010 10:36 am

De Vos W wrote:Myamoto Musashi wrote:

There's no glory in testing engines. Only higher electric bills and hardware that has a shorter-than-average life expectancy. As with most things in life, people are quick to complain and slow to help. Ignore that type of person and rest easy knowing that your work is appreciated very much by the community!
Don't speak in the name of the computer chess community! How the hell
can you know what everybody feel and think ?

Don't be silly, he just gave Graham a very good advice. With no word he
said what you tried to understand...

Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!

Re: Houdini 1.03a The New Nr 1 !!!