44 elo swing depending on hardware!

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

kranium
Posts: 2130
Joined: Thu May 29, 2008 10:43 am

Re: 44 elo swing depending on hardware!

Post by kranium »

lkaufman wrote:
kranium wrote:
ThatsIt wrote:
pohl4711 wrote: [...snip...]
Its all about playing a lot, lot, lot of games! and with 5'+3'' most
people dont play lot of games, but only 100 or 150 in the head-to-
head competition. And thats obviously not enough.
[...snip...]
Best - Stefan
Modesty is_not your matter, isn't it ?
My view:
better 50, 100 or 150 games with 5'+3" than thousands of ultra bullet scrap !
the top engine's ratings in Stefan's Lightspeed list match fairly closely with CEGT, and other lists

but Stefan's list has a couple big advantages:

it's unbiased and all-inclusive
and more importantly: he plays enough games to achieve a high level of accuracy
(an error margin of +-5 ELO compared to CEGT 40/20's +-15 ELO or more)

unlike CEGT, he does not play the games on different hardware and simply combine the results
(which may result in big ELO swing, as Larry points out in this topic)

that said, i'm not surprised to see just how popular his site has become...
and it may begin to explain your animosity towards him
Now that I have the hardware, I'm planning to get an answer once and for all to the question of bullet chess (like LS list) correlates well with blitz lists (like IPON and now the 5' + 3" CEGT list). I'm running a gauntlet for the new Komodo (against five top engines) at 2' + 1" (HT off, same book as LS uses, 36 cores running on it so 36 games at once. When I'm done, I'll cut the time in half and repeat, and if time permits I'll do 4' +2". I'll have enough games to be able to say once and for all how valid bullet testing is, if the goal is to predict results at 5' + 3" or so. Although I've often said that I think bullet testing favors Ippo related engines, I'm open-minded; if the results show otherwise I won't hesitate to admit I was wrong. Actually it would be very good news for the computer chess community if I am wrong, because it means that we can get much more reliable sample sizes just by playing faster games.
So far my result (for TCEC stage 3 version) against Houdini 3 is 47.1% out of 1900 games, about 20 elo down. If there really is no difference in relative strength of engines at different levels, I would expect something like 48% at 4'+2" and 46% at 1' + 30". The percentage should asymptotically approach 50% at super long time controls. But I claim that there is some reasonable level where Komodo actually will score over 50% in a long match. Maybe this will shed some light on the question. I may actually just run a fairly slow match on my quad to see if I get a plus score.

Larry,
fact is: CEGT has recently gotten your wholehearted thumbs up as they enthusiastically adopted your preferred TC...
but they use various hardware and LS provides a consistent hardware platform

so, i fail to understand why this/your topic: "44 elo swing depending on hardware!"
is now turning into a referendum on the validity of lightning speeds (i.e. -> the LS rating list)
(which have been used with great success by engine developers for many many years, Bob H., Vas R. especially)

it seems especially inappropriate after the severe LS put-downs by CEGT and IPON, (whom you have recently praised)
lkaufman wrote: Actually it would be very good news for the computer chess community if I am wrong, because it means that we can get much more reliable sample sizes just by playing faster games.
you're a great guy, and i believe you are fair (despite a commercial conflict of interest)...
but with all due respect (and you deserve alot), IMO, your individual tests != empirical truth for the CC community on this issue

Norm
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: 44 elo swing depending on hardware!

Post by lkaufman »

kranium wrote: Larry,
fact is: CEGT has recently gotten your wholehearted thumbs up as they enthusiastically adopted your preferred TC...
but they use various hardware and LS provides a consistent hardware platform

so, i fail to understand why this/your topic: "44 elo swing depending on hardware!"
is now turning into a referendum on the validity of lightning speeds (i.e. -> the LS rating list)
(which have been used with great success by engine developers for many many years, Bob H., Vas R. especially)

it seems especially inappropriate after the severe LS put-downs by CEGT and IPON, (whom you have recently praised)
lkaufman wrote: Actually it would be very good news for the computer chess community if I am wrong, because it means that we can get much more reliable sample sizes just by playing faster games.
you're a great guy, and i believe you are fair (despite a commercial conflict of interest)...
but with all due respect (and you deserve alot), IMO, your individual tests != empirical truth for the CC community on this issue

Norm
The original topic of the thread has been answered to my satisfaction -- turning off HT solved the problem. Maybe the new topic should have been in a separate thread. I think both the IPON/CEGT 5'+3" lists and the LS list are very valuable; the first because it is the most reliable for comparing unrelated engines, and the second because it is the most reliable for comparing engines of the same family. These tests should determine for me whether LS level lists are reliable for unrelated engines; it's up to each person to believe whatever conclusion I reach or not. On the topic at issue, if it turns out that I'm wrong and LS level testing is valid for unrelated engines, I would have no reason to favor the slower lists with smaller samples.
kranium
Posts: 2130
Joined: Thu May 29, 2008 10:43 am

Re: 44 elo swing depending on hardware!

Post by kranium »

lkaufman wrote:
kranium wrote: Larry,
fact is: CEGT has recently gotten your wholehearted thumbs up as they enthusiastically adopted your preferred TC...
but they use various hardware and LS provides a consistent hardware platform

so, i fail to understand why this/your topic: "44 elo swing depending on hardware!"
is now turning into a referendum on the validity of lightning speeds (i.e. -> the LS rating list)
(which have been used with great success by engine developers for many many years, Bob H., Vas R. especially)

it seems especially inappropriate after the severe LS put-downs by CEGT and IPON, (whom you have recently praised)
lkaufman wrote: Actually it would be very good news for the computer chess community if I am wrong, because it means that we can get much more reliable sample sizes just by playing faster games.
you're a great guy, and i believe you are fair (despite a commercial conflict of interest)...
but with all due respect (and you deserve alot), IMO, your individual tests != empirical truth for the CC community on this issue

Norm
The original topic of the thread has been answered to my satisfaction -- turning off HT solved the problem. Maybe the new topic should have been in a separate thread. I think both the IPON/CEGT 5'+3" lists and the LS list are very valuable; the first because it is the most reliable for comparing unrelated engines, and the second because it is the most reliable for comparing engines of the same family. These tests should determine for me whether LS level lists are reliable for unrelated engines; it's up to each person to believe whatever conclusion I reach or not. On the topic at issue, if it turns out that I'm wrong and LS level testing is valid for unrelated engines, I would have no reason to favor the slower lists with smaller samples.
ultra-fast testing methodology (which includes Stefan's Lightspeed list) has been discussed at length (here and elsewhere) for many years and i believe has been proven effective

IMO Marc Lacrosse's initial work was instrumental, but unfortunately he died unexpectedly at age 48

maybe these links are interesting for you:
http://chessprogramming.wikispaces.com/Marc+Lacrosse
https://sites.google.com/site/chessbaza ... mlmfl-test

and a discussion here from 2009:
http://talkchess.com/forum/viewtopic.ph ... at&start=7
kranium
Posts: 2130
Joined: Thu May 29, 2008 10:43 am

Re: 44 elo swing depending on hardware!

Post by kranium »

keeping in mind Moore's law:
"over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years"

the fact is:
because today's inexpensive and powerful multicore CPUs are so fast, and allow engines to search so much deeper (than in 2006!),
ultra-fast testing methodology is becoming more and more relevant

instead of being unfairly criticized by the outdated and entrenched establishment,
Stefan should be congratulated for recognizing the fact and incorporating it into a rating list
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: 44 elo swing depending on hardware!

Post by lkaufman »

kranium wrote:
lkaufman wrote:
kranium wrote: Larry,
fact is: CEGT has recently gotten your wholehearted thumbs up as they enthusiastically adopted your preferred TC...
but they use various hardware and LS provides a consistent hardware platform

so, i fail to understand why this/your topic: "44 elo swing depending on hardware!"
is now turning into a referendum on the validity of lightning speeds (i.e. -> the LS rating list)
(which have been used with great success by engine developers for many many years, Bob H., Vas R. especially)

it seems especially inappropriate after the severe LS put-downs by CEGT and IPON, (whom you have recently praised)
lkaufman wrote: Actually it would be very good news for the computer chess community if I am wrong, because it means that we can get much more reliable sample sizes just by playing faster games.
you're a great guy, and i believe you are fair (despite a commercial conflict of interest)...
but with all due respect (and you deserve alot), IMO, your individual tests != empirical truth for the CC community on this issue

Norm
The original topic of the thread has been answered to my satisfaction -- turning off HT solved the problem. Maybe the new topic should have been in a separate thread. I think both the IPON/CEGT 5'+3" lists and the LS list are very valuable; the first because it is the most reliable for comparing unrelated engines, and the second because it is the most reliable for comparing engines of the same family. These tests should determine for me whether LS level lists are reliable for unrelated engines; it's up to each person to believe whatever conclusion I reach or not. On the topic at issue, if it turns out that I'm wrong and LS level testing is valid for unrelated engines, I would have no reason to favor the slower lists with smaller samples.
ultra-fast testing methodology (which includes Stefan's Lightspeed list) has been discussed at length (here and elsewhere) for many years and i believe has been proven effective

IMO Marc Lacrosse's initial work was instrumental, but unfortunately he died unexpectedly at age 48

maybe these links are interesting for you:
http://chessprogramming.wikispaces.com/Marc+Lacrosse
https://sites.google.com/site/chessbaza ... mlmfl-test

and a discussion here from 2009:
http://talkchess.com/forum/viewtopic.php?

t=28130&postdays=0&postorder=asc&topic_view=flat&start=7
The last reference is to the validity of very fast testing for measuring program improvements, with which I fully agree; we rely on it ourselves. That is the biggest value of the LS list; it gives a quick and accurate measure of program improvement over a previous version. The second reference is simply too old, quoting a study that ended with Rybka 1. It is my contention that the massive pruning introduced in Rybka 2 and 3, and then adopted by Ippolit, Stockfish, Komodo etc. was a game-changer. Previously the programs searched pretty much the same, but now there are great differences in the search, which may favor short or long time control games more. It is my contention, with which many but by no means all agree, that Ippolit is especially strong at very fast chess due to something about its pruning, and that therefore any very fast testing makes Ippo and all derivatives and relatives look better than they "really" are when tested too fast. There is certainly no reason just to assume that totally unrelated engines scale equally as well with more time. How strong this effect is I hope to determine. It may turn out that I have overestimated this effect.
User avatar
Graham Banks
Posts: 45244
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: 44 elo swing depending on hardware!

Post by Graham Banks »

kranium wrote:keeping in mind Moore's law:
"over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years"

the fact is:
because today's inexpensive and powerful multicore CPUs are so fast, and allow engines to search so much deeper (than in 2006!),
ultra-fast testing methodology is becoming more and more relevant

instead of being unfairly criticized by the outdated and entrenched establishment,
Stefan should be congratulated for recognizing the fact and incorporating it into a rating list
Keep in mind that some of us prefer longer time controls because we can enjoy watching and gathering games of a better quality.

Of course, whatever time control people choose to test at is entirely their choice, just as others are free to either embrace or ignore what they do.
There doesn't tend to be too much difference in comparative ratings for most engines.
gbanksnz at gmail.com
kranium
Posts: 2130
Joined: Thu May 29, 2008 10:43 am

Re: 44 elo swing depending on hardware!

Post by kranium »

Graham Banks wrote:
kranium wrote:keeping in mind Moore's law:
"over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years"

the fact is:
because today's inexpensive and powerful multicore CPUs are so fast, and allow engines to search so much deeper (than in 2006!),
ultra-fast testing methodology is becoming more and more relevant

instead of being unfairly criticized by the outdated and entrenched establishment,
Stefan should be congratulated for recognizing the fact and incorporating it into a rating list
Keep in mind that some of us prefer longer time controls because we can enjoy watching and gathering games of a better quality.

Of course, whatever time control people choose to test at is entirely their choice, just as others are free to either embrace or ignore what they do.
There doesn't tend to be too much difference in comparative ratings for most engines.
agreed...

ultra fast TCs (less than 1 minute) are great for testing (provided # of games is large enough), but probably not appropriate from a human entertainment perspective

but with the power of PCs today, the quality of play is still relevant...and this is improving daily

i.e. could the top engines today with 3 minutes per game (on a really fast multi core system) beat most grandmasters (w/ oodles of time)?

my guess is: more than likely (IMO it's well beyond the point where GMs have any chance)
User avatar
pohl4711
Posts: 2900
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: 44 elo swing depending on hardware!

Post by pohl4711 »

lkaufman wrote:
Now that I have the hardware, I'm planning to get an answer once and for all to the question of bullet chess (like LS list) correlates well with blitz lists (like IPON and now the 5' + 3" CEGT list). I'm running a gauntlet for the new Komodo (against five top engines) at 2' + 1" (HT off, same book as LS uses, 36 cores running on it so 36 games at once. When I'm done, I'll cut the time in half and repeat, and if time permits I'll do 4' +2". I'll have enough games to be able to say once and for all how valid bullet testing is, if the goal is to predict results at 5' + 3" or so. Although I've often said that I think bullet testing favors Ippo related engines, I'm open-minded; if the results show otherwise I won't hesitate to admit I was wrong. Actually it would be very good news for the computer chess community if I am wrong, because it means that we can get much more reliable sample sizes just by playing faster games.
So far my result (for TCEC stage 3 version) against Houdini 3 is 47.1% out of 1900 games, about 20 elo down. If there really is no difference in relative strength of engines at different levels, I would expect something like 48% at 4'+2" and 46% at 1' + 30". The percentage should asymptotically approach 50% at super long time controls. But I claim that there is some reasonable level where Komodo actually will score over 50% in a long match. Maybe this will shed some light on the question. I may actually just run a fairly slow match on my quad to see if I get a plus score.
A good idea. But before doing so, you should check out the excellent testwork of Andreas Strangmüller: http://www.fastgm.de. Perhaps you find all answers there?

What we see there is, that with longer thinking times, the difference of the first and the last position of a ratinglist gets smaller. That happens, because the draw-ratio increases with longer thinking time and so head-to-head results can get closer to 50%.
But we see, too, that in all 3 rating-lists (the list with 3.75''+0.0375'' has a too short thinking time - I ignore this one, because with that short times, Windows-system-operations can distort (or engine-initialize-operations)) Houdini 3 is number 1 and Komodo CCT is number 2. Only Stockfish climbs a little bit with more time:

http://www.fastgm.de/15+0.15.html / http://www.fastgm.de/60+0.60.html / http://www.fastgm.de/240+2.40.html

Stefan
ThatsIt
Posts: 992
Joined: Thu Mar 09, 2006 2:11 pm

Re: 44 elo swing depending on hardware!

Post by ThatsIt »

I wrote my view not our or CEGTs view,
and therefore there is no need to claim quirky things like that:
"unlike CEGT, he does not play the games on different hardware
and simply combine the results (which may result in big ELO swing,
as Larry points out in this topic) that said, i'm not surprised to
see just how popular his site has become...
and it may begin to explain your animosity towards him"


G.S.
beram
Posts: 1187
Joined: Wed Jan 06, 2010 3:11 pm

Re: 44 elo swing depending on hardware!

Post by beram »

pohl4711 wrote:
lkaufman wrote:
Now that I have the hardware, I'm planning to get an answer once and for all to the question of bullet chess (like LS list) correlates well with blitz lists (like IPON and now the 5' + 3" CEGT list). I'm running a gauntlet for the new Komodo (against five top engines) at 2' + 1" (HT off, same book as LS uses, 36 cores running on it so 36 games at once. When I'm done, I'll cut the time in half and repeat, and if time permits I'll do 4' +2". I'll have enough games to be able to say once and for all how valid bullet testing is, if the goal is to predict results at 5' + 3" or so. Although I've often said that I think bullet testing favors Ippo related engines, I'm open-minded; if the results show otherwise I won't hesitate to admit I was wrong. Actually it would be very good news for the computer chess community if I am wrong, because it means that we can get much more reliable sample sizes just by playing faster games.
So far my result (for TCEC stage 3 version) against Houdini 3 is 47.1% out of 1900 games, about 20 elo down. If there really is no difference in relative strength of engines at different levels, I would expect something like 48% at 4'+2" and 46% at 1' + 30". The percentage should asymptotically approach 50% at super long time controls. But I claim that there is some reasonable level where Komodo actually will score over 50% in a long match. Maybe this will shed some light on the question. I may actually just run a fairly slow match on my quad to see if I get a plus score.
A good idea. But before doing so, you should check out the excellent testwork of Andreas Strangmüller: http://www.fastgm.de. Perhaps you find all answers there?

What we see there is, that with longer thinking times, the difference of the first and the last position of a ratinglist gets smaller. That happens, because the draw-ratio increases with longer thinking time and so head-to-head results can get closer to 50%.
But we see, too, that in all 3 rating-lists (the list with 3.75''+0.0375'' has a too short thinking time - I ignore this one, because with that short times, Windows-system-operations can distort (or engine-initialize-operations)) Houdini 3 is number 1 and Komodo CCT is number 2. Only Stockfish climbs a little bit with more time:

http://www.fastgm.de/15+0.15.html / http://www.fastgm.de/60+0.60.html / http://www.fastgm.de/240+2.40.html

Stefan
I agree completely with you Stefan
And I have said en discussed this in other words before with Larry
But he wont agree and keeps saying that he wants to examine this or that by himself. Each time with other arguments, over and over again.
Bottomline is: there is no such thing as Komodo playing better than Houdini at longer time controls