Scaling Study

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

beram
Posts: 1187
Joined: Wed Jan 06, 2010 3:11 pm

Re: Scaling Study

Post by beram »

Don wrote:
beram wrote:
Modern Times wrote:Most rating lists don't have the large numbers of games that Larry has above.
I'm also confident in Larry's results now that he has HT off on his machines.

Those results are certainly very interesting, but it is just one opponent (Komodo vs Houdini) so it does't tell us much. If he repeated those same tests Komodo vs Stockfish, that would be fascinating I think.

Ray,
It is true that these lists have not that very large numbers
But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
The latest CCRL 40/40 4cpu list 26 oct. has 124 games between them with score 54 % for Houdini 3 4cpu (on 1 cpu 54,7 % just 32 games)
If the doubling theory of 8 ELO increase from Larry would be true, than this should be a very very strange abnormality in result between those opponents
As far as we can observe from these lists at the moment it just makes no sense

Besides that you are right that you need more tests against other opponents
A huge problem with the LS list is that multiple dev versions of Stockfish are included. That's not very scientific because it leads to statistical anomalies.

The reason is that if several versions of any given program are in the same list and then it's not likely to be the best one that is on top, instead it will be the luckiest that is on top. If you submit 3 or 4 more one of them will likely end up ahead of Houdini even if there is no improvement at all simply because of statistical anomaly. It's sort of like flipping a coin and if you lose then saying, "let's try again." Sooner or later you will win the coin flip.

There should be a rule that you are not allowed to have more than 1 version every 4 months or something like that and that if you do all the others should not be reported. It's just not right to do this. I see this in other lists where they rate the same program twice sometimes using different "modes" such as "Houdini tactical" along with Houdini normal. Because Houdini tactical is significantly weaker it's probably no big deal - but it does represent 2 opportunities to be on top which only certain programs get. If I could put 10 versions of Komodo on the lists (and they all had very minor changes) people would pick the one on top and ignore the rest and I would get the benefit of the sampling noise which other program don't get.

Playing a huge number of games helps mitigate this problem, but the error margins are still given as 5 ELO and the "true" error margin is higher since this is not a controlled study with a fixed and stated number of games to be played.

The Ipon test was the only test I believed to be run scientifically and correctly but it suffered from very low sample size. Even though that was eventually improved upon it was still a problem but at least it did not give some program multiple chances to defeat the error margins.
Well Don, therefore Stephan has also created this list:
http://ls-ratinglist.beepworld.de/ls-to ... nament.htm
and not much differences in top three there
in fact difference between SF 2210 and Komodo 6 is even bigger, 9 ELO instead of 7
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Scaling Study

Post by Don »

beram wrote:
Don wrote:
beram wrote:
Modern Times wrote:Most rating lists don't have the large numbers of games that Larry has above.
I'm also confident in Larry's results now that he has HT off on his machines.

Those results are certainly very interesting, but it is just one opponent (Komodo vs Houdini) so it does't tell us much. If he repeated those same tests Komodo vs Stockfish, that would be fascinating I think.

Ray,
It is true that these lists have not that very large numbers
But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
The latest CCRL 40/40 4cpu list 26 oct. has 124 games between them with score 54 % for Houdini 3 4cpu (on 1 cpu 54,7 % just 32 games)
If the doubling theory of 8 ELO increase from Larry would be true, than this should be a very very strange abnormality in result between those opponents
As far as we can observe from these lists at the moment it just makes no sense

Besides that you are right that you need more tests against other opponents
A huge problem with the LS list is that multiple dev versions of Stockfish are included. That's not very scientific because it leads to statistical anomalies.

The reason is that if several versions of any given program are in the same list and then it's not likely to be the best one that is on top, instead it will be the luckiest that is on top. If you submit 3 or 4 more one of them will likely end up ahead of Houdini even if there is no improvement at all simply because of statistical anomaly. It's sort of like flipping a coin and if you lose then saying, "let's try again." Sooner or later you will win the coin flip.

There should be a rule that you are not allowed to have more than 1 version every 4 months or something like that and that if you do all the others should not be reported. It's just not right to do this. I see this in other lists where they rate the same program twice sometimes using different "modes" such as "Houdini tactical" along with Houdini normal. Because Houdini tactical is significantly weaker it's probably no big deal - but it does represent 2 opportunities to be on top which only certain programs get. If I could put 10 versions of Komodo on the lists (and they all had very minor changes) people would pick the one on top and ignore the rest and I would get the benefit of the sampling noise which other program don't get.

Playing a huge number of games helps mitigate this problem, but the error margins are still given as 5 ELO and the "true" error margin is higher since this is not a controlled study with a fixed and stated number of games to be played.

The Ipon test was the only test I believed to be run scientifically and correctly but it suffered from very low sample size. Even though that was eventually improved upon it was still a problem but at least it did not give some program multiple chances to defeat the error margins.
Well Don, therefore Stephan has also created this list:
http://ls-ratinglist.beepworld.de/ls-to ... nament.htm
and not much differences in top three there
in fact difference between SF 2210 and Komodo 6 is even bigger, 9 ELO instead of 7
I'm glad Stephan has taken what I said to heart, however ....

This new list is based on the same data and puts the highest scoring SF first, so just pulling out the less well scoring versions doesn't change anything at all.

The point isn't whether the list is correct at any given moment, it's that it's not scientifically valid when done in a slipshod manner. Maybe it IS correct, maybe it isn't but just showing a different version of the list doesn't prove anything.

The way it should work is that if a new version of a program is added to the list it should not be reported until 50,000 games are played (or some strictly set number that is decided in advance) and it has proved to be well over the error margin stronger, perhaps 2 times. This protects against just submitting the same version (or something very close) over and over against until one gets lucky.

It is possible to generated a new test version of Stockfish every day, so it's not valid to just keep testing a new version everyone time someone thinks the newest is a little better and then see how it does and keep it if happens to come out stronger. You have to protect against that to have a valid list, otherwise I could ask him to test a new version of Komodo every day too.

In our own testing we often get a version of Komodo that is several ELO stronger only to find LATER than it's a couple of ELO weaker and just got lucky on the test.
ANY rating list already has significant error (and hence the error bars) and you don't want to compound that problem.)

I'm sorry if you don't like what I'm saying, but I'm not just making it up. It's a real problem that should be protected against. It's not generally a big problem because usually only versions several months apart that have improved a lot are tested. But if incremental improvements are submitted often it tends to put the results in question.

Another version of this same phenomenon was illustrated at a world championship where the Deep Blue team estimated their winning chances at around 50% - despite the fact that they were the clear favorites to win and by far the strongest program. I don't remember how many programs were there, but statistically the more program you have, the lower the winning chances for any single program due to the huge margin of error of only 5 rounds. So even if their chances were 50% most of the other programs only had 5% - 50% was focused on only 1 program! As it turned out, Deep Blue did NOT win that tournament despite being the strongest program by far.

Having said all of that, I DO like the LS list - in particular it's refreshing to see a list with relatively small error margins and huge samples of games.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Uri Blass
Posts: 11152
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Scaling Study

Post by Uri Blass »

I have no problem with testing every version and I think that more knowledge is always better than less knowledge.

If people get wrong conclusions from the data then it is not the fault of the person who gives the data.
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Scaling Study

Post by lkaufman »

Modern Times wrote:
beram wrote: But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
Indeed, the LS list is the only one where the statistical margin of error is consistently low.

I think you know I ran 500 games between this pair, 5'+3" 4CPU, and got 273.5 - 226.5 (+148 =251 -101) in Houdini's favour, which is also 54.7% coincidentally. So that is four times the number of CPUs and six times longer thinking time, for exactly the same result. Hardware was different of course, all my chess machines are AMD, so that certainly might affect things.
I'm not making any claims regarding the MP performance of Komodo, all of my tests are single core. I suspect that the way we do MP is less efficient for 4 cores than the Houdini/Stockfish method but more efficient for 16 cores. Anyway it is sufficiently different to make MP results irrelevant to the question of scaling. If someone with an i7 machine wants to run a match between Komodo 6 and Houdini 3 at a reasonably slow time limit (I'd suggest the TCEC limit of one hour plus half a minute) in single core mode (so four or six games at a time) a hundred game match would only take two or three days. I would bet on Komodo under those conditions.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Scaling Study

Post by Don »

Uri Blass wrote:I have no problem with testing every version and I think that more knowledge is always better than less knowledge.

If people get wrong conclusions from the data then it is not the fault of the person who gives the data.
That's one point of view.

Very often people take advantage of the fact that people will draw the wrong conclusions, so I feel it's my duty as a good scientist to try minimize that. You will notice that I often point out flaws in peoples reasoning even when those flaws makes Komodo look good. I have done that many times in this forum.

More knowledge is always better than less knowledge in principle, but in practice most people don't have any clue about how to process data. It's fun to watch politicians use data to mislead people.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
beram
Posts: 1187
Joined: Wed Jan 06, 2010 3:11 pm

Re: Scaling Study

Post by beram »

Don wrote:
Uri Blass wrote:I have no problem with testing every version and I think that more knowledge is always better than less knowledge.

If people get wrong conclusions from the data then it is not the fault of the person who gives the data.
That's one point of view.

Very often people take advantage of the fact that people will draw the wrong conclusions, so I feel it's my duty as a good scientist to try minimize that. You will notice that I often point out flaws in peoples reasoning even when those flaws makes Komodo look good. I have done that many times in this forum.

More knowledge is always better than less knowledge in principle, but in practice most people don't have any clue about how to process data. It's fun to watch politicians use data to mislead people.
Its also fun watching engine authors trying to proove things wrong when they are not
By your own standards your latest below comments, is nothing else than obfuscating information.

"I'm glad Stephan has taken what I said to heart, however ....
This new list is based on the same data and puts the highest scoring SF first, so just pulling out the less well scoring versions doesn't change anything at all.
The point isn't whether the list is correct at any given moment, it's that it's not scientifically valid when done in a slipshod manner.
Maybe it IS correct, maybe it isn't but just showing a different version of the list doesn't prove anything. ..."
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Scaling Study

Post by Don »

beram wrote:
Don wrote:
Uri Blass wrote:I have no problem with testing every version and I think that more knowledge is always better than less knowledge.

If people get wrong conclusions from the data then it is not the fault of the person who gives the data.
That's one point of view.

Very often people take advantage of the fact that people will draw the wrong conclusions, so I feel it's my duty as a good scientist to try minimize that. You will notice that I often point out flaws in peoples reasoning even when those flaws makes Komodo look good. I have done that many times in this forum.

More knowledge is always better than less knowledge in principle, but in practice most people don't have any clue about how to process data. It's fun to watch politicians use data to mislead people.
Its also fun watching engine authors trying to proove things wrong when they are not
Brian, I did not try to prove anything is wrong, you are being obtuse, illogical and unreasonable. Everything I said is a real factor and should be considered. I would feel the same if it was Komodo being tested multiple times. You are getting all emotional and subjective about Stockfish.

My post was NOT about proving the SF result was wrong, I would pointing out a flaw in the testing procedure that should be a real consideration.

Anyway, you are not mathematical so I don't wish to argue with you. If it were HG or someone that had some math sense we would not even be having this discussion, they would immediately know what I'm saying is a consideration.

By your own standards your latest below comments, is nothing else than obfuscating information.

"I'm glad Stephan has taken what I said to heart, however ....
This new list is based on the same data and puts the highest scoring SF first, so just pulling out the less well scoring versions doesn't change anything at all.
The point isn't whether the list is correct at any given moment, it's that it's not scientifically valid when done in a slipshod manner.
Maybe it IS correct, maybe it isn't but just showing a different version of the list doesn't prove anything. ..."
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.