Well Don, therefore Stephan has also created this list:Don wrote:A huge problem with the LS list is that multiple dev versions of Stockfish are included. That's not very scientific because it leads to statistical anomalies.beram wrote:Modern Times wrote:Most rating lists don't have the large numbers of games that Larry has above.
I'm also confident in Larry's results now that he has HT off on his machines.
Those results are certainly very interesting, but it is just one opponent (Komodo vs Houdini) so it does't tell us much. If he repeated those same tests Komodo vs Stockfish, that would be fascinating I think.
Ray,
It is true that these lists have not that very large numbers
But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
The latest CCRL 40/40 4cpu list 26 oct. has 124 games between them with score 54 % for Houdini 3 4cpu (on 1 cpu 54,7 % just 32 games)
If the doubling theory of 8 ELO increase from Larry would be true, than this should be a very very strange abnormality in result between those opponents
As far as we can observe from these lists at the moment it just makes no sense
Besides that you are right that you need more tests against other opponents
The reason is that if several versions of any given program are in the same list and then it's not likely to be the best one that is on top, instead it will be the luckiest that is on top. If you submit 3 or 4 more one of them will likely end up ahead of Houdini even if there is no improvement at all simply because of statistical anomaly. It's sort of like flipping a coin and if you lose then saying, "let's try again." Sooner or later you will win the coin flip.
There should be a rule that you are not allowed to have more than 1 version every 4 months or something like that and that if you do all the others should not be reported. It's just not right to do this. I see this in other lists where they rate the same program twice sometimes using different "modes" such as "Houdini tactical" along with Houdini normal. Because Houdini tactical is significantly weaker it's probably no big deal - but it does represent 2 opportunities to be on top which only certain programs get. If I could put 10 versions of Komodo on the lists (and they all had very minor changes) people would pick the one on top and ignore the rest and I would get the benefit of the sampling noise which other program don't get.
Playing a huge number of games helps mitigate this problem, but the error margins are still given as 5 ELO and the "true" error margin is higher since this is not a controlled study with a fixed and stated number of games to be played.
The Ipon test was the only test I believed to be run scientifically and correctly but it suffered from very low sample size. Even though that was eventually improved upon it was still a problem but at least it did not give some program multiple chances to defeat the error margins.
http://ls-ratinglist.beepworld.de/ls-to ... nament.htm
and not much differences in top three there
in fact difference between SF 2210 and Komodo 6 is even bigger, 9 ELO instead of 7
