Don't get offended. I'm aware you are not professional but hobby chess enthusiasts. The problem I have is with presenting results from these lists as official despite disclaimer that you have (I'm not saying you or Graham is doing it, but I lot of ppl get that impression).Adam Hair wrote: Milos, you keep stating that the CCRL lists have no scientific significance,
yet you do not state why you say this. Furthermore, you also have stated
that some tests done with the Ippolit derivatives are much less flawed.
I have not seen one yet that is any better than what is done in the CCRL,
including one that I ran and posted in the Open Chess forum. Perhaps
you could state your grievances concerning CCRL methodologies, either
in a new thread or pm me. The CCRL most definitely has flaws, yet the
total disregard of its results by you seems to be due to things unrelated
to science. I have seen many outlandish statements and insults from you.
Perhaps you could dial back the antagonism and start giving thoughtful
statements. I have found several things that you have said quite interesting.
Unfortunately, too much of the time you are more interested in attacking
than in debating and pointing out the holes in other peoples' arguments.
Now regarding scientific insignificance let me just name a few flaws that come to my mind (the actual list would be much longer). I will skip personal biases and even presume that none of you have any intentions of distorting results.
So lets see:
1) Inequality of testing conditions
a) You do testing on various machines. Most of them much stronger than what you claim is your benchmark machine. You adjust speed by running Fritz (or Crafty, I'm not sure on that) benchmark (both are synthetic, outdated and don't give the right picture). Benchmarks are synthetic and measure just a couple of positions that are certainly not representative of a chess game. You adjust benchmark results to TC you use. Even though there you make engines have the same number of nodes for fixed depth as on the reference machine you change TC and directly change TM (I'll talk more on TC in further points). So all your tests are done with effectively different TM for the same engine. Moreover, various machines have different multi-core implementations, and they have different caches which might distort things very much in SMP engine matches.
b) You do testing on various OSes, where you have different running programs in parallel, different services running in the background, different schedulers and prioritization. Moreover, I strongly doubt that all of you only use those machines for engine testing and not run other things in the same time on them.
c) You do tests under various GUIs. GUIs handle time differently, handle UCI protocol differently, adjudicate games differently, etc. In some cases some GUI's even have tendencies to favor its own engine.
Conclusion here, even though you make some effort to balance things, you are effectively comparing apples and oranges.
2) You do not have representative sample of opponents for each engine that you test. There is no methodology there. Since there are engines that are correlated, and also engines tend to favor some opponents and some not, there is a problem of huge difference in ratings and you don't do any effort to balance this, you are introducing even more noise that would be the case in testing of i.e. only 3 same standard opponents for each engine.
3) You use general book. This is far worse than using huge set of start positions. You think that with a general book and many games you'll get a good distribution. Instead you get a lot of openings of the same type, not enough diversity and certainly not a good representation of computer chess (representative sample). This introduces even more noise.
4) You use EGBT. In this way you disfavor engines that don't use tablebases or use other types of them.
5) Your reference machine Athlon 64 X2 4600+ (2.4 GHz) is way too week, and what you claim is a long TC testing (40/40) is effectively blitz in today's most powerful machines.
6) Tournament TCs might not be the best choice for TCs. Even though it's FIDE standard it doesn't mean it beneficial for computer chess. TCs that give more equal distribution between times for various moves are much better for engine testing since they capture more of a raw engine strength (which is purely important for analysis). So incremental TCs are better. The best would be not to use TCs at all and instead use time per move.
7) You play way too few games. Bayeselo gives error margins assuming ideal testing conditions on the same machine with the same OS and the same GUI. In other words, noise is Gaussian and just comes from little uncertainties in CPU, OS (same CPU and OS for both engines and the whole tournament) and SMP implementations. In your case there are so many sources of noise (points 1, 2 and 3), some of them are not even Gaussian and real error margins are actually couple of times bigger than what you show. For claiming validity of results in your testing conditions you would have to run for one to two orders of magnitude more games - at least tens of thousands, if not hundreds of thousands.