michiguel wrote:
If they are really weird then there is no reason to play them twice.
Exactly that is the reason. If they are balanced then there is no reason. You seem to miss the point. I am mostly testing pretty equal engines, I do not need freak errors of openings. Better (+1 -1) than +1.
Michel example was of course an illustration to make a point and not a real example.
Yes, but the general public may get a wrong impression out of these weird, absurd examples. These things do not happen in real life.
The fact is, unbalanced positions happen and are part of the bell curve of possibilities.
I checked Bob's EPD file for 500 out of 4,000 positions. The only flaw I saw was that some clusters of positions were differing by only 1-2 moves, I told that to Bob and he scrambled them. They are all balanced, 8-12 moves deep into the opening positions.
Making this manual correction disrupts the assumption that all the positions are independent events. Once you disrupt this, the formulas to calculate standard errors are not valid anymore and become just limits. If you play 500 positions with black and white (1000 total) the standard deviation you get is one that will be lower than the one that correspond to n=500 but than the one for n=1000. It is something in between, but you do not know what it is.
I can estimeate what the number is, and it's >998 out of 1,000. Did you try to play advanced engines such as Crafty, Rybka, IvanHoes in the single standard opening position at fixed total time + increment? I checked 500 games, not a single was a repeat. The randomness of these engines is now up to this level, at least at ultra-short time controls. In fact, on Windows the standard C clock() function only has a resolution of 16ms, therefore playing at 1,000 msecs + 100 msecs increment will give large uncertainties at every move!
In practice, most likely it will be closer to 1000
Yes, very close to 1,000, >998.
but there is not reason to mess that randomness of the sampling to counter act the occurrence of events that are in the tail of the bell curve.
Could you say how many of openings (in percentage) determine the output in points or half-points? This is a macroscopical value, in my estimation 10-15%. I do not want this mess to interfere in my testing of pretty equal engines.
Now very seriosusly, you have a point:
This, your choice of mess, decays as 1/Sqrt(Mess).
My, reversed colours mess, does not decay.
I made some calculations and came to the conclusion that if you play <100,000-1,000,000 games in a match my mess is smaller. Otherwise your mess is smaller. You have a point, and you could make your own calculations to see the numbers, I can give only an order of magnitude. As I play usually for testing thousands to several dozens of thousands of games, my choice is for reversed colours.
The bigger problem is not only that you may have unbalanced positions, but too balanced (drawish), or openings that one engine does not understand well. In the latter case, you amplify the advantage or disadvantage that a given engine may have.
Miguel
If we enter such subtleties, even the starting opening position has some flaws, as it's 52-54% for white. Do you want the set of opening positions to be 50% score or 53% score, as is normal for white? The main point is to leave the opening close to the beginning, with representative openings, balanced to the percentage 48-57% result for white, and playing them both colours. That I will lose <2 games out of 1,000 for confidence margins does not bother me too much, if not for 1,000,000 games matches. This arguments are present even between Elostat and Bayeselo, but for error margins I prefer Elostat.
The problems are too subtle and tiny to present them in an absurd manner.
Kai