Do You really need 1000s of games for testing?

bob · Post by **bob** » Fri Nov 05, 2010 9:30 pm

Michiel Koorn wrote:the assumption is that the better player finds quicker ways to win and takes longer lines to defeat. Assumptions need to be validated...

What about when two really good players meet? Or, as I have seen in tournaments, a GM plays a known simple win that is very slow and methodical rather than wasting a lot of energy to try to find a tactical bust? Or when one side blunders, for whatever reason (time issue, etc.)?

I think that just taking the 0 - 1/2 - 1 result is iffy enough, adding length of game in moves, or length of game in time used may well make it more iffy, unless someone does a huge study to see if the results are meaningful. Clearly it is easier to use W/L/D results as we do today as anything else would require far more effort to validate before the results could be trusted. And it still is not clear to me that this idea would reduce the number of games required...

Michel · Post by **Michel** » Sat Nov 06, 2010 8:41 am

if an engine gains or loses because of an unbalanced position (decisively or not), the effect will be canceled by losing or gaining with opposite colour (decisively or not). Unjust +1 or -1 is worse than unjust (+1 -1) or (-1 +1).

Assume all positions are mate in one. The test will give no information at all and hence the error bar is infinite. This is not cured by playing each position with both colors.

abulmo · Post by **abulmo** » Sat Nov 06, 2010 9:45 am

hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

If you mix 1 liter of water with 1 liter of alcohol, you get 1.98 l. of alcoholic solution, not 2 l. Real life may be more complicated than simple mathematics.

hgm wrote:I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...

Or maybe the mathematics behind Elo needs to be more sophisticated, for example when dealing with draw games, correlation between players, etc.

Laskos · Post by **Laskos** » Sat Nov 06, 2010 10:34 am

Michel wrote:
if an engine gains or loses because of an unbalanced position (decisively or not), the effect will be canceled by losing or gaining with opposite colour (decisively or not). Unjust +1 or -1 is worse than unjust (+1 -1) or (-1 +1).
Assume all positions are mate in one. The test will give no information at all and hence the error bar is infinite. This is not cured by playing each position with both colors.

1. Playing both colours cures a bit the problem, (+1 -1). Playing them once is worse.

2. I do not want my testing positions to be mate in 1. If it happens in rare cases, then some estimation of the systematic error can be made by just counting the number of freak positions. If I will be so stupid to make a set of only very unbalanced positions, then my test certainly is flawed.

3. I do not multiply the rate of freak positions by playing them both sides, as I am playing both sides every non-freak position too.

4. The examples as yours and Uri's are some hypothetial weird situations, when one has to be really an incompetent to the limit of absurdity tester.

Kai

michiguel · Post by **michiguel** » Sat Nov 06, 2010 3:04 pm

Laskos wrote:
Michel wrote:
if an engine gains or loses because of an unbalanced position (decisively or not), the effect will be canceled by losing or gaining with opposite colour (decisively or not). Unjust +1 or -1 is worse than unjust (+1 -1) or (-1 +1).
Assume all positions are mate in one. The test will give no information at all and hence the error bar is infinite. This is not cured by playing each position with both colors.
1. Playing both colours cures a bit the problem, (+1 -1). Playing them once is worse.

2. I do not want my testing positions to be mate in 1. If it happens in rare cases, then some estimation of the systematic error can be made by just counting the number of freak positions. If I will be so stupid to make a set of only very unbalanced positions, then my test certainly is flawed.

3. I do not multiply the rate of freak positions by playing them both sides, as I am playing both sides every non-freak position too.

4. The examples as yours and Uri's are some hypothetial weird situations, when one has to be really an incompetent to the limit of absurdity tester.

Kai

If they are really weird then there is no reason to play them twice. Michel example was of course an illustration to make a point and not a real example. The fact is, unbalanced positions happen and are part of the bell curve of possibilities. Making this manual correction disrupts the assumption that all the positions are independent events. Once you disrupt this, the formulas to calculate standard errors are not valid anymore and become just limits. If you play 500 positions with black and white (1000 total) the standard deviation you get is one that will be lower than the one that correspond to n=500 but than the one for n=1000. It is something in between, but you do not know what it is. In practice, most likely it will be closer to 1000 but there is not reason to mess that randomness of the sampling to counter act the occurrence of events that are in the tail of the bell curve.

The bigger problem is not only that you may have unbalanced positions, but too balanced (drawish), or openings that one engine does not understand well. In the latter case, you amplify the advantage or disadvantage that a given engine may have.

Miguel

Milos · Post by **Milos** » Sat Nov 06, 2010 4:56 pm

abulmo wrote:If you mix 1 liter of water with 1 liter of alcohol, you get 1.98 l. of alcoholic solution, not 2 l. Real life may be more complicated than simple mathematics.

This is quite a ridiculous example and it's comparing apples and oranges.
It's like you said if you take 3 liters of air and put it in a car tire you'll get 1 liter of air.
The real thing is if you take 1kg of water and 1kg of alcohol you will absolutely always get exactly 2kg of alcoholic solution.

IWB · Post by **IWB** » Sun Nov 07, 2010 10:03 am

Hello

I checked this theorie with my IPON database. Maybe the amount of games is not good enough, but looking at the data there is nothing that jumps in the eye to support your assumtion:

No. of moves No. of games

Rybka 4 64 2800
Houdini 1.03a 66 2800
Stockfish 1.9.1 62 2200
Naum 4.2 69 3900
Crittter 0.8 69 2600
Komodo 1.2 68 3100
Deep Shredder 12 67 5100
Deep Fritz 12 70 4100
Gull 1.0a 69 2300
Hiarcs 13.1 67 3100
Zappa Mexico II 70 6500
Spark 0.4 65 2800
Protector 1.3.2 66 4500
Deep Onno 1.2.7 67 3100
Hannibal 1.0a 67 2500
Deep Junior 12 69 2100
Deep Sjeng WC2008 69 5700
Toga II 1.4 beta5c 66 6200
Jonny 4.0 64 2200
Loop 2007 65 5000
Crafty 23.3 70 2300
Spike 1.2 Turin 65 6700

67 3709,09

Bye
Ingo

PS: Sorry, I cant make a propper list out of that here

Laskos · Post by **Laskos** » Mon Nov 08, 2010 3:50 am

michiguel wrote:
If they are really weird then there is no reason to play them twice.

Exactly that is the reason. If they are balanced then there is no reason. You seem to miss the point. I am mostly testing pretty equal engines, I do not need freak errors of openings. Better (+1 -1) than +1.

Michel example was of course an illustration to make a point and not a real example.

Yes, but the general public may get a wrong impression out of these weird, absurd examples. These things do not happen in real life.

The fact is, unbalanced positions happen and are part of the bell curve of possibilities.

I checked Bob's EPD file for 500 out of 4,000 positions. The only flaw I saw was that some clusters of positions were differing by only 1-2 moves, I told that to Bob and he scrambled them. They are all balanced, 8-12 moves deep into the opening positions.

Making this manual correction disrupts the assumption that all the positions are independent events. Once you disrupt this, the formulas to calculate standard errors are not valid anymore and become just limits. If you play 500 positions with black and white (1000 total) the standard deviation you get is one that will be lower than the one that correspond to n=500 but than the one for n=1000. It is something in between, but you do not know what it is.

I can estimeate what the number is, and it's >998 out of 1,000. Did you try to play advanced engines such as Crafty, Rybka, IvanHoes in the single standard opening position at fixed total time + increment? I checked 500 games, not a single was a repeat. The randomness of these engines is now up to this level, at least at ultra-short time controls. In fact, on Windows the standard C clock() function only has a resolution of 16ms, therefore playing at 1,000 msecs + 100 msecs increment will give large uncertainties at every move!

In practice, most likely it will be closer to 1000

Yes, very close to 1,000, >998.

but there is not reason to mess that randomness of the sampling to counter act the occurrence of events that are in the tail of the bell curve.

Could you say how many of openings (in percentage) determine the output in points or half-points? This is a macroscopical value, in my estimation 10-15%. I do not want this mess to interfere in my testing of pretty equal engines.

Now very seriosusly, you have a point:

This, your choice of mess, decays as 1/Sqrt(Mess).
My, reversed colours mess, does not decay.

I made some calculations and came to the conclusion that if you play <100,000-1,000,000 games in a match my mess is smaller. Otherwise your mess is smaller. You have a point, and you could make your own calculations to see the numbers, I can give only an order of magnitude. As I play usually for testing thousands to several dozens of thousands of games, my choice is for reversed colours.

The bigger problem is not only that you may have unbalanced positions, but too balanced (drawish), or openings that one engine does not understand well. In the latter case, you amplify the advantage or disadvantage that a given engine may have.

Miguel

If we enter such subtleties, even the starting opening position has some flaws, as it's 52-54% for white. Do you want the set of opening positions to be 50% score or 53% score, as is normal for white? The main point is to leave the opening close to the beginning, with representative openings, balanced to the percentage 48-57% result for white, and playing them both colours. That I will lose <2 games out of 1,000 for confidence margins does not bother me too much, if not for 1,000,000 games matches. This arguments are present even between Elostat and Bayeselo, but for error margins I prefer Elostat.

The problems are too subtle and tiny to present them in an absurd manner.

Kai

Michiel Koorn · Post by **Michiel Koorn** » Mon Nov 08, 2010 11:02 am

bob wrote:And it still is not clear to me that this idea would reduce the number of games required...

Fundamentally continuous data contains more information than attribute or categorical data, allowing for statistical sample sizes based on normal probability distributions (or some less well known others) as opposed to binomial or chi-square. Depending on variance and detection threshold sample sizes can be reduced orders of magnitude.

In the analysis I propose two things can happen:
1) game length correlates to difference in playing strength and sample size will go down or
2) game length is irrelevant and the sample size will not change, or perhaps go up.

Michiel Koorn · Post by **Michiel Koorn** » Mon Nov 08, 2010 11:11 am

IWB wrote:Hello
I checked this theorie with my IPON database. Maybe the amount of games is not good enough, but looking at the data there is nothing that jumps in the eye to support your assumtion:

I am assuming I am looking at ranking vs game length?
If so this list does not prove anything one way or another. If a good engine plays and wins over a bad engine quickly they both score a quick game.
What is needed is the following data:
player - opponent - outcome - game length, and if possible ratings of players. Of course these can also be calculated from the data set. If you can provide me the raw data for your list I could do it.

Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?