Do You really need 1000s of games for testing?

Laskos · Post by **Laskos** » Fri Nov 05, 2010 12:42 pm

Uri Blass wrote:
My point is not that I know a way to do better but that it is not something that is proved mathematically and in theory it is possible to have 200 positions when a match based on them tells us correctly which program is better for a real difference of at least 2 elo(also if the real difference is calculated based on some constant big set of 1000,000 positions).

If you admit that N=W+D+L is the number of trials (games), W/N, D/N, L/N are probabilities in categorical distribution, then you cannot cheat statistics. If you do not admit those quantities as being described, then you can say anything, I can only say that you will have even harder time than pure statistics.

Kai

Uri Blass · Post by **Uri Blass** » Fri Nov 05, 2010 1:43 pm

Laskos wrote:
Uri Blass wrote:
My point is not that I know a way to do better but that it is not something that is proved mathematically and in theory it is possible to have 200 positions when a match based on them tells us correctly which program is better for a real difference of at least 2 elo(also if the real difference is calculated based on some constant big set of 1000,000 positions).
If you admit that N=W+D+L is the number of trials (games), W/N, D/N, L/N are probabilities in categorical distribution, then you cannot cheat statistics. If you do not admit those quantities as being described, then you can say anything, I can only say that you will have even harder time than pure statistics.

Kai

I do not admit the probabilities are W/N D/N L/N and the probabilities may be function of the starting position.

For example the probability may be the following:
1)1 for white to win if A plays with white against B in position 1
2)1 for draw if B plays with white against A in position 1.

zamar · Post by **zamar** » Fri Nov 05, 2010 2:20 pm

Uri Blass wrote: I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.

Of course you're right in a way.
- There might be a unicorn behind my front door right now. I cannot prove that it could not possibly be true.
- It's entirely possible that the entire universe was made from olive leaves by angels. I cannot prove that incorrect.

But as you can hopefully see yourself, this "you cannot prove that sth might not be true" approach leads nowhere. That's why I'd much rather stay with the widely accepted statistical model of which H.G. and Bob are talking about.

Laskos · Post by **Laskos** » Fri Nov 05, 2010 2:23 pm

Uri Blass wrote:
Laskos wrote:
Uri Blass wrote:
My point is not that I know a way to do better but that it is not something that is proved mathematically and in theory it is possible to have 200 positions when a match based on them tells us correctly which program is better for a real difference of at least 2 elo(also if the real difference is calculated based on some constant big set of 1000,000 positions).
If you admit that N=W+D+L is the number of trials (games), W/N, D/N, L/N are probabilities in categorical distribution, then you cannot cheat statistics. If you do not admit those quantities as being described, then you can say anything, I can only say that you will have even harder time than pure statistics.

Kai
I do not admit the probabilities are W/N D/N L/N and the probabilities may be function of the starting position.

For example the probability may be the following:
1)1 for white to win if A plays with white against B in position 1
2)1 for draw if B plays with white against A in position 1.

That is a highly artificial proposal. The position must be unbalanced, and engines play deterministically. If you can give one single example of engines and the position at fixed time, then I will eliminate it from my EPD file

. These freak positions are contributors to the systematic error, the thing I wrote before.

Trying to cheat statistics you would cheat yourself with a systematic error. Based only on your position, and behaving as you described in each game, with what precision can you determine the strength of engines? The difference is clear, 191 Elo points, but 191 +/- ? 95% confidence in how many games? As I said, you make things worse than pure statistics.

Kai

Roger Brown · Post by **Roger Brown** » Fri Nov 05, 2010 2:43 pm

zamar wrote:
Uri Blass wrote: I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.
Of course you're right in a way.
- There might be a unicorn behind my front door right now. I cannot prove that it could not possibly be true.
- It's entirely possible that the entire universe was made from olive leaves by angels. I cannot prove that incorrect.

But as you can hopefully see yourself, this "you cannot prove that sth might not be true" approach leads nowhere. That's why I'd much rather stay with the widely accepted statistical model of which H.G. and Bob are talking about.

Hello Joona Kiiski,

Give up. Save yourself the effort. What I find fascinating is that only one member is given license to post arguments like this without the dreaded s.... word being invoked. Proving the negative is fraught with practical and logical difficulties.

Arguing that because you cannot do it then XX leads to a nowhere destination. At the end of it, the depressing message of persons like Theron, Hyatt, Muller et al is that yes, you need many, many games.

Once I realised that, it set me free from posting all sorts of "findings" from my ten game tournaments. This question of obtaining meaningful results from a small sample of games will never die though.

Later.

Ps. There is a unicorn lurking outside your door but the mere act of attempting to observe it will cause it to disappear. Didn't you know that? I do not know what they teach people these days!

Michiel Koorn · Post by **Michiel Koorn** » Fri Nov 05, 2010 3:56 pm

I recognise the issue, and have a question to all.
Currently only the w/l/d information is taken from the game. From this datastructure the sample sizes follow.
What is you use game length as additional parameter:
1/#turns for won,-1/#turns for lost and 0 for draws. This way the data becomes continuous, not discrete, with all the benefits to statistics.

michiguel · Post by **michiguel** » Fri Nov 05, 2010 4:29 pm

Laskos wrote:
Uri Blass wrote:
hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
I do not know a way to get conclusion about small difference in small number of games but
nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,
A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.
Uri, you mean by that the systematic error, which is much worse than statistical error. You can get every time 2 Elo points difference in a 400 games match which means nothing until you eliminate exactly that systematic error. You can get even a correct result with the systematic error, but you can not know when. For better or worse, most engines exhibit a sufficiently random behaviour even at fixed time or fixed depth controls to get rid of systematic errors. That is, with a set of balanced, random opening positions. I am even in favour of playing each position with reversed colours too, as this eliminates unbalanced positions.

If you have an unbalanced position, you do not solve the problem reversing the colors. You make it worse. Now you have two unbalanced positions. This may seem to be a fair practice from sporting reasons, but it is not a good sampling procedure.

Miguel

ONE CAN NOT DO BETTER THAN THE STATISTICAL ERROR MARGINS.
ONE CAN DO WORSE THAN THAT.

Kai

Laskos · Post by **Laskos** » Fri Nov 05, 2010 5:00 pm

michiguel wrote: If you have an unbalanced position, you do not solve the problem reversing the colors. You make it worse. Now you have two unbalanced positions. This may seem to be a fair practice from sporting reasons, but it is not a good sampling procedure.

Miguel

Maybe we mean different things for "solve". In my sense, if an engine gains or loses because of an unbalanced position (decisively or not), the effect will be canceled by losing or gaining with opposite colour (decisively or not). Unjust +1 or -1 is worse than unjust (+1 -1) or (-1 +1). Similar things can be said of unjust draws, etc. I clearly prefer playing my matches with both reversed colours.

Kai

bob · Post by **bob** » Fri Nov 05, 2010 5:59 pm

Michiel Koorn wrote:I recognise the issue, and have a question to all.
Currently only the w/l/d information is taken from the game. From this datastructure the sample sizes follow.
What is you use game length as additional parameter:
1/#turns for won,-1/#turns for lost and 0 for draws. This way the data becomes continuous, not discrete, with all the benefits to statistics.

It still won't be continuous, just "less discrete" since you have more outcomes.

But the first question is, what suggests that the length of a game has anything to do with the skill level of either player?

Michiel Koorn · Post by **Michiel Koorn** » Fri Nov 05, 2010 9:15 pm

the assumption is that the better player finds quicker ways to win and takes longer lines to defeat. Assumptions need to be validated...

Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?

Re: Do You really need 1000s of games for testing?