obvious/easy move

Sven · Post by **Sven** » Sat Jan 26, 2013 11:18 am

Adam Hair wrote:I screwed up. Sigma is a function of the inverse square root of the number of games. What HGM is pointing out is that you have to play twice the number of games per version to make the error equal to that when you self-test.

If the error is y Elo after x games, then the error when comparing gauntlet results is y*√2. To reduce that to y, you have play 2x games against the gauntlet for each version. So, when you are starting from scratch you have to play 4x games to equal the error bars of x games of self testing. For each new version, you will have to play 2x games against the gauntlet.

The error after x games self-testing is y
The error when comparing two versions via a gauntlet after x games each is √(y²+y²) = y*√2
To reduce this to y, the individual errors must be y/√2.
Since y is proportional to √(1/x), then y/√2 is proportional to √(1/2x)
So each version has to play 2x games against the gauntlet to make the error when comparing be y Elo.

Hi Adam,

can you explain to me what "self-testing" means for you, other than playing a "trivial gauntlet" against only one opponent which is another (previous) version of the same engine? I don't understand that different error bar calculation at all (gauntlet with N=1 opponents vs. gauntlet with N>1 opponents, why should it make a difference?), so there must be some basic detail that I missed.

Sven

hgm · Post by **hgm** » Sat Jan 26, 2013 11:27 am

"Self-testing" in this context is equal to "direct comparison", i.e. playing the programs you want to compare against each other. The alternative, "indirect comparison", is playing the programs you want to compare each independently against a third party.

Sven · Post by **Sven** » Sat Jan 26, 2013 12:10 pm

hgm wrote:"Self-testing" in this context is equal to "direct comparison", i.e. playing the programs you want to compare against each other. The alternative, "indirect comparison", is playing the programs you want to compare each independently against a third party.

O.k., thanks, so self-testing is not the same as a "gauntlet vs. 1 opponent". Then the difference in the required number of games to get the same error bars is obvious: if you play N games A1 vs. A2 then both A1 and A2 have N games. If you play A1 vs. B and A2 vs. B and want to get the same error bars for the A1 and A2 ratings then you need to play games such that both A1 and A2 have N games, so you need to play 2*N games in total. That does not change if you replace the single opponent B by K opponents B1 .. BK: both A1 and A2 still need N games, so still 2*N games in total.

Is it that simple, or did I simplify too much?

EDIT: there is a difference in interpretation, though, that should not be missed: the ratings (and their error bars) obtained from self-play are an estimate for the strength relation between these two engine versions and nothing else. The ratings + error bars obtained from any set of games between more than two engines, either via gauntlet or RR or anything in the middle of that, are an estimate for the strength relation between all these participating engines - and nothing else. If we are interested in a good estimate of the "real world" ratings of our tested engine versions then using more opponents usually comes closer to that "reality" than using fewer, and using only one opponent (as in self-play) is almost as far away from that "reality" as possible. So we should be aware of what we want to estimate, and what we actually are estimating with our tests.

Sven

hgm · Post by **hgm** » Sat Jan 26, 2013 12:35 pm

The point you missed is that the error bar in a difference of two independently measured ratings is the root-mean-square sum of the individual arror bars. (So sqrt(2) as large if they were equal.) So if you are interested in the difference of the the two ratings (to know which was better), you need an extra doubling of the number of games to get the individual error bars sqrt(2) smaller.

Rebel · Post by **Rebel** » Sat Jan 26, 2013 1:01 pm

bob wrote: However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"

Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.

You say it yourself, different groups of opponents give different ratings. However self-play does not but has its own buts. Whatever system you chose it's imperfect. Combining them gives some more security, still imperfect.

I am an absolutely firm believer in "test like you plan to run".

That's an intriguing statement, no idea what you mean by that.

I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...

There is no absolute truth but feel free to make an interesting contribution.

Adam Hair · Post by **Adam Hair** » Sat Jan 26, 2013 2:55 pm

Sven Schüle wrote: EDIT: there is a difference in interpretation, though, that should not be missed: the ratings (and their error bars) obtained from self-play are an estimate for the strength relation between these two engine versions and nothing else. The ratings + error bars obtained from any set of games between more than two engines, either via gauntlet or RR or anything in the middle of that, are an estimate for the strength relation between all these participating engines - and nothing else. If we are interested in a good estimate of the "real world" ratings of our tested engine versions then using more opponents usually comes closer to that "reality" than using fewer, and using only one opponent (as in self-play) is almost as far away from that "reality" as possible. So we should be aware of what we want to estimate, and what we actually are estimating with our tests.

Sven

I agree with you, Sven.

This is what we do with Gaviota. When Miguel makes a change to Gaviota's code for the purpose of increasing strength, he runs 80k fixed node games (30k nodes) against the previous version. If it appears to be stronger, he keeps the change. I think that after about 20 Elo, he confirms the changes by playing 40k games at *short time controls against a gauntlet. I confirm his findings by running my own 26k gauntlet at *short time controls. I also run longer time control games to estimate Gaviota's strength at normal usage.

The amount of Elo increase from the fixed node self-testing is different from what we measure from the gauntlets, but they do indicate positive changes. It is the measurement of Elo increase from the gauntlet results that gives us a better indication of how well Gaviota will do in the real world. And even that is only a rough estimate.

* Short time controls for us means 40 moves in 16 seconds for Gaviota. I use stronger engines with time odds as the gauntlet opponents. Miguel does the same.

Richard Allbert · Post by **Richard Allbert** » Sat Jan 26, 2013 4:26 pm

Just for my own sanity...

Crafty-23.6-1 2652 4 4 30000 62% 2557 25%
Crafty-23.6R02 2641 7 7 8515 60% 2558 25%

you say 11 elo here, but isn't that result saying strength is the same?

2641 + 7 = 2648 and 2652 - 4 = 2648 .. the error bars meet each other.

I understand after 30k games the result was the same anyway, but I thought we were always supposed to take error bars into account.

hgm · Post by **hgm** » Sat Jan 26, 2013 5:11 pm

Errors should be added 'a la Pythagoras', so the error bar in the difference is sqrt(4*4+7*7) = sqrt(65) = 8. So the difference of 11 is outside the (95%) error bar for the difference. It is more like a 3-sigma difference.

Richard Allbert · Post by **Richard Allbert** » Sat Jan 26, 2013 5:23 pm

Thanks

I'd always read the error bar simply adding / subtracting.

Ajedrecista · Post by **Ajedrecista** » Sat Jan 26, 2013 6:13 pm

Hello:

hgm wrote:Errors should be added 'a la Pythagoras', so the error bar in the difference is sqrt(4*4+7*7) = sqrt(65) = 8. So the difference of 11 is outside the (95%) error bar for the difference. It is more like a 3-sigma difference.

I fully agree with you. Here is a brief explanation of a normal difference distribution (I assume that each error follows a normal distribution).

I deduced a formula for error bars (using my own model) more than a year ago:

Comparison between formulæ of standard deviations.

Ajedrecista wrote:
<e> = ± 200·log{[mu + k·(sd)][1 - mu + k·(sd)]/[mu - k·(sd)][1 - mu - k·(sd)]}
Where k gives the confidence level (k ~ 1.96 for 95% confidence, k = 2 for ~ 95.45% confidence...).

I use sample standard deviations:

Code: Select all

_i stands for subindex i.

Number of games: wins_i + draws_i + loses_i = n_i
Draw ratio: D_i = (draws_i)/n_i

µ_i = (wins_i + 0.5*draws_i)/n_i
1 - µ_i = (loses_i + 0.5*draws_i)/n_i

s_i = sqrt{[µ_i*(1 - µ_i) - D_i/4]/(n_i - 1)}

i = 1 and 2 in this case:

Code: Select all

n_1 = 30000; n_2 = 8515
µ_1 = 0.62; µ_2 = 0.6
D_1 = 0.25 = D_2

I used Derive 6 for some calculations. Here are the details:

Code: Select all

s_1 = sqrt[(0.62*0.38 - 0.25/4)/29999] ~ 0.002402122465
s_2 = sqrt[(0.6*0.4 - 0.25/4)/8514] ~ 0.004565962662

error_1 = 200*log{[(0.62 + z*s_1)(0.38 + z*s_1)]/{[(0.62 - z*s_1)(0.38 - z*s_1)]}
error_2 = 200*log{[(0.6 + z*s_2)(0.4 + z*s_2)]/{[(0.6 - z*s_2)(0.4 - z*s_2)]}

11 = sqrt[(error_1)² + (error_2)²]

I solved z: z ~ 2.932893139, which is almost 3, as you wrote. If I go a little far beyond:

Code: Select all

erf(x): error function of x.

Confidence[(Crafty-23.6-1) is better than (Crafty-23.6R02)] = erf[z/sqrt(2)] ~ 0.9966418054 ~ 99.6642%.

LOS[(Crafty-23.6-1) is better than (Crafty-23.6R02)] = 0.5*{1 + erf[z/sqrt(2)]} ~ 0.9983209027 ~ 99.8321%.

I hope that all my calculations are clear and correct. I know that the only exact numbers are 30000 and 8515 (the number of games) while the rest of numbers (scores, draw ratios and the Elo difference of 11) are roundings.

Regards from Spain.

Ajedrecista.

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move

Re: obvious/easy move

Re: obvious/easy move

Re: Obvious/easy move.