Math Test 4 All

CRoberson · Post by **CRoberson** » Wed Mar 20, 2013 7:26 pm

CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.

Ok, guys. This wasn't meant as a cry for help. I knew the answer before posting. I was hoping to point out this real world scenario to all the "testers" that claim they know a new version is better.

So, lets look at Conkie's analysis. It was quite correct and simple. Both programs are barely within each others margins, so we can't tell if they are different.

I have seen others statements in the last year that if they are barely within the margins then the top one is probably better. Well that is not the way the math works. The math purely goes: you have a value that is either outside or inside the margins. If it is inside (anywhere inside), then you can't claim one is better unless (as Miguel stated) you lower the margins. If they are outside each others margins then you can make a statement. The problem here is a little more complicated than the classical stats test that I stated. Instead of comparing one variable against a constant value, we are comparing 2 variables. Thus, we have to consider the fact that the ranges for each overlap.

If I understood Lucas' statement correctly, he used LOS to decide that A is better than B. But, Munoz did LOS and said that it is unclear. So, I don't get that.

Well, here is the answer: A == B. I ran the first test to run 2400 games and it crashed after 770 games. So, I ran the same programs again and it crashed after 599 games. So, A and B are the same programs.

The point to the testers is that here is real world data for two seemingly different programs that are really two rather varied results for the same program still within the margins.

So, two really different programs that exhibit data like that shown here are not clearly stronger/weaker than each other.

Laskos · Post by **Laskos** » Wed Mar 20, 2013 8:54 pm

CRoberson wrote:
CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
Ok, guys. This wasn't meant as a cry for help. I knew the answer before posting. I was hoping to point out this real world scenario to all the "testers" that claim they know a new version is better.

So, lets look at Conkie's analysis. It was quite correct and simple. Both programs are barely within each others margins, so we can't tell if they are different.

I have seen others statements in the last year that if they are barely within the margins then the top one is probably better. Well that is not the way the math works. The math purely goes: you have a value that is either outside or inside the margins. If it is inside (anywhere inside), then you can't claim one is better unless (as Miguel stated) you lower the margins. If they are outside each others margins then you can make a statement.

Error margins are given for confidence levels. So you can make a statement given that confidence.

The problem here is a little more complicated than the classical stats test that I stated. Instead of comparing one variable against a constant value, we are comparing 2 variables. Thus, we have to consider the fact that the ranges for each overlap.

If I understood Lucas' statement correctly, he used LOS to decide that A is better than B. But, Munoz did LOS and said that it is unclear. So, I don't get that.

I think Lucas said that LOS A against Telepath and LOS B against Telepath are very close to 100%. LOS A against B is some 75%, smaller than the desired 95% or whatever, and we will usually call that undecided.

Well, here is the answer: A == B. I ran the first test to run 2400 games and it crashed after 770 games. So, I ran the same programs again and it crashed after 599 games. So, A and B are the same programs.

The point to the testers is that here is real world data for two seemingly different programs that are really two rather varied results for the same program still within the margins.

So, two really different programs that exhibit data like that shown here are not clearly stronger/weaker than each other.

lucasart · Post by **lucasart** » Fri Mar 22, 2013 3:09 pm

Laskos wrote:I think Lucas said that LOS A against Telepath and LOS B against Telepath are very close to 100%.

Yes, that was what I meant.

But I still think that the real problem of this experiment is uncontrolled early stopping. No one is talking about this!

At best there's no biais in the stopping time (random due to power failure or crash in the program) and you introduce more variance that you, by definition, cannot predict, since it's an exogenous source of variance.

At worst, it's biaised as the operator stops the results when he "likes" the result.

Math Test 4 All

Which Version is better?

Re: Math Test 4 All

Re: Math Test 4 All

Re: Math Test 4 All