Math Test 4 All

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Which Version is better?

Poll ended at Tue Apr 09, 2013 1:05 am

A is better
3
17%
B is better
1
6%
Can't tell - they may be the same
14
78%
 
Total votes: 18

CRoberson
Posts: 2094
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Math Test 4 All

Post by CRoberson »

CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
Ok, guys. This wasn't meant as a cry for help. I knew the answer before posting. I was hoping to point out this real world scenario to all the "testers" that claim they know a new version is better.

So, lets look at Conkie's analysis. It was quite correct and simple. Both programs are barely within each others margins, so we can't tell if they are different.

I have seen others statements in the last year that if they are barely within the margins then the top one is probably better. Well that is not the way the math works. The math purely goes: you have a value that is either outside or inside the margins. If it is inside (anywhere inside), then you can't claim one is better unless (as Miguel stated) you lower the margins. If they are outside each others margins then you can make a statement. The problem here is a little more complicated than the classical stats test that I stated. Instead of comparing one variable against a constant value, we are comparing 2 variables. Thus, we have to consider the fact that the ranges for each overlap.

If I understood Lucas' statement correctly, he used LOS to decide that A is better than B. But, Munoz did LOS and said that it is unclear. So, I don't get that.

Well, here is the answer: A == B. I ran the first test to run 2400 games and it crashed after 770 games. So, I ran the same programs again and it crashed after 599 games. So, A and B are the same programs.

The point to the testers is that here is real world data for two seemingly different programs that are really two rather varied results for the same program still within the margins.

So, two really different programs that exhibit data like that shown here are not clearly stronger/weaker than each other.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Math Test 4 All

Post by Laskos »

CRoberson wrote:
CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
Ok, guys. This wasn't meant as a cry for help. I knew the answer before posting. I was hoping to point out this real world scenario to all the "testers" that claim they know a new version is better.

So, lets look at Conkie's analysis. It was quite correct and simple. Both programs are barely within each others margins, so we can't tell if they are different.

I have seen others statements in the last year that if they are barely within the margins then the top one is probably better. Well that is not the way the math works. The math purely goes: you have a value that is either outside or inside the margins. If it is inside (anywhere inside), then you can't claim one is better unless (as Miguel stated) you lower the margins. If they are outside each others margins then you can make a statement.
Error margins are given for confidence levels. So you can make a statement given that confidence.
The problem here is a little more complicated than the classical stats test that I stated. Instead of comparing one variable against a constant value, we are comparing 2 variables. Thus, we have to consider the fact that the ranges for each overlap.

If I understood Lucas' statement correctly, he used LOS to decide that A is better than B. But, Munoz did LOS and said that it is unclear. So, I don't get that.
I think Lucas said that LOS A against Telepath and LOS B against Telepath are very close to 100%. LOS A against B is some 75%, smaller than the desired 95% or whatever, and we will usually call that undecided.

Well, here is the answer: A == B. I ran the first test to run 2400 games and it crashed after 770 games. So, I ran the same programs again and it crashed after 599 games. So, A and B are the same programs.

The point to the testers is that here is real world data for two seemingly different programs that are really two rather varied results for the same program still within the margins.

So, two really different programs that exhibit data like that shown here are not clearly stronger/weaker than each other.
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Math Test 4 All

Post by lucasart »

Laskos wrote:I think Lucas said that LOS A against Telepath and LOS B against Telepath are very close to 100%.
Yes, that was what I meant.

But I still think that the real problem of this experiment is uncontrolled early stopping. No one is talking about this!

At best there's no biais in the stopping time (random due to power failure or crash in the program) and you introduce more variance that you, by definition, cannot predict, since it's an exogenous source of variance.

At worst, it's biaised as the operator stops the results when he "likes" the result.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.