Another Firebird-Rybka match (120 games @ 20mn+5s)

Andre · Post by **Andre** » Sun Jan 24, 2010 7:26 pm

Conditions: Win 2003 w32 / 2CPU / 128MB Hash / HS 8 Moves / 20mn+5s / 120 games

Firebird 1.0 - Rybka 3.0 : 69,5-50,5 (+33/=73/-14) 57,92%-42,08%

CRoberson · Post by **CRoberson** » Sun Jan 24, 2010 8:24 pm

Andre wrote:Conditions: Win 2003 w32 / 2CPU / 128MB Hash / HS 8 Moves / 20mn+5s / 120 games

Firebird 1.0 - Rybka 3.0 : 69,5-50,5 (+33/=73/-14) 57,92%-42,08%

Interesting data. That suggests a 57 Elo improvement. However, the margins on 120 games may be around 58 Elo. So, insufficient data to
prove an improvement. I don't know the exact margins for 120 games. For 96 games it is 60 and for 200 games it is 42. From that, I guessed at 58 for 120.

slobo · Post by **slobo** » Sun Jan 24, 2010 8:42 pm

CRoberson wrote:
Andre wrote:Conditions: Win 2003 w32 / 2CPU / 128MB Hash / HS 8 Moves / 20mn+5s / 120 games

Firebird 1.0 - Rybka 3.0 : 69,5-50,5 (+33/=73/-14) 57,92%-42,08%

Interesting data. That suggests a 57 Elo improvement. However, the margins on 120 games may be around 58 Elo. So, insufficient data to
prove an improvement. I don't know the exact margins for 120 games. For 96 games it is 60 and for 200 games it is 42. From that, I guessed at 58 for 120.

I have new informations:

Combined score after 402 games

227.0 - 175.0 in favor of RobboLito

+46 Elo for RobboLito

Program Elo + - Games Score Av.Op. Draws

1 RobboLite 0.085d3 x64 : 3148 20 20 402 56.5 % 3102 65.2 %
2 Rybka 3 sp : 3102 20 20 402 43.5 % 3148 65.2 %
PGN available

Spacious_Mind · Post by **Spacious_Mind** » Sun Jan 24, 2010 8:58 pm

Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same for both engines. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick

CRoberson · Post by **CRoberson** » Sun Jan 24, 2010 9:10 pm

Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick

You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.

Spacious_Mind · Post by **Spacious_Mind** » Sun Jan 24, 2010 9:14 pm

CRoberson wrote:
Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick
You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.

Yes I fully understand. You are going from -1 to +115 therefore both -1 and +115 are unlikely extremes.

regards

Nick

CRoberson · Post by **CRoberson** » Sun Jan 24, 2010 9:16 pm

slobo wrote:
CRoberson wrote:
Andre wrote:Conditions: Win 2003 w32 / 2CPU / 128MB Hash / HS 8 Moves / 20mn+5s / 120 games

Firebird 1.0 - Rybka 3.0 : 69,5-50,5 (+33/=73/-14) 57,92%-42,08%

Interesting data. That suggests a 57 Elo improvement. However, the margins on 120 games may be around 58 Elo. So, insufficient data to
prove an improvement. I don't know the exact margins for 120 games. For 96 games it is 60 and for 200 games it is 42. From that, I guessed at 58 for 120.
I have new informations:

Combined score after 402 games

227.0 - 175.0 in favor of RobboLito

+46 Elo for RobboLito

Program Elo + - Games Score Av.Op. Draws

1 RobboLite 0.085d3 x64 : 3148 20 20 402 56.5 % 3102 65.2 %
2 Rybka 3 sp : 3102 20 20 402 43.5 % 3148 65.2 %
PGN available

Now, you are talking about Robbo instead of Firebird. ok.

Two things come to mind with your data.
1) 46 Elo is a far cry from the 100 Elo you were claiming.
2) At 400 games the margins are +/- 30 Elo. So, Robbo is above the margins in this case. But, the question is how much better is it?
The answer is in the data. It could be 46 Elo stronger or as little as 16 Elo stronger or as much as 76 Elo stronger. The odds are that it
is far from the 100 Elo that you have been claiming in the past.

CRoberson · Post by **CRoberson** » Sun Jan 24, 2010 9:30 pm

Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick
You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.
Yes I fully understand. You are going from -1 to +115 therefore both -1 and +115 are unlikely extremes.

regards

Nick

If you understand that, then why did you say that Firebird is clearly better and that it is unlikely for the other engine to turn the tables in
the next 120 games? If the results are within the margins, then the results are within the fluctuation range and it is possible for the other
engine to turn the tables.

Marc MP · Post by **Marc MP** » Sun Jan 24, 2010 9:44 pm

CRoberson wrote:
Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick
You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.
Yes I fully understand. You are going from -1 to +115 therefore both -1 and +115 are unlikely extremes.

regards

Nick
If you understand that, then why did you say that Firebird is clearly better and that it is unlikely for the other engine to turn the tables in
the next 120 games? If the results are within the margins, then the results are within the fluctuation range and it is possible for the other
engine to turn the tables.

What is the confidence level attached with your 58 elo margin? Is it the commonly used 95% ?

If so, if he gets 57 elo, the probability that Robbolito is stronger should be around 93-94%.

There is no "proof" with perfect certainty using statistics. Even if he would get +100 elo, there would still be a very small probability that the engine is not stronger and that the results could be overturned.

As the elo gap increases, this probability becomes vanishingly small.

Spacious_Mind · Post by **Spacious_Mind** » Sun Jan 24, 2010 9:48 pm

CRoberson wrote:
Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick
You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.
Yes I fully understand. You are going from -1 to +115 therefore both -1 and +115 are unlikely extremes.

regards

Nick
If you understand that, then why did you say that Firebird is clearly better and that it is unlikely for the other engine to turn the tables in
the next 120 games? If the results are within the margins, then the results are within the fluctuation range and it is possible for the other
engine to turn the tables.

First of all your 58 might not be right because I see other examples where I see 115 games as +/- 56.

But thats besides the point. You surely have to agree that -1 and +115 are too extreme and statistically unlikely so therefore the next 120 games will likely show similar results based on exactly the same settings and exactly the same computer whatever those are? Either that or we might as well throw the ELO systems through the window.

Therefore the next 120 games will more likely then not again show that Firebird is better under those EXACT same conditions.

I really don't care if it is Tom, Dick or Harry playing, it makes no difference to me. It's the impression that the other engine whichever it is will miraculously turn things around in the next 120 games which I find baffling.

best regards

Nick

Another Firebird-Rybka match (120 games @ 20mn+5s)

Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)