Another Firebird-Rybka match (120 games @ 20mn+5s)

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

User avatar
slobo
Posts: 2331
Joined: Mon Apr 09, 2007 5:36 pm

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by slobo »

CRoberson wrote:
slobo wrote:
CRoberson wrote:
Andre wrote:Conditions: Win 2003 w32 / 2CPU / 128MB Hash / HS 8 Moves / 20mn+5s / 120 games

Firebird 1.0 - Rybka 3.0 : 69,5-50,5 (+33/=73/-14) 57,92%-42,08%

Interesting data. That suggests a 57 Elo improvement. However, the margins on 120 games may be around 58 Elo. So, insufficient data to
prove an improvement. I don't know the exact margins for 120 games. For 96 games it is 60 and for 200 games it is 42. From that, I guessed at 58 for 120.
I have new informations:

Combined score after 402 games

227.0 - 175.0 in favor of RobboLito

+46 Elo for RobboLito


Program Elo + - Games Score Av.Op. Draws

1 RobboLite 0.085d3 x64 : 3148 20 20 402 56.5 % 3102 65.2 %
2 Rybka 3 sp : 3102 20 20 402 43.5 % 3148 65.2 %
PGN available
Now, you are talking about Robbo instead of Firebird. ok.

Two things come to mind with your data.
1) 46 Elo is a far cry from the 100 Elo you were claiming.
2) At 400 games the margins are +/- 30 Elo. So, Robbo is above the margins in this case. But, the question is how much better is it?
The answer is in the data. It could be 46 Elo stronger or as little as 16 Elo stronger or as much as 76 Elo stronger. The odds are that it
is far from the 100 Elo that you have been claiming in the past.
It is probably because you have analysed RobboLite 0.085d3 x64 data, not FireBird's ones.

I don't know whether FB's data is available in the moment.
"Well, I´m just a soul whose intentions are good,
Oh Lord, please don´t let me be misunderstood."
CRoberson
Posts: 2094
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by CRoberson »

Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick
You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.
Yes I fully understand. You are going from -1 to +115 therefore both -1 and +115 are unlikely extremes.

regards

Nick
If you understand that, then why did you say that Firebird is clearly better and that it is unlikely for the other engine to turn the tables in
the next 120 games? If the results are within the margins, then the results are within the fluctuation range and it is possible for the other
engine to turn the tables.
First of all your 58 might not be right because I see other examples where I see 115 games as +/- 56.

But thats besides the point. You surely have to agree that -1 and +115 are too extreme and statistically unlikely so therefore the next 120 games will likely show similar results based on exactly the same settings and exactly the same computer whatever those are? Either that or we might as well throw the ELO systems through the window.

Therefore the next 120 games will more likely then not again show that Firebird is better under those EXACT same conditions.

I really don't care if it is Tom, Dick or Harry playing, it makes no difference to me. It's the impression that the other engine whichever it is will miraculously turn things around in the next 120 games which I find baffling.

best regards

Nick
You are making the classically incorrect argument.

If the margins are +/- 100 and your performance is +90, then you are in the fluctuation range of the margins. That means that
next set of games could come out very different (maybe -50) even under the exact same conditions.

Also, there is more. If the margins are +/- 50 and you score +70, then you could be only 20 Elo stronger. So, a much larger number of test games may reveal that you are only + 28 stronger or it (just as likely) may reveal that you are 110 Elo stronger.

I run thousands of tests per week. I have seen the swings that you think are unlikely on numerous occasions.

That is why the USCF gives only a provisional rating for chess players with less than 25 games. Also, this is why they
raised the number of games for an established rating. They raised it in the last 15 years, I forget exactly when.
Personally, I thought the number of games should have been raised to 40 or 50. Even then there are large margins, but humans
don't play often enough for a higher threshold than that. Another big issue with rating humans,
is that we learn as we play. By the time a human has played 50 tournament games, they are playing stronger than when they started.
User avatar
Spacious_Mind
Posts: 317
Joined: Mon Nov 02, 2009 12:05 am
Location: Alabama

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by Spacious_Mind »

CRoberson wrote:
Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick
You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.
Yes I fully understand. You are going from -1 to +115 therefore both -1 and +115 are unlikely extremes.

regards

Nick
If you understand that, then why did you say that Firebird is clearly better and that it is unlikely for the other engine to turn the tables in
the next 120 games? If the results are within the margins, then the results are within the fluctuation range and it is possible for the other
engine to turn the tables.
First of all your 58 might not be right because I see other examples where I see 115 games as +/- 56.

But thats besides the point. You surely have to agree that -1 and +115 are too extreme and statistically unlikely so therefore the next 120 games will likely show similar results based on exactly the same settings and exactly the same computer whatever those are? Either that or we might as well throw the ELO systems through the window.

Therefore the next 120 games will more likely then not again show that Firebird is better under those EXACT same conditions.

I really don't care if it is Tom, Dick or Harry playing, it makes no difference to me. It's the impression that the other engine whichever it is will miraculously turn things around in the next 120 games which I find baffling.

best regards

Nick
You are making the classically incorrect argument.

If the margins are +/- 100 and your performance is +90, then you are in the fluctuation range of the margins. That means that
next set of games could come out very different (maybe -50) even under the exact same conditions.

Also, there is more. If the margins are +/- 50 and you score +70, then you could be only 20 Elo stronger. So, a much larger number of test games may reveal that you are only + 28 stronger or it (just as likely) may reveal that you are 110 Elo stronger.

I run thousands of tests per week. I have seen the swings that you think are unlikely on numerous occasions.

That is why the USCF gives only a provisional rating for chess players with less than 25 games. Also, this is why they
raised the number of games for an established rating. They raised it in the last 15 years, I forget exactly when.
Personally, I thought the number of games should have been raised to 40 or 50. Even then there are large margins, but humans
don't play often enough for a higher threshold than that. Another big issue with rating humans,
is that we learn as we play. By the time a human has played 50 tournament games, they are playing stronger than when they started.
Hi Charles,

That argument I would buy if you are playing multiple opponents and you are improving while the other declines, but not if you are playing the same two engines over and over again..sorry. Unless you are saying that one engine is learning and the other is not? Is that the case here?

But still even if you are correct, I would counter that I see 56 for 115 games :)

btw. I love discussions therefore please don't read into this as me fighting with you because I am not ;)

best regards

Nick
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by Milos »

CRoberson wrote:However, the margins on 120 games may be around 58 Elo. So, insufficient data to
prove an improvement. I don't know the exact margins for 120 games. For 96 games it is 60 and for 200 games it is 42. From that, I guessed at 58 for 120.
You are totally wrong and obviously you are not able to calculate error margin given match result between 2 engines. For the given data set, for 2 sigma or 95% confidence interval, the error margin is +/- 39 ELO.
Before placing ridiculous posts, try to learn a bit of elementary statistics.
CRoberson
Posts: 2094
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by CRoberson »

Spacious_Mind wrote:
Hi Charles,

That argument I would buy if you are playing multiple opponents and you are improving while the other declines, but not if you are playing the same two engines over and over again..sorry. Unless you are saying that one engine is learning and the other is not? Is that the case here?

But still even if you are correct, I would counter that I see 56 for 115 games :)

btw. I love discussions therefore please don't read into this as me fighting with you because I am not ;)

best regards

Nick
Hi Nick,

I don't think you are arguing, because you are making almost all the classical mistakes in this discussion which just means you don't
understand statistical testing well. Nothing bad about that. 99% of the world's population doesn't understand it.

I run numerous batches of tests between two programs and the results are exactly as I have been stating. The only exception is when
some sort of learning is turned on and we are not talking about that in this discussion.

Back to the specific. If the margins are +/- 56 and A scores +58 Elo. That means all of these things:
1) A is better than B. (But, by how much).
2) A could be as low as 2 Elo stronger (58-56).
3) A could be as high as 114 Elo stronger (58+56).
4) Another run of 10 times as many games may show A to be less than 20 Elo stronger.

That is the problem with most people playing around with chess programs: they run too few tests to jump to their conclusions.

I've run a two program match of 100 games and the first program wins all of the first 10 games. The next day when the games are done, the score is 80 to 20 in favor of the 2nd program. The first only won 20 and half of them were the first 10 games.

Haven't you ever flipped a coin 10 times and had 8 heads or even 10 heads. That is how casinos make loads of money. People invent
some system, then test it 50 times and decide it is great. However, they didn't factor in the statistical meanings and they go to a casino
and lose all their money.
CRoberson
Posts: 2094
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by CRoberson »

Milos wrote:
CRoberson wrote:However, the margins on 120 games may be around 58 Elo. So, insufficient data to
prove an improvement. I don't know the exact margins for 120 games. For 96 games it is 60 and for 200 games it is 42. From that, I guessed at 58 for 120.
You are totally wrong and obviously you are not able to calculate error margin given match result between 2 engines. For the given data set, for 2 sigma or 95% confidence interval, the error margin is +/- 39 ELO.
Before placing ridiculous posts, try to learn a bit of elementary statistics.
Try again. Who said I want as little as 95% significance. I want better than 95%.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by Milos »

CRoberson wrote: I've run a two program match of 100 games and the first program wins all of the first 10 games. The next day when the games are done, the score is 80 to 20 in favor of the 2nd program. The first only won 20 and half of them were the first 10 games.
Again you prove you do not know the elementary statistics.
Assuming that the final result is 80:20 for engine B against engine A, this means chance for engine A to win against engine B is 20%.
Chance that engine A wins any 10 games in a row is 0.2^10=0.00001%. Chance to win first 10 games of the match is even lower.
In other words you will not see it in your life, even if you spend it whole testing just these 2 engines.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by Milos »

CRoberson wrote:Try again. Who said I want as little as 95% significance. I want better than 95%.
This kind of reply is quite fundamentalistic. And it's quite a fact that it is impossible to convince fundamentalist in anything contrary to their belief.
User avatar
Spacious_Mind
Posts: 317
Joined: Mon Nov 02, 2009 12:05 am
Location: Alabama

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by Spacious_Mind »

CRoberson wrote:
Back to the specific. If the margins are +/- 56 and A scores +58 Elo. That means all of these things:
1) A is better than B. (But, by how much).
2) A could be as low as 2 Elo stronger (58-56).
3) A could be as high as 114 Elo stronger (58+56).
4) Another run of 10 times as many games may show A to be less than 20 Elo stronger.
Hi Charles

You are making an assumption that I don't know how Elostat works :)

Show me anywhere in my previous Post where I have not stated exactly what you are now finally showing above? All I have been saying is that per Elostat or whatever other system the bigger likelyhood remains in the middle somewhere. You indicated extremes which prompted my original post.

All I am simply saying is that your Number 2 and Number 3 is unlikely given 120 games which is not a fantastic amount but enough games to make number 2 and 3 happen in tests between engines unlikely in the next 120 games. I also do this for thousands of games and I have never seen it to that extreme. So surely you agree it is more of a question of what more games will show with regards to the stability of +57 Elo as reported. It might be +30 or it might be + 80 who knows. But in either case it will still show that Firebird is better at these settings.

Best regards

Nick
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Post by BubbaTough »

Milos wrote:
CRoberson wrote: I've run a two program match of 100 games and the first program wins all of the first 10 games. The next day when the games are done, the score is 80 to 20 in favor of the 2nd program. The first only won 20 and half of them were the first 10 games.
Again you prove you do not know the elementary statistics.
Assuming that the final result is 80:20 for engine B against engine A, this means chance for engine A to win against engine B is 20%.
Chance that engine A wins any 10 games in a row is 0.2^10=0.00001%. Chance to win first 10 games of the match is even lower.
In other words you will not see it in your life, even if you spend it whole testing just these 2 engines.
I have no chicken in this fight, but feel compelled to mention that, unlikely or not, I have also seen this happen. It almost makes one wonder if the assumptions on which your statistical models are based are potentially imperfect :).

-Sam