Another Firebird-Rybka match (120 games @ 20mn+5s)

CRoberson · Post by **CRoberson** » Mon Jan 25, 2010 12:10 am

Milos wrote:
CRoberson wrote:Try again. Who said I want as little as 95% significance. I want better than 95%.
This kind of reply is quite fundamentalistic. And it's quite a fact that it is impossible to convince fundamentalist in anything contrary to their belief.

Ah, now you are insulting me.

I'll make it simple. I test numerous mods per day. In a 2 month period, I easily test 100 mods. If I use only 95% confidence levels, I
will make the wrong decision 5% of the time. Therefore, I don't use as little as a 95% confidence level.

Michael Sherwin · Post by **Michael Sherwin** » Mon Jan 25, 2010 12:11 am

CRoberson wrote:
Spacious_Mind wrote:
CRoberson wrote:
Spacious_Mind wrote:Hi Andre,

Nice results, which clearly seem to indicate that on your machine with your settings that Firebird is stronger. There is nothing that can dispute that. Elostat ratings or whatever you use, calculates the average ratings over a given number of games. The plus/minus difference the further you move away from the middle become more and more extreme and unlikely. For example in the next 120 games at your setting Rybka would have to win 33 and lose 14 in order for the ELO to be the same. Which is probably unlikely based on what you have experienced so far on your machine right?

If the variance is +/- 58 and your performance difference is 57 then you begin to start grasping at straws if you seriously think that the other program will miraculously turn the next 120 games upside down with a difference as you are showing.

Therefore the only question is does the 57 elo remain stable or are there fluctuations. The fact that Firebird is stronger on your machine with your settings is hard to argue against.

best regards

Nick
You don't understand. If A outperforms B by N-1 Elo and the margins are +/- N, then there is insufficient evidence to say that
A is better than B. A must outperform B by more than N Elo with margins at +/- N. With margins at 58, you must score 59 or better.

If your score is within the margins (even by a little) then you are within the fluctuation range for that number of games.
Yes I fully understand. You are going from -1 to +115 therefore both -1 and +115 are unlikely extremes.

regards

Nick
If you understand that, then why did you say that Firebird is clearly better and that it is unlikely for the other engine to turn the tables in
the next 120 games? If the results are within the margins, then the results are within the fluctuation range and it is possible for the other
engine to turn the tables.

I do not know where to attach this so I am doing so here.

Elo calculation and statistics with its error bars and confidence levels is fine and only slightly flawed for human vs human. However, given the random effects of hardware and starting positions many more games are needed for the same error bars and confidence levels to be accurate for computer vs. computer.

CRoberson · Post by **CRoberson** » Mon Jan 25, 2010 12:16 am

BubbaTough wrote:
Milos wrote:
CRoberson wrote: I've run a two program match of 100 games and the first program wins all of the first 10 games. The next day when the games are done, the score is 80 to 20 in favor of the 2nd program. The first only won 20 and half of them were the first 10 games.
Again you prove you do not know the elementary statistics.
Assuming that the final result is 80:20 for engine B against engine A, this means chance for engine A to win against engine B is 20%.
Chance that engine A wins any 10 games in a row is 0.2^10=0.00001%. Chance to win first 10 games of the match is even lower.
In other words you will not see it in your life, even if you spend it whole testing just these 2 engines.
I have no chicken in this fight, but feel compelled to mention that, unlikely or not, I have also seen this happen. It almost makes one wonder if the assumptions on which your statistical models are based are potentially imperfect .

-Sam

Sam,

You are correct. His assumptions are wrong and his math is wrong. 0.2^10 does not come out to what he quotes.

Milos · Post by **Milos** » Mon Jan 25, 2010 12:27 am

CRoberson wrote:I test numerous mods per day. In a 2 month period, I easily test 100 mods. If I use only 95% confidence levels, I
will make the wrong decision 5% of the time. Therefore, I don't use as little as a 95% confidence level.

If this is really true you must have a pretty large cluster. We regular mortals don't have this privilege. Therefore, it would be quite easy for you to prove with like 10k games (if you think that satisfies your accuracy level) in 40/40 control that Robbo is really not stronger than Rybka. I guess it would take you only few days for that.

So my question for you, why don't you do it?

Andre · Post by **Andre** » Mon Jan 25, 2010 12:28 am

It's a quite a long time control so it takes some time to run such match but I'll add another 200 games to see if the difference holds

Milos · Post by **Milos** » Mon Jan 25, 2010 12:34 am

CRoberson wrote:You are correct. His assumptions are wrong and his math is wrong. 0.2^10 does not come out to what he quotes.

LOL, so maybe you can show us how math of someone who extrapolates elo bars just according to number of games is right.

Spacious_Mind · Post by **Spacious_Mind** » Mon Jan 25, 2010 12:40 am

Andre wrote:It's a quite a long time control so it takes some time to run such match but I'll add another 200 games to see if the difference holds

Hi Andre,

Well this discussion has tweaked my interest, therefore I would love to see another one or two sets of 120 games to see how these sets compare.

But that is easy for me to say because I do not have to play the games

Best regards

Nick

CRoberson · Post by **CRoberson** » Mon Jan 25, 2010 3:37 am

Milos wrote:
CRoberson wrote: I've run a two program match of 100 games and the first program wins all of the first 10 games. The next day when the games are done, the score is 80 to 20 in favor of the 2nd program. The first only won 20 and half of them were the first 10 games.
Again you prove you do not know the elementary statistics.
Assuming that the final result is 80:20 for engine B against engine A, this means chance for engine A to win against engine B is 20%.
Chance that engine A wins any 10 games in a row is 0.2^10=0.00001%. Chance to win first 10 games of the match is even lower.
In other words you will not see it in your life, even if you spend it whole testing just these 2 engines.

First, I said the score is 80 to 20. That doesn't mean that the other 10 games were won. Second, the fact that the score is 80 to 20 in 100
games does not mean that is an exact reflection of the the two programs relative strengths.

Now, for your misunderstanding of your math. Here is a real life true example. I have 2 children and only 2. They were born on the same
day exactly 2 years apart. By your logic, I would have to have almost 365 children for that to happen, because the odds of 2
siblings being born on the same day 2 years apart are 1 in 365 (not 1 in 365^2). I only have 2 children and they were born on the same day exactly 2 years apart.

You claim the odds are 1 in 100,000 (0.00001%) to see an example of what I claim. Statistics like that only tell how often it happens, not when something is
going to happen. As in the example of my children's births, it is a statistical fallacy to assume that you have to go through the full
range of events before the unlikely event happens.

Also, you have not made any effort to find out my background in computer chess before you called me a liar. So, you
don't know how long I've been doing it or what are the odds that I have seen it. You are just talking about a single instance random chance. To calculate the odds that I have seen it, requires knowledge of how often I test programs and for how long I have
been doing it. You must factor all of that into your calculations before using the math to call me a liar.

I have been doing work computer games for around 18 years. I have posts in rec.games.chess dating back to somewhere between
1991 and 1993. In 18 years, that comes to an average of 15 tests of 100 games per day to have completed 100,000 tests in 18 years.
That is to run 100,000 tests. As I pointed out, you don't always need to run 100,000 tests to see the 1 test in 100,000. But, lets say
that on average I tested once per day at 100 games per test. That puts the odds at 1 in 5 (20%) that I have seen it. This is far better
odds than you quoted. So, it is theoretically possible. One in five are not terrible odds for something like this.

I notice that you did not ask what TC I use for testing before you did some superficial math.

Now, to the facts:
Given that I can run 400 games in about 8 hours on just one of my computers, that means I can easily run 400 games per day or up to
1200 games per day on that computer. (Yes, I have more than one. If you don't believe that, just ask anybody that has attended the
Pan American Computer Chess Championships in the last 3 years. I can name names if you like - they are all members of this forum. If you don't believe them, I have pictures on the web.) At 1200 games per day, that is 12 tests of 100 per day. At that rate, I
can complete 4,380 tests per year on 1 machine. That would be about 10,000 tests in the last two years. That would mean on something
where the odds are 1 in 100,000 that there is a 1 in 10 chance that I have seen it. This is far better than the stats that you did and did, by
far, too quickly and with too little info. How did I come up with 1 in 10 odds. Well, 10,000 tests times the probability (as you stated)
of 1 in 100,000 gives odds of 10,000/100,000 = 1/10. Also, I have more than one computer. Sometimes, I have run all of them for
weeks at a time to test something. Even with just 2 computers, the odds are doubled that I have seen it which puts it to 2 x 1/10 = 1/5.

You must admit, 1 in 5 odds are not very bad for this. The odds are better than that as I have more than 2 computers. Again,
provable via personal references or pictures that already exist on the web.

So, I have a question for you. Do you still contend that I am lying about this?

CRoberson · Post by **CRoberson** » Mon Jan 25, 2010 3:45 am

Milos wrote:
CRoberson wrote:I test numerous mods per day. In a 2 month period, I easily test 100 mods. If I use only 95% confidence levels, I
will make the wrong decision 5% of the time. Therefore, I don't use as little as a 95% confidence level.
If this is really true you must have a pretty large cluster. We regular mortals don't have this privilege. Therefore, it would be quite easy for you to prove with like 10k games (if you think that satisfies your accuracy level) in 40/40 control that Robbo is really not stronger than Rybka. I guess it would take you only few days for that.
So my question for you, why don't you do it?

It doesn't take a large cluster to test 100 modifications in 2 months. That is on average 1 and 2/3rds tests per day (less than 2 per day).
One of my machines (a quad) can run 400 games in 8 hours. As stated in the other part of this thread. I did not state the TC (time control) that I use for testing. It is not G in 25 mins + 5 sec. That is too long.

So you tell me. At 400 games per test, what Elo margins do I use and at what confidence level?

I have no interest in testing Robbo. I have other tests more interesting than that. If you did your research on me you'd know that.

CRoberson · Post by **CRoberson** » Mon Jan 25, 2010 4:07 am

Andre wrote:It's a quite a long time control so it takes some time to run such match but I'll add another 200 games to see if the difference holds

Hello Andre,

You are correct. Your TC is rather long. You get about 1 game per hour, so 120 games is 5 days. Then another 120, would make 10 days total.

I humbly suggest, that you speed it up by 5x and do it in 2 days by cutting the TC by 5x. That way you could run 4 sets of 120 tests in 4 days and show the individual test results each day. That would be a TC of G/ 4 mins + 1 sec.

Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)

Re: Another Firebird-Rybka match (120 games @ 20mn+5s)