Uri Blass wrote:bob wrote:Uri Blass wrote:Richard Allbert wrote:Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread:
http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):
Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).
Bob is preparing data that could be used to verify this theory. We will see.
Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.
Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...
Interestingly, I've found a Null move bug as a result
Richard
I think that in this case your program to calculate elo has a bug.
difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.
Uri
That has become the "designer excuse" of choice it seems, sort of like "designer drugs". "you have a bug". Never any other possible explanation. Such as some inherent randomness that has different characteristics than we believe should be presented. He didn't mention the error bar, just the number BayesElo gives, which is a _single_ number. they could easily be exactly the same, with exactly the same error margins. Which most (but not me at least) would conclude means "no difference in the programs".
You need to quit assuming so much, and take what people write at face value. If it says 2500+/- 40, and the next run says 2500 +/- 40, anyone in their right mind would say "BayesElo gave the same results for both." Amazingly, that is _exactly_ what he said. And that is why these discussions go on and on...
the possible error is not important for this point.
Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.
Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.
Uri
Here is the complete post he made:
==========================================================
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.
Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...
Interestingly, I've found a Null move bug as a result

===========================================================
Now where are you drawing your conclusions from? it seems to say 75 games per each of his four versions, against each of 5 opponents. So the logical assumption is 15 games per opponent. 2 points could make a _big_ difference in one of those, depending on how they end up rated themselves...
Again, drawing too many conclusions, from assumptions, which appear to have nothing to do with the actual conditions being described. "total of 300 games" seems quite clear to me...
here is a 15 game match, with program A losing the first 5 rounds, drawing the next 5, and winning the final 5. And the rating difference between the two programs is computed at 26 points +/- 68.. want to know why?

All well-documented in Remi's writeup. White vs black. All the games had A as white. Fix that?
Code: Select all
Rank Name Elo + - games score oppo. draws
1 program B 13 68 68 15 50% -13 33%
2 program A -13 68 68 15 50% 13 33%
Correct it so that black/white is equalized as best possible with an odd number of games and I get this:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 program B 1 68 68 15 50% -1 33%
2 program A -1 68 68 15 50% 1 33%
I think _nobody_ is putting enough thought into considering this entire process. Just the usual
stampee foot... can't be.... stampee foot.. bug... stampee foot. cluster broken... stampee foot.. software broken... stampee foot... researcher can't produce good data... stampee foot... statistically invalid... stampee foot... impossible result... stampee foot... cherry-picked data... stampee foot... I could fix this trivially if I wanted to... stampee foot.... stampee foot...
time after time.
It would seem to me that a _logical-thinking_ person would begin to say "Hmmm, we are seeing more of this than we would expect, by multiple different sources, so maybe there is something wrong with our thought processes. For example, of _course_ the games are somewhat dependent. Why? Same damned players. How well would 100 games with me vs Kasparov correlate? 100% since I would lose every game? Data no good. Data dependent... stampee foot...
<sigh>
A beats C by 2 points
A plays even with B.