more on engine testing

mathmoi · Post by **mathmoi** » Fri Aug 08, 2008 5:44 pm

hgm wrote:Yes you can, if he scored only 62% in group B, paying all other players there, and group B as a whole was crushed 100-0 by group A. He is only ~85 Elo stronger than the average player of group B. And the 100-0 shows you that on the average, group B is at least about 1000 Elo weaker, on the average.

1) Where was it said that he scored 62% in group B. Maybe he scored 100%

2) Doesn't a 100-0 results show you that goup B is at least 400 Elo weaker (instead of 1000)? I'm not arguing with you here, I'm honestly asking.

hgm · Post by **hgm** » Fri Aug 08, 2008 6:27 pm

hgm wrote:
Rémi Coulom wrote:When the number of games is high and the winning rate is not close to 0% or 100%, there is virtually no difference between expected rating and maximum a posteriori.
This is very tricky: sometimes there are 'hidden massacres' in the pairing network:

Suppose you have 200 players. A group of 100 of them play a complete round robin, (i.e. 99 games each), and turn out to be all about equally strong (so the scores are normally distributed with sigma ~4, and all players had results between 38 and 61 (+/-3 sigma should be enough to catch all the 100 players, as the probability a player exceeds that by chance is <1%). The same is true for the other grou of 100 players.

But now suppose that each of the players from group A has played one game against a player of group B (and vice versa), and that all the A players won that game. (This is pretty much what happens in the first round of a Swiss tournament, seeded by rating.)

The 1 game hardly affects the result of the individual players, they now each have 100 games, and for the A players the score was between 39% and 62%, for the B players between 38% and 61%. All very far from 100%. But B as a group was slaughtered by the A players with a 100% score!

This case is totally mishandled by BayesElo. It will not succeed in getting the Elo difference between the A and B groups average anywhere near correctly. While the 100 games played between the two should be enough to give a good clue (or in this case, a good lower limit for the difference).

The rating model used by BayesElo assumes a score percentage of

100%/(1+10^(ratingDifference/400))

So for a difference of 400, you would expect a score of 100%/(1+10^1) = 100%/11 ~ 9%. 100-0 is a very large deviation from that. Even if you ignore draws, the chance that you get 100 losses when your winning chance is 9% is (10/11)^100 = 7.2e-5, i.e. less than 1 in 13,000.

bob · Post by **bob** » Fri Aug 08, 2008 8:24 pm

hgm wrote:
bob wrote:I would draw a completely different conclusion. Ratings within each pool should be highly accurate. ratings between the pools can't possibly be accurate with so few games. So the old GIGO applies here. The original data is garbage from the perspective of being useful to predice overall ratings...
Well, of course you woud consider a 100-0 result garbage, because 100 is too few games. With you even a 25,000 game result is garbage...

But when I test engines, you can be pretty sure that the engine scoring 0 in a 100-0 match was not the stronger one...

Yeah, but you didn't get a 0-100 result between two opponents. 100 opponents got a 1, another 100 got a 0, and that is all you have to go on. Not much...

hgm · Post by **hgm** » Fri Aug 08, 2008 9:36 pm

Well, if you would have bothered to read the discussion, rather than trolling, you would have seen that Rémi thinks differently, and BayesElo thinks differently, and I think differently, and that we all agree now that this is pretty much to go on. So you are quite alone in this...

bob · Post by **bob** » Sat Aug 09, 2008 5:31 pm

hgm wrote:Well, if you would have bothered to read the discussion, rather than trolling, you would have seen that Rémi thinks differently, and BayesElo thinks differently, and I think differently, and that we all agree now that this is pretty much to go on. So you are quite alone in this...

If you had bothered to read, you would have noticed that the round-robin gave a much _narrower_ range of Elo ratings for the rating pool, or did you overlook that? And since Elo ratings are _relative_, and since I actually care about how I do against all opponents but also care about how I did against the _best_ opponent(s), having more accurate ratings for all actually does offer some information. If you had bothered to read, of course.

hgm · Post by **hgm** » Sat Aug 09, 2008 6:07 pm

Are you sure you are posting in the right thread?

bob · Post by **bob** » Sat Aug 09, 2008 8:21 pm

hgm wrote:Are you sure you are posting in the right thread?

seems so to me,

hgm · Post by **hgm** » Sat Aug 09, 2008 10:26 pm

Then I am sure you will have no difficulty quoting the round-robin that I am supposed to have read, and explain to us how that ties in with the 100-0 in my hypothetical example we are discussing...

Narrower than what?

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing