1 draw=1 win + 1 loss (always!)

BubbaTough · Post by **BubbaTough** » Sun Sep 22, 2013 3:24 pm

lkaufman wrote:
BubbaTough wrote:
lkaufman wrote: The theoreticians would say that since empirical evidence indicates that two draws should count for more than one win plus one loss, the scoring system should reflect this.
In what sense? In terms of producing a rating that predicts future results more accurately? If so, are these future results measured using current scoring systems (loss 0, draw 0.5, win 1)?

-Sam
The experts here say that weighting draws more heavily in the rating formula improves the predictability of the system. It would seem like common sense that if the rating system were changed to reflect this, the scoring system should match it as much as possible. Theoretically this should increase the chance that the best player will win the tournament, or to put it another way it should decrease the luck factor in the final standings.

If one were to change the scoring system because of the results of the rating system, then you would want to adjust the rating system to reflect the new scoring system. Assumably it would converge somewhere.

When I see things starting to get complicated like that, my usual conclusion is its best to just keep it simple (like the current scoring system). It is a little like when tablebases started to show some positions were wins after 50 moves with no pawn moves or captures and chess organizations started to encode exceptions in the 50 move rule. Luckily, that experiment was ended reasonably quickly.

-Sam

lkaufman · Post by **lkaufman** » Sun Sep 22, 2013 3:41 pm

BubbaTough wrote:
lkaufman wrote:
BubbaTough wrote:
lkaufman wrote: The theoreticians would say that since empirical evidence indicates that two draws should count for more than one win plus one loss, the scoring system should reflect this.
In what sense? In terms of producing a rating that predicts future results more accurately? If so, are these future results measured using current scoring systems (loss 0, draw 0.5, win 1)?

-Sam
The experts here say that weighting draws more heavily in the rating formula improves the predictability of the system. It would seem like common sense that if the rating system were changed to reflect this, the scoring system should match it as much as possible. Theoretically this should increase the chance that the best player will win the tournament, or to put it another way it should decrease the luck factor in the final standings.
If one were to change the scoring system because of the results of the rating system, then you would want to adjust the rating system to reflect the new scoring system. Assumably it would converge somewhere.

When I see things starting to get complicated like that, my usual conclusion is its best to just keep it simple (like the current scoring system). It is a little like when tablebases started to show some positions were wins after 50 moves with no pawn moves or captures and chess organizations started to encode exceptions in the 50 move rule. Luckily, that experiment was ended reasonably quickly.

-Sam

Well, I don't agree about the fifty move rule, I think some sort of generalized exception is called for. But in general simple is best. Probably just number of wins as tiebreak is all that is needed with current scoring system, at least now we have a theoretical justification for it.

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 23, 2013 12:13 am

I don't know what you are talking about with relative/absolute error... But what you said earlier about thresholds is exactly how likelihood of a draw is assigned in GD as Remi said. Given strength of two players with distribution (be it logistic or normal), then it is a draw when their performance in a game lies inside a threshold. That is how the law of comparative judgement works. It doesn't matter what kind of distribution the performance of a player follows. GD follows Thurstone-Mosteller and the others Bradley-Terry, so both models are included in the paper. As HG mentioned, you can't tell the outcome via random walk because you don't know what the cummulative distribution looks like. Here is a quote from a paper that I refernced that says it all with regard to chess-games

Henery, R. J. (1992a). An extension to the thurstone-mosteller model for chess. Journal of the Royal Statistical Society, 41(5):559–567.

The normal model for chess is appropriate if winning at chess is by the accumulation of
small advantages, so that Y is a sum of small quantities. If gross blunders play an
important part, the Bradley-Terry model is more appropriate. Neither model gives a
convincing description of chess: some kind of mixture of distributions or scale parameters
would be necessary (cf Section 3.3). However, if, as here, we are concerned with players
with very similar abilities, the precise mathematical form of the distribution for X is not
too important. Substantial bias may arise when using the wrong model with large
differences in abilities: this general point is discussed by Latta (1979).

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 23, 2013 3:06 pm

This is also wrong. There is no way you can tell visually OR by a better means when you only have ratings calculated by one draw model. Adam just used one set of ratings calculated by using one draw model, so I don't know how you can tell which is better? So what you say about guessing which draw model is correct giving numbers like 1.8D, 1.9D is not correct at all. Davidson will fit ratings calculated by it better (a circle as you call it), and Rao-Kupper will fit ratings calculated by its own model better (parabola like). Infact without scaling the data/model plots for DV and RK are very much different you would think one of them is completely bogus. This is the model vs observation result so how the hell can you tell which fits better??

hgm · Post by **hgm** » Mon Sep 23, 2013 8:06 pm

It is true that one model compresses the rating scale compared to another, but that should not matter. Models that differ in weighting the draws compared to wins and losses should eventually (in the limit of large number of games) provide the same ratings, except for the scaling.

For any model you can plot drawRate as a function of DeltaElo, and compare it with (winRate * lossRate)^(N/2), to see which N gives the best fit. That determines for how many wins + losses a draw has to be counted to get fastest convergence with the number of games, if you 'naively' calculate the ratings from the score percentage, rather than from maximum likelihood with an assumed draw model.

Uri Blass · Post by **Uri Blass** » Mon Sep 23, 2013 8:59 pm

Daniel Shawul wrote:
BubbaTough wrote:
lkaufman wrote:Thanks. One consequence of the notion that one win plus one loss equals less than two draws is that it gives theoretical validation to using number of wins as a tiebreak, I think. The current popular practice of scoring draws as 1 but wins as 3 might also be justified with this logic,although it seems way too severe to me (and also works in reverse for the players in the bottom half of the table). Can you suggest a better scoring system for tournaments that captures this idea? The best I can come up with is something like this: wins 3 out of 3, draws 2 out of 4, losses zero out of 3, scoring by percentage of total possible points. So one win and three draws gets 9 out of 15 for 60%, while two wins, one draw, and one loss (the same by standard count) gets 8 out of 13 for 61.5%. Practical effect would be the same as just making number of wins the tiebreak with standard count for the top half, but it gets the bottom half right too. Comments welcome.
The whole idea is ridiculous to me, though I haven't been following the thread carefully. Could you summarize why in the world one would want to use anything other than a standard {0, 0.5, 1} scoring metric?

-Sam
The guy is seriously confused you are advised to ignore him. The winning percentages are the same for both 1W+3D and 2W+1D namely 2.5 points out of 4 => 62.5%. The goal is not to change to soccer scoring system by changing this percentage but the _ratings_ assigned to the players. For example if I draw with white for a 50% score, I would have less rating than if I did it with black. All draw models give exactly the same winning percentages for every player, even though the players are assigned different ELOs based on home advantage (white or black) or draw ratio (handled via drawElo by bayeselo)...
So Glenn-David, Rao-Kupper, and Davidson give the same winning percentages but different elos, so someone please stop this guy from spreading his misinformed crap.

The question is if the models can give different rating for people who got the same score with the same number of games with white and black.

Suppose that we have not the problem of one player who play more with white and another player who play more with black(for example 5 players when everyone play against everyone meaning that everyone plays twice with white and twice with black)

Suppose also that all players start with the same rating.
The question is if 2 wins draw and loss can give different rating than win and 3 draws.

If it can give different rating then common sense suggest that you should give also different ranking.

Note that the right test to test the possible models should be by finding the model that gives smaller error function in predicting results.

basically you have rating r that you calculate based on games for the time before the game and you also have a function of expected result that is
expected_result(rating_white,rating_black)

You calculate error function that is sum of the squares of the difference between expected_result and observed_result and the model that is best is the model that gives a smaller value for the error function based on many games.

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 23, 2013 9:41 pm

There is a third model that say 1.5 draws = 1 win + 1 loss.

I have to qualify a previous statement about the GD model being a model of 1.5D=W+L. That is true only for the parameters Remi chose for the plots, namely P(W/delta=0)=P(L/delta=0)=P(D/delta=0)=0.3333. Meaning games of two equally strong players are 33% draws. For other parameters chosen such that P(D/delta=0)~60% the GD model has effect more like 2D~W+L as shown in the 2nd plot. And for less draws P(D/delta=0)=10% it becomes 1.3D~W+L shown in 3rd plot.

Therefore GD model has different behavior unlike RK and DV that give same relations of 1D=W+L, and 2D=W+L resp for any chosen parameters. This is due to the 'invariant nature' of the relations below for any values of the parameters, that become obvious if P(D)^2 is expanded to P(D)*P(D) for DV model. GD model can not be formulated in the same manner because of the erf() function, but apparently it can have an effect like the 'fixed' DV or RK model ...

Code: Select all

RK:  P(D)=factor*P(W)*P(L)
DV:  P(D)*P(D)=factor*P(W)*P(L)

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 23, 2013 9:54 pm

Note that the right test to test the possible models should be by finding the model that gives smaller error function in predicting results.

basically you have rating r that you calculate based on games for the time before the game and you also have a function of expected result that is
expected_result(rating_white,rating_black)

You calculate error function that is sum of the squares of the difference between expected_result and observed_result and the model that is best is the model that gives a smaller value for the error function based on many games.

This is exactly what is done in the paper, namely cross-correlation tests, but i am sure there are better bayesian model selection approaches. Knowing how hard it is to prove which model is best, it baffles me that some here 'clearly' see how davidson model is better. That has been my qualm, you can't really know until you make the test with all draw models. We did the standard cross-10, i.e. partition the data set in to 10 groups, train on 9 of them and test on 1 subset. DV showed better match there and also for cross-2, and cross-4 on CCRL/CEGT blitz and standard tc ratings, for a total of 4.

The model are called 'draw models' but they do have parameter for home advantage (white/black). It may also be possible to add other modifiers like game length to change ratings. Quick wins give you higher ratings. The point system is the same for all just the ratings. When you say 1W+3D, you get 62.5% and then you add other necessary values like draw ratio is 3/4=75%, game length for the win=30, that win is with black, against a 2300 elo but all draws are against a 2400 elo... etc. So all will contribute to the the predicted rating (strength) of the player, and the different 'draw' models assign different strength obviously.

Uri Blass · Post by **Uri Blass** » Tue Sep 24, 2013 5:45 am

Daniel Shawul wrote:
Note that the right test to test the possible models should be by finding the model that gives smaller error function in predicting results.

basically you have rating r that you calculate based on games for the time before the game and you also have a function of expected result that is
expected_result(rating_white,rating_black)

You calculate error function that is sum of the squares of the difference between expected_result and observed_result and the model that is best is the model that gives a smaller value for the error function based on many games.
This is exactly what is done in the paper, namely cross-correlation tests, but i am sure there are better bayesian model selection approaches. Knowing how hard it is to prove which model is best, it baffles me that some here 'clearly' see how davidson model is better. That has been my qualm, you can't really know until you make the test with all draw models. We did the standard cross-10, i.e. partition the data set in to 10 groups, train on 9 of them and test on 1 subset. DV showed better match there and also for cross-2, and cross-4 on CCRL/CEGT blitz and standard tc ratings, for a total of 4.

The model are called 'draw models' but they do have parameter for home advantage (white/black). It may also be possible to add other modifiers like game length to change ratings. Quick wins give you higher ratings. The point system is the same for all just the ratings. When you say 1W+3D, you get 62.5% and then you add other necessary values like draw ratio is 3/4=75%, game length for the win=30, that win is with black, against a 2300 elo but all draws are against a 2400 elo... etc. So all will contribute to the the predicted rating (strength) of the player, and the different 'draw' models assign different strength obviously.

1)For chess programs rating:
Using game length may increase the rating of programs that never resign(or if the interface adjudicate games based on evaluations it is going to increase the rating of programs that never show very bad evaluation).

If you want to use the pgn of the games and not only the results then it is better to use computer analysis of the games in order to calculate rating
so both players can earn rating points if they played better than their rating and it is possible that both players lose rating points if they played worse than their rating based on computer analysis.

Note that
I do not like this idea because there is a problem in calculating the rating of the strong programs in this way.
For example if you use houdini to analyze the games of houdini it may increase houdini's rating and if you want accurate result by computer analysis you may need significantly more time to analyze the games
relative to the time that is used to play the games.

2)For chess human rating:
I am against all these ideas in human-human games because they encourage cheating.
2 players can simply prepare their game at home and earn rating points from their draw if you use computer analysis to calculate rating.

I am also against the idea that 2 draws do not give the same as win and loss for rating or for ranking of humans because I think this idea also encourage cheating(if a pair of win and loss is not equal to 2 draws then players with equal strength can get motivation to fix their result before the game so they get more from their expected 50% result).

Uri Blass · Post by **Uri Blass** » Tue Sep 24, 2013 7:45 am

Daniel Shawul wrote:
Note that the right test to test the possible models should be by finding the model that gives smaller error function in predicting results.

basically you have rating r that you calculate based on games for the time before the game and you also have a function of expected result that is
expected_result(rating_white,rating_black)

You calculate error function that is sum of the squares of the difference between expected_result and observed_result and the model that is best is the model that gives a smaller value for the error function based on many games.
This is exactly what is done in the paper, namely cross-correlation tests, but i am sure there are better bayesian model selection approaches. Knowing how hard it is to prove which model is best, it baffles me that some here 'clearly' see how davidson model is better. That has been my qualm, you can't really know until you make the test with all draw models. We did the standard cross-10, i.e. partition the data set in to 10 groups, train on 9 of them and test on 1 subset. DV showed better match there and also for cross-2, and cross-4 on CCRL/CEGT blitz and standard tc ratings, for a total of 4.

The model are called 'draw models' but they do have parameter for home advantage (white/black). It may also be possible to add other modifiers like game length to change ratings. Quick wins give you higher ratings. The point system is the same for all just the ratings. When you say 1W+3D, you get 62.5% and then you add other necessary values like draw ratio is 3/4=75%, game length for the win=30, that win is with black, against a 2300 elo but all draws are against a 2400 elo... etc. So all will contribute to the the predicted rating (strength) of the player, and the different 'draw' models assign different strength obviously.

1)For chess programs rating:
Using game length may increase the rating of programs that never resign(or if the interface adjudicate games based on evaluations it is going to increase the rating of programs that never show very bad evaluation).

If you want to use the pgn of the games and not only the results then it is better to use computer analysis of the games in order to calculate rating
so both players can earn rating points if they played better than their rating and it is possible that both players lose rating points if they played worse than their rating based on computer analysis.

Note that
I do not like this idea because there is a problem in calculating the rating of the strong programs in this way.
For example if you use houdini to analyze the games of houdini it may increase houdini's rating and if you want accurate result by computer analysis you may need significantly more time to analyze the games
relative to the time that is used to play the games.

2)For chess human rating:
I am against all these ideas in human-human games because they encourage cheating.
2 players can simply prepare their game at home and earn rating points from their draw if you use computer analysis to calculate rating.

I am also against the idea that 2 draws do not give the same as win and loss for rating or for ranking of humans because I think this idea also encourage cheating(if a pair of win and loss is not equal to 2 draws then players with equal strength can get motivation to fix their result before the game so they get more from their expected 50% result).

1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)