Elo Increase per Doubling

Adam Hair · Post by **Adam Hair** » Wed May 09, 2012 7:12 pm

hgm wrote:
Adam Hair wrote:For time odds, I use Winboard. If you use the latest Winboard with the built-in tournament manager, then set up the tournament with your base time control. Then, you have to modify winboard.ini (found at C:\Documents and Settings\Administrator\Application Data in Windows XP). For each engine in your tournament, add -firstTimeOdds=x, so that1/x times the base time control equals the time control that particular engine will play with. If you use PSWBTM as the tournament manager, then you have to add /%sTimeOdds=x (where x is the same as above) for each engine's parameter in the PSWBTM Engine Manager.
Note that it is also easy to do 'depth odds' in WinBoard. To limit an engine to play at a given depth D, without affecting the depth of its opponent, you can add to its install line:

-firstInitString="new\nrandom\nsd D\n"

i.e. append an sd D line to the usual "new\nrandom\n". This will then be automatically sent to the engine at the beginning of every game. The same method can be used to set a parameter in only a single engine, where WinBoard does not provide separate controls for each engine. E.g. 'core odds', where you want to fix the number of CPUs on a per-engine basis, rather than having the engine use the GUI settings for this, can be achieved by

-firstInitString="cores 4\nnew\nrandom\n"

In this case the cores 4 has to preceed the new, as the cores command is typically sent before new (to spare engines the agony of having to change number of threads during a game or search). The same you could do for 'hash-odds games', where the engines use a different hash setting; just prefix a memory M command with the desired memory size M to the init string.

Should these be added to the Winboard.ini file or do they go in the command line? Sorry for the stupid question. I am still learning with Winboard.

Don · Post by **Don** » Wed May 09, 2012 7:19 pm

petero2 wrote:
Don wrote:
petero2 wrote:Yes, the logarithmic formula goes to infinity, so not a good approximation for large x.

Here is an exponential fit instead: Elo=1495+2996*(1-exp(-Name/12.8))

I've added 2 more levels to the same test and I'm going to get a larger sample of games. This is interesting, so even though I don't know if it's a very accurate way to estimate the highest rating it's a lot of fun to try.

I wrote a quicky program to find the 3 constants in this formula that minimizes the "least squares" error. With the new data points I get this:

Elo = 1488.20 + 2692.70 * (1 - exp( -level / 11.09 )

And the maximum achievable ELO is ... (drum roll please) is 4180! I don't have large samples at these high levels and I will continue to run the test for at least a couple more days or longer - and recheck the estimate.

It may be interesting to run the same test on a different program and see if the estimate is in the same ballpark. Preferably a program that has a solid fixed nodes testing level. This would make the "guess" more believable if they agreed.

Don
My parameters were also computed by minimizing the least squares error. I happened to have an old octave implementation of the Gauss-Newton method that I used. To see how much the estimate is affected by measurement errors, I added normally distributed noise with standard deviation 10 to the rating values and computed the corresponding maximum ELO. I repeated this 100000 times and made a histogram:

Average value: 4497
Standard deviation: 116

Although, I believe an even bigger error source is the fact that the true rating curve is most likely not an exponential function, so the large extrapolation is probably unsound.

ELO itself of course is not perfectly sound for this and I agree there is too much we don't know. However I'm pretty amazed at how will this exponential curve fits the data so far at these levels.

Also, I have believed for a long time that we have AT LEAST 1000 more ELO to achieve perfect play. Of course that is just my own intuition, a bigger source of error than any of these other things

I am playing a 5m+5s match with Houdini 1.5 64 right now and I notice that 40% of the games are draws. That means each program had about 30 chances out of 100 to convert a loss to a draw. It would take a LOT of ELO to do that.

But we are not even considering the lost opportunities to convert draws to a win that both program almost certainly missed. So it's very easy for me to believe that another 1000 ELO is not that much and it could be much higher.

Code: Select all

Rank Name           Elo      +      -    games   score   oppo.   draws 
   1 4428.15      3006.4   43.4   43.4     228   50.9%  3000.0   39.5% 
   2 Houdini_1.5  3000.0   43.4   43.4     228   49.1%  3006.4   39.5%

Anyway, it's fun to speculate on this. I have some confidence in it because I don't believe any program is capable of playing perfect chess (not matter how much resources given) due to GHI and zugzwang issues. So any asymptote could be short of the mark. Maybe in 20 years we will look back and not be able to believe our progress and computers will be playing 1000 ELO stronger and still winning and losing games!

lkaufman · Post by **lkaufman** » Thu May 10, 2012 5:11 am

In my view the rating system gradually starts to break down when players are paired with rating differences greater than the value of the first move, say 50 elo or so. Theoretically Black should play for a draw and White should play to win, but once White becomes an underdog he can play for a draw instead. Then results depend on whether Black can avoid this by playing unsoundly but not too much so and on how good White is at playing to draw instead of trying to play properly. So when we talk about the rating of "God" it all depends on whether pairings are limited to close ones or not. The results of experiments will also depend on this. The 4500 estimate does seem reasonable to me if only close pairings are allowed, but not otherwise, unless of course we assume God is reading the minds of his opponents to see what they will miss, a sort of cheating. To put it simply, I think Carlsen could get many draws with White against "perfect" play, but not against a player who modified his play drastically based on how "weak" his opponent was. Ratings cease to measure objective strength of play when this becomes a factor. I see it all the time in actual human play. For example, there is one master against whom I have an overwhelming score (something like 12 wins, 4 draws, no losses), way better than the rating gap of about 150 would predict. The reason is that he plays super-sharp but dubious defenses as Black (mainly the Benoni) which avoid draws against lower-rated opponents and allow him to obtain a much higher rating than he would get against equals or superiors.

Don · Post by **Don** » Thu May 10, 2012 7:03 am

lkaufman wrote:In my view the rating system gradually starts to break down when players are paired with rating differences greater than the value of the first move, say 50 elo or so. Theoretically Black should play for a draw and White should play to win, but once White becomes an underdog he can play for a draw instead. Then results depend on whether Black can avoid this by playing unsoundly but not too much so and on how good White is at playing to draw instead of trying to play properly. So when we talk about the rating of "God" it all depends on whether pairings are limited to close ones or not. The results of experiments will also depend on this. The 4500 estimate does seem reasonable to me if only close pairings are allowed, but not otherwise, unless of course we assume God is reading the minds of his opponents to see what they will miss, a sort of cheating. To put it simply, I think Carlsen could get many draws with White against "perfect" play, but not against a player who modified his play drastically based on how "weak" his opponent was. Ratings cease to measure objective strength of play when this becomes a factor. I see it all the time in actual human play. For example, there is one master against whom I have an overwhelming score (something like 12 wins, 4 draws, no losses), way better than the rating gap of about 150 would predict. The reason is that he plays super-sharp but dubious defenses as Black (mainly the Benoni) which avoid draws against lower-rated opponents and allow him to obtain a much higher rating than he would get against equals or superiors.

But essentially for a study like this we don't want the players making any assumptions. In other words they should play the board. I'm not sure the result would be valid if the players were playing for cheap-shots with unsound moves or giving up a chance to draw in a bad position (which is also unsound) even though it will give a better result against a weak opponent.

We could view this study as what would happen in a tournament where no one had information about their opponents. Here is something to ponder - if BOTH players know about their opponent, does the power of knowing cancel out? We know that any knowledge improves your chances but since both players have knowledge does one have more of an advantage than the other?

A typical problem is coming out of book with an inferior position as black even though you are the stronger player. Without knowledge you take a draw, with knowledge you avoid it so it would seem to favor the stronger player. However why can't the weaker player in other positions use the threat of a draw as leverage? He WANT'S a draw and he knows his opponent does NOT want a draw.

lkaufman · Post by **lkaufman** » Thu May 10, 2012 7:19 am

Don wrote:
lkaufman wrote:In my view the rating system gradually starts to break down when players are paired with rating differences greater than the value of the first move, say 50 elo or so. Theoretically Black should play for a draw and White should play to win, but once White becomes an underdog he can play for a draw instead. Then results depend on whether Black can avoid this by playing unsoundly but not too much so and on how good White is at playing to draw instead of trying to play properly. So when we talk about the rating of "God" it all depends on whether pairings are limited to close ones or not. The results of experiments will also depend on this. The 4500 estimate does seem reasonable to me if only close pairings are allowed, but not otherwise, unless of course we assume God is reading the minds of his opponents to see what they will miss, a sort of cheating. To put it simply, I think Carlsen could get many draws with White against "perfect" play, but not against a player who modified his play drastically based on how "weak" his opponent was. Ratings cease to measure objective strength of play when this becomes a factor. I see it all the time in actual human play. For example, there is one master against whom I have an overwhelming score (something like 12 wins, 4 draws, no losses), way better than the rating gap of about 150 would predict. The reason is that he plays super-sharp but dubious defenses as Black (mainly the Benoni) which avoid draws against lower-rated opponents and allow him to obtain a much higher rating than he would get against equals or superiors.
But essentially for a study like this we don't want the players making any assumptions. In other words they should play the board. I'm not sure the result would be valid if the players were playing for cheap-shots with unsound moves or giving up a chance to draw in a bad position (which is also unsound) even though it will give a better result against a weak opponent.

We could view this study as what would happen in a tournament where no one had information about their opponents. Here is something to ponder - if BOTH players know about their opponent, does the power of knowing cancel out? We know that any knowledge improves your chances but since both players have knowledge does one have more of an advantage than the other?

A typical problem is coming out of book with an inferior position as black even though you are the stronger player. Without knowledge you take a draw, with knowledge you avoid it so it would seem to favor the stronger player. However why can't the weaker player in other positions use the threat of a draw as leverage? He WANT'S a draw and he knows his opponent does NOT want a draw.

That is precisely the problem. It means that ratings are not well defined even in theory unless mismatches are avoided (or not rated). With mismatches the peak possible rating becomes some function of the frequency and magnitude of mismatches. Basically I think the weaker player has more power to force a draw as White than the stronger one has to avoid one.

Don · Post by **Don** » Thu May 10, 2012 2:43 pm

lkaufman wrote:
Don wrote:
lkaufman wrote:In my view the rating system gradually starts to break down when players are paired with rating differences greater than the value of the first move, say 50 elo or so. Theoretically Black should play for a draw and White should play to win, but once White becomes an underdog he can play for a draw instead. Then results depend on whether Black can avoid this by playing unsoundly but not too much so and on how good White is at playing to draw instead of trying to play properly. So when we talk about the rating of "God" it all depends on whether pairings are limited to close ones or not. The results of experiments will also depend on this. The 4500 estimate does seem reasonable to me if only close pairings are allowed, but not otherwise, unless of course we assume God is reading the minds of his opponents to see what they will miss, a sort of cheating. To put it simply, I think Carlsen could get many draws with White against "perfect" play, but not against a player who modified his play drastically based on how "weak" his opponent was. Ratings cease to measure objective strength of play when this becomes a factor. I see it all the time in actual human play. For example, there is one master against whom I have an overwhelming score (something like 12 wins, 4 draws, no losses), way better than the rating gap of about 150 would predict. The reason is that he plays super-sharp but dubious defenses as Black (mainly the Benoni) which avoid draws against lower-rated opponents and allow him to obtain a much higher rating than he would get against equals or superiors.
But essentially for a study like this we don't want the players making any assumptions. In other words they should play the board. I'm not sure the result would be valid if the players were playing for cheap-shots with unsound moves or giving up a chance to draw in a bad position (which is also unsound) even though it will give a better result against a weak opponent.

We could view this study as what would happen in a tournament where no one had information about their opponents. Here is something to ponder - if BOTH players know about their opponent, does the power of knowing cancel out? We know that any knowledge improves your chances but since both players have knowledge does one have more of an advantage than the other?

A typical problem is coming out of book with an inferior position as black even though you are the stronger player. Without knowledge you take a draw, with knowledge you avoid it so it would seem to favor the stronger player. However why can't the weaker player in other positions use the threat of a draw as leverage? He WANT'S a draw and he knows his opponent does NOT want a draw.
That is precisely the problem. It means that ratings are not well defined even in theory unless mismatches are avoided (or not rated). With mismatches the peak possible rating becomes some function of the frequency and magnitude of mismatches. Basically I think the weaker player has more power to force a draw as White than the stronger one has to avoid one.

The current asymptote is 4209.7 after more data collected.

Basically what you are saying is what we have always known and there are multiple issues here which is why I say this is just for fun. Here are just some of the issues:

1. True chess skill is not transitive.
2. Similar point - true chess skill is multi-dimensional.
3. ELO is not a perfect way to model chess skill.
4. No program can play perfect chess even with infinite resources due to GHI issues.
5. The exponential formula may not even be very appropriate.
6. The book could have some positions that are game theoretic wins.

Point 4 can be fixed with the removal of hash tables or a re-engineering of them which is not nearly as efficient but can solve the GHI problems. I don't think zugzwang is an issue as Komodo can detect that.

Another things that bothers me about this is that the projected ELO for a level that takes 24 hours to search a move is about 4000, surprisingly close to the asymptote. That implies that we are pretty close to perfect play and I just don't believe that. That makes me suspect the curve shape. On the other hand perhaps we are not far from the point computer checkers has been for a long time. We DO see a lot of draws at high levels and perhaps at correspondence levels we would get mostly draws?

It's unclear to me what to do about the contempt and opponent modeling issue in a game where it's not part of the rules to know anything about your opponent. But I do think a study like this should be done with a contempt factor that is high enough to cover the white advantage and a little more - i.e. all programs should play for a win (given that we don't know anything about the opponent) so that we don't see ridiculous upsets.

Don · Post by **Don** » Thu May 10, 2012 3:41 pm

petero2 wrote: My parameters were also computed by minimizing the least squares error. I happened to have an old octave implementation of the Gauss-Newton method that I used. To see how much the estimate is affected by measurement errors, I added normally distributed noise with standard deviation 10 to the rating values and computed the corresponding maximum ELO. I repeated this 100000 times and made a histogram:

Average value: 4497
Standard deviation: 116

Although, I believe an even bigger error source is the fact that the true rating curve is most likely not an exponential function, so the large extrapolation is probably unsound.

That is an interesting histogram. That gives me some confidence that this is not hugely effected by minor errors.

I'm repeating the study with critter. Will be very interesting to see how the asymptote line compares to Komodo. The final value of the Komodo test was an asymptote of 4209.7.

I'm using only 12 levels instead of 13 (because this test takes too long to run) and even though each level has played only about 80 games so far the numbers are looking very similar to komodo's numbers. The current max elo computed by the critter run is 4129.6 - that is pretty amazing similarity!

RoadWarrior · Post by **RoadWarrior** » Thu May 10, 2012 6:43 pm

JuLieN wrote:
Adam Hair wrote:And I always seem to use my time running and examining experiments, as well as figuring out how to work around problems that arise due to my lack of programming knowledge, instead of learning how to write code . Well, I do know some programming, but it is in the "monkey see, monkey do" language
Learning a programming language is not difficult at all. What is difficult is to learn to program. What's the difference?

All programs are poems, it's just that not all programmers are poets.

Don · Post by **Don** » Thu May 10, 2012 6:52 pm

Ok, I'm doing the study with Critter now and I have some data. For calibration purposes I had to place one of the Komodo versions in the test as an ELO reference point - but it's not used in the calculations of course. I picked the 05 komodo and fixed it's rating at 2461.6 which is what it came out as previously.

For reference the Komodo test indicated an asymptote of 4209.7 and the Critter test (currently) is showing 4174.3. I think it's remarkable how closely they agree.

petero2 · Post by **petero2** » Thu May 10, 2012 6:59 pm

Don wrote: The current asymptote is 4209.7 after more data collected.

Basically what you are saying is what we have always known and there are multiple issues here which is why I say this is just for fun. Here are just some of the issues:

1. True chess skill is not transitive.
2. Similar point - true chess skill is multi-dimensional.
3. ELO is not a perfect way to model chess skill.
4. No program can play perfect chess even with infinite resources due to GHI issues.
5. The exponential formula may not even be very appropriate.
6. The book could have some positions that are game theoretic wins.

Another issue is that there are multiple ways to implement a perfect player. Assuming chess is a draw and that it was possible to create a 32-man endgame tablebase, an engine that used that EGTB but just randomly picked an optimal move would probably draw a lot of games even against quite weak players. I imagine a game where the engine played black could start something like this:

1. e4 a6
2. d4 a5

Only when the engine risks losing it would play a "reasonable" move. If white later makes a positionally weak move, the engine will likely give up the advantage immediately. Very similar things happen if you play the weak side in a drawn KRBKR endgame against an engine with tablebases but with no "swindle" mode.

On the other hand, a perfect player that actively tries to steer the game towards positions where the opponent must play very exactly to maintain the draw would probably win almost all games even against the best human players.

Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling