Elo Increase per Doubling

Don · Post by **Don** » Tue May 08, 2012 2:27 pm

Don wrote:
Rebel wrote:
Adam Hair wrote:I do think I will revisit both of these types of measurements. Also, Miguel has suggested to me to measure the increase in Elo when the number of nodes is doubled.
Good idea, just one but: time control is killed. But since that's true for both engines.....

Is there no interface that allows each engine its own time control ?
I'm running a test of pure nodes doubling now. I need my computer for other things so this will be just a few hundred games. Here is what I have so far (I'm still running) :
Code: Select all
Rank Name    Elo      +      -    games   score   oppo.   draws 
   1 10    4607.7   96.1   96.1     169   94.1%  3851.3   11.8% 
   2 09    4498.6   84.1   84.1     169   87.9%  3861.6   17.2% 
   3 08    4400.2   81.3   81.3     170   80.0%  3875.8   12.9% 
   4 07    4270.4   79.3   79.3     170   70.9%  3888.8   15.9% 
   5 06    4135.0   81.4   81.4     170   61.8%  3902.4   14.1% 
   6 05    3969.2   84.4   84.4     170   51.2%  3918.9    5.9% 
   7 04    3757.1   84.5   84.5     170   37.4%  3940.1   11.2% 
   8 03    3654.1   87.0   87.0     170   30.3%  3950.5   11.2% 
   9 02    3543.2   89.7   89.7     170   23.5%  3961.5   10.6% 
  10 01    3323.0  108.9  108.9     170   12.1%  3983.6    5.3% 
  11 00    3000.0  179.5  179.5     170    1.5%  4015.9    0.6% 


      TIME       RATIO    log&#40;r&#41;     NODES    log&#40;r&#41;  ave DEPTH    GAMES   PLAYER
 ---------  ----------  --------  --------  --------  ---------  -------   --
    0.0022       1.000     0.000     0.001     0.000     3.1752      170   00
    0.0028       1.302     0.264     0.001     0.685     4.2193      170   01
    0.0046       2.122     0.752     0.002     1.378     5.3787      170   02
    0.0082       3.748     1.321     0.004     2.071     6.5317      170   03
    0.0158       7.263     1.983     0.008     2.765     7.6667      170   04
    0.0326      14.939     2.704     0.016     3.457     8.7110      170   05
    0.0577      26.458     3.276     0.033     4.151     9.7802      170   06
    0.1052      48.235     3.876     0.066     4.844    10.8375      170   07
    0.1926      88.292     4.481     0.131     5.537    11.8340      170   08
    0.3901     178.827     5.186     0.262     6.230    13.0136      169   09
    0.7313     335.232     5.815     0.524     6.923    14.0442      169   10
As Ed observes you do lose time control but it's the same with fixed depth. The average depth is interesting too, we get a little more than a ply for each doubling of nodes.

For reference, version 00 is 512 nodes and each subsequent level doubles that.

I'll report again later with the delta's and a graph.

Don

P.S. I have bayeselo set to use confidence of 98% so my error margins will be larger than what is normal.

Rebel · Post by **Rebel** » Tue May 08, 2012 2:28 pm

Don wrote:Ed,

What is "moves changed?" Do you mean the move is changed in the final iteration?

Hi Don,

It measures within an iteration if a best move has changed.

Don · Post by **Don** » Tue May 08, 2012 2:36 pm

Rebel wrote:
Don wrote:Ed,

What is "moves changed?" Do you mean the move is changed in the final iteration?
Hi Don,

It measures within an iteration if a best move has changed.

Yes, I realize now that was a stupid question, what else could it be?

That is pretty impressive that it almost never changes once it's going so deep. But of course I'm sure the number of samples at those depths are way too low to measure with any precision, so the right percentages would asymptotically approach zero. Actually given enough depth it would probably reach zero at perfect play!

Rebel · Post by **Rebel** » Tue May 08, 2012 3:15 pm

Don wrote:
Rebel wrote:
Don wrote:Ed,

What is "moves changed?" Do you mean the move is changed in the final iteration?
Hi Don,

It measures within an iteration if a best move has changed.
Yes, I realize now that was a stupid question, what else could it be?

That is pretty impressive that it almost never changes once it's going so deep. But of course I'm sure the number of samples at those depths are way too low to measure with any precision, so the right percentages would asymptotically approach zero. Actually given enough depth it would probably reach zero at perfect play!

One note of care, move changes at low iterations are pretty meaningless but a move change at ply 18+ (or so) often is significant for the outcome of the game. So as impressive such a statistic seems at first glance it can be misleading as well.

Don · Post by **Don** » Tue May 08, 2012 4:37 pm

Ok, here are enough games to give some results:

Code: Select all

Rank Name    Elo      +      -    games   score   oppo.   draws 
   1 10    4609.5   34.0   34.0    1487   94.1%  3822.7   10.8% 
   2 09    4477.0   28.9   28.9    1488   86.5%  3836.2   15.9% 
   3 08    4377.5   27.5   27.5    1488   79.9%  3846.1   15.5% 
   4 07    4244.8   26.8   26.8    1489   70.5%  3859.8   15.4% 
   5 06    4108.0   27.2   27.2    1490   60.9%  3874.0   14.8% 
   6 05    3970.6   28.1   28.1    1490   51.7%  3887.7   12.9% 
   7 04    3788.2   29.9   29.9    1490   40.5%  3905.9   10.3% 
   8 03    3621.7   31.6   31.6    1490   30.8%  3922.6    9.1% 
   9 02    3441.2   33.9   33.9    1490   21.2%  3940.6    8.0% 
  10 01    3209.0   38.9   38.9    1490   10.9%  3963.8    5.0% 
  11 00    3000.0   50.2   50.2    1490    3.2%  3984.8    3.1%

I plotted these ELO ratings starting with 2000 and also plotted a flat reference line (the green line) which shows a constant gain that is about equal to the first doubling for comparison. It's pretty obvious that the gain curves away from linear.

Don

Rebel wrote:
Don wrote:
Rebel wrote:
Don wrote:Ed,

What is "moves changed?" Do you mean the move is changed in the final iteration?
Hi Don,

It measures within an iteration if a best move has changed.
Yes, I realize now that was a stupid question, what else could it be?

That is pretty impressive that it almost never changes once it's going so deep. But of course I'm sure the number of samples at those depths are way too low to measure with any precision, so the right percentages would asymptotically approach zero. Actually given enough depth it would probably reach zero at perfect play!
One note of care, move changes at low iterations are pretty meaningless but a move change at ply 18+ (or so) often is significant for the outcome of the game. So as impressive such a statistic seems at first glance it can be misleading as well.

[img]

JuLieN · Post by **JuLieN** » Tue May 08, 2012 5:15 pm

@Don

What is the value of the asymptote to your red curve? (This is of course the maximal Elo in your system).

bob · Post by **bob** » Tue May 08, 2012 5:36 pm

Don wrote:
Don wrote:
Rebel wrote:
Adam Hair wrote:I do think I will revisit both of these types of measurements. Also, Miguel has suggested to me to measure the increase in Elo when the number of nodes is doubled.
Good idea, just one but: time control is killed. But since that's true for both engines.....

Is there no interface that allows each engine its own time control ?
I'm running a test of pure nodes doubling now. I need my computer for other things so this will be just a few hundred games. Here is what I have so far (I'm still running) :
Code: Select all
Rank Name    Elo      +      -    games   score   oppo.   draws 
   1 10    4607.7   96.1   96.1     169   94.1%  3851.3   11.8% 
   2 09    4498.6   84.1   84.1     169   87.9%  3861.6   17.2% 
   3 08    4400.2   81.3   81.3     170   80.0%  3875.8   12.9% 
   4 07    4270.4   79.3   79.3     170   70.9%  3888.8   15.9% 
   5 06    4135.0   81.4   81.4     170   61.8%  3902.4   14.1% 
   6 05    3969.2   84.4   84.4     170   51.2%  3918.9    5.9% 
   7 04    3757.1   84.5   84.5     170   37.4%  3940.1   11.2% 
   8 03    3654.1   87.0   87.0     170   30.3%  3950.5   11.2% 
   9 02    3543.2   89.7   89.7     170   23.5%  3961.5   10.6% 
  10 01    3323.0  108.9  108.9     170   12.1%  3983.6    5.3% 
  11 00    3000.0  179.5  179.5     170    1.5%  4015.9    0.6% 


      TIME       RATIO    log&#40;r&#41;     NODES    log&#40;r&#41;  ave DEPTH    GAMES   PLAYER
 ---------  ----------  --------  --------  --------  ---------  -------   --
    0.0022       1.000     0.000     0.001     0.000     3.1752      170   00
    0.0028       1.302     0.264     0.001     0.685     4.2193      170   01
    0.0046       2.122     0.752     0.002     1.378     5.3787      170   02
    0.0082       3.748     1.321     0.004     2.071     6.5317      170   03
    0.0158       7.263     1.983     0.008     2.765     7.6667      170   04
    0.0326      14.939     2.704     0.016     3.457     8.7110      170   05
    0.0577      26.458     3.276     0.033     4.151     9.7802      170   06
    0.1052      48.235     3.876     0.066     4.844    10.8375      170   07
    0.1926      88.292     4.481     0.131     5.537    11.8340      170   08
    0.3901     178.827     5.186     0.262     6.230    13.0136      169   09
    0.7313     335.232     5.815     0.524     6.923    14.0442      169   10
As Ed observes you do lose time control but it's the same with fixed depth. The average depth is interesting too, we get a little more than a ply for each doubling of nodes.

For reference, version 00 is 512 nodes and each subsequent level doubles that.

I'll report again later with the delta's and a graph.

Don
P.S. I have bayeselo set to use confidence of 98% so my error margins will be larger than what is normal.

My personal feeling is that this tends to exaggerate the rating gain. You are basically playing A vs A' where the ONLY difference is 2x speed. Identical evals, identical searches, identical time controls. This won't hold true against a suite of opponents, is my opinion. It also tends to break some important things. For example, Crafty tries very hard to always finish the current ply-1 move it is searching, before aborting on time, just in case this is fixing to fail high (or low). With fixed nodes, it can't do that....

I hate nodes/depth as a search limit for trying to measure anything to do with rating (improvement)

Don · Post by **Don** » Tue May 08, 2012 7:32 pm

bob wrote:
Don wrote:
Don wrote:
Rebel wrote:
Adam Hair wrote:I do think I will revisit both of these types of measurements. Also, Miguel has suggested to me to measure the increase in Elo when the number of nodes is doubled.
Good idea, just one but: time control is killed. But since that's true for both engines.....

Is there no interface that allows each engine its own time control ?
I'm running a test of pure nodes doubling now. I need my computer for other things so this will be just a few hundred games. Here is what I have so far (I'm still running) :
Code: Select all
Rank Name    Elo      +      -    games   score   oppo.   draws 
   1 10    4607.7   96.1   96.1     169   94.1%  3851.3   11.8% 
   2 09    4498.6   84.1   84.1     169   87.9%  3861.6   17.2% 
   3 08    4400.2   81.3   81.3     170   80.0%  3875.8   12.9% 
   4 07    4270.4   79.3   79.3     170   70.9%  3888.8   15.9% 
   5 06    4135.0   81.4   81.4     170   61.8%  3902.4   14.1% 
   6 05    3969.2   84.4   84.4     170   51.2%  3918.9    5.9% 
   7 04    3757.1   84.5   84.5     170   37.4%  3940.1   11.2% 
   8 03    3654.1   87.0   87.0     170   30.3%  3950.5   11.2% 
   9 02    3543.2   89.7   89.7     170   23.5%  3961.5   10.6% 
  10 01    3323.0  108.9  108.9     170   12.1%  3983.6    5.3% 
  11 00    3000.0  179.5  179.5     170    1.5%  4015.9    0.6% 


      TIME       RATIO    log&#40;r&#41;     NODES    log&#40;r&#41;  ave DEPTH    GAMES   PLAYER
 ---------  ----------  --------  --------  --------  ---------  -------   --
    0.0022       1.000     0.000     0.001     0.000     3.1752      170   00
    0.0028       1.302     0.264     0.001     0.685     4.2193      170   01
    0.0046       2.122     0.752     0.002     1.378     5.3787      170   02
    0.0082       3.748     1.321     0.004     2.071     6.5317      170   03
    0.0158       7.263     1.983     0.008     2.765     7.6667      170   04
    0.0326      14.939     2.704     0.016     3.457     8.7110      170   05
    0.0577      26.458     3.276     0.033     4.151     9.7802      170   06
    0.1052      48.235     3.876     0.066     4.844    10.8375      170   07
    0.1926      88.292     4.481     0.131     5.537    11.8340      170   08
    0.3901     178.827     5.186     0.262     6.230    13.0136      169   09
    0.7313     335.232     5.815     0.524     6.923    14.0442      169   10
As Ed observes you do lose time control but it's the same with fixed depth. The average depth is interesting too, we get a little more than a ply for each doubling of nodes.

For reference, version 00 is 512 nodes and each subsequent level doubles that.

I'll report again later with the delta's and a graph.

Don
P.S. I have bayeselo set to use confidence of 98% so my error margins will be larger than what is normal.
My personal feeling is that this tends to exaggerate the rating gain. You are basically playing A vs A' where the ONLY difference is 2x speed. Identical evals, identical searches, identical time controls. This won't hold true against a suite of opponents, is my opinion. It also tends to break some important things. For example, Crafty tries very hard to always finish the current ply-1 move it is searching, before aborting on time, just in case this is fixing to fail high (or low). With fixed nodes, it can't do that....

I hate nodes/depth as a search limit for trying to measure anything to do with rating (improvement)

I don't know any way to really get an accurate rating, how you test influences things. For example only testing against other computers is not accurate either. The node testing I have already proved hurts the ELO substantially as a good time control algorithm is worth quite a bit. However I don't think it has any effect on the validity of this test. I cannot imagine that if you sudden death games instead of fixed nodes I would find that program improve MORE with depth for example.

I have also found that playing programs in round robin fashion where the ELO differences are this huge has a very strong compressing effect on the ratings. I have done similar experiments where I only let a player play up or down 2 or 3 levels and the rating spread is significantly larger.

I think a lot of this has to do with the contempt. I have seen games Komodo should NEVER have to suffer a draw because the opponent is over 1000 ELO weaker but Komodo is playing black and right out of book finds a way to force a draw!

In the test I just ran there was 1 draw of 10 vs 00 even though the difference is a whopping 1627 ELO! The game was very short - draw by repetition on whites 16th move.

Don · Post by **Don** » Tue May 08, 2012 8:15 pm

JuLieN wrote:@Don

What is the value of the asymptote to your red curve? (This is of course the maximal Elo in your system).

Is that a joke? I wish I knew where it all ended

I think if we gathered this data for several more levels and got enough data for each point we could try to estimate the value using some type of curve fitting.

If someone wants to take a crack at estimating this with curve fitting, here is the final result:

Code: Select all

Rank Name    Elo      +      -    games   score   oppo.   draws 
   1 10    3127.9   25.4   25.4    2739   94.2%  2328.9   10.4% 
   2 09    2997.0   21.6   21.6    2739   86.8%  2341.9   15.2% 
   3 08    2891.6   20.4   20.4    2740   79.9%  2352.8   15.5% 
   4 07    2752.0   19.9   19.9    2740   70.2%  2366.7   15.8% 
   5 06    2615.3   20.2   20.2    2740   60.8%  2380.4   14.5% 
   6 05    2467.2   20.9   20.9    2740   51.2%  2395.2   13.0% 
   7 04    2307.4   22.1   22.1    2740   41.4%  2411.2   10.5% 
   8 03    2118.3   23.5   23.5    2740   30.6%  2430.1    8.6% 
   9 02    1932.7   25.3   25.3    2740   20.9%  2448.6    7.8% 
  10 01    1709.7   28.6   28.6    2740   10.9%  2470.9    5.3% 
  11 00    1500.0   36.9   36.9    2740    3.2%  2491.9    3.1%

Don · Post by **Don** » Tue May 08, 2012 8:23 pm

JuLieN wrote:@Don

What is the value of the asymptote to your red curve? (This is of course the maximal Elo in your system).

I think the highest achievable rating (i.e. "God's rating") must be very high. The ELO drop per doubling seems to only gradually decline and that is with incorrect contempt factors set.

I'll bet the top programs are still 1000 ELO or more from perfect play at human-like time controls.

Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling

Re: Elo Increase per Doubling