Adam Hair wrote:I do think I will revisit both of these types of measurements. Also, Miguel has suggested to me to measure the increase in Elo when the number of nodes is doubled.
Good idea, just one but: time control is killed. But since that's true for both engines.....
Is there no interface that allows each engine its own time control ?
I'm running a test of pure nodes doubling now. I need my computer for other things so this will be just a few hundred games. Here is what I have so far (I'm still running) :
As Ed observes you do lose time control but it's the same with fixed depth. The average depth is interesting too, we get a little more than a ply for each doubling of nodes.
For reference, version 00 is 512 nodes and each subsequent level doubles that.
I'll report again later with the delta's and a graph.
Don
P.S. I have bayeselo set to use confidence of 98% so my error margins will be larger than what is normal.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
What is "moves changed?" Do you mean the move is changed in the final iteration?
Hi Don,
It measures within an iteration if a best move has changed.
Yes, I realize now that was a stupid question, what else could it be?
That is pretty impressive that it almost never changes once it's going so deep. But of course I'm sure the number of samples at those depths are way too low to measure with any precision, so the right percentages would asymptotically approach zero. Actually given enough depth it would probably reach zero at perfect play!
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
What is "moves changed?" Do you mean the move is changed in the final iteration?
Hi Don,
It measures within an iteration if a best move has changed.
Yes, I realize now that was a stupid question, what else could it be?
That is pretty impressive that it almost never changes once it's going so deep. But of course I'm sure the number of samples at those depths are way too low to measure with any precision, so the right percentages would asymptotically approach zero. Actually given enough depth it would probably reach zero at perfect play!
One note of care, move changes at low iterations are pretty meaningless but a move change at ply 18+ (or so) often is significant for the outcome of the game. So as impressive such a statistic seems at first glance it can be misleading as well.
I plotted these ELO ratings starting with 2000 and also plotted a flat reference line (the green line) which shows a constant gain that is about equal to the first doubling for comparison. It's pretty obvious that the gain curves away from linear.
Don
Rebel wrote:
Don wrote:
Rebel wrote:
Don wrote:Ed,
What is "moves changed?" Do you mean the move is changed in the final iteration?
Hi Don,
It measures within an iteration if a best move has changed.
Yes, I realize now that was a stupid question, what else could it be?
That is pretty impressive that it almost never changes once it's going so deep. But of course I'm sure the number of samples at those depths are way too low to measure with any precision, so the right percentages would asymptotically approach zero. Actually given enough depth it would probably reach zero at perfect play!
One note of care, move changes at low iterations are pretty meaningless but a move change at ply 18+ (or so) often is significant for the outcome of the game. So as impressive such a statistic seems at first glance it can be misleading as well.
[img]
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Adam Hair wrote:I do think I will revisit both of these types of measurements. Also, Miguel has suggested to me to measure the increase in Elo when the number of nodes is doubled.
Good idea, just one but: time control is killed. But since that's true for both engines.....
Is there no interface that allows each engine its own time control ?
I'm running a test of pure nodes doubling now. I need my computer for other things so this will be just a few hundred games. Here is what I have so far (I'm still running) :
As Ed observes you do lose time control but it's the same with fixed depth. The average depth is interesting too, we get a little more than a ply for each doubling of nodes.
For reference, version 00 is 512 nodes and each subsequent level doubles that.
I'll report again later with the delta's and a graph.
Don
P.S. I have bayeselo set to use confidence of 98% so my error margins will be larger than what is normal.
My personal feeling is that this tends to exaggerate the rating gain. You are basically playing A vs A' where the ONLY difference is 2x speed. Identical evals, identical searches, identical time controls. This won't hold true against a suite of opponents, is my opinion. It also tends to break some important things. For example, Crafty tries very hard to always finish the current ply-1 move it is searching, before aborting on time, just in case this is fixing to fail high (or low). With fixed nodes, it can't do that....
I hate nodes/depth as a search limit for trying to measure anything to do with rating (improvement)
Adam Hair wrote:I do think I will revisit both of these types of measurements. Also, Miguel has suggested to me to measure the increase in Elo when the number of nodes is doubled.
Good idea, just one but: time control is killed. But since that's true for both engines.....
Is there no interface that allows each engine its own time control ?
I'm running a test of pure nodes doubling now. I need my computer for other things so this will be just a few hundred games. Here is what I have so far (I'm still running) :
As Ed observes you do lose time control but it's the same with fixed depth. The average depth is interesting too, we get a little more than a ply for each doubling of nodes.
For reference, version 00 is 512 nodes and each subsequent level doubles that.
I'll report again later with the delta's and a graph.
Don
P.S. I have bayeselo set to use confidence of 98% so my error margins will be larger than what is normal.
My personal feeling is that this tends to exaggerate the rating gain. You are basically playing A vs A' where the ONLY difference is 2x speed. Identical evals, identical searches, identical time controls. This won't hold true against a suite of opponents, is my opinion. It also tends to break some important things. For example, Crafty tries very hard to always finish the current ply-1 move it is searching, before aborting on time, just in case this is fixing to fail high (or low). With fixed nodes, it can't do that....
I hate nodes/depth as a search limit for trying to measure anything to do with rating (improvement)
I don't know any way to really get an accurate rating, how you test influences things. For example only testing against other computers is not accurate either. The node testing I have already proved hurts the ELO substantially as a good time control algorithm is worth quite a bit. However I don't think it has any effect on the validity of this test. I cannot imagine that if you sudden death games instead of fixed nodes I would find that program improve MORE with depth for example.
I have also found that playing programs in round robin fashion where the ELO differences are this huge has a very strong compressing effect on the ratings. I have done similar experiments where I only let a player play up or down 2 or 3 levels and the rating spread is significantly larger.
I think a lot of this has to do with the contempt. I have seen games Komodo should NEVER have to suffer a draw because the opponent is over 1000 ELO weaker but Komodo is playing black and right out of book finds a way to force a draw!
In the test I just ran there was 1 draw of 10 vs 00 even though the difference is a whopping 1627 ELO! The game was very short - draw by repetition on whites 16th move.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
What is the value of the asymptote to your red curve? (This is of course the maximal Elo in your system).
Is that a joke? I wish I knew where it all ended
I think if we gathered this data for several more levels and got enough data for each point we could try to estimate the value using some type of curve fitting.
If someone wants to take a crack at estimating this with curve fitting, here is the final result:
What is the value of the asymptote to your red curve? (This is of course the maximal Elo in your system).
I think the highest achievable rating (i.e. "God's rating") must be very high. The ELO drop per doubling seems to only gradually decline and that is with incorrect contempt factors set.
I'll bet the top programs are still 1000 ELO or more from perfect play at human-like time controls.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.