obvious/easy move

Don · Post by **Don** » Fri Jan 25, 2013 8:44 pm

hgm wrote:
Adam Hair wrote:If you need x games to get +/- y Elo in self-testing, then you need x*√2 games to have the same error (+/- y Elo) when comparing gauntlet results.
Actually x*sqr(2) = x*4 rather than x*sqrt(2).

Huh? I don't follow what you are saying. x * sqr(2) is not the same as x*4 so what are you saying? Are you saying that you need 4 times as many games total?

Adam Hair · Post by **Adam Hair** » Fri Jan 25, 2013 10:54 pm

I screwed up. Sigma is a function of the inverse square root of the number of games. What HGM is pointing out is that you have to play twice the number of games per version to make the error equal to that when you self-test.

If the error is y Elo after x games, then the error when comparing gauntlet results is y*√2. To reduce that to y, you have play 2x games against the gauntlet for each version. So, when you are starting from scratch you have to play 4x games to equal the error bars of x games of self testing. For each new version, you will have to play 2x games against the gauntlet.

The error after x games self-testing is y
The error when comparing two versions via a gauntlet after x games each is √(y²+y²) = y*√2
To reduce this to y, the individual errors must be y/√2.
Since y is proportional to √(1/x), then y/√2 is proportional to √(1/2x)
So each version has to play 2x games against the gauntlet to make the error when comparing be y Elo.

bob · Post by **bob** » Fri Jan 25, 2013 11:48 pm

Don wrote:
bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)

However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"

Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.

I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...

Don · Post by **Don** » Fri Jan 25, 2013 11:55 pm

bob wrote:
Don wrote:
bob wrote: I don't follow your "more efficient use of resources."

In self testing you play your latest (hopefully best) against the previous best. The CPU time used running the previous best is wasted.

In testing against others, the cpu time spent running those programs is wasted.

To get the same error bar, you need the same number of games. Hence you waste the same amount of time whether it is by running your own program, or running the others.

So exactly how do you see self-play as "more efficient"? I see absolutely no difference in efficiency, with the risk that self-play distorts the improvement.
There was a discussion recently concerning this on this forum. Basically in head to head matches the error margins can be taken at face value. When more than 2 programs are involved such as in the fashion you are describing, the error margins don't mean what you think they do because there are 2 sources of errors.

I think the reason is that you can view one program's rating as having NO error and just treat it as the reference version. For all you know it was rated with a million games and the rating has no error. Then you are concerned with the amount of error in the program of interest.

But let's say you are testing 2 versions against a single foreign program. You can consider that foreign program as a fixed reference, but EACH of the 2 versions you are interested in have an error margin. You are basically extrapolating the results indirectly. Hence my stick analogy, if 2 sticks are very close to the same length the easiest and most accurate way to determine which is longest is to hold them side by side, not to use a third stick with marks in it (a yardstick.)
However, you are measuring two different things. When you test against yourself, you are asking "how does this change perform against a program whose only difference is that it does not have this change?" When you test against others, you ask a different question: "How does this change influence my play against a variety of opponents?"

However I do not believe there is much difference in testing against one program vs another, even self.

You believe in testing as you plan to run? I have no idea which program Komodo will play against so that is already a problem. Right now it is playing Nemo, a program I never considered testing against.

Those two questions are related, but they are absolutely NOT the same. One can prove this simply by playing against two different groups of opponents and notice that the ratings are not identical between the two tests. Or looking more carefully, the number of wins, draws and losses changes.

I am an absolutely firm believer in "test like you plan to run". Whether it be in drag racing, or chess. I will try to dig up my old testing data where I addressed this specific data. It was quite a few years ago so it might take some digging, and it might not even be around...

It better be really a big deal to be worth the extra testing time. And I know for a fact that it isn't a big deal.

I used to run marathons and 10k races but I never trained on the same course I ran on. But I felt that my training was similar enough. I think there is little difference in playing Critter, or Komodo for these kinds of tests.

JBNielsen · Post by **JBNielsen** » Sat Jan 26, 2013 1:12 am

Don wrote:I think anything you can do in this regard is low hanging fruit for program improvement.

What do you mean when you say:

y = the number of moves (max 3) that are z centipawns better than the rest.

How do you know that a move is better than the rest without raising beta?
JBNielsen wrote:
Don wrote:

The basic concept is that EVERY move is an easy move (if the program doesn't change it's mind) and it's just a matter of degrees. It can cause a search to return between 5% and 100% of the time it would normally take depending on the degree of easiness. The nature is such that most moves will abort at least a little earlier than normal when the program does not change it's mind during the iteration. This is so much more sensible that a single narrowly defined class of "easy move" which rarely kicks in.
Don, I don't know if you have read my earlier post. With my formula I wanted to spend less time on easy moves and more time on normal/hard moves. Also if there are several good moves, and also if it is a minor gap.

Here is my formula again:
Code: Select all
There is always a risk of stopping a search early, but I believe the time is better spent in positions where many moves are candidates for the best move. 

Humans have the same strategy, and I think it is wise. 

I have not made this in Dabbaba yet, but your results makes me want to implement something like this: 

Don't start a new iteration if more than x% of the time for this move is used. 

x = 35 - 8 * z + 2 * y 

y = the number of moves (max 3) that are z centipawns better than the rest. 
z is the biggest gap among the 4 best moves. z is max 3.00. 

x is always 45% if the biggest gap is less than 0.20 or we have a matescore. 

x = 13% if 1 move recaptures a queen (z=3.0) 

x = 15% if 2 moves recaptures a bishop (z=3.0) 

x = 31% if 2 moves recaptures a pawn (z=1.0) 

x = 33% if 1 move is z=0.50 better than the rest 

x = 39% if 3 moves are z=0.25 better than the rest. 

Notice I try to take advantage of small gaps too.... 

This is just a quick shot that must be refined - but the point is, that there may be some elo in a better time disposition. 
A similar calculation should be done to stop earlier in the middle of an iteration.

PS.
Right now I am running a selftest.
After 180 games the version leading with 53% has this rule added:
Don't start a new iteration if more than 30% of the time for the move is used.

Has anyone else tried to add this simple rule?
An easy way to gain 20 elo if a longer test confirms the 53%.

I have not made anything about easy moves yet, but I will soon do that as well as a better ordering at the root.

If I have 4 white moves with a score around 3.00, and the rest is zero or below, I have found a gap.
You are right, that I can only be sure of the score for the best move. The 3 other might be as bad or worse than the rest.
But the important thing is, that a majority of moves are much worse than the best move. That should give a lot of cutoffs and we reach a good depth in a short time. And we can spare the time for harder decisions.

I assume that is you philosophy...

- - -

I have stopped my test after 361 games.
The version with the "Don't start a new iteration if more than 30% of the time for the move is used" rule added won only with 51.5%
I do not know what the LOS is.
But I will keep the rule until it is proven bad.

Don · Post by **Don** » Sat Jan 26, 2013 1:23 am

JBNielsen wrote:
Don wrote:I think anything you can do in this regard is low hanging fruit for program improvement.

What do you mean when you say:

y = the number of moves (max 3) that are z centipawns better than the rest.

How do you know that a move is better than the rest without raising beta?
JBNielsen wrote:
Don wrote:

The basic concept is that EVERY move is an easy move (if the program doesn't change it's mind) and it's just a matter of degrees. It can cause a search to return between 5% and 100% of the time it would normally take depending on the degree of easiness. The nature is such that most moves will abort at least a little earlier than normal when the program does not change it's mind during the iteration. This is so much more sensible that a single narrowly defined class of "easy move" which rarely kicks in.
Don, I don't know if you have read my earlier post. With my formula I wanted to spend less time on easy moves and more time on normal/hard moves. Also if there are several good moves, and also if it is a minor gap.

Here is my formula again:
Code: Select all
There is always a risk of stopping a search early, but I believe the time is better spent in positions where many moves are candidates for the best move. 

Humans have the same strategy, and I think it is wise. 

I have not made this in Dabbaba yet, but your results makes me want to implement something like this: 

Don't start a new iteration if more than x% of the time for this move is used. 

x = 35 - 8 * z + 2 * y 

y = the number of moves (max 3) that are z centipawns better than the rest. 
z is the biggest gap among the 4 best moves. z is max 3.00. 

x is always 45% if the biggest gap is less than 0.20 or we have a matescore. 

x = 13% if 1 move recaptures a queen (z=3.0) 

x = 15% if 2 moves recaptures a bishop (z=3.0) 

x = 31% if 2 moves recaptures a pawn (z=1.0) 

x = 33% if 1 move is z=0.50 better than the rest 

x = 39% if 3 moves are z=0.25 better than the rest. 

Notice I try to take advantage of small gaps too.... 

This is just a quick shot that must be refined - but the point is, that there may be some elo in a better time disposition. 
A similar calculation should be done to stop earlier in the middle of an iteration.

PS.
Right now I am running a selftest.
After 180 games the version leading with 53% has this rule added:
Don't start a new iteration if more than 30% of the time for the move is used.

Has anyone else tried to add this simple rule?
An easy way to gain 20 elo if a longer test confirms the 53%.

I have not made anything about easy moves yet, but I will soon do that as well as a better ordering at the root.
If I have 4 white moves with a score around 3.00, and the rest is zero or below, I have found a gap.
You are right, that I can only be sure of the score for the best move. The 3 other might be as bad or worse than the rest.
But the important thing is, that a majority of moves are much worse than the best move. That should give a lot of cutoffs and we reach a good depth in a short time. And we can spare the time for harder decisions.

I assume that is you philosophy...

No.

- - -

I have stopped my test after 361 games.
The version with the "Don't start a new iteration if more than 30% of the time for the move is used" rule added won only with 51.5%
I do not know what the LOS is.
But I will keep the rule until it is proven bad.

JBNielsen · Post by **JBNielsen** » Sat Jan 26, 2013 1:41 am

Don wrote:
JBNielsen wrote:
If I have 4 white moves with a score around 3.00, and the rest is zero or below, I have found a gap.
You are right, that I can only be sure of the score for the best move. The 3 other might be as bad or worse than the rest.
But the important thing is, that a majority of moves are much worse than the best move. That should give a lot of cutoffs and we reach a good depth in a short time. And we can spare the time for harder decisions.

I assume that is you philosophy...

No.

Thanks for your quick and clear answer. I am quite sure it didn't reveal any of the secrets of Komodo

I will try my own ideas about saving time / easy moves.

PS.
Many experienced people participate in this thread. I missed some input in the thread about why a few checks by the opponent can make a position very hard for a computer, where humans easily understand that the opponent runs out of checks and the combination is still winning.
http://www.talkchess.com/forum/viewtopic.php?t=46170

Don · Post by **Don** » Sat Jan 26, 2013 1:42 am

JBNielsen wrote:
Don wrote:
JBNielsen wrote:
If I have 4 white moves with a score around 3.00, and the rest is zero or below, I have found a gap.
You are right, that I can only be sure of the score for the best move. The 3 other might be as bad or worse than the rest.
But the important thing is, that a majority of moves are much worse than the best move. That should give a lot of cutoffs and we reach a good depth in a short time. And we can spare the time for harder decisions.

I assume that is you philosophy...

No.
Thanks for your quick and clear answer. I am quite sure it didn't reveal any of the secrets of Komodo

I hope I did not reveal too much!

I will try my own ideas about saving time / easy moves.

PS.
Many experienced people participate in this thread. I missed some input in the thread about why a few checks by the opponent can make a position very hard for a computer, where humans easily understand that the opponent runs out of checks and the combination is still winning.
http://www.talkchess.com/forum/viewtopic.php?t=46170

lucasart · Post by **lucasart** » Sat Jan 26, 2013 2:07 am

Don wrote: I'm going to go into a bit of a rant here so please forgive me.

For almost 30 years of computer chess I have been getting the warning from people about avoiding self-testing and although I consider it a well-meaning warning nobody has once offered any evidence other than their own superstition. I'm very close to putting it in the category of "myth" or "Conventional wisdom" which by definition is not "exceptional" - it is usually untested and believed based on blind credulity or gut instinct which is notoriously unreliable.

In general I completely agree. This business of self testing = incest is a load of BS. The fact is that self testing has practical advantages
1/ with equal testing time you can play twice more games
2/ often you need to measure something small, and to do so (in a statistically significant way), you have to increase the sensitivity of the measure. And self-testing typically does that (you've often noticed that if your engine has gained 50 elo in self-play, maybe with varied opponents it translates to only 25 elo or so)
As for the assertion that self testing results are meaningless, it is pure dogma, without any proven basis in reality. Typically those who say that are old ranters that have been in the computer chess for too long and haven't picked up modern computer chess techniques, or newcomers who just repeat what they've been told without understanding it.

*however*, I sometimes do testing against varied opponents:
- to measure an elo improvement (version n+1 against version n of my engine). I know that self testing can be trusted, in the sense that if DiscoCheck n+1 beats DiscoCheck n by 50 elo in self-play, then it will beat it regardless of the testing methodology. But self-testing often magnifies the elo difference. So perhaps it will be only 30 elo against a pool of varied opponents.
- anything to do with time management should not be done in self-testing, in my experience (as opposed to opinion). For example when I tested my time management code, there was no measurable improvement in self-play, but +25 elo against Fruit 05/11/03 (patch was to use more time to resolve aspiration failures). The point is that in self-play, you have a huge percentage of ponder hits. Since then I always test time management stuff with different opponents. So in the context of this thread (time management related), self-testing is perhaps not a good idea indeed.

jd1 · Post by **jd1** » Sat Jan 26, 2013 9:54 am

lucasart wrote:
Don wrote: I'm going to go into a bit of a rant here so please forgive me.

For almost 30 years of computer chess I have been getting the warning from people about avoiding self-testing and although I consider it a well-meaning warning nobody has once offered any evidence other than their own superstition. I'm very close to putting it in the category of "myth" or "Conventional wisdom" which by definition is not "exceptional" - it is usually untested and believed based on blind credulity or gut instinct which is notoriously unreliable.
In general I completely agree. This business of self testing = incest is a load of BS. The fact is that self testing has practical advantages
1/ with equal testing time you can play twice more games
2/ often you need to measure something small, and to do so (in a statistically significant way), you have to increase the sensitivity of the measure. And self-testing typically does that (you've often noticed that if your engine has gained 50 elo in self-play, maybe with varied opponents it translates to only 25 elo or so)
As for the assertion that self testing results are meaningless, it is pure dogma, without any proven basis in reality. Typically those who say that are old ranters that have been in the computer chess for too long and haven't picked up modern computer chess techniques, or newcomers who just repeat what they've been told without understanding it.

*however*, I sometimes do testing against varied opponents:
- to measure an elo improvement (version n+1 against version n of my engine). I know that self testing can be trusted, in the sense that if DiscoCheck n+1 beats DiscoCheck n by 50 elo in self-play, then it will beat it regardless of the testing methodology. But self-testing often magnifies the elo difference. So perhaps it will be only 30 elo against a pool of varied opponents.
- anything to do with time management should not be done in self-testing, in my experience (as opposed to opinion). For example when I tested my time management code, there was no measurable improvement in self-play, but +25 elo against Fruit 05/11/03 (patch was to use more time to resolve aspiration failures). The point is that in self-play, you have a huge percentage of ponder hits. Since then I always test time management stuff with different opponents. So in the context of this thread (time management related), self-testing is perhaps not a good idea indeed.

Agree completely about self testing, although my testing hasn't got the greatest track record

I think you just need to be conservative about reporting elo increase.

Jerry

obvious/easy move

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results

Re: obvious/easy move - final results