Interesting null-move test

bob · Post by **bob** » Fri Mar 26, 2010 9:32 pm

jwes wrote:
bob wrote:
hgm wrote:Let me get ths straight:

You use plain null move, no verification? So if you end up in KRK, it is now a draw. And if you end up in things like KRPKR it becomes almost impossible to win, as the opponent would simply sac his Rook for the remaining Pawn, and you would not know how to checkmate him with a Rook up.
why would that be? It finds KR vs K checkmates trivially. I just tried it in a simple position with kings and rook in center of the board. The evaluation knows to drive the losing king to the edge of the board, where it finds mate quick enough...
Since the standard method of checkmating with KR v k involves repeated zugzwangs, it is reasonable (but apparently wrong) to assume that a search with null moves would not find these mates.

I believe it is an issue of the eval taking you in the right direction and then any move that checks turns off null in the next ply. So eventually you get to where you see the mate. However, KRK is a very rare ending overall...

bob wrote:

Most won Pawn endings would also be bungled; KPK is no longer a win if the passer is not out of reach.
You are focusing on positions, I am focusing on entire games. All I can say at the moment is that null-everywhere is no worse than restricting it to only being used when the side on move has a piece or move. I can run the match to 1/4 million games to get a more accurate Elo measurement, which I will start right now. But it certainly is within a couple of Elo either way.

Seems to me that this should cost you quite a bit more than 2 Elo. Rook endings are very common (> 10% ?), and not being able to win any of those should at least cost you a few percent in score, and each % corrsponds to 7 Elo. So a 20-50 Elo drop would have to be expected.

The test must be flawed somehow...
The rook issue is not an issue apparently. Anyone can take current Crafty source, go to search.c, remove that single restriction on when null-move search is done (here is the code before/after):

if (do_null && alpha == beta - 1 && depth > 1 && !tree->inchk[ply] &&
TotalPieces(wtm, occupied)) {

if (do_null && alpha == beta - 1 && depth > 1 && !tree->inchk[ply]) {

Pretty obvious what was removed. Nothing else was changed. Same positions, same hardware, etc. I ran the above test twice and got within 2 Elo the second time around (second time was 1 elo better than the first run, not significant)... Can't say much more other than this appears to be another episode of "myth-busters"...
It certainly could be that gains in the middlegame balance losses in the endgame. It would be interesting to run a test starting with endgame positions to see if the results are different.

They probably would be different. But then again, so would tests that start with a few pieces left. Or tests that are based on positions with a forced mate. Failing in one type of position doesn't mean failing in all. That's why testing is so important, it removes bias and prejudice and lets you see the truth without any subjective lights shining.

hgm · Post by **hgm** » Fri Mar 26, 2010 9:38 pm

bob wrote:why would that be? It finds KR vs K checkmates trivially. I just tried it in a simple position with kings and rook in center of the board. The evaluation knows to drive the losing king to the edge of the board, where it finds mate quick enough....

When I was building my new engine this month, I started testing it in 1+0 games before I had implemented the material table that would contain information when to use null move, and what the game phase was. Under those conditions it was not able to win KRK; all cases where it occurred ended in a 50-move draw. I thought the null move was responsible. But because of the lacking game-phase info, it was also using the middle-game PST for the King, which has no built-in tendency to centralize Kings. So I guess that must have made the difference, then.

If KRK can be won anyway, then my estimate about the Elo cost should indeed be way off.

bob · Post by **bob** » Fri Mar 26, 2010 9:42 pm

hgm wrote:
bob wrote:why would that be? It finds KR vs K checkmates trivially. I just tried it in a simple position with kings and rook in center of the board. The evaluation knows to drive the losing king to the edge of the board, where it finds mate quick enough....
When I was building my new engine this month, I started testing it in 1+0 games before I had implemented the material table that would contain information when to use null move, and what the game phase was. Under those conditions it was not able to win KRK; all cases where it occurred ended in a 50-move draw. I thought the null move was responsible. But because of the lacking game-phase info, it was also using the middle-game PST for the King, which has no built-in tendency to centralize Kings. So I guess that must have made the difference, then.

If KRK can be won anyway, then my estimate about the Elo cost should indeed be way off.

KPK is a problem that I verified, however, even though Crafty has specific eval knowledge to win that when it is winnable. I've started the 250,000 game test to get a real accurate Elo estimate for this approach. Even -1 Elo is enough to prevent me from using it, but it is an interesting thing to look at that goes against past discussions.

Desperado · Post by **Desperado** » Fri Mar 26, 2010 9:53 pm

bob wrote:
jwes wrote:
bob wrote:
hgm wrote:Let me get ths straight:

You use plain null move, no verification? So if you end up in KRK, it is now a draw. And if you end up in things like KRPKR it becomes almost impossible to win, as the opponent would simply sac his Rook for the remaining Pawn, and you would not know how to checkmate him with a Rook up.
why would that be? It finds KR vs K checkmates trivially. I just tried it in a simple position with kings and rook in center of the board. The evaluation knows to drive the losing king to the edge of the board, where it finds mate quick enough...
Since the standard method of checkmating with KR v k involves repeated zugzwangs, it is reasonable (but apparently wrong) to assume that a search with null moves would not find these mates.
I believe it is an issue of the eval taking you in the right direction and then any move that checks turns off null in the next ply. So eventually you get to where you see the mate. However, KRK is a very rare ending overall...

bob wrote:

Most won Pawn endings would also be bungled; KPK is no longer a win if the passer is not out of reach.
You are focusing on positions, I am focusing on entire games. All I can say at the moment is that null-everywhere is no worse than restricting it to only being used when the side on move has a piece or move. I can run the match to 1/4 million games to get a more accurate Elo measurement, which I will start right now. But it certainly is within a couple of Elo either way.

Seems to me that this should cost you quite a bit more than 2 Elo. Rook endings are very common (> 10% ?), and not being able to win any of those should at least cost you a few percent in score, and each % corrsponds to 7 Elo. So a 20-50 Elo drop would have to be expected.

The test must be flawed somehow...
The rook issue is not an issue apparently. Anyone can take current Crafty source, go to search.c, remove that single restriction on when null-move search is done (here is the code before/after):

if (do_null && alpha == beta - 1 && depth > 1 && !tree->inchk[ply] &&
TotalPieces(wtm, occupied)) {

if (do_null && alpha == beta - 1 && depth > 1 && !tree->inchk[ply]) {

Pretty obvious what was removed. Nothing else was changed. Same positions, same hardware, etc. I ran the above test twice and got within 2 Elo the second time around (second time was 1 elo better than the first run, not significant)... Can't say much more other than this appears to be another episode of "myth-busters"...
It certainly could be that gains in the middlegame balance losses in the endgame. It would be interesting to run a test starting with endgame positions to see if the results are different.
They probably would be different. But then again, so would tests that start with a few pieces left. Or tests that are based on positions with a forced mate. Failing in one type of position doesn't mean failing in all. That's why testing is so important, it removes bias and prejudice and lets you see the truth without any subjective lights shining.

maybe this can be interesting...

Code: Select all


...
	   depth > 1 && TotalPieces(wtm, occupied))
	|| depth > alwaysNullDepth
...

this may also speed up the search, but is not that sensitive to zugzwangs ?!

jdart · Post by **jdart** » Fri Mar 26, 2010 10:49 pm

> If you play 30,000 games and it plays _better_, why on earth would you toss an idea out just because there is an occasional game it might lose due to the change?

I can see this point of view, but the bad thing about null/move zugzwang positions is that if your program has such a problem it can be "blind" to the right move, even with a very deep search, and can go wrong in endings where a human player can see the right way. So it's ugly. If you intend the program to be used by humans for analysis or play, this will likely annoy them if it happens. I tend to think that makes it a priority to fix.

--Jon

Don · Post by **Don** » Sat Mar 27, 2010 12:39 am

Bob,

At one point we were not doing null move when the stage of the game was less than 2, which means there could be knights and bishops on the board but nothing else. It was one of the few times I assumed something would be an improvement without testing because I was looking at some positions with only knights bishops and pawns on the board and the program "solved" them better without null move - practically in every case.

This ended up costing us because later on we noticed a regression and I had to carefully back out changes to the various versions to determine what change was the culprit. It turns out that this was the mistake costing us about 5-10 ELO, much more than I would have ever expected.

Right now we do not do null move when stage is 0 which means knights can be on the board (knights do not contribute to stage in Komodo.) We will test a version at some point that does it like you do it where we do not do null move if the side to move has only pawns or less.

bob wrote:I recently tweaked around a bit with null-move in Crafty. Been on my to-do list for a while. Version 23.2R01 is the best so far and does null-move everywhere except when the side on move has no pieces at all. Even with a single knight, it is currently doing null-move.

Just for fun, I decided to try null-move _everywhere_, pieces or not. So far, this is the result (23.2R05 is the null-everywhere version, R01 is the null whenever stm has a piece, others are a couple of other tests that were no real change).

Crafty-23.2R05 2646 17 17 1251 63% 2543 20%
Crafty-23.2R04-3 2633 4 4 30000 61% 2550 22%
Crafty-23.2R04-5 2633 4 4 30000 61% 2550 23%
Crafty-23.2R01-1 2632 4 4 30000 61% 2550 23%

Hard to imagine this actually being better, but after 1251 games it looks to be although the error bar is large...

current results:

Crafty-23.2R05 2637 12 12 2510 61% 2550 22%
Crafty-23.2R04-3 2634 3 3 30000 61% 2551 22%
Crafty-23.2R04-5 2633 3 3 30000 61% 2551 23%
Crafty-23.2R01-1 2633 3 3 30000 61% 2551 23%

jwes · Post by **jwes** » Sat Mar 27, 2010 2:13 am

bob wrote:
jwes wrote:
bob wrote:
hgm wrote:Let me get ths straight:

You use plain null move, no verification? So if you end up in KRK, it is now a draw. And if you end up in things like KRPKR it becomes almost impossible to win, as the opponent would simply sac his Rook for the remaining Pawn, and you would not know how to checkmate him with a Rook up.
why would that be? It finds KR vs K checkmates trivially. I just tried it in a simple position with kings and rook in center of the board. The evaluation knows to drive the losing king to the edge of the board, where it finds mate quick enough...
Since the standard method of checkmating with KR v k involves repeated zugzwangs, it is reasonable (but apparently wrong) to assume that a search with null moves would not find these mates.
I believe it is an issue of the eval taking you in the right direction and then any move that checks turns off null in the next ply. So eventually you get to where you see the mate. However, KRK is a very rare ending overall...

bob wrote:

Most won Pawn endings would also be bungled; KPK is no longer a win if the passer is not out of reach.
You are focusing on positions, I am focusing on entire games. All I can say at the moment is that null-everywhere is no worse than restricting it to only being used when the side on move has a piece or move. I can run the match to 1/4 million games to get a more accurate Elo measurement, which I will start right now. But it certainly is within a couple of Elo either way.

Seems to me that this should cost you quite a bit more than 2 Elo. Rook endings are very common (> 10% ?), and not being able to win any of those should at least cost you a few percent in score, and each % corrsponds to 7 Elo. So a 20-50 Elo drop would have to be expected.

The test must be flawed somehow...
The rook issue is not an issue apparently. Anyone can take current Crafty source, go to search.c, remove that single restriction on when null-move search is done (here is the code before/after):

if (do_null && alpha == beta - 1 && depth > 1 && !tree->inchk[ply] &&
TotalPieces(wtm, occupied)) {

if (do_null && alpha == beta - 1 && depth > 1 && !tree->inchk[ply]) {

Pretty obvious what was removed. Nothing else was changed. Same positions, same hardware, etc. I ran the above test twice and got within 2 Elo the second time around (second time was 1 elo better than the first run, not significant)... Can't say much more other than this appears to be another episode of "myth-busters"...
It certainly could be that gains in the middlegame balance losses in the endgame. It would be interesting to run a test starting with endgame positions to see if the results are different.
They probably would be different. But then again, so would tests that start with a few pieces left. Or tests that are based on positions with a forced mate. Failing in one type of position doesn't mean failing in all. That's why testing is so important, it removes bias and prejudice and lets you see the truth without any subjective lights shining.

The problem with your testing method is that if a change makes a significant difference in a small minority of positions while performing identically in most positions, your testing will not recognize it unless you play a very large number of games.

bob · Post by **bob** » Sat Mar 27, 2010 2:15 am

jdart wrote:> If you play 30,000 games and it plays _better_, why on earth would you toss an idea out just because there is an occasional game it might lose due to the change?

I can see this point of view, but the bad thing about null/move zugzwang positions is that if your program has such a problem it can be "blind" to the right move, even with a very deep search, and can go wrong in endings where a human player can see the right way. So it's ugly. If you intend the program to be used by humans for analysis or play, this will likely annoy them if it happens. I tend to think that makes it a priority to fix.

--Jon

What about all the _other_ positions where it provides _better_ analysis by going deeper? Everything we do is a little give and take.

Don · Post by **Don** » Sat Mar 27, 2010 2:36 am

bob wrote:
jdart wrote:> If you play 30,000 games and it plays _better_, why on earth would you toss an idea out just because there is an occasional game it might lose due to the change?

I can see this point of view, but the bad thing about null/move zugzwang positions is that if your program has such a problem it can be "blind" to the right move, even with a very deep search, and can go wrong in endings where a human player can see the right way. So it's ugly. If you intend the program to be used by humans for analysis or play, this will likely annoy them if it happens. I tend to think that makes it a priority to fix.

--Jon
What about all the _other_ positions where it provides _better_ analysis by going deeper? Everything we do is a little give and take.

Personally I usually hedge my bets under the theory that a chain is only as strong as it's weakest link. I hate to have some ridiculous weakness in the program even if it only shows up 1 in a million games or has less than 1 ELO impact. So if some change makes the program more correct but actually weakens it by some amount that is difficult to measure, I labor over the decision more.

Some zugzwang issues can cause your program to NEVER see something it needs to see and thus affects the scalability of the program (it will never play perfect chess even with infinite hardware.) Of course that is not a practical issue, but it illustrates the point. It may make a difference with 10X more hardware for example.

I also have this feeling which I cannot prove that attention to some details may have a large affect on the ultimate scalability. I have a king and pawn vs king database compiled into the program and I have been unable to prove it helps, but it just seems like it should. My theory is that it probably helps more with depth because as you gradually approach perfect play these little things will come into play more often. (A lot less games decided by decisive tactics early in the game.) On the other hand with deeper searches the king and pawn endings will get played better anyway - so I'm not sure how all of this interacts.

All of this is pretty much theoretical however and programs over the years will automatically scale as needed as the best program authors will always tend to implement things that actually work on the hardware of the day.

bob · Post by **bob** » Sat Mar 27, 2010 6:25 am

Don wrote:
bob wrote:
jdart wrote:> If you play 30,000 games and it plays _better_, why on earth would you toss an idea out just because there is an occasional game it might lose due to the change?

I can see this point of view, but the bad thing about null/move zugzwang positions is that if your program has such a problem it can be "blind" to the right move, even with a very deep search, and can go wrong in endings where a human player can see the right way. So it's ugly. If you intend the program to be used by humans for analysis or play, this will likely annoy them if it happens. I tend to think that makes it a priority to fix.

--Jon
What about all the _other_ positions where it provides _better_ analysis by going deeper? Everything we do is a little give and take.
Personally I usually hedge my bets under the theory that a chain is only as strong as it's weakest link. I hate to have some ridiculous weakness in the program even if it only shows up 1 in a million games or has less than 1 ELO impact. So if some change makes the program more correct but actually weakens it by some amount that is difficult to measure, I labor over the decision more.

None of that makes any sense. I'll bet that for _every_ feature in your chess program, I can come up with a position where it works poorly or not at all. Such is the nature of the game. Yet I doubt you would toss everything out. This idea of not doing something because it hurts a certain type of position is nothing but witchcraft/voodo/etc. If something causes a program to win more games when it is enabled than when it is disablede, and the sample size of games is sufficiently large to eliminate random effects, then that idea ought to be enabled. Nothing else makes any sense.

I am not talking about changes that look right and appear to have no effect on overall performance, I am talking about changes that improve Elo over tens of thousands of test games. I don't care how changes affect individual test positions. That idea is so badly flawed it is not worth discussing, except for programs whose sole purpose in life is to address a particular concept such as "chest" for checkmates.

Note that I am not proposing that this null-everywhere is a good one. Only that it is a very tiny bit worse, 1-2 Elo at worst (after about 100,000 games and still counting). And that was quite surprising (that it was so close to break-even).

Some zugzwang issues can cause your program to NEVER see something it needs to see and thus affects the scalability of the program (it will never play perfect chess even with infinite hardware.) Of course that is not a practical issue, but it illustrates the point. It may make a difference with 10X more hardware for example.

That's a testable hypothesis. If it makes a difference at 10x, it will make a smaller (but still measurable) difference at 5x, which can be easily tested.

I also have this feeling which I cannot prove that attention to some details may have a large affect on the ultimate scalability. I have a king and pawn vs king database compiled into the program and I have been unable to prove it helps, but it just seems like it should. My theory is that it probably helps more with depth because as you gradually approach perfect play these little things will come into play more often. (A lot less games decided by decisive tactics early in the game.) On the other hand with deeper searches the king and pawn endings will get played better anyway - so I'm not sure how all of this interacts.

I don't believe it helps at all. 3-4-5 piece endgame tables are a very small improvement in some positions, and hurt in others. I am one day going to precisely quantify the effect of tables on/off using the cluster. So far, testing has suggested no benefit at all unless you have monstrously fast disk drives. The slow-down hurts about as much as the increased accuracy helps.

All of this is pretty much theoretical however and programs over the years will automatically scale as needed as the best program authors will always tend to implement things that actually work on the hardware of the day.

Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test

Re: Interesting null-move test