undo move vs. Position Cloning

Don · Post by **Don** » Thu Sep 17, 2009 1:43 am

bob wrote:
BubbaTough wrote:I haven't looked at the code, but it sounded like what Marco was doing was just saving what a lot of people put in an undo structure in a move stack which he indexes. The amount of copying he is doing is then just the same as if he was saving his undo info, no more. And his undo does not have to copy back from the undo structure. Thus, compared to a stack of Undo's, his system is clearly better. I am not sure what crafty does, but at least compared to one common method this is a good idea.

-Sam
Copy/Make as I have implemented it in the past never copied "back" either. I simply used a structure called "pos" that has the bitboards, hash signature, pawn hash signature, etc in it. I then define pos position[maxply].

To do a make, I simply do position[ply+1]=position[ply] and then update position[ply+1] and use that at the next level in the search. When I return to this level (ply) the current position is the same as it was before we did the copy since I didn't change it, so no unmake or copy back is needed. The idea of copying the current state to a backup, and then copying it back to unmake is not a solution I would have ever thought of, much less implemented. It is so obviously wrong as to not even be considered.

I think there must be a misunderstanding somewhere. I don't think anyone would consider copying back state to UNMAKE a move, as the original position is never going to go away until it is not needed.

Evert · Post by **Evert** » Thu Sep 17, 2009 2:11 am

bob wrote:Compilers are not doing this, as it is quite risky to pre-fetch and then not use the data because of an early exit or whatever...

Right, you'd have to insert the instructions by hand (which is why I said "well placed"

), and only in places where you know you're going to need the data a couple of instructions later.

Also, we are not just talking about cache misses due to the data not being present, there is a big cost to fetch the data, and fetching the data pushes something else out of cache, resulting in additional misses later. All of which add up to a lot of bandwidth.

Yes. Trying to pre-fetch everything you might conceivably use is obviously a bad idea.

Well, my main reason for using copy/make is that it makes my code simpler, and I would be surprised if it turned out to be the bottle-neck in my program (which isn't very advanced, unsurprisingly since it's only a couple of months old), but I'm curious to know what sort of ELO difference it might make. If it's tens rather than hundreds there's something else I need to improve first.

Don · Post by **Don** » Thu Sep 17, 2009 2:16 am

bob wrote:
Don wrote:
gingell wrote:Here's what I do in Chesley:
Code: Select all
child_position = parent_position;  // Copy the parent
child_position.apply_move (move);  // Apply a move to the child.
search (child_position)            // Search the child
And then child_position is thrown away.

In Chesley a position is 168 bytes. The implementation is C++ and I've verified the compiler generates a memcpy as you'd expect. Experimenting with dummy fields to pad that out to 256 or 512 bytes makes very little difference in number of nodes per second I see.

The fact that virtually every other program I've looked at does move/undo worries me a little, since it's rather more likely I'm missing something everyone else sees than the other way round! But it's still hard for me to see a solution cheaper than doing that copy.
What you are seeing is just an example of heritage - make/unmake is how it's always been done. Your father did it that way, his father did it that way ...
Actually, this is exactly the _opposite_ of what has happened. Many of us did copy/make in the 70's and beyond. But when we moved to the PC with the first pentiums, we discovered that memory bandwidth was limited, and the copy operation was too expensive to tolerate, and it only got worse if you added a second or third or fourth cpu to the mix.

In YOUR world this is the opposite of what happened, but in the real world we live in it is different.

In your world only a very few people had access to hardware that was even big enough to store a few hundred position states. And even though you may have had access to machines like the Cray you couldn't afford one yourself and very few people could. And chess programs were by necessity university projects run with big Iron in the days of the dinosaur when memory bandwidth was not an issue because processors were so so stinking slow anyway.

In the real world of the computer chess hobbyist, life began with the first affordable microcomputers. My first computer had 1/4 K ram and my next one had a whopping 4K of RAM. How many position states can you store in 4K of RAM along with the program itself? I don't even think HG's program would run on this.

But even more to the point, almost any pseudo code reference to computer chess algorithms contains the undo() function in it. This is why most programs have undo() in them. In fact, it's just like you said, YOU are part of that heritage when you changed Crafty because a lot of progrmmers copied YOUR code. So it's really your fault -:)

Seriously, I think you will see it go back the other way - it's already happening and it's probably more natural for muli-processing anyway.

Code: Select all
 *    9.16   Significant changes to the way MakeMove() operates.  Rather than  *
 *           copying the large position[ply] structure around, the piece       *
 *           location bitmaps and the array of which piece is on which square  *
 *           are now simple global variables.  This means that there is now an *
 *           UnmakeMove() function that must be used to restore these after a  *
 *           a move has been made.  The benefit is that we avoid copying 10    *
 *           bitmaps for piece locations when typically only one is changed,   *
 *           Ditto for the which piece is on which square array which is 64    *
 *           bytes long.  The result is much less memory traffic, with the     *
 *           probability that the bitmaps now stay in cache all the time.
That is right out of crafty's main.c and shows about when I stopped the copy/make approach and explains why. Copy/Make is not new. It is quite old, Make/Unmake is newer because it better matches up with the poor PC memory bandwidth limits.

It had to be done that way back then and thus every pseduo-code example of how to do some chess algorithm has undo() in it somewhere.

I also believe that modern processors are optimized for memory copies - it is a ridiculously common operation.
The old Cray T932 could copy almost 400 gigabytes per second. Check out your friendly PC's memory bandwidth. A processor with a single memory unit (as opposed to supercomputer-level interleaved memory banks) is hardly optimized for memory copying operations.

Your position state is just about the same size as mine and I believe that with states as small as ours, it's just not an issue.

It takes about the same time to UNMAKE a move as to make a move. But for us that time is cut in half, because we never have to undo() a move - we get that for free.
You might _think_ it is for free. But the copy is far from free. Add 7 more threads on an 8-core box and it is even less "free".

In theory we pay for that with a bit extra make time - but even that is not so obvious to me. Our make routines have less logic as we do not have the extra bookeeping and logic of stashing away undo information in yet another array.

I don't really know which is faster for programs like ours, but it would not suprise me if what we are doing is the faster way - it's certainly simple and clean compared to the alternative and simple and clean almost always win out (except when it doesn't.)

Don · Post by **Don** » Thu Sep 17, 2009 2:45 am

You might _think_ it is for free. But the copy is far from free. Add 7 more threads on an 8-core box and it is even less "free".

I honestly don't know which is faster Bob, but I'm not making any assumptions either. You seem to be just assuming that copying 192 bytes is so expensive it is worth the extra logic, memory access and missed branch predictions that go with undo() which is clearly NOT free. With copy, undo is FREE.

If 8 processor machines are so limited that they cannot handle 4 or 5 K bytes of position states per processor, then they are probably not very practical machines.

On my quad I run 8 instances of my chess program, but only 4 are thinking at a time and I notice very little slowdown over running on 1 core.

Are 8 processor machines really this fragile? I would never buy an 8 processor machine if it just means that every instance of a program runs at half the speed because the memory is so crippled up. I would think they would design these things better.

bob · Post by **bob** » Thu Sep 17, 2009 2:47 am

Don wrote:
bob wrote:
BubbaTough wrote:I haven't looked at the code, but it sounded like what Marco was doing was just saving what a lot of people put in an undo structure in a move stack which he indexes. The amount of copying he is doing is then just the same as if he was saving his undo info, no more. And his undo does not have to copy back from the undo structure. Thus, compared to a stack of Undo's, his system is clearly better. I am not sure what crafty does, but at least compared to one common method this is a good idea.

-Sam
Copy/Make as I have implemented it in the past never copied "back" either. I simply used a structure called "pos" that has the bitboards, hash signature, pawn hash signature, etc in it. I then define pos position[maxply].

To do a make, I simply do position[ply+1]=position[ply] and then update position[ply+1] and use that at the next level in the search. When I return to this level (ply) the current position is the same as it was before we did the copy since I didn't change it, so no unmake or copy back is needed. The idea of copying the current state to a backup, and then copying it back to unmake is not a solution I would have ever thought of, much less implemented. It is so obviously wrong as to not even be considered.
I think there must be a misunderstanding somewhere. I don't think anyone would consider copying back state to UNMAKE a move, as the original position is never going to go away until it is not needed.

Read the second post in this thread. He specifically says to copy the current state to a "backup/save" location, then update things for the make operation, and then copy the saved state back to unmake. And then he says he doesn't do that. I don't believe anyone ever did it that way myself, it is beyond ugly.

bob · Post by **bob** » Thu Sep 17, 2009 2:51 am

Don wrote:
bob wrote:
Don wrote:
gingell wrote:Here's what I do in Chesley:
Code: Select all
child_position = parent_position;  // Copy the parent
child_position.apply_move (move);  // Apply a move to the child.
search (child_position)            // Search the child
And then child_position is thrown away.

In Chesley a position is 168 bytes. The implementation is C++ and I've verified the compiler generates a memcpy as you'd expect. Experimenting with dummy fields to pad that out to 256 or 512 bytes makes very little difference in number of nodes per second I see.

The fact that virtually every other program I've looked at does move/undo worries me a little, since it's rather more likely I'm missing something everyone else sees than the other way round! But it's still hard for me to see a solution cheaper than doing that copy.
What you are seeing is just an example of heritage - make/unmake is how it's always been done. Your father did it that way, his father did it that way ...
Actually, this is exactly the _opposite_ of what has happened. Many of us did copy/make in the 70's and beyond. But when we moved to the PC with the first pentiums, we discovered that memory bandwidth was limited, and the copy operation was too expensive to tolerate, and it only got worse if you added a second or third or fourth cpu to the mix.
In YOUR world this is the opposite of what happened, but in the real world we live in it is different.

In your world only a very few people had access to hardware that was even big enough to store a few hundred position states. And even though you may have had access to machines like the Cray you couldn't afford one yourself and very few people could. And chess programs were by necessity university projects run with big Iron in the days of the dinosaur when memory bandwidth was not an issue because processors were so so stinking slow anyway.

In the real world of the computer chess hobbyist, life began with the first affordable microcomputers. My first computer had 1/4 K ram and my next one had a whopping 4K of RAM. How many position states can you store in 4K of RAM along with the program itself? I don't even think HG's program would run on this.

But even more to the point, almost any pseudo code reference to computer chess algorithms contains the undo() function in it. This is why most programs have undo() in them. In fact, it's just like you said, YOU are part of that heritage when you changed Crafty because a lot of progrmmers copied YOUR code. So it's really your fault -:)

Seriously, I think you will see it go back the other way - it's already happening and it's probably more natural for muli-processing anyway.
Code: Select all
 *    9.16   Significant changes to the way MakeMove() operates.  Rather than  *
 *           copying the large position[ply] structure around, the piece       *
 *           location bitmaps and the array of which piece is on which square  *
 *           are now simple global variables.  This means that there is now an *
 *           UnmakeMove() function that must be used to restore these after a  *
 *           a move has been made.  The benefit is that we avoid copying 10    *
 *           bitmaps for piece locations when typically only one is changed,   *
 *           Ditto for the which piece is on which square array which is 64    *
 *           bytes long.  The result is much less memory traffic, with the     *
 *           probability that the bitmaps now stay in cache all the time.
That is right out of crafty's main.c and shows about when I stopped the copy/make approach and explains why. Copy/Make is not new. It is quite old, Make/Unmake is newer because it better matches up with the poor PC memory bandwidth limits.

It had to be done that way back then and thus every pseduo-code example of how to do some chess algorithm has undo() in it somewhere.

I also believe that modern processors are optimized for memory copies - it is a ridiculously common operation.
The old Cray T932 could copy almost 400 gigabytes per second. Check out your friendly PC's memory bandwidth. A processor with a single memory unit (as opposed to supercomputer-level interleaved memory banks) is hardly optimized for memory copying operations.

Your position state is just about the same size as mine and I believe that with states as small as ours, it's just not an issue.

It takes about the same time to UNMAKE a move as to make a move. But for us that time is cut in half, because we never have to undo() a move - we get that for free.
You might _think_ it is for free. But the copy is far from free. Add 7 more threads on an 8-core box and it is even less "free".

In theory we pay for that with a bit extra make time - but even that is not so obvious to me. Our make routines have less logic as we do not have the extra bookeeping and logic of stashing away undo information in yet another array.

I don't really know which is faster for programs like ours, but it would not suprise me if what we are doing is the faster way - it's certainly simple and clean compared to the alternative and simple and clean almost always win out (except when it doesn't.)

I don't "stash away" any undo information, with the only exception being the hash signature, and the 50-move counter.

The pentium-pro was not a bad machine. I had one with 4-way memory interleaving (a big server box). Getting rid of copy/make was a win on that machine as well. I suspect it is the same today as memory speeds have not changed much, yet processor speeds are way beyond that original 200mhz pentium pro. Much faster to make/unmake data in cache, rather than waiting hundreds (or thousands) of cycles for something to poke in from memory today.

bob · Post by **bob** » Thu Sep 17, 2009 2:59 am

Don wrote:
You might _think_ it is for free. But the copy is far from free. Add 7 more threads on an 8-core box and it is even less "free".
I honestly don't know which is faster Bob, but I'm not making any assumptions either. You seem to be just assuming that copying 192 bytes is so expensive it is worth the extra logic, memory access and missed branch predictions that go with undo() which is clearly NOT free. With copy, undo is FREE.

First, I am not assuming _anything_. This has been measured, _carefully_. I agree that Unmake() is free. And that Make() does no extra work at all either, except for the copy. But the copy is what I am looking at. In Crafty, when I was doing (originally) copy/make, I was copying 192 bytes of "stuff". My speed jumped _significantly_ when I dumped the copy and went to make/unmake. Several others had already discovered this. I was not the one that discovered that the copy part was bad. But when Bruce Moreland first mentioned the potential problem, I did some tests, and became convinced. It was a lot harder to go from copy/make to make/unmake, than it is to go the other direction. But I made the modifications and saw an instant significant performance jump.

As I said, I am going to go back to copy/make in the current Crafty (not that hard to do) and compare performance, in terms of NPS, with a single thread, and with 8 threads, to see what the issues are, if any. If it is as fast or faster, I will report that here and stick with it. If not, I am saving the make/unmake version and will revert to that. My Unmake() function is not very big, so eliminating that is not a huge deal from a simplicity point of view. I only worry about moving chunks of N bytes around (192 might be fairly accurate today, 12 piece boards, plus two for diagonals and ranks/files, by the time I am finished, 192 or less seems reasonable.

I hope to get this done tomorrow afternoon between classes, with any luck. More data once that is done. Then there won't be any speculation or guesswork at all.

If 8 processor machines are so limited that they cannot handle 4 or 5 K bytes of position states per processor, then they are probably not very practical machines.

On my quad I run 8 instances of my chess program, but only 4 are thinking at a time and I notice very little slowdown over running on 1 core.

Are 8 processor machines really this fragile? I would never buy an 8 processor machine if it just means that every instance of a program runs at half the speed because the memory is so crippled up. I would think they would design these things better.

Don · Post by **Don** » Thu Sep 17, 2009 3:37 am

bob wrote: I don't "stash away" any undo information, with the only exception being the hash signature, and the 50-move counter.

The pentium-pro was not a bad machine. I had one with 4-way memory interleaving (a big server box). Getting rid of copy/make was a win on that machine as well. I suspect it is the same today as memory speeds have not changed much, yet processor speeds are way beyond that original 200mhz pentium pro. Much faster to make/unmake data in cache, rather than waiting hundreds (or thousands) of cycles for something to poke in from memory today.

But I don't think that is happening. If it were as bad as you are saying my program would be crawling because I have hash tables. Due to the cache issue I tried not storing hash tables in the quies search and it had no impact on my nodes per second. This is random access all over several megabyes of data, reads and writes.

And yet you think a small stack of 30 or so commonly accessed position states is just trashing the whole cache of the machine. I really don't think you are thinking very clearly about this, because you seem to picking and choosing your assumptions. You assume that the extra logic has no appreciable cost but that copying 100 bytes is something modern processors just cannot handle.

You made the arugment that the problem is so bad they have to have level 1, level 2, level 3 cache and so on. So I'll make the argument that this caching technology actually works and you can actually expect 6K of position state to be in the cache.

BoldReceiver · Post by **BoldReceiver** » Thu Sep 17, 2009 4:21 am

Thanks Bob, I'd obviously overlooked this. I appreciate the help and the discussion that has followed.

Don · Post by **Don** » Thu Sep 17, 2009 5:30 am

bob wrote:
Don wrote:
You might _think_ it is for free. But the copy is far from free. Add 7 more threads on an 8-core box and it is even less "free".
I honestly don't know which is faster Bob, but I'm not making any assumptions either. You seem to be just assuming that copying 192 bytes is so expensive it is worth the extra logic, memory access and missed branch predictions that go with undo() which is clearly NOT free. With copy, undo is FREE.
First, I am not assuming _anything_. This has been measured, _carefully_. I agree that Unmake() is free. And that Make() does no extra work at all either, except for the copy. But the copy is what I am looking at. In Crafty, when I was doing (originally) copy/make, I was copying 192 bytes of "stuff". My speed jumped _significantly_ when I dumped the copy and went to make/unmake. Several others had already discovered this. I was not the one that discovered that the copy part was bad. But when Bruce Moreland first mentioned the potential problem, I did some tests, and became convinced. It was a lot harder to go from copy/make to make/unmake, than it is to go the other direction. But I made the modifications and saw an instant significant performance jump.

As I said, I am going to go back to copy/make in the current Crafty (not that hard to do) and compare performance, in terms of NPS, with a single thread, and with 8 threads, to see what the issues are, if any. If it is as fast or faster, I will report that here and stick with it. If not, I am saving the make/unmake version and will revert to that. My Unmake() function is not very big, so eliminating that is not a huge deal from a simplicity point of view. I only worry about moving chunks of N bytes around (192 might be fairly accurate today, 12 piece boards, plus two for diagonals and ranks/files, by the time I am finished, 192 or less seems reasonable.

I hope to get this done tomorrow afternoon between classes, with any luck. More data once that is done. Then there won't be any speculation or guesswork at all.

If 8 processor machines are so limited that they cannot handle 4 or 5 K bytes of position states per processor, then they are probably not very practical machines.

On my quad I run 8 instances of my chess program, but only 4 are thinking at a time and I notice very little slowdown over running on 1 core.

Are 8 processor machines really this fragile? I would never buy an 8 processor machine if it just means that every instance of a program runs at half the speed because the memory is so crippled up. I would think they would design these things better.

A test is certainly a better proposal than us arguing about it

A test I could do is to double my state. In my program state is 192 bytes, it's probably a little more in crafty since you carry extra bit boards around. I did a printf( "%d", sizeof(position)) to get that figure.

So I could add a 200 byte character array and see how it impacts performance. I'm pretty sure I would see almost no slowdown but I could be wrong of course. I'm pretty sure GCC is not smart enough to know that it could just copy the first 192 bytes

undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning