copy/make vs make/unmake test results

bob · Post by **bob** » Sat Sep 19, 2009 2:55 am

I didn't get a chance to make the changes until this afternoon. Took about 3.5 hours to make them all, and then do a little debugging where I made a typo or forgot about something here and there. Both versions now search exactly the same trees, tested over a couple of hundred positions, so they are identical, except that in each of the following cases, log.001 comes from the original make/unmake version of Crafty, and log.002 comes from the new copy/make with no unmake function whatsoever.

I picked a few test positions that anyone could try, just so we have a test anyone interested can reproduce. All of the results were about the same, so I will just give one. Every test I tried using just one CPU showed about a 5% loss in speed using copy/make. In looking at my very old notes, this was just over 10% slower back on the pentium/pentium-pro, perhaps assisted by the very large L2 caches of today (this is a core-2 box with 4mb of L2 cache).

log.001: time=30.67 mat=1 n=119294067 fh=96% nps=3.9M
log.002: time=31.51 mat=1 n=117529270 fh=96% nps=3.7M

The new version is about 5% slower using copy/make. Not _that_ bad. (note I am only looking at nps, both runs were told to search for 30 seconds and quit, but I only check the time once per second inside crafty at this search time limit so they run for slightly different times. But the NPS numbers are what we are talking about.

However, I then tested this on my dual quad-core box using 8 cores and 8 threads. The bottom fell out in a deep/narrow endgame:

log.001: time=30.49 mat=1 n=564874031 fh=97% nps=18.5M
log.002: time=30.26 mat=1 n=434702300 fh=83% nps=14.4M

Whereas in typical middlegame positions, things were not nearly as bad:

log.001: time=30.47 mat=0 n=492705674 fh=87% nps=16.2M
log.002: time=30.87 mat=0 n=481349471 fh=87% nps=15.6M

For these SMP runs. I used a dual quad-core core-2 box, 2.0ghz, 16gb of ram (I used a fairly small hash table) and 4mb of shared L2 per chip, two chips. This box also has 2-way memory interleaving. I did not have access to a more vanilla quad-core or dual quad-core with more normal memory, this is a server box that is pretty fast.

Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores. It would probably degrade much worse on 16 cores. I tried to test it on such a box, but ran into enough issues with the installed compiler and libs that I gave up for the moment.

For the record, my "position" structure is something like 220 bytes or so. There are 12 piece-type bitboards, plus occupied-white and occupied-black, plus the diagonal-movers and rank/file-moves, total of 16 bitboards of 8 bytes each. Throw in the 64 bytes for square contents, the two hash signatures, and a few bytes for castle rights, ep status, 50 move counter, and it ends up at a little over 200 bytes total. Or just over 3 cache lines on this box.

My conclusion is, as always, copy/make is not the way to go on the PC platform. YMMV of course. I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.

mcostalba · Post by **mcostalba** » Sat Sep 19, 2009 8:34 am

bob wrote: For the record, my "position" structure is something like 220 bytes or so.

Thanks for the test.

How many bytes you backup in make move and then restore from backup storage in unmake move ?

Really hope my question is clear this time

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Sep 19, 2009 9:29 am

bob wrote: My conclusion is, as always, copy/make is not the way to go on the PC platform. YMMV of course. I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.

So with your approach, make/unmake makes sense rather than to use ~4 additional 64-byte cachelines per ply to safe the unmake. In make/unmake you'll incrementally change only a small fraction of the whole board structure (one or two bitboards out of 12, two squares out of 64). In copy/make you would copy a lot of stuff not affected by makemove. Copy/make increase L1 pressure and you'll stall more often let say on magic (or rotated) lookups and elsewhere.

TIMTOWTDI. I have the information you have inside 12++ bitboards plus the redundant board array, in one quad-bb. So for me copy/make is far less than one additional cacheline per ply, quad-bb, pawn-hashkey (general hashkey has a separate array for repetition detection anyway), some piece counters. The other stuff I mentioned (~256 bytes) to keep recursively for staged movegen, some attack bitboards, 16 directionwise move target sets as unsorted movelist, is generated on the fly with register intensive SSE2 Kogge-Stone stuff but no magic lookups, and hides the latency of the TT-probe, which is done always.

bob · Post by **bob** » Sat Sep 19, 2009 2:37 pm

mcostalba wrote:
bob wrote: For the record, my "position" structure is something like 220 bytes or so.
Thanks for the test.

How many bytes you backup in make move and then restore from backup storage in unmake move ?

Really hope my question is clear this time

As I said, I copy under 256 bytes to make a move. I copy _nothing_ to unmake a move, that is the point of copy/make.

It goes something like this:

MakeMove(etc, ply):

position[ply+1] = position[ply];
update position[ply+1]
(position[ply] is completely unchanged.
pass position[ply+1] to the next level of search.
when search returns it does nothing since position[ply] is still correct, as only position[ply+1] was changed.

I'm also trying to be clear, also. In copy/make there is _NO_ unmake() whatsoever. Otherwise it would not be called copy/make.

mcostalba · Post by **mcostalba** » Sat Sep 19, 2009 2:49 pm

bob wrote:
mcostalba wrote:
bob wrote: For the record, my "position" structure is something like 220 bytes or so.
Thanks for the test.

How many bytes you backup in make move and then restore from backup storage in unmake move ?

Really hope my question is clear this time
As I said, I copy under 256 bytes to make a move. I copy _nothing_ to unmake a move, that is the point of copy/make.

It goes something like this:

MakeMove(etc, ply):

position[ply+1] = position[ply];
update position[ply+1]
(position[ply] is completely unchanged.
pass position[ply+1] to the next level of search.
when search returns it does nothing since position[ply] is still correct, as only position[ply+1] was changed.

I'm also trying to be clear, also. In copy/make there is _NO_ unmake() whatsoever. Otherwise it would not be called copy/make.

No I am not talking of copy/make.

Please forget copy/make. for my question do like copy/make does not exsist otherwise we get nowhere

Now consider we have _only_ make move + unmake move.

In make move you _save_ same info as example position key or something else, then you update the position.

In unmake move you update back the position _and_ retrieve some info, as example the key from the backup storage where you saved in make move.

Now my question is: how many bytes you save in make move and then retrieve in unmake ? IOW how many fields you don't update incrementally in unmake move but simply take back the old saved value ?

Hope is clear now.

bob · Post by **bob** » Sat Sep 19, 2009 3:14 pm

mcostalba wrote:
bob wrote:
mcostalba wrote:
bob wrote: For the record, my "position" structure is something like 220 bytes or so.
Thanks for the test.

How many bytes you backup in make move and then restore from backup storage in unmake move ?

Really hope my question is clear this time
As I said, I copy under 256 bytes to make a move. I copy _nothing_ to unmake a move, that is the point of copy/make.

It goes something like this:

MakeMove(etc, ply):

position[ply+1] = position[ply];
update position[ply+1]
(position[ply] is completely unchanged.
pass position[ply+1] to the next level of search.
when search returns it does nothing since position[ply] is still correct, as only position[ply+1] was changed.

I'm also trying to be clear, also. In copy/make there is _NO_ unmake() whatsoever. Otherwise it would not be called copy/make.
No I am not talking of copy/make.

Please forget copy/make. for my question do like copy/make does not exsist otherwise we get nowhere

Now consider we have _only_ make move + unmake move.

In make move you _save_ same info as example position key or something else, then you update the position.

In unmake move you update back the position _and_ retrieve some info, as example the key from the backup storage where you saved in make move.

Now my question is: how many bytes you save in make move and then retrieve in unmake ? IOW how many fields you don't update incrementally in unmake move but simply take back the old saved value ?

Hope is clear now.

OK. I save the hash signature and pawn hash signature (16 bytes total) plus 4 bytes (2 x castle rights for one side, en passant target square, 50 move counter). Total = 20 bytes of stuff that is copied then restored. I suspect it would be better to unmake the hash signature stuff now, it was very slightly better to copy it on 32 bit machines, but now that 64 bit registers are the norm, it is likely better to unmake the hash / pawn-hash signatures as well, which I will test later.

But in any case, this started as a "copy/make" discussion, which is where I have been coming from all along. But to answer your question, I copy 20 bytes at the front of MakeMove() and copy that 20 bytes back in UnmakeMove().

mcostalba · Post by **mcostalba** » Sat Sep 19, 2009 5:16 pm

bob wrote:But to answer your question, I copy 20 bytes at the front of MakeMove() and copy that 20 bytes back in UnmakeMove().

Ok. In SF I copy 56 bytes...they are never copied back:

Code: Select all

  // Copy some fields of old state to our new StateInfo object except the
  // ones which are recalculated from scratch anyway, then switch our state
  // pointer to point to the new, ready to be updated, state.
  struct ReducedStateInfo {
    Key key, pawnKey, materialKey;
    int castleRights, rule50;
    Square epSquare;
    Value mgValue, egValue;
    Value npMaterial[2];
  };

Don · Post by **Don** » Mon Sep 21, 2009 4:23 pm

bob wrote:I didn't get a chance to make the changes until this afternoon. Took about 3.5 hours to make them all, and then do a little debugging where I made a typo or forgot about something here and there. Both versions now search exactly the same trees, tested over a couple of hundred positions, so they are identical, except that in each of the following cases, log.001 comes from the original make/unmake version of Crafty, and log.002 comes from the new copy/make with no unmake function whatsoever.

I picked a few test positions that anyone could try, just so we have a test anyone interested can reproduce. All of the results were about the same, so I will just give one. Every test I tried using just one CPU showed about a 5% loss in speed using copy/make. In looking at my very old notes, this was just over 10% slower back on the pentium/pentium-pro, perhaps assisted by the very large L2 caches of today (this is a core-2 box with 4mb of L2 cache).

log.001: time=30.67 mat=1 n=119294067 fh=96% nps=3.9M
log.002: time=31.51 mat=1 n=117529270 fh=96% nps=3.7M

The new version is about 5% slower using copy/make. Not _that_ bad. (note I am only looking at nps, both runs were told to search for 30 seconds and quit, but I only check the time once per second inside crafty at this search time limit so they run for slightly different times. But the NPS numbers are what we are talking about.

However, I then tested this on my dual quad-core box using 8 cores and 8 threads. The bottom fell out in a deep/narrow endgame:

log.001: time=30.49 mat=1 n=564874031 fh=97% nps=18.5M
log.002: time=30.26 mat=1 n=434702300 fh=83% nps=14.4M

Whereas in typical middlegame positions, things were not nearly as bad:

log.001: time=30.47 mat=0 n=492705674 fh=87% nps=16.2M
log.002: time=30.87 mat=0 n=481349471 fh=87% nps=15.6M

For these SMP runs. I used a dual quad-core core-2 box, 2.0ghz, 16gb of ram (I used a fairly small hash table) and 4mb of shared L2 per chip, two chips. This box also has 2-way memory interleaving. I did not have access to a more vanilla quad-core or dual quad-core with more normal memory, this is a server box that is pretty fast.

Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores. It would probably degrade much worse on 16 cores. I tried to test it on such a box, but ran into enough issues with the installed compiler and libs that I gave up for the moment.

For the record, my "position" structure is something like 220 bytes or so. There are 12 piece-type bitboards, plus occupied-white and occupied-black, plus the diagonal-movers and rank/file-moves, total of 16 bitboards of 8 bytes each. Throw in the 64 bytes for square contents, the two hash signatures, and a few bytes for castle rights, ep status, 50 move counter, and it ends up at a little over 200 bytes total. Or just over 3 cache lines on this box.

My conclusion is, as always, copy/make is not the way to go on the PC platform. YMMV of course. I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.

Does your position structures overlap cache lines or does the compiler fix that for you? I think your position structure should be padded to be exactly 256 bytes if you are running on more than one processor.

bob · Post by **bob** » Mon Sep 21, 2009 7:48 pm

Don wrote:
bob wrote:I didn't get a chance to make the changes until this afternoon. Took about 3.5 hours to make them all, and then do a little debugging where I made a typo or forgot about something here and there. Both versions now search exactly the same trees, tested over a couple of hundred positions, so they are identical, except that in each of the following cases, log.001 comes from the original make/unmake version of Crafty, and log.002 comes from the new copy/make with no unmake function whatsoever.

I picked a few test positions that anyone could try, just so we have a test anyone interested can reproduce. All of the results were about the same, so I will just give one. Every test I tried using just one CPU showed about a 5% loss in speed using copy/make. In looking at my very old notes, this was just over 10% slower back on the pentium/pentium-pro, perhaps assisted by the very large L2 caches of today (this is a core-2 box with 4mb of L2 cache).

log.001: time=30.67 mat=1 n=119294067 fh=96% nps=3.9M
log.002: time=31.51 mat=1 n=117529270 fh=96% nps=3.7M

The new version is about 5% slower using copy/make. Not _that_ bad. (note I am only looking at nps, both runs were told to search for 30 seconds and quit, but I only check the time once per second inside crafty at this search time limit so they run for slightly different times. But the NPS numbers are what we are talking about.

However, I then tested this on my dual quad-core box using 8 cores and 8 threads. The bottom fell out in a deep/narrow endgame:

log.001: time=30.49 mat=1 n=564874031 fh=97% nps=18.5M
log.002: time=30.26 mat=1 n=434702300 fh=83% nps=14.4M

Whereas in typical middlegame positions, things were not nearly as bad:

log.001: time=30.47 mat=0 n=492705674 fh=87% nps=16.2M
log.002: time=30.87 mat=0 n=481349471 fh=87% nps=15.6M

For these SMP runs. I used a dual quad-core core-2 box, 2.0ghz, 16gb of ram (I used a fairly small hash table) and 4mb of shared L2 per chip, two chips. This box also has 2-way memory interleaving. I did not have access to a more vanilla quad-core or dual quad-core with more normal memory, this is a server box that is pretty fast.

Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores. It would probably degrade much worse on 16 cores. I tried to test it on such a box, but ran into enough issues with the installed compiler and libs that I gave up for the moment.

For the record, my "position" structure is something like 220 bytes or so. There are 12 piece-type bitboards, plus occupied-white and occupied-black, plus the diagonal-movers and rank/file-moves, total of 16 bitboards of 8 bytes each. Throw in the 64 bytes for square contents, the two hash signatures, and a few bytes for castle rights, ep status, 50 move counter, and it ends up at a little over 200 bytes total. Or just over 3 cache lines on this box.

My conclusion is, as always, copy/make is not the way to go on the PC platform. YMMV of course. I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.
Does your position structures overlap cache lines or does the compiler fix that for you? I think your position structure should be padded to be exactly 256 bytes if you are running on more than one processor.

It is padded to 256 bytes. And more importantly, the split blocks (which contain the per-cpu position structure arrays) are aligned on page boundaries so that on AMD boxes I don't share a page of memory across two CPUs which is bad on NUMA boxes.

My goal has always been to optimize memory accesses, hence aligning hash tables, and everything else that might be important.

BTW, I should add that the allignment of the individual structures for copy/make is not important here (although it is done since the structure is close enough to 256 for the compiler to round it up). This state stack is an array, and there is a unique instance of this array for each CPU anyway, so there is no possibility of two copy state instances overlapping to the same cache line, if they belong to two different processors/threads anyway.

copy/make vs make/unmake test results

copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results

Re: copy/make vs make/unmake test results