I didn't get a chance to make the changes until this afternoon.  Took about 3.5 hours to make them all, and then do a little debugging where I made a typo or forgot about something here and there.  Both versions now search exactly the same trees, tested over a couple of hundred positions, so they are identical, except that in each of the following cases, log.001 comes from the original make/unmake version of Crafty, and log.002 comes from the new copy/make with no unmake function whatsoever.
I picked a few test positions that anyone could try, just so we have a test anyone interested can reproduce.  All of the results were about the same, so I will just give one.  Every test I tried using just one CPU showed about a 5% loss in speed using copy/make.  In looking at my very old notes, this was just over 10% slower back on the pentium/pentium-pro, perhaps assisted by the very large L2 caches of today (this is a core-2 box with 4mb of L2 cache).
log.001:              time=30.67  mat=1  n=119294067  fh=96%  nps=3.9M
log.002:              time=31.51  mat=1  n=117529270  fh=96%  nps=3.7M
The new version is about 5% slower using copy/make. Not _that_ bad.  (note I am only looking at nps, both runs were told to search for 30 seconds and quit, but I only check the time once per second inside crafty at this search time limit so they run for slightly different times.  But the NPS numbers are what we are talking about.
However, I then tested this on my dual quad-core box using 8 cores and 8 threads.  The bottom fell out in a deep/narrow endgame:
log.001:              time=30.49  mat=1  n=564874031  fh=97%  nps=18.5M
log.002:              time=30.26  mat=1  n=434702300  fh=83%  nps=14.4M
Whereas in typical middlegame positions, things were not nearly as bad:
log.001:              time=30.47  mat=0  n=492705674  fh=87%  nps=16.2M
log.002:              time=30.87  mat=0  n=481349471  fh=87%  nps=15.6M
For these SMP runs. I used a dual quad-core core-2 box, 2.0ghz,  16gb of ram (I used a fairly small hash table) and 4mb of shared L2 per chip, two chips.  This box also has 2-way memory interleaving.  I did not have access to a more vanilla quad-core or dual quad-core with more normal memory, this is a server box that is pretty fast.
Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores.  It would probably degrade much worse on 16 cores.  I tried to test it on such a box, but ran into enough issues with the installed compiler and libs that I gave up for the moment.
For the record, my "position" structure is something like 220 bytes or so.  There are 12 piece-type bitboards, plus occupied-white and occupied-black, plus the diagonal-movers and rank/file-moves, total of 16 bitboards of 8 bytes each.  Throw in the 64 bytes for square contents, the two hash signatures, and a few bytes for castle rights, ep status, 50 move counter, and it ends up at a little over 200 bytes total.  Or just over 3 cache lines on this box.
My conclusion is, as always, copy/make is not the way to go on the PC platform.  YMMV of course.  I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.
			
			
									
						
										
						copy/make vs make/unmake test results
Moderator: Ras
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
- 
				mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: copy/make vs make/unmake test results
Thanks for the test.bob wrote: For the record, my "position" structure is something like 220 bytes or so.
How many bytes you backup in make move and then restore from backup storage in unmake move ?
Really hope my question is clear this time

- 
				Gerd Isenberg
- Posts: 2251
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: copy/make vs make/unmake test results
So with your approach, make/unmake makes sense rather than to use ~4 additional 64-byte cachelines per ply to safe the unmake. In make/unmake you'll incrementally change only a small fraction of the whole board structure (one or two bitboards out of 12, two squares out of 64). In copy/make you would copy a lot of stuff not affected by makemove. Copy/make increase L1 pressure and you'll stall more often let say on magic (or rotated) lookups and elsewhere.bob wrote: My conclusion is, as always, copy/make is not the way to go on the PC platform. YMMV of course. I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.
TIMTOWTDI. I have the information you have inside 12++ bitboards plus the redundant board array, in one quad-bb. So for me copy/make is far less than one additional cacheline per ply, quad-bb, pawn-hashkey (general hashkey has a separate array for repetition detection anyway), some piece counters. The other stuff I mentioned (~256 bytes) to keep recursively for staged movegen, some attack bitboards, 16 directionwise move target sets as unsorted movelist, is generated on the fly with register intensive SSE2 Kogge-Stone stuff but no magic lookups, and hides the latency of the TT-probe, which is done always.
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: copy/make vs make/unmake test results
As I said, I copy under 256 bytes to make a move. I copy _nothing_ to unmake a move, that is the point of copy/make.mcostalba wrote:Thanks for the test.bob wrote: For the record, my "position" structure is something like 220 bytes or so.
How many bytes you backup in make move and then restore from backup storage in unmake move ?
Really hope my question is clear this time
It goes something like this:
MakeMove(etc, ply):
position[ply+1] = position[ply];
update position[ply+1]
(position[ply] is completely unchanged.
pass position[ply+1] to the next level of search.
when search returns it does nothing since position[ply] is still correct, as only position[ply+1] was changed.
I'm also trying to be clear, also. In copy/make there is _NO_ unmake() whatsoever. Otherwise it would not be called copy/make.
- 
				mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: copy/make vs make/unmake test results
No I am not talking of copy/make.bob wrote:As I said, I copy under 256 bytes to make a move. I copy _nothing_ to unmake a move, that is the point of copy/make.mcostalba wrote:Thanks for the test.bob wrote: For the record, my "position" structure is something like 220 bytes or so.
How many bytes you backup in make move and then restore from backup storage in unmake move ?
Really hope my question is clear this time
It goes something like this:
MakeMove(etc, ply):
position[ply+1] = position[ply];
update position[ply+1]
(position[ply] is completely unchanged.
pass position[ply+1] to the next level of search.
when search returns it does nothing since position[ply] is still correct, as only position[ply+1] was changed.
I'm also trying to be clear, also. In copy/make there is _NO_ unmake() whatsoever. Otherwise it would not be called copy/make.
Please forget copy/make. for my question do like copy/make does not exsist otherwise we get nowhere

Now consider we have _only_ make move + unmake move.
In make move you _save_ same info as example position key or something else, then you update the position.
In unmake move you update back the position _and_ retrieve some info, as example the key from the backup storage where you saved in make move.
Now my question is: how many bytes you save in make move and then retrieve in unmake ? IOW how many fields you don't update incrementally in unmake move but simply take back the old saved value ?
Hope is clear now.
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: copy/make vs make/unmake test results
OK. I save the hash signature and pawn hash signature (16 bytes total) plus 4 bytes (2 x castle rights for one side, en passant target square, 50 move counter). Total = 20 bytes of stuff that is copied then restored. I suspect it would be better to unmake the hash signature stuff now, it was very slightly better to copy it on 32 bit machines, but now that 64 bit registers are the norm, it is likely better to unmake the hash / pawn-hash signatures as well, which I will test later.mcostalba wrote:No I am not talking of copy/make.bob wrote:As I said, I copy under 256 bytes to make a move. I copy _nothing_ to unmake a move, that is the point of copy/make.mcostalba wrote:Thanks for the test.bob wrote: For the record, my "position" structure is something like 220 bytes or so.
How many bytes you backup in make move and then restore from backup storage in unmake move ?
Really hope my question is clear this time
It goes something like this:
MakeMove(etc, ply):
position[ply+1] = position[ply];
update position[ply+1]
(position[ply] is completely unchanged.
pass position[ply+1] to the next level of search.
when search returns it does nothing since position[ply] is still correct, as only position[ply+1] was changed.
I'm also trying to be clear, also. In copy/make there is _NO_ unmake() whatsoever. Otherwise it would not be called copy/make.
Please forget copy/make. for my question do like copy/make does not exsist otherwise we get nowhere
Now consider we have _only_ make move + unmake move.
In make move you _save_ same info as example position key or something else, then you update the position.
In unmake move you update back the position _and_ retrieve some info, as example the key from the backup storage where you saved in make move.
Now my question is: how many bytes you save in make move and then retrieve in unmake ? IOW how many fields you don't update incrementally in unmake move but simply take back the old saved value ?
Hope is clear now.
But in any case, this started as a "copy/make" discussion, which is where I have been coming from all along. But to answer your question, I copy 20 bytes at the front of MakeMove() and copy that 20 bytes back in UnmakeMove().
- 
				mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: copy/make vs make/unmake test results
Ok. In SF I copy 56 bytes...they are never copied back:bob wrote:But to answer your question, I copy 20 bytes at the front of MakeMove() and copy that 20 bytes back in UnmakeMove().
Code: Select all
  // Copy some fields of old state to our new StateInfo object except the
  // ones which are recalculated from scratch anyway, then switch our state
  // pointer to point to the new, ready to be updated, state.
  struct ReducedStateInfo {
    Key key, pawnKey, materialKey;
    int castleRights, rule50;
    Square epSquare;
    Value mgValue, egValue;
    Value npMaterial[2];
  };
- 
				Don  
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: copy/make vs make/unmake test results
Does your position structures overlap cache lines or does the compiler fix that for you? I think your position structure should be padded to be exactly 256 bytes if you are running on more than one processor.bob wrote:I didn't get a chance to make the changes until this afternoon. Took about 3.5 hours to make them all, and then do a little debugging where I made a typo or forgot about something here and there. Both versions now search exactly the same trees, tested over a couple of hundred positions, so they are identical, except that in each of the following cases, log.001 comes from the original make/unmake version of Crafty, and log.002 comes from the new copy/make with no unmake function whatsoever.
I picked a few test positions that anyone could try, just so we have a test anyone interested can reproduce. All of the results were about the same, so I will just give one. Every test I tried using just one CPU showed about a 5% loss in speed using copy/make. In looking at my very old notes, this was just over 10% slower back on the pentium/pentium-pro, perhaps assisted by the very large L2 caches of today (this is a core-2 box with 4mb of L2 cache).
log.001: time=30.67 mat=1 n=119294067 fh=96% nps=3.9M
log.002: time=31.51 mat=1 n=117529270 fh=96% nps=3.7M
The new version is about 5% slower using copy/make. Not _that_ bad. (note I am only looking at nps, both runs were told to search for 30 seconds and quit, but I only check the time once per second inside crafty at this search time limit so they run for slightly different times. But the NPS numbers are what we are talking about.
However, I then tested this on my dual quad-core box using 8 cores and 8 threads. The bottom fell out in a deep/narrow endgame:
log.001: time=30.49 mat=1 n=564874031 fh=97% nps=18.5M
log.002: time=30.26 mat=1 n=434702300 fh=83% nps=14.4M
Whereas in typical middlegame positions, things were not nearly as bad:
log.001: time=30.47 mat=0 n=492705674 fh=87% nps=16.2M
log.002: time=30.87 mat=0 n=481349471 fh=87% nps=15.6M
For these SMP runs. I used a dual quad-core core-2 box, 2.0ghz, 16gb of ram (I used a fairly small hash table) and 4mb of shared L2 per chip, two chips. This box also has 2-way memory interleaving. I did not have access to a more vanilla quad-core or dual quad-core with more normal memory, this is a server box that is pretty fast.
Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores. It would probably degrade much worse on 16 cores. I tried to test it on such a box, but ran into enough issues with the installed compiler and libs that I gave up for the moment.
For the record, my "position" structure is something like 220 bytes or so. There are 12 piece-type bitboards, plus occupied-white and occupied-black, plus the diagonal-movers and rank/file-moves, total of 16 bitboards of 8 bytes each. Throw in the 64 bytes for square contents, the two hash signatures, and a few bytes for castle rights, ep status, 50 move counter, and it ends up at a little over 200 bytes total. Or just over 3 cache lines on this box.
My conclusion is, as always, copy/make is not the way to go on the PC platform. YMMV of course. I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: copy/make vs make/unmake test results
It is padded to 256 bytes. And more importantly, the split blocks (which contain the per-cpu position structure arrays) are aligned on page boundaries so that on AMD boxes I don't share a page of memory across two CPUs which is bad on NUMA boxes.Don wrote:Does your position structures overlap cache lines or does the compiler fix that for you? I think your position structure should be padded to be exactly 256 bytes if you are running on more than one processor.bob wrote:I didn't get a chance to make the changes until this afternoon. Took about 3.5 hours to make them all, and then do a little debugging where I made a typo or forgot about something here and there. Both versions now search exactly the same trees, tested over a couple of hundred positions, so they are identical, except that in each of the following cases, log.001 comes from the original make/unmake version of Crafty, and log.002 comes from the new copy/make with no unmake function whatsoever.
I picked a few test positions that anyone could try, just so we have a test anyone interested can reproduce. All of the results were about the same, so I will just give one. Every test I tried using just one CPU showed about a 5% loss in speed using copy/make. In looking at my very old notes, this was just over 10% slower back on the pentium/pentium-pro, perhaps assisted by the very large L2 caches of today (this is a core-2 box with 4mb of L2 cache).
log.001: time=30.67 mat=1 n=119294067 fh=96% nps=3.9M
log.002: time=31.51 mat=1 n=117529270 fh=96% nps=3.7M
The new version is about 5% slower using copy/make. Not _that_ bad. (note I am only looking at nps, both runs were told to search for 30 seconds and quit, but I only check the time once per second inside crafty at this search time limit so they run for slightly different times. But the NPS numbers are what we are talking about.
However, I then tested this on my dual quad-core box using 8 cores and 8 threads. The bottom fell out in a deep/narrow endgame:
log.001: time=30.49 mat=1 n=564874031 fh=97% nps=18.5M
log.002: time=30.26 mat=1 n=434702300 fh=83% nps=14.4M
Whereas in typical middlegame positions, things were not nearly as bad:
log.001: time=30.47 mat=0 n=492705674 fh=87% nps=16.2M
log.002: time=30.87 mat=0 n=481349471 fh=87% nps=15.6M
For these SMP runs. I used a dual quad-core core-2 box, 2.0ghz, 16gb of ram (I used a fairly small hash table) and 4mb of shared L2 per chip, two chips. This box also has 2-way memory interleaving. I did not have access to a more vanilla quad-core or dual quad-core with more normal memory, this is a server box that is pretty fast.
Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores. It would probably degrade much worse on 16 cores. I tried to test it on such a box, but ran into enough issues with the installed compiler and libs that I gave up for the moment.
For the record, my "position" structure is something like 220 bytes or so. There are 12 piece-type bitboards, plus occupied-white and occupied-black, plus the diagonal-movers and rank/file-moves, total of 16 bitboards of 8 bytes each. Throw in the 64 bytes for square contents, the two hash signatures, and a few bytes for castle rights, ep status, 50 move counter, and it ends up at a little over 200 bytes total. Or just over 3 cache lines on this box.
My conclusion is, as always, copy/make is not the way to go on the PC platform. YMMV of course. I'll tuck this version away somewhere, but am not going to use it for anything since it is slower.
My goal has always been to optimize memory accesses, hence aligning hash tables, and everything else that might be important.
BTW, I should add that the allignment of the individual structures for copy/make is not important here (although it is done since the structure is close enough to 256 for the compiler to round it up). This state stack is an array, and there is a unique instance of this array for each CPU anyway, so there is no possibility of two copy state instances overlapping to the same cache line, if they belong to two different processors/threads anyway.