performance of copy-make

rbarreira · Post by **rbarreira** » Tue Aug 02, 2011 3:45 pm

Rein Halbersma wrote:
rbarreira wrote:
Joost Buijs wrote:Maybe the performance issue of copy-make is not that important at all because 90% of the time is spent in the evaluation function. I assume you will hardly notice the performance penalty of copy-make on the program as a whole.
Even if 90% of the time is spent in the evaluation function, that doesn't mean copy-make can't have a very negative impact. For example, if it fills up the cache with garbage it will make the evaluation run slower.

Profiling code is very useful but it does not tell the whole story. It only shows where your program spends its time, not why it spends its time there.
For a 16 ply search and 256 bytes per position, you are at about 4K per core. That should comfortably fit into cache.

Sure it fits, but it pushes out other things from the cache in order to fit. Especially L1 cache which is quite small. This makes other functions in the program slower, and might not show up in a profiler report.

Aleks Peshkov · Post by **Aleks Peshkov** » Tue Aug 02, 2011 5:01 pm

1) It is possible to perform "read - modify - write_to_the_new_location" operations with some position representation parts without explicit copy stage. Updating Zobrist hash is obvious example.

2) Latency of position copy and other making move work can be hidden during TT hash probe memory prefetch. I doubt it is possible to do anything useful in parallel during undo move operations.

marcelk · Post by **marcelk** » Tue Aug 02, 2011 5:46 pm

Rein Halbersma wrote:
Conclusion: it seems copy-make can be made competitive compared to the usual make-undo. Plus of course the simplification of the whole split point business for multicore engines.

I use copy-make for the attack tables plus some extra fields such as material signature and hash, but not for the board (mailbox). I don't observe an obvious speed penalty: Last time I checked its raw 'perft' performance is better than Crafty's (and mine includes SEE which is pointless in perft).

Some have concerns about cache trashing, multiplying the search depth with the data block sizes and concluding that it doesn't fit. But that is a false alarm because what matters for cycles/node are the last few ply only.

I'm not sure if it can be sped up by changing the design to make/unmake. It probably can: Joost's engine is really a lot faster than anybody else's. One reason I did this was to avoid complexity and hours lost on tracking down low-level bugs. For me nps is the last thing to squeeze and it would be unwise to do so unless the engine is >3000 already, but it is not bad with copy-make (just below 1500 cycles/node on the Phenoms for average searches during games).

bob · Post by **bob** » Tue Aug 02, 2011 8:27 pm

I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...

Don · Post by **Don** » Thu Aug 04, 2011 4:52 pm

Joost Buijs wrote:Maybe the performance issue of copy-make is not that important at all because 90% of the time is spent in the evaluation function. I assume you will hardly notice the performance penalty of copy-make on the program as a whole.

I have measured the performance of make in Komodo (which is pure copy-make) and it's something like 2% of the running time. Evaluation is by far the largest bottleneck for Komodo.

It's possible that this number does not tell the whole story, but I would not know how to prove that short of re-implementing Komodo without copy-make.

I have also checked this by calling "copy-make" twice in a row just to see how much difference it makes, using 2 different states and the time is very small. There is a small advantage to calling it twice however as the state being copied is guaraneed to be in cache, but I think the writes should dominate this time. So I am far from motivated to go back to the old way that programs used to do make/unmake.

Don · Post by **Don** » Thu Aug 04, 2011 6:23 pm

bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...

The first time I started using copy-make it was for the idea of simplifying the code, getting rid of global variables and avoiding the complexities and extra conditional branches involved in unmaking a move. It's often, but not always, the case that simplifying the code comes with many advantages including performance.

I usually try to determine using logic and my own intuition what should happen if I do things a certain way but I have found that my intuition is a very poor guide. So I am very careful to not make any assumptions no matter how obvious they seem to me or how much the fly in the face of logic. Of course one still has to use reason as a guide.

So my first gut feeling on this issue was that it was a horrible idea. At MIT I was convinced to try it as I was given a lecture on cache sizes, that copying a few hundred bytes on modern CPU's was nothing compared to things like missed branches and missed cache hits and other consideration. So when I stopped putting undue weight on the speed of copying state, the only thing left was win, win, win. But of course the only thing that really mattered to me was whether it really worked or not.

There are only 2 potential downsides to copy-make. Possible cache blowout issues and the speed of copying. The upsides to copy-make are:

1. much reduced logic (which means less missed branches)
2. great simplification of code. (which can be somewhat mitigated with good design)
3. easier to make parallel code.
4. No dealing with unmake and it's complexities.
5. less data structures to support unmaking a move.

None of these things are major - with good design I don't think a program needs to have copy-make and can still be bug free and easy to work with - but I am a firm believer in keeping things as simple as possible (but no simpler.) Sometimes simple gets in the way and you need a little more complexity, but I have concluded that this is not one of those cases.

I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.

So I did an experiment in order to test the theory that copy state is blowing out the cache and slowing the program to a crawl - which seems to be the big fear that is being put out there. Before getting to that, I want to mention that most good chess programs use a number of vary large data structures and that storing a few position states is a trivial amount of space in comparison.

Komodo has a position state that is 208 bytes. It used to be 192 bytes for alignment purposes but it has grown - but I can get it back down to 192 easily enough. But that is besides the point.

So the experiment is to see how much slower the program will get if I double the size of the position state. I added a character array the middle of the state struct definition that is also 208 bytes large and I did not even bother to align it in any special way but I doubt that matters in this case as it is on a 4 byte boundary anyway. Then I doubled the size of that to get 3x larger state that is normal in Komodo.

I ran 3 timings (50 position to 13 ply) and took the median value as the time.

When I doubled the state size, the program slowed down by 0.7 percent, that's less than 1 percent but non-trivial. When I tripled the state size it slowed down by 1.4 percent compared to the regular version. So the state size appears to matter and is worthy of consideration. Note that by doubling and tripling the state size I have to suffer the additional cache overhead as well as the state copy time.

It's difficult to draw any conclusions from this because I don't know how much unmake and the associated logic and data structures to support it would slow things down. I would point out that make is also a little more complicated without copy-make - so the question is whether make/unmake is a major win over copy-make. I seriously doubt that it is. I would guess that with a relatively small state size it's a non-issue. If I thought it was more than 2 percent I would go through the pain of changing the program but I doubt it's a slowdown at all.

I know this is not proof of anything, but Komodo manages to be a very strong program using this scheme but as far as I know it's the only program that does this so perhaps I am missing something big that would propel us over the level of Houdini if I change it.

It would not be too difficult to convert my program - I could keep everything the same but have a special routine that makes and unmakes a move to a target state. However, guess what? Now I would have to add more code and global variables in order to maintain the information to undue a move and do repetition testing. In other words you STILL have to maintain state, now it's just more complicated.

bob · Post by **bob** » Thu Aug 04, 2011 8:41 pm

Don wrote:
bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...
The first time I started using copy-make it was for the idea of simplifying the code, getting rid of global variables and avoiding the complexities and extra conditional branches involved in unmaking a move. It's often, but not always, the case that simplifying the code comes with many advantages including performance.

I usually try to determine using logic and my own intuition what should happen if I do things a certain way but I have found that my intuition is a very poor guide. So I am very careful to not make any assumptions no matter how obvious they seem to me or how much the fly in the face of logic. Of course one still has to use reason as a guide.

So my first gut feeling on this issue was that it was a horrible idea. At MIT I was convinced to try it as I was given a lecture on cache sizes, that copying a few hundred bytes on modern CPU's was nothing compared to things like missed branches and missed cache hits and other consideration. So when I stopped putting undue weight on the speed of copying state, the only thing left was win, win, win. But of course the only thing that really mattered to me was whether it really worked or not.

There are only 2 potential downsides to copy-make. Possible cache blowout issues and the speed of copying. The upsides to copy-make are:

1. much reduced logic (which means less missed branches)
2. great simplification of code. (which can be somewhat mitigated with good design)
3. easier to make parallel code.
4. No dealing with unmake and it's complexities.
5. less data structures to support unmaking a move.

None of these things are major - with good design I don't think a program needs to have copy-make and can still be bug free and easy to work with - but I am a firm believer in keeping things as simple as possible (but no simpler.) Sometimes simple gets in the way and you need a little more complexity, but I have concluded that this is not one of those cases.

I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.

So I did an experiment in order to test the theory that copy state is blowing out the cache and slowing the program to a crawl - which seems to be the big fear that is being put out there. Before getting to that, I want to mention that most good chess programs use a number of vary large data structures and that storing a few position states is a trivial amount of space in comparison.

Komodo has a position state that is 208 bytes. It used to be 192 bytes for alignment purposes but it has grown - but I can get it back down to 192 easily enough. But that is besides the point.

So the experiment is to see how much slower the program will get if I double the size of the position state. I added a character array the middle of the state struct definition that is also 208 bytes large and I did not even bother to align it in any special way but I doubt that matters in this case as it is on a 4 byte boundary anyway. Then I doubled the size of that to get 3x larger state that is normal in Komodo.

I ran 3 timings (50 position to 13 ply) and took the median value as the time.

When I doubled the state size, the program slowed down by 0.7 percent, that's less than 1 percent but non-trivial. When I tripled the state size it slowed down by 1.4 percent compared to the regular version. So the state size appears to matter and is worthy of consideration. Note that by doubling and tripling the state size I have to suffer the additional cache overhead as well as the state copy time.

It's difficult to draw any conclusions from this because I don't know how much unmake and the associated logic and data structures to support it would slow things down. I would point out that make is also a little more complicated without copy-make - so the question is whether make/unmake is a major win over copy-make. I seriously doubt that it is. I would guess that with a relatively small state size it's a non-issue. If I thought it was more than 2 percent I would go through the pain of changing the program but I doubt it's a slowdown at all.

I know this is not proof of anything, but Komodo manages to be a very strong program using this scheme but as far as I know it's the only program that does this so perhaps I am missing something big that would propel us over the level of Houdini if I change it.

It would not be too difficult to convert my program - I could keep everything the same but have a special routine that makes and unmakes a move to a target state. However, guess what? Now I would have to add more code and global variables in order to maintain the information to undue a move and do repetition testing. In other words you STILL have to maintain state, now it's just more complicated.

Early Crafty's were copy/make, because that was highly efficient on the Cray with its ridiculous (even by PCs of today) memory bandwidth. But my question to him was "how does copy make make a parallel split easier"? Whatever you do, you have to replicate the board and everything N times to use N threads. Whether you have a single board (as I do) in a structure unique to each thread, or whether you have N boards.

The problem I would have with copy/make is that I allow _any_ thread to help any other thread, and then back up results. So you end up copying a lot of stuff at split points unless you choose to use a "master/slave" sort of arrangement where only the master can back anything up.

But ignoring that, since it can be dealt with, what is the benefit of copy/make with regard to doing a parallel split? I don't see any. And when I changed Crafty, it did not affect parallel search at all..

rbarreira · Post by **rbarreira** » Thu Aug 04, 2011 8:53 pm

Don wrote: I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.

Actually the upper bound is bigger than 2 percent, due to the cache issue.

Don · Post by **Don** » Thu Aug 04, 2011 9:50 pm

bob wrote:
Don wrote:
bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...
The first time I started using copy-make it was for the idea of simplifying the code, getting rid of global variables and avoiding the complexities and extra conditional branches involved in unmaking a move. It's often, but not always, the case that simplifying the code comes with many advantages including performance.

I usually try to determine using logic and my own intuition what should happen if I do things a certain way but I have found that my intuition is a very poor guide. So I am very careful to not make any assumptions no matter how obvious they seem to me or how much the fly in the face of logic. Of course one still has to use reason as a guide.

So my first gut feeling on this issue was that it was a horrible idea. At MIT I was convinced to try it as I was given a lecture on cache sizes, that copying a few hundred bytes on modern CPU's was nothing compared to things like missed branches and missed cache hits and other consideration. So when I stopped putting undue weight on the speed of copying state, the only thing left was win, win, win. But of course the only thing that really mattered to me was whether it really worked or not.

There are only 2 potential downsides to copy-make. Possible cache blowout issues and the speed of copying. The upsides to copy-make are:

1. much reduced logic (which means less missed branches)
2. great simplification of code. (which can be somewhat mitigated with good design)
3. easier to make parallel code.
4. No dealing with unmake and it's complexities.
5. less data structures to support unmaking a move.

None of these things are major - with good design I don't think a program needs to have copy-make and can still be bug free and easy to work with - but I am a firm believer in keeping things as simple as possible (but no simpler.) Sometimes simple gets in the way and you need a little more complexity, but I have concluded that this is not one of those cases.

I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.

So I did an experiment in order to test the theory that copy state is blowing out the cache and slowing the program to a crawl - which seems to be the big fear that is being put out there. Before getting to that, I want to mention that most good chess programs use a number of vary large data structures and that storing a few position states is a trivial amount of space in comparison.

Komodo has a position state that is 208 bytes. It used to be 192 bytes for alignment purposes but it has grown - but I can get it back down to 192 easily enough. But that is besides the point.

So the experiment is to see how much slower the program will get if I double the size of the position state. I added a character array the middle of the state struct definition that is also 208 bytes large and I did not even bother to align it in any special way but I doubt that matters in this case as it is on a 4 byte boundary anyway. Then I doubled the size of that to get 3x larger state that is normal in Komodo.

I ran 3 timings (50 position to 13 ply) and took the median value as the time.

When I doubled the state size, the program slowed down by 0.7 percent, that's less than 1 percent but non-trivial. When I tripled the state size it slowed down by 1.4 percent compared to the regular version. So the state size appears to matter and is worthy of consideration. Note that by doubling and tripling the state size I have to suffer the additional cache overhead as well as the state copy time.

It's difficult to draw any conclusions from this because I don't know how much unmake and the associated logic and data structures to support it would slow things down. I would point out that make is also a little more complicated without copy-make - so the question is whether make/unmake is a major win over copy-make. I seriously doubt that it is. I would guess that with a relatively small state size it's a non-issue. If I thought it was more than 2 percent I would go through the pain of changing the program but I doubt it's a slowdown at all.

I know this is not proof of anything, but Komodo manages to be a very strong program using this scheme but as far as I know it's the only program that does this so perhaps I am missing something big that would propel us over the level of Houdini if I change it.

It would not be too difficult to convert my program - I could keep everything the same but have a special routine that makes and unmakes a move to a target state. However, guess what? Now I would have to add more code and global variables in order to maintain the information to undue a move and do repetition testing. In other words you STILL have to maintain state, now it's just more complicated.
Early Crafty's were copy/make, because that was highly efficient on the Cray with its ridiculous (even by PCs of today) memory bandwidth. But my question to him was "how does copy make make a parallel split easier"? Whatever you do, you have to replicate the board and everything N times to use N threads. Whether you have a single board (as I do) in a structure unique to each thread, or whether you have N boards.

The problem I would have with copy/make is that I allow _any_ thread to help any other thread, and then back up results. So you end up copying a lot of stuff at split points unless you choose to use a "master/slave" sort of arrangement where only the master can back anything up.

But ignoring that, since it can be dealt with, what is the benefit of copy/make with regard to doing a parallel split? I don't see any. And when I changed Crafty, it did not affect parallel search at all..

The primary benefit is that unmake goes away and thus costs nothing and does not produce bugs. I think it may be faster because of this because I don't believe make is twice as slow as it would normally be using copy-make. Also, there are less opportunities for conditional branches to miss.

Like malloc/free and open/close your code has to be peppered with unmake() and you cannot forget to match them up properly - and when modifying code I am sure this will be an extra error to annoy you with.

So I guess the short answer is that there is no major benefit to either method that I can tell other than what I layed out and other than any relative performance difference between the two - but I don't even know which is faster. Somehow I don't feel that 200 bytes per position is a major cache blowout issue when other data structure in chess programs are orders of magnitude larger than this.

I may take the time to implement the unmake just to see for myself - if I do I will report the results to you and the forum. I just don't look forward to building the unmake routine and getting it right for castling, en-pasant, promotions, etc.

bob · Post by **bob** » Thu Aug 04, 2011 10:03 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...
The first time I started using copy-make it was for the idea of simplifying the code, getting rid of global variables and avoiding the complexities and extra conditional branches involved in unmaking a move. It's often, but not always, the case that simplifying the code comes with many advantages including performance.

I usually try to determine using logic and my own intuition what should happen if I do things a certain way but I have found that my intuition is a very poor guide. So I am very careful to not make any assumptions no matter how obvious they seem to me or how much the fly in the face of logic. Of course one still has to use reason as a guide.

So my first gut feeling on this issue was that it was a horrible idea. At MIT I was convinced to try it as I was given a lecture on cache sizes, that copying a few hundred bytes on modern CPU's was nothing compared to things like missed branches and missed cache hits and other consideration. So when I stopped putting undue weight on the speed of copying state, the only thing left was win, win, win. But of course the only thing that really mattered to me was whether it really worked or not.

There are only 2 potential downsides to copy-make. Possible cache blowout issues and the speed of copying. The upsides to copy-make are:

1. much reduced logic (which means less missed branches)
2. great simplification of code. (which can be somewhat mitigated with good design)
3. easier to make parallel code.
4. No dealing with unmake and it's complexities.
5. less data structures to support unmaking a move.

None of these things are major - with good design I don't think a program needs to have copy-make and can still be bug free and easy to work with - but I am a firm believer in keeping things as simple as possible (but no simpler.) Sometimes simple gets in the way and you need a little more complexity, but I have concluded that this is not one of those cases.

I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.

So I did an experiment in order to test the theory that copy state is blowing out the cache and slowing the program to a crawl - which seems to be the big fear that is being put out there. Before getting to that, I want to mention that most good chess programs use a number of vary large data structures and that storing a few position states is a trivial amount of space in comparison.

Komodo has a position state that is 208 bytes. It used to be 192 bytes for alignment purposes but it has grown - but I can get it back down to 192 easily enough. But that is besides the point.

So the experiment is to see how much slower the program will get if I double the size of the position state. I added a character array the middle of the state struct definition that is also 208 bytes large and I did not even bother to align it in any special way but I doubt that matters in this case as it is on a 4 byte boundary anyway. Then I doubled the size of that to get 3x larger state that is normal in Komodo.

I ran 3 timings (50 position to 13 ply) and took the median value as the time.

When I doubled the state size, the program slowed down by 0.7 percent, that's less than 1 percent but non-trivial. When I tripled the state size it slowed down by 1.4 percent compared to the regular version. So the state size appears to matter and is worthy of consideration. Note that by doubling and tripling the state size I have to suffer the additional cache overhead as well as the state copy time.

It's difficult to draw any conclusions from this because I don't know how much unmake and the associated logic and data structures to support it would slow things down. I would point out that make is also a little more complicated without copy-make - so the question is whether make/unmake is a major win over copy-make. I seriously doubt that it is. I would guess that with a relatively small state size it's a non-issue. If I thought it was more than 2 percent I would go through the pain of changing the program but I doubt it's a slowdown at all.

I know this is not proof of anything, but Komodo manages to be a very strong program using this scheme but as far as I know it's the only program that does this so perhaps I am missing something big that would propel us over the level of Houdini if I change it.

It would not be too difficult to convert my program - I could keep everything the same but have a special routine that makes and unmakes a move to a target state. However, guess what? Now I would have to add more code and global variables in order to maintain the information to undue a move and do repetition testing. In other words you STILL have to maintain state, now it's just more complicated.
Early Crafty's were copy/make, because that was highly efficient on the Cray with its ridiculous (even by PCs of today) memory bandwidth. But my question to him was "how does copy make make a parallel split easier"? Whatever you do, you have to replicate the board and everything N times to use N threads. Whether you have a single board (as I do) in a structure unique to each thread, or whether you have N boards.

The problem I would have with copy/make is that I allow _any_ thread to help any other thread, and then back up results. So you end up copying a lot of stuff at split points unless you choose to use a "master/slave" sort of arrangement where only the master can back anything up.

But ignoring that, since it can be dealt with, what is the benefit of copy/make with regard to doing a parallel split? I don't see any. And when I changed Crafty, it did not affect parallel search at all..
The primary benefit is that unmake goes away and thus costs nothing and does not produce bugs. I think it may be faster because of this because I don't believe make is twice as slow as it would normally be using copy-make. Also, there are less opportunities for conditional branches to miss.

Like malloc/free and open/close your code has to be peppered with unmake() and you cannot forget to match them up properly - and when modifying code I am sure this will be an extra error to annoy you with.

So I guess the short answer is that there is no major benefit to either method that I can tell other than what I layed out and other than any relative performance difference between the two - but I don't even know which is faster. Somehow I don't feel that 200 bytes per position is a major cache blowout issue when other data structure in chess programs are orders of magnitude larger than this.

I may take the time to implement the unmake just to see for myself - if I do I will report the results to you and the forum. I just don't look forward to building the unmake routine and getting it right for castling, en-pasant, promotions, etc.

My make/unmake have _nothing_ to do with my parallel search effectiveness. I'm not talking about performance. I'm certain make/unmake is faster, because it always works on the same data with likely cache hits, as opposed to copy/make which always works on different data as it is copied first for every last make operation. How significant? Don't know today. On the original pentium it was a _huge_ performance hit. Maybe 2-3x slower at a minimum. But that had just L1 cache and very small on top of that. Today? I have not tried it. With today's 4M+ L2 caches, and newer machines with L3 (or older but non-X86 machines) it might not be that much of a win, although I can not imagine that it can't be measured.

But, the poster said it made his parallel split stuff simpler and I could not understand how it is even related to parallel splits at all... I don't use board information there, nor do I make moves while splitting. I just have to copy the board to each thread that is going to work together on the same position, and that's true for copy/make or make/unmake... unless someone can point out something I am missing...

IMO the advantage for make/unmake is that if you have (say) a 256 byte "board state" then that is 4 cache lines. that will likely be in 4 different sets. Since they get referenced at every node, their LRU priority will always be low and they will stick around. With copy/make, each time you copy, you abandon the old state for a while, but even worse, you create a new one that dislodges 4 cache blocks (again assuming your 208 number which is <= 256 but > 192 so it takes 4 blocks (if you make sure the copy/make data is aligned to a 64 byte boundary, of course)

The only thing I have not measured since 1995/1996 is the cost of copy/make. I could probably convert Crafty to copy/make quite easily (it is a lot harder going the other way, been there, done that, got a half-dozen T-shirts from the effort required) and measure this, as it is something worth knowing accurately.

Being kind to cache is the single biggest performance improvement one can make...

performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make