Questions on SMP search

bob · Post by **bob** » Tue Apr 26, 2011 8:20 pm

Houdini wrote:
bob wrote:Hand-waving does _not_ cut it here. "slightly more efficient code" is a crock. Why? The search and evaluation represents a ton of instructions and read-only data, in addition to the usual read/write data. Many Intel processors have shared L2 or shared L3 cache. Why, on a 4/8 core chip, do you want to have N copies of essentially the same code in a shared cache, when one will do and stress cache much less?

Several of us have actually tested these kinds of ideas extensively. Passing a pointer to local data around has no significant cost on Intel. When I modified Crafty to use parallel search (version 15.0, almost 15 years ago) I expected a 10% slow-down in execution speed. I didn't see any appreciable loss. And avoiding duplicate code in a shared cache is significant.

If that was the only way you could figure out how to do a parallel search, that's fine. But please don't try to pawn it off as "the most efficient way." It isn't'
LOL, one has to love your condescending tone ("If that was the only way you could figure out how to do a parallel search").
As I wrote earlier, I can disable this feature with a simple compilation switch.
This means I can compare the two versions in a matter of seconds. And as I said before, it generates a 3% to 5% speed-up for Houdini.

Robert

So with a simple compile switch, you can change from N instances of each block of code, to a single instance and passing a pointer around? Somehow I don't think so.

Comparing the speed of one cpu vs one cpu using per instance vs pointer doesn't exactly address the issue being discussed. What happens when you run on a 24 or 32 or 64 core box? That is more interesting as that is where you are far more likely to run into cache issues...

That was the reason for my "condescending tone" since you are implying you can do one thing, but really do another...

bob · Post by **bob** » Tue Apr 26, 2011 9:11 pm

Houdini wrote:
marcelk wrote:Indeed. Feel free to submit an abuse complaint to the moderators of the group. What is annoying are people who are claiming other's work as their own.
Even if you are annoyed by anything I or anyone else did or said, there still is no reason for you to jump into Ben-Hur Carlos' technical thread and hijack it for the sole purpose of expressing your personal frustrations.

People like Bob call that a "live adult discussion", I call that a lack of consideration for the original poster. It significantly reduces the quality of the exchanges on this forum. If that was your goal, you have succeeded.

Robert

There _is_ often a lack of quality here. Most of that is due to evasiveness. Ask me a question, I will give you absolute facts, if available, and an opinion or statement in reply. I won't shift the topic to something else or give evasive answers such as "why did you ask that?" or "why do you want to talk about that?" which reminds me of the old "doctor / Elijah" program that tried to carry on a natural conversation with someone, but which failed miserably.

Houdini · Post by **Houdini** » Tue Apr 26, 2011 9:13 pm

bob wrote:So with a simple compile switch, you can change from N instances of each block of code, to a single instance and passing a pointer around? Somehow I don't think so.

Yes, I can.
Let me show you an actual code snippet of a function definition:

Code: Select all

template<$$def_THREAD_ID$$ Kleur IK, bool MULTICORE> 
int Knoop_All($$def_STELLING$$ int BETA, int DIEPTE)
{
	const Kleur JIJ = (IK == WIT ? ZWART : WIT);
	rec_stelling_var *VAR_0 = VAR;
...

and of a call to this function:

Code: Select all

	waarde = -Knoop_All<$$THREAD_ID$$ JIJ, MULTICORE>($$STELLING$$ 1 - BETA, diepte_nieuw);

All the $$ things are C++ macros that are modified through compiler switches, they can be empty or not. There's obviously a bit more to it than what you see here, but you get the idea. C++ is a *very* powerful language indeed...

If you're really interested I could send you the two compilations, but I sincerely hope that at this point you've somehow accepted that not everyone is following the Bob Hyatt cookbook

.

bob wrote:Comparing the speed of one cpu vs one cpu using per instance vs pointer doesn't exactly address the issue being discussed. What happens when you run on a 24 or 32 or 64 core box? That is more interesting as that is where you are far more likely to run into cache issues...

I've never tested beyond 16 cores, but up to 16 my thread-specific code version is consistently faster.

bob wrote:That was the reason for my "condescending tone" since you are implying you can do one thing, but really do another...

Your condescending tone was misplaced.

Robert

wgarvin · Post by **wgarvin** » Tue Apr 26, 2011 10:01 pm

bob wrote:
bhlangonijr wrote:
bob wrote: What exactly do you mean by "atomic operations"??? Using a lock to protect an entry?
You don't need to use a lock to protect a history entry, since it is usually an integer type so that the cpu will use only a single instruction to update it. There is no concurrency issue as It is done atomically. I suppose is that what he means...
He's wrong. You have to read a value from memory, update it, and write it back. That is _not_ done atomically unless you do something like an add with a lock prefix. This is subject to the classic "interleaved update" problem. Two cpus, A and B. A and B simultaneously read a value for X. Each adds 1 to that value, then each writes X back to memory. When you are done, X was incremented by 1, not 2.

If a single instruction, such as add [history + 8 * edx], eax is done, the CPU has to do a read and write and they most definitely are _not_ done atomically unless you request it with a lck prefix. And a compiler doesn't do that normally.

While its true that the whole operation (read-modify-write) is not atomic, I thought he just meant that the read is atomic and the write is atomic.
So at worst you'll miss an increment on the "interleaved update". As long as you don't do anything with the counters besides read them and increment them.
(I guess it might be different if there's any occasional thing triggered like "divide all of the history counters by two" -- in that case, interleaved
updates during the rescaling might be small but real problem.

Houdini wrote:
bob wrote:if you use only 2mb, you are faced with this:

log.001: time=3:00 mat=0 n=441934327 fh=91% nps=2.5M
log.002: time=3:00 mat=0 n=490668472 fh=91% nps=2.7M

Both searches are for 180 seconds. Which one would you rather toss out in a game? The first uses hashp=2m, the second uses hashp=128m. I know which one _I_ want. I'll take the almost 10% faster version myself. And when you have 24 of those, you are impacting cache significantly. And with no down-side to a shared cache, this is an idea that makes no sense to me.
You report results obtained with Crafty, I report Houdini results.
Houdini uses a 2 MB pawn hash per thread and typically spends less than 2 % of the time in the pawn hash code.
While your results are certainly interesting, they're not relevant for Houdini.

bob wrote:If this is a good idea, why not local transposition tables as well?
There is a fundamental difference between the pawn hash and the main transposition table.
The pawn hash is just for speed-up, it doesn't alter the actual search process.
The main hash modifies the search process very significantly, not sharing it would be counterproductive.

bob wrote:You won't see significant NUMA issues until you go beyond two physical processor chips (not cores). And the pawn hash trick I explained will eliminate about half of the pawn hash table accesses anyway. The main trans/ref table is _much_ worse because it is continually getting written to, roughly once per node searched. The pawn hash doesn't get modified much if it is big enough to prevent excessive overwrites.
My NUMA development and tests are on a 16 core box with 4 NUMA nodes.
Like Marco explained, your pawn hash trick is just a manual caching method that can be dealt with very efficiently by the normal memory caching mechanism including prefetching.

To each his own...

Robert

If your pawn hash was globally shared and only 2 MB, the entire thing could fit in L3 cache which are often 6-8 MB or larger. But if you have twelve of those, they'll be 24 MB and they won't fit.

Regardless, bob is right that you're increasing the cache pressure by not sharing them. Writes are rare so they shouldn't cause much contention. You get pawn hash hits >95% of the time, right? With a shared cache, some of those hits will land in the same cache lines and be cheaper than the reads evicting something and going out to a deeper level of cache or to main memory. With separate tables per thread, they will definitely land in different cache lines and increase contention in the smaller caches.

In fact... depending on how you allocated the 2 MB buffers, its possible (and even rather likely) that when looking up the same pawn hash key in 2 different threads, there is a false collision due to aliasing and one thread's copy of the entry for that key gets evicted from L1 or L2 to make room for the other thread's independent entry for the same key. I don't know how commonly this occurs, but its surely an ironic outcome if the intention behind not sharing the tables was to reduce cache contention.

Houdini · Post by **Houdini** » Tue Apr 26, 2011 11:06 pm

wgarvin wrote:In fact... depending on how you allocated the 2 MB buffers, its possible (and even rather likely) that when looking up the same pawn hash key in 2 different threads, there is a false collision due to aliasing and one thread's copy of the entry for that key gets evicted from L1 or L2 to make room for the other thread's independent entry for the same key. I don't know how commonly this occurs, but its surely an ironic outcome if the intention behind not sharing the tables was to reduce cache contention.

Your final comment about the memory allocation is spot on, it's important to make sure that the stride between the data structures of different threads is carefully chosen.

For the rest we could argue forever about cache contention and pollution, but ultimately the proof of the pudding is in the eating.
Houdini achieves very good results with a relatively small, thread-local pawn hash, and with thread-local code hardwired to a board representation. Other engines may require a different recipe, only tests can tell.

Robert

bob · Post by **bob** » Tue Apr 26, 2011 11:54 pm

You will discover that when you continually make misleading statements, "I modified Robo." to "Houdini is a completely new program." to "I have been working on computer chess for over 30 years" and such, many of us are going to take _everything_ you take with a huge block of salt.

Not much more I can say...

bob · Post by **bob** » Wed Apr 27, 2011 12:01 am

wgarvin wrote:
bob wrote:
bhlangonijr wrote:
bob wrote: What exactly do you mean by "atomic operations"??? Using a lock to protect an entry?
You don't need to use a lock to protect a history entry, since it is usually an integer type so that the cpu will use only a single instruction to update it. There is no concurrency issue as It is done atomically. I suppose is that what he means...
He's wrong. You have to read a value from memory, update it, and write it back. That is _not_ done atomically unless you do something like an add with a lock prefix. This is subject to the classic "interleaved update" problem. Two cpus, A and B. A and B simultaneously read a value for X. Each adds 1 to that value, then each writes X back to memory. When you are done, X was incremented by 1, not 2.

If a single instruction, such as add [history + 8 * edx], eax is done, the CPU has to do a read and write and they most definitely are _not_ done atomically unless you request it with a lck prefix. And a compiler doesn't do that normally.
While its true that the whole operation (read-modify-write) is not atomic, I thought he just meant that the read is atomic and the write is atomic.
So at worst you'll miss an increment on the "interleaved update". As long as you don't do anything with the counters besides read them and increment them.
(I guess it might be different if there's any occasional thing triggered like "divide all of the history counters by two" -- in that case, interleaved
updates during the rescaling might be small but real problem.

The only usage for atomic in a C statement like:

history[index] = history[index] + value;

would be to make certain that history[index] is properly updated each time a thread wants to do so.

Clearly that is not an "atomic operation". In that multiple threads can simultaneously load and then store the value, losing all but the last store that was done. I was not considering the issue of grabbing a value where some bits are correct and some have just been partially updated and are wrong, at least when talking about normal values. I suppose one can even have a value that overlaps two cache blocks and really get screwed up. Unless the typical lock is acquired before the modify and released afterward.

No idea if he rescales the counters as I did in Crafty, and yep, that's another problem that greatly increases the chances of a lost update (or a few of them).

Houdini wrote:
bob wrote:if you use only 2mb, you are faced with this:

log.001: time=3:00 mat=0 n=441934327 fh=91% nps=2.5M
log.002: time=3:00 mat=0 n=490668472 fh=91% nps=2.7M

Both searches are for 180 seconds. Which one would you rather toss out in a game? The first uses hashp=2m, the second uses hashp=128m. I know which one _I_ want. I'll take the almost 10% faster version myself. And when you have 24 of those, you are impacting cache significantly. And with no down-side to a shared cache, this is an idea that makes no sense to me.
You report results obtained with Crafty, I report Houdini results.
Houdini uses a 2 MB pawn hash per thread and typically spends less than 2 % of the time in the pawn hash code.
While your results are certainly interesting, they're not relevant for Houdini.

bob wrote:If this is a good idea, why not local transposition tables as well?
There is a fundamental difference between the pawn hash and the main transposition table.
The pawn hash is just for speed-up, it doesn't alter the actual search process.
The main hash modifies the search process very significantly, not sharing it would be counterproductive.

bob wrote:You won't see significant NUMA issues until you go beyond two physical processor chips (not cores). And the pawn hash trick I explained will eliminate about half of the pawn hash table accesses anyway. The main trans/ref table is _much_ worse because it is continually getting written to, roughly once per node searched. The pawn hash doesn't get modified much if it is big enough to prevent excessive overwrites.
My NUMA development and tests are on a 16 core box with 4 NUMA nodes.
Like Marco explained, your pawn hash trick is just a manual caching method that can be dealt with very efficiently by the normal memory caching mechanism including prefetching.

To each his own...

Robert
If your pawn hash was globally shared and only 2 MB, the entire thing could fit in L3 cache which are often 6-8 MB or larger. But if you have twelve of those, they'll be 24 MB and they won't fit.

Regardless, bob is right that you're increasing the cache pressure by not sharing them. Writes are rare so they shouldn't cause much contention. You get pawn hash hits >95% of the time, right? With a shared cache, some of those hits will land in the same cache lines and be cheaper than the reads evicting something and going out to a deeper level of cache or to main memory. With separate tables per thread, they will definitely land in different cache lines and increase contention in the smaller caches.

In fact... depending on how you allocated the 2 MB buffers, its possible (and even rather likely) that when looking up the same pawn hash key in 2 different threads, there is a false collision due to aliasing and one thread's copy of the entry for that key gets evicted from L1 or L2 to make room for the other thread's independent entry for the same key. I don't know how commonly this occurs, but its surely an ironic outcome if the intention behind not sharing the tables was to reduce cache contention.

bob · Post by **bob** » Wed Apr 27, 2011 12:03 am

Houdini wrote:
wgarvin wrote:In fact... depending on how you allocated the 2 MB buffers, its possible (and even rather likely) that when looking up the same pawn hash key in 2 different threads, there is a false collision due to aliasing and one thread's copy of the entry for that key gets evicted from L1 or L2 to make room for the other thread's independent entry for the same key. I don't know how commonly this occurs, but its surely an ironic outcome if the intention behind not sharing the tables was to reduce cache contention.
Your final comment about the memory allocation is spot on, it's important to make sure that the stride between the data structures of different threads is carefully chosen.

I think you will find that the probability of hitting that accurately is either slim or none, and slim just left town... I know of no way to enforce this. It is not about "stride in virtual memory" where you can do several tricks such as malloc() the data and divide it up yourself), rather it is about "stride in physical memory" which you have zero control over.

For the rest we could argue forever about cache contention and pollution, but ultimately the proof of the pudding is in the eating.
Houdini achieves very good results with a relatively small, thread-local pawn hash, and with thread-local code hardwired to a board representation. Other engines may require a different recipe, only tests can tell.

Robert

Houdini · Post by **Houdini** » Wed Apr 27, 2011 12:55 am

bob wrote:You will discover that when you continually make misleading statements, "I modified Robo." to "Houdini is a completely new program." to "I have been working on computer chess for over 30 years" and such, many of us are going to take _everything_ you take with a huge block of salt.

Amazing, we've now come to the point where you're _falsely_ attributing quotes and "misleading statements" to me.

1) I have NEVER said that "I modified Robo". Can you please find the link to the post or web page where I allegedly have said that?

2) I've acknowledged from the start the strong influence of both Ippo and Stockfish, and in a minor part Crafty. I still do on my web site and in the readme file of Houdini. But Houdini also contains many original ideas and techniques, this discussion has been a small demonstration of that.

3) I've been involved in computer chess since around 1985. That makes about 25 years, not your "over 30 years".

Enough time wasted for now in another anti-Houdini hijacking of a technical thread, I'll let it rest here...

Cheers!
Robert

bob · Post by **bob** » Wed Apr 27, 2011 1:49 am

Houdini wrote:
bob wrote:You will discover that when you continually make misleading statements, "I modified Robo." to "Houdini is a completely new program." to "I have been working on computer chess for over 30 years" and such, many of us are going to take _everything_ you take with a huge block of salt.
Amazing, we've now come to the point where you're _falsely_ attributing quotes and "misleading statements" to me.

1) I have NEVER said that "I modified Robo". Can you please find the link to the post or web page where I allegedly have said that?

2) I've acknowledged from the start the strong influence of both Ippo and Stockfish, and in a minor part Crafty. I still do on my web site and in the readme file of Houdini. But Houdini also contains many original ideas and techniques, this discussion has been a small demonstration of that.

3) I've been involved in computer chess since around 1985. That makes about 25 years, not your "over 30 years".

Enough time wasted for now in another anti-Houdini hijacking of a technical thread, I'll let it rest here...

Cheers!
Robert

I'm not going to debate this ad nauseum. You come out of the blue, with a new program, top of the rating chart, a supposedly original work. Pure bilgewater.

If you want to earn someone's respect, you have to _earn_ it. Not _copy_ it.

I don't think there's a person left on the planet that believes Houdini is anything other than a modified Robolito. For good reason. From comparing output. To other methods of comparison. So live in your fantasy world. I'm not part of it...

I have been involved in computer chess since 1968, and been playing in ACM/ICCA/ICGA/etc tournaments since 1976. Not once has your name come up until "along comes Houdini." So spare me the "I have been involved for 25 years" nonsense. 25 years or 30 years doesn't matter. Neither is factual...

Now please go back to whatever it was you were doing for the past 25 years...

Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search