thread affinity

Joost Buijs · Post by **Joost Buijs** » Fri Jul 03, 2015 9:57 pm

mar wrote:
Joost Buijs wrote:I've tried several times in the past to set the affinity mask on Win7x64 and it didn't give me any speedup whatsoever. Actually I have the impression that it hurts a little.
It is not very easy to detect small differences because SMP has a tendency to be somewhat random in nature.
Since you are seeing a speedup of 7% with the affinity set it must have something to do with the difference in architecture of the programs as such.
That's interesting, I'd guess that it should be no worse unless other threads are running.
As I thought it's a common idea to try but what I measured seems too much of a difference for my taste.
Yes, as I stated it's not a chess program (in fact it's a simple kd-tree accelerated trimesh raytracer so it should be trivial to parallelize)

I found it to be worse, but with a very small difference, it could have been noise.
On my system there are usually many threads running, they're not taking much CPU time, but it can explain small differences.

jdart · Post by **jdart** » Sat Jul 04, 2015 3:13 am

The issue is that when a thread is migrated to another core, its memory is not migrated. So now that thread is doing non-local memory access. This is true both under Windows and Linux. See https://software.intel.com/en-us/articl ... s-for-numa.

That is why pinning threads to cores may be a good idea.

--Jon

matthewlai · Post by **matthewlai** » Sat Jul 04, 2015 3:16 am

jdart wrote:The issue is that when a thread is migrated to another core, its memory is not migrated. So now that thread is doing non-local memory access. This is true both under Windows and Linux. See https://software.intel.com/en-us/articl ... s-for-numa.

That is why pinning threads to cores may be a good idea.

--Jon

That only applies to NUMA though. I don't think he is working on NUMA hardware.

On recent Intel CPUs only L1 and L2 will be invalidated, and that's less than 1MB (and can be loaded from L3 after the migration).

Joost Buijs · Post by **Joost Buijs** » Sat Jul 04, 2015 6:39 am

Indeed, none of my computers has NUMA at the moment.
In the past I had several AMD multiprocessor boxes that were using NUMA, but I abandoned them a long time ago.

bob · Post by **bob** » Sat Jul 04, 2015 7:23 pm

mar wrote:
bob wrote:However, I did go to the specific core to extract as much as possible by always using the same L1, L2 and L3, rather than just the same L3 if threads bounce around in the same chip but switch cores.
I knew that L1 is per core but I was never sure with L2/L3. I thought L2 was shared by a pair of cores but as you said this is probably not the case.
I was surprised that the OS decided to shuffle n active threads running on n cores that much.

Depends on the processor. Some share L2, some don't. I think AMD is the one that did although I don't know if they still do. But Intel has separate L1/L2 per processor, but then again L2 is 256K which is pretty small.

bob · Post by **bob** » Sat Jul 04, 2015 7:26 pm

mar wrote:
Joost Buijs wrote:I've tried several times in the past to set the affinity mask on Win7x64 and it didn't give me any speedup whatsoever. Actually I have the impression that it hurts a little.
It is not very easy to detect small differences because SMP has a tendency to be somewhat random in nature.
Since you are seeing a speedup of 7% with the affinity set it must have something to do with the difference in architecture of the programs as such.
That's interesting, I'd guess that it should be no worse unless other threads are running.
As I thought it's a common idea to try but what I measured seems too much of a difference for my taste.
Yes, as I stated it's not a chess program (in fact it's a simple kd-tree accelerated trimesh raytracer so it should be trivial to parallelize)

The place it has occasionally killed me was when I would run multiple tests at the same time. IE on my 20 core box, I can run 5 4-thread tests at the same time. And when I forgot, they would ALL use processors 0 through 3 which was not what I needed.

In current crafty, I have an smpaffinity command. Set it to -1 and affinity is disabled. Set it to any other number and this instance of crafty starts binding to that CPU number and above. Now when I run those 5 tests, I use smpaffinity=0 or 4 or 8 or 12 or 16, which gets everyone out of each other's way...

This is probably a "dedicated system" idea anyway as if two people run on the same machine, they would see the same problem I saw. That's why I left the option to disable it (smpaffinity=-1).

bob · Post by **bob** » Sat Jul 04, 2015 7:31 pm

matthewlai wrote:
mar wrote:
matthewlai wrote:Well then that's clearly not your problem .

Another common is threads writing into adjacent memory locations causing cache-line invalidation on other cores.

I'm guessing that's not your problem as well .
I was thinking along the same lines. The threads read the same memory (this should be no problem) and the only writes occur to a buffer which is different for each thread.
Only the final result might write on adjacent locations in theory, but the work is divided so that the probability that this happens is very close to zero (I tried various other batching schemes but this made no difference).
I also tried to write the final result to per-thread buffers as well but for no gain at all.

It's just that I was surprised that shuffling work around cores can cause a measurable slowdown (in fact, I was surprised that the OS does something like that at all and relatively often...)
Maybe Linux doesn't suffer from this. (btw. 19+ speedup on 20 cores is excellent!
But honestly I was hoping to get average speedup of 3.x closer to 4 but instead I get 3.x closer to 3.
Yeah reading should be no problem.

I find it strange that moving around would cause significant slowdown in a non-NUMA system. For example, even if every thread gets moved 10 times a second, that's 100ms of running time per move. A modern machine has DRAM bandwidth of about 30GB/s, so it would still only take less than 1ms to get everything back in cache.

One would also think that the guy/girl who wrote the scheduler would have thought of this, and made it not happen that often.

Maybe your program is now bottlenecked by memory bandwidth? Depending on your CPU, It's also possible that it's bottlenecked by shared cache bandwidth (on recent Intel CPUs L3 is shared, and on recent AMD CPUs L3 is shared by all cores, and each pair of cores share a L2).

It probably takes longer than 1ms. Your L1/L2 data is mostly thread-local, and a lot of it is dirty. Which means when your thread bounces to another CPU, suddenly the old cache and the new cache suffer through a lot of forwarding transfers to move the data between caches. It is a more than 2x hit as well, when you think about the complexities of core 0 reading values that core 1 has, core 1 reads values that core 0 has, and then both read values that neither have which causes additional dirty write-backs before reading the new data...

I measured a significant speedup in Crafty when I started using this. I only wish OS X had such a feature, but like everything else OS X does, they have to be different or decide something is just not important, without having a clue about what is really going on inside.

bob · Post by **bob** » Sat Jul 04, 2015 7:32 pm

jdart wrote:The issue is that when a thread is migrated to another core, its memory is not migrated. So now that thread is doing non-local memory access. This is true both under Windows and Linux. See https://software.intel.com/en-us/articl ... s-for-numa.

That is why pinning threads to cores may be a good idea.

--Jon

Right. IF you have a multi-package (Linux terminology) or multi-chip motherboard where NUMA becomes an issue. It is even worse if you specifically try to address this issue by carefully touching only your local data first, so that it faults in to your local NUMA node's memory.

thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity