Baffling multithreading scaling behavior

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 10:10 pm

Good idea. I've never profiled multithreaded code before (or when I did, I just looked at the single thread data that I was interested in). Guess I should get used to that if I'm going to work on this parallel algorithm more.

sasachess · Post by **sasachess** » Thu Sep 08, 2016 1:05 am

matthewlai wrote: vTune will probably help, but maybe try free options like gprof first? If you look at the profile of a single-threaded search vs multiple searches in a process, that may give you some hints (if some function takes a higher % of time in the multiple threads case).

gprof doesn't work well with multithread applications..

http://sam.zoy.org/writings/programming/gprof.html

bob · Post by **bob** » Thu Sep 08, 2016 4:40 am

matthewlai wrote:
bob wrote:
matthewlai wrote:
jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf
This HAS been an issue within the last 3-4 years also. Intel has not always been consistent on how the logical processors are numbered. On some Intel chips, logical processors 0 and 1 share a physical core. On others, 0 and 8 share a physical core (assuming 8 physical cores). I ran afoul of this even though I use processor affinity in Crafty to lock each thread to a specific core. I have an "affinity" command that lets me specify how the physical cores are numbered to avoid this. And even worse, when Intel went to non-power-of-2 core counts, that broke some of the linux internal processor affinity stuff as well because it is more convenient to do an AND rather than a DIV instruction.
Wouldn't that be _because_ you use affinity, instead of _even though_ you use affinity? If you don't pin your threads, and the scheduler is aware of the new numbering (since presumably it looks at the core ID of each core instead of relying on a specific numbering), it should do the right thing?

No. I put affinity code in to address this problem, as the linux scheduler was confused about physical cores and such and the linux scheduler itself was screwing this up. We don't update kernels very frequently on our cluster boxes as that can break or change things that were working earlier...

That particular kernel assumed that even-numbered "processors" were the second processor on a core, i.e. 0 and 1 on same core. But that wasn't the case, and this particular kernel simply didn't "grog" the idea of 10 physical cores where 0-9 were physical and 10-19 shared cores with the first 10.

bob · Post by **bob** » Thu Sep 08, 2016 4:44 am

matthewlai wrote:
cdani wrote:
TomKerrigan wrote:Yeah, I only check every 1024 nodes (easy to test) ... still, I've been meaning to increase that number.
I calculate on the fly how many nodes I should wait to check the clock, so for example to check it say every 0.05 seconds, or less if the TC is really low. Of course this number varies a lot on different cpus.
I don't check the time anymore. At the beginning of each search, I start a thread that sleeps for the allocated amount of time, and sets a flag when it wakes up. The search thread only checks the flag.

Well, ok, it's slightly more complicated than that, because searches can be interrupted. It actually waits on a condition variable with a timeout equal to the duration, so it can wake up early if we want to terminate the search early (by notify()ing on the condition variable).

I have 3 threads for a single-threaded search - the main I/O processing thread, a search thread, and a timer thread.

That doesn't work so well for me as my "target time" changes frequently depending on whether the search fails high, fails low or locks in to the same move... I dynamically calculate a "nodes_between_time_checks" variable that uses the target time. IE if the target time is 2-3 minutes, I only need to check the time once a second to be perfectly safe. If the target time is 0.1 seconds, I want to check it much more frequently...

TomKerrigan · Post by **TomKerrigan** » Thu Sep 08, 2016 7:10 pm

Can you turn off hyperthreading and avoid the whole problem?

The machine I just bought allows me to control this in the BIOS.

bob · Post by **bob** » Thu Sep 08, 2016 8:46 pm

TomKerrigan wrote:Can you turn off hyperthreading and avoid the whole problem?

The machine I just bought allows me to control this in the BIOS.

So far as I know, yes. I do that to all machines we use. IBM donated one of their new power PC boxes (20 physical cores, but with their massive hyper threading (8 per core) it was pretty ridiculous. However, I never had time to try to investigate whether it could be turned off or not, as several were interested in testing since it has one terabyte of DRAM also.

The 2660 (2 x 10 cores) I use a lot has hyper threading disabled... You also have to watch turboboost if you want to do speedup measurements. IE my 2660 system is a 2.6 ghz box, but it will run 20 threads/cores at 2.9ghz (with Crafty). Using 1 thread sees that go up another .3ghz which screws up one cpu vs 20 cpu timing measurements.

TomKerrigan · Post by **TomKerrigan** » Sat Sep 10, 2016 7:33 pm

My engine shares several arrays/matrices between threads.

I assumed this would be fine because they arrays are never modified.

Something must be wrong with my understanding of memory/cache coherency systems because having multiple copies of these arrays (one per thread) sped up my engine from 70% to 90% of what I should be able to achieve with 16 cores.

(Ideally I want the same performance as running 16 separate processes, which is surprisingly and impressively just as fast as running 1 process. I checked.)

Not sure how I'm going to attack that last 10% but maybe there are some more global variables that are gumming up the works...

matthewlai · Post by **matthewlai** » Sat Sep 10, 2016 7:38 pm

TomKerrigan wrote:My engine shares several arrays/matrices between threads.

I assumed this would be fine because they arrays are never modified.

Something must be wrong with my understanding of memory/cache coherency systems because having multiple copies of these arrays (one per thread) sped up my engine from 70% to 90% of what I should be able to achieve with 16 cores.

(Ideally I want the same performance as running 16 separate processes, which is surprisingly and impressively just as fast as running 1 process. I checked.)

Not sure how I'm going to attack that last 10% but maybe there are some more global variables that are gumming up the works...

Maybe those entries are getting kicked out of cache sometimes? It shouldn't make a difference otherwise.

The other possibility is that parts of those tables share the same cache lines as data that is modified. That would also result in cache invalidation even though the tables aren't modified.

bob · Post by **bob** » Sat Sep 10, 2016 8:16 pm

TomKerrigan wrote:My engine shares several arrays/matrices between threads.

I assumed this would be fine because they arrays are never modified.

Something must be wrong with my understanding of memory/cache coherency systems because having multiple copies of these arrays (one per thread) sped up my engine from 70% to 90% of what I should be able to achieve with 16 cores.

(Ideally I want the same performance as running 16 separate processes, which is surprisingly and impressively just as fast as running 1 process. I checked.)

Not sure how I'm going to attack that last 10% but maybe there are some more global variables that are gumming up the works...

Sharing things that are read-only is fine. They will end up in all caches, and they won't be swapped around. The problem hits when you have one cache block that gets modified in multiple threads. That results in a LOT of cache traffic with all the forwarding and invalidate requests being shipped around.

TomKerrigan · Post by **TomKerrigan** » Sat Sep 10, 2016 8:21 pm

bob wrote:
TomKerrigan wrote:My engine shares several arrays/matrices between threads.

I assumed this would be fine because they arrays are never modified.

Something must be wrong with my understanding of memory/cache coherency systems because having multiple copies of these arrays (one per thread) sped up my engine from 70% to 90% of what I should be able to achieve with 16 cores.

(Ideally I want the same performance as running 16 separate processes, which is surprisingly and impressively just as fast as running 1 process. I checked.)

Not sure how I'm going to attack that last 10% but maybe there are some more global variables that are gumming up the works...
Sharing things that are read-only is fine. They will end up in all caches, and they won't be swapped around. The problem hits when you have one cache block that gets modified in multiple threads. That results in a LOT of cache traffic with all the forwarding and invalidate requests being shipped around.

Okay, that was my understanding as well.

I think Matthew was on to something, maybe these arrays that I duplicated were laid out in memory next to something the threads WERE modifying.

Probably the best thing to do is put as much data as possible into the engine class to avoid this sort of possibility. Back when I was writing most of this code in 1997-1998 one of my goals was to conserve memory but now that we have computers with many gigabytes of RAM, duplicating some small tables seems like a complete non-issue.

Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior