Baffling multithreading scaling behavior

matthewlai · Post by **matthewlai** » Sat Sep 10, 2016 8:25 pm

TomKerrigan wrote:
bob wrote:
TomKerrigan wrote:My engine shares several arrays/matrices between threads.

I assumed this would be fine because they arrays are never modified.

Something must be wrong with my understanding of memory/cache coherency systems because having multiple copies of these arrays (one per thread) sped up my engine from 70% to 90% of what I should be able to achieve with 16 cores.

(Ideally I want the same performance as running 16 separate processes, which is surprisingly and impressively just as fast as running 1 process. I checked.)

Not sure how I'm going to attack that last 10% but maybe there are some more global variables that are gumming up the works...
Sharing things that are read-only is fine. They will end up in all caches, and they won't be swapped around. The problem hits when you have one cache block that gets modified in multiple threads. That results in a LOT of cache traffic with all the forwarding and invalidate requests being shipped around.
Okay, that was my understanding as well.

I think Matthew was on to something, maybe these arrays that I duplicated were laid out in memory next to something the threads WERE modifying.

Probably the best thing to do is put as much data as possible into the engine class to avoid this sort of possibility. Back when I was writing most of this code in 1997-1998 one of my goals was to conserve memory but now that we have computers with many gigabytes of RAM, duplicating some small tables seems like a complete non-issue.

An alternative would be to have a global struct that holds all the tables, and have 1 cache-line of padding before and after it.

Code: Select all

struct &#123;
    uint8_t padding&#91;64&#93;;
    // table 1
    // table 2
    // ...
    uint8_t padding&#91;64&#93;;
&#125; globals;

bob · Post by **bob** » Sun Sep 11, 2016 5:40 pm

TomKerrigan wrote:
bob wrote:
TomKerrigan wrote:My engine shares several arrays/matrices between threads.

I assumed this would be fine because they arrays are never modified.

Something must be wrong with my understanding of memory/cache coherency systems because having multiple copies of these arrays (one per thread) sped up my engine from 70% to 90% of what I should be able to achieve with 16 cores.

(Ideally I want the same performance as running 16 separate processes, which is surprisingly and impressively just as fast as running 1 process. I checked.)

Not sure how I'm going to attack that last 10% but maybe there are some more global variables that are gumming up the works...
Sharing things that are read-only is fine. They will end up in all caches, and they won't be swapped around. The problem hits when you have one cache block that gets modified in multiple threads. That results in a LOT of cache traffic with all the forwarding and invalidate requests being shipped around.
Okay, that was my understanding as well.

I think Matthew was on to something, maybe these arrays that I duplicated were laid out in memory next to something the threads WERE modifying.

Probably the best thing to do is put as much data as possible into the engine class to avoid this sort of possibility. Back when I was writing most of this code in 1997-1998 one of my goals was to conserve memory but now that we have computers with many gigabytes of RAM, duplicating some small tables seems like a complete non-issue.

One optimization is to do things in two steps.

(1) lump everything together that is modified frequently by a thread, so that there is no cache line overlap with similar data for another thread. This is data that usually does NOT require a lock for protection, except for at a split point where things must be shared.

(2) global read-only (after initialization) data since this can be shared with no cache thrashing.

(3) global data that is actually modified by multiple threads.

Once you have the data separated, you need to make another pass, and look at WHEN a piece of data is accessed with respect to other data in the same category (1, 2 or 3 above). Variables with temporal locality (accessed closely with respect to time) should be given spatial locality (close together in memory) so that when you access the first value and fetch that cache block, you also fetch the second value at the same time which hides the latency for the second access.

Obviously the last step can result in an infinite optimization task since you will encounter cases where different variables have high temporal locality (i.e. you access A, B, C and D in one place, but A, E, F and G somewhere else. They can't ALL be close together in memory.

Fortunately these are generally edge cases that occur infrequently, but you will always have at least a few.

Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior