Read access contention

mcostalba · Post by **mcostalba** » Wed Jun 03, 2009 10:54 am

I know that write accessing a variable from different threads can lead to slowdowns due to contentions. Also without locking.

I don't know if a variable heavy accessed by different CPU in a SMP system but in read only mode can have the same problems.

Have someone experience on this?

Thanks
Marco

Gian-Carlo Pascutto · Wed Jun 03, 2009 11:50 am

Read only is not a problem, but be aware there should not be a read-write variable within the same cacheline (usually 64 bytes).

hgm · Post by **hgm** » Wed Jun 03, 2009 11:52 am

No experience, but in theory it should not. The variables get cached in the 'Shared' state, and as long as they remain cached there is no reason why other cores should be notified when they are accessed fo reading. So it should be as fast as any cache read.

Only when you start writing such Shared variables the other cores have to be notified that the copy they hold now is Invalid, and that is what takes time.

bob · Post by **bob** » Wed Jun 03, 2009 4:53 pm

mcostalba wrote:I know that write accessing a variable from different threads can lead to slowdowns due to contentions. Also without locking.

I don't know if a variable heavy accessed by different CPU in a SMP system but in read only mode can have the same problems.

Have someone experience on this?

Thanks
Marco

Won't hurt a thing. read-only values can appear in all caches at the same time and incur no overhead of any kind.

bob · Post by **bob** » Wed Jun 03, 2009 4:54 pm

hgm wrote:No experience, but in theory it should not. The variables get cached in the 'Shared' state, and as long as they remain cached there is no reason why other cores should be notified when they are accessed fo reading. So it should be as fast as any cache read.

Only when you start writing such Shared variables the other cores have to be notified that the copy they hold now is Invalid, and that is what takes time.

And as GCP mentioned (as have I many times) a read/write variable in that same cache line would be a no-no, even if it is only read/written in one thread. Those read/writes will invalidate it in all other caches even though that variable is not used in the other threads...

michiguel · Post by **michiguel** » Wed Jun 03, 2009 5:23 pm

bob wrote:
hgm wrote:No experience, but in theory it should not. The variables get cached in the 'Shared' state, and as long as they remain cached there is no reason why other cores should be notified when they are accessed fo reading. So it should be as fast as any cache read.

Only when you start writing such Shared variables the other cores have to be notified that the copy they hold now is Invalid, and that is what takes time.
And as GCP mentioned (as have I many times) a read/write variable in that same cache line would be a no-no, even if it is only read/written in one thread. Those read/writes will invalidate it in all other caches even though that variable is not used in the other threads...

I am confused about your statement. "read/write variable in that same cache line" means in the same cache line to what exactly?

Are you referring to something like this?
http://www.ddj.com/embedded/212400410

Miguel

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Jun 03, 2009 5:28 pm

michiguel wrote: I am confused about your statement. "read/write variable in that same cache line" means in the same cache line to what exactly?

In the same cacheline as the read-only variable.

You as a programmer may think in terms of variables, but the memory subsystem of the CPU thinks in terms of cachelines. So if you have a read-write variable in the same cacheline as a read-only variable, the entire cacheline is effectively read-write.

diep · Post by **diep** » Wed Jun 03, 2009 6:01 pm

Gian-Carlo Pascutto wrote:
michiguel wrote: I am confused about your statement. "read/write variable in that same cache line" means in the same cache line to what exactly?
In the same cacheline as the read-only variable.

You as a programmer may think in terms of variables, but the memory subsystem of the CPU thinks in terms of cachelines. So if you have a read-write variable in the same cacheline as a read-only variable, the entire cacheline is effectively read-write.

Exactly,

Note that things are getting more complex in future and in some more expensive systems already are a lot more complex.

So the best is avoiding writes/reads to the same memory from other cpu's. That's better for cache snoops. Maybe no big issue in 1 socket and no big deal with Nehalem in 2 sockets, but sure major issue at 4 sockets already not to mention SSI type systems.

Now this is of course a problem with multithreading; the OS likes to be efficient, yet in reality sometimes it's faster to do things total local.

On paper it is the same maybe, in reality i see performance differences between multithreading and multiprocessing if you use it in a normal manner (some weirdo c++ defines type Nalimov that rewrites things to each other is of course ballony, as no one is practical doing it that manner).

Windoze is acting weird there especially; if you have processes in the same virtual adress space then the shared RAM you can allocate is just 1/n th part of what you can allocate when the processes are not in the same virtual adress space. Diep no longer is in the same virtual adress space under windows, so that solved a lot of issues there. It has a direct impact upon memory.

Multiprocessing gets more system time from windows also. It schedules processes BETTER than multithreaded applications. Processes get usually a full core, multithreaded applications get lobotomized a lot.

We seem to have with Istanbul now cpu's that will have reserved a part of the L3 for shared memory and the rest in a kind of local manner to the RAM. We should figure out more details there. That's of course not so efficient seemingly, but it's faster if you have threads or processes that run total local.

Accessing the same cachelines in principle you ALWAYS should avoid whenever possible, as you never know which entity owns that cacheline and whatever that entity had done with it. With local cachelines there is no such problem.

Note that 'aligning' your data or whatever might have impact on L1 or L2 of the cpu, but not be on par with the alignment of the memory controller. That used to be different some cpu generations ago. Nowadays however those things have become so complex and of major importance to the performance that they've clocked things at different clockspeeds there. That's a big issue now in worst case performance, it wasn't that bad some years ago.

Vincent

bob · Post by **bob** » Wed Jun 03, 2009 6:54 pm

michiguel wrote:
bob wrote:
hgm wrote:No experience, but in theory it should not. The variables get cached in the 'Shared' state, and as long as they remain cached there is no reason why other cores should be notified when they are accessed fo reading. So it should be as fast as any cache read.

Only when you start writing such Shared variables the other cores have to be notified that the copy they hold now is Invalid, and that is what takes time.
And as GCP mentioned (as have I many times) a read/write variable in that same cache line would be a no-no, even if it is only read/written in one thread. Those read/writes will invalidate it in all other caches even though that variable is not used in the other threads...
I am confused about your statement. "read/write variable in that same cache line" means in the same cache line to what exactly?

Are you referring to something like this?
http://www.ddj.com/embedded/212400410

Miguel

Say you have a group of variables in consecutive memory addresses, and the alignment is such that the entire group of 16 integer values map into one physical cache block/line. If you have data mixed up so that there is global read-only data in that group that multiple threads (on multiple processors) are reading but not modifying, all works well. But if one of those values gets modified by a single thread, the entire cache block becomes invalid in the other processor caches and depending on the processor you use, the cache with the newly modified value will have to forward it to the other caches when they come back for another read-only access, or you will go thru an invalidate/write cycle which is even worse.

When any byte of a cache block is modified, the other caches will snoop this and invalidate their copy. If they then need that value, the cache with the correct (modified) data will have to forward it to them. This can cause a huge slowdown. The fix is simple. No read-only shared data in a cache block with any data that gets modified. Data that gets modified needs to be grouped into cache blocks so that if processor N modifies a value in a cache block, no other processor will be accessing _any_ data in that block under any circumstances. You can't do this for small bits of data that must be shared for a parallel search. But you can do it for most of the data and get around a huge cache-to-cache forwarding bottleneck.

Zach Wegner · Post by **Zach Wegner** » Wed Jun 03, 2009 7:24 pm

diep wrote:Accessing the same cachelines in principle you ALWAYS should avoid whenever possible, as you never know which entity owns that cacheline and whatever that entity had done with it. With local cachelines there is no such problem.

Maybe in principle, but there are plenty of exceptions where you can do reads on shared memory without fear of hitting a dirty cache line. For example, if I'm deciding whether to split or not based on the current iteration number, the chance that it is in our cache but has been modified is virtually nil. In many other instances it's simply unavoidable, or you could simply do without shared memory.

Read access contention

Read access contention

Re: Read access contention

Re: Read access contention

Re: Read access contention

Re: Read access contention

Re: Read access contention

Re: Read access contention

Re: Read access contention

Re: Read access contention

Re: Read access contention