Sync question

Gian-Carlo Pascutto · Mon Jul 27, 2009 10:38 pm

bob wrote: OK, some examples.

For each of the following, I want to execute the two following instructions:

mov A, eax
mov B, ebx

They appear exactly in the order in my program.

The _only_ way you can _guarantee_ that the the two stores are done _exactly_ in the order written is as follows:

mov A, eax
sfence
mov B, eax
sfence

The sfence is not needed on x86.

Out of order writes are a x86 characteristic. No way around them unless you want to intersperse sfence instructions throughout your code and slow it down by at least a couple of orders of magnitude or more.

This is false. x86/amd64 guarantees that stores are not reordered. Please see my post above, where this is literally stated in the manuals, including the multiprocessor case.

A load may still be reordered wrt to a store, but that is not is relevant for the code discussed.

bob · Post by **bob** » Mon Jul 27, 2009 10:40 pm

hgm wrote:
Gian-Carlo Pascutto wrote:So, the code needs both a sfence and lfence on weakly ordered CPUs (and will hence be slow), but it will work correctly and fast on x86 and derivates, due to the store/store and load/load ordering that is enforced.
Agreed. But an lfence does not need be very expensive on any machine, I think. If it would be unduly slow, I think the "home-made" lfence I proposed, through
Code: Select all
movl memA, %ecx
movl %ecx, %eax
andl $0, %ecx
movl memB&#40;%ecx&#41;, %ebx

What, exactly, is that supposed to accomplish? Did you forget about register renaming so that the and 0, %ecx will will somehow stall the indexed read for memB? It won't. The %ecx in instruction 3 will be a different register than the %ecx in instruction 1, so there is no name dependency to tie up and order the instruction stream, if I am understanding what you are intending. Unless I misunderstand what you are trying to do, this won't do anything at all about the order things are read. The second read can easily be done before the first one, since they will be register independent after fetching/decoding/renaming.

is an alternative to force reading of B to start only after the read for memA is complete, and is hardly more expensive than any normal load.

The original problem was formulated in such a way (reads more frequent than writes by at least a factor 10,000) that only the efficiency of reading matters. So it doesn't matter how expensive sfence is. (Although it is hard to imagine sfence is more expensive than a mutex based on xchg, as there the CPU also has to wait until the written data is "globally visible" before the read data can be made available to the CPU, or two CPUs could both grab the lock.)

Under these conditions, the locations to be read are likely to be almost always in cache for all CPUs, in the Shared state. So the normal loads would have quite low latency. Only on the occasional write all these entries would be invalidated, and the readers would have to acquire the line again from memory, or directly from their sibbling CPU. (Assuming they do not communicate through L2, as Core 2 Duo, or through L3 as Core i7.)

bob · Post by **bob** » Mon Jul 27, 2009 10:47 pm

Gian-Carlo Pascutto wrote:
bob wrote: OK, some examples.

For each of the following, I want to execute the two following instructions:

mov A, eax
mov B, ebx

They appear exactly in the order in my program.

The _only_ way you can _guarantee_ that the the two stores are done _exactly_ in the order written is as follows:

mov A, eax
sfence
mov B, eax
sfence
The sfence is not needed on x86.

Out of order writes are a x86 characteristic. No way around them unless you want to intersperse sfence instructions throughout your code and slow it down by at least a couple of orders of magnitude or more.
This is false. x86/amd64 guarantees that stores are not reordered. Please see my post above, where this is literally stated in the manuals, including the multiprocessor case.

A load may still be reordered wrt to a store, but that is not is relevant for the code discussed.

If that is true, we ran some tests on a broken I7 then...

I finally tracked my intel core docs down and it does say that stores are done in order. The I7 is another issue I can not find details on but a test program we ran on the prototype we had here early this year claimed it was detecting out of order stores, and it was a pretty simple program. I'm so used to out-of-order stores dating all the way back to the Crays, that I simply do not write code that depends on this so that I don't have to debug it when "the rules change".

whether the I7 was yet another bios bug that was setting something wrong, or it is a new feature, or it was simply a hardware bug (we had more than one) I do not know.

Gian-Carlo Pascutto · Mon Jul 27, 2009 10:54 pm

bob wrote:
If that is true, we ran some tests on a broken I7 then...

Either that, or the test code was buggy, or you saw load-store reordering (instead of store-store reordering) and confused them.

Intel gives the same guarantees as AMD:

Page 314, Section 8.2:

http://www.intel.com/Assets/PDF/manual/253668.pdf

Writes are not reordered with older reads.
•Writes to memory are not reordered with other writes, with the exception of
—writes executed with the CLFLUSH instruction,
—streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD),
—string operations (see Section 8.2.4.1).

In a multiple-processor system, the following ordering principles apply:
•Individual processors use the same ordering principles as in a single-processor system.
•Writes by a single processor are observed in the same order by all processors.

Note that if a single CPU ensures no store-store reordering, the last section guarantees everybody sees the store in program order.

Gian-Carlo Pascutto · Mon Jul 27, 2009 11:08 pm

The I7 is another issue I can not find details on but a test program we ran on the prototype we had here early this year claimed it was detecting out of order stores, and it was a pretty simple program.

Note, as I already said, that the compiler can still reorder things without barriers, and in fact, GCC will do exactly that.

bob wrote:I'm so used to out-of-order stores dating all the way back to the Crays, that I simply do not write code that depends on this so that I don't have to debug it when "the rules change".

Yes, I think that's good design. x86 is really the exception here. You do not want to have to guess afterwards which guarantees you were depending on, when you ever want to compile your program for ARM or Itanium. Or have to pray that the code still works after every compiler upgrade...

hgm · Post by **hgm** » Mon Jul 27, 2009 11:42 pm

bob wrote:
Code: Select all
movl memA, %ecx
movl %ecx, %eax
andl $0, %ecx
movl memB&#40;%ecx&#41;, %ebx
What, exactly, is that supposed to accomplish? Did you forget about register renaming so that the and 0, %ecx will will somehow stall the indexed read for memB? It won't. The %ecx in instruction 3 will be a different register than the %ecx in instruction 1, so there is no name dependency to tie up and order the instruction stream, if I am understanding what you are intending. Unless I misunderstand what you are trying to do, this won't do anything at all about the order things are read. The second read can easily be done before the first one, since they will be register independent after fetching/decoding/renaming.

Look again! This is a true data dependency, not a register dependency.

Yes, the source %ecx of intruction 3 will be renamed to another register than the destination %ecx of instruction 3 out of order. But there will be no valid data in it until instruction 3 completes. And execution of instruction 3 won't start until its source %ecx receives data from the load.

There is no way memB can be loaded before its address is known, and the address is calculated from the loaded A. The instruction scheduler is not smart enough to understand that 0&x is always 0, and will wait for x to become available. It does not even look at values of operands, that is left for the ALU. (A compiler would probably optimize this away, though.)

bob · Post by **bob** » Tue Jul 28, 2009 12:00 am

Gian-Carlo Pascutto wrote:
bob wrote:
If that is true, we ran some tests on a broken I7 then...
Either that, or the test code was buggy, or you saw load-store reordering (instead of store-store reordering) and confused them.

Code ran two threads on two physical chips (not on a pair of cores with shared cache). One thread did some calculations, and wrote to A and B consecutively, although the computations to compute the value to store at A were more complex. The second calculation computed B a much simpler way, but always came up with the same answer. And for any A or B value, the new value was always old+1 for both cases. So we expected to see 1,1, then 2, 2 (or 2,1 if A was stored but B had not been stored). If we saw a 2, 3 that meant B was written to first. We saw all the cases and the program counted them. The second thread just read A and B and looked for the case where A != B. If B was bigger, B was stored first, if A was bigger, it was stored first. If they were equal, they were either unchanged or both had been written to before we read. All was in asm so there was no compiler opimization tricks to deal with.

Can't imagine any load/store reordering since one thread only wrote, the other only read, so they had a lot of cache forwarding going on, with one chip being the Forwarder, the other continually being invalidated and having to ask for new values from the Forwarder.

Intel gives the same guarantees as AMD:

Page 314, Section 8.2:

http://www.intel.com/Assets/PDF/manual/253668.pdf

Writes are not reordered with older reads.
•Writes to memory are not reordered with other writes, with the exception of
—writes executed with the CLFLUSH instruction,
—streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD),
—string operations (see Section 8.2.4.1).

In a multiple-processor system, the following ordering principles apply:
•Individual processors use the same ordering principles as in a single-processor system.
•Writes by a single processor are observed in the same order by all processors.
Note that if a single CPU ensures no store-store reordering, the last section guarantees everybody sees the store in program order.

what is this for? Core and earlier or also I7?

core was the first (I believe) to allow loads to be moved ahead of stores on the same processor, using a pretty accurate predictor to determine if the load was to the same address (unknown at reorder time) as the store which would cause a RAW hazard if this was not detected or prevented.

bob · Post by **bob** » Tue Jul 28, 2009 12:03 am

Gian-Carlo Pascutto wrote:
The I7 is another issue I can not find details on but a test program we ran on the prototype we had here early this year claimed it was detecting out of order stores, and it was a pretty simple program.
Note, as I already said, that the compiler can still reorder things without barriers, and in fact, GCC will do exactly that.

bob wrote:I'm so used to out-of-order stores dating all the way back to the Crays, that I simply do not write code that depends on this so that I don't have to debug it when "the rules change".
Yes, I think that's good design. x86 is really the exception here. You do not want to have to guess afterwards which guarantees you were depending on, when you ever want to compile your program for ARM or Itanium. Or have to pray that the code still works after every compiler upgrade...

I agree. This is like the original BSF/BSR instructions where if the initial value was zero, the result was "undetermined" and the destination register was unchanged. "undetermined" never left me comfortable enough to write the ugly BSR/BSF functions that first set the value you want for zero in the destination, then do the BSF/BSR knowing the previous value would stay if there were no bits set. But Intel did not guarantee this and they were free to change this if they wanted. I didn't want an unexpected architectural change to blow up my code, so I try to write what will work for all existing architectures, which means not depending on any sort of ordering at all. And as a result, I have run on every processor ever made (parallel machines) with no problems, including the ugly Itanium and the ultra-liberal alpha, as well as PPCs, HPs, Sparcs, etc.

bob · Post by **bob** » Tue Jul 28, 2009 12:09 am

hgm wrote:
bob wrote:
Code: Select all
movl memA, %ecx
movl %ecx, %eax
andl $0, %ecx
movl memB&#40;%ecx&#41;, %ebx
What, exactly, is that supposed to accomplish? Did you forget about register renaming so that the and 0, %ecx will will somehow stall the indexed read for memB? It won't. The %ecx in instruction 3 will be a different register than the %ecx in instruction 1, so there is no name dependency to tie up and order the instruction stream, if I am understanding what you are intending. Unless I misunderstand what you are trying to do, this won't do anything at all about the order things are read. The second read can easily be done before the first one, since they will be register independent after fetching/decoding/renaming.
Look again! This is a true data dependency, not a register dependency.

Yes, the source %ecx of intruction 3 will be renamed to another register than the destination %ecx of instruction 3 out of order. But there will be no valid data in it until instruction 3 completes. And execution of instruction 3 won't start until its source %ecx receives data from the load.

There is no way memB can be loaded before its address is known, and the address is calculated from the loaded A. The instruction scheduler is not smart enough to understand that 0&x is always 0, and will wait for x to become available. It does not even look at values of operands, that is left for the ALU. (A compiler would probably optimize this away, though.)

I'm not convinced. Anding anything with 0 produces zero. and the difference between move 0, %eax and and 0, %eax is nil. Nothing says a processor can't deal with that. Just as surely as it can deal with this:

mov A, eax
xor eax, eax

Cray had lots of special cases for such instructions to do other things. I think relying on that kind of trick is risky. I don't want to debug that next year when the internal optimizations improve to the point that the processor starts to do what a compiler does today and toss useless code out.

Gian-Carlo Pascutto · Tue Jul 28, 2009 12:19 am

bob wrote: what is this for? Core and earlier or also I7?

Why not read it and answer the question yourself?

It's "Pentium Pro and later". They can't change this (in the sense of relaxing it), since changing it would mean Core i7 is not compatible (by any definition) with previous CPUs. After all, they're *guaranteeing* you're allowed to assume no store/store reordering exists.

Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question