Gian-Carlo Pascutto wrote:bob wrote:
If that is true, we ran some tests on a broken I7 then...
Either that, or the test code was buggy, or you saw load-store reordering (instead of store-store reordering) and confused them.
Code ran two threads on two physical chips (not on a pair of cores with shared cache). One thread did some calculations, and wrote to A and B consecutively, although the computations to compute the value to store at A were more complex. The second calculation computed B a much simpler way, but always came up with the same answer. And for any A or B value, the new value was always old+1 for both cases. So we expected to see 1,1, then 2, 2 (or 2,1 if A was stored but B had not been stored). If we saw a 2, 3 that meant B was written to first. We saw all the cases and the program counted them. The second thread just read A and B and looked for the case where A != B. If B was bigger, B was stored first, if A was bigger, it was stored first. If they were equal, they were either unchanged or both had been written to before we read. All was in asm so there was no compiler opimization tricks to deal with.
Can't imagine any load/store reordering since one thread only wrote, the other only read, so they had a lot of cache forwarding going on, with one chip being the Forwarder, the other continually being invalidated and having to ask for new values from the Forwarder.
Intel gives the same guarantees as AMD:
Page 314, Section 8.2:
http://www.intel.com/Assets/PDF/manual/253668.pdf
Writes are not reordered with older reads.
•Writes to memory are not reordered with other writes, with the exception of
—writes executed with the CLFLUSH instruction,
—streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD),
—string operations (see Section 8.2.4.1).
In a multiple-processor system, the following ordering principles apply:
•Individual processors use the same ordering principles as in a single-processor system.
•Writes by a single processor are observed in the same order by all processors.
Note that if a single CPU ensures no store-store reordering, the last section guarantees everybody sees the store in program order.
what is this for? Core and earlier or also I7?
core was the first (I believe) to allow loads to be moved ahead of stores on the same processor, using a pretty accurate predictor to determine if the load was to the same address (unknown at reorder time) as the store which would cause a RAW hazard if this was not detected or prevented.