Sync question

hgm · Post by **hgm** » Tue Jul 28, 2009 12:21 am

bob wrote:For each of the following, I want to execute the two following instructions:

mov A, eax
mov B, ebx

They appear exactly in the order in my program.

The _only_ way you can _guarantee_ that the the two stores are done _exactly_ in the order written is as follows:

mov A, eax
sfence
mov B, eax
sfence

Which incurs far more overhead than the original atomic lock I mentioned.

You can't get away with this:

mov A, eax
mov B, ebx
sfence

Because all the sfence instruction guarantees is that when you execute it, the processor hangs until both of the preceeding mov instructions have executed. But it does _not_ guarantee that B won't be written first. It just guarantees that before you proceed, _both_ will be written in unknown order.

The only thing I need is to be sure that B is seen by other CPUs to be written later than A. This can be accomplished by

Code: Select all

mov A, eax
sfence
mov B1, ebx
mov B2, ecx
...
mov Bn, edx

I don't see the need for an sfense behind the writing of B. I don't care if writing of a whole lot of other stuff is interleaved with writing B. (There were atomic locks around this whole writing sequence anyway, in my code, so I guess this is automatic.)

I didn't notice your mentioning this.

4 pages ago hgm wrote:Uncertainty about the order in which things are done by different processors is another matter than uncertainty about the order in which a single CPU does things. For the latter there is the sfence and similar instuctions. This does not lock the bus, as it pertains only to what happens in a single CPU.

Because I don't believe you understand what is going on at this level. Which is OK, but it is not OK to just say "sfence" solves this when it most certainly does not.

So, if you want to guarantee the order of two stores, you have to put an sfence after each.

Wrong again. You only have to put an sfence between them. And on x86 that isn't even needed.

And you completely stall the CPU while it is waiting on the write, and won't execute any other instructions out of order to fill in the idle cycles in the various pipelines. Lousy solution.

Not if reads are >10,000 times more frequent than writes. Then even a single cycle saved on the reads would earn back more than a 5,000-cycle penalty on a single write. So: good solution...

Otherwise you can use a lock where you write and a lock where you read, so that a reader can't access _either_ value until the lock is released. You still need an sfence (store fence) prior to clearing the lock or you can still get zapped since the instruction to clear the lock is a simple mov lock, 0 and the preceeding stores might not both be completed prior to that.

If you don't program in assembly language, you don't get access to the sfence stuff anyway, and unless you know you need it, you have subtle timing issues that are a royal pain to deal with, because they are very hard to detect.

Out of order writes are a x86 characteristic. No way around them unless you want to intersperse sfence instructions throughout your code and slow it down by at least a couple of orders of magnitude or more.

As Gian-Carlo pointed out, the manuals say differently.

That is exactly what I have been saying all along. Peterson's code has a problem on X86 without using sfence. In fact, any program that depends on A being written before B (in the above example) has a problem unless you are willing to slow things to a crawl with sfence.

This you will have to explain. For one, you needed the sfence to clear the atomic lock as well, so how does that make atomic locks better? (Suppose for the sake of argument that we are talking about other systems than x86, where sfences would be needed.)

That is a simple explanation of the problem I have been hitting on from day one. I have _repeatedly_ said "this is not a cache issue" because it has _nothing_ to do with cache. If cache never sees the mov A, eax write, it can't do a thing to help us. And it won't always see that write before the write to B. Hence no order guarantee and cache can't do a thing about it.

Note also that it does not matter at all what the state of A and B are with respect to other caches or even the cache on the current processor. Neither might be cached anywyere. Both might be caches as shared everywhere. The problem still persists in exactly the same way with no help from cache to solve it at all.

Again, you have the out-of-order problem unless you sfence _each_ write on the writer end. And that is worse, still, as I said. Locks don't stall the CPU, while sfence instructions halt everything.

As said before, write cost was not a problem, as writes were virtually never done.

hgm · Post by **hgm** » Tue Jul 28, 2009 12:40 am

bob wrote:I'm not convinced. Anding anything with 0 produces zero. and the difference between move 0, %eax and and 0, %eax is nil. Nothing says a processor can't deal with that. Just as surely as it can deal with this:

mov A, eax
xor eax, eax

Cray had lots of special cases for such instructions to do other things. I think relying on that kind of trick is risky. I don't want to debug that next year when the internal optimizations improve to the point that the processor starts to do what a compiler does today and toss useless code out.

The opcode that would logically correspond to xor r,r is recognized by the instruction decoder as a clear instruction. The manual mentions this.

If you don't believe that this works, try it out. I _know_ that it works on every CPU I tried, because I always use this instruction to measure load latency, by looping

Code: Select all

movl mem&#40;%ebx,%eax&#41;, %eax
andl $0, %eax
addl $128, %ebx  // next address

This really makes the loads wait for each other to complete.

If you want to be very paranoic, you can use a register set to zero, in stead of an immediate constant. I for one think it is a _very_ safe bet they will never bother to build in a comparator that weeds out immediate operands that are zero. That would be a total waste, as no one would of course use such operands. Unless they really wanted to use this trick, in which case they would not be happy with it. And the extra check on immediate operands would probably slow the entire CPU pipeline down...

Gian-Carlo Pascutto · Tue Jul 28, 2009 12:48 am

hgm wrote:That would be a total waste, as no one would of course use such operands.

Relocatable code that didn't need to be relocated?

Gian-Carlo Pascutto · Tue Jul 28, 2009 12:54 am

hgm wrote: Look again! This is a true data dependency, not a register dependency.

[...]

There is no way memB can be loaded before its address is known, and the address is calculated from the loaded A.

Actually, the Alpha can reorder dependant loads. But I'm not sure the specific case in which it does that would apply to your example. It's just not something I'd want to think about.

diep · Post by **diep** » Tue Jul 28, 2009 3:44 am

hi,

I see claims of "manual says this or that". Just a small warning: manuals don't even tell a part of the truth. Past years all those processors have gotten more and more transistor logics especially for all kind of possible 'tricks'. None of those are in the manual as they are 'business secret'. Patent database has a lot though there of all kind of things those processors can or cannot do.

Both AMD as well as Intel are quite advanced in prediction and reading ahead data. I'd say they total hammer away the rest there. A few years ago AMD patented quite some things there which some say is *impossible* here. Note intel already was doing all sorts of things there that makes core2 such a great processor. I wouldn't call something impossible. You can of course read ahead anything by guessing and later simply see whether the guess was correct.

As these techniques are their biggest secret, nothing of all that is in any manual, only some vague descriptiosn exist in patent database there.

Looking at old processors like alpha 21264 is not real interesting IMHO with respect to prediction. First of all such processors were eating 300+ watts or so, and when compared to todays processors, todays processors are generations newer and better and more complicated.

What is interesting though is to take a look at why todays processors dominate so much, meanwhile they're only optimized for a few registers (as HGM commented some while ago also). Old processors like alpha had really a lot of accessable registers allowing really a lot.

So to speak bitboards in assembler at a 21264 is a lot faster than at todays processors, in number of cycles per instruction needed.

This where todays processors achieve practical far superior, maybe even factor 2, higher IPC at all kind of integer codes, if we also take into account the compiler and do not use SSE2+ code.

Isn't the real problem of producing a great processor the caches?

A processor is as good as its L1 cache is IMHO.

These old processors simply are too slow for pattern recognition where you access nonstop table after table, as they all lose it on the L1 cache bandwidth/latency/throughput (whatever name you want to give it)

This where alpha 21264 really was designed for exactly that. Memory even tells me it could execute 4 instructions a cycle, but would need to look it up again.

I intend to optimize some code in Diep again, but i already know that the real problem will be the L1 cache after i manage to optimize away some branches that is.

If you're writing all the complex patterns like:

Code: Select all

int piecemine    = board&#91;sq&#93;,
    piececloseby = board&#91;sq+8&#93;,
    myclevervar = mycleverdatastructure&#91;side&#93;&#91;dfile&#93;;
    
if&#40; !(&#40;piecemine-rook&#41;|&#40;piececloseby-knight&#41;|!&#40;myclevervar&2&#41;) ) &#123;
// note how ugly removing branches looks like and is asking for more bugs
// any clever idea to let it look like more like the original code, 
// or such that it is more readable?
   ...
&#125;

Now we removed a few branches from diep in this manner. Great.
But the hammering on the L1 is getting even bigger.

It was already doing 1 load every cycle on average nearly (though i must interpret the data better, those profilers you can never trust here).

It's all about how fast the L1 can deliver the goods and the crappy compilers that do not want to give to AMD 2 loads a cycle.

Maybe after i optimize a lot, AMD is faster than intel once again in IPC, the compiler still will let intel win of course.
Isn't AMD having 2 readports versus intel 1, or did intel change that with Nehalem?

Vincent

Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question

Re: Sync question