An example of intel behaviour in GCC

diep · Post by **diep** » Sun Feb 01, 2009 4:30 pm

bob wrote:
diep wrote:In Diep's hashtable code in summer 2008 i rewrote some code in order to speed it up. I was very amazed to see Diep actually SLOWER with the code at AMD.

So for once i took a look in the assembler output.

Here is the C code of the code fragment:

(Everything is 32 bits signed or unsigned integers here)
Code: Select all
Hashtable code:
   ..
   deltax = 0;
    if( bestscore >= MATEVALUE-1000 )
      deltax = realply;

    if( bestscore <= -MATEVALUE+1000 )
      deltax = -realply;

    (a few lines more code)
} // end of function
I was using the latest released GCC compiler at the time which i had just downloaded hours before doing this compile. Whatever flag i used.

gcc -O3 -march=k8 -mtune=k8 -S -DUNIXPII -c diepab.c

Other flags like -O2 and no tuning to k8 but to k7, it all didn't help.
Each time the above code fragment produced.

Next is what it generated. It first jumps away regurarly to label L837,
then it executes 4 instructions there and jumps back. I don't want it to
jump around in the code. I want fall through code, that's lightyears faster at AMD. How they get the bizarre idea to generate this type of code i do not know.
Vincent, any _good_ compiler person could explain this to you. The idea is to (a) optimize branch prediction and (b) optimize the prefetch that occurs in cache blocks.

If you do this:

if (c) {
rarely executed code
}

You're doing a statement that has no logical basis.

Where is your proof the code gets executed rarely if the optimizations happen without PGO?

What we do know is that AMD suffers a bunch of penalties if we jump around in the code, whereas 'fall through' works perfectly.

diep · Post by **diep** » Sun Feb 01, 2009 4:32 pm

hgm wrote:I once tried to replace every branch over a simple movl instruction by a movcc, on the assembly produced by GCC with the -S option. On the Pentium M this produced no measurable speedup. Apparently most banches were too easy to predict...

For easily predictable branches at AMD,
CMOV gave at least 50% speedup here.

CMOV is a directpath instruction.

Jumping through the code based upon a JLE is real ugly.

Vincent

diep · Post by **diep** » Sun Feb 01, 2009 4:37 pm

hgm wrote:I am afraid I have to side with Bob on this one. If "fall-through code is lightyears faster on AMD", then I would want the code to fall through in the common case, not the rare one. Which is exacty what the shown GCC-produced assembly code does.

And I think that Intel actually has revealed how its static branch prediction works; The standard way is to assume forward branches are not taken, and backward branches are taken. At the time of Core Solo/Duo (IIRC)they published they had found an improvement of this, changing the default prediction for branches with a very remote target. After some Googling I did find something about it in connection with P-IV, but one might assume that this is something that should have the same optimum irrespective of processor family, and hence used in all Intel CPUs ever since.

http://arstechnica.com/old/content/2004 ... um-2.ars/8

Except that Bob made 1 big assumption mistake, the compiler in this case did not have PGO information at all. Note i had not mentionned this, but the compiler flags shown indicated it.

You can verify this yourself with about all 4.x GCC versions, it generates this type of ugly code.

There is ANOTHER big reason everyone misses which is the biggest problem in the long run: that is that rewriting code to such ugly code means that the PGO cannot rewrite it either.

visual c++ / intel c++ they get a speedup of 22% out of PGO,
GCC hardly gets a speedup out of PGO when applied.

The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.

Intels branch prediction at core2/Nehalem is NOT public information.
This is why cachegrind cannot do statistics on intel branchprediction and only can do for AMD.

Vincent

bob · Post by **bob** » Sun Feb 01, 2009 4:50 pm

mcostalba wrote:Thanks for the Intel link.

Regarding the jump to label optimization I really think it is the wrong way to go, apart for being compiler/processor specific it does not scale and it is not a general solution.

I hope you realize that I was showing what the compiler turned the if(c) block into, in order to make branch prediction / cache prefetch more efficient???

You should compare the jump-label hand rewritten code with a PGO optimized one and see if the rewrite is still needed or the compiler has, as it should, rewritten the assembly for you.

This was exactly my point. The compiler "knows what is best" and is quite good at transforming clean-looking code into "clean-running code".

In my experience, the code should be written in the most simple and straightforward way, this is what compiler expects and what is designed to optimize for.

hgm · Post by **hgm** » Sun Feb 01, 2009 4:50 pm

diep wrote:The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.

Are you implying that PGO is totally stupid? If I would write a profiler, it would not make the slightest difference how many jumps or banches there were in the code. You simply count the number of times each possible control path is traversed. Either by incrementing a counter after every branch and at every branch target, or, less invasively, by just recording the return adress on every timer interrupt.

bob · Post by **bob** » Sun Feb 01, 2009 4:52 pm

diep wrote:
hgm wrote:I am afraid I have to side with Bob on this one. If "fall-through code is lightyears faster on AMD", then I would want the code to fall through in the common case, not the rare one. Which is exacty what the shown GCC-produced assembly code does.

And I think that Intel actually has revealed how its static branch prediction works; The standard way is to assume forward branches are not taken, and backward branches are taken. At the time of Core Solo/Duo (IIRC)they published they had found an improvement of this, changing the default prediction for branches with a very remote target. After some Googling I did find something about it in connection with P-IV, but one might assume that this is something that should have the same optimum irrespective of processor family, and hence used in all Intel CPUs ever since.

http://arstechnica.com/old/content/2004 ... um-2.ars/8
Except that Bob made 1 big assumption mistake, the compiler in this case did not have PGO information at all. Note i had not mentionned this, but the compiler flags shown indicated it.

You can verify this yourself with about all 4.x GCC versions, it generates this type of ugly code.

There is ANOTHER big reason everyone misses which is the biggest problem in the long run: that is that rewriting code to such ugly code means that the PGO cannot rewrite it either.

visual c++ / intel c++ they get a speedup of 22% out of PGO,
GCC hardly gets a speedup out of PGO when applied.

The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.

Intels branch prediction at core2/Nehalem is NOT public information.
This is why cachegrind cannot do statistics on intel branchprediction and only can do for AMD.

Vincent

Who cares? That is the _purpose_ of PGO compiling, to let the compiler discover how branches behave so that it can optimize the code to the specific architecture.

I do not understand your next-to-last statement. GCC doesn't mangle the code unless it believes it is better. Without PGO it can't know branch behavior and can only guess, and yes it can guess wrong.

Just use PGO and it will work.

bob · Post by **bob** » Sun Feb 01, 2009 5:00 pm

diep wrote:
bob wrote:
diep wrote:In Diep's hashtable code in summer 2008 i rewrote some code in order to speed it up. I was very amazed to see Diep actually SLOWER with the code at AMD.

So for once i took a look in the assembler output.

Here is the C code of the code fragment:

(Everything is 32 bits signed or unsigned integers here)
Code: Select all
Hashtable code:
   ..
   deltax = 0;
    if( bestscore >= MATEVALUE-1000 )
      deltax = realply;

    if( bestscore <= -MATEVALUE+1000 )
      deltax = -realply;

    (a few lines more code)
} // end of function
I was using the latest released GCC compiler at the time which i had just downloaded hours before doing this compile. Whatever flag i used.

gcc -O3 -march=k8 -mtune=k8 -S -DUNIXPII -c diepab.c

Other flags like -O2 and no tuning to k8 but to k7, it all didn't help.
Each time the above code fragment produced.

Next is what it generated. It first jumps away regurarly to label L837,
then it executes 4 instructions there and jumps back. I don't want it to
jump around in the code. I want fall through code, that's lightyears faster at AMD. How they get the bizarre idea to generate this type of code i do not know.
Vincent, any _good_ compiler person could explain this to you. The idea is to (a) optimize branch prediction and (b) optimize the prefetch that occurs in cache blocks.

If you do this:

if (c) {
rarely executed code
}
You're doing a statement that has no logical basis.

Where is your proof the code gets executed rarely if the optimizations happen without PGO?

What we do know is that AMD suffers a bunch of penalties if we jump around in the code, whereas 'fall through' works perfectly.

Why would you not use PGO, and then _complain_ when the compiler produces less than optimal code. The traditional static branch prediction approach in processors is to assume that backward branches are loops and are almost always take, while forward branches are exceptions and are rarely taken, except when the branches are a significant difference away, which suggests some sort of static optimization that guessed that this forward branch is more or less likely for some reason. The entire purpose for PGO is to observe branch behavior and shuffle the C-code around to make it execute more efficiently, so that it turns into faster asm code. I use PGO on everything I compile that is of any significance in terms of run-time. Why would you not use it on your chess program, and then complain when the code is sub-optimal when the compiler has to _guess_ what happens.

If your code is as well-thought-out as your post above, I could see how it might be tricked into producing ugly-looking code.

bob · Post by **bob** » Sun Feb 01, 2009 5:05 pm

diep wrote:
hgm wrote:I once tried to replace every branch over a simple movl instruction by a movcc, on the assembly produced by GCC with the -S option. On the Pentium M this produced no measurable speedup. Apparently most banches were too easy to predict...
For easily predictable branches at AMD,
CMOV gave at least 50% speedup here.

CMOV is a directpath instruction.

Jumping through the code based upon a JLE is real ugly.

Vincent

Again, this statement is rarely true. It depends on the predictability of the branch, and most every branch falls into the "highly-predictable" category. Many of them can be predicted by the usual "local branch prediction" approach that looks at the previous N (N=10-12 depending on processor) times that branch has been predicted. Others fit better into a "correlated branch prediction" approach since branches of the form "if (wtm)" are all correlated in their behavior for one ply. And then the tournament predictor works to predict which of the above two predictors best fits each specific branch and selects the one offering better prediction.

If you do this:

if (random() < 0.5) {
}

Then _no_ predictor will work well and there's nothing that can be done. And there CMOV might well be a big win.

diep · Post by **diep** » Sun Feb 01, 2009 5:31 pm

hgm wrote:
diep wrote:The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.
Are you implying that PGO is totally stupid? If I would write a profiler, it would not make the slightest difference how many jumps or banches there were in the code. You simply count the number of times each possible control path is traversed. Either by incrementing a counter after every branch and at every branch target, or, less invasively, by just recording the return adress on every timer interrupt.

In GCC yes the PGO is really ugly bad, i use very polite words.
Beginners level would be a compliment.

At least 4 times worse than intel c++ / visual c++ do it.

Your way of doing PGO in GCC would mean that either we hear 1 clean shot, and you're dead by a bullet shot by a hired criminal, or GCC team would soon undo your code as AMD would be faster IPC wise than Nehalem.

That's why i showed the raped code sample by GCC. You must do SOMETHING to make the code slower from GCC.

Of course 2 simple CMOV's would be a lot faster than all the silly jumping and should get generated by default. It is just 1 example. There is worse.

Both are american companies, that's the weird thing. Intel produces in Israel and ships then the wafers to Malaysia where the hard work starts of testing them for weeks and then putting a stamp on them and they get sold. AMD prints the cpu's in germany then ships it to Indonesia, where the testing happens for weeks, stamp on it and they get sold.

Usually Bob only defends "till death" american projects. Maybe he made a mistake this time and assumed the code would only take the branch if it was seldom.

However the branch generated by GCC here has a cost of around 30 cycles at AMD.

At intel it is a lot cheaper, intel has more clever rewrite mechanisms and has a bigger lookahead buffer. Probably Nehalem can already reorder the code.

Even then it is slower than doing a CMOV.
CMOV costs 1 cycle at most, even at intel.

I forgot to add that taking the branch, dang another cycle penalty extra for AMD, and then the jump back, another cycle at AMD.

So even if it would get 100% predicted by AMD that branch it still loses 2 cycles unnecesary.

That's why i want clean fall through code.

Thanks,
Vincent

bob · Post by **bob** » Sun Feb 01, 2009 10:20 pm

diep wrote:
hgm wrote:
diep wrote:The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.
Are you implying that PGO is totally stupid? If I would write a profiler, it would not make the slightest difference how many jumps or banches there were in the code. You simply count the number of times each possible control path is traversed. Either by incrementing a counter after every branch and at every branch target, or, less invasively, by just recording the return adress on every timer interrupt.
In GCC yes the PGO is really ugly bad, i use very polite words.
Beginners level would be a compliment.

At least 4 times worse than intel c++ / visual c++ do it.

Your way of doing PGO in GCC would mean that either we hear 1 clean shot, and you're dead by a bullet shot by a hired criminal, or GCC team would soon undo your code as AMD would be faster IPC wise than Nehalem.

That's why i showed the raped code sample by GCC. You must do SOMETHING to make the code slower from GCC.

Of course 2 simple CMOV's would be a lot faster than all the silly jumping and should get generated by default. It is just 1 example. There is worse.

Both are american companies, that's the weird thing. Intel produces in Israel and ships then the wafers to Malaysia where the hard work starts of testing them for weeks and then putting a stamp on them and they get sold. AMD prints the cpu's in germany then ships it to Indonesia, where the testing happens for weeks, stamp on it and they get sold.

Usually Bob only defends "till death" american projects. Maybe he made a mistake this time and assumed the code would only take the branch if it was seldom.

Don't know what you are talking about myself. Both AMD and Intel _are_ "American projects". I've used both. And had success with both. Intel currently leads the pack. AMD led it for quite a while. I use what is best, not what is politically correct or whatever.

If you do not use PGO, and you said you didn't, then the compiler has absolutely no way to know how the branch will behave, and yes, it will produce poor code when it makes the wrong assumption. Simple fix is to use PGO. Then the problem goes away.

Intel C is _not_ 4 time faster/better than GCC. It is better. And it might be 10% faster or so as it has been in the past. I haven't run on AMD hardware recently, but when I have, I used GCC and it worked quite well. On occasion, it would fail miserably and crash. Particularly when I tried to profile a threaded program so that the thread-management code would also get optimized. However, I have also had Intel versions that would crash as well, so it happens.

However the branch generated by GCC here has a cost of around 30 cycles at AMD.

At intel it is a lot cheaper, intel has more clever rewrite mechanisms and has a bigger lookahead buffer. Probably Nehalem can already reorder the code.

Even then it is slower than doing a CMOV.
CMOV costs 1 cycle at most, even at intel.

I forgot to add that taking the branch, dang another cycle penalty extra for AMD, and then the jump back, another cycle at AMD.

So even if it would get 100% predicted by AMD that branch it still loses 2 cycles unnecesary.

That's why i want clean fall through code.

Thanks,
Vincent

But you only want clean fall-through if it falls through _most_ of the time. Otherwise that would be slower, it would cause the i-cache to prefetch stuff that gets skipped, and is just not as efficient overall.

An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC