An example of intel behaviour in GCC

diep · Post by **diep** » Sun Feb 01, 2009 11:47 pm

bob wrote:
diep wrote:
hgm wrote:
diep wrote:The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.
Are you implying that PGO is totally stupid? If I would write a profiler, it would not make the slightest difference how many jumps or banches there were in the code. You simply count the number of times each possible control path is traversed. Either by incrementing a counter after every branch and at every branch target, or, less invasively, by just recording the return adress on every timer interrupt.
In GCC yes the PGO is really ugly bad, i use very polite words.
Beginners level would be a compliment.

At least 4 times worse than intel c++ / visual c++ do it.

Your way of doing PGO in GCC would mean that either we hear 1 clean shot, and you're dead by a bullet shot by a hired criminal, or GCC team would soon undo your code as AMD would be faster IPC wise than Nehalem.

That's why i showed the raped code sample by GCC. You must do SOMETHING to make the code slower from GCC.

Of course 2 simple CMOV's would be a lot faster than all the silly jumping and should get generated by default. It is just 1 example. There is worse.

Both are american companies, that's the weird thing. Intel produces in Israel and ships then the wafers to Malaysia where the hard work starts of testing them for weeks and then putting a stamp on them and they get sold. AMD prints the cpu's in germany then ships it to Indonesia, where the testing happens for weeks, stamp on it and they get sold.

Usually Bob only defends "till death" american projects. Maybe he made a mistake this time and assumed the code would only take the branch if it was seldom.
Don't know what you are talking about myself. Both AMD and Intel _are_ "American projects". I've used both. And had success with both. Intel currently leads the pack. AMD led it for quite a while. I use what is best, not what is politically correct or whatever.

If you do not use PGO, and you said you didn't, then the compiler has absolutely no way to know how the branch will behave, and yes, it will produce poor code when it makes the wrong assumption. Simple fix is to use PGO. Then the problem goes away.

Intel C is _not_ 4 time faster/better than GCC. It is better. And it might be 10% faster or so as it has been in the past. I haven't run on AMD hardware recently, but when I have, I used GCC and it worked quite well. On occasion, it would fail miserably and crash. Particularly when I tried to profile a threaded program so that the thread-management code would also get optimized. However, I have also had Intel versions that would crash as well, so it happens.

However the branch generated by GCC here has a cost of around 30 cycles at AMD.

At intel it is a lot cheaper, intel has more clever rewrite mechanisms and has a bigger lookahead buffer. Probably Nehalem can already reorder the code.

Even then it is slower than doing a CMOV.
CMOV costs 1 cycle at most, even at intel.

I forgot to add that taking the branch, dang another cycle penalty extra for AMD, and then the jump back, another cycle at AMD.

So even if it would get 100% predicted by AMD that branch it still loses 2 cycles unnecesary.

That's why i want clean fall through code.

Thanks,
Vincent
But you only want clean fall-through if it falls through _most_ of the time. Otherwise that would be slower, it would cause the i-cache to prefetch stuff that gets skipped, and is just not as efficient overall.

You're not listening. After it has wrangled my neat written C-code to this spaghetti code it doesn't know how to speedup something like that with pgo, as it isn't a branch, it's just some sort of far jump forward and a jump back. Nothing more nothing less. PGO doesn't help then. It keeps spaghetti.

Now we can have a discussion whether macaroni is a better form of spaghetti as that is easier to eat with a spoon; problem is that generating clean straight fall through code with CMOV's, for my part using a set* command, it means that pgo pass still CAN optimize the code to a branch when needed. GCC isn't that clever though. GCC generates something that is slower at both core2 as well as AMD. Just it hurts AMD *major league*. Not sure about intel. I assume Nehalem isn't worse than core2 here.

There is no excuse to generate spaghetticode here, except when you want to keep GCC slower than other compilers and hurt AMD a tad more than Intel. When i say AMD i also mean other manufacturers such as IBM (if you use GCC at that chip).

Vincent

bob · Post by **bob** » Mon Feb 02, 2009 1:47 am

diep wrote:
bob wrote:
diep wrote:
hgm wrote:
diep wrote:The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.
Are you implying that PGO is totally stupid? If I would write a profiler, it would not make the slightest difference how many jumps or banches there were in the code. You simply count the number of times each possible control path is traversed. Either by incrementing a counter after every branch and at every branch target, or, less invasively, by just recording the return adress on every timer interrupt.
In GCC yes the PGO is really ugly bad, i use very polite words.
Beginners level would be a compliment.

At least 4 times worse than intel c++ / visual c++ do it.

Your way of doing PGO in GCC would mean that either we hear 1 clean shot, and you're dead by a bullet shot by a hired criminal, or GCC team would soon undo your code as AMD would be faster IPC wise than Nehalem.

That's why i showed the raped code sample by GCC. You must do SOMETHING to make the code slower from GCC.

Of course 2 simple CMOV's would be a lot faster than all the silly jumping and should get generated by default. It is just 1 example. There is worse.

Both are american companies, that's the weird thing. Intel produces in Israel and ships then the wafers to Malaysia where the hard work starts of testing them for weeks and then putting a stamp on them and they get sold. AMD prints the cpu's in germany then ships it to Indonesia, where the testing happens for weeks, stamp on it and they get sold.

Usually Bob only defends "till death" american projects. Maybe he made a mistake this time and assumed the code would only take the branch if it was seldom.
Don't know what you are talking about myself. Both AMD and Intel _are_ "American projects". I've used both. And had success with both. Intel currently leads the pack. AMD led it for quite a while. I use what is best, not what is politically correct or whatever.

If you do not use PGO, and you said you didn't, then the compiler has absolutely no way to know how the branch will behave, and yes, it will produce poor code when it makes the wrong assumption. Simple fix is to use PGO. Then the problem goes away.

Intel C is _not_ 4 time faster/better than GCC. It is better. And it might be 10% faster or so as it has been in the past. I haven't run on AMD hardware recently, but when I have, I used GCC and it worked quite well. On occasion, it would fail miserably and crash. Particularly when I tried to profile a threaded program so that the thread-management code would also get optimized. However, I have also had Intel versions that would crash as well, so it happens.

However the branch generated by GCC here has a cost of around 30 cycles at AMD.

At intel it is a lot cheaper, intel has more clever rewrite mechanisms and has a bigger lookahead buffer. Probably Nehalem can already reorder the code.

Even then it is slower than doing a CMOV.
CMOV costs 1 cycle at most, even at intel.

I forgot to add that taking the branch, dang another cycle penalty extra for AMD, and then the jump back, another cycle at AMD.

So even if it would get 100% predicted by AMD that branch it still loses 2 cycles unnecesary.

That's why i want clean fall through code.

Thanks,
Vincent
But you only want clean fall-through if it falls through _most_ of the time. Otherwise that would be slower, it would cause the i-cache to prefetch stuff that gets skipped, and is just not as efficient overall.
You're not listening. After it has wrangled my neat written C-code to this spaghetti code it doesn't know how to speedup something like that with pgo, as it isn't a branch, it's just some sort of far jump forward and a jump back. Nothing more nothing less. PGO doesn't help then. It keeps spaghetti.

Now we can have a discussion whether macaroni is a better form of spaghetti as that is easier to eat with a spoon; problem is that generating clean straight fall through code with CMOV's, for my part using a set* command, it means that pgo pass still CAN optimize the code to a branch when needed. GCC isn't that clever though. GCC generates something that is slower at both core2 as well as AMD. Just it hurts AMD *major league*. Not sure about intel. I assume Nehalem isn't worse than core2 here.

There is no excuse to generate spaghetticode here, except when you want to keep GCC slower than other compilers and hurt AMD a tad more than Intel. When i say AMD i also mean other manufacturers such as IBM (if you use GCC at that chip).

Vincent

_You_ are not listening. The compiler doesn't "mangle" _anything_ when doing PGO. It just compiles the code into equivalent asm, instruments it so that branch behavior is measured, and after the re-compile the code is mangled to match the data from the PGO output.

I know a couple of the real contributors to GCC, and have for years. They do _not_ intentionally make the compiler slower than a commercial compiler, any more than I try to make Crafty slower than a commercial program. It doesn't make any sense...

bob · Post by **bob** » Mon Feb 02, 2009 6:02 am

Here's a short comparison. I used intel compiler with PGO on my core-2 duo laptop, and ran the crafty bench command. I then did a plain vanilla gcc compile, optimizing but no PGO and ran the same. (2.0 ghz core-2 duo using just one cpu). I then ran a 32 bit test as well, using my PIV 2.8 xeon in my office, same test.

results are:

32 bit:

gcc:
Total nodes: 122093153
Raw nodes per second: 824953
icc:
Total nodes: 122093153
Raw nodes per second: 953852

64 bit:

gcc:
Total nodes: 122093153
Raw nodes per second: 2141985
icc:
Total nodes: 122093153
Raw nodes per second: 2219875

On my office box, intel is about 10% faster which is about what I expected, but on my 64 bit laptop, the difference is very small.

GCC is not a bad compiler...

PGO would improve things but I didn't feel like taking the 10 minutes or so to do it to see...

bob · Post by **bob** » Mon Feb 02, 2009 7:41 pm

mcostalba wrote:
bob wrote: Vincent, any _good_ compiler person could explain this to you. The idea is to (a) optimize branch prediction and (b) optimize the prefetch that occurs in cache blocks.

If you do this:

if (c) {
rarely executed code
}

It is more efficient to turn it into this:

if (!c) go to rare_executed;
continue_point:

somewhere out of the main instruction stream:

rarely_executed:
{rarely executed code}
go to continue_point;

The reason for that is twofold. Now we make the more common case for the branch the "fall_through_. Yes, it would get predicted correctly most of the time anyway. But note that now when we fetch the if(c) code, we fetch a block of 32/64/128 bytes depending on the CPU. And that code _won't_ contain the "rarely_executed" code which does nothing more than waste cache and prefetch cycles.

Apart that probably the correct equivalent code of

if (c) {
rarely executed code
}

is

if (c) go to rare_executed;
continue_point:

I would say that it is a very slippery way to think to understund processor branch prediction. Also with an empty brach target buffer we don't know what processor will choose by default, as example on MIPS default is "not-taken" branch prediction. Intel had never explained how its branch prediction actually works.

So, IMHO, your way of pseudo-optimize (blindly) this very low level stuff just mess up the code with no clear advantage.

A less naive approach would be to use a PGO compiler, that also shuffle code for us instead of doing it manually with a dubious jump-label fest.

Just my two cents.

I am not sure I follow.

(a) If I wrote code as above, it would be for two reasons:

(1) I know that the branch is rarely taken and the remote target code is rarely executed as a result; and,

(2) I know that the compiler won't rearrange the code (for whatever reason), something that today would be beyond highly uncommon as most compilers are quite good at this and those that do PGO would obviously rearrange things or there is no reason to do the PGO in the first place.

(b) I do not write code like my example, because I know that any compiler I use (which includes GCC and ICC at least) will rearrange the code as above for speed, while I can write the code to best improve clarity.

I was not suggesting that a "human" write the code in the example I gave. I was giving that as a natural way the compiler would optimize the code once it had some PGO data that showed that the "rarely executed code segment" above is really rarely executed. That makes the program faster and more cache-friendly at the same time.

mcostalba · Post by **mcostalba** » Mon Feb 02, 2009 10:25 pm

bob wrote: I was not suggesting that a "human" write the code in the example I gave. I was giving that as a natural way the compiler would optimize the code once it had some PGO data that showed that the "rarely executed code segment" above is really rarely executed. That makes the program faster and more cache-friendly at the same time.

Ok. Then I have misunderstood you. Sorry for the noise.

flok · Post by **flok** » Tue Feb 03, 2009 3:17 pm

bob wrote: If you do this:

if (c) {
rarely executed code
}

It is more efficient to turn it into this:

if (!c) go to rare_executed;
continue_point:

In the linux kernel they use the unlikely() macro to help the compiler.
E.g.
if (unlikely(pos_winning_from_crafty == 1))
{
}

they are defined as:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

http://zeta-puppis.com/2006/02/09/likel ... xtensions/

bob · Post by **bob** » Tue Feb 03, 2009 8:13 pm

flok wrote:
bob wrote: If you do this:

if (c) {
rarely executed code
}

It is more efficient to turn it into this:

if (!c) go to rare_executed;
continue_point:
In the linux kernel they use the unlikely() macro to help the compiler.
E.g.
if (unlikely(pos_winning_from_crafty == 1))
{
}

they are defined as:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

http://zeta-puppis.com/2006/02/09/likel ... xtensions/

There is no other choice. You can't PGO the kernel itself.

mcostalba · Post by **mcostalba** » Tue Feb 03, 2009 9:50 pm

bob wrote:
flok wrote:
bob wrote: If you do this:

if (c) {
rarely executed code
}

It is more efficient to turn it into this:

if (!c) go to rare_executed;
continue_point:
In the linux kernel they use the unlikely() macro to help the compiler.
E.g.
if (unlikely(pos_winning_from_crafty == 1))
{
}

they are defined as:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

http://zeta-puppis.com/2006/02/09/likel ... xtensions/
There is no other choice. You can't PGO the kernel itself.

GCC historically fails to have a decent PGO, this is the reason the above defines are very used, not only in the kernel that in any case could be audited...and it is BTW.

A nice side effect is that are also useful for self-documenting the code.

wgarvin · Post by **wgarvin** » Wed Feb 04, 2009 1:05 am

mcostalba wrote:GCC historically fails to have a decent PGO, this is the reason the above defines are very used, not only in the kernel that in any case could be audited...and it is BTW.

A nice side effect is that are also useful for self-documenting the code.

There are other reasons not to use PGO. For large projects it can add some risk, which can be hard to quantify. We usually don't use it for console games because its hard to collect reliable profile info until the codebase stops changing, which doesn't happen until right before we ship. The closer we get to shipping, the more dangerous it is for anything to be changing--including the compiled binary code! PGO builds are less reproducible than non-PGO builds, in the sense that changing the profile data can have drastic and unpredictable effects on the actual compiled code, because the output of the compiler depends in non-trivial ways on the collected profile data. So the cumulative testing performed on all earlier builds would give us less confidence about today's build if they were PGO builds, than it does if they were all non-PGO builds. With a non-PGO build, if nearly all of the source code has not changed, then nearly all of the compiled binary code will be identical also (the compiler will give the same output from the same input). So the weeks of testing effort that we put into those previous binary versions still give us a lot of confidence about the stability and quality of the final build.

Compiler bugs are another problem. In some compilers the PGO support itself is kind of buggy. But PGO can also expose you to compiler bugs that are not directly related to PGO, but related to flow control or loop optimization or other things---bugs that might not affected you at all in a non-PGO build. And even though they're rare, we do trip over compiler bugs from time to time. On a previous project I got hit by a bug in the Microsoft compiler involving an optimization to a signed loop condition. The condition should have been false to skip the loop, instead the compiler flubbed it and executed the loop repeatedly until it crashed. Anyway, with non-PGO builds, the compiler will usually hit the same bugs each time we compile the code. With PGO builds, each new set of profile data can potentially change the compiler's output in significant ways, so who knows what could happen! Maybe the first 99 times it would not produce the buggy output, and then on the final build before shipping, you could be really unlucky and get buggy output. Imagine shipping out a million copies of a console game only to have users show up in droves complaining about an unavoidable crash bug in the game. We can't test everything in a 24-hour period.. modern games have millions of lines of source code, and tens of megabytes of compiled code. So worst-case scenarios like that make it easy to pass up on the benefits of PGO.

Anyway, using PGO on a large project that is changing right up until the end of the project, can be both cumbersome and risky. For chess engines its probably fine though! YMMV.

An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC