An example of intel behaviour in GCC

diep · Post by **diep** » Sat Jan 31, 2009 8:41 pm

In Diep's hashtable code in summer 2008 i rewrote some code in order to speed it up. I was very amazed to see Diep actually SLOWER with the code at AMD.

So for once i took a look in the assembler output.

Here is the C code of the code fragment:

(Everything is 32 bits signed or unsigned integers here)

Code: Select all

Hashtable code:
   ..
   deltax = 0;
    if( bestscore >= MATEVALUE-1000 )
      deltax = realply;

    if( bestscore <= -MATEVALUE+1000 )
      deltax = -realply;

    (a few lines more code)
} // end of function

I was using the latest released GCC compiler at the time which i had just downloaded hours before doing this compile. Whatever flag i used.

gcc -O3 -march=k8 -mtune=k8 -S -DUNIXPII -c diepab.c

Other flags like -O2 and no tuning to k8 but to k7, it all didn't help.
Each time the above code fragment produced.

Next is what it generated. It first jumps away regurarly to label L837,
then it executes 4 instructions there and jumps back. I don't want it to
jump around in the code. I want fall through code, that's lightyears faster at AMD. How they get the bizarre idea to generate this type of code i do not know.

At intel such jumps are a lot cheaper than at AMD. Jumps with a bunch of instructions in between are ugly slow at k8. Short jumps of a few instructions, AMD is a lot faster. When optimizing some code a few years ago i had noticed that bigtime.

Note each time i take a look, over the past years, i see the same pattern repeat. I try to make code faster in C, and compiler screws up. It is just not funny.

What i do not know is whether 40 instructions are just within lookahead buffer of the 72 bytes core2, i'm sure they are out of range to AMD which has a shorter lookahead than intel.

I just wanted 2 CMOV's. In fact compiler could have reordered the CMOV's to not be within the same 16 bytes. Thing is, it is generating the below crap code:

Code: Select all

cmpl	$498999, 28(%ebp)
	movl	12(%ebp), %eax
	jle	L837
L822:
	movl	28(%ebp), %edx
        ...
        here are another 35 lines of assembler
        ...
	orl	%ecx, -68(%ebp)
	xorl	-68(%ebp), %edx
	movl	%edx, 16(%edi)
	addl	$68, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	leave
	ret
L836:
	xorl	%eax, %eax
	cmpl	44(%ebp), %edx
	setl	%al
	incl	%eax
	movl	%eax, -48(%ebp)
	jmp	L801
L837:
	negl	12(%ebp)
	xorl	%eax, %eax
	cmpl	$-498999, 28(%ebp)
	cmovl	12(%ebp), %eax
	jmp	L822
L814:
	movl	36(%ebp), %ecx
	cmpl	%ecx, %edx
	movl	%ecx, -16(%ebp)
	jbe	L817
	movl	-28(%ebp), %eax
	movl	%edx, -16(%ebp)
	andl	$-131072, %eax
	movl	%eax, -44(%ebp)
	jmp	L817

Note i was not the first noticing this. GCP found something on the net that Linus Thorvalds said about that there is no excuse to not use CMOV by GCC. The answer posted to that, we're speaking the year 2007 now, was that it would be slower at the P4-prescott, therefore GCC doesn't generate it like that.

Vincent

wgarvin · Post by **wgarvin** » Sat Jan 31, 2009 10:52 pm

Are you sure? As of January 2007, it sounds like Linus was arguing that CMOVcc instructions are not very useful on modern processors:

http://ondioline.org/mail/cmov-a-bad-id ... order-cpus

http://linux.derkeiler.com/Mailing-List ... 00689.html

bob · Post by **bob** » Sat Jan 31, 2009 11:16 pm

diep wrote:In Diep's hashtable code in summer 2008 i rewrote some code in order to speed it up. I was very amazed to see Diep actually SLOWER with the code at AMD.

So for once i took a look in the assembler output.

Here is the C code of the code fragment:

(Everything is 32 bits signed or unsigned integers here)
Code: Select all
Hashtable code:
   ..
   deltax = 0;
    if( bestscore >= MATEVALUE-1000 )
      deltax = realply;

    if( bestscore <= -MATEVALUE+1000 )
      deltax = -realply;

    (a few lines more code)
} // end of function
I was using the latest released GCC compiler at the time which i had just downloaded hours before doing this compile. Whatever flag i used.

gcc -O3 -march=k8 -mtune=k8 -S -DUNIXPII -c diepab.c

Other flags like -O2 and no tuning to k8 but to k7, it all didn't help.
Each time the above code fragment produced.

Next is what it generated. It first jumps away regurarly to label L837,
then it executes 4 instructions there and jumps back. I don't want it to
jump around in the code. I want fall through code, that's lightyears faster at AMD. How they get the bizarre idea to generate this type of code i do not know.

Vincent, any _good_ compiler person could explain this to you. The idea is to (a) optimize branch prediction and (b) optimize the prefetch that occurs in cache blocks.

If you do this:

if (c) {
rarely executed code
}

It is more efficient to turn it into this:

if (!c) go to rare_executed;
continue_point:

somewhere out of the main instruction stream:

rarely_executed:
{rarely executed code}
go to continue_point;

The reason for that is twofold. Now we make the more common case for the branch the "fall_through_. Yes, it would get predicted correctly most of the time anyway. But note that now when we fetch the if(c) code, we fetch a block of 32/64/128 bytes depending on the CPU. And that code _won't_ contain the "rarely_executed" code which does nothing more than waste cache and prefetch cycles.

At intel such jumps are a lot cheaper than at AMD. Jumps with a bunch of instructions in between are ugly slow at k8. Short jumps of a few instructions, AMD is a lot faster. When optimizing some code a few years ago i had noticed that bigtime.

Note each time i take a look, over the past years, i see the same pattern repeat. I try to make code faster in C, and compiler screws up. It is just not funny.

What i do not know is whether 40 instructions are just within lookahead buffer of the 72 bytes core2, i'm sure they are out of range to AMD which has a shorter lookahead than intel.

I just wanted 2 CMOV's. In fact compiler could have reordered the CMOV's to not be within the same 16 bytes. Thing is, it is generating the below crap code:
Code: Select all
cmpl	$498999, 28(%ebp)
	movl	12(%ebp), %eax
	jle	L837
L822:
	movl	28(%ebp), %edx
        ...
        here are another 35 lines of assembler
        ...
	orl	%ecx, -68(%ebp)
	xorl	-68(%ebp), %edx
	movl	%edx, 16(%edi)
	addl	$68, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	leave
	ret
L836:
	xorl	%eax, %eax
	cmpl	44(%ebp), %edx
	setl	%al
	incl	%eax
	movl	%eax, -48(%ebp)
	jmp	L801
L837:
	negl	12(%ebp)
	xorl	%eax, %eax
	cmpl	$-498999, 28(%ebp)
	cmovl	12(%ebp), %eax
	jmp	L822
L814:
	movl	36(%ebp), %ecx
	cmpl	%ecx, %edx
	movl	%ecx, -16(%ebp)
	jbe	L817
	movl	-28(%ebp), %eax
	movl	%edx, -16(%ebp)
	andl	$-131072, %eax
	movl	%eax, -44(%ebp)
	jmp	L817
Note i was not the first noticing this. GCP found something on the net that Linus Thorvalds said about that there is no excuse to not use CMOV by GCC. The answer posted to that, we're speaking the year 2007 now, was that it would be slower at the P4-prescott, therefore GCC doesn't generate it like that.

Vincent

You can make it do so however, or at least you could. CMOV was never a "default" option when I was looking at this, but you could make it emit the instructions. But the above is not really about CMOV, it is about trying to optimize pre-fetching, and it should be the right way to do things on either AMD or Intel, assuming you ran a PGO compile so that it actually knows which way the branch goes most of the time...

bob · Post by **bob** » Sat Jan 31, 2009 11:20 pm

wgarvin wrote:Are you sure? As of January 2007, it sounds like Linus was arguing that CMOVcc instructions are not very useful on modern processors:

http://ondioline.org/mail/cmov-a-bad-id ... order-cpus

http://linux.derkeiler.com/Mailing-List ... 00689.html

I think that most of that is "PIV sucks", and with predictable branches, cmov doesn't offer anything... No surprise at all except for chess evaluations where the often-used if (wtm) type construct causes some problems, at least for those that don't use a tournament branch predictor with one side using a correlated prediction mechanism...

wgarvin · Post by **wgarvin** » Sat Jan 31, 2009 11:51 pm

bob wrote: I think that most of that is "PIV sucks", and with predictable branches, cmov doesn't offer anything... No surprise at all except for chess evaluations where the often-used if (wtm) type construct causes some problems, at least for those that don't use a tournament branch predictor with one side using a correlated prediction mechanism...

To me it read more like "CMOV sucks generally, though it sucks even more than usual on P4"...

However, if you consider just the case where you have a genuinely unpredictable branch, and you're selecting between two constants (or at least, two values that can be easily synthesized without memory accesses) then CMOV or some other branchless construction (e.g. using SBB) might be useful.

For chess, maybe some inline assembly macros could be used to get CMOV when you really want it. I've never tried that though.

hgm · Post by **hgm** » Sat Jan 31, 2009 11:56 pm

I once tried to replace every branch over a simple movl instruction by a movcc, on the assembly produced by GCC with the -S option. On the Pentium M this produced no measurable speedup. Apparently most banches were too easy to predict...

mcostalba · Post by **mcostalba** » Sun Feb 01, 2009 12:01 pm

bob wrote: Vincent, any _good_ compiler person could explain this to you. The idea is to (a) optimize branch prediction and (b) optimize the prefetch that occurs in cache blocks.

If you do this:

if (c) {
rarely executed code
}

It is more efficient to turn it into this:

if (!c) go to rare_executed;
continue_point:

somewhere out of the main instruction stream:

rarely_executed:
{rarely executed code}
go to continue_point;

The reason for that is twofold. Now we make the more common case for the branch the "fall_through_. Yes, it would get predicted correctly most of the time anyway. But note that now when we fetch the if(c) code, we fetch a block of 32/64/128 bytes depending on the CPU. And that code _won't_ contain the "rarely_executed" code which does nothing more than waste cache and prefetch cycles.

Apart that probably the correct equivalent code of

if (c) {
rarely executed code
}

is

if (c) go to rare_executed;
continue_point:

I would say that it is a very slippery way to think to understund processor branch prediction. Also with an empty brach target buffer we don't know what processor will choose by default, as example on MIPS default is "not-taken" branch prediction. Intel had never explained how its branch prediction actually works.

So, IMHO, your way of pseudo-optimize (blindly) this very low level stuff just mess up the code with no clear advantage.

A less naive approach would be to use a PGO compiler, that also shuffle code for us instead of doing it manually with a dubious jump-label fest.

Just my two cents.

hgm · Post by **hgm** » Sun Feb 01, 2009 2:00 pm

I am afraid I have to side with Bob on this one. If "fall-through code is lightyears faster on AMD", then I would want the code to fall through in the common case, not the rare one. Which is exacty what the shown GCC-produced assembly code does.

And I think that Intel actually has revealed how its static branch prediction works; The standard way is to assume forward branches are not taken, and backward branches are taken. At the time of Core Solo/Duo (IIRC)they published they had found an improvement of this, changing the default prediction for branches with a very remote target. After some Googling I did find something about it in connection with P-IV, but one might assume that this is something that should have the same optimum irrespective of processor family, and hence used in all Intel CPUs ever since.

http://arstechnica.com/old/content/2004 ... um-2.ars/8

mcostalba · Post by **mcostalba** » Sun Feb 01, 2009 2:37 pm

Thanks for the Intel link.

Regarding the jump to label optimization I really think it is the wrong way to go, apart for being compiler/processor specific it does not scale and it is not a general solution.

You should compare the jump-label hand rewritten code with a PGO optimized one and see if the rewrite is still needed or the compiler has, as it should, rewritten the assembly for you.

In my experience, the code should be written in the most simple and straightforward way, this is what compiler expects and what is designed to optimize for.

hgm · Post by **hgm** » Sun Feb 01, 2009 2:47 pm

OK, I assume there is a misunderstanding here. I don't think that Bob meant that in the high-level language one should pogram like this, just that an optimizing compiler would translate the code like a compiler that does not shuffle around code would for the source with the goto.

An example of intel behaviour in GCC

An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC

Re: An example of intel behaviour in GCC