optimization suprise

jwes · Post by **jwes** » Sat May 03, 2008 9:25 am

My makemove/unmakemove functions seemed too slow, so I looked at the code and found this. My move structure is 2 bits flags, 6 bits from, 2 bits flags, 6 bits to. Various combinations of the flags are used to represent the various special moves, then I had a switch to handle the special cases and the default case:

Code: Select all

#define SPECIALMOVE 0xc0c0
#define MOVEEP 0xc000
#define MOVEKC 0xc040
#define MOVEQC 0xc080
#define MOVEP2 0xc0c0
#define MOVEPN 0x8000
#define MOVEPB 0x8040
#define MOVEPR 0x8080
#define MOVEPQ 0x80c0

switch (move & SPECIALMOVE)
{
    case MOVEEP:
    	... break;
    case MOVEKC:
    	... break;
    case MOVEQC:
    	... break;
    case MOVEP2:
    	... break;
    case MOVEPN:
    	... break;
    case MOVEPB:
    	... break;
    case MOVEPR:
    	... break;
    case MOVEPQ:
    	... break;
    default:
	... break;
}

I changed the code to make the cases between 0 and 15 like this:

Code: Select all

#define CONVERT(a) (((a&0xc000)>>12) | ((a&0xc0)>>6))

switch (CONVERT(move))
{
    case CONVERT(MOVEEP):
    	... break;
    case CONVERT(MOVEKC):
    	... break;
    case CONVERT(MOVEQC):
    	... break;
    case CONVERT(MOVEP2):
    	... break;
    case CONVERT(MOVEPN):
    	... break;
    case CONVERT(MOVEPB):
    	... break;
    case CONVERT(MOVEPR):
    	... break;
    case CONVERT(MOVEPQ):
    	... break;
    case 0:
	... break;
}

and makemove/unmakemove runs about 20% faster. This seems like quite a lot for changing one statement. I'd guesstimate that I saved ~50 ns for each time I executed the switch statement.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat May 03, 2008 9:48 am

Hmm, that seemes to be an issue a "good" optimization compiler should find by itself. I guess you scattered your special bits for better readability of from-to coordinates during debugging. I would prefer to keep them adjacent.

Gerd

Harald Johnsen · Post by **Harald Johnsen** » Sat May 03, 2008 10:37 am

In the first switch the compiler is generating a lot of test and jumps, in the second there is a single branch table.

Gerd Isenberg wrote:Hmm, that seemes to be an issue a "good" optimization compiler should find by itself. I guess you scattered your special bits for better readability of from-to coordinates during debugging. I would prefer to keep them adjacent.

I'm wondering wich compiler try to do a perfect hashing for case values.
GCC and ICC ?

HJ.

CThinker · Post by **CThinker** » Sun May 04, 2008 2:18 am

This code is a lot faster:

Code: Select all

switch(...)
{
case 0:...
case 1:...
...
case 10:
}

than this code:

Code: Select all

switch(...)
{
case 100:
case 325:
case 401:
...
case 1230:
}

The reason is, for the first code, the compiler will generate a jump table, and use the 'case' values as the index. So, it is just one statement to execute the entire switch. Whereas, in the second code, each 'case' value will have to be tested individually.

jwes · Post by **jwes** » Sun May 04, 2008 5:41 am

CThinker wrote:This code is a lot faster:
Code: Select all
switch(...)
{
case 0:...
case 1:...
...
case 10:
}
than this code:
Code: Select all
switch(...)
{
case 100:
case 325:
case 401:
...
case 1230:
}
The reason is, for the first code, the compiler will generate a jump table, and use the 'case' values as the index. So, it is just one statement to execute the entire switch. Whereas, in the second code, each 'case' value will have to be tested individually.

Not quite true. The compiler generates a tree of if statments, so it takes log2(n) conditional jumps. I knew that the jump table would be faster, but not that it would significantly impact the performance of my program.

wgarvin · Post by **wgarvin** » Sun May 04, 2008 5:50 am

jwes wrote:
CThinker wrote:This code is a lot faster:
...
The reason is, for the first code, the compiler will generate a jump table, and use the 'case' values as the index. So, it is just one statement to execute the entire switch. Whereas, in the second code, each 'case' value will have to be tested individually.
Not quite true. The compiler generates a tree of if statments, so it takes log2(n) conditional jumps. I knew that the jump table would be faster, but not that it would significantly impact the performance of my program.

Perhaps the branches are not very predictable, so the log2(n) conditional jumps means a couple of mispredictions, whereas the table version has just one misprediction at the indirect branch.

hgm · Post by **hgm** » Sun May 04, 2008 7:19 am

It is not obvious to me that a jump table would be faster than a well-designed jump tree. It depends on the details of the misprediction logic, which heavily depends on CPU type. In general, the misprediction logic will be able to hold more information of a tree of branches than on a single indirect jump. This might help to get a lower misprediction rate on the tree.

bob · Post by **bob** » Sun May 04, 2008 6:28 pm

hgm wrote:It is not obvious to me that a jump table would be faster than a well-designed jump tree. It depends on the details of the misprediction logic, which heavily depends on CPU type. In general, the misprediction logic will be able to hold more information of a tree of branches than on a single indirect jump. This might help to get a lower misprediction rate on the tree.

Also a jump into a jump table has to be predicted even though it is unconditional, because the target address is unknown at fetch time, and that is going to add additional penalty when the unconditional jump is predicted to go to the wrong address...

hgm · Post by **hgm** » Sun May 04, 2008 7:22 pm

Yes, but it seems since Pentium M they have implemented something for this. If I understood it correctly, they always predict the branch as taken, but they take the speculative target from the Branch Target Buffer indexed by the global branch predictor (the packed taken/not-taken bits of the last 12 branches).

So if the target address correlates in any way with the code flow just before the indirect jump, it can make a prediction about the target address. Seems to be only 75% efficient, though, while normal branches are typically predicted with 95-97% accuracy.

wgarvin · Post by **wgarvin** » Sun May 04, 2008 7:31 pm

hgm wrote:It is not obvious to me that a jump table would be faster than a well-designed jump tree. It depends on the details of the misprediction logic, which heavily depends on CPU type. In general, the misprediction logic will be able to hold more information of a tree of branches than on a single indirect jump. This might help to get a lower misprediction rate on the tree.

Suppose the target is truly unpredictable. Then you have log2(n) branches that each have a 50% chance of being mispredicted, leading to on average log2(n)/2 mispredictions. Compare that with the indirect branch, which has 1 misprediction. All else being equal, the indirect branch will be faster for n > 4 or so. However, if the target is somewhat predictable, then the conditional branches will do better because they will sometimes predict correctly and the proper instructions will be fetched and speculatively executed. The indirect branch will still be mispredicted nearly all of the time though. If the branch is extremely predictable, both solutions should do well but again the indirect branch is likely to be better for n > 4.

bob wrote:Also a jump into a jump table has to be predicted even though it is unconditional, because the target address is unknown at fetch time, and that is going to add additional penalty when the unconditional jump is predicted to go to the wrong address...

So that's a good point which suggests that "all else is not equal" in my example above. Both potential targets of the conditional branch are known as it decodes the instruction, whereas with the indirect branch there may be some latency before the target is actually known (which could cause some delay in getting those instructions fetched). I'm guessing that for n >= 8 the indirect branch would still be better.

By the way, I recall in the days of the Pentium II/III that Intel advised one of my co-workers that if you load the target address into a register at least 40 cycles before you jump or call through the register, it will basically be free (because it will be predicted to go to that address). I've never noticed a compiler actually doing this, but I thought it was interesting.

optimization suprise

optimization suprise

Re: optimization suprise

Re: optimization suprise

Re: optimization suprise

Re: optimization suprise

Re: optimization suprise

Re: optimization suprise

Re: optimization suprise

Re: optimization suprise

Re: optimization suprise