Core Port Saturation

xmas79 · Post by **xmas79** » Mon Apr 14, 2014 11:59 pm

Hi all,
following HGM post on the cache of the processors, I have another question. I would like to know how to measure the impact of core port saturation... We know a modern processor core can execute up to 4 instructions in parallel, so its throughput can be as high as 0.25 CPI, if the program that is being executed has a good instruction pattern, eg interleaving one memory instruction, one multiplication etc... the pipeline will be always full so no stall will occur. But looking at my chess code I saw (make move after a capture):

Code: Select all

TotalPiecesCount&#40;opponent&#41;--;
TotalPiecesCount&#40;BOTH&#41;--;
PieceCount&#40;opponent, captured&#41;--;
PieceCount&#40;BOTH, captured&#41;--;

and then I just thought it was not a good instruction pattern. Usually, these instructions are translated into "inc dword ptr [eax]", or "inc byte ptr [eax+ebx*4+0xeac0]" or something similar, and on modern cpus with out-of-order instructions have zero impact on speed. They will usually occupy some empty slot in the pipeline so they are pretty free... But what if we have a bunch of these instructions in a row? I wrote this small program:

Code: Select all

void test&#40;)
&#123;
	unsigned long long size = 512*1024*1024;
	unsigned char *memory = &#40;unsigned char*&#41;malloc&#40;size&#41;;
	memset&#40;memory, 0, sizeof&#40;memory&#41;);
	
	time_t t = clock&#40;); 
	for &#40;int j = 0; j < 8; j++)
	&#123;
		unsigned char *current = memory;
		for &#40;unsigned long long i = 0; i < size; i += 2   /* 16 */)
		&#123;
			 (*&#40;current++))++;
			 (*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
		&#125;
	&#125;

	t = clock&#40;) - t; 
	printf&#40;"%6.3f sec\n", t * &#40;1. / CLOCKS_PER_SEC&#41;); 
&#125;

and the result is that a partial loop unrolling is always faster:

Code: Select all

	1.852 sec  ----> when increment is 2 and there are 2 (*&#40;current++))++; rows
	1.531 sec  ----> when increment is 16 and rows are 16 (*&#40;current++))++; rows

The compiled code in the 2 case is:

Code: Select all

inc byte ptr&#91;eax&#93;
inc byte ptr&#91;eax + 1&#93;

while in 16 case is:

Code: Select all

inc byte ptr&#91;eax&#93;
inc byte ptr&#91;eax + 1&#93;
inc byte ptr&#91;eax + 2&#93;
...
inc byte ptr&#91;eax + 15&#93;

So I would expect a lot of pipeline stalls in the case with 16 INCs, while I would expect a smaller CPI in the 2 INCs case, as the cpu can execute another instruction belonging to a group different from "inc" (eg loop end comparison).

This clearly means I miss something in these architectures...

Another interesting question is what is impact of the instruction cache size on a typical alpha beta searcher. My CPU, as an example, is an intel i7 3630QM quad-core with HT, and it has 32Kb of instruction cache. Are Eval+AB+Hash+...+printf+... instructions bigger than 32Kb? I guess yes, so keeping code compact should have very little impact on performance (please do not consider perft, where code locality is very high, so you should do something very wrong to get it wrong...). What do you think?

Best regards,
Natale.

bob · Post by **bob** » Tue Apr 15, 2014 3:13 am

xmas79 wrote:Hi all,
following HGM post on the cache of the processors, I have another question. I would like to know how to measure the impact of core port saturation... We know a modern processor core can execute up to 4 instructions in parallel, so its throughput can be as high as 0.25 CPI, if the program that is being executed has a good instruction pattern, eg interleaving one memory instruction, one multiplication etc... the pipeline will be always full so no stall will occur. But looking at my chess code I saw (make move after a capture):
Code: Select all
TotalPiecesCount&#40;opponent&#41;--;
TotalPiecesCount&#40;BOTH&#41;--;
PieceCount&#40;opponent, captured&#41;--;
PieceCount&#40;BOTH, captured&#41;--;
and then I just thought it was not a good instruction pattern. Usually, these instructions are translated into "inc dword ptr [eax]", or "inc byte ptr [eax+ebx*4+0xeac0]" or something similar, and on modern cpus with out-of-order instructions have zero impact on speed. They will usually occupy some empty slot in the pipeline so they are pretty free... But what if we have a bunch of these instructions in a row? I wrote this small program:
Code: Select all
void test&#40;)
&#123;
	unsigned long long size = 512*1024*1024;
	unsigned char *memory = &#40;unsigned char*&#41;malloc&#40;size&#41;;
	memset&#40;memory, 0, sizeof&#40;memory&#41;);
	
	time_t t = clock&#40;); 
	for &#40;int j = 0; j < 8; j++)
	&#123;
		unsigned char *current = memory;
		for &#40;unsigned long long i = 0; i < size; i += 2   /* 16 */)
		&#123;
			 (*&#40;current++))++;
			 (*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
			 //(*&#40;current++))++;
		&#125;
	&#125;

	t = clock&#40;) - t; 
	printf&#40;"%6.3f sec\n", t * &#40;1. / CLOCKS_PER_SEC&#41;); 
&#125;
and the result is that a partial loop unrolling is always faster:
Code: Select all
	1.852 sec  ----> when increment is 2 and there are 2 (*&#40;current++))++; rows
	1.531 sec  ----> when increment is 16 and rows are 16 (*&#40;current++))++; rows
The compiled code in the 2 case is:
Code: Select all
inc byte ptr&#91;eax&#93;
inc byte ptr&#91;eax + 1&#93;
while in 16 case is:
Code: Select all
inc byte ptr&#91;eax&#93;
inc byte ptr&#91;eax + 1&#93;
inc byte ptr&#91;eax + 2&#93;
...
inc byte ptr&#91;eax + 15&#93;
So I would expect a lot of pipeline stalls in the case with 16 INCs, while I would expect a smaller CPI in the 2 INCs case, as the cpu can execute another instruction belonging to a group different from "inc" (eg loop end comparison).

This clearly means I miss something in these architectures...

Another interesting question is what is impact of the instruction cache size on a typical alpha beta searcher. My CPU, as an example, is an intel i7 3630QM quad-core with HT, and it has 32Kb of instruction cache. Are Eval+AB+Hash+...+printf+... instructions bigger than 32Kb? I guess yes, so keeping code compact should have very little impact on performance (please do not consider perft, where code locality is very high, so you should do something very wrong to get it wrong...). What do you think?

Best regards,
Natale.

First, you do NOT have to execute one add, one mul, etc per cycle to get 4 micro-ops per cycle throughput. While the specifics on number of executions is anything but a constant across all the existing Intel processors, you can generally execute 3 independent integer instructions in parallel. All you need to have is actual data independence between the three instructions. for example

inc eax
inc ebx
inc ecx

is perfectly OK.

inc eax
inc eax
inc eax

has data dependencies that prevent parallel execution. Your code doesn't look that bad. And when you factor in the other instructions needed to access "x(i,j)" the i,j has to be collapsed to a single index, hopefully with a shift if the rightmost dimension is a power of 2, otherwise an integer multiply. I don't think your code is particularly problematic with respect to minimizing CPI.

Your second example clearly has nothing but a huge string of data dependencies, which means serial rather than parallel order. Yes, there is some staggered parallel stuff to do, but with "current++" that serves as a finite limiter of parallel execution, and it gives an example where the idea of loop unrolling does not always work. You can just make the executable bigger, which means the cache footprint grows and you lose rather than win or break even. The limiting factor is the size of the reorder buffer in the CPU, how many instructions ahead can it fetch so that it can find those parallel operations. The Hennessy/Patterson architecture book addresses this issue, and even goes so far as to answer "what happens if the reorder buffer is infinite in size". Turns out the gain over 70-100 is minimal.

What will fit in 32K?

Depends on the programmer and how carefully he writes his code and how complex it actually ends up. One can reach some impressive NPS speeds when things are done correctly.

For example, my code running on a machine that is 3-4 years old, an intel xeon 5650 at 2.67ghz, 2 x 6 cores, is currently hitting around 50M nodes per second, and slowly climbing as I clean things up.

xmas79 · Post by **xmas79** » Tue Apr 15, 2014 8:12 am

Hi,

bob wrote:Your code doesn't look that bad. And when you factor in the other instructions needed to access "x(i,j)" the i,j has to be collapsed to a single index, hopefully with a shift if the rightmost dimension is a power of 2

I accounted it, and making my arrays power of two (eg pieces is a BITBOARD pieces[2][8]) makes it faster as the pointer is often loaded with a single LEA instruction like "LEA [EAX+EBX*8+structure_offset]". I don't have big eval at all, but the code pattern in accessing these things have pratically zero impact on search performance. One pretty (useless except for perft) clock trick.

bob wrote:...Your second example clearly has nothing but a huge string of data dependencies, which means serial rather than parallel order. Yes, there is some staggered parallel stuff to do, but with "current++" that serves as a finite limiter of parallel execution, and it gives an example where the idea of loop unrolling does not always work.

Sorry but I don't get it. My understanding of your statement is that (like my thinking) that the 16 version should be slower, but it's actually faster!

The smart compiler knows that putting a "current++" on each row is inefficient, that's why it translates the whole thing in a series of 16 incs related to the memory increment, and then increments the pointer only once. The equivalent C code of the compiled code is:

Code: Select all

(*&#40;current + 0&#41;)++;
(*&#40;current + 1&#41;)++;
(*&#40;current + 2&#41;)++;
...
(*&#40;current + 15&#41;)++;
current += 16;

The thing will compile in the same manner with the "2" increment case (with 1 will be optimized in a completely different way, and no comparison is possible). But the 16 is always faster!! I also tried with 64, 128, etc... No way... The unrolled version is always faster. The translated code shows no dependency at all between instructions, so I expect 1 CPI in this code. But I would expect to hit a port saturation, as the cpu can't execute 16 memory writes in parallel... The thing is different with 2 INCs only, as the cpu could execute instructions from other group, and could therefore execute them in parallel more often than in the 16 (or 64 or 128!) case, lowering the CPI value, and hence the execution time... It can't be the fact that the final statement (current += x) is executed less times in the 16 case with respect to the 2 case. 0.3 seconds is a big difference for that. And it seems that the processor "understands" that thing and doesn't get any penalty at all... So I can't get what's under the hood....

Thanks,
Natale.

Karlo Bala · Post by **Karlo Bala** » Tue Apr 15, 2014 1:10 pm

xmas79 wrote:Hi,
bob wrote:Your code doesn't look that bad. And when you factor in the other instructions needed to access "x(i,j)" the i,j has to be collapsed to a single index, hopefully with a shift if the rightmost dimension is a power of 2
I accounted it, and making my arrays power of two (eg pieces is a BITBOARD pieces[2][8]) makes it faster as the pointer is often loaded with a single LEA instruction like "LEA [EAX+EBX*8+structure_offset]". I don't have big eval at all, but the code pattern in accessing these things have pratically zero impact on search performance. One pretty (useless except for perft) clock trick.

bob wrote:...Your second example clearly has nothing but a huge string of data dependencies, which means serial rather than parallel order. Yes, there is some staggered parallel stuff to do, but with "current++" that serves as a finite limiter of parallel execution, and it gives an example where the idea of loop unrolling does not always work.
Sorry but I don't get it. My understanding of your statement is that (like my thinking) that the 16 version should be slower, but it's actually faster!

The smart compiler knows that putting a "current++" on each row is inefficient, that's why it translates the whole thing in a series of 16 incs related to the memory increment, and then increments the pointer only once. The equivalent C code of the compiled code is:
Code: Select all
(*&#40;current + 0&#41;)++;
(*&#40;current + 1&#41;)++;
(*&#40;current + 2&#41;)++;
...
(*&#40;current + 15&#41;)++;
current += 16;
The thing will compile in the same manner with the "2" increment case (with 1 will be optimized in a completely different way, and no comparison is possible). But the 16 is always faster!! I also tried with 64, 128, etc... No way... The unrolled version is always faster. The translated code shows no dependency at all between instructions, so I expect 1 CPI in this code. But I would expect to hit a port saturation, as the cpu can't execute 16 memory writes in parallel... The thing is different with 2 INCs only, as the cpu could execute instructions from other group, and could therefore execute them in parallel more often than in the 16 (or 64 or 128!) case, lowering the CPI value, and hence the execution time... It can't be the fact that the final statement (current += x) is executed less times in the 16 case with respect to the 2 case. 0.3 seconds is a big difference for that. And it seems that the processor "understands" that thing and doesn't get any penalty at all... So I can't get what's under the hood....

Thanks,
Natale.

This is an example from Paul Hsieh.

Before unrolling:

Code: Select all

    float f&#91;100&#93;,sum;

    ...

    /* Back to back dependent 
       adds force the throughput 
       to be equal to the total 
       FADD latency. */

    sum = 0;
    for&#40;i=0;i<100;i++) &#123;
        sum += f&#91;i&#93;;
    &#125;

After:

Code: Select all

 float f&#91;100&#93;,sum0,sum1;
    float sum2,sum3,sum;

    ...

    /* Optimized for a 4-cycle 
       latency fully pipelined 
       FADD unit.  The throughput 
       is one add per clock. */

    sum0 = sum1 = sum2 = sum3 = 0;
    for&#40;i=0;i<100;i+=4&#41; &#123;
        sum0 += f&#91;i&#93;;
        sum1 += f&#91;i+1&#93;;
        sum2 += f&#91;i+2&#93;;
        sum3 += f&#91;i+3&#93;;
    &#125;
    sum = &#40;sum0+sum1&#41; + &#40;sum2+sum3&#41;;

hgm · Post by **hgm** » Tue Apr 15, 2014 3:43 pm

It seems natural to me that the 16-case should be faster than the 2-case: you are doing the same number of increments, but in fewer loop iterations. So the overhead in incrementing the pointer and the counter is smaller. The 'critical path' in all cases is incrementing the pointer and loop counter, but the optimizer was smart enough to do that only once per iteration. As the latency of inc is 1 clock (i.e., the result can already be used on the subsequent clock cycle), this does not really slow you down, not even in the case of the tightest loop.

You are doing 4G increments at a CPU clock of (probably) 3.4GHz (the turbo-boost speed for 3630QM). If you would have done that at 1 increment per clock, it would have taken slightly over 1 sec.

The basic operation requires one memory load, one memory store and an ALU operation (inc). I think the i7 can do 2 loads or 1 load + 1 store per clock, and 3 (simple) ALU operations (for the more complex operations such as shift and integer multiply, there is only 1 ALU that can do them). But the instruction pipeline itself (in particular the retirement stage) can only do 4 instructions per clock.

So your code is load/store limited to 1 increment per clock, which requires only 3 instructions. The 4th 'slot' can be used to (out of order) handle the overhead (which is all simple ALU operations) of incrementing the pointer, the loop counter, testing the latter and branching. That requires at least 4 operations, though. This should be no problem in the 16-case, where the loop body takes 16 clocks anyway, and 16 ALU instructions can be executed in parallel. But in the 2-case only 2 ALU instructions can be executed in parallel with the body. So you would need an extra cycle (sometimes) to execute the remaining instructions of the loop overhead.

If the loop overhead is really 4 instructions, and two of them can be done in parallel with the body, only two remain, which in principle is only half a cycle of work. So then the loop could on average take 2.5 clocks. I guess this should be in theory attainable, when in 5 clocks you would do two loops, and would schedule the instructions as

1: load + 3 ALU
2-4: load + store + 2 ALU
5: store + 3 ALU

that would be 4 loads, 4 stores and 12 ALU operations in 5 clocks. The 12 ALU would be 4 increments for the body, and 2x4 operations for the increment-and-test overhead of two loops, Whether the hardware is actually smart enough to settle on this pattern remains to be seen. Note that calculation of the addresses, although it formally uses addition too, does not need an ALU, but has its own Address Generation Units.

(This assumes a branch instruction is an ALU instruction, which I am not completely sure of. In is conceivable that branches have their own execute unit, or are handled by the retirement unit itself, which would make it a bit easier for the instruction scheduler, as it could also do 3 ALU + branch in one clock.)

wgarvin · Post by **wgarvin** » Tue Apr 15, 2014 5:36 pm

These sort of synthetic benchmarks are not very indicative of actual performance on actual code. About the only thing they're useful for is measuring the cost/penalty for specific types of bottlenecks, stalls, etc.

In general, x86/x64 is very forgiving for performance, and you only have to do a few things to get good performance:

(1) Organize your data for good cache locality. For most general-purpose programs, the CPU will spend up to half of its time waiting for cache misses to be serviced. Reading data all the way from RAM can take hundreds of cycles, whereas reading it from L1 is so fast its "almost" free. Sometimes you can organize data better, but in some cases (like the transposition table in chess programs) the accesses are essentially "random" and you can't do much about it, other than prefetching it and then doing some fixed amount of other calculations that you would have had to do anyway.

(2) Avoid long dependence chains. Give it several independent little pieces of math that it can do in parallel. The compiler will intermix the instructions and the CPU will issue them as fast as it can and execute them out-of-order. But if later calculations have a dependency on the result of earlier calculations, they won't be able to execute until the earlier calculation has been done. If you have one long dependence chain and nothing else to do in parallel, a lot of cycles will be wasted.

(3) A special case of (2) to be aware of is the "loop-carried dependence". When you have a short loop where each iteration is independent (e.g. perhaps it stores to different addresses in an array, which were not written to by a previous iteration) then the out-of-order execution allows it to be executing 2 or more iterations' worth of instructions in parallel. Don't underestimate the power of the out-of-order execution: it can have dozens of instructions in-flight and can hide L1 and even some L2 data stalls very effectively; as long as it has other stuff to keep it busy, being forced to wait for something is not too bad. Anyway, a "loop-carried dependency" is where each iteration of the loop starts with a calculation that depends on the result from the previous iteration. It should be obvious why that's bad: its just like one long dependence chain, instead of being able to do several iterations in parallel its going to end up finishing each one before starting the next one, and lots of potential execution throughput can be wasted.

(4) Always always use a profiler to figure out what code is executed a lot, and what code is slower than you expected. Don't try to guess, you'll just waste a lot of time trying to optimize the wrong things. Code that looks like it should be just fine will sometimes have lousy performance because of a subtle thing that can easily be tweaked once you figure out what it is. More common: code that "looks expensive" will turn out to run blazing fast and trying to optimize it will do nothing or perhaps even make it slower.

(5) Cache misses, Cache misses, Cache misses! Less of these = everything much faster. Use smaller data types for large arrays of data (e.g. bytes or halfwords, instead of 32-bit words). Organize big loops so they do all their processing on small enough chunks to fit in L1 cache ("strip-mining"). Do simple calculations using arithmetic instructions instead of table lookups. Yes, this is all the same as point (1) ("Organize your data for good cache locality") but its so important that I'm saying it again here.

Some things that x86/x64 is really good at (amazingly good):

* They have incredibly good branch predictors. If the branches are mostly-predictable, it will do pretty well.
* Write-combining and good store-forwarding. You can use dumb patterns like copying smart pointers everywhere, and the extra inc/dec traffic to the reference counts won't hurt that much. Other platforms suffer from terrible things like "load-hit-store" penalties... not x86/x64 though!
* x86 is much kinder to mis-aligned integer access than most other platforms. Unless it splits across a cache line, there's virtually no penalty (and even then its small enough to often be worth it).
* Out-of-order execution is really good, keeping it fed with lots of work is the only difficult part. It can usually issue 3-4 uops worth of insns per cycle, it can execute 3-4 independent uops per cycle, it can retire 3-4 insns per cycle. It can fuse and split some kinds of operations, so even things like stack manipulations can be super-cheap, because an insn like "push eax" can get split into multiple uops, so the stack pointer adjustment can be executed immediately and then other things depending on that can be executed, even if the write of eax to the stack memory hasn't been executed yet. For things like function calls, this can make them amazingly cheap compared to other platforms.

bob · Post by **bob** » Tue Apr 15, 2014 5:58 pm

xmas79 wrote:Hi,
bob wrote:Your code doesn't look that bad. And when you factor in the other instructions needed to access "x(i,j)" the i,j has to be collapsed to a single index, hopefully with a shift if the rightmost dimension is a power of 2
I accounted it, and making my arrays power of two (eg pieces is a BITBOARD pieces[2][8]) makes it faster as the pointer is often loaded with a single LEA instruction like "LEA [EAX+EBX*8+structure_offset]". I don't have big eval at all, but the code pattern in accessing these things have pratically zero impact on search performance. One pretty (useless except for perft) clock trick.

bob wrote:...Your second example clearly has nothing but a huge string of data dependencies, which means serial rather than parallel order. Yes, there is some staggered parallel stuff to do, but with "current++" that serves as a finite limiter of parallel execution, and it gives an example where the idea of loop unrolling does not always work.
Sorry but I don't get it. My understanding of your statement is that (like my thinking) that the 16 version should be slower, but it's actually faster!

The smart compiler knows that putting a "current++" on each row is inefficient, that's why it translates the whole thing in a series of 16 incs related to the memory increment, and then increments the pointer only once. The equivalent C code of the compiled code is:
Code: Select all
(*&#40;current + 0&#41;)++;
(*&#40;current + 1&#41;)++;
(*&#40;current + 2&#41;)++;
...
(*&#40;current + 15&#41;)++;
current += 16;
The thing will compile in the same manner with the "2" increment case (with 1 will be optimized in a completely different way, and no comparison is possible). But the 16 is always faster!! I also tried with 64, 128, etc... No way... The unrolled version is always faster. The translated code shows no dependency at all between instructions, so I expect 1 CPI in this code. But I would expect to hit a port saturation, as the cpu can't execute 16 memory writes in parallel... The thing is different with 2 INCs only, as the cpu could execute instructions from other group, and could therefore execute them in parallel more often than in the 16 (or 64 or 128!) case, lowering the CPI value, and hence the execution time... It can't be the fact that the final statement (current += x) is executed less times in the 16 case with respect to the 2 case. 0.3 seconds is a big difference for that. And it seems that the processor "understands" that thing and doesn't get any penalty at all... So I can't get what's under the hood....

Thanks,
Natale.

I didn't mean to imply "slower". Just not 16x faster. If you unroll, with x86 addressing, the idea of
inc xx
inc xx+1
inc xx+2

works pretty well, just so at the end you add 16 to the pointer itself, assuming that is needed later...

The bottleneck here becomes memory loads and stores. But the cache will at least prefetch 64 byte chunks so that most of the memory references are hidden anyway.

The intel CPU is incredibly hard to predict, in terms of performance. There are N+ variables that can influence overall speed. So long as you are careful in how you access memory to take advantage of cache (both temporal/spatial locality as well as size) and you avoid the long dependency chains that fill the reorder buffer with stuff that can't be executed in parallel, you have a pretty good chance of reaching a high performance level.

bob · Post by **bob** » Tue Apr 15, 2014 6:01 pm

wgarvin wrote:These sort of synthetic benchmarks are not very indicative of actual performance on actual code. About the only thing they're useful for is measuring the cost/penalty for specific types of bottlenecks, stalls, etc.

In general, x86/x64 is very forgiving for performance, and you only have to do a few things to get good performance:

(1) Organize your data for good cache locality. For most general-purpose programs, the CPU will spend up to half of its time waiting for cache misses to be serviced. Reading data all the way from RAM can take hundreds of cycles, whereas reading it from L1 is so fast its "almost" free. Sometimes you can organize data better, but in some cases (like the transposition table in chess programs) the accesses are essentially "random" and you can't do much about it, other than prefetching it and then doing some fixed amount of other calculations that you would have had to do anyway.

(2) Avoid long dependence chains. Give it several independent little pieces of math that it can do in parallel. The compiler will intermix the instructions and the CPU will issue them as fast as it can and execute them out-of-order. But if later calculations have a dependency on the result of earlier calculations, they won't be able to execute until the earlier calculation has been done. If you have one long dependence chain and nothing else to do in parallel, a lot of cycles will be wasted.

(3) A special case of (2) to be aware of is the "loop-carried dependence". When you have a short loop where each iteration is independent (e.g. perhaps it stores to different addresses in an array, which were not written to by a previous iteration) then the out-of-order execution allows it to be executing 2 or more iterations' worth of instructions in parallel. Don't underestimate the power of the out-of-order execution: it can have dozens of instructions in-flight and can hide L1 and even some L2 data stalls very effectively; as long as it has other stuff to keep it busy, being forced to wait for something is not too bad. Anyway, a "loop-carried dependency" is where each iteration of the loop starts with a calculation that depends on the result from the previous iteration. It should be obvious why that's bad: its just like one long dependence chain, instead of being able to do several iterations in parallel its going to end up finishing each one before starting the next one, and lots of potential execution throughput can be wasted.

(4) Always always use a profiler to figure out what code is executed a lot, and what code is slower than you expected. Don't try to guess, you'll just waste a lot of time trying to optimize the wrong things. Code that looks like it should be just fine will sometimes have lousy performance because of a subtle thing that can easily be tweaked once you figure out what it is. More common: code that "looks expensive" will turn out to run blazing fast and trying to optimize it will do nothing or perhaps even make it slower.

(5) Cache misses, Cache misses, Cache misses! Less of these = everything much faster. Use smaller data types for large arrays of data (e.g. bytes or halfwords, instead of 32-bit words). Organize big loops so they do all their processing on small enough chunks to fit in L1 cache ("strip-mining"). Do simple calculations using arithmetic instructions instead of table lookups. Yes, this is all the same as point (1) ("Organize your data for good cache locality") but its so important that I'm saying it again here.

Some things that x86/x64 is really good at (amazingly good):

* They have incredibly good branch predictors. If the branches are mostly-predictable, it will do pretty well.
* Write-combining and good store-forwarding. You can use dumb patterns like copying smart pointers everywhere, and the extra inc/dec traffic to the reference counts won't hurt that much. Other platforms suffer from terrible things like "load-hit-store" penalties... not x86/x64 though!
* x86 is much kinder to mis-aligned integer access than most other platforms. Unless it splits across a cache line, there's virtually no penalty (and even then its small enough to often be worth it).
* Out-of-order execution is really good, keeping it fed with lots of work is the only difficult part. It can usually issue 3-4 uops worth of insns per cycle, it can execute 3-4 independent uops per cycle, it can retire 3-4 insns per cycle. It can fuse and split some kinds of operations, so even things like stack manipulations can be super-cheap, because an insn like "push eax" can get split into multiple uops, so the stack pointer adjustment can be executed immediately and then other things depending on that can be executed, even if the write of eax to the stack memory hasn't been executed yet. For things like function calls, this can make them amazingly cheap compared to other platforms.

One quibble. When you say "much kinder to integer alignment" it depends on what kind of integer. I've been working on some ASM code of late where I have written a small library to be used in my 330 class. Wanted to give them some easy code to do basic input and output, and for some things, I wanted to use the C library. For example (read or _read depending on system). Turns out that code in the library uses XMM registers. Which is NOT forgiving of alignment errors, at least on my mac. If the stack alignment is not exactly correct, _read crashes on an xmm load. Pain in the a$$. It was quite educational to figure out just how the stack had to be aligned for each library routine call.

BeyondCritics · Post by **BeyondCritics** » Tue Apr 15, 2014 9:17 pm

Thank you for your hints.

I have one question: What happens if i call a function indirectly using a pointer, as it happens often when virtual methods are in play. Is there some sort of penalty, and how big is it?

Oliver

hgm · Post by **hgm** » Tue Apr 15, 2014 9:29 pm

If that pointer points to the same routine every time, branch prediction will still be 100% correct. If the target chances all the time, I am not sure it could do anything, and the branch will be mis-predicted most of the time (incurring the normal penalty for that). It is already hard to predict branches with only two possible outcomes (jump or fall through).

Core Port Saturation

Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation

Re: Core Port Saturation