Function pointers hurt performance?

Cardoso · Post by **Cardoso** » Wed Sep 21, 2011 12:52 am

Hi everybody,

I need to implement some pattern searching.
I use a list of bitboards to indentify the position and then
I use an array of base scores for each pattern.
Now the base score is not enough for each pattern, I need some additional scoring depending on several board conditions.
So I will need some scoring functions to complement the base score in the array.
I was thinking using functions pointers. So I would need a second array of functions pointers.
And for each pattern I assign a function pointer to a complement scoring function.
So the pattern score = base_score + complement_scoring_func(...)

My question is this: will functions pointers hurt performance of the cpu predictors?
In the loop through the patterns if a pattern is found I read the array of functions pointers and call it.

I still don't have the code to show.

thanks in advance.
Alvaro

mcostalba · Post by **mcostalba** » Wed Sep 21, 2011 5:10 am

Cardoso wrote:Hi everybody,
I use a list of bitboards to indentify the position

This seems to me by far the slowest: to find a match looping across an array.

I'd guess you are looking definitely in the wrong direction regarding the source of possible slowdowns, anyhow I strongly suggest to use a profiler, even very experienced developers don't trust their intuition but use a profiler for optimizing.

wgarvin · Post by **wgarvin** » Wed Sep 21, 2011 9:14 am

Function pointers mean an indirect call, like calling a virtual method. x86 chips can only predict these if its a monomorphic call site (always calling the same function at that spot in the code). They predict it to the same address it went to last time. If you look up the function pointer in some kind of table and the values vary, its going to mispredict a lot. I've heard that some older chip designs like Pentium II and III would always predict it correctly if you loaded it into a register and left it there for ~40 cycles and then jumped or called thru that register. I would expect it to also be true of Pentium M-derived chips. I don't know if it applies to other modern chips. I have no idea about AMD chips either.

stegemma · Post by **stegemma** » Wed Sep 21, 2011 12:27 pm

This seems similar to the virtual functions calling in languages like C++. Maybe the overhead could be the same plus the search in your array.

wgarvin · Post by **wgarvin** » Wed Sep 21, 2011 3:40 pm

wgarvin wrote:Function pointers mean an indirect call, like calling a virtual method. x86 chips can only predict these if its a monomorphic call site (always calling the same function at that spot in the code). They predict it to the same address it went to last time. If you look up the function pointer in some kind of table and the values vary, its going to mispredict a lot. I've heard that some older chip designs like Pentium II and III would always predict it correctly if you loaded it into a register and left it there for ~40 cycles and then jumped or called thru that register. I would expect it to also be true of Pentium M-derived chips. I don't know if it applies to other modern chips. I have no idea about AMD chips either.

According to Agner Fog's document, Pentium M, Core2 and newer designs have something more sophisticated for predicting indirect branches, that can sometimes predict patterns with various different branch targets. Of course if the branches don't actually follow a pattern, it will mispredict most of the time.

Also the misprediction penalty of x86 chips is not as large as I thought. Agner Fog gives details on the penalty for all of Intel and AMD's chips, and they all seem to be between 13 and 17 cycles. e.g. Pentium M and Nehalem are 15 cycles.

micron · Post by **micron** » Thu Sep 22, 2011 3:20 am

If I call Foo()

Code: Select all

SInt64 gFoo;
void Foo&#40;) &#123; gFoo++; &#125;

directly and indirectly through a function pointer, the corresponding assembler is

Code: Select all

callq  _Foo  ; direct

callq  *%r14 ; function pointer

Experiments timing a gazillion calls show no difference between the two calling methods. Function pointers do not hurt performance.

<rant>
In the old days it was easy to arrange this kind of timing experiment, but modern compilers put up obstacles. For example, at high optimization level a compiler discovers that a gazillion invocations of Foo() is equivalent to

Code: Select all

gFoo += GAZILLION;

The timing run now occupies a few nanoseconds instead of the expected seconds (or whatever), causing open mouths and squeaks of surprise. Moreover, it doesn't test what you wanted, because Foo() isn't called even once. A workaround is to put Foo()'s definition in a separate compilation unit, away from the prying eyes of the inliner.
</rant>

Robert P.

wgarvin · Post by **wgarvin** » Thu Sep 22, 2011 3:42 am

That test is not very meaningful, its a monomorphic call site. All x86 chips will easily predict that. Try pseudoeandomly selecting the function pointer from an array of 8 pointers. At startup, use a command line option to decide to either (1) set them to 8 separate functions that add 8 different constants to a global variable, or (2) set all 8 pointers to point to the same function.

Then use RKISS or something to get a 3-bit index for each iteration. Run at least 10,000 iterations (maybe 100,000?) and compare timings from (1) and (2).

bob · Post by **bob** » Thu Sep 22, 2011 7:26 pm

micron wrote:If I call Foo()
Code: Select all
SInt64 gFoo;
void Foo&#40;) &#123; gFoo++; &#125;  
directly and indirectly through a function pointer, the corresponding assembler is
Code: Select all
callq  _Foo  ; direct

callq  *%r14 ; function pointer
Experiments timing a gazillion calls show no difference between the two calling methods. Function pointers do not hurt performance.

<rant>
In the old days it was easy to arrange this kind of timing experiment, but modern compilers put up obstacles. For example, at high optimization level a compiler discovers that a gazillion invocations of Foo() is equivalent to
Code: Select all
gFoo += GAZILLION;
The timing run now occupies a few nanoseconds instead of the expected seconds (or whatever), causing open mouths and squeaks of surprise. Moreover, it doesn't test what you wanted, because Foo() isn't called even once. A workaround is to put Foo()'s definition in a separate compilation unit, away from the prying eyes of the inliner.
</rant>

Robert P.

There are two parts to predicting a branch on x86. 1. Is the branch taken (for a call it is always "yes")? 2. Where is the branch going?

(2) is more interesting because when you fetch and then predict the branch, you don't have a clue where it is going since the register being used might not yet be ready for access. The solution is a "branch target buffer" which simply predicts the branch AND where it is going, based on the last time it was encountered. You can do a conditional jump to an indirect address and predict the jump correctly and miss the address (entire thing is then predicted wrong) or you can predict the address correctly and miss the jump (again, entire thing is wrong), or you can miss both. Only when you get both right do you have any success.

Your code always jumps to the same place, whether you use the explicit jump address, or the indirect address through a register. When you get into a call where the address changes, performance will drop. Your code really is not testing that at all...

micron · Post by **micron** » Fri Sep 23, 2011 5:35 am

Ah, right. A performance hit from scrambling the branch predictor's tiny mind is easily demonstrated.

Code: Select all

gFuncPtr&#91;0&#93; = &Foo;
gFuncPtr&#91;1&#93; = &Foo1;
...
gFuncPtr&#91;7&#93; = &Foo7;

  for ( i = 0; i < GAZILLION; i++ ) &#123;
	(*gFuncPtr&#91;i % 8&#93;)();
  &#125;

Timings based on the code above give 10 ns/call. With all gFuncPtr[] = &Foo, branch prediction never fails and the time is only 2.4 ns/call.

A 'direct call' alternative to variable function pointers corresponds test-code like this:

Code: Select all

  for ( i = 0; i < GAZILLION; i++ ) &#123;
	switch ( gFuncID&#91;j % 8&#93; ) &#123;
	  case 0&#58; Foo&#40;); break;
	  case 1&#58; Foo1&#40;); break;
  	  ...
	  case 7&#58; Foo7&#40;); break;
	&#125;
  &#125;

This clocks in at 3.6 ns/call. The timing is identical whether the values in gFuncID[] are 0--7, or all 0, suggesting that branch prediction is not involved.

It seems that in artificial but (I think) fair comparison, calls through a function pointer are not necessarily slower than direct calls via switch, and can even be faster. Note that the tests are deliberately designed to expose small differences. It's hard to believe that they could affect an engine's n.p.s. measurably.

Robert P.

wgarvin · Post by **wgarvin** » Fri Sep 23, 2011 4:59 pm

micron wrote:Ah, right. A performance hit from scrambling the branch predictor's tiny mind is easily demonstrated.
Code: Select all
gFuncPtr&#91;0&#93; = &Foo;
gFuncPtr&#91;1&#93; = &Foo1;
...
gFuncPtr&#91;7&#93; = &Foo7;

  for ( i = 0; i < GAZILLION; i++ ) &#123;
	(*gFuncPtr&#91;i % 8&#93;)();
  &#125;
Timings based on the code above give 10 ns/call. With all gFuncPtr[] = &Foo, branch prediction never fails and the time is only 2.4 ns/call.

Some modern chips can apparently predict more than 1 or 2 different targets. In this case, there is a very regular pattern that is 8 values long, but I guess it can't remember as many as 8 separate branch targets so it is unable to successfully predict them on each pass through the table. That extra 7.6 ns probably represents the full penalty (15 cycles or whatever) for every call. Notice that without it, the effective cost of the call/return and whatever other entry/exit instructions was only like 5 cycles.

micron wrote: A 'direct call' alternative to variable function pointers corresponds test-code like this:
Code: Select all
  for ( i = 0; i < GAZILLION; i++ ) &#123;
	switch ( gFuncID&#91;j % 8&#93; ) &#123;
	  case 0&#58; Foo&#40;); break;
	  case 1&#58; Foo1&#40;); break;
  	  ...
	  case 7&#58; Foo7&#40;); break;
	&#125;
  &#125;
This clocks in at 3.6 ns/call. The timing is identical whether the values in gFuncID[] are 0--7, or all 0, suggesting that branch prediction is not involved.

You should look at the disassembly... I think its more likely that the compiler turned that switch statement into a tree of ordinary conditional branches, and then successfully predicted most or all of them.

Switch statements can be compiled in a couple of different ways, depending on how many case values there are and whether they are dense or sparse, etc.

micron wrote: It seems that in artificial but (I think) fair comparison, calls through a function pointer are not necessarily slower than direct calls via switch, and can even be faster. Note that the tests are deliberately designed to expose small differences. It's hard to believe that they could affect an engine's n.p.s. measurably.

Robert P.

Function pointers hurt performance?

Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?

Re: Function pointers hurt performance?