Pointer to functions

michiguel · Post by **michiguel** » Sat May 03, 2008 5:52 am

They are supposed to be slower. However, in certain situations the alternative is to have a switch or a chain of "if else" that involves lots of branches. When would you choose one approach over the other?

I know I have to try, but having an educated guess at the beginning helps a lot.

Miguel

Dann Corbit · Post by **Dann Corbit** » Sat May 03, 2008 7:13 am

michiguel wrote:They are supposed to be slower. However, in certain situations the alternative is to have a switch or a chain of "if else" that involves lots of branches. When would you choose one approach over the other?

I know I have to try, but having an educated guess at the beginning helps a lot.

Miguel

Unless it is in a very time critical place, just write which is the most clear to you. On some compilers one technique will be faster than another, but it can change even with different compiler settings.

Most of the time in any program is spent in a few hot spots.

You can run this through your profiler to find out how things go on your compiler:

Code: Select all

#include <math.h>
#include <stdio.h>

typedef double  (*f_t) (double);
static f_t      f[] = {log, log10, sqrt, cos, cosh, exp, sin, sinh, tan, tanh, 0};

static double   accum0 = 0;
static double   accum1 = 0;
static double   accum2 = 0;


void            arr(void)
{
    int             i;
    double          d = 0;
    for (i = 0; f[i]; i++) {
        d += f[i] (0.5);
    }
    accum0 += d;
}

void            poi(void)
{
    f_t            *flist = f;
    double          d = 0;
    while (*flist) {
        f_t             ff = *flist;
        d += ff(0.5);
        flist++;
    }
    accum1 += d;
}

void            swi(void)
{
    int             i;
    double          d = 0;
    for (i = 0; f[i]; i++) {
        switch (i) {
        case 0:
            d += f[0] (0.5);
            break;
        case 1:
            d += f[1] (0.5);
            break;
        case 2:
            d += f[2] (0.5);
            break;
        case 3:
            d += f[3] (0.5);
            break;
        case 4:
            d += f[4] (0.5);
            break;
        case 5:
            d += f[5] (0.5);
            break;
        case 6:
            d += f[6] (0.5);
            break;
        case 7:
            d += f[7] (0.5);
            break;
        case 8:
            d += f[8] (0.5);
            break;
        case 9:
            d += f[9] (0.5);
            break;
        default:
            break;
        }
    }
    accum2 += d;
}

int             main(void)
{
    long            i;
    for (i = 0; i < 1000000; i++) {
        arr();
        poi();
        swi();
    }
    printf("%.20g, %.20g, %.20g\n", accum0, accum1, accum2);
    return 0;
}

Gerd Isenberg · Post by **Gerd Isenberg** » Sat May 03, 2008 9:24 am

michiguel wrote:They are supposed to be slower. However, in certain situations the alternative is to have a switch or a chain of "if else" that involves lots of branches. When would you choose one approach over the other?

I know I have to try, but having an educated guess at the beginning helps a lot.

Miguel

For some reason I favor to use switches - and let the compiler decide to internally use a jump-table or cmp/test jxx chains (or both, a jump-table with some leading conditions). There might be a better chance to avoid a miss-prediction up and then, if the most common case immediately follows the indirect jump. For msc I use assume(0) keyword in an default case, which should not occur (assert (false)).

Using function pointers, function pointers to member functions - or virtual functions is likely harder to predict - indirect calls as well as returns. Otherwise - Dann is right - use what is better to read for you. Huge switch monsters tend to be harder to maintain.

Gerd

wgarvin · Post by **wgarvin** » Sat May 03, 2008 8:03 pm

Edit: a quick cheat-sheet...

Code: Select all

if statement  -->  conditional branch
virtual method call  -->  table lookup + indirect branch
switch statement   -->  chained conditional branches and/or a table lookup + indirect branch
call through a function pointer  -->  indirect branch

----------

The performance is mostly dictated by whether the branch is predictable or not, because a correctly-predicted branch is super cheap and an incorrectly-predicted branch is usually expensive.

If you are actually branching in an unpredictable way, its going to cost something no matter what technique you use (though conditional branches at least have a better chance of speculatively executing the right thing, something that almost never happens with an unpredictable indirect branch). Sometimes you can rearrange your code so that the branches in it are more predictable. E.g. instead of looping over all pieces on the board in some random order and switching on their type, you could do separate loops for each type.

On some platforms the branch predictors are not very good at indirect branches (which is one reason for using a chain of ifs). A typical behaviour is to predict it going to the same address as the previous time this branch was executed. Note that this applies to virtual method calls just as much as calls through a function pointer (since they are basically the same thing); if the call site is monomorphic (i.e. the actual virtual method being invoked is the same nearly every time) then it will be highly predictable and thus pretty fast. If the call site is polymorphic (invoking different actual methods) then it is probably going to be mispredicted a lot on platforms with weak branch prediction.

I worry less about indirect branches on x86 than I do on other platforms though, because x86 chips have pretty amazing branch predictors. Intel has been leading the pack since the days of the first Pentium, turning out really clever, state-of-the-art branch predictor designs with every new generation of chip. I don't even try to follow how they work anymore; I just assume that as long as I give them something with a clear pattern in it (e.g. a loop where every fifth iteration the if test is true), then after an iteration or two they will predict it successfully.

Steelman · Post by **Steelman** » Sat May 03, 2008 8:17 pm

Recently I did some time studies on that very question. I set up a loop cycling 1,000,000 times and tried pointers to functions, pointer switch statement, cascading if/else statements, etc. I found all pointer calls to be faster, switch statements came in second but very close to pointers, and cascading else/if statements very slow. If the branch needed is at the end of the else/if "chain" then execution is slower. Which makes sense. However you really can't go wrong with switch statements.

I actually started this study on the abs function call which I see in many times in other programs in instead of a switch statements.

Example:
Part of "makemove" for my engine. This decides of my King has castled and which side it is castling. This turns out to run least twice as fast as any abs function call to find out if the king is moving 2 squares or one. Also note the pointers seen to increase speed as well.

switch(current->from - current->to)
{
case 2: /* castle queen side */
...rest of code...
break;

case -2: /* castle king side */
...rest of code...
break;

default:
...rest of code...
break;
}

But as Dan "points out" the compiler, the compiler settings, etc can change your results. Run your own time study and use whatever is faster and/or easier for you to live with.

wgarvin · Post by **wgarvin** » Sat May 03, 2008 9:49 pm

Dann Corbit wrote:Most of the time in any program is spent in a few hot spots.

For some types of program this turns out to not be true. For 95% of programs out there I guess it is true though, including chess engines.

I agree you should definitely use a profiling tool to find the hotspots before trying to optimize things, or you will most likely spend a lot of time optimizing something that's only 1% of the total runtime instead of the things that are 10 or 20%.

bob · Post by **bob** » Sun May 04, 2008 12:09 am

wgarvin wrote:Edit: a quick cheat-sheet...
Code: Select all
if statement  -->  conditional branch
virtual method call  -->  table lookup + indirect branch
switch statement   -->  chained conditional branches and/or a table lookup + indirect branch
call through a function pointer  -->  indirect branch
----------

The performance is mostly dictated by whether the branch is predictable or not, because a correctly-predicted branch is super cheap and an incorrectly-predicted branch is usually expensive.

If you are actually branching in an unpredictable way, its going to cost something no matter what technique you use (though conditional branches at least have a better chance of speculatively executing the right thing, something that almost never happens with an unpredictable indirect branch). Sometimes you can rearrange your code so that the branches in it are more predictable. E.g. instead of looping over all pieces on the board in some random order and switching on their type, you could do separate loops for each type.

On some platforms the branch predictors are not very good at indirect branches (which is one reason for using a chain of ifs). A typical behaviour is to predict it going to the same address as the previous time this branch was executed. Note that this applies to virtual method calls just as much as calls through a function pointer (since they are basically the same thing); if the call site is monomorphic (i.e. the actual virtual method being invoked is the same nearly every time) then it will be highly predictable and thus pretty fast. If the call site is polymorphic (invoking different actual methods) then it is probably going to be mispredicted a lot on platforms with weak branch prediction.

I worry less about indirect branches on x86 than I do on other platforms though, because x86 chips have pretty amazing branch predictors. Intel has been leading the pack since the days of the first Pentium, turning out really clever, state-of-the-art branch predictor designs with every new generation of chip. I don't even try to follow how they work anymore; I just assume that as long as I give them something with a clear pattern in it (e.g. a loop where every fifth iteration the if test is true), then after an iteration or two they will predict it successfully.

On the PC it is even more complicated when you have to (a) predict whether the branch is taken or not taken, and (b) predict where the branch is going to go to (when it is a pointer-type that uses a register to contain the address of the target). Those end up being almost 100% mis-predicted on one of the above.

I tried a cute very small move generator a few months back using pointers to functions for each piece type, and it was significantly slower than what I do at present.

Don · Post by **Don** » Sun May 04, 2008 5:48 am

I once wrote a chess program that used a large table of function pointers - 1 for each piece type of each color on each square for the move generator. It was a silly thing but it was extremely fast at the time. Of course I had to write code which wrote each move generator function because you could never write that much code without bugs. I think this comes to 736 separate functions to generate moves!

It allows you to avoid much conditional branching and use constants. I don't know how important that is these days but back when I did this it helped noticeably to use constants when you could. Also, you avoid most other calculations such as square offsets, almost everything was explicitly coded as a constant, not an offset calculation.

Unfortunately, this program was extremely fast on some processors and extremely slow on others. When it was slow, I believe it was due to blowing out the instruction cache.

I think this was especially good for pawn moves because you avoided all kinds of tests, like whether the pawn on the 7th, 5th or 2nd ranks. I think this particular program used a 64 element mailbox style board so I also avoided edge testing.

It's probably more trouble than it is worth, but I thought I would mention it since this thread reminded me of it. Has anyone else done anything similar?

Gerd Isenberg · Post by **Gerd Isenberg** » Sun May 04, 2008 10:51 am

Don wrote:I once wrote a chess program that used a large table of function pointers - 1 for each piece type of each color on each square for the move generator. It was a silly thing but it was extremely fast at the time. Of course I had to write code which wrote each move generator function because you could never write that much code without bugs. I think this comes to 736 separate functions to generate moves!

It allows you to avoid much conditional branching and use constants. I don't know how important that is these days but back when I did this it helped noticeably to use constants when you could. Also, you avoid most other calculations such as square offsets, almost everything was explicitly coded as a constant, not an offset calculation.

Unfortunately, this program was extremely fast on some processors and extremely slow on others. When it was slow, I believe it was due to blowing out the instruction cache.

I think this was especially good for pawn moves because you avoided all kinds of tests, like whether the pawn on the 7th, 5th or 2nd ranks. I think this particular program used a 64 element mailbox style board so I also avoided edge testing.

It's probably more trouble than it is worth, but I thought I would mention it since this thread reminded me of it. Has anyone else done anything similar?

I actually use a huge switch monster to generate move by move within a finite state machine with 700++ states. As "compensation", I actually make moves branchless by something like this:

Code: Select all

 hashkeys     ^= currentCastleEPState();
 epCastleCnt  &= move.fromMaskAspects()  & move.toMaskAspects();
 epCastleCnt  += move.fromIncAspects()   + move.toIncAspects();
 hashkeys     ^= move.fromKeyAspects()   ^ move.toKeyAspects();
 hashkeys     ^= currentCastleEPState();
 quadbitboard ^= move.fromBoardAspects() ^ move.toBoardAspects();

Until 386 (486?) using tables of function pointers was favorable over compare- conditional jump chains. Only one instruction. The branch-target was fetched, decodeded and executed after the target was determined in order. After introducing pipelining and out of order execution, where processors fetch 16 (32) bytes of instructions each cycle, it became worse.

http://en.wikipedia.org/wiki/X86_architecture
http://en.wikipedia.org/wiki/Branch_predictor
http://en.wikipedia.org/wiki/Branch_predication

Each indirect branch was almost a miss-predicator with 12+X cycles penalty, unless it was the code immediately following the jmp reg. P4, AMD Athlon (?) K8 took the same branch-target than previously, which was important for calling virtual functions of almost same objects via

Code: Select all

(*this->hiddenVirtualJumpTable[indexOfVirtualFunction]) (args).

Pentium M, core2 and AMD K10 branch-predication takes more care on indirect branches and recognizes simple pattern.

Software Optimization Guide for AMD Family 10h Processors, 40546 Rev. 3.02 May 2007
A.6 Branch-Prediction Table
...
AMD Family 10h processors implement a separate 512-entry target array used to predict indirect branches with multiple dynamic targets.

If using an indirect branch on a per move base inside a recursive search routine, we have near the horizon all-node parents of stand-pat cut-nodes (eval >= beta) which likely dominate the predication rate. But I guess those branches in recursive routines are still quite hard to predicate. Sometimes it might be better to take one penalty for sure, to likely save some others.

The advantage of using switch over table of function pointer - compiler or profile guided optimizers may have better options to try indirect jump versus some leading or complete cmp/test jxx chains for different and always changing target architectures.

bob · Post by **bob** » Sun May 04, 2008 6:24 pm

Don wrote:I once wrote a chess program that used a large table of function pointers - 1 for each piece type of each color on each square for the move generator. It was a silly thing but it was extremely fast at the time. Of course I had to write code which wrote each move generator function because you could never write that much code without bugs. I think this comes to 736 separate functions to generate moves!

It allows you to avoid much conditional branching and use constants. I don't know how important that is these days but back when I did this it helped noticeably to use constants when you could. Also, you avoid most other calculations such as square offsets, almost everything was explicitly coded as a constant, not an offset calculation.

Unfortunately, this program was extremely fast on some processors and extremely slow on others. When it was slow, I believe it was due to blowing out the instruction cache.

I think this was especially good for pawn moves because you avoided all kinds of tests, like whether the pawn on the 7th, 5th or 2nd ranks. I think this particular program used a 64 element mailbox style board so I also avoided edge testing.

It's probably more trouble than it is worth, but I thought I would mention it since this thread reminded me of it. Has anyone else done anything similar?

That was the basic idea behind Carl Ebeling's Ph.D. dissertation at CMU "All the right moves" in fact... It was how HiTech generated moves, although rather than functions they used a chip for each square, but with the same idea...

Pointer to functions

Pointer to functions

Re: Pointer to functions

Re: Pointer to functions

Re: Pointer to functions

Re: Pointer to functions

Re: Pointer to functions

Re: Pointer to functions

Re: Pointer to functions

Re: Pointer to functions

Re: Pointer to functions