more STL code for Stockfish

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

more STL code for Stockfish

Post by Rein Halbersma »

The sort_moves function from Stockfish header move.h

Code: Select all

// Our dedicated sort in range [firstMove, lastMove), first splits
// positive scores from ramining then order seaprately the two sets.
template<typename T>
inline void sort_moves&#40;T* firstMove, T* lastMove, T** lastPositive&#41;
&#123;
    T tmp;
    T *p, *d;

    d = lastMove;
    p = firstMove - 1;

    d->score = -1; // right guard

    // Split positives vs non-positives
    do &#123;
        while ((++p&#41;->score > 0&#41; &#123;&#125;

        if &#40;p != d&#41;
        &#123;
            while (--d != p && d->score <= 0&#41; &#123;&#125;

            tmp = *p;
            *p = *d;
            *d = tmp;
        &#125;

    &#125; while &#40;p != d&#41;;

    // Sort just positive scored moves, remaining only when we get there
    insertion_sort<T, T*>&#40;firstMove, p&#41;;
    *lastPositive = p;
&#125;

// in movepick.cpp &#40;lines 146-149&#41; 
  case PH_NONCAPTURES&#58;
      lastMove = generate<MV_NON_CAPTURE>&#40;pos, moves&#41;; 
      score_noncaptures&#40;);
      sort_moves&#40;moves, lastMove, &lastGoodNonCapture&#41;;
      return;
should be functionally equivalent to the following piece of STL code (barring any hidden side effects of the above code and barring the inevitable stability issues from partition and sort)

Code: Select all

#include <algorithm>
#include <functional>

// function object to test move scores &#40;can be inlined when passed to algorithms&#41;
template<typename T>
struct HasPositiveScore&#58; public std&#58;&#58;unary_function<T, bool>
&#123;
    bool operator&#40;)&#40;T* move&#41; const
    &#123;
        return move->score > 0;
    &#125;
&#125;

// Our dedicated sort in range &#91;firstMove, lastMove&#41;, first splits
// positive scores from ramining then order seaprately the two sets.
template<typename T>
inline T partition_and_sort_positive_moves&#40;T* firstMove, T* lastMove&#41;
&#123;
    // Split positives vs non-positives
    firstNonPositive = std&#58;&#58;partition&#40;firstMove, lastMove, HasPositiveScore&#41;; 
    
    // Sort just positive scored moves, remaining only when we get there
    if &#40;firstMove != firstNonPositive&#41;
        std&#58;&#58;sort&#40;firstMove, firstNonPositive&#41;; 
    
    return firstNonPositive;
&#125;

// in movepick.cpp &#40;lines 146-149&#41;
  case PH_NONCAPTURES&#58;
      lastMove = generate<MV_NON_CAPTURE>&#40;pos, moves&#41;;
      score_noncaptures&#40;);
      lastGoodNonCapture = partition_and_sort_positive_moves&#40;moves, lastMove&#41;;
      return;
Note that I renamed sort_moves to the more accurate partition_and_sort_positive_moves and that I also changed the signature to return the pointer to the first non-positive score (rather than doing this by reference). This is to make the calling convention more similar to what the STL does. Note that I also don't call the Stockfish home-made insertion_sort but use the std::sort instead.

Finally, if one insists on stability for equal scored moves, std::stable_partition and std::stable_sort can take care of that (but they are more expensive). Of course, for debugging drop-in replacements for algorithms, stable functions are essential because you can test functional equivalence.

Rein

P.S. It literally took me 5 minutes to write this code after seeing the comments specifying the functional behavior. I have made no attempt to properly check the current implementation of sort_moves and insertion_sort. I bet it took a lot longer to make sure those functions were 100% correct.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: more STL code for Stockfish

Post by mcostalba »

Hi Rein,

sorry for the delay, but I was off-line for a couple of days.

The reason why we retired std::sort() from SF (actually it was there for a version that now I don't remember), is the fact that the result is platform dependent. Because std::sort is not stable you can have, and do have, different searched nodes results in case you use MSVC, gcc or icc libraries or even using the 32 or 64 bit versions of the same libraries.

This was unacceptable because SF is released for 32 and 64 bits and compiled with different compilers, and we really want functionality of the different binaries to be always 100% identical.

Anyhow I like your code refactoring and I'd be taking some code style ideas.

Regarding the other open question: why current pick_best() seems faster than your proposed one. I have compiled both with MSVC and post here the corresponding assembly in case someone could give some insight:


Code: Select all

//////////////////////////////////////////
Rein's version&#58;
//////////////////////////////////////////

template<typename T>
inline T pick_best&#40;T* curMove, T* lastMove&#41;
&#123;
0040EA20  push        esi  
0040EA21  mov         esi,eax 
0040EA23  push        edi  
    T* bestMove = std&#58;&#58;max_element&#40;curMove, lastMove&#41;; // find the best move
0040EA24  mov         edx,ecx 
0040EA26  cmp         ecx,esi 
0040EA28  je          pick_best<MoveStack>+39h &#40;40EA59h&#41; 
0040EA2A  lea         eax,&#91;ecx+8&#93; 
0040EA2D  cmp         eax,esi 
0040EA2F  je          pick_best<MoveStack>+39h &#40;40EA59h&#41; 
0040EA31  mov         edi,dword ptr &#91;edx+4&#93; 
0040EA34  cmp         edi,dword ptr &#91;eax+4&#93; 
0040EA37  cmovl       edx,eax 
0040EA3A  add         eax,8 
0040EA3D  cmp         eax,esi 
0040EA3F  jne         pick_best<MoveStack>+11h &#40;40EA31h&#41; 
    std&#58;&#58;swap&#40;*curMove, *bestMove&#41;; // swap best move to the front
0040EA41  mov         eax,dword ptr &#91;ecx&#93; 
0040EA43  cmp         ecx,edx 
0040EA45  je          pick_best<MoveStack>+3Bh &#40;40EA5Bh&#41; 
0040EA47  mov         edi,dword ptr &#91;edx&#93; 
0040EA49  mov         esi,dword ptr &#91;ecx+4&#93; 
0040EA4C  mov         dword ptr &#91;ecx&#93;,edi 
0040EA4E  mov         edi,dword ptr &#91;edx+4&#93; 
0040EA51  mov         dword ptr &#91;ecx+4&#93;,edi 
0040EA54  mov         dword ptr &#91;edx&#93;,eax 
0040EA56  mov         dword ptr &#91;edx+4&#93;,esi 
    return *curMove; // return pointer to best move;
0040EA59  mov         eax,dword ptr &#91;ecx&#93; 
0040EA5B  mov         edx,dword ptr &#91;ecx+4&#93; 
&#125;
0040EA5E  pop         edi  
0040EA5F  pop         esi  
0040EA60  ret              




///////////////////////////////////////////
Original SF version&#58;
//////////////////////////////////////////


template<typename T>
inline T pick_best&#40;T* curMove, T* lastMove&#41;
&#123;
    T bestMove, tmp;

    bestMove = *curMove;
0040EA20  mov         eax,dword ptr &#91;ecx&#93; 
0040EA22  mov         edx,dword ptr &#91;ecx+4&#93; 
0040EA25  push        ebx  
0040EA26  mov         ebx,dword ptr &#91;esp+8&#93; 
    while (++curMove != lastMove&#41;
0040EA2A  add         ecx,8 
0040EA2D  push        esi  
0040EA2E  push        edi  
0040EA2F  cmp         ecx,ebx 
0040EA31  je          pick_best<MoveStack>+2Dh &#40;40EA4Dh&#41; 
    &#123;
        if &#40;bestMove < *curMove&#41;
0040EA33  cmp         edx,dword ptr &#91;ecx+4&#93; 
0040EA36  jge         pick_best<MoveStack>+26h &#40;40EA46h&#41; 
        &#123;
            tmp = *curMove;
0040EA38  mov         esi,dword ptr &#91;ecx&#93; 
0040EA3A  mov         edi,dword ptr &#91;ecx+4&#93; 
            *curMove = bestMove;
0040EA3D  mov         dword ptr &#91;ecx&#93;,eax 
0040EA3F  mov         dword ptr &#91;ecx+4&#93;,edx 
            bestMove = tmp;
0040EA42  mov         eax,esi 
0040EA44  mov         edx,edi 
0040EA46  add         ecx,8 
0040EA49  cmp         ecx,ebx 
0040EA4B  jne         pick_best<MoveStack>+13h &#40;40EA33h&#41; 
        &#125;
    &#125;
    return bestMove;
&#125;
0040EA4D  pop         edi  
0040EA4E  pop         esi  
0040EA4F  pop         ebx  
0040EA50  ret              

Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: more STL code for Stockfish

Post by Rein Halbersma »

mcostalba wrote:Hi Rein,

sorry for the delay, but I was off-line for a couple of days.

The reason why we retired std::sort() from SF (actually it was there for a version that now I don't remember), is the fact that the result is platform dependent. Because std::sort is not stable you can have, and do have, different searched nodes results in case you use MSVC, gcc or icc libraries or even using the 32 or 64 bit versions of the same libraries.

This was unacceptable because SF is released for 32 and 64 bits and compiled with different compilers, and we really want functionality of the different binaries to be always 100% identical.

Anyhow I like your code refactoring and I'd be taking some code style ideas.
Hi Marco,

Platform-independence is indeed very important. std::stable_partition and std::stable_sort would already guarantee this, and also guarantee that the generated move order is not changed for equally scored moves. However, these algorithms are more expensive because they allocate a temporary buffer rather than do things in place.

In that case it might make sense to write your own partition and sort algorithms as you currently do. In any case, it might be a good idea to let those algorithms have the exact same signature as the STL counterparts. This will let you experiment with std::stable_partition and std::stable_sort without changing too much code (e.g. if you make your own sf namespace, all you need to do is change std::->sf:: for those 2 lines of code).

Anyway, the whole std:find family is not plagued by such stability issues. In your TT code e.g. one could do some refactoring by using std::find_if in the probe and a sequence of std::find_first_of and std::min_element during the store (this will do the replacement scheme that you currently have). You only need to write one or two function objects that do the key and depth comparisons and that's it.

In my own draughts engine, I have templated the entire TT table on replacement scheme, so that I can easily experiment with different schemes.

Rein
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: more STL code for Stockfish

Post by wgarvin »

I skimmed through the two disassemblies, and here are my impressions. I can only guess about exactly why the original is faster; only a profiling tool would tell the whole story.

It looks like your move type T is a 64-bit struct, and comparison involves comparing just the second half of it. (I'm guessing its struct { U32 move; S32 score; } or something like that.)

Original SF version

The main loop looks like this. Notice that only one pointer (curMove, in ecx) is being dereferenced on most iterations of the loop:

Code: Select all

L1&#58;
0040EA33  cmp         edx,dword ptr &#91;ecx+4&#93;
0040EA36  jge         L2
...
L2&#58;
0040EA46  add         ecx,8
0040EA49  cmp         ecx,ebx
0040EA4B  jne         L1
The compiler is keeping bestMove in eax:edx, and the curMove pointer in ecx. The swap is done inside the loop, and might be done more than once. It looks like this:

Code: Select all

            tmp = *curMove;
0040EA38  mov         esi,dword ptr &#91;ecx&#93;
0040EA3A  mov         edi,dword ptr &#91;ecx+4&#93;
            *curMove = bestMove;
0040EA3D  mov         dword ptr &#91;ecx&#93;,eax
0040EA3F  mov         dword ptr &#91;ecx+4&#93;,edx
            bestMove = tmp;
0040EA42  mov         eax,esi
0040EA44  mov         edx,edi
That looks pretty optimal to me. Also, when it decides to return bestMove, it will return it in eax:edx (where it already is).


Rein's version (modified by Marco)

In this version, bestMove is being kept as a pointer until we decide to return it. The main loop (inside of std::max_element) looks like this:

Code: Select all

L1&#58;
0040EA31  mov         edi,dword ptr &#91;edx+4&#93;
0040EA34  cmp         edi,dword ptr &#91;eax+4&#93;
0040EA37  cmovl       edx,eax
0040EA3A  add         eax,8
0040EA3D  cmp         eax,esi
0040EA3F  jne         L1
Notice that it might or might not copy the bestMove ptr (edx) into the curMove ptr (eax) (the cmovl instruction). So the loop it ended up with, dereferences curMove->score on every iteration, rather than keeping it in a register. However, I'd be surprised if this was enough by itself to explain the slower results.

The swap is done only once, on exit from the loop (and only if bestMove < *curMove):

Code: Select all

    std&#58;&#58;swap&#40;*curMove, *bestMove&#41;; // swap best move to the front
0040EA41  mov         eax,dword ptr &#91;ecx&#93;
0040EA43  cmp         ecx,edx
0040EA45  je          Lexit;
0040EA47  mov         edi,dword ptr &#91;edx&#93;
0040EA49  mov         esi,dword ptr &#91;ecx+4&#93;
0040EA4C  mov         dword ptr &#91;ecx&#93;,edi
0040EA4E  mov         edi,dword ptr &#91;edx+4&#93;
0040EA51  mov         dword ptr &#91;ecx+4&#93;,edi
0040EA54  mov         dword ptr &#91;edx&#93;,eax
0040EA56  mov         dword ptr &#91;edx+4&#93;,esi
    return *curMove; // return pointer to best move;
0040EA59  mov         eax,dword ptr &#91;ecx&#93;
Lexit&#58;

0040EA5B  mov         edx,dword ptr &#91;ecx+4&#93;
Notice that neither half was already in registers. Its using eax:edx for *bestMove (which is the return value at the end) and its using esi:edi as the temporary for the swap. So it has to do 4 loads and 4 stores there. The routine also needed 3 temporary registers (edi,esi,ebx) instead of just two [Edit: my bad, both routines use 3 temporary registers plus the two for the return value.] So many load/store operations at the end of the loop might be hard for it to issue all at once.

I'm purely speculating here, but I think the original version is faster for a mix of reasons:
(1) not using a pointer for bestMove seems to be better, the compiler can keep this 64-bit type in two register temporaries.
(2) For this particular case, std::max_element's main loop is slightly more expensive than a hand-crafted one, and
(3) The original version does the swaps inside the loop, so it has much less work to do when exiting the loop.

Even if the original does do more than one swap, since bestMove was in registers, the swap involves only two stores and two loads. With the out-of-order execution, they might cost nothing!

By comparison, the modified version does all of its loading (4 read+3 pop) and storing (4) all at the end of the function, which does not seem ideal in this case. Maybe the load at 0040EA59 is blocked somehow by the preceeding stores??
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: more STL code for Stockfish

Post by wgarvin »

A couple further thoughts...

(1) Rein's version swaps two array elements, while the original version swaps a temporary (T bestMove) with an array element. So the original swap is cheaper (bestMove is in registers). That's also why the original trashes the first element, which oddly enough, seems to have a bit of performance benefit? You could fix the trashing by just storing bestMove back to the front of the list on exiting the function, and it might not slow things down much if at all.

(2) When I said the load and store instructions for the original's swap "might cost nothing", what I meant is that they are happening inside the loop and the loop does only one load and no stores. So if we assume that move(s) requiring a swap are found "somewhere in the middle" of the loop, the 2 loads and 2 stores to carry out the swap are going to be overlapped with the later executions of the loop. In comparison, Rein's version always does the swap at the end, so if anything, those loads and stores are going to overlap with other work done by the calling function after this one returns. The caller probably does more loading and storing, so the swap in Rein's version is probably competing for those resources where the original version did not.

All in all, it seems like a case where the "straightforward" hand-written code turned out to be easier for the compiler to optimize [edit: or intrinsically better suited to the 32-bit x86 platform with its few registers and great OOE]. Even if it has some quirks, it also has great performance which the "cleaner" STL solution is unlikely to be able to match. The compiler might do a better job on Rein's version if it was targeting x86-64 and had more registers available, I don't know.
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: more STL code for Stockfish

Post by Rein Halbersma »

wgarvin wrote:A couple further thoughts...

(1) Rein's version swaps two array elements, while the original version swaps a temporary (T bestMove) with an array element. So the original swap is cheaper (bestMove is in registers). That's also why the original trashes the first element, which oddly enough, seems to have a bit of performance benefit? You could fix the trashing by just storing bestMove back to the front of the list on exiting the function, and it might not slow things down much if at all.

(2) When I said the load and store instructions for the original's swap "might cost nothing", what I meant is that they are happening inside the loop and the loop does only one load and no stores. So if we assume that move(s) requiring a swap are found "somewhere in the middle" of the loop, the 2 loads and 2 stores to carry out the swap are going to be overlapped with the later executions of the loop. In comparison, Rein's version always does the swap at the end, so if anything, those loads and stores are going to overlap with other work done by the calling function after this one returns. The caller probably does more loading and storing, so the swap in Rein's version is probably competing for those resources where the original version did not.

All in all, it seems like a case where the "straightforward" hand-written code turned out to be easier for the compiler to optimize [edit: or intrinsically better suited to the 32-bit x86 platform with its few registers and great OOE]. Even if it has some quirks, it also has great performance which the "cleaner" STL solution is unlikely to be able to match. The compiler might do a better job on Rein's version if it was targeting x86-64 and had more registers available, I don't know.
Hi Wylie,

Impressed as I am about your reasoning about this assembly code, I'm afraid it defeats the whole purpose of using the STL. So what if you save a few cycles here or there. The whole point was to reduce 13 lines of code that you have to a) first implement and check for correctness and b) maintain, to only 3 lines of code that much more clearly document what is going on.

If the SF code has some particular assumptions that are slightly more specific than the general assumptions that the STL implementers have to code for, there will always be a slight performance gain by coding an algorithm yourself. But the time saved on doing it the STL way can be spent on real chess things (such as reducing the number of nodes searched, rather than a few cycles per node searched).

Rein
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: more STL code for Stockfish

Post by mcostalba »

Rein Halbersma wrote: But the time saved on doing it the STL way can be spent on real chess things (such as reducing the number of nodes searched, rather than a few cycles per node searched).
Well, optimizing a fast path _is_ real chess thing.

I think both of you have a point, what Rein very well explains, not only with word but especially with his example code (that is what _I_ really value, not the words that we have plenty of them already here) is the correct approach to software development in general.

OTH we are not talking of general guidelines of software development, but of a concrete case of a chess engine where performance in fast path _do_ count.

If pick_best() was a normal function, rarely called, I am full with Rein, but it happens this is fast path and so the very high quality posts of Wylie are extremely value to me. Something to ponder on.

So please let me indulge here saying: try to stick to general good software development practice as long as you can...but not longer ;-) aka put your personal judgment always before any written how-to rule.

Marco

P.S: This pick_best() issue is still open for me because on one side I prefer Rein's version to mine, on the other hand I really would like the committed version would be the fastest.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: more STL code for Stockfish

Post by wgarvin »

mcostalba wrote:
Rein Halbersma wrote: But the time saved on doing it the STL way can be spent on real chess things (such as reducing the number of nodes searched, rather than a few cycles per node searched).
Well, optimizing a fast path _is_ real chess thing.

I think both of you have a point, what Rein very well explains, not only with word but especially with his example code (that is what _I_ really value, not the words that we have plenty of them already here) is the correct approach to software development in general.

OTH we are not talking of general guidelines of software development, but of a concrete case of a chess engine where performance in fast path _do_ count.

If pick_best() was a normal function, rarely called, I am full with Rein, but it happens this is fast path and so the very high quality posts of Wylie are extremely value to me. Something to ponder on.

So please let me indulge here saying: try to stick to general good software development practice as long as you can...but not longer ;-) aka put your personal judgment always before any written how-to rule.

Marco

P.S: This pick_best() issue is still open for me because on one side I prefer Rein's version to mine, on the other hand I really would like the committed version would be the fastest.
Everything you say makes good sense to me. I don't disagree with Rein either; if you can make this function "cleaner" without making it any slower, then that would obviously be a win all around.

At least you could probably replace this:

Code: Select all

    if &#40;bestMove < *curMove&#41;
    &#123;
        tmp = *curMove;
        *curMove = bestMove;
        bestMove = tmp;
    &#125;
with one of these:

Code: Select all

    if &#40;bestMove < *curMove&#41;
    &#123;
        std&#58;&#58;swap&#40;bestMove, *curMove&#41;;
    &#125;

Code: Select all

    if &#40;bestMove < *curMove&#41;
    &#123;
        std&#58;&#58;swap&#40;*curMove, bestMove&#41;;
    &#125;
I think with the version of std::swap included with Visual Studio 2008, the second versionshould be equivalent to your code. (It assigns the FIRST arg to a temporary, then assigns second arg to first arg, then assigns temporary to second arg).

Another weird thing that I didn't know about, although it does make sense: That version of std::swap compares the addresses of the two arguments and only performs the swap if they differ ! (Maybe the standard specifies it like this in case your assignment operator doesn't handle self-assignment and you try to swap an object with itself.)

In this case the compiler knows that you never took the address of bestMove and assigned it to a pointer, so it knows that &(*curMove) and &bestMove can't legally alias, and the comparison gets optimized out. It should still be able to keep bestMove and the temporary in registers.
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: more STL code for Stockfish

Post by Rein Halbersma »

mcostalba wrote:
If pick_best() was a normal function, rarely called, I am full with Rein, but it happens this is fast path and so the very high quality posts of Wylie are extremely value to me. Something to ponder on.

P.S: This pick_best() issue is still open for me because on one side I prefer Rein's version to mine, on the other hand I really would like the committed version would be the fastest.
Let's do this systematically.

First, the measurement. IIRC, the number of nodes searched was 2% lower for my version compared to current. OTOH, the nps was 1% higher. In total, time to depth was 1% lower for mine. Of course, this makes this somewhat apples and oranges. Are you taking nps as the decisive speed measurement? And did you repeat these measurements? Because as Bob wrote, a 1% change on a 16s process could be due to some paging issues of the OS rather than any real performance difference. If you can spare the cycles, why not run a small cluster test?

Second, the current code a) does extra swaps and b) has a side effect. So if the performance gain (if any) is due to some out-of-order execution bonus on a particular processor and particular compiler, how can you be sure it will be like this forever? Relying on such idiosyncrasies simply does not make sense.

Third, why not test a third variant, namely the handwritten loop with the swap and the end (i.e. simply the STL version written by hand)? This would be the fairest test to see whether STL code has an abstraction penalty or not.

Fourth, I fully agree with the need for speed and optimizing the fast path. I have a lot of empathy for these cycle squeeze games. OTOH, my goal is to gain as many ELO per minute spent programming / testing. In my experience, one hour of handwritten loops and assembly analysis is equal to 5 minutes of STL coding + 55 minutes of thinking about other issues. For my own engine, I have gladly given up ~10% of speed for much clearer code. I think I can use these 55 minutes more effectively on other aspects (but this may differ from person to person of course).

Rein

PS just words, no code this time. If you want to see code, PM me and I can give you access to my repository for my draughts engine.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: more STL code for Stockfish

Post by wgarvin »

Hmm, looking at the std::swap part of Rein's version again:

Code: Select all

    std&#58;&#58;swap&#40;*curMove, *bestMove&#41;; // swap best move to the front
0040EA41  mov         eax,dword ptr &#91;ecx&#93;
0040EA43  cmp         ecx,edx
0040EA45  je          Lexit;
0040EA47  mov         edi,dword ptr &#91;edx&#93;
0040EA49  mov         esi,dword ptr &#91;ecx+4&#93;
0040EA4C  mov         dword ptr &#91;ecx&#93;,edi
0040EA4E  mov         edi,dword ptr &#91;edx+4&#93;
0040EA51  mov         dword ptr &#91;ecx+4&#93;,edi
0040EA54  mov         dword ptr &#91;edx&#93;,eax
0040EA56  mov         dword ptr &#91;edx+4&#93;,esi
    return *curMove; // return pointer to best move;
0040EA59  mov         eax,dword ptr &#91;ecx&#93;
Lexit&#58;

0040EA5B  mov         edx,dword ptr &#91;ecx+4&#93; 
Especially the first couple instructions: the CMP and JE...

Before, I thought those were coming from the source of pick_best, but actually I think they are coming from std::swap (it compares the addresses).

Microsoft's std::swap is found in their <utility> header, and looks like this (I fixed the indenting):

Code: Select all

// TEMPLATE FUNCTION swap &#40;from <algorithm>)
template<class _Ty> inline
void swap&#40;_Ty& _Left, _Ty& _Right&#41;
&#123;
    // exchange values stored at _Left and _Right
    if (&_Left != &_Right&#41;
    &#123;
        // different, worth swapping
        _Ty _Tmp = _Left;
        _Left = _Right;
        _Right = _Tmp;
    &#125;
&#125;
Edit: So unless the compiler is sure the addresses are different, it has to test it. And we see it doing that in the generated code for Rein's version.

None of these things seems very expensive, individually... I suppose the function is called millions of times so even small costs add up.