32 bit versions for bitscan64

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Oct 07, 2009 10:03 pm

bob wrote: Terrible idea. When a company (Intel) says "operation is undefined" it is foolish to try it out, and then _depend_ on the specific behaviour you notice. You can find a perfectly safe PopCnt() in Crafty that returns 64 when no bits are set, and which does not depend on undefined behaviour that can change in the next processor generation without anyone knowing.

Here is a 32 bit version that does not depend on undefined behaviour and which only has two extra instructions.
Code: Select all
int static __inline__ MSB(BITBOARD word)
{
  int dummy1, dummy2, dummy3;

asm("        bsr     %1, %0      " "\n\t"
    "        jnz     1f          " "\n\t"
    "        bsr     %2, %0      " "\n\t"
    "        jnz     2f          " "\n\t"
    "        movl    $64, %0     " "\n\t"
    "        jmp     2f          " "\n\t"
    "1:      addl    $32,%0      " "\n\t"
    "2:                          " "\n\t"
:   "=&q"(dummy1), "=&q"(dummy2), "=&q"(dummy3)
:    "1"((int) (word >> 32)), "2"((int) word)
:    "cc");
  return (dummy1);
}

Mike's "unsafe" version is a trailing zero count, while your MSB is no leading zero count. I would prefer to return -1 for empty sets, since 64 somehow implies leading zero count, which it isn't due to missing xor 63 for none empty sets

I'll vote for to use leading/trailing zero counts for all sets including the empty, but assume MSB/LSB aka bitscan reverse/forward for none empty bitboards like in typical while (bits) loops. Safes one conditional jump and Mike's version becomes safe and even shorter.

Code: Select all

__inline s32 bitscanForward(u64 bits)
{
    assert (bits != 0);

    __asm
    {
        bsf eax, dword ptr bits
        jnz done
        bsf eax, dword ptr bits+4
        add eax, 32
done:
    }
}

AMD defined target register unchanged for bitscanning empty words

Michael Sherwin · Post by **Michael Sherwin** » Wed Oct 07, 2009 10:17 pm

Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.

bob · Post by **bob** » Wed Oct 07, 2009 11:24 pm

Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.

That's surprising since internally there is no "edx" register anyway, thanks to the renaming mechanism.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Oct 08, 2009 12:08 am

Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.

Applied chaos theory

. Surprising due to one or two byte longer code, which multiplies if heavily inlined. edx is explicitly used for no purpose other than to add a constant and "destroys" its content to improve register pressure. Otoh mov edx, 32 can be executed en passant with slower bsf, to make the final add somehow slightly faster. May be also related to how inlined ms inline assembly embeds in surrounding code.

Michael Sherwin · Post by **Michael Sherwin** » Thu Oct 08, 2009 1:19 am

Here are some numbers on a Q6600.

Original position:

just using add eax, 32 --
2,516,099 nps

using mov edx, 32 with add eax, edx --
2,576,068 nps

difference 59,969 nps

INCREASE OF 59,969/2,516,099 = .023834 or 2.38%

After e2e4:

2,497,449

2,554,567

57,118

INCREASE OF 2.29%

Since it speeds up the whole program by over 2% it can not be just a small increase for the function. There must be something else happening such as the before mentioned prevention of a pipeline stall.

bob · Post by **bob** » Thu Oct 08, 2009 10:29 pm

Gerd Isenberg wrote:
Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.
Applied chaos theory . Surprising due to one or two byte longer code, which multiplies if heavily inlined. edx is explicitly used for no purpose other than to add a constant and "destroys" its content to improve register pressure. Otoh mov edx, 32 can be executed en passant with slower bsf, to make the final add somehow slightly faster. May be also related to how inlined ms inline assembly embeds in surrounding code.

This follows exactly from the observation that many make when they add or move a single line of code and things run faster or slower. The effect compounds over a long sequence of instructions, and changes cache misses/hits/fills/etc as well. Doesn't take much. I have removed code and seen speed go down slightly. I ignore such changes, since they are caused by something at a level where I have little if any control.

Michael Sherwin · Post by **Michael Sherwin** » Fri Oct 09, 2009 5:47 am

bob wrote:
Gerd Isenberg wrote:
Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.
Applied chaos theory . Surprising due to one or two byte longer code, which multiplies if heavily inlined. edx is explicitly used for no purpose other than to add a constant and "destroys" its content to improve register pressure. Otoh mov edx, 32 can be executed en passant with slower bsf, to make the final add somehow slightly faster. May be also related to how inlined ms inline assembly embeds in surrounding code.
This follows exactly from the observation that many make when they add or move a single line of code and things run faster or slower. The effect compounds over a long sequence of instructions, and changes cache misses/hits/fills/etc as well. Doesn't take much. I have removed code and seen speed go down slightly. I ignore such changes, since they are caused by something at a level where I have little if any control.

I also know about that anomaly. And it does not apply here, because, I have revisited bsf many times over the years in many different versions of the program and using the extra edx register (that does not really exist) has ALWAYS been faster.

32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64