32 bit versions for bitscan64

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

bob wrote: Terrible idea. When a company (Intel) says "operation is undefined" it is foolish to try it out, and then _depend_ on the specific behaviour you notice. You can find a perfectly safe PopCnt() in Crafty that returns 64 when no bits are set, and which does not depend on undefined behaviour that can change in the next processor generation without anyone knowing.

Here is a 32 bit version that does not depend on undefined behaviour and which only has two extra instructions.

Code: Select all

int static __inline__ MSB(BITBOARD word)
{
  int dummy1, dummy2, dummy3;

asm("        bsr     %1, %0      " "\n\t"
    "        jnz     1f          " "\n\t"
    "        bsr     %2, %0      " "\n\t"
    "        jnz     2f          " "\n\t"
    "        movl    $64, %0     " "\n\t"
    "        jmp     2f          " "\n\t"
    "1:      addl    $32,%0      " "\n\t"
    "2:                          " "\n\t"
:   "=&q"(dummy1), "=&q"(dummy2), "=&q"(dummy3)
:    "1"((int) (word >> 32)), "2"((int) word)
:    "cc");
  return (dummy1);
}

Mike's "unsafe" version is a trailing zero count, while your MSB is no leading zero count. I would prefer to return -1 for empty sets, since 64 somehow implies leading zero count, which it isn't due to missing xor 63 for none empty sets ;-)

I'll vote for to use leading/trailing zero counts for all sets including the empty, but assume MSB/LSB aka bitscan reverse/forward for none empty bitboards like in typical while (bits) loops. Safes one conditional jump and Mike's version becomes safe and even shorter.

Code: Select all

__inline s32 bitscanForward(u64 bits)
{
    assert (bits != 0);

    __asm
    {
        bsf eax, dword ptr bits
        jnz done
        bsf eax, dword ptr bits+4
        add eax, 32
done:
    }
}
AMD defined target register unchanged for bitscanning empty words :!: :?:
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 32 bit versions for bitscan64

Post by bob »

Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.
That's surprising since internally there is no "edx" register anyway, thanks to the renaming mechanism.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.
Applied chaos theory :roll:. Surprising due to one or two byte longer code, which multiplies if heavily inlined. edx is explicitly used for no purpose other than to add a constant and "destroys" its content to improve register pressure. Otoh mov edx, 32 can be executed en passant with slower bsf, to make the final add somehow slightly faster. May be also related to how inlined ms inline assembly embeds in surrounding code.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

Here are some numbers on a Q6600.

Original position:

just using add eax, 32 --
2,516,099 nps

using mov edx, 32 with add eax, edx --
2,576,068 nps

difference 59,969 nps

INCREASE OF 59,969/2,516,099 = .023834 or 2.38%


After e2e4:

2,497,449

2,554,567

57,118

INCREASE OF 2.29%

Since it speeds up the whole program by over 2% it can not be just a small increase for the function. There must be something else happening such as the before mentioned prevention of a pipeline stall.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 32 bit versions for bitscan64

Post by bob »

Gerd Isenberg wrote:
Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.
Applied chaos theory :roll:. Surprising due to one or two byte longer code, which multiplies if heavily inlined. edx is explicitly used for no purpose other than to add a constant and "destroys" its content to improve register pressure. Otoh mov edx, 32 can be executed en passant with slower bsf, to make the final add somehow slightly faster. May be also related to how inlined ms inline assembly embeds in surrounding code.
This follows exactly from the observation that many make when they add or move a single line of code and things run faster or slower. The effect compounds over a long sequence of instructions, and changes cache misses/hits/fills/etc as well. Doesn't take much. I have removed code and seen speed go down slightly. I ignore such changes, since they are caused by something at a level where I have little if any control.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

bob wrote:
Gerd Isenberg wrote:
Michael Sherwin wrote:Hi Gerd,

The only thing is, is that using the edx register somehow makes it faster on all the processors that I have, Q6600 and earlier. It must prevent a processor stall or something.
Applied chaos theory :roll:. Surprising due to one or two byte longer code, which multiplies if heavily inlined. edx is explicitly used for no purpose other than to add a constant and "destroys" its content to improve register pressure. Otoh mov edx, 32 can be executed en passant with slower bsf, to make the final add somehow slightly faster. May be also related to how inlined ms inline assembly embeds in surrounding code.
This follows exactly from the observation that many make when they add or move a single line of code and things run faster or slower. The effect compounds over a long sequence of instructions, and changes cache misses/hits/fills/etc as well. Doesn't take much. I have removed code and seen speed go down slightly. I ignore such changes, since they are caused by something at a level where I have little if any control.
I also know about that anomaly. And it does not apply here, because, I have revisited bsf many times over the years in many different versions of the program and using the extra edx register (that does not really exist) has ALWAYS been faster.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through