32 bit versions for bitscan64

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

wgarvin wrote:
Michael Sherwin wrote:However, if the latest processors are faster at executing 32bit bsf/bsr, then I might be interested in testing something like the following:

note: it's been a long time since i've written assembly so I am not sure if the following code is correct. Howeever, it (the correct code) did test slower when I tried it on older machines.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm 
    {
        mov ebx, 32 ; I guess that 32 is correct here
        mov ecx, 32
        bsf ebx, dword ptr bits
        bsf ecx, dword ptr bits+4
        shl ebx, 6
        mov eax, [bsfTbl + ebx + ecx]
    }
}

If you're willing to rely on bsf to not change the register contents when it fails, you could try something like this:

Code: Select all

__inline s32 FirstBit(u64 bits) 
{ 
    __asm 
    { 
        mov ebx, 32 
        bsf ebx, dword ptr bits+4 
        add ebx, 32
        bsf ebx, dword ptr bits
    } 
}
If you don't trust bsf or don't like the length of that dependence chain, then you can try something like this instead:

Code: Select all

__inline s32 FirstBit(u64 bits) 
{ 
    __asm 
    { 
        bsf ecx, dword ptr bits+4 
        cmovnc ecx, 32
        bsf ebx, dword ptr bits
        lea ecx, [ecx+32]
        cmovnc ebx, ecx
    } 
}
I didn't compile those, YMMV... they both return 64 if the bitboard is empty, if you want -1 instead you could just replace the first 32 with 0xFFFFFFDF.
Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?

Your first method looks good to me as bsf should not change the register if there are no bits set.

I was aiming for a bit more parallelism. However, your method does not need a table lookup, so I will test it.

Edit: I do not have an i7 so hopefully someone else will test it.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

Michael Sherwin wrote: Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?
cmovXX conditional move on any flag or its negation, since 80386, like conditional jump jXX, for instance z (zero) nz (no zero) c (carry) nc (no carry). Thus

Code: Select all

cmovnc  rax, rcx
will move rcx to rax if carry is not set, otherwise (carry set) instruction is noop.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

Gerd Isenberg wrote:
Michael Sherwin wrote: Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?
cmovXX conditional move on any flag or its negation, since 80386, like conditional jump jXX, for instance z (zero) nz (no zero) c (carry) nc (no carry). Thus

Code: Select all

cmovnc  rax, rcx
will move rcx to rax if carry is not set, otherwise (carry set) instruction is noop.
Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.

I do have some preliminary results to share that are instructive. In the original position. Q6600.

Current code from RomiChess -- 2,496,860 nodes per second.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm 
    {
        mov edx, 0
        bsf eax, dword ptr bits
        jnz done
        mov edx, 32
        bsf eax, dword ptr bits+4
        jnz done
        mov eax, 32 
done:   add eax, edx
    }
}
Modified to just one branch instruction, but has more dependencies -- 2,463,186 nodes per second.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm 
    {
        mov eax, 32
        bsf eax, dword ptr bits
        jnz done
        bsf eax, dword ptr bits+4
        add eax, 32
done:   
    }
}
Using Wylie's code slightly modified -- 2,479,477 nodes per second.

Code: Select all

__inline s32 FirstBit(u64 bits) 
{ 
    __asm 
    { 
        mov eax, 32 
        bsf eax, dword ptr bits+4 
        add eax, 32 
        bsf eax, dword ptr bits 
    } 
}
So, it is shown IMO that parallelism matters more than a branch or two. I would like to get Wylie's cmovnc working as I think that it may prove to be the fastest on core2 and newer machines.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: 32 bit versions for bitscan64

Post by Zach Wegner »

What about something like this? A few more instructions and a bit register-heavy, but branchless and only one bsf. I don't know how much MSVC optimizes the assembly code, if at all, wrt register usage, since bits is going to be in registers.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm
    {
        mov ebx, dword ptr bits
        mov edx, 32
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    }
}
Equivalent C code:

Code: Select all

mask = (bits.low != 0) - 1;
add = 32 & mask;
bits.low |= bits.high & mask;
return bsf(bits.low) + add;
Anyways, this is a bit pointless. Just get a new processor. ;)
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: 32 bit versions for bitscan64

Post by wgarvin »

Michael Sherwin wrote:Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.
MSVC 6.0 is pretty old. Maybe the inline assembler didn't support it, or maybe you have to give it a command-line option for the target processor type (I think the conditional move instructions were added around the time of the Pentium Pro/Pentium II; I can't remember what the default target was for MSVC 6.0 but there were still chips around back then which didn't support the CMOVcc instructions).

A possible hack around it is to just emit the exact bytes that make up the instruction. If you have NASM or a similar assembler, just put one instruction in a file and assemble it in 32-bit raw mode into a couple of bytes, and open them with a hex editor (or disassemble them again with NASM, it prints out the bytes of the instruction as well as the disassembled text). You could also use NASM documentation or any similar source to figure out the bytes of the instruction. I'm not sure how you tell it to emit those exact bytes (does "db 0xEE,0xEF,0xF0" or whatever work?)
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

Zach Wegner wrote:What about something like this? A few more instructions and a bit register-heavy, but branchless and only one bsf. I don't know how much MSVC optimizes the assembly code, if at all, wrt register usage, since bits is going to be in registers.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm
    {
        mov ebx, dword ptr bits
        mov edx, 32
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    }
}
Equivalent C code:

Code: Select all

mask = (bits.low != 0) - 1;
add = 32 & mask;
bits.low |= bits.high & mask;
return bsf(bits.low) + add;
Anyways, this is a bit pointless. Just get a new processor. ;)
My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though! :D
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

wgarvin wrote:
Michael Sherwin wrote:Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.
MSVC 6.0 is pretty old. Maybe the inline assembler didn't support it, or maybe you have to give it a command-line option for the target processor type (I think the conditional move instructions were added around the time of the Pentium Pro/Pentium II; I can't remember what the default target was for MSVC 6.0 but there were still chips around back then which didn't support the CMOVcc instructions).

A possible hack around it is to just emit the exact bytes that make up the instruction. If you have NASM or a similar assembler, just put one instruction in a file and assemble it in 32-bit raw mode into a couple of bytes, and open them with a hex editor (or disassemble them again with NASM, it prints out the bytes of the instruction as well as the disassembled text). You could also use NASM documentation or any similar source to figure out the bytes of the instruction. I'm not sure how you tell it to emit those exact bytes (does "db 0xEE,0xEF,0xF0" or whatever work?)
I can barely use the tools that I already have learned. Just do not have the time to learn NASM. Fasm looks simple and I have worked with masm32, so maybe I can use one of those--when I have some time.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

Michael Sherwin wrote:
Gerd Isenberg wrote:
Michael Sherwin wrote: Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?
cmovXX conditional move on any flag or its negation, since 80386, like conditional jump jXX, for instance z (zero) nz (no zero) c (carry) nc (no carry). Thus

Code: Select all

cmovnc  rax, rcx
will move rcx to rax if carry is not set, otherwise (carry set) instruction is noop.
Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.

I do have some preliminary results to share that are instructive. In the original position. Q6600.

Current code from RomiChess -- 2,496,860 nodes per second.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm 
    {
        mov edx, 0
        bsf eax, dword ptr bits
        jnz done
        mov edx, 32
        bsf eax, dword ptr bits+4
        jnz done
        mov eax, 32 
done:   add eax, edx
    }
}
Modified to just one branch instruction, but has more dependencies -- 2,463,186 nodes per second.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm 
    {
        mov eax, 32
        bsf eax, dword ptr bits
        jnz done
        bsf eax, dword ptr bits+4
        add eax, 32
done:   
    }
}
Using Wylie's code slightly modified -- 2,479,477 nodes per second.

Code: Select all

__inline s32 FirstBit(u64 bits) 
{ 
    __asm 
    { 
        mov eax, 32 
        bsf eax, dword ptr bits+4 
        add eax, 32 
        bsf eax, dword ptr bits 
    } 
}
So, it is shown IMO that parallelism matters more than a branch or two. I would like to get Wylie's cmovnc working as I think that it may prove to be the fastest on core2 and newer machines.
Got this working but, it was the slowest-- 2,441,656

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm 
    {
        mov eax, 32
        mov ebx, 32
        mov ecx, 0    // cmovnc rax, 32 instruction does not work with MSVC6
        bsf ebx, dword ptr bits+4
        bsf eax, dword ptr bits
     cmovnz ebx, ecx  // cmovnc would not work for me
        lea eax, [eax + ebx] 
    }
}
the cmovcc instruction might be faster than poorly predicted branches. Well predicted branches seem hard to beat!

Edit:Well what false conception was clouding my mind? :?

Changed my two branch code to the following one branch version -- 2,519,082

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm 
    {
        mov eax, 32 
        mov edx, 0
        bsf eax, dword ptr bits
        jnz done
        mov edx, 32
        bsf eax, dword ptr bits+4
done:   add eax, edx
    }
}
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: 32 bit versions for bitscan64

Post by Zach Wegner »

Michael Sherwin wrote: My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though! :D
That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.

Also, note that this is 100% untested, I'm only pretty sure it works.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm
    {
        mov ebx, dword ptr bits
        mov edx, 32
        mov eax, edx
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    }
}
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

My Assembler proposal somewhere in the thread
===============================================

Code: Select all


ULONG bsf64a(BTB_T bb) 
   {
    //_asm add,32 to have 64 as return value

    _asm bsf eax,DWORD PTR bb+4 
    _asm add eax,32 
    _asm bsf eax,DWORD PTR bb 

    //return eax 
   } 

which leads to the introduction of the thread...
==================================================

Code: Select all


ULONG bsf64(BTB_T bb) 
   { 
    UI_32 *const ptr = (UI_32*)&bb; 
    ULONG id=64; 

    _BitScanForward(&id,*(ptr+1)); 
    id+=32; 
    _BitScanForward(&id,*ptr); 

    return(id); 
   } 

which further leads to
=========================

Code: Select all

ULONG bsf64(BTB_T bb) 
   { 
    ULONG id; //=64 to have 64 as return value 

    _BitScanForward(&id,(UI_32)(bb>>32)); 
    id+=32; 
    _BitScanForward(&id,(UI_32)bb); 

    return(id); 
   } 

which becomes (to save a 2nd scan)
====================================

Code: Select all


ULONG bsf64(UI_64 bb)
	{
	 ULONG id; //=64 to have 64 as return value 

	 _BitScanForward(&id,(UI_32)bb) ? 
	 id : 
	 id += 31 + _BitScanForward(&id,(UI_32)(bb>>32));

	 return(id);
	}
 
    // return eax 
   } 

which is based on my 2nd Assembler proposal
============================================

Code: Select all


ULONG bsf64a(BTB_T bb) 
   { 
    //_asm mov,64 //to have 64 as return value 

    _asm bsf eax,DWORD PTR bb 
    _asm jnz DONE 
    _asm bsf eax,DWORD PTR bb+4 
    _asm add eax,32 
DONE: ; 
    // return eax 
   }  

So, if intrinsics are available, i would use them definitely, because
compared to selfmade inline assembler, the compiler know everything
about the register status and should(will) optimize the usage of them.
Or the other way around. Using inline assembler may _unoptimize_ the
code, because you force to use registers which are in use somehow.

if you would only use the inline-assembler function once or twice, you
are able to optimize by hand, but i think, in a chess engine you use
it in 100 of different situations an that will decrease performance significantly.

Am I wrong ?

Nice Sunday to everyone. :)