32 bit versions for bitscan64

Michael Sherwin · Post by **Michael Sherwin** » Sat Aug 22, 2009 10:45 pm

wgarvin wrote:
Michael Sherwin wrote:However, if the latest processors are faster at executing 32bit bsf/bsr, then I might be interested in testing something like the following:

note: it's been a long time since i've written assembly so I am not sure if the following code is correct. Howeever, it (the correct code) did test slower when I tried it on older machines.
Code: Select all
__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm 
    &#123;
        mov ebx, 32 ; I guess that 32 is correct here
        mov ecx, 32
        bsf ebx, dword ptr bits
        bsf ecx, dword ptr bits+4
        shl ebx, 6
        mov eax, &#91;bsfTbl + ebx + ecx&#93;
    &#125;
&#125;
If you're willing to rely on bsf to not change the register contents when it fails, you could try something like this:
Code: Select all
__inline s32 FirstBit&#40;u64 bits&#41; 
&#123; 
    __asm 
    &#123; 
        mov ebx, 32 
        bsf ebx, dword ptr bits+4 
        add ebx, 32
        bsf ebx, dword ptr bits
    &#125; 
&#125;
If you don't trust bsf or don't like the length of that dependence chain, then you can try something like this instead:
Code: Select all
__inline s32 FirstBit&#40;u64 bits&#41; 
&#123; 
    __asm 
    &#123; 
        bsf ecx, dword ptr bits+4 
        cmovnc ecx, 32
        bsf ebx, dword ptr bits
        lea ecx, &#91;ecx+32&#93;
        cmovnc ebx, ecx
    &#125; 
&#125;
I didn't compile those, YMMV... they both return 64 if the bitboard is empty, if you want -1 instead you could just replace the first 32 with 0xFFFFFFDF.

Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?

Your first method looks good to me as bsf should not change the register if there are no bits set.

I was aiming for a bit more parallelism. However, your method does not need a table lookup, so I will test it.

Edit: I do not have an i7 so hopefully someone else will test it.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Aug 22, 2009 11:02 pm

Michael Sherwin wrote: Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?

cmovXX conditional move on any flag or its negation, since 80386, like conditional jump jXX, for instance z (zero) nz (no zero) c (carry) nc (no carry). Thus

Code: Select all

cmovnc  rax, rcx

will move rcx to rax if carry is not set, otherwise (carry set) instruction is noop.

Michael Sherwin · Post by **Michael Sherwin** » Sun Aug 23, 2009 1:20 am

Gerd Isenberg wrote:
Michael Sherwin wrote: Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?
cmovXX conditional move on any flag or its negation, since 80386, like conditional jump jXX, for instance z (zero) nz (no zero) c (carry) nc (no carry). Thus
Code: Select all
cmovnc  rax, rcx
will move rcx to rax if carry is not set, otherwise (carry set) instruction is noop.

Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.

I do have some preliminary results to share that are instructive. In the original position. Q6600.

Current code from RomiChess -- 2,496,860 nodes per second.

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm 
    &#123;
        mov edx, 0
        bsf eax, dword ptr bits
        jnz done
        mov edx, 32
        bsf eax, dword ptr bits+4
        jnz done
        mov eax, 32 
done&#58;   add eax, edx
    &#125;
&#125;

Modified to just one branch instruction, but has more dependencies -- 2,463,186 nodes per second.

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm 
    &#123;
        mov eax, 32
        bsf eax, dword ptr bits
        jnz done
        bsf eax, dword ptr bits+4
        add eax, 32
done&#58;   
    &#125;
&#125;

Using Wylie's code slightly modified -- 2,479,477 nodes per second.

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41; 
&#123; 
    __asm 
    &#123; 
        mov eax, 32 
        bsf eax, dword ptr bits+4 
        add eax, 32 
        bsf eax, dword ptr bits 
    &#125; 
&#125;

So, it is shown IMO that parallelism matters more than a branch or two. I would like to get Wylie's cmovnc working as I think that it may prove to be the fastest on core2 and newer machines.

Zach Wegner · Post by **Zach Wegner** » Sun Aug 23, 2009 1:51 am

What about something like this? A few more instructions and a bit register-heavy, but branchless and only one bsf. I don't know how much MSVC optimizes the assembly code, if at all, wrt register usage, since bits is going to be in registers.

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm
    &#123;
        mov ebx, dword ptr bits
        mov edx, 32
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    &#125;
&#125;

Equivalent C code:

Code: Select all

mask = &#40;bits.low != 0&#41; - 1;
add = 32 & mask;
bits.low |= bits.high & mask;
return bsf&#40;bits.low&#41; + add;

Anyways, this is a bit pointless. Just get a new processor.

wgarvin · Post by **wgarvin** » Sun Aug 23, 2009 2:13 am

Michael Sherwin wrote:Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.

MSVC 6.0 is pretty old. Maybe the inline assembler didn't support it, or maybe you have to give it a command-line option for the target processor type (I think the conditional move instructions were added around the time of the Pentium Pro/Pentium II; I can't remember what the default target was for MSVC 6.0 but there were still chips around back then which didn't support the CMOVcc instructions).

A possible hack around it is to just emit the exact bytes that make up the instruction. If you have NASM or a similar assembler, just put one instruction in a file and assemble it in 32-bit raw mode into a couple of bytes, and open them with a hex editor (or disassemble them again with NASM, it prints out the bytes of the instruction as well as the disassembled text). You could also use NASM documentation or any similar source to figure out the bytes of the instruction. I'm not sure how you tell it to emit those exact bytes (does "db 0xEE,0xEF,0xF0" or whatever work?)

Michael Sherwin · Post by **Michael Sherwin** » Sun Aug 23, 2009 4:53 am

Zach Wegner wrote:What about something like this? A few more instructions and a bit register-heavy, but branchless and only one bsf. I don't know how much MSVC optimizes the assembly code, if at all, wrt register usage, since bits is going to be in registers.
Code: Select all
__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm
    &#123;
        mov ebx, dword ptr bits
        mov edx, 32
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    &#125;
&#125;
Equivalent C code:
Code: Select all
mask = &#40;bits.low != 0&#41; - 1;
add = 32 & mask;
bits.low |= bits.high & mask;
return bsf&#40;bits.low&#41; + add;
Anyways, this is a bit pointless. Just get a new processor.

My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though!

Michael Sherwin · Post by **Michael Sherwin** » Sun Aug 23, 2009 5:01 am

wgarvin wrote:
Michael Sherwin wrote:Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.
MSVC 6.0 is pretty old. Maybe the inline assembler didn't support it, or maybe you have to give it a command-line option for the target processor type (I think the conditional move instructions were added around the time of the Pentium Pro/Pentium II; I can't remember what the default target was for MSVC 6.0 but there were still chips around back then which didn't support the CMOVcc instructions).

A possible hack around it is to just emit the exact bytes that make up the instruction. If you have NASM or a similar assembler, just put one instruction in a file and assemble it in 32-bit raw mode into a couple of bytes, and open them with a hex editor (or disassemble them again with NASM, it prints out the bytes of the instruction as well as the disassembled text). You could also use NASM documentation or any similar source to figure out the bytes of the instruction. I'm not sure how you tell it to emit those exact bytes (does "db 0xEE,0xEF,0xF0" or whatever work?)

I can barely use the tools that I already have learned. Just do not have the time to learn NASM. Fasm looks simple and I have worked with masm32, so maybe I can use one of those--when I have some time.

Michael Sherwin · Post by **Michael Sherwin** » Sun Aug 23, 2009 5:22 am

Michael Sherwin wrote:
Gerd Isenberg wrote:
Michael Sherwin wrote: Don't remember a cmovnc instruction. Is it a newer instruction? What does it do?
cmovXX conditional move on any flag or its negation, since 80386, like conditional jump jXX, for instance z (zero) nz (no zero) c (carry) nc (no carry). Thus
Code: Select all
cmovnc  rax, rcx
will move rcx to rax if carry is not set, otherwise (carry set) instruction is noop.
Thanks Gerd!

cmovnc eax, ebx, MSVC 6.0 does not like. It says 'improper operand', about the following instruction. Maybe cmovnc is just a pseudo form of the intended instruction.

I do have some preliminary results to share that are instructive. In the original position. Q6600.

Current code from RomiChess -- 2,496,860 nodes per second.
Code: Select all
__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm 
    &#123;
        mov edx, 0
        bsf eax, dword ptr bits
        jnz done
        mov edx, 32
        bsf eax, dword ptr bits+4
        jnz done
        mov eax, 32 
done&#58;   add eax, edx
    &#125;
&#125;
Modified to just one branch instruction, but has more dependencies -- 2,463,186 nodes per second.
Code: Select all
__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm 
    &#123;
        mov eax, 32
        bsf eax, dword ptr bits
        jnz done
        bsf eax, dword ptr bits+4
        add eax, 32
done&#58;   
    &#125;
&#125;
Using Wylie's code slightly modified -- 2,479,477 nodes per second.
Code: Select all
__inline s32 FirstBit&#40;u64 bits&#41; 
&#123; 
    __asm 
    &#123; 
        mov eax, 32 
        bsf eax, dword ptr bits+4 
        add eax, 32 
        bsf eax, dword ptr bits 
    &#125; 
&#125;
So, it is shown IMO that parallelism matters more than a branch or two. I would like to get Wylie's cmovnc working as I think that it may prove to be the fastest on core2 and newer machines.

Got this working but, it was the slowest-- 2,441,656

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm 
    &#123;
        mov eax, 32
        mov ebx, 32
        mov ecx, 0    // cmovnc rax, 32 instruction does not work with MSVC6
        bsf ebx, dword ptr bits+4
        bsf eax, dword ptr bits
     cmovnz ebx, ecx  // cmovnc would not work for me
        lea eax, &#91;eax + ebx&#93; 
    &#125;
&#125;

the cmovcc instruction might be faster than poorly predicted branches. Well predicted branches seem hard to beat!

Edit:Well what false conception was clouding my mind?

Changed my two branch code to the following one branch version -- 2,519,082

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm 
    &#123;
        mov eax, 32 
        mov edx, 0
        bsf eax, dword ptr bits
        jnz done
        mov edx, 32
        bsf eax, dword ptr bits+4
done&#58;   add eax, edx
    &#125;
&#125;

Zach Wegner · Post by **Zach Wegner** » Sun Aug 23, 2009 8:44 am

Michael Sherwin wrote: My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though!

That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.

Also, note that this is 100% untested, I'm only pretty sure it works.

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41;
&#123;
    __asm
    &#123;
        mov ebx, dword ptr bits
        mov edx, 32
        mov eax, edx
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    &#125;
&#125;

Desperado · Post by **Desperado** » Sun Aug 23, 2009 9:51 am

My Assembler proposal somewhere in the thread
===============================================

Code: Select all


ULONG bsf64a&#40;BTB_T bb&#41; 
   &#123;
    //_asm add,32 to have 64 as return value

    _asm bsf eax,DWORD PTR bb+4 
    _asm add eax,32 
    _asm bsf eax,DWORD PTR bb 

    //return eax 
   &#125;

which leads to the introduction of the thread...
==================================================

Code: Select all


ULONG bsf64&#40;BTB_T bb&#41; 
   &#123; 
    UI_32 *const ptr = &#40;UI_32*)&bb; 
    ULONG id=64; 

    _BitScanForward&#40;&id,*&#40;ptr+1&#41;); 
    id+=32; 
    _BitScanForward&#40;&id,*ptr&#41;; 

    return&#40;id&#41;; 
   &#125;

which further leads to
=========================

Code: Select all

ULONG bsf64&#40;BTB_T bb&#41; 
   &#123; 
    ULONG id; //=64 to have 64 as return value 

    _BitScanForward&#40;&id,&#40;UI_32&#41;&#40;bb>>32&#41;); 
    id+=32; 
    _BitScanForward&#40;&id,&#40;UI_32&#41;bb&#41;; 

    return&#40;id&#41;; 
   &#125;

which becomes (to save a 2nd scan)
====================================

Code: Select all


ULONG bsf64&#40;UI_64 bb&#41;
	&#123;
	 ULONG id; //=64 to have 64 as return value 

	 _BitScanForward&#40;&id,&#40;UI_32&#41;bb&#41; ? 
	 id &#58; 
	 id += 31 + _BitScanForward&#40;&id,&#40;UI_32&#41;&#40;bb>>32&#41;);

	 return&#40;id&#41;;
	&#125;
 
    // return eax 
   &#125;

which is based on my 2nd Assembler proposal
============================================

Code: Select all


ULONG bsf64a&#40;BTB_T bb&#41; 
   &#123; 
    //_asm mov,64 //to have 64 as return value 

    _asm bsf eax,DWORD PTR bb 
    _asm jnz DONE 
    _asm bsf eax,DWORD PTR bb+4 
    _asm add eax,32 
DONE&#58; ; 
    // return eax 
   &#125;

So, if intrinsics are available, i would use them definitely, because
compared to selfmade inline assembler, the compiler know everything
about the register status and should(will) optimize the usage of them.
Or the other way around. Using inline assembler may _unoptimize_ the
code, because you force to use registers which are in use somehow.

if you would only use the inline-assembler function once or twice, you
are able to optimize by hand, but i think, in a chess engine you use
it in 100 of different situations an that will decrease performance significantly.

Am I wrong ?

Nice Sunday to everyone.

32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64