32 bit versions for bitscan64

mcostalba · Post by **mcostalba** » Sun Aug 23, 2009 6:14 pm

Desperado wrote: or did you already test it ?

No I didn't

...and you ?

P.S: I am not very fond of assembler because MSVC and gcc have different syntax so you end up with different versions for different compilers and if you do that there should be a very good reason to do it, read at least 5% increase in speed but I bet with you that the above, with all the added ancillary stuff bb!=0 and bit reset, is not 5% faster then mine (I think it is not faster at all because there are two bsf for 1 multiply + one branch).

mcostalba · Post by **mcostalba** » Sun Aug 23, 2009 6:30 pm

mcostalba wrote: What I am thinking is that the branch is not so bad at the end.

If you use the bitscan in a tight loop to serialize the pieces moves and you pass in input a bitboard with all the attacksquares of a given piece then there is a good probability that if the first bit is say in low-word then also the next one will be because attacks squares for sliding pieces or pawns or kings are normally near

70%

In 70% of cases if previous bit was in low/high part of the 64bit word then also the next one will be.

Desperado · Post by **Desperado** » Sun Aug 23, 2009 7:02 pm

Hi,

just another one (not slow, but fast enough ?)

Code: Select all


ULONG bsf64b(UI_64 bb)
	{
	 const UI_32 off[2]  = {0,32};
	 UI_32 lohi[2] = {(UI_32)bb,(UI_32)(bb>>32)};
	 ULONG  id;

	 _BitScanForward(&id,lohi[(UI_32)bb == 0]);

	 return(id+off[(UI_32)bb == 0]);
	}

maybe someone wants to tune this one, i am not sure for _==_ statement.

perphaps (bool)(UI_32)bb instead

or tmp = (bb==0) and reuse it at the end, i dont know...

(but branchless

)

Desperado · Post by **Desperado** » Sun Aug 23, 2009 7:19 pm

edit:

...

and a very simple reversal function (which can be tuned more i think).
because there are a lot of shifts it could be profitable for tmp = bb>>32 ...

Code: Select all


ULONG bsr64b(UI_64 bb)
	{
	 const UI_32 off[2]  = {32,0};
	 UI_32 lohi[2] = {(UI_32)(bb>>32),(UI_32)bb};
	 ULONG  id;

	 _BitScanReverse(&id,lohi[(UI_32)(bb>>32) == 0]);

	 return(id+off[(UI_32)(bb>>32) == 0]);
	}

just swapped the contents of arrays and used bb>>32 for condition

anyway, these 2 functions are faster than my initial-proposals of the thread, are easy to handle, and for me both looking 32 bit friendly.
Maybe Zachs(i like it) is faster,maybe not, maybe machine dependend ???
Additionally, both(forward,reverse) working the same way, which makes me happy.

Desperado · Post by **Desperado** » Sun Aug 23, 2009 7:59 pm

Desperado wrote:edit:

...

and a very simple reversal function (which can be tuned more i think).
because there are a lot of shifts it could be profitable for tmp = bb>>32 ...
Code: Select all
ULONG bsr64b(UI_64 bb)
	{
	 const UI_32 off[2]  = {32,0};
	 UI_32 lohi[2] = {(UI_32)(bb>>32),(UI_32)bb};
	 ULONG  id;

	 _BitScanReverse(&id,lohi[(UI_32)(bb>>32) == 0]);

	 return(id+off[(UI_32)(bb>>32) == 0]);
	}
just swapped the contents of arrays and used bb>>32 for condition

anyway, these 2 functions are faster than my initial-proposals of the thread, are easy to handle, and for me both looking 32 bit friendly.
Maybe Zachs(i like it) is faster,maybe not, maybe machine dependend ???
Additionally, both(forward,reverse) working the same way, which makes me happy.

measured outside a complex software(engine)

intrinsic versions...

1. first proposals (with branching)
2. Zachs
3. last proposals
4. ...

(enough scanning for me now)...

thx to all

mcostalba · Post by **mcostalba** » Sun Aug 23, 2009 8:59 pm

Desperado wrote: (enough scanning for me now)...

thx to all

I 've made up my mind that we are struggling for something that in the very very best case can give an extra 1%....but realistically even less.

There are at least 2-3 solutions that differ for a fraction of a point, this is too much CPU/compiler/surrounding engine dependant.

So I would say we are at a dead end: there is not a single winner, there is a buch of solutions all at the same level, practically speaking.

Desperado · Post by **Desperado** » Sun Aug 23, 2009 9:34 pm

I think so too!

Well, i searched for a:

- competetive
- 32 bit friendly
- in both direction working (forward,reverse)

solution.

At least for me, my first proposals(after elimination of the _pointer traversals_) meet these points.
(of course some other solutions do this too)
So i found what i was searching.
And it was nice to see some little tricks, some of you
are using!

A special thx to you Marco for spending time on testing!

regards

Michael Sherwin · Post by **Michael Sherwin** » Sun Aug 23, 2009 11:39 pm

Zach Wegner wrote:
Michael Sherwin wrote: My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though!
That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.

Also, note that this is 100% untested, I'm only pretty sure it works.
Code: Select all
__inline s32 FirstBit(u64 bits)
{
    __asm
    {
        mov ebx, dword ptr bits
        mov edx, 32
        mov eax, edx
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    }
}

Well, in MSVC6 it only runs for a debug compile, so I ran my version also in debug. Maybe it is not a fair comparison--I don't know.

My newest one branch -- 870,549 nps
your no branch one scan -- 861,234 nps

I will try to find out why it locks up for a release compile.

Michael Sherwin · Post by **Michael Sherwin** » Wed Oct 07, 2009 8:52 am

mcostalba wrote:
Desperado wrote: (enough scanning for me now)...

thx to all
I 've made up my mind that we are struggling for something that in the very very best case can give an extra 1%....but realistically even less.

There are at least 2-3 solutions that differ for a fraction of a point, this is too much CPU/compiler/surrounding engine dependant.

So I would say we are at a dead end: there is not a single winner, there is a buch of solutions all at the same level, practically speaking.

Maybe, maybe not!

I somehow lost my changes so I had to revisit this topic.

Fastest for me, by a substantial amount, on a Q6600 was my one branch code.

Tried another minor change.

Now this new way is 1.6% faster in search speed than my last one.

Code: Select all

__inline s32 FirstBit(u64 bits) 
{ 
    __asm 
    { 
        bsf eax, dword ptr bits 
        jnz done 
        mov eax, 32
        mov edx, 32 
        bsf eax, dword ptr bits+4 
        add eax, edx 
done:
    } 
}

bob · Post by **bob** » Wed Oct 07, 2009 6:14 pm

Michael Sherwin wrote:
Zach Wegner wrote:
Michael Sherwin wrote: My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though!
That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.

Also, note that this is 100% untested, I'm only pretty sure it works.
Code: Select all
__inline s32 FirstBit(u64 bits)
{
    __asm
    {
        mov ebx, dword ptr bits
        mov edx, 32
        mov eax, edx
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    }
}
Well, in MSVC6 it only runs for a debug compile, so I ran my version also in debug. Maybe it is not a fair comparison--I don't know.

My newest one branch -- 870,549 nps
your no branch one scan -- 861,234 nps

I will try to find out why it locks up for a release compile.

Terrible idea. When a company (Intel) says "operation is undefined" it is foolish to try it out, and then _depend_ on the specific behaviour you notice. You can find a perfectly safe PopCnt() in Crafty that returns 64 when no bits are set, and which does not depend on undefined behaviour that can change in the next processor generation without anyone knowing.

Here is a 32 bit version that does not depend on undefined behaviour and which only has two extra instructions.

Code: Select all

int static __inline__ MSB(BITBOARD word)
{
  int dummy1, dummy2, dummy3;

asm("        bsr     %1, %0      " "\n\t"
    "        jnz     1f          " "\n\t"
    "        bsr     %2, %0      " "\n\t"
    "        jnz     2f          " "\n\t"
    "        movl    $64, %0     " "\n\t"
    "        jmp     2f          " "\n\t"
    "1:      addl    $32,%0      " "\n\t"
    "2:                          " "\n\t"
:   "=&q"(dummy1), "=&q"(dummy2), "=&q"(dummy3)
:    "1"((int) (word >> 32)), "2"((int) word)
:    "cc");
  return (dummy1);
}

32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64