32 bit versions for bitscan64

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Desperado wrote: or did you already test it ?
No I didn't :-)

...and you ?

P.S: I am not very fond of assembler because MSVC and gcc have different syntax so you end up with different versions for different compilers and if you do that there should be a very good reason to do it, read at least 5% increase in speed but I bet with you that the above, with all the added ancillary stuff bb!=0 and bit reset, is not 5% faster then mine (I think it is not faster at all because there are two bsf for 1 multiply + one branch).
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

mcostalba wrote: What I am thinking is that the branch is not so bad at the end.

If you use the bitscan in a tight loop to serialize the pieces moves and you pass in input a bitboard with all the attacksquares of a given piece then there is a good probability that if the first bit is say in low-word then also the next one will be because attacks squares for sliding pieces or pawns or kings are normally near
70%


In 70% of cases if previous bit was in low/high part of the 64bit word then also the next one will be.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Hi,

just another one (not slow, but fast enough ?)

Code: Select all


ULONG bsf64b(UI_64 bb)
	{
	 const UI_32 off[2]  = {0,32};
	 UI_32 lohi[2] = {(UI_32)bb,(UI_32)(bb>>32)};
	 ULONG  id;

	 _BitScanForward(&id,lohi[(UI_32)bb == 0]);

	 return(id+off[(UI_32)bb == 0]);
	}

maybe someone wants to tune this one, i am not sure for _==_ statement.

perphaps (bool)(UI_32)bb instead

or tmp = (bb==0) and reuse it at the end, i dont know...

(but branchless :wink: )

:)
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

edit:

...

and a very simple reversal function (which can be tuned more i think).
because there are a lot of shifts it could be profitable for tmp = bb>>32 ...

Code: Select all


ULONG bsr64b(UI_64 bb)
	{
	 const UI_32 off[2]  = {32,0};
	 UI_32 lohi[2] = {(UI_32)(bb>>32),(UI_32)bb};
	 ULONG  id;

	 _BitScanReverse(&id,lohi[(UI_32)(bb>>32) == 0]);

	 return(id+off[(UI_32)(bb>>32) == 0]);
	}
just swapped the contents of arrays and used bb>>32 for condition

anyway, these 2 functions are faster than my initial-proposals of the thread, are easy to handle, and for me both looking 32 bit friendly.
Maybe Zachs(i like it) is faster,maybe not, maybe machine dependend ???
Additionally, both(forward,reverse) working the same way, which makes me happy. :lol:


:)
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Desperado wrote:edit:

...

and a very simple reversal function (which can be tuned more i think).
because there are a lot of shifts it could be profitable for tmp = bb>>32 ...

Code: Select all


ULONG bsr64b(UI_64 bb)
	{
	 const UI_32 off[2]  = {32,0};
	 UI_32 lohi[2] = {(UI_32)(bb>>32),(UI_32)bb};
	 ULONG  id;

	 _BitScanReverse(&id,lohi[(UI_32)(bb>>32) == 0]);

	 return(id+off[(UI_32)(bb>>32) == 0]);
	}
just swapped the contents of arrays and used bb>>32 for condition

anyway, these 2 functions are faster than my initial-proposals of the thread, are easy to handle, and for me both looking 32 bit friendly.
Maybe Zachs(i like it) is faster,maybe not, maybe machine dependend ???
Additionally, both(forward,reverse) working the same way, which makes me happy. :lol:


:)
measured outside a complex software(engine)

intrinsic versions...

1. first proposals (with branching)
2. Zachs
3. last proposals
4. ...

(enough scanning for me now)...

thx to all
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Desperado wrote: (enough scanning for me now)...

thx to all
I 've made up my mind that we are struggling for something that in the very very best case can give an extra 1%....but realistically even less.

There are at least 2-3 solutions that differ for a fraction of a point, this is too much CPU/compiler/surrounding engine dependant.

So I would say we are at a dead end: there is not a single winner, there is a buch of solutions all at the same level, practically speaking.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

I think so too!

Well, i searched for a:

- competetive
- 32 bit friendly
- in both direction working (forward,reverse)

solution.

At least for me, my first proposals(after elimination of the _pointer traversals_) meet these points.
(of course some other solutions do this too)
So i found what i was searching.
And it was nice to see some little tricks, some of you
are using!

A special thx to you Marco for spending time on testing!

:) regards
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

Zach Wegner wrote:
Michael Sherwin wrote: My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though! :D
That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.

Also, note that this is 100% untested, I'm only pretty sure it works.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm
    {
        mov ebx, dword ptr bits
        mov edx, 32
        mov eax, edx
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    }
}
Well, in MSVC6 it only runs for a debug compile, so I ran my version also in debug. Maybe it is not a fair comparison--I don't know.

My newest one branch -- 870,549 nps
your no branch one scan -- 861,234 nps

I will try to find out why it locks up for a release compile.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: 32 bit versions for bitscan64

Post by Michael Sherwin »

mcostalba wrote:
Desperado wrote: (enough scanning for me now)...

thx to all
I 've made up my mind that we are struggling for something that in the very very best case can give an extra 1%....but realistically even less.

There are at least 2-3 solutions that differ for a fraction of a point, this is too much CPU/compiler/surrounding engine dependant.

So I would say we are at a dead end: there is not a single winner, there is a buch of solutions all at the same level, practically speaking.
Maybe, maybe not!

I somehow lost my changes so I had to revisit this topic.

Fastest for me, by a substantial amount, on a Q6600 was my one branch code.

Tried another minor change.

Now this new way is 1.6% faster in search speed than my last one.

Code: Select all

__inline s32 FirstBit(u64 bits) 
{ 
    __asm 
    { 
        bsf eax, dword ptr bits 
        jnz done 
        mov eax, 32
        mov edx, 32 
        bsf eax, dword ptr bits+4 
        add eax, edx 
done:
    } 
}
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 32 bit versions for bitscan64

Post by bob »

Michael Sherwin wrote:
Zach Wegner wrote:
Michael Sherwin wrote: My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though! :D
That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.

Also, note that this is 100% untested, I'm only pretty sure it works.

Code: Select all

__inline s32 FirstBit(u64 bits)
{
    __asm
    {
        mov ebx, dword ptr bits
        mov edx, 32
        mov eax, edx
        test ebx, ebx
        setne ecx
        dec ecx
        and edx, ecx
        and ecx, dword ptr bits+4
        or ecx, ebx
        bsf eax, ecx
        add eax, edx
    }
}
Well, in MSVC6 it only runs for a debug compile, so I ran my version also in debug. Maybe it is not a fair comparison--I don't know.

My newest one branch -- 870,549 nps
your no branch one scan -- 861,234 nps

I will try to find out why it locks up for a release compile.
Terrible idea. When a company (Intel) says "operation is undefined" it is foolish to try it out, and then _depend_ on the specific behaviour you notice. You can find a perfectly safe PopCnt() in Crafty that returns 64 when no bits are set, and which does not depend on undefined behaviour that can change in the next processor generation without anyone knowing.

Here is a 32 bit version that does not depend on undefined behaviour and which only has two extra instructions.

Code: Select all

int static __inline__ MSB(BITBOARD word)
{
  int dummy1, dummy2, dummy3;

asm("        bsr     %1, %0      " "\n\t"
    "        jnz     1f          " "\n\t"
    "        bsr     %2, %0      " "\n\t"
    "        jnz     2f          " "\n\t"
    "        movl    $64, %0     " "\n\t"
    "        jmp     2f          " "\n\t"
    "1:      addl    $32,%0      " "\n\t"
    "2:                          " "\n\t"
:   "=&q"(dummy1), "=&q"(dummy2), "=&q"(dummy3)
:    "1"((int) (word >> 32)), "2"((int) word)
:    "cc");
  return (dummy1);
}