32 bit versions for bitscan64

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

edit:

And especially to Marco,

because you have an Intel-Machine, the branchless version
may be significantly faster. Perhaps you want to have a try ?

Code: Select all

ULONG bsf64(BTB_T bb) 
   { 
    ULONG id; //=64 to have 64 as return value 

    _BitScanForward(&id,(UI_32)(bb>>32)); 
    id+=32; 
    _BitScanForward(&id,(UI_32)bb); 

    return(id); 
   } 

cheers
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: 32 bit versions for bitscan64

Post by stegemma »

I suppose that ~12 cycles for this assembly code is not true. The imul itself takes 11 cycles, if i remember well.

In my assembly programs i use PENTOPT to verify the number of cycles and i suggest to test your code with this tool (or any equivalent). It seems hard to find on the net... but there are a demo version that works fine for little pieces of code.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

stegemma wrote:I suppose that ~12 cycles for this assembly code is not true. The imul itself takes 11 cycles, if i remember well.

In my assembly programs i use PENTOPT to verify the number of cycles and i suggest to test your code with this tool (or any equivalent). It seems hard to find on the net... but there are a demo version that works fine for little pieces of code.
I implicitly referred AMD K8/K10 or Intel core2, but not Intel P4.
On these cpus imul reg, imm32 is 3 cycles.
source:
Software Optimization Guide for AMD64 Processors (pdf)
Software Optimization Guide for AMD Family 10h Processors (pdf) Appendix C Instruction Latencies
Intel 64 and IA32 Architectures Optimization Reference Manual (pdf) Appendix C Instruction Latencies
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Desperado wrote:edit:

And especially to Marco,

because you have an Intel-Machine, the branchless version
may be significantly faster. Perhaps you want to have a try ?

Code: Select all

ULONG bsf64(BTB_T bb) 
   { 
    ULONG id; //=64 to have 64 as return value 

    _BitScanForward(&id,(UI_32)(bb>>32)); 
    id+=32; 
    _BitScanForward(&id,(UI_32)bb); 

    return(id); 
   } 

cheers
Is this your last version :-) ? the super super super best version ? are you sure ? :-) confirmed :-) ??

Ok ! here we go ........

...to be continued :-)
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: 32 bit versions for bitscan64

Post by stegemma »

Thanks for the links, my informations seems to be very old.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

stegemma wrote:Thanks for the links, my informations seems to be very old.
You are welcome. The fast imul of these cpus is the reason, the whole mul hashing stuff is competitive, like De Bruijn mul, Matt's folded mul, kindergarten and magic bitboards.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

mcostalba wrote:
Desperado wrote:edit:

And especially to Marco,

because you have an Intel-Machine, the branchless version
may be significantly faster. Perhaps you want to have a try ?

Code: Select all

ULONG bsf64(BTB_T bb) 
   { 
    ULONG id; //=64 to have 64 as return value 

    _BitScanForward(&id,(UI_32)(bb>>32)); 
    id+=32; 
    _BitScanForward(&id,(UI_32)bb); 

    return(id); 
   } 

cheers
Is this your last version :-) ? the super super super best version ? are you sure ? :-) confirmed :-) ??

Ok ! here we go ........

...to be continued :-)


I am sorry but it doesn't work :-(


Are you sure you have already functionality tested the routine ?

I suspect that the last _BitScanForward() sets id to zero in case of empty half bitboard and so overwriting previous results :-(
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Really, i am sorry for this ! :oops:

Well this isnt my _LAST_ (super super version), contrary
it was the _FIRST_ version if you look back to the beginning
of the Thread. But we choose the the branching version
because it runs faster for me, and somehow i forgot
to validate this one.

It really modifies the _id_ somehow, and i thougth it wouldnt
if no bit is found.( :shock: )
And that was somehow enforced, because i know that bsf
isnt modifying the register is there isnt a bit.

and because the first runs only found bits in the lo_part of value
id doesnt produced an error.

Well if someone has to test the return value of _BitScanForward,
this version doesnt pay of anymore (or it becomes like we used
it already with ?: operator)

well , a big big sorry...

:cry:
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Desperado wrote: well , a big big sorry...

:cry:
No problem.

Well in the mean time I have tested the real branchless version that of Zach Wegner:

Code: Select all

Square first_1(Bitboard b) {

  unsigned long id;
  uint32_t mask;

  mask = ((uint32_t)b != 0) - 1;
  _BitScanForward(&id, (((uint32_t)(b >> 32)) & mask) | (uint32_t)b);
  return Square(id + (32 & mask)); 
}
More or less we are always at the same speed of the current best (mine ;-) ), perhaps a bit slower but just a bit.


What I am thinking is that the branch is not so bad at the end.

If you use the bitscan in a tight loop to serialize the pieces moves and you pass in input a bitboard with all the attacksquares of a given piece then there is a good probability that if the first bit is say in low-word then also the next one will be because attacks squares for sliding pieces or pawns or kings are normally near, so that if you have a rook in A1 then the branch predictor for the bitboard of the squares attacked by the rook is not that bad after the first one....

Anyhow at the end of this long story the real real best (on my PC) seems the mine with a small modification to remove a temporary:

Code: Select all

// Use type-punning
union b_union {

    Bitboard b;
    struct {
        uint32_t l;
        uint32_t h;
    } dw;
};

// WARNING: Needs -fno-strict-aliasing compiler option
Square pop_1st_bit(Bitboard* bb) {

  b_union u;

  u.b = *bb;

  if (u.dw.l)
  {
      *((uint32_t*)bb) = u.dw.l & (u.dw.l - 1);
      return Square(BitTable[((u.dw.l ^ (u.dw.l - 1)) * 0x783a9b23) >> 26]);
  }

  *((uint32_t*)bb+1) = u.dw.h & (u.dw.h - 1); // Little endian only?
  return Square(BitTable[((~(u.dw.h ^ (u.dw.h - 1))) * 0x783a9b23) >> 26]);
}
Could anyone beat this ? ;-)

...but this time someone else do the tests :-)
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Well with all pros/cons and for curiosity
the selfmade inline assembly-function may be interesting.

And this works for sure ( :D ) (not like my last proposal)

Code: Select all


ULONG bsf64a(UI_64 bb)
	{
	 _asm
		{
		 //mov eax,32 as free choice
		 bsf eax,DWORD PTR bb+4
		 add eax,32
		 bsf eax,DWORD PTR bb
		}

	 //return eax
	}
with a seperate pop1 (and bb!=0)

I expect the speed will drop although it looks very cheap.

or did you already test it ?