32 bit versions for bitscan64

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:15 pm

Ok I have tested with:

Square pop_1st_bit&#40;Bitboard* bb&#41;
&#123;
  unsigned long id;

  _BitScanForward&#40;&id, &#40;uint32_t&#41;(*bb&#41;) ? id &#58; id += 31 + _BitScanForward&#40;&id, &#40;uint32_t&#41;((*bb&#41; >> 32&#41;);
  *bb ^= *bb & -(*bb&#41;;
  return Square&#40;id&#41;;
&#125;

but no speed increase, perhaps just very slightly, anyhow below 1%

Then I have changed my move serialization loop from

Code: Select all

while &#40;b&#41;
    (*mlist++).move = make_move&#40;from, pop_1st_bit&#40;&b&#41;)

to

Code: Select all

while &#40;b&#41;
&#123;
    Square id = first_1&#40;b&#41;;
    (*mlist++).move = make_move&#40;from, id&#41;;
    b ^= &#40;1ULL << id&#41;;
&#125;

With first_1() defined as:

Code: Select all

Square first_1&#40;Bitboard b&#41; &#123;

  unsigned long id;

  _BitScanForward&#40;&id, &#40;uint32_t&#41;b&#41; ? id &#58; id += 31 + _BitScanForward&#40;&id, &#40;uint32_t&#41;&#40;b >> 32&#41;);

  return Square&#40;id&#41;; 
&#125;

And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!!

Desperado · Post by **Desperado** » Sat Aug 22, 2009 2:40 pm

mcostalba wrote:Ok I have tested with:
Code: Select all
while &#40;b&#41;
&#123;
    Square id = first_1&#40;b&#41;;
    (*mlist++).move = make_move&#40;from, id&#41;;
    b ^= &#40;1ULL << id&#41;;
&#125;
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!!

i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:41 pm

OK, I have put the definition of id out of the loop:

Code: Select all

Square id;
while &#40;b&#41;
&#123;
    id = first_1&#40;b&#41;;
    (*mlist++).move = make_move&#40;from, id&#41;;
    b ^= &#40;1ULL << id&#41;;
&#125;

Now it is better at 438.560 nodes/sec but still -1.7% slower then the combined bitscan + pop

Desperado · Post by **Desperado** » Sat Aug 22, 2009 2:47 pm

perhaps this one...

Code: Select all


while &#40;b&#41; 
&#123; 
    (*mlist++).move = make_move&#40;from, first_1&#40;b&#41;); 
    b ^= b&-b;
&#125;

good luck

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:48 pm

Already under pgo compilation.....

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:58 pm

Ok with this.

Code: Select all

while &#40;b&#41;
&#123;
     (*mlist++).move = make_move&#40;from, first_1&#40;b&#41;);
     b ^= b & (-b&#41;;
&#125;

Now we are at 443.655 nodes/sec, very similar to combined bitscan+pop that is at 446.222 nodes/sec.

My impression is that the current version that I posted at the beginning, does not have 64 bits operations that are very costly on a 32 bit machine, while in this one we have 3 operations more on 64 bits (xor, neg, and):

Code: Select all

 b ^= b & (-b&#41;;

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Aug 22, 2009 3:01 pm

Desperado wrote:
mcostalba wrote:Ok I have tested with:
Code: Select all
while &#40;b&#41;
&#123;
    Square id = first_1&#40;b&#41;;
    (*mlist++).move = make_move&#40;from, id&#41;;
    b ^= &#40;1ULL << id&#41;;
&#125;
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!!
i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers

Sure, 1ULL << id in 32-bit mode is not that cheap, and dependent from the bitscan (also in 64-bit mode). May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...

Desperado · Post by **Desperado** » Sat Aug 22, 2009 3:17 pm

i cannot get it faster than that

:

Code: Select all


ULONG bsf64_and_pop1&#40;UI_64 &bb&#41;
	&#123;
	 ULONG id;

	 _BitScanForward&#40;&id,&#40;UI_32&#41;bb&#41; ? 
	 id &#58; 
	 id += 31 + _BitScanForward&#40;&id,&#40;UI_32&#41;&#40;bb>>32&#41;);

                 bb &= bb - 1;
	 //bb ^= bb&-bb;

	 return&#40;id&#41;;
	&#125;

So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!

thx

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 3:20 pm

Gerd Isenberg wrote:May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...

It is a bit faster indeed at 445.761 nodes/sec very similar to the reference version.

I would think any 64 bit operation in 32 bit mode is to avoid, perhaps that's the reason the reference version still seems to be the best.

Anyhow using the intrinsic _BitScanForward() instead of the Matt multiply doesn't seem to give a good increase even on an Intel Core 2 Duo, on slower CPU it can go only worse and on faster 64 bit CPU in any case you have another version tuned for 64 bits.

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 3:24 pm

Desperado wrote: So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!

I think I have alreaady gave you, the separated bitscan + pop.

With the last version we have more or less the same node count of the MattsFT version and the last tested version is separated bitscan.

I cannot test _only_ bitscan without pop because otherwise engine doen't work anymore. What I have done is to _separate_ the two and I think we have verified that the bitscan intrinsic version is of same speed of MattsFT version.

Am I missing something ?

32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64