32 bit versions for bitscan64

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:15 pm

Ok I have tested with:

Square pop_1st_bit(Bitboard* bb)
{
  unsigned long id;

  _BitScanForward(&id, (uint32_t)(*bb)) ? id : id += 31 + _BitScanForward(&id, (uint32_t)((*bb) >> 32));
  *bb ^= *bb & -(*bb);
  return Square(id);
}

but no speed increase, perhaps just very slightly, anyhow below 1%

Then I have changed my move serialization loop from

Code: Select all

while (b)
    (*mlist++).move = make_move(from, pop_1st_bit(&b))

to

Code: Select all

while (b)
{
    Square id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
}

With first_1() defined as:

Code: Select all

Square first_1(Bitboard b) {

  unsigned long id;

  _BitScanForward(&id, (uint32_t)b) ? id : id += 31 + _BitScanForward(&id, (uint32_t)(b >> 32));

  return Square(id); 
}

And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!!

Desperado · Post by **Desperado** » Sat Aug 22, 2009 2:40 pm

mcostalba wrote:Ok I have tested with:
Code: Select all
while (b)
{
    Square id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
}
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!!

i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:41 pm

OK, I have put the definition of id out of the loop:

Code: Select all

Square id;
while (b)
{
    id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
}

Now it is better at 438.560 nodes/sec but still -1.7% slower then the combined bitscan + pop

Desperado · Post by **Desperado** » Sat Aug 22, 2009 2:47 pm

perhaps this one...

Code: Select all


while (b) 
{ 
    (*mlist++).move = make_move(from, first_1(b)); 
    b ^= b&-b;
}

good luck

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:48 pm

Already under pgo compilation.....

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 2:58 pm

Ok with this.

Code: Select all

while (b)
{
     (*mlist++).move = make_move(from, first_1(b));
     b ^= b & (-b);
}

Now we are at 443.655 nodes/sec, very similar to combined bitscan+pop that is at 446.222 nodes/sec.

My impression is that the current version that I posted at the beginning, does not have 64 bits operations that are very costly on a 32 bit machine, while in this one we have 3 operations more on 64 bits (xor, neg, and):

Code: Select all

 b ^= b & (-b);

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Aug 22, 2009 3:01 pm

Desperado wrote:
mcostalba wrote:Ok I have tested with:
Code: Select all
while (b)
{
    Square id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
}
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!!
i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers

Sure, 1ULL << id in 32-bit mode is not that cheap, and dependent from the bitscan (also in 64-bit mode). May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...

Desperado · Post by **Desperado** » Sat Aug 22, 2009 3:17 pm

i cannot get it faster than that

:

Code: Select all


ULONG bsf64_and_pop1(UI_64 &bb)
	{
	 ULONG id;

	 _BitScanForward(&id,(UI_32)bb) ? 
	 id : 
	 id += 31 + _BitScanForward(&id,(UI_32)(bb>>32));

                 bb &= bb - 1;
	 //bb ^= bb&-bb;

	 return(id);
	}

So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!

thx

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 3:20 pm

Gerd Isenberg wrote:May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...

It is a bit faster indeed at 445.761 nodes/sec very similar to the reference version.

I would think any 64 bit operation in 32 bit mode is to avoid, perhaps that's the reason the reference version still seems to be the best.

Anyhow using the intrinsic _BitScanForward() instead of the Matt multiply doesn't seem to give a good increase even on an Intel Core 2 Duo, on slower CPU it can go only worse and on faster 64 bit CPU in any case you have another version tuned for 64 bits.

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 3:24 pm

Desperado wrote: So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!

I think I have alreaady gave you, the separated bitscan + pop.

With the last version we have more or less the same node count of the MattsFT version and the last tested version is separated bitscan.

I cannot test _only_ bitscan without pop because otherwise engine doen't work anymore. What I have done is to _separate_ the two and I think we have verified that the bitscan intrinsic version is of same speed of MattsFT version.

Am I missing something ?

32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64