32 bit versions for bitscan64

Discussion of chess software programming and technical issues.

Moderator: Ras

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Ok I have tested with:

Code: Select all

Square pop_1st_bit(Bitboard* bb)
{
  unsigned long id;

  _BitScanForward(&id, (uint32_t)(*bb)) ? id : id += 31 + _BitScanForward(&id, (uint32_t)((*bb) >> 32));
  *bb ^= *bb & -(*bb);
  return Square(id);
}
but no speed increase, perhaps just very slightly, anyhow below 1%

Then I have changed my move serialization loop from

Code: Select all

while (b)
    (*mlist++).move = make_move(from, pop_1st_bit(&b))
to

Code: Select all

while (b)
{
    Square id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
}
With first_1() defined as:

Code: Select all

Square first_1(Bitboard b) {

  unsigned long id;

  _BitScanForward(&id, (uint32_t)b) ? id : id += 31 + _BitScanForward(&id, (uint32_t)(b >> 32));

  return Square(id); 
}
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!! :shock:
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

mcostalba wrote:Ok I have tested with:

Code: Select all

while (b)
{
    Square id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
}
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!! :shock:
i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers
Last edited by Desperado on Sat Aug 22, 2009 2:42 pm, edited 1 time in total.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

OK, I have put the definition of id out of the loop:

Code: Select all

Square id;
while (b)
{
    id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
} 
Now it is better at 438.560 nodes/sec but still -1.7% slower then the combined bitscan + pop
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

perhaps this one...

Code: Select all


while (b) 
{ 
    (*mlist++).move = make_move(from, first_1(b)); 
    b ^= b&-b;
} 

good luck :)
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Already under pgo compilation..... :-)
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Ok with this.

Code: Select all

while (b)
{
     (*mlist++).move = make_move(from, first_1(b));
     b ^= b & (-b);
}
Now we are at 443.655 nodes/sec, very similar to combined bitscan+pop that is at 446.222 nodes/sec.

My impression is that the current version that I posted at the beginning, does not have 64 bits operations that are very costly on a 32 bit machine, while in this one we have 3 operations more on 64 bits (xor, neg, and):

Code: Select all

 b ^= b & (-b);
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

Desperado wrote:
mcostalba wrote:Ok I have tested with:

Code: Select all

while (b)
{
    Square id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= (1ULL << id);
}
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!! :shock:
i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers
Sure, 1ULL << id in 32-bit mode is not that cheap, and dependent from the bitscan (also in 64-bit mode). May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

i cannot get it faster than that 8-) :

Code: Select all


ULONG bsf64_and_pop1(UI_64 &bb)
	{
	 ULONG id;

	 _BitScanForward(&id,(UI_32)bb) ? 
	 id : 
	 id += 31 + _BitScanForward(&id,(UI_32)(bb>>32));

                 bb &= bb - 1;
	 //bb ^= bb&-bb;

	 return(id);
	}


So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!


thx
Last edited by Desperado on Sat Aug 22, 2009 3:22 pm, edited 1 time in total.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Gerd Isenberg wrote:May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...
It is a bit faster indeed at 445.761 nodes/sec very similar to the reference version.

I would think any 64 bit operation in 32 bit mode is to avoid, perhaps that's the reason the reference version still seems to be the best.

Anyhow using the intrinsic _BitScanForward() instead of the Matt multiply doesn't seem to give a good increase even on an Intel Core 2 Duo, on slower CPU it can go only worse and on faster 64 bit CPU in any case you have another version tuned for 64 bits.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Desperado wrote: So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!
I think I have alreaady gave you, the separated bitscan + pop.

With the last version we have more or less the same node count of the MattsFT version and the last tested version is separated bitscan.

I cannot test _only_ bitscan without pop because otherwise engine doen't work anymore. What I have done is to _separate_ the two and I think we have verified that the bitscan intrinsic version is of same speed of MattsFT version.

Am I missing something ?