32 bit versions for bitscan64

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Aug 21, 2009 9:10 pm

mcostalba wrote:I have tested the below version on my Intel Core 2 Duo but with no results, perhaps a very very slightly increase but well below 1%

Code: Select all

Square pop_1st_bit(Bitboard* bb) {

  uint32_t* ptr = (uint32_t*)bb;
  unsigned long id;

  if (*ptr)
  {
      _BitScanForward(&id, *ptr);
      *ptr ^= (1 << id);
  }
  else
  {
      ptr += 1;
      _BitScanForward(&id, *ptr);
      *ptr ^= (1 << id);
      id += 32;
  }
  return Square(id);
}

This is the disassembled code (MSVC):

Code: Select all

Square pop_1st_bit(Bitboard* bb) {
0040E760  push        ecx  

  uint32_t* ptr = (uint32_t*)bb;
  unsigned long id;

  if (*ptr)
0040E761  mov         edx,dword ptr [esi]  
0040E763  push        edi  
  {
      _BitScanForward(&id, *ptr);
      *ptr ^= (1 << id);
0040E764  mov         edi,1  
0040E769  test        edx,edx  
0040E76B  je          pop_1st_bit+1Bh (40E77Bh)  
0040E76D  bsf         eax,edx  
0040E770  mov         ecx,eax  
0040E772  shl         edi,cl  
0040E774  xor         edi,edx  
0040E776  mov         dword ptr [esi],edi  
0040E778  pop         edi  
  }
  return Square(id);
}
0040E779  pop         ecx  
0040E77A  ret  
  }
  else
  {
      ptr += 1;
      _BitScanForward(&id, *ptr);
0040E77B  mov         edx,dword ptr [esi+4]  
0040E77E  bsf         ecx,edx  
      *ptr ^= (1 << id);
0040E781  shl         edi,cl  
0040E783  mov         eax,ecx  
0040E785  xor         edi,edx  
0040E787  mov         dword ptr [esi+4],edi  
      id += 32;
0040E78A  add         eax,20h  
0040E78D  pop         edi  
  }
  return Square(id);
}
0040E78E  pop         ecx  
0040E78F  ret

It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops. Two distinct high-low loops in 32-bit mode may keep all in registers, but that is a mess with respect to keep source code as architecture independent as possible. Another movegen alternative is to hash up to eight bits of a sliding move target line, knight- and king-target set in one run, almost branchless.

Desperado · Post by **Desperado** » Fri Aug 21, 2009 11:14 pm

Code: Select all


ULONG bsf64(BTB_T bb)
	{
	 UI_32 *const ptr = (UI_32*)&bb;
	 ULONG id=64;

	 _BitScanForward(&id,*(ptr)) ? id : id += 31 + (bool)_BitScanForward(&id,*(ptr+1));

	 return(id);
	}

and its assembly

	push	ecx
	bsf	eax, DWORD PTR _bb$[esp]
	mov	DWORD PTR _id$[esp+4], 64		; 00000040H
	jne	SHORT $LN4@bsf64
	bsf	eax, DWORD PTR _bb$[esp+4]
	mov	ecx, 0
	setne	cl
	mov	edx, eax
	mov	DWORD PTR _id$[esp+4], eax
	lea	eax, DWORD PTR [edx+ecx+31]
$LN4@bsf64:
	pop	ecx
	ret	0

Matts Folding Trick

Code: Select all

	 const int lsb_64_table[64] =
		{
		 63, 30,  3, 32, 59, 14, 11, 33,
		 60, 24, 50,  9, 55, 19, 21, 34,
		 61, 29,  2, 53, 51, 23, 41, 18,
		 56, 28,  1, 43, 46, 27,  0, 35,
		 62, 31, 58,  4,  5, 49, 54,  6,
		 15, 52, 12, 40,  7, 42, 45, 16,
		 25, 57, 48, 13, 10, 39,  8, 44,
		 20, 47, 38, 22, 17, 37, 36, 26
		};

ULONG bsf64_MattTaylor(BTB_T bb)
	{

	 bb ^= (bb- 1);
     unsigned int folded = (int) bb ^ (bb >> 32);
   
	 return lsb_64_table[folded * 0x78291ACF >> 26];
	}

	mov	ecx, DWORD PTR _bb$[esp-4]
	mov	eax, DWORD PTR _bb$[esp]
	push	esi
	mov	esi, eax
	mov	edx, ecx
	add	edx, -1
	adc	esi, -1
	xor	eax, esi
	xor	ecx, edx
	xor	eax, ecx
	imul	eax, 2015959759				; 78291acfH
	shr	eax, 26					; 0000001aH
	mov	eax, DWORD PTR _lsb_64_table[eax*4]
	pop	esi
	ret	0

Hi, just came back and have a look what happend. Without any news
i only want to give you my assembly.Tomorrow i will analyse what s going
on. I am soooooo tired.

til tomorrow...

Desperado · Post by **Desperado** » Sat Aug 22, 2009 10:45 am

Hello Gerd,

Gerd Isenberg wrote: It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops. Two distinct high-low loops in 32-bit mode may keep all in registers, but that is a mess with respect to keep source code as architecture independent as possible. Another movegen alternative is to hash up to eight bits of a sliding move target line, knight- and king-target set in one run, almost branchless.

1.

i am a little bit suprised or maybe i dont really see the problem.
What is the matter using a pointer with adding an offset ?
I thought the assembler is doing the same memoryadress + offset, so why
shouldnt the programmer do that explicit?

2.

can it be, that if someone only wants to pop1 (64bit value), that
value64 &= -value64 is the most effectiv way, also in 32 bit mode?

best regards

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Aug 22, 2009 11:29 am

Desperado wrote: 1.

i am a little bit suprised or maybe i dont really see the problem.
What is the matter using a pointer with adding an offset ?
I thought the assembler is doing the same memoryadress + offset, so why
shouldnt the programmer do that explicit?

2.

can it be, that if someone only wants to pop1 (64bit value), that
value64 &= -value64 is the most effectiv way, also in 32 bit mode?

best regards

1. The point is to keep the bitboard to serialize inside a register and to don't read it and write it back each time from/to memory. With small loop bodies, i.e. writing generated moves into a move list, additional memory transfer becomes relative expensive. This is not a big deal, anyway.

2. Yes, I would prefer a loop with value64 &= -value64, which might be done in parallel with the bitscan forward (on core2, i7, not K8) to improve ipc.

Code: Select all

  ; rdx bitboard to serialize
  test rdx, rdx
  jz   over
loop:
  lea  rcx, [rdx-1]
  bsf  rax, rdx
  ...
  and  rdx, rcx ; reset ls1b, sets or resets zero-flag
  ... ; mov, stosq etc
  jnz  loop
over:

Cheers,
Gerd

Desperado · Post by **Desperado** » Sat Aug 22, 2009 11:36 am

Ok, now a version _WITHOUT_ any pointer tricks.

Code: Select all


//****************************************************************************
//* FUNCTION	: bsf64a													 *
//* DESCRIPTION	: 32 bit version											 *
//****************************************************************************

unsigned long bsf64a(unsigned __int64 bb)
	{
	 unsigned long id;

	 _BitScanForward(&id,(UI_32)bb) ? id : id+=31+_BitScanForward(&id,(UI_32)(bb>>32));

	 return(id);
	}


...


	push	ecx
	bsf	eax, DWORD PTR _bb$[esp]
	jne	SHORT $LN4@bsf64a
	bsf	eax, DWORD PTR _bb$[esp+4]
	setne	cl
	movzx	edx, cl
	mov	DWORD PTR _id$[esp+4], eax
	lea	eax, DWORD PTR [eax+edx+31]
$LN4@bsf64a:
	pop	ecx
	ret	0

This is doing the same performance as the pointer-version. (without _traverse bitboards via pointer_)

Marco (or anybody else), would you be so nice to give me some benchmarks compared to MattsFT ? i am very interested how intel machine is performing.

Additionally i would suggest to do "postoperation" with value64^=value64&-value64 to pop1.

cheers

edit: thx gerd for reply, was preparing my post at the moment...

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 11:39 am

Gerd Isenberg wrote: It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops.

I have tried also this version, with less pointers, but it seems no gain on the current one.

Code: Select all

// Use type-punning
union b_union {

    Bitboard b;
    struct {
        uint32_t l;
        uint32_t h;
    } dw;
};

// WARNING: Needs -fno-strict-aliasing compiler option
Square pop_1st_bit(Bitboard* bb) {

  unsigned long id;
  b_union u;

  u.b = *bb;

  if (u.dw.l)
  {
      _BitScanForward(&id, u.dw.l);
      *((uint32_t*)bb) ^= (1 << id);
  }
  else
  {
      _BitScanForward(&id, u.dw.h);
      *((uint32_t*)bb+1) ^= (1 << id); // Little endian only?
      id += 32;
  }
  return Square(id);
}

I have an Intel Core 2 Duo so bsf should be fast enough.

Below is the assembly of the above:

Code: Select all

0041E510  push        ecx  
0041E511  mov         eax,dword ptr [edx] 
0041E513  mov         ecx,dword ptr [edx+4] 
0041E516  push        esi  
0041E517  mov         esi,1 
0041E51C  test        eax,eax 
0041E51E  je          pop_1st_bit+20h (41E530h) 
0041E520  bsf         eax,eax 
0041E523  mov         ecx,eax 
0041E525  shl         esi,cl 
0041E527  mov         dword ptr [esp+4],eax 
0041E52B  xor         dword ptr [edx],esi 
0041E52D  pop         esi  
0041E52E  pop         ecx  
0041E52F  ret              
0041E530  bsf         ecx,ecx 
0041E533  shl         esi,cl 
0041E535  mov         eax,ecx 
0041E537  mov         dword ptr [esp+4],ecx 
0041E53B  xor         dword ptr [edx+4],esi 
0041E53E  add         eax,20h 
0041E541  pop         esi  
0041E542  pop         ecx  
0041E543  ret

mcostalba · Post by **mcostalba** » Sat Aug 22, 2009 11:44 am

Desperado wrote:Ok, now a version _WITHOUT_ any pointer tricks.
Code: Select all
//****************************************************************************
//* FUNCTION	: bsf64a													 *
//* DESCRIPTION	: 32 bit version											 *
//****************************************************************************

unsigned long bsf64a(unsigned __int64 bb)
	{
	 unsigned long id;

	 _BitScanForward(&id,(UI_32)bb) ? id : id+=31+_BitScanForward(&id,(UI_32)(bb>>32));

	 return(id);
	}
This is doing the same performance as the pointer-version. (without _traverse bitboards via pointer_)

Marco (or anybody else), would you be so nice to give me some benchmarks compared to MattsFT ? i am very interested how intel machine is performing.

Additionally i would suggest to do "postoperation" with value64^=value64&-value64 to pop1.

cheers

edit: thx gerd for reply, was preparing my post at the moment...

But does this version resets the bit ?

I think the two things are different. IMHO you cannot take a fast bitscan only routine insert in an already exsisting pop_1s_bit() and assume you will speed up also the second one.

IMHO to get a fast pop_1s_bit() it is better to _forget_ compeletly about bitscan alone and write from scratch as is the bitscan does not exsist.

I will be very happy to test a pop_1s_bit() that is what is used in chess engines but a bit scan alone IMHO just blurs the problem.

Desperado · Post by **Desperado** » Sat Aug 22, 2009 12:25 pm

ok, here it is.

but i first like to mention that my style seems to be different, because
i most often pop1 in the following way.

Code: Select all

//some code...

while(bitboard)
	{
	 id = bitscan //scan instruction

	 //code

	 bb^=bb&-bb; //pop instruction
	}

//some code...

I think if you only have one operation (without a loop, so to say),
there is no difference, beside i can use the unmodified or the modified bitboard in the _code between_. (There is no additional information gain popping immediatelly,or is there an additional profit ?)

I mean, i can _always_ put the pop-instruction behind (as next instruction)
to the scan-instruction, with the same result.

i would be happy if you can spend some words on that point.

either way, here is the SCAN_AND_POP code.

Code: Select all

unsigned long bsf64_and_pop1(unsigned __int64 &bb)
	{
	 unsigned long id;

	 _BitScanForward(&id,(UI_32)bb) ? id : id+=31+_BitScanForward(&id,(UI_32)(bb>>32));
	 
	 bb^=bb&-bb;

	 return(id);
	}

Desperado · Post by **Desperado** » Sat Aug 22, 2009 12:55 pm

Edit to my last post:

So, i am very interest in a comparison between MattFT an bitscan_only
version, because as i descriped above,i do _not_ always pop when i scan!

when i pop1 i _do not_ need _any_ scan before!

(beside i would pop1 with bb^= 1<<id for some reason).

And if it is necessary to scan_and_pop, i just put my instructions together (in a row).

Desperado · Post by **Desperado** » Sat Aug 22, 2009 1:29 pm

edit on edit:

Sorry Marco, the longer i think about what i described i disagree with your opinion. At least for me, and also for practise, it is _absolut relevant_
what the bscan alone is doing.
pop and scan are independent components, which can be simply put together if needed.
if there would be a way, where the combination of both become better(faster,different,profitable in some way) than their basic components itself,
ok then this combination becomes also an _autonomous component_.
(i dont think a scan_and_pop function meets this argument, so seperating
the pop and scan doesnt matter)

with most respect i say this, because i think i am not so experienced.

So if someone else has an opinion on that point, let me know or correct me. I am here to learn sth.

Thx

32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64