It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops. Two distinct high-low loops in 32-bit mode may keep all in registers, but that is a mess with respect to keep source code as architecture independent as possible. Another movegen alternative is to hash up to eight bits of a sliding move target line, knight- and king-target set in one run, almost branchless.mcostalba wrote:I have tested the below version on my Intel Core 2 Duo but with no results, perhaps a very very slightly increase but well below 1%
Code: Select all
Square pop_1st_bit(Bitboard* bb) { uint32_t* ptr = (uint32_t*)bb; unsigned long id; if (*ptr) { _BitScanForward(&id, *ptr); *ptr ^= (1 << id); } else { ptr += 1; _BitScanForward(&id, *ptr); *ptr ^= (1 << id); id += 32; } return Square(id); }
This is the disassembled code (MSVC):Code: Select all
Square pop_1st_bit(Bitboard* bb) { 0040E760 push ecx uint32_t* ptr = (uint32_t*)bb; unsigned long id; if (*ptr) 0040E761 mov edx,dword ptr [esi] 0040E763 push edi { _BitScanForward(&id, *ptr); *ptr ^= (1 << id); 0040E764 mov edi,1 0040E769 test edx,edx 0040E76B je pop_1st_bit+1Bh (40E77Bh) 0040E76D bsf eax,edx 0040E770 mov ecx,eax 0040E772 shl edi,cl 0040E774 xor edi,edx 0040E776 mov dword ptr [esi],edi 0040E778 pop edi } return Square(id); } 0040E779 pop ecx 0040E77A ret } else { ptr += 1; _BitScanForward(&id, *ptr); 0040E77B mov edx,dword ptr [esi+4] 0040E77E bsf ecx,edx *ptr ^= (1 << id); 0040E781 shl edi,cl 0040E783 mov eax,ecx 0040E785 xor edi,edx 0040E787 mov dword ptr [esi+4],edi id += 32; 0040E78A add eax,20h 0040E78D pop edi } return Square(id); } 0040E78E pop ecx 0040E78F ret
32 bit versions for bitscan64
Moderators: hgm, Rebel, chrisw
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: 32 bit versions for bitscan64
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
Code: Select all
ULONG bsf64(BTB_T bb)
{
UI_32 *const ptr = (UI_32*)&bb;
ULONG id=64;
_BitScanForward(&id,*(ptr)) ? id : id += 31 + (bool)_BitScanForward(&id,*(ptr+1));
return(id);
}
and its assembly
push ecx
bsf eax, DWORD PTR _bb$[esp]
mov DWORD PTR _id$[esp+4], 64 ; 00000040H
jne SHORT $LN4@bsf64
bsf eax, DWORD PTR _bb$[esp+4]
mov ecx, 0
setne cl
mov edx, eax
mov DWORD PTR _id$[esp+4], eax
lea eax, DWORD PTR [edx+ecx+31]
$LN4@bsf64:
pop ecx
ret 0
Code: Select all
const int lsb_64_table[64] =
{
63, 30, 3, 32, 59, 14, 11, 33,
60, 24, 50, 9, 55, 19, 21, 34,
61, 29, 2, 53, 51, 23, 41, 18,
56, 28, 1, 43, 46, 27, 0, 35,
62, 31, 58, 4, 5, 49, 54, 6,
15, 52, 12, 40, 7, 42, 45, 16,
25, 57, 48, 13, 10, 39, 8, 44,
20, 47, 38, 22, 17, 37, 36, 26
};
ULONG bsf64_MattTaylor(BTB_T bb)
{
bb ^= (bb- 1);
unsigned int folded = (int) bb ^ (bb >> 32);
return lsb_64_table[folded * 0x78291ACF >> 26];
}
mov ecx, DWORD PTR _bb$[esp-4]
mov eax, DWORD PTR _bb$[esp]
push esi
mov esi, eax
mov edx, ecx
add edx, -1
adc esi, -1
xor eax, esi
xor ecx, edx
xor eax, ecx
imul eax, 2015959759 ; 78291acfH
shr eax, 26 ; 0000001aH
mov eax, DWORD PTR _lsb_64_table[eax*4]
pop esi
ret 0
i only want to give you my assembly.Tomorrow i will analyse what s going
on. I am soooooo tired.
til tomorrow...
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
Hello Gerd,
i am a little bit suprised or maybe i dont really see the problem.
What is the matter using a pointer with adding an offset ?
I thought the assembler is doing the same memoryadress + offset, so why
shouldnt the programmer do that explicit?
2.
can it be, that if someone only wants to pop1 (64bit value), that
value64 &= -value64 is the most effectiv way, also in 32 bit mode?
best regards
1.Gerd Isenberg wrote: It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops. Two distinct high-low loops in 32-bit mode may keep all in registers, but that is a mess with respect to keep source code as architecture independent as possible. Another movegen alternative is to hash up to eight bits of a sliding move target line, knight- and king-target set in one run, almost branchless.
i am a little bit suprised or maybe i dont really see the problem.
What is the matter using a pointer with adding an offset ?
I thought the assembler is doing the same memoryadress + offset, so why
shouldnt the programmer do that explicit?
2.
can it be, that if someone only wants to pop1 (64bit value), that
value64 &= -value64 is the most effectiv way, also in 32 bit mode?
best regards
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: 32 bit versions for bitscan64
1. The point is to keep the bitboard to serialize inside a register and to don't read it and write it back each time from/to memory. With small loop bodies, i.e. writing generated moves into a move list, additional memory transfer becomes relative expensive. This is not a big deal, anyway.Desperado wrote: 1.
i am a little bit suprised or maybe i dont really see the problem.
What is the matter using a pointer with adding an offset ?
I thought the assembler is doing the same memoryadress + offset, so why
shouldnt the programmer do that explicit?
2.
can it be, that if someone only wants to pop1 (64bit value), that
value64 &= -value64 is the most effectiv way, also in 32 bit mode?
best regards
2. Yes, I would prefer a loop with value64 &= -value64, which might be done in parallel with the bitscan forward (on core2, i7, not K8) to improve ipc.
Code: Select all
; rdx bitboard to serialize
test rdx, rdx
jz over
loop:
lea rcx, [rdx-1]
bsf rax, rdx
...
and rdx, rcx ; reset ls1b, sets or resets zero-flag
... ; mov, stosq etc
jnz loop
over:
Gerd
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
Ok, now a version _WITHOUT_ any pointer tricks.
This is doing the same performance as the pointer-version. (without _traverse bitboards via pointer_)
Marco (or anybody else), would you be so nice to give me some benchmarks compared to MattsFT ? i am very interested how intel machine is performing.
Additionally i would suggest to do "postoperation" with value64^=value64&-value64 to pop1.
cheers
edit: thx gerd for reply, was preparing my post at the moment...
Code: Select all
//****************************************************************************
//* FUNCTION : bsf64a *
//* DESCRIPTION : 32 bit version *
//****************************************************************************
unsigned long bsf64a(unsigned __int64 bb)
{
unsigned long id;
_BitScanForward(&id,(UI_32)bb) ? id : id+=31+_BitScanForward(&id,(UI_32)(bb>>32));
return(id);
}
...
push ecx
bsf eax, DWORD PTR _bb$[esp]
jne SHORT $LN4@bsf64a
bsf eax, DWORD PTR _bb$[esp+4]
setne cl
movzx edx, cl
mov DWORD PTR _id$[esp+4], eax
lea eax, DWORD PTR [eax+edx+31]
$LN4@bsf64a:
pop ecx
ret 0
Marco (or anybody else), would you be so nice to give me some benchmarks compared to MattsFT ? i am very interested how intel machine is performing.
Additionally i would suggest to do "postoperation" with value64^=value64&-value64 to pop1.
cheers
edit: thx gerd for reply, was preparing my post at the moment...
Last edited by Desperado on Sat Aug 22, 2009 11:41 am, edited 1 time in total.
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: 32 bit versions for bitscan64
I have tried also this version, with less pointers, but it seems no gain on the current one.Gerd Isenberg wrote: It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops.
Code: Select all
// Use type-punning
union b_union {
Bitboard b;
struct {
uint32_t l;
uint32_t h;
} dw;
};
// WARNING: Needs -fno-strict-aliasing compiler option
Square pop_1st_bit(Bitboard* bb) {
unsigned long id;
b_union u;
u.b = *bb;
if (u.dw.l)
{
_BitScanForward(&id, u.dw.l);
*((uint32_t*)bb) ^= (1 << id);
}
else
{
_BitScanForward(&id, u.dw.h);
*((uint32_t*)bb+1) ^= (1 << id); // Little endian only?
id += 32;
}
return Square(id);
}
I have an Intel Core 2 Duo so bsf should be fast enough.
Below is the assembly of the above:
Code: Select all
0041E510 push ecx
0041E511 mov eax,dword ptr [edx]
0041E513 mov ecx,dword ptr [edx+4]
0041E516 push esi
0041E517 mov esi,1
0041E51C test eax,eax
0041E51E je pop_1st_bit+20h (41E530h)
0041E520 bsf eax,eax
0041E523 mov ecx,eax
0041E525 shl esi,cl
0041E527 mov dword ptr [esp+4],eax
0041E52B xor dword ptr [edx],esi
0041E52D pop esi
0041E52E pop ecx
0041E52F ret
0041E530 bsf ecx,ecx
0041E533 shl esi,cl
0041E535 mov eax,ecx
0041E537 mov dword ptr [esp+4],ecx
0041E53B xor dword ptr [edx+4],esi
0041E53E add eax,20h
0041E541 pop esi
0041E542 pop ecx
0041E543 ret
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: 32 bit versions for bitscan64
But does this version resets the bit ?Desperado wrote:Ok, now a version _WITHOUT_ any pointer tricks.
This is doing the same performance as the pointer-version. (without _traverse bitboards via pointer_)Code: Select all
//**************************************************************************** //* FUNCTION : bsf64a * //* DESCRIPTION : 32 bit version * //**************************************************************************** unsigned long bsf64a(unsigned __int64 bb) { unsigned long id; _BitScanForward(&id,(UI_32)bb) ? id : id+=31+_BitScanForward(&id,(UI_32)(bb>>32)); return(id); }
Marco (or anybody else), would you be so nice to give me some benchmarks compared to MattsFT ? i am very interested how intel machine is performing.
Additionally i would suggest to do "postoperation" with value64^=value64&-value64 to pop1.
cheers
edit: thx gerd for reply, was preparing my post at the moment...
I think the two things are different. IMHO you cannot take a fast bitscan only routine insert in an already exsisting pop_1s_bit() and assume you will speed up also the second one.
IMHO to get a fast pop_1s_bit() it is better to _forget_ compeletly about bitscan alone and write from scratch as is the bitscan does not exsist.
I will be very happy to test a pop_1s_bit() that is what is used in chess engines but a bit scan alone IMHO just blurs the problem.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
ok, here it is.
but i first like to mention that my style seems to be different, because
i most often pop1 in the following way.
I think if you only have one operation (without a loop, so to say),
there is no difference, beside i can use the unmodified or the modified bitboard in the _code between_. (There is no additional information gain popping immediatelly,or is there an additional profit ?)
I mean, i can _always_ put the pop-instruction behind (as next instruction)
to the scan-instruction, with the same result.
i would be happy if you can spend some words on that point.
either way, here is the SCAN_AND_POP code.
but i first like to mention that my style seems to be different, because
i most often pop1 in the following way.
Code: Select all
//some code...
while(bitboard)
{
id = bitscan //scan instruction
//code
bb^=bb&-bb; //pop instruction
}
//some code...
there is no difference, beside i can use the unmodified or the modified bitboard in the _code between_. (There is no additional information gain popping immediatelly,or is there an additional profit ?)
I mean, i can _always_ put the pop-instruction behind (as next instruction)
to the scan-instruction, with the same result.
i would be happy if you can spend some words on that point.
either way, here is the SCAN_AND_POP code.
Code: Select all
unsigned long bsf64_and_pop1(unsigned __int64 &bb)
{
unsigned long id;
_BitScanForward(&id,(UI_32)bb) ? id : id+=31+_BitScanForward(&id,(UI_32)(bb>>32));
bb^=bb&-bb;
return(id);
}
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
Edit to my last post:
So, i am very interest in a comparison between MattFT an bitscan_only
version, because as i descriped above,i do _not_ always pop when i scan!
when i pop1 i _do not_ need _any_ scan before!
(beside i would pop1 with bb^= 1<<id for some reason).
And if it is necessary to scan_and_pop, i just put my instructions together (in a row).
So, i am very interest in a comparison between MattFT an bitscan_only
version, because as i descriped above,i do _not_ always pop when i scan!
when i pop1 i _do not_ need _any_ scan before!
(beside i would pop1 with bb^= 1<<id for some reason).
And if it is necessary to scan_and_pop, i just put my instructions together (in a row).
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
edit on edit:
Sorry Marco, the longer i think about what i described i disagree with your opinion. At least for me, and also for practise, it is _absolut relevant_
what the bscan alone is doing.
pop and scan are independent components, which can be simply put together if needed.
if there would be a way, where the combination of both become better(faster,different,profitable in some way) than their basic components itself,
ok then this combination becomes also an _autonomous component_.
(i dont think a scan_and_pop function meets this argument, so seperating
the pop and scan doesnt matter)
with most respect i say this, because i think i am not so experienced.
So if someone else has an opinion on that point, let me know or correct me. I am here to learn sth.
Thx
Sorry Marco, the longer i think about what i described i disagree with your opinion. At least for me, and also for practise, it is _absolut relevant_
what the bscan alone is doing.
pop and scan are independent components, which can be simply put together if needed.
if there would be a way, where the combination of both become better(faster,different,profitable in some way) than their basic components itself,
ok then this combination becomes also an _autonomous component_.
(i dont think a scan_and_pop function meets this argument, so seperating
the pop and scan doesnt matter)
with most respect i say this, because i think i am not so experienced.
So if someone else has an opinion on that point, let me know or correct me. I am here to learn sth.
Thx