P.S: I am not very fond of assembler because MSVC and gcc have different syntax so you end up with different versions for different compilers and if you do that there should be a very good reason to do it, read at least 5% increase in speed but I bet with you that the above, with all the added ancillary stuff bb!=0 and bit reset, is not 5% faster then mine (I think it is not faster at all because there are two bsf for 1 multiply + one branch).
mcostalba wrote:
What I am thinking is that the branch is not so bad at the end.
If you use the bitscan in a tight loop to serialize the pieces moves and you pass in input a bitboard with all the attacksquares of a given piece then there is a good probability that if the first bit is say in low-word then also the next one will be because attacks squares for sliding pieces or pawns or kings are normally near
70%
In 70% of cases if previous bit was in low/high part of the 64bit word then also the next one will be.
just swapped the contents of arrays and used bb>>32 for condition
anyway, these 2 functions are faster than my initial-proposals of the thread, are easy to handle, and for me both looking 32 bit friendly.
Maybe Zachs(i like it) is faster,maybe not, maybe machine dependend ???
Additionally, both(forward,reverse) working the same way, which makes me happy.
just swapped the contents of arrays and used bb>>32 for condition
anyway, these 2 functions are faster than my initial-proposals of the thread, are easy to handle, and for me both looking 32 bit friendly.
Maybe Zachs(i like it) is faster,maybe not, maybe machine dependend ???
Additionally, both(forward,reverse) working the same way, which makes me happy.
measured outside a complex software(engine)
intrinsic versions...
1. first proposals (with branching)
2. Zachs
3. last proposals
4. ...
- competetive
- 32 bit friendly
- in both direction working (forward,reverse)
solution.
At least for me, my first proposals(after elimination of the _pointer traversals_) meet these points.
(of course some other solutions do this too)
So i found what i was searching.
And it was nice to see some little tricks, some of you
are using!
A special thx to you Marco for spending time on testing!
Michael Sherwin wrote:
My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though!
That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.
Also, note that this is 100% untested, I'm only pretty sure it works.
__inline s32 FirstBit(u64 bits)
{
__asm
{
mov ebx, dword ptr bits
mov edx, 32
mov eax, edx
test ebx, ebx
setne ecx
dec ecx
and edx, ecx
and ecx, dword ptr bits+4
or ecx, ebx
bsf eax, ecx
add eax, edx
}
}
Well, in MSVC6 it only runs for a debug compile, so I ran my version also in debug. Maybe it is not a fair comparison--I don't know.
My newest one branch -- 870,549 nps
your no branch one scan -- 861,234 nps
I will try to find out why it locks up for a release compile.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin wrote:
My bitboards can be empty so I need a 64 returned--or I would test it now. So, when I have some time. Very nice concept though!
That's fairly simple to do--just need to add one instruction, assuming bsf on a zero argument doesn't change the destination. Set eax to 32 before the bsf and then we add 32 after, since the lower half is zero.
Also, note that this is 100% untested, I'm only pretty sure it works.
__inline s32 FirstBit(u64 bits)
{
__asm
{
mov ebx, dword ptr bits
mov edx, 32
mov eax, edx
test ebx, ebx
setne ecx
dec ecx
and edx, ecx
and ecx, dword ptr bits+4
or ecx, ebx
bsf eax, ecx
add eax, edx
}
}
Well, in MSVC6 it only runs for a debug compile, so I ran my version also in debug. Maybe it is not a fair comparison--I don't know.
My newest one branch -- 870,549 nps
your no branch one scan -- 861,234 nps
I will try to find out why it locks up for a release compile.
Terrible idea. When a company (Intel) says "operation is undefined" it is foolish to try it out, and then _depend_ on the specific behaviour you notice. You can find a perfectly safe PopCnt() in Crafty that returns 64 when no bits are set, and which does not depend on undefined behaviour that can change in the next processor generation without anyone knowing.
Here is a 32 bit version that does not depend on undefined behaviour and which only has two extra instructions.