32 bit versions for bitscan64
Moderators: hgm, Rebel, chrisw
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
Ok.
Well, as you pointed out it performs not that worse,
and we may not forget that it works also in the reversed direction.
bsr.
Thx a lot for your trials and time investement.
cheers
Well, as you pointed out it performs not that worse,
and we may not forget that it works also in the reversed direction.
bsr.
Thx a lot for your trials and time investement.
cheers
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
only to have it complete. the bsr function would now
look like this.
look like this.
Code: Select all
//****************************************************************************
//* PROCEDURE : bsr64 *
//* DESCRIPTION : 32 bit version *
//****************************************************************************
ULONG bsr64(BTB_T bb)
{
ULONG id;
_BitScanReverse(&id,(UI_32)(bb>>32)) ?
id+=32 :
_BitScanReverse(&id,(UI_32)bb);
return(id);
}
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: 32 bit versions for bitscan64
Just curious! Would it not be faster to make id global to avoid the return value?Desperado wrote:only to have it complete. the bsr function would now
look like this.
Code: Select all
//**************************************************************************** //* PROCEDURE : bsr64 * //* DESCRIPTION : 32 bit version * //**************************************************************************** ULONG bsr64(BTB_T bb) { ULONG id; _BitScanReverse(&id,(UI_32)(bb>>32)) ? id+=32 : _BitScanReverse(&id,(UI_32)bb); return(id); }
(in MP (would work in SP version also) versions a pointer to a structure holding both bb and id could be passed to bsr64?)
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: 32 bit versions for bitscan64
I think the function is inlined anyway, at least for pgo compiles and accessing a local variable is way faster then a global one.
Besides I really don't like this kind of ugly, old fashioned, tricks
Anyhow I don't think it is faster, but of course you can always _test it_ (this is the evil word I know ) and come back with some numbers...
Thanks
Marco
Besides I really don't like this kind of ugly, old fashioned, tricks
Anyhow I don't think it is faster, but of course you can always _test it_ (this is the evil word I know ) and come back with some numbers...
Thanks
Marco
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: 32 bit versions for bitscan64
I was just (slightly) curious. Not enough to do any testing of my own!mcostalba wrote:I think the function is inlined anyway, at least for pgo compiles and accessing a local variable is way faster then a global one.
Besides I really don't like this kind of ugly, old fashioned, tricks
Anyhow I don't think it is faster, but of course you can always _test it_ (this is the evil word I know ) and come back with some numbers...
Thanks
Marco
Here is what I do in RomiChess. And I am perfectly happy with it--given the limitations of the 64bit compiler.
asm.c:
Code: Select all
__inline s32 FirstBit(u64 bits)
{
__asm
{
mov edx, 0
bsf eax, dword ptr bits
jnz done
mov edx, 32
bsf eax, dword ptr bits+4
jnz done
mov eax, 32
done: add eax, edx
}
}
__inline s32 LastBit(u64 bits)
{
__asm
{
mov edx, 32
bsr eax, dword ptr bits+4
jnz done
mov edx, 0
bsr eax, dword ptr bits
jnz done
mov eax, 64
done: add eax, edx
}
}
Code: Select all
__inline s32 FirstBit(u64 bits) {
s32 index;
if(_BitScanForward64(&index, bits))
return index;
return 64;
}
__inline s32 LastBit(u64 bits) {
s32 index;
if(_BitScanReverse64(&index, bits))
return index;
return 64;
}
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
Hi Michael.Michael Sherwin wrote:Just curious! Would it not be faster to make id global to avoid the return value?Desperado wrote:only to have it complete. the bsr function would now
look like this.
Code: Select all
//**************************************************************************** //* PROCEDURE : bsr64 * //* DESCRIPTION : 32 bit version * //**************************************************************************** ULONG bsr64(BTB_T bb) { ULONG id; _BitScanReverse(&id,(UI_32)(bb>>32)) ? id+=32 : _BitScanReverse(&id,(UI_32)bb); return(id); }
(in MP (would work in SP version also) versions a pointer to a structure holding both bb and id could be passed to bsr64?)
I think hiding the implentation is a good choice, even on the costs of some speed.
The following is a snapshot of my engine-implementation.
Code: Select all
//****************************************************************************
//* BITSCAN CHOICE *
//* *
//* 1: SOFTSCANS,needed in debug mode and may work much faster than *
//* the intrinsics, depending on machine. *
//* a: further choice between MattsFoldingTrick and double conversion *
//* where dc works significantly faster on my amd-k8 *
//* b: reverse scan only with double conversion *
//* *
//* 2: HARDSCANS,using intrinsics. Special routines for 32 bit compile *
//* ONLY AVAILABLE IN RELEASE MODE! *
//* The 32bit inrinsic versions may only perform equal or slightly better *
//* than SOFTSCANS but working in both directions.On fast bitscan *
//* machines i expect that the 32bit-intrinsic-scans work faster than *
//* the double conversions. *
//****************************************************************************
#if _DEBUG | _SOFTSCAN
//****************************************************************************
//* FUNCTION : bsf64 *
//* DESCRIPTION : 64 bit version (Matts Folding Trick) *
//****************************************************************************
#if _MATTSCAN
ULONG bsf64(BTB_T bb)
{
UI_32 fld;
ASSERT(bb!=0,"bsf error");
bb ^= bb-1;
fld =(UI_32) bb ^ (bb>>32);
return(index[fld*0x78291ACF >> 26]);
}
#else
//****************************************************************************
//* FUNCTION : bsf64 *
//* DESCRIPTION : 64 bit version (double conversion) *
//* works best for amd-k8 model *
//****************************************************************************
ULONG bsf64(BTB_T bb)
{
union {
double d;
struct
{
unsigned int mantissal : 32;
unsigned int mantissah : 20;
unsigned int exponent : 11;
unsigned int sign : 1;
};
} ud;
ud.d = (double)(bb & -bb); // isolated LS1B to double
return ud.exponent - 1023;
}
#endif //MATTSCAN
//****************************************************************************
//* FUNCTION : bsr64 *
//* DESCRIPTION : 64 bit version (double conversion) *
//****************************************************************************
ULONG bsr64(UI_64 bb)
{
union {
double d;
struct
{
unsigned int mantissal : 32;
unsigned int mantissah : 20;
unsigned int exponent : 11;
unsigned int sign : 1;
};
} ud;
ud.d = (double)(bb & ~(bb >> 32)); // avoid rounding error
return (ud.exponent - 1023);
}
#elif _COMPILE64
//****************************************************************************
//* FUNCTION : bsf64 *
//* DESCRIPTION : 64 bit version intrinsic *
//****************************************************************************
ULONG bsf64(BTB_T bb)
{ULONG id=64; _BitScanForward64(&id,bb);return(id);}
//****************************************************************************
//* FUNCTION : bsr64 *
//* DESCRIPTION : 64 bit version intrinsic *
//****************************************************************************
ULONG bsr64(BTB_T bb)
{ULONG id=64;_BitScanReverse64(&id,bb);return(id);}
#elif _COMPILE32
//****************************************************************************
//* FUNCTION : bsf64 *
//* DESCRIPTION : 32 bit version inrinsic *
//****************************************************************************
ULONG bsf64(UI_64 bb)
{
ULONG id;
_BitScanForward(&id,(UI_32)bb) ?
id :
id += 31 + _BitScanForward(&id,(UI_32)(bb>>32));
return(id);
}
//****************************************************************************
//* FUNCTION : bsr64 *
//* DESCRIPTION : 32 bit version intrinsic *
//****************************************************************************
ULONG bsr64(BTB_T bb)
{
ULONG id;
_BitScanReverse(&id,(UI_32)(bb>>32)) ?
id+=32 :
_BitScanReverse(&id,(UI_32)bb);
return(id);
}
#endif
regards
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: 32 bit versions for bitscan64
Hi Michael,Desperado wrote:Hi Michael.Michael Sherwin wrote:Just curious! Would it not be faster to make id global to avoid the return value?Desperado wrote:only to have it complete. the bsr function would now
look like this.
Code: Select all
//**************************************************************************** //* PROCEDURE : bsr64 * //* DESCRIPTION : 32 bit version * //**************************************************************************** ULONG bsr64(BTB_T bb) { ULONG id; _BitScanReverse(&id,(UI_32)(bb>>32)) ? id+=32 : _BitScanReverse(&id,(UI_32)bb); return(id); }
(in MP (would work in SP version also) versions a pointer to a structure holding both bb and id could be passed to bsr64?)
I think hiding the implentation is a good choice, even on the costs of some speed.
The following is a snapshot of my engine-implementation.
That s why i hide the code in a functionCode: Select all
//**************************************************************************** //* BITSCAN CHOICE * //* * //* 1: SOFTSCANS,needed in debug mode and may work much faster than * //* the intrinsics, depending on machine. * //* a: further choice between MattsFoldingTrick and double conversion * //* where dc works significantly faster on my amd-k8 * //* b: reverse scan only with double conversion * //* * //* 2: HARDSCANS,using intrinsics. Special routines for 32 bit compile * //* ONLY AVAILABLE IN RELEASE MODE! * //* The 32bit inrinsic versions may only perform equal or slightly better * //* than SOFTSCANS but working in both directions.On fast bitscan * //* machines i expect that the 32bit-intrinsic-scans work faster than * //* the double conversions. * //**************************************************************************** #if _DEBUG | _SOFTSCAN //**************************************************************************** //* FUNCTION : bsf64 * //* DESCRIPTION : 64 bit version (Matts Folding Trick) * //**************************************************************************** #if _MATTSCAN ULONG bsf64(BTB_T bb) { UI_32 fld; ASSERT(bb!=0,"bsf error"); bb ^= bb-1; fld =(UI_32) bb ^ (bb>>32); return(index[fld*0x78291ACF >> 26]); } #else //**************************************************************************** //* FUNCTION : bsf64 * //* DESCRIPTION : 64 bit version (double conversion) * //* works best for amd-k8 model * //**************************************************************************** ULONG bsf64(BTB_T bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & -bb); // isolated LS1B to double return ud.exponent - 1023; } #endif //MATTSCAN //**************************************************************************** //* FUNCTION : bsr64 * //* DESCRIPTION : 64 bit version (double conversion) * //**************************************************************************** ULONG bsr64(UI_64 bb) { union { double d; struct { unsigned int mantissal : 32; unsigned int mantissah : 20; unsigned int exponent : 11; unsigned int sign : 1; }; } ud; ud.d = (double)(bb & ~(bb >> 32)); // avoid rounding error return (ud.exponent - 1023); } #elif _COMPILE64 //**************************************************************************** //* FUNCTION : bsf64 * //* DESCRIPTION : 64 bit version intrinsic * //**************************************************************************** ULONG bsf64(BTB_T bb) {ULONG id=64; _BitScanForward64(&id,bb);return(id);} //**************************************************************************** //* FUNCTION : bsr64 * //* DESCRIPTION : 64 bit version intrinsic * //**************************************************************************** ULONG bsr64(BTB_T bb) {ULONG id=64;_BitScanReverse64(&id,bb);return(id);} #elif _COMPILE32 //**************************************************************************** //* FUNCTION : bsf64 * //* DESCRIPTION : 32 bit version inrinsic * //**************************************************************************** ULONG bsf64(UI_64 bb) { ULONG id; _BitScanForward(&id,(UI_32)bb) ? id : id += 31 + _BitScanForward(&id,(UI_32)(bb>>32)); return(id); } //**************************************************************************** //* FUNCTION : bsr64 * //* DESCRIPTION : 32 bit version intrinsic * //**************************************************************************** ULONG bsr64(BTB_T bb) { ULONG id; _BitScanReverse(&id,(UI_32)(bb>>32)) ? id+=32 : _BitScanReverse(&id,(UI_32)bb); return(id); } #endif
regards
Yes, I need to start thinking in terms of portable easy to maintain and debug code, rather than the 'hard wired' stuff that I tend to write!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
Re: 32 bit versions for bitscan64
However, if the latest processors are faster at executing 32bit bsf/bsr, then I might be interested in testing something like the following:Michael Sherwin wrote:I was just (slightly) curious. Not enough to do any testing of my own!mcostalba wrote:I think the function is inlined anyway, at least for pgo compiles and accessing a local variable is way faster then a global one.
Besides I really don't like this kind of ugly, old fashioned, tricks
Anyhow I don't think it is faster, but of course you can always _test it_ (this is the evil word I know ) and come back with some numbers...
Thanks
Marco
Here is what I do in RomiChess. And I am perfectly happy with it--given the limitations of the 64bit compiler.
asm.c:x64.c:Code: Select all
__inline s32 FirstBit(u64 bits) { __asm { mov edx, 0 bsf eax, dword ptr bits jnz done mov edx, 32 bsf eax, dword ptr bits+4 jnz done mov eax, 32 done: add eax, edx } } __inline s32 LastBit(u64 bits) { __asm { mov edx, 32 bsr eax, dword ptr bits+4 jnz done mov edx, 0 bsr eax, dword ptr bits jnz done mov eax, 64 done: add eax, edx } }
Code: Select all
__inline s32 FirstBit(u64 bits) { s32 index; if(_BitScanForward64(&index, bits)) return index; return 64; } __inline s32 LastBit(u64 bits) { s32 index; if(_BitScanReverse64(&index, bits)) return index; return 64; }
note: it's been a long time since i've written assembly so I am not sure if the following code is correct. Howeever, it (the correct code) did test slower when I tried it on older machines.
Code: Select all
__inline s32 FirstBit(u64 bits)
{
__asm
{
mov ebx, 0
mov ecx, 0
bsf ebx, dword ptr bits
bsf ecx, dword ptr bits+4
shl ebx, 6
mov eax, [bsfTbl + ebx + ecx]
}
}
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: 32 bit versions for bitscan64
Michael Sherwin wrote:However, if the latest processors are faster at executing 32bit bsf/bsr, then I might be interested in testing something like the following:
note: it's been a long time since i've written assembly so I am not sure if the following code is correct. Howeever, it (the correct code) did test slower when I tried it on older machines.
Code: Select all
__inline s32 FirstBit(u64 bits) { __asm { mov ebx, 0 mov ecx, 0 bsf ebx, dword ptr bits bsf ecx, dword ptr bits+4 shl ebx, 6 mov eax, [bsfTbl + ebx + ecx] } }
If you're willing to rely on bsf to not change the register contents when it fails, you could try something like this:
Code: Select all
__inline s32 FirstBit(u64 bits)
{
__asm
{
mov ebx, 32
bsf ebx, dword ptr bits+4
add ebx, 32
bsf ebx, dword ptr bits
}
}
Code: Select all
__inline s32 FirstBit(u64 bits)
{
__asm
{
bsf ecx, dword ptr bits+4
cmovnc ecx, 32
bsf ebx, dword ptr bits
lea ecx, [ecx+32]
cmovnc ebx, ecx
}
}
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: 32 bit versions for bitscan64
my proposals:Michael Sherwin wrote:Code: Select all
__inline s32 FirstBit(u64 bits) { __asm { mov ebx, 0 mov ecx, 0 bsf ebx, dword ptr bits bsf ecx, dword ptr bits+4 shl ebx, 6 mov eax, [bsfTbl + ebx + ecx] } }
Code: Select all
ULONG bsf64a(BTB_T bb)
{
_asm bsf eax,DWORD PTR bb+4
_asm add eax,32
_asm bsf eax,DWORD PTR bb
//return eax
}
Code: Select all
ULONG bsf64a(BTB_T bb)
{
_asm bsf eax,DWORD PTR bb
_asm jnz DONE
_asm bsf eax,DWORD PTR bb+4
_asm add eax,32
DONE: ;
// return eax
}
1. is branchless, but needs 2 scans (for fast bitscan machines)
2. is branching but may save a scan(for slow bitscan machines)
(long time ago i tried assembler )