I fear there is something seriously broken with your time measurement. Loop tests with small bodies are too chaotic, processor may internally "unroll" the loop. You should use Time Stamp Counter aka x86 RDTSC instruction with leading cpuid to measure how many cycles a routine takes. HGM once posted some nice code sample to index an array of counters indexed by cycle difference from RDTSC, couldn't find it yet. I will soon put some stuff in cpw.Desperado wrote:hi everyone again.
today i tested some ideas. Let me start up with what i have done
to compute the msb. i order from slow to fast.
1. fasttest() -> performs -97%
testing bits, starting from sq to msb, obstructed by end of line
(in simple loop)
2. divconq() -> performs -78%
check always upper half of line, splits until msb is found
3. BitScanReverse64 intrinsic -> performs -10%
4. bsr (double conversion, from Gerd). my original one +-0
5. msb-lookup +30%
So what conclusion has to be done here, with respect to Kindergarten BB ?Code: Select all
//****************************************************************************** //... for(BTB k=1;k<256;k++) { msk.msb_rank[k] = bsr64(k); //UI_08 msb_rank[256]; maybe 128 is enough msk.msb_file[k] = bsf64(k); // "" } //... //****************************************************************************** BTB msb_rank(BTB bb,SQR_ID sq64) { UI_08 shift = (sq64>>3) << 3; return((BTB) 1 << (msk.msb_rank[bb>>shift] + shift)); } //****************************************************************************** BTB msb_file(BTB bb,SQR_ID sq64) { //obstruction to rank 1 has to be set... BTB x; x = bb >> (sq64 & 7); x *= diag; x >>= 56; x = bR8 >> (msk.msb_file[x]*8); return(x & bb); } //******************************************************************************
The 2 msb lookups are faster than computing by bitscan-reverse !?
should someone now use KG-BB ? Where are the differences ?
Because i am not (of course not) expert for KG-BB. What s about the memory usage ?
Or which kind of bsr should be used ? (Gerd's double conversion is the fastest i used until now. it s faster than the intrinsic ? or am i doing sth totally wrong ?)
note:
- in the case how the msb is used for the obstruction difference, it
is enough to isolate the corresponding file _OR_ rank, to build the ms1b.
That may lead to some other ideas.
- maybe a fast magic bsr-function without lookup can be created, due
to the simpleness of a masked line ?? any idea for that ? (would be enough to have it for one line,all other lines maybe shifted...)
well, now with the kind of msb-lookup the performance comes closer to
the magic approach, some more 20 % speedup and it will beat it...
Your msb_file approach is fastest? Double conversion faster than bsr? Hmmm. What processor do you use? AMD K8? Can you eventually post the generated assembly of the routines?
Also compare your msb_file code with kindergerten files.