if ( x ) do {
int idx = bitScanForward(x); // square index from 0..63
*list++ = foo(idx, ...);
} while (x &= x-1); // reset LS1B
nthom wrote:I'm testing on a Phenom 9550 quad core.
BSF reg, reg VectorPath 4
BSF reg, mem VectorPath 7
So not that fast, since VectorPath blocks also other units, but should be faster than 2 * 32 bit bsf. So may be for those AMD boxes De Bruijn mul is faster...
Or even better leading zero count ...
LZCNT reg, reg Direct Path single 2, 1 (Latency, Reciprocal Throughput)
FYI i figured it out - it turned out to be a problem with my VS2008 optimization settings. I had maximise speed (/O2) on but it wasn't getting through to the actual command line until i changed a heap of stuff and then changed it all back!
Now my 64-bit version is about 50% faster than the 32-bit one.
nthom wrote:FYI i figured it out - it turned out to be a problem with my VS2008 optimization settings. I had maximise speed (/O2) on but it wasn't getting through to the actual command line until i changed a heap of stuff and then changed it all back!
Now my 64-bit version is about 50% faster than the 32-bit one.
How did you know it wasn't getting through to the command line? I might be having the same problem...