Good lord. I was dead wrong on this. There are all sorts of integer instructions even for 64 bitboards __popcll() , __clzll() and what have you. If your integers are 24 bit then there are faster multiplications __mul24() as well. I got some speed up with the popcnt already. I assumed that they wouldn't bother implementing integer intrinsics since floats are faster and that is what gpu are intended for. Anyway I hope Gerd sees this and do his magicc) 32 bit firstone and bitcounts are ok but still very slow. I need to count psedudo legal moves before generating them.
The bitboards are very sparse and maybe a simpler version of population count would help.
Edit: Wait. That was a fluke, intrinsics are slower especially the __ffsll(). But I shaved of 5 registers and removed some tables by using intrinsic. I will do unit-testing to find if they are actually slower.
The intrinsics are hardware implemented according to this page so they should be faster.
Btw the bitmagic library has some SSE popcnt routines that can be used when there is no hardware support for it. Here is a straight forward translation of the well known 32 bits pop count.