while (b)
{
(*mlist++).move = make_move(from, first_1(b));
b ^= b & (-b);
}
Now we are at 443.655 nodes/sec, very similar to combined bitscan+pop that is at 446.222 nodes/sec.
My impression is that the current version that I posted at the beginning, does not have 64 bits operations that are very costly on a 32 bit machine, while in this one we have 3 operations more on 64 bits (xor, neg, and):
while (b)
{
Square id = first_1(b);
(*mlist++).move = make_move(from, id);
b ^= (1ULL << id);
}
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!!
i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)
using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)
cheers
Sure, 1ULL << id in 32-bit mode is not that cheap, and dependent from the bitscan (also in 64-bit mode). May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...
Gerd Isenberg wrote:May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...
It is a bit faster indeed at 445.761 nodes/sec very similar to the reference version.
I would think any 64 bit operation in 32 bit mode is to avoid, perhaps that's the reason the reference version still seems to be the best.
Anyhow using the intrinsic _BitScanForward() instead of the Matt multiply doesn't seem to give a good increase even on an Intel Core 2 Duo, on slower CPU it can go only worse and on faster 64 bit CPU in any case you have another version tuned for 64 bits.
Desperado wrote:
So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?
i would be very (very very very ) thankful!
I think I have alreaady gave you, the separated bitscan + pop.
With the last version we have more or less the same node count of the MattsFT version and the last tested version is separated bitscan.
I cannot test _only_ bitscan without pop because otherwise engine doen't work anymore. What I have done is to _separate_ the two and I think we have verified that the bitscan intrinsic version is of same speed of MattsFT version.