The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too
ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
The BLSR/BLSI instruction reset/extract the lowest set bit. I'm not aware of compiler intrinsics for those operations, so I hope that compilers will detect idioms like "x & (x-1)" and "x & (-x)" and convert them to those instructions.
The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too
ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Wonderful!
Time to write some new stuff in CPW
Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.
The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too
ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Wonderful!
Time to write some new stuff in CPW
Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.
The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too
ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
I'm sure several of these instructions will find niche uses though for chessprogramming. Compressing attack bitboards and using PDEP to uncompress them does sound like an easy win.
Right... Unfortunately they're probably icc only, not gcc. But if gcc adds intrinsics for it (and I'm guessing they will), icc will probably add the same ones since it's always gcc compatible. So it's probably all good.
The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too
ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.
The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too
ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.
You'd extract exactly the same bits that are used as input to a variable-shift magic multiply. However, the index formed by magic mul, shift usually has 1 bit less. So the tables would approximately double in size.
[edit: it might still be worth it though, since the PEXT replaces both the multiply and the variable shift... in effect you get the variable index sizes "for free" with PEXT, which is pretty nice]
However, the PDEP at the end shrinks table size by a factor of four and requires no extra memory indirection, so that seems like a pure win.
I thought of another idea... If there wasn't already a dedicated POPCNT instruction, it could easily be implemented with a PEXT, TZCNT sequence [edit: or PEXT, NOT, LZCNT might be better, I guess]
If pext/pdep pairs well, minimal Kindergarten BBs will get a new kick as well, ignoring redundancy of the outer squares for same extract and deposit masks.