
I was thinking about PEXT/PDEP for rook or bishop attacks, and it occurred to me that you could use two different masks for the same square (i.e. instead of a table lookup of up to 12 bits, you'd have two table lookups using up to 6 inputs bits, and producing 8 bits of output--in other words, only one byte per table entry!) The tables would be small enough to virtually always hit in the cache. The two PEXTs before the lookup and the two PDEPs at the end, could be done in parallel. So it would only cost a few cycles more, which might be offset by fewer/cheaper cache misses.
To take this idea a bit further, confining the input/output bits of each table to a single rank/file or a single diagonal would also make most of the tables shareable. (i.e. for input square in a given file, you would not need 8 separate tables for each of the 8 ranks, you'd only need 1 such table). In fact the *rank* tables would also work just fine as *file* tables, saving another factor of two. [edit: and maybe some of the same tables could be used for bishop diagonals, if the upper index bits were zeroes?]
I haven't worked out the details, but it wouldn't surprise me if the complete tables for rooks would fit easily in L1 cache.