What will be faster when incremental doing/undoing moves:
1.
do_move(const int move) {
int to_from;
...
to_from = move & 0xFFF;
key ^= random_rook_key[to_from];
bitboard1 ^= set_bits[to_from];
bitboard2 ^= set_bits[to_from];
...
}
2.
do_move(const int move) {
int to;
int from;
...
to = move & 0x3F;
from = (move >> 6) & 0x3F;
key ^= random_rook_key[to];
key ^= random_rook_key[from];
bitboard1 |= set_bit[to];
bitboard1 &= clear_bit[from];
bitboard2 |= set_bit[to];
bitboard2 &= clear_bit[from];
...
}
In case 1. we have less bit operations but we need much larger look-up-tables (size = 4096 x 8 Bytes = 32 KB).
In case 2. we have more bit operations but we need smaller loop-up-tables (size = 64 x 8 Bytes = 0.5 KB).
actually I like your first one better, but using simdwise sse2.
For approach 2, you may better use xor with set_bit always.
It depends on how many cachelines you can save in average with denser, but twice as much lookups. It depends not necessarily on the distribution of those cachelines in memory (but there are issues as well, pages, hardware pre-fetcher).
Memory versus computation is often hard to predict and will largely depend on your other memory and cache requirements. As long your L1 Datacache is not saturated, likely using more memory is faster. But if it becomes polluted it might be the other way around.
Another idea for recent x64 cpus like core 2 duo is to use bittest intrinsics, to set, reset or complement bits by bitindex with one instruction. btx reg64, reg64 has a latency of one cycle on a core 2 duo.
Hopefully the compiler will use the btc reg64, reg64 here, since the btx-version with memory operand is much more expensive since it allows to process large bit-arrays and doesn't rely on mod 32/64...
In a perft- or "beancounter" framework, doMove- and undoMove-performance is more critical than in approaches where one usually spends much more time in leaf-nodes - also a futility-pruning versus lazy-eval issue.