vladstamate wrote:Code: Select all
m_kHashKey ^= g_kPieceCodes[m_iSideToMove][kMove.m_piece][fromSq] ^ g_kPieceCodes[m_iSideToMove][kMove.m_piece][toSq];
This probably means that your array has dimensions [2][6][64] or something like that.
To index into it at g_kPieceCodes[side][piece][square], the compiler has to calculate the address, something like:
Code: Select all
address = g_kPieceCodes + sizeof(bitboard) * (side*6*64 + piece*64 + square);
Multiplying by sizeof(bitboard) is a shift left by 3, which can be done with the addressing mode on x86, so you can think of it as "free" here.
Multiplying by 64 is the same as shift left by 6, but multiplying by 6*64 is going to take several smaller math instructions (or an actual multiply instruction, which is kind of slow).
It would help if you used a pieceAndSide value that had [side] and [piece] combined together into one. For example, if the least-significant bit of each piece number represented the side. Then it would be more like this:
Code: Select all
address = g_kPieceCodes + sizeof(bitboard) * (pieceAndSide*64 + square);
That is just a shift, an add and then the memory lookup. The expensive multiply should be gone.
If you don't want to change how you assigned the piece numbers, you could try just changing the array dimensions to [2*6][64] and the lookup code to something like
Code: Select all
m_kHashKey ^= g_kPieceCodes[kMove.m_piece*2+m_iSideToMove][fromSq] ^ g_kPieceCodes[kMove.m_piece*2+m_iSideToMove][toSq];
The compiler will do something smart for the *2 (maybe add it to itself, or shift it left by one). Its optimizer will figure out that (kMove.m_piece*2+m_iSideToMove) only needs to be computed once, but if you want to be on the safe side you could do that computation into a local int variable first.
Note that there might be something else wrong here, because I don't see how that multiply could slow it down enough to be a 30% speed hit (unless maybe you're just testing perft or something)
[Edit: another thing.. make sure you're compiling with optimizations on! Most compilers produce pretty terrible code at -O0 or -O1. I assumed it was for x86-64, but if this is for Itanium or PPC or something like that, maybe the addressing mode stuff is not "for free" either. Also, optimizing compilers will mix together the instructions from lots of different statements to get faster code. So if you're looking at a disassembly of this function, make sure that the instructions you think are part of this xor calculation are really a part of it, and not something unrelated that the compiler happened to sneak in between.]
[Edit 2: what is kMove.piece ? If it is a bitfield, be aware that that is probably slow on every compiler ever made. Most compilers generate kind of crappy code for bitfields. Even if its a structure, some compilers will not handle it especially well.]