Peer Gynt wrote:I know, but do you need FPU in a chess program? What for? Try using integers instead.
MMX was wonderful back then (x years ago), I remember using it for software rasterization and software (audio) mixing, it was simply awesome.
Do you really need SIMD for a bitboard chess engine? Now that we have 64-bit cpus? I guess not.
Yes, for sure! Even more for avx2 and 256-bit ymm registers. Dot-products in eval, and otherwise depending on your design. The usual pseudo bitboarders with magic lookups not that much, but direction-wise fill-approaches ala DirGolem
mar wrote:The only interesting SSE instruction (for chess) is popcnt which is SSE4.x (not sure). MMX may only be interesting in 32-bit mode but AFAIK not many optimize their engines for that anyway. It may be interesting to do a performance comparison - MMX assembly/intrinsics vs 32-bit (assembly/C) code. I doubt MMX will have any advantage on a 64-bit CPU, but I'm just guessing here.
Also you won't be able to interleave mmx assembly with other (complier generated) instructions to remove dependencies/optimize pipeline so hard to say.
A perfomance comparison would be nice. Until then it's just guesswork.
8 MMX registers are still nice to have in 64-bit mode with sse-float, where some xmm regs need to be caller safe. There are no explicit calling conventions with MMX, only some early confusion for win64.
Gerd Isenberg wrote:
Yes, for sure! Even more for avx2 and 256-bit ymm registers. Dot-products in eval, and otherwise depending on your design. The usual pseudo bitboarders with magic lookups not that much, but direction-wise fill-approaches ala DirGolem
Hmm, interesting, yes. Dot products for collecting eval terms, maybe if carefully designed. I guess I should dig into SSE/AVX stuff (do they have a lerp instruction yet?:), just out of curiosity. I admit I never got beyond MMX. As for DirGolem, thanks for the link, I will try to wrap my head around it (no promises
8 MMX registers are still nice to have in 64-bit mode with sse-float, where some xmm regs need to be caller safe. There are no explicit calling conventions with MMX, only some early confusion for win64.
You mean that you have to preserve some xmm regs just like ebp/rbp for example (the compiler can omit the frame AFAIK, at least in C)? I guess it's up to the compiler to preserve important registers across function calls (i mean in body), I can imagine it may want store this pointer in one of the GPRs for example. Either way the compiler can temporarily save a register and restore later when appropriate - it would be a waste to completely reserve one(more) registers, especially considering inner loops.
Btw. I don't understand what context switch has to do with caller safety? It has to store all the registers anyway, in fact the more registers to store the more expensive context switch.
EDIT: Now when I think about it I could even use stack pointer register in hand written assembly - interrupts require a context switch anyway so it could work
mar wrote:Btw. I don't understand what context switch has to do with caller safety? It has to store all the registers anyway, in fact the more registers to store the more expensive context switch.
The three sentences on that page simply summarise the important points about floating point / mmx registers on (Microsoft's) x86-64:
- preserved among context switches. So all processes and threads are free to use them and won't suddenly see their values changed below their feet.
- no explicit calling convention. So if you make any library calls or even calls to your own code, the content of those registers might change. In addition, your code is allowed to change the values of those registers however it wants.
- prohibited in kernel mode. If you write a device driver, don't touch those registers. Context switching to/from the kernel does not preserve those registers.
More registers does not necessarily mean that context switches become more expensive. If programs are known not to use floating point registers or mmx registers, the kernel does not have to save them across context switches to/from those programs. The kernel can detect this by disabling the fpu. Once a program attempts to execute an fpu instruction, it will trap to the kernel which will then save the registers and remember that this process uses those registers.
Gerd Isenberg wrote:
Yes, for sure! Even more for avx2 and 256-bit ymm registers. Dot-products in eval, and otherwise depending on your design. The usual pseudo bitboarders with magic lookups not that much, but direction-wise fill-approaches ala DirGolem
Hmm, interesting, yes. Dot products for collecting eval terms, maybe if carefully designed. I guess I should dig into SSE/AVX stuff (do they have a lerp instruction yet?:), just out of curiosity. I admit I never got beyond MMX. As for DirGolem, thanks for the link, I will try to wrap my head around it (no promises
8 MMX registers are still nice to have in 64-bit mode with sse-float, where some xmm regs need to be caller safe. There are no explicit calling conventions with MMX, only some early confusion for win64.
You mean that you have to preserve some xmm regs just like ebp/rbp for example (the compiler can omit the frame AFAIK, at least in C)? I guess it's up to the compiler to preserve important registers across function calls (i mean in body), I can imagine it may want store this pointer in one of the GPRs for example. Either way the compiler can temporarily save a register and restore later when appropriate - it would be a waste to completely reserve one(more) registers, especially considering inner loops.
Btw. I don't understand what context switch has to do with caller safety? It has to store all the registers anyway, in fact the more registers to store the more expensive context switch.
EDIT: Now when I think about it I could even use stack pointer register in hand written assembly - interrupts require a context switch anyway so it could work
You can do for instance eight sliding directions by Kogge-Stone or dumb7fill in SSE2, and en passant some pawn stuff in MMX for better scheduling. Due to byte-wise add the a-h or h-a wraps can be handled and-less, while going west is a bit more expensive in LERF mapping:
__m64 nortOne(__m64 b) {
return _mm_slli_epi64 (b, 8);
}
__m64 soutOne(__m64 b) {
return _mm_srli_epi64 (b, 8);
}
__m64 eastOne(__m64 b) {
return _mm_add_epi8 (b, b);
}
__m64 noEaOne (__m64 b) {
b = _mm_add_epi8 (b, b);
b = _mm_slli_epi64 (b, 8);
return b;
}
__m64 soEaOne (__m64 b) {
b = _mm_add_epi8 (b, b);
b = _mm_srli_epi64 (b, 8);
return b;
}
__m64 westOne(__m64 b) {
b = _mm_srli_epi64 (b, 1);
b = _mm_add_epi8 (b, b);
b = _mm_srli_epi64 (b, 1);
return b;
}
__m64 soWeOne (__m64 b) {
b = _mm_srli_epi64 (b, 1);
b = _mm_add_epi8 (b, b);
b = _mm_srli_epi64 (b, 9);
return b;
}
__m64 noWeOne (__m64 b) {
b = _mm_srli_epi64 (b, 1);
b = _mm_add_epi8 (b, b);
b = _mm_slli_epi64 (b, 7);
return b;
}
Context switch has nothing to do with caller safety, but is precondition to use the instructions at all - which was initially unclear in w64, see Agner Fog and links from the cpw MMX page.