Page 2 of 4
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 9:37 am
by Gerd Isenberg
bob wrote:
The thing is it interlaces nicely with other instructions around it... I have been using several divides in my code to scale parts of the evaluation, and it made absolutely no difference in my NPS...
K8 32-bit idiv is 42 cycles vector path. I can't imagine that this is noise in Crafty. As mentioned, division (and modulus) by constant is therefor target of heavy compiler optimizations, like division by power of two or other tricks based on reciprocal multiplication.
Code: Select all
x idiv (2**i) ::= ( x + ((x>>31) & ((2**i)-1)) ) >> i
Using cdq instead of mov, sar 31 here by the compiler is another reason to simply trust them. Otoh one should be aware that idiv is an expensive, complex instruction compared to add/sub/shift and even imul.
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 10:06 am
by hgm
The latency of divide instructions is such that they almost always cause a hefty pipeline stall. The re-order buffer is simply not large enough to hide the latency, even if it is not on a critical path.
The divide cannot be retired before it has finished, and this means that it will stall the retire stage when all instructions before it in the control stream have been retired. This will create a traffic jam in the re-order buffer, which will fill up with the instructions after the div, that will complete execution, but cannot be retired.
I think nowadays the re-order buffer can hold about 75 uOps, but with an execution throughput of 3 uOps per cycle, it will only take about 25 cycles before all uOps in the re-order buffer have completed execution. The remaining 17 cycles the CPU is then completely stalled.
It is hard to believe that you would not notice this in execution speed. If you don't, it is more likely because the compiler is actually clever enough to eliminate the division, or do it in FP with the aid of a reciprocal. The reciprocal refinement algortithm also has a quite long latency (although faster than 42 clocks), but it has the advantage that it consists of several instructions, and thus completes and retires gradually. A smart compiler can then interleave the critical path of the division with other instructions, and avoid a pipeline stall completely. It even can pipeline several divisions that way, which for the div instructions is impossible.
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 11:24 am
by Gerd Isenberg
hgm wrote:The latency of divide instructions is such that they almost always cause a hefty pipeline stall. The re-order buffer is simply not large enough to hide the latency, even if it is not on a critical path.
The divide cannot be retired before it has finished, and this means that it will stall the retire stage when all instructions before it in the control stream have been retired. This will create a traffic jam in the re-order buffer, which will fill up with the instructions after the div, that will complete execution, but cannot be retired.
I think nowadays the re-order buffer can hold about 75 uOps, but with an execution throughput of 3 uOps per cycle, it will only take about 25 cycles before all uOps in the re-order buffer have completed execution. The remaining 17 cycles the CPU is then completely stalled.
It is hard to believe that you would not notice this in execution speed. If you don't, it is more likely because the compiler is actually clever enough to eliminate the division, or do it in FP with the aid of a reciprocal. The reciprocal refinement algortithm also has a quite long latency (although faster than 42 clocks), but it has the advantage that it consists of several instructions, and thus completes and retires gradually. A smart compiler can then interleave the critical path of the division with other instructions, and avoid a pipeline stall completely. It even can pipeline several divisions that way, which for the div instructions is impossible.
Yes, here some assembly snippets generated by the good old msvc6 compiler, where it uses reciprocal fixed point multiplication as well as cdq,and,add,sar for power of two division:
Code: Select all
eax = ecx / 2
00000 8b c1 mov eax, ecx
00002 99 cdq
00003 2b c2 sub eax, edx
00005 d1 f8 sar eax, 1
eax = ecx / 3
00000 b8 56 55 55 55 mov eax, 55555556H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 8b c8 mov ecx, eax
0000b c1 e9 1f shr ecx, 31
0000e 03 c1 add eax, ecx
eax = ecx / 4
00000 8b c1 mov eax, ecx
00002 99 cdq
00003 83 e2 03 and edx, 3
00006 03 c2 add eax, edx
00008 c1 f8 02 sar eax, 2
eax = ecx / 5
00000 b8 67 66 66 66 mov eax, 66666667H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 d1 f8 sar eax, 1
0000b 8b c8 mov ecx, eax
0000d c1 e9 1f shr ecx, 31
00010 03 c1 add eax, ecx
eax = ecx / 6
00000 b8 ab aa aa 2a mov eax, 2aaaaaabH
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 8b c8 mov ecx, eax
0000b c1 e9 1f shr ecx, 31
0000e 03 c1 add eax, ecx
eax = ecx / 7
00000 b8 93 24 49 92 mov eax, 92492493H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 03 c1 add eax, ecx
0000b c1 f8 02 sar eax, 2
0000e 8b c8 mov ecx, eax
00010 c1 e9 1f shr ecx, 31
00013 03 c1 add eax, ecx
eax = ecx / 8
00000 8b c1 mov eax, ecx
00002 99 cdq
00003 83 e2 07 and edx, 7
00006 03 c2 add eax, edx
00008 c1 f8 03 sar eax, 3
eax = ecx / 9
00000 b8 39 8e e3 38 mov eax, 38e38e39H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 d1 f8 sar eax, 1
0000b 8b c8 mov ecx, eax
0000d c1 e9 1f shr ecx, 31
00010 03 c1 add eax, ecx
eax = ecx / 10
00000 b8 67 66 66 66 mov eax, 66666667H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 c1 f8 02 sar eax, 2
0000c 8b c8 mov ecx, eax
0000e c1 e9 1f shr ecx, 31
00011 03 c1 add eax, ecx
eax = ecx / 100
00000 b8 1f 85 eb 51 mov eax, 51eb851fH
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 c1 f8 05 sar eax, 5
0000c 8b c8 mov ecx, eax
0000e c1 e9 1f shr ecx, 31
00011 03 c1 add eax, ecx
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 12:39 pm
by hgm
Indeed, for division by a constant multiplication with the reciprocal seems the obvious choice. More tricky is the case where you have to divide through a variable unknown at compile time.
The best way to do that would be to make use of the SSE (or 3DNow!) reciprocal instruction, (a fast table lookup), followed by two or three iterations of x = x*(2-n*x) to obtain 1/n.
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 1:36 pm
by Gerd Isenberg
signed modulo seemed a bit harder to optimize
Code: Select all
eax = ecx % 2
00000 8b c1 mov eax, ecx
00002 25 01 00 00 80 and eax, 80000001H
00007 79 05 jns SHORT $L807
00009 48 dec eax
0000a 83 c8 fe or eax, fffffffeH
0000d 40 inc eax
$L807:
eax = ecx % 3
00000 8b c1 mov eax, ecx
00002 99 cdq
00003 b9 03 00 00 00 mov ecx, 3
00008 f7 f9 idiv ecx
0000a 8b c2 mov eax, edx
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 1:55 pm
by hgm
Yes, I try to avoid modulo wherever I can. Even unsigned (unless it is by a power of 2, of course).
For instance, to generate a number in the range 0-4 from the hash key, I don't use Key%5, but in stead
(Key&0xFFFFFFF)*5 >> 28.
(I need that to pick a primary entry to hash to within each bucket. The bucket number is then derived from the key as Key & (HSIZE-1), which makes it reasonably independent. ) This is not an alternative way to do modulo, but it is a good way to get a small integer with reasonably homogenous distribution from the hash key.
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 9:46 pm
by Gerd Isenberg
for further references:
Software Optimization Guide for AMD64 Processors
8.1 Replacing Division with Multiplication
8.8 Derivation of Algorithm, Multiplier, and Shift
Factor for Integer Division by Constants
The mentioned algorithm is based on
http://swox.com/~tege/divcnst-pldi94.pdf
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 10:13 pm
by bob
Gerd Isenberg wrote:bob wrote:
The thing is it interlaces nicely with other instructions around it... I have been using several divides in my code to scale parts of the evaluation, and it made absolutely no difference in my NPS...
K8 32-bit idiv is 42 cycles vector path. I can't imagine that this is noise in Crafty. As mentioned, division (and modulus) by constant is therefor target of heavy compiler optimizations, like division by power of two or other tricks based on reciprocal multiplication.
Code: Select all
x idiv (2**i) ::= ( x + ((x>>31) & ((2**i)-1)) ) >> i
Using cdq instead of mov, sar 31 here by the compiler is another reason to simply trust them. Otoh one should be aware that idiv is an expensive, complex instruction compared to add/sub/shift and even imul.
However, when a program spends > 50% of the total search time inside the evaluation, 2-3 divides don't even show up in the timing and don't affect my NPS at all. Verified by lots of testing of course...
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 10:18 pm
by bob
hgm wrote:The latency of divide instructions is such that they almost always cause a hefty pipeline stall. The re-order buffer is simply not large enough to hide the latency, even if it is not on a critical path.
The divide cannot be retired before it has finished, and this means that it will stall the retire stage when all instructions before it in the control stream have been retired. This will create a traffic jam in the re-order buffer, which will fill up with the instructions after the div, that will complete execution, but cannot be retired.
I think nowadays the re-order buffer can hold about 75 uOps, but with an execution throughput of 3 uOps per cycle, it will only take about 25 cycles before all uOps in the re-order buffer have completed execution. The remaining 17 cycles the CPU is then completely stalled.
It is hard to believe that you would not notice this in execution speed. If you don't, it is more likely because the compiler is actually clever enough to eliminate the division, or do it in FP with the aid of a reciprocal. The reciprocal refinement algortithm also has a quite long latency (although faster than 42 clocks), but it has the advantage that it consists of several instructions, and thus completes and retires gradually. A smart compiler can then interleave the critical path of the division with other instructions, and avoid a pipeline stall completely. It even can pipeline several divisions that way, which for the div instructions is impossible.
I actually don't know how (nor do I care) the compiler handles a/b for ints. All I care about is symmetrical answers, which >> won't give directly. Whatever it does the cost inside Crafty is nil. And I am not dividing by a power of 2 either. I have this in a couple of key places:
((s) * (62 - Min(TotalWhitePieces + TotalBlackPieces, 42)) / 86)
which is not exactly a power of 2...
Re: question about symmertic evaluation
Posted: Thu May 24, 2007 10:20 pm
by bob
sje wrote:bob wrote:I'd be willing to bet that a plain divide will not noticably affect his overall search speed. And it will solve this correctly.
Agreed. Furthermore, a decent optimizing compiler will code a power of two division as a right shift where it is possible (and correct) to do so.
except mine are not power of 2 divisions. one divides by 86. Another by 62. Etc...