TalkChess.com

Posted: **Thu May 24, 2007 9:37 am**

bob wrote: The thing is it interlaces nicely with other instructions around it... I have been using several divides in my code to scale parts of the evaluation, and it made absolutely no difference in my NPS...

K8 32-bit idiv is 42 cycles vector path. I can't imagine that this is noise in Crafty. As mentioned, division (and modulus) by constant is therefor target of heavy compiler optimizations, like division by power of two or other tricks based on reciprocal multiplication.

Code: Select all

x idiv &#40;2**i&#41; &#58;&#58;= ( x + (&#40;x>>31&#41; & (&#40;2**i&#41;-1&#41;) ) >> i

Using cdq instead of mov, sar 31 here by the compiler is another reason to simply trust them. Otoh one should be aware that idiv is an expensive, complex instruction compared to add/sub/shift and even imul.

Posted: **Thu May 24, 2007 10:06 am**

The latency of divide instructions is such that they almost always cause a hefty pipeline stall. The re-order buffer is simply not large enough to hide the latency, even if it is not on a critical path.

The divide cannot be retired before it has finished, and this means that it will stall the retire stage when all instructions before it in the control stream have been retired. This will create a traffic jam in the re-order buffer, which will fill up with the instructions after the div, that will complete execution, but cannot be retired.

I think nowadays the re-order buffer can hold about 75 uOps, but with an execution throughput of 3 uOps per cycle, it will only take about 25 cycles before all uOps in the re-order buffer have completed execution. The remaining 17 cycles the CPU is then completely stalled.

It is hard to believe that you would not notice this in execution speed. If you don't, it is more likely because the compiler is actually clever enough to eliminate the division, or do it in FP with the aid of a reciprocal. The reciprocal refinement algortithm also has a quite long latency (although faster than 42 clocks), but it has the advantage that it consists of several instructions, and thus completes and retires gradually. A smart compiler can then interleave the critical path of the division with other instructions, and avoid a pipeline stall completely. It even can pipeline several divisions that way, which for the div instructions is impossible.

Posted: **Thu May 24, 2007 11:24 am**

hgm wrote:The latency of divide instructions is such that they almost always cause a hefty pipeline stall. The re-order buffer is simply not large enough to hide the latency, even if it is not on a critical path.

The divide cannot be retired before it has finished, and this means that it will stall the retire stage when all instructions before it in the control stream have been retired. This will create a traffic jam in the re-order buffer, which will fill up with the instructions after the div, that will complete execution, but cannot be retired.

I think nowadays the re-order buffer can hold about 75 uOps, but with an execution throughput of 3 uOps per cycle, it will only take about 25 cycles before all uOps in the re-order buffer have completed execution. The remaining 17 cycles the CPU is then completely stalled.

It is hard to believe that you would not notice this in execution speed. If you don't, it is more likely because the compiler is actually clever enough to eliminate the division, or do it in FP with the aid of a reciprocal. The reciprocal refinement algortithm also has a quite long latency (although faster than 42 clocks), but it has the advantage that it consists of several instructions, and thus completes and retires gradually. A smart compiler can then interleave the critical path of the division with other instructions, and avoid a pipeline stall completely. It even can pipeline several divisions that way, which for the div instructions is impossible.

Yes, here some assembly snippets generated by the good old msvc6 compiler, where it uses reciprocal fixed point multiplication as well as cdq,and,add,sar for power of two division:

Code: Select all

eax = ecx / 2
  00000	8b c1		 mov	 eax, ecx
  00002	99		 cdq
  00003	2b c2		 sub	 eax, edx
  00005	d1 f8		 sar	 eax, 1

eax = ecx / 3
  00000	b8 56 55 55 55	 mov	 eax, 55555556H
  00005	f7 e9		 imul	 ecx
  00007	8b c2		 mov	 eax, edx
  00009	8b c8		 mov	 ecx, eax
  0000b	c1 e9 1f	 shr	 ecx, 31
  0000e	03 c1		 add	 eax, ecx

eax = ecx / 4
  00000	8b c1		 mov	 eax, ecx
  00002	99		 cdq
  00003	83 e2 03	 and	 edx, 3
  00006	03 c2		 add	 eax, edx
  00008	c1 f8 02	 sar	 eax, 2

eax = ecx / 5
  00000	b8 67 66 66 66	 mov	 eax, 66666667H
  00005	f7 e9		 imul	 ecx
  00007	8b c2		 mov	 eax, edx
  00009	d1 f8		 sar	 eax, 1
  0000b	8b c8		 mov	 ecx, eax
  0000d	c1 e9 1f	 shr	 ecx, 31
  00010	03 c1		 add	 eax, ecx

eax = ecx / 6
  00000	b8 ab aa aa 2a	 mov	 eax, 2aaaaaabH
  00005	f7 e9		 imul	 ecx
  00007	8b c2		 mov	 eax, edx
  00009	8b c8		 mov	 ecx, eax
  0000b	c1 e9 1f	 shr	 ecx, 31
  0000e	03 c1		 add	 eax, ecx

eax = ecx / 7
  00000	b8 93 24 49 92	 mov	 eax, 92492493H
  00005	f7 e9		 imul	 ecx
  00007	8b c2		 mov	 eax, edx
  00009	03 c1		 add	 eax, ecx
  0000b	c1 f8 02	 sar	 eax, 2
  0000e	8b c8		 mov	 ecx, eax
  00010	c1 e9 1f	 shr	 ecx, 31
  00013	03 c1		 add	 eax, ecx

eax = ecx / 8
  00000	8b c1		 mov	 eax, ecx
  00002	99		 cdq
  00003	83 e2 07	 and	 edx, 7
  00006	03 c2		 add	 eax, edx
  00008	c1 f8 03	 sar	 eax, 3

eax = ecx / 9
  00000	b8 39 8e e3 38	 mov	 eax, 38e38e39H
  00005	f7 e9		 imul	 ecx
  00007	8b c2		 mov	 eax, edx
  00009	d1 f8		 sar	 eax, 1
  0000b	8b c8		 mov	 ecx, eax
  0000d	c1 e9 1f	 shr	 ecx, 31
  00010	03 c1		 add	 eax, ecx

eax = ecx / 10
  00000	b8 67 66 66 66	 mov	 eax, 66666667H
  00005	f7 e9		 imul	 ecx
  00007	8b c2		 mov	 eax, edx
  00009	c1 f8 02	 sar	 eax, 2
  0000c	8b c8		 mov	 ecx, eax
  0000e	c1 e9 1f	 shr	 ecx, 31
  00011	03 c1		 add	 eax, ecx

eax = ecx / 100
  00000	b8 1f 85 eb 51	 mov	 eax, 51eb851fH
  00005	f7 e9		 imul	 ecx
  00007	8b c2		 mov	 eax, edx
  00009	c1 f8 05	 sar	 eax, 5
  0000c	8b c8		 mov	 ecx, eax
  0000e	c1 e9 1f	 shr	 ecx, 31
  00011	03 c1		 add	 eax, ecx

Posted: **Thu May 24, 2007 12:39 pm**

Indeed, for division by a constant multiplication with the reciprocal seems the obvious choice. More tricky is the case where you have to divide through a variable unknown at compile time.

The best way to do that would be to make use of the SSE (or 3DNow!) reciprocal instruction, (a fast table lookup), followed by two or three iterations of x = x*(2-n*x) to obtain 1/n.

Posted: **Thu May 24, 2007 1:36 pm**

signed modulo seemed a bit harder to optimize

Code: Select all

eax = ecx % 2
  00000	8b c1		 mov	 eax, ecx
  00002	25 01 00 00 80	 and	 eax, 80000001H
  00007	79 05		 jns	 SHORT $L807
  00009	48		 dec	 eax
  0000a	83 c8 fe	 or	 eax, fffffffeH
  0000d	40		 inc	 eax
$L807&#58;

eax = ecx % 3
  00000	8b c1		 mov	 eax, ecx
  00002	99		 cdq
  00003	b9 03 00 00 00	 mov	 ecx, 3
  00008	f7 f9		 idiv	 ecx
  0000a	8b c2		 mov	 eax, edx

Posted: **Thu May 24, 2007 1:55 pm**

Yes, I try to avoid modulo wherever I can. Even unsigned (unless it is by a power of 2, of course).

For instance, to generate a number in the range 0-4 from the hash key, I don't use Key%5, but in stead

(Key&0xFFFFFFF)*5 >> 28.

(I need that to pick a primary entry to hash to within each bucket. The bucket number is then derived from the key as Key & (HSIZE-1), which makes it reasonably independent. ) This is not an alternative way to do modulo, but it is a good way to get a small integer with reasonably homogenous distribution from the hash key.

Posted: **Thu May 24, 2007 9:46 pm**

for further references:

Software Optimization Guide for AMD64 Processors

8.1 Replacing Division with Multiplication
8.8 Derivation of Algorithm, Multiplier, and Shift
Factor for Integer Division by Constants

The mentioned algorithm is based on
http://swox.com/~tege/divcnst-pldi94.pdf

Posted: **Thu May 24, 2007 10:13 pm**

Gerd Isenberg wrote:
bob wrote: The thing is it interlaces nicely with other instructions around it... I have been using several divides in my code to scale parts of the evaluation, and it made absolutely no difference in my NPS...
K8 32-bit idiv is 42 cycles vector path. I can't imagine that this is noise in Crafty. As mentioned, division (and modulus) by constant is therefor target of heavy compiler optimizations, like division by power of two or other tricks based on reciprocal multiplication.
Code: Select all
x idiv &#40;2**i&#41; &#58;&#58;= ( x + (&#40;x>>31&#41; & (&#40;2**i&#41;-1&#41;) ) >> i
Using cdq instead of mov, sar 31 here by the compiler is another reason to simply trust them. Otoh one should be aware that idiv is an expensive, complex instruction compared to add/sub/shift and even imul.

However, when a program spends > 50% of the total search time inside the evaluation, 2-3 divides don't even show up in the timing and don't affect my NPS at all. Verified by lots of testing of course...

Posted: **Thu May 24, 2007 10:18 pm**

hgm wrote:The latency of divide instructions is such that they almost always cause a hefty pipeline stall. The re-order buffer is simply not large enough to hide the latency, even if it is not on a critical path.

The divide cannot be retired before it has finished, and this means that it will stall the retire stage when all instructions before it in the control stream have been retired. This will create a traffic jam in the re-order buffer, which will fill up with the instructions after the div, that will complete execution, but cannot be retired.

I think nowadays the re-order buffer can hold about 75 uOps, but with an execution throughput of 3 uOps per cycle, it will only take about 25 cycles before all uOps in the re-order buffer have completed execution. The remaining 17 cycles the CPU is then completely stalled.

It is hard to believe that you would not notice this in execution speed. If you don't, it is more likely because the compiler is actually clever enough to eliminate the division, or do it in FP with the aid of a reciprocal. The reciprocal refinement algortithm also has a quite long latency (although faster than 42 clocks), but it has the advantage that it consists of several instructions, and thus completes and retires gradually. A smart compiler can then interleave the critical path of the division with other instructions, and avoid a pipeline stall completely. It even can pipeline several divisions that way, which for the div instructions is impossible.

I actually don't know how (nor do I care) the compiler handles a/b for ints. All I care about is symmetrical answers, which >> won't give directly. Whatever it does the cost inside Crafty is nil. And I am not dividing by a power of 2 either. I have this in a couple of key places:

((s) * (62 - Min(TotalWhitePieces + TotalBlackPieces, 42)) / 86)

which is not exactly a power of 2...

Posted: **Thu May 24, 2007 10:20 pm**

sje wrote:
bob wrote:I'd be willing to bet that a plain divide will not noticably affect his overall search speed. And it will solve this correctly.
Agreed. Furthermore, a decent optimizing compiler will code a power of two division as a right shift where it is possible (and correct) to do so.

except mine are not power of 2 divisions. one divides by 86. Another by 62. Etc...

TalkChess.com

question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation