hgm wrote:The latency of divide instructions is such that they almost always cause a hefty pipeline stall. The re-order buffer is simply not large enough to hide the latency, even if it is not on a critical path.
The divide cannot be retired before it has finished, and this means that it will stall the retire stage when all instructions before it in the control stream have been retired. This will create a traffic jam in the re-order buffer, which will fill up with the instructions after the div, that will complete execution, but cannot be retired.
I think nowadays the re-order buffer can hold about 75 uOps, but with an execution throughput of 3 uOps per cycle, it will only take about 25 cycles before all uOps in the re-order buffer have completed execution. The remaining 17 cycles the CPU is then completely stalled.
It is hard to believe that you would not notice this in execution speed. If you don't, it is more likely because the compiler is actually clever enough to eliminate the division, or do it in FP with the aid of a reciprocal. The reciprocal refinement algortithm also has a quite long latency (although faster than 42 clocks), but it has the advantage that it consists of several instructions, and thus completes and retires gradually. A smart compiler can then interleave the critical path of the division with other instructions, and avoid a pipeline stall completely. It even can pipeline several divisions that way, which for the div instructions is impossible.
Yes, here some assembly snippets generated by the good old msvc6 compiler, where it uses reciprocal fixed point multiplication as well as cdq,and,add,sar for power of two division:
Code: Select all
eax = ecx / 2
00000 8b c1 mov eax, ecx
00002 99 cdq
00003 2b c2 sub eax, edx
00005 d1 f8 sar eax, 1
eax = ecx / 3
00000 b8 56 55 55 55 mov eax, 55555556H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 8b c8 mov ecx, eax
0000b c1 e9 1f shr ecx, 31
0000e 03 c1 add eax, ecx
eax = ecx / 4
00000 8b c1 mov eax, ecx
00002 99 cdq
00003 83 e2 03 and edx, 3
00006 03 c2 add eax, edx
00008 c1 f8 02 sar eax, 2
eax = ecx / 5
00000 b8 67 66 66 66 mov eax, 66666667H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 d1 f8 sar eax, 1
0000b 8b c8 mov ecx, eax
0000d c1 e9 1f shr ecx, 31
00010 03 c1 add eax, ecx
eax = ecx / 6
00000 b8 ab aa aa 2a mov eax, 2aaaaaabH
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 8b c8 mov ecx, eax
0000b c1 e9 1f shr ecx, 31
0000e 03 c1 add eax, ecx
eax = ecx / 7
00000 b8 93 24 49 92 mov eax, 92492493H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 03 c1 add eax, ecx
0000b c1 f8 02 sar eax, 2
0000e 8b c8 mov ecx, eax
00010 c1 e9 1f shr ecx, 31
00013 03 c1 add eax, ecx
eax = ecx / 8
00000 8b c1 mov eax, ecx
00002 99 cdq
00003 83 e2 07 and edx, 7
00006 03 c2 add eax, edx
00008 c1 f8 03 sar eax, 3
eax = ecx / 9
00000 b8 39 8e e3 38 mov eax, 38e38e39H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 d1 f8 sar eax, 1
0000b 8b c8 mov ecx, eax
0000d c1 e9 1f shr ecx, 31
00010 03 c1 add eax, ecx
eax = ecx / 10
00000 b8 67 66 66 66 mov eax, 66666667H
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 c1 f8 02 sar eax, 2
0000c 8b c8 mov ecx, eax
0000e c1 e9 1f shr ecx, 31
00011 03 c1 add eax, ecx
eax = ecx / 100
00000 b8 1f 85 eb 51 mov eax, 51eb851fH
00005 f7 e9 imul ecx
00007 8b c2 mov eax, edx
00009 c1 f8 05 sar eax, 5
0000c 8b c8 mov ecx, eax
0000e c1 e9 1f shr ecx, 31
00011 03 c1 add eax, ecx