It seems that people are still using this code, because it is still amongst the fastest in the world for these type of calculations. As a result I am now invited to come to Berlin for a week, to teach them how to best utilize the program.
But of course Pentium I has long since been abandoned. It required very special optimization, as all FP calculations were done with these silly 8087 co-processor instructions, which had stack-based register addressing, and needed to be paired with fxch (floating exchange) instructions to dig the value out of the stack you wanted to operate on next. To modern out-of-order CPUs all these extra instructions are just slowing it down.
In fact this hole 8087 instruction set has been sort of abandoned; nowadays there are AVX instructions superceding the already obsolete SSEn instruction sets. These can do 4 double-precision multiplications or addistions per clock (throughput), and even 8/clock in single precision, while on Pentium I I could do only 1 fmul per 2 clocks. So it should be possible to do nearly an order of magnitude faster using AVX (e.g. on my i7 Sandy Bridge) as using those crappy 8087 instructions.
So I am attempting to re-write the code in AVX before I go to Berlin in April. I have no experience whatsoever using the AVX instruction set, however. (I just learned it existed this week...) Is there anyone here that has experience using it? (For things other than Chess, obviously; for Chess floating point is virtually useless.)
The code has to multiply complex numbers in many places, and each complex multiply requires 4 real multiplications and two additions / subtractions. I wondered if there is a standard way to do this in AVX. I was thinking of the following method:
Code: Select all
vbroadcastf128 mem128a, %ymm0 // load same (re, im) pair of doubles in two halves of YMM register
vbroadcastf128 mem128b, %ymm1 // second operand
vpermilpd $1, %ymm0, %ymm0 // swap re and im part of low pair
vmulpd %ymm0, %ymm1, %ymm2 // re*re, im*im, re*im, im*re
vhaddpd %ymm2, %ymm2, %ymm3 // pairs of re*re+im*im and re*im+im*re
?
A second problem is that the vhaddpd gives me two copies of the real part in the high elements, and two copies of the imaginary part in the low elements. While the 'input format' of the multiply had (re, im) pairs in each half. I guess this can be avoided by swapping the high and low pairs of the product in one of the two vhaddps operands:
Code: Select all
vperm2f128 $1, %ymm2, %ymm2, %ymm3 // create a {re*im, im*re, re*re, -im*im} copy
vhaddpd %ymm3, %ymm2, %ymm4 // {re*re-im*im, re*im+im*re, re*im+im*re, re*re-im*im}
Does this make any sense, or are there more efficient ways to do this? This scheme is a bit wasteful on additions, basically doing them in duplicat. This saves later duplication of them by vbroadcastf128 instructions, however. So there is something to be said for keeping complex numbers in the ymm registers as quadruples {re, im, re, im} or {re, im, im, re} in stead of pairs (re, im). As long as you do mostly multiplies.
