Off topic: Floating point number crunching

hgm · Post by **hgm** » Tue Mar 26, 2013 11:40 pm

Many years ago, when I still was a physicist, I wrote a program to calculate time-dependent atomic structure for atoms exposed to laser beams. People used Cray super-computers for such calculations in these days, and I was one of the first to try it on a PC. By using superior numeric algorithms, I could beat them. The code contained hand-optimized assembler for the time-critical loops, and it was optimized for Pentium I.

It seems that people are still using this code, because it is still amongst the fastest in the world for these type of calculations. As a result I am now invited to come to Berlin for a week, to teach them how to best utilize the program.

But of course Pentium I has long since been abandoned. It required very special optimization, as all FP calculations were done with these silly 8087 co-processor instructions, which had stack-based register addressing, and needed to be paired with fxch (floating exchange) instructions to dig the value out of the stack you wanted to operate on next. To modern out-of-order CPUs all these extra instructions are just slowing it down.

In fact this hole 8087 instruction set has been sort of abandoned; nowadays there are AVX instructions superceding the already obsolete SSEn instruction sets. These can do 4 double-precision multiplications or addistions per clock (throughput), and even 8/clock in single precision, while on Pentium I I could do only 1 fmul per 2 clocks. So it should be possible to do nearly an order of magnitude faster using AVX (e.g. on my i7 Sandy Bridge) as using those crappy 8087 instructions.

So I am attempting to re-write the code in AVX before I go to Berlin in April. I have no experience whatsoever using the AVX instruction set, however. (I just learned it existed this week...) Is there anyone here that has experience using it? (For things other than Chess, obviously; for Chess floating point is virtually useless.)

The code has to multiply complex numbers in many places, and each complex multiply requires 4 real multiplications and two additions / subtractions. I wondered if there is a standard way to do this in AVX. I was thinking of the following method:

Code: Select all

  vbroadcastf128 mem128a, %ymm0 // load same (re, im) pair of doubles in two halves of YMM register
  vbroadcastf128 mem128b, %ymm1 // second operand
  vpermilpd $1, %ymm0, %ymm0 // swap re and im part of low pair
  vmulpd %ymm0, %ymm1, %ymm2 // re*re, im*im, re*im, im*re
  vhaddpd %ymm2, %ymm2, %ymm3 // pairs of re*re+im*im and re*im+im*re
  ?

One problem is that I don't need re*re+im*im, but re*re-im*im. There are subadd instructions in the AVX set, but I could not find one for 'horizontal' addition. I can of course flip the sign bit of the im*im product with a vxorpd instruction. This seems a bit of a pity, to use an instruction that does nothing else than just flipping a single bit out of 256, though. And what is worse, I have to come up with a 256-bit operand vector {0., -0., 0., 0.}. It would be bad to waste one register for holding that permanently, but even more bad to waste cache bandwidth by using it directly from memory each time.

A second problem is that the vhaddpd gives me two copies of the real part in the high elements, and two copies of the imaginary part in the low elements. While the 'input format' of the multiply had (re, im) pairs in each half. I guess this can be avoided by swapping the high and low pairs of the product in one of the two vhaddps operands:

Code: Select all

  vperm2f128 $1, %ymm2, %ymm2, %ymm3 // create a {re*im, im*re, re*re, -im*im} copy
  vhaddpd %ymm3, %ymm2, %ymm4 // {re*re-im*im, re*im+im*re, re*im+im*re, re*re-im*im}

This produces an output vector that duplicates the product in a way that would be directly suitable for performing a new multiplication on it by the same scheme. One of the two factors would need its 3rd and 4th member swapped compared to the other, so it would not matter that much in what order they come out. The 1st and 2nd element are always a conventional (re, im) pair, and can be stored to memory as such.

Does this make any sense, or are there more efficient ways to do this? This scheme is a bit wasteful on additions, basically doing them in duplicat. This saves later duplication of them by vbroadcastf128 instructions, however. So there is something to be said for keeping complex numbers in the ymm registers as quadruples {re, im, re, im} or {re, im, im, re} in stead of pairs (re, im). As long as you do mostly multiplies.

Evert · Post by **Evert** » Wed Mar 27, 2013 6:41 am

I've never tried AVX assembly, so I'm no help there, but I would recommend seeing how a plain C (99) or FORTRAN (95) version performs. It may turn out that the performance is perfectly acceptable and safe you a lot of trouble (especially in the future or if people want to run the code on an older non-AVX machine).

Both C99 and FORTRAN have complex types and I'd expect the compiler to know how to generate efficient code for them (perhaps more so for a FORTRAN compiler than a C compiler), but again I haven't used this directly myself (not a lot of use for complex number in stellar structure calculations). For FORTRAN compilers, gfortran is decent these days (and free, obviously) but the Intel compiler is free (for personal use) on Linux. Or I could do some benchmarks for you, if you'd want to go that route.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 27, 2013 8:30 am

I have no experience with AVX float/double instructions. Have you tried gcc -arch:AVX to look at the generated code?

See answer 1 of:
http://stackoverflow.com/questions/7839 ... ut-archavx

"Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty."

and
http://www.phoronix.com/scan.php?page=n ... px=MTA3NjE

hgm · Post by **hgm** » Wed Mar 27, 2013 10:13 am

Well, one of the problems is that this was all done so long ago that the original C-source of the optimized routines seems to be lost. (I probably have it somewhere on a floppy, but even I could find that amongst the ~300 other floppies I have stashed in a box somewhere, I no longer have a way to transfer files from floppies to modern machines.) So basically all that is left are the assembly sources with the hand-optimized loops.

In those days I could beat the compiler by an order of magnitude: my assembly code did 66 MFLOPS on a 200MHz Pentium-I, while the -O2 compiled code did only 6 MFLOPS. But this was with the DJGPP compiler running on a DOS extender, which did not even align doubles on an 8-byte boundary.

There are only 4 time-critiacl loops, and they are really small. Like:

Code: Select all

for(i=1; i<MANY; i++) {
  h = a[i]*y[i] + b[i]*h;
  s2 = s; s = h + x[i];
  d2 = d; d = h - x[i];
  ss[i] = (ss[i-1] - s) * r1 + s2;
  dd[i] = (dd[i-1] - d) * r2 + d2;
}

with all variables (except r1, r2) complex. It should be very doable to replace these few statements by some AVX assembly.

Evert · Post by **Evert** » Wed Mar 27, 2013 11:12 am

hgm wrote:Well, one of the problems is that this was all done so long ago that the original C-source of the optimized routines seems to be lost. (I probably have it somewhere on a floppy, but even I could find that amongst the ~300 other floppies I have stashed in a box somewhere, I no longer have a way to transfer files from floppies to modern machines.) So basically all that is left are the assembly sources with the hand-optimized loops.

In those days I could beat the compiler by an order of magnitude: my assembly code did 66 MFLOPS on a 200MHz Pentium-I, while the -O2 compiled code did only 6 MFLOPS. But this was with the DJGPP compiler running on a DOS extender, which did not even align doubles on an 8-byte boundary.

There are only 4 time-critiacl loops, and they are really small. Like:
Code: Select all
for(i=1; i<MANY; i++) {
  h = a[i]*y[i] + b[i]*h;
  s2 = s; s = h + x[i];
  d2 = d; d = h - x[i];
  ss[i] = (ss[i-1] - s) * r1 + s2;
  dd[i] = (dd[i-1] - d) * r2 + d2;
}
with all variables (except r1, r2) complex. It should be very doable to replace these few statements by some AVX assembly.

Is the entire program written in assembler or just the time-critical parts? If it's just the time-critical parts (which is the impression I had at first) then it shouldn't be too hard to redo in C, right?

No doubt it's possible to replace them by hand-written AVX assembly, but the question is whether it would beat what a modern compiler can do with AVX enabled (especially if you don't have a lot of experience writing AVX code)...

I have fond memories of DJGPP. I particularly remember it was much faster than MinGW back then (and I also remember programs on an UltraSPARC running faster than on a Pentium with double the clock speed).

hgm · Post by **hgm** » Wed Mar 27, 2013 12:13 pm

Evert wrote:Is the entire program written in assembler or just the time-critical parts? If it's just the time-critical parts (which is the impression I had at first) then it shouldn't be too hard to redo in C, right?

It is quite nasty, actually: The original C routines had much more code than just the loop. The first few iterations of the loop were taken out of it, because they needed to be done slightly different because of boundary conditions. And they also calculated the various constants that were used during the loop. And it is so long ago now that I forgot exactly what it did. (It had something to do with making an U.L decomposition of a tri-diagonal matrix with constant diagonals.)

What I did was compile those routines with the -S option to generate assembler. And then I hacked the loop part.

But now that the C sources are lost, I would have to decompile the assembly of the non-loop code to recover it. Which seems more work than just writing an AVX loop from scratch.

It would be interesting to compare my own handiwork with that of an AVX-enabled compiler, though. Perhaps a solution would be to replace the entire loop in the assembly code by a call to a subroutine conforming to the C calling convention (and passing all constants in memory variables, initialized 8087 FP registers and pointers as parameters). Then I can write a C routine for just the loop from scratch, and compile that.

I just hope my compiler supports AVX. The latest version of gcc does not seem to support the -mno-cygwin option. So I am still using gcc 3.4.4. This is a very undesirable situation, but I don't know how else I can compile WinBoard. I also cannot use 64-bit mode for this reason: the x64-mingw versions seem to automatically upgrade gcc when I install them under Cygwin, so that I can no longer use it to create Windows binaries.

jdart · Post by **jdart** » Wed Mar 27, 2013 3:11 pm

I think you should try doing this in FORTRAN or C99, which have native complex types. And let the compiler do the AVX generation.

--JOn

ZirconiumX · Post by **ZirconiumX** » Wed Mar 27, 2013 4:12 pm

hgm wrote:
Evert wrote:Is the entire program written in assembler or just the time-critical parts? If it's just the time-critical parts (which is the impression I had at first) then it shouldn't be too hard to redo in C, right?
It is quite nasty, actually: The original C routines had much more code than just the loop. The first few iterations of the loop were taken out of it, because they needed to be done slightly different because of boundary conditions. And they also calculated the various constants that were used during the loop. And it is so long ago now that I forgot exactly what it did. (It had something to do with making an U.L decomposition of a tri-diagonal matrix with constant diagonals.)

What I did was compile those routines with the -S option to generate assembler. And then I hacked the loop part.

But now that the C sources are lost, I would have to decompile the assembly of the non-loop code to recover it. Which seems more work than just writing an AVX loop from scratch.

It would be interesting to compare my own handiwork with that of an AVX-enabled compiler, though. Perhaps a solution would be to replace the entire loop in the assembly code by a call to a subroutine conforming to the C calling convention (and passing all constants in memory variables, initialized 8087 FP registers and pointers as parameters). Then I can write a C routine for just the loop from scratch, and compile that.

I just hope my compiler supports AVX. The latest version of gcc does not seem to support the -mno-cygwin option. So I am still using gcc 3.4.4. This is a very undesirable situation, but I don't know how else I can compile WinBoard. I also cannot use 64-bit mode for this reason: the x64-mingw versions seem to automatically upgrade gcc when I install them under Cygwin, so that I can no longer use it to create Windows binaries.

gcc -mno-cygwin was deprecated in gcc 4.0 IIRC, and removed in 4.7.

You really shouldn't need to use it.

Matthew:out

hgm · Post by **hgm** » Wed Mar 27, 2013 5:44 pm

What is the alternative, then, to get a working Windows binary?

Evert · Post by **Evert** » Wed Mar 27, 2013 6:17 pm

What is -mno-cygwin supposed to do? It's been years since I had Windows, but back then I used MinGW and never touched Cygwin. I had no problems making binaries, either GUI or console...

Off topic: Floating point number crunching

Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching

Re: Off topic: Floating point number crunching