While reading over the years there has been a good bit of discussion on utilizing whatever bells and whistles are available on the top of the line "new cpu's". SSE, MMX, MMX2, or whatever it may be.
Curious since most authors use little to no assembler in their code. How do you utilize these in higher end languages?
Things like writing your code to minimize cache misses seem interesting, but I'm not even sure how to even check that or what kind of code yields better compiler output so that the resulting binary code truly maximizes the architecture.
I have seen and tested using gcc optimization flags -03, or manually specifying -arch=686 instead of the flat 386. But I thought while it could use some pre optimized binary output the real benefit came from the original source code itself to guide it for specific implementations. The only thing that I can think of from a C point of view, is using 64bit ints for bit bases, since the resulting binary code could use real 64bit values rather than having to split it up into 2 32bit ints and hacking for 32bit systems.
-Josh
Utilizing Architecture Specific Functions from a HL Languag
Moderator: Ras
-
jshriver
- Posts: 1370
- Joined: Wed Mar 08, 2006 9:41 pm
- Location: Morgantown, WV, USA
-
wgarvin
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: Utilizing Architecture Specific Functions from a HL Lan
For using specific instructions (BSF, POPCNT, etc.) you usually have two choices: either a compiler-supported intrinsic (something like _BitScanForward64) or some sort of "inline assembly" in a format the compiler recognizes. Most production-quality providers (MSVC, Intel, GCC) support one or both of these but the syntax varies. So you end up wrapping them in your own inline function and using #ifdefs around each version of the function: usually one for each compiler/platform combo you care about, plus a "default" version that calculates the result in some other way without using the platform-specific functionality.
For something like floating-point calculations, SSE2 math instructions will already be used by most compilers if you tell to use at least that instruction set. In theory some compilers can auto-vectorize, but if you really want to get the performance gains in real programs from doing SIMD on 4 floats at once, you probably have to write your code for that. Each compiler supports some kind of intrinsic types ("__vector4" or "vector float" or "vec_float4" or similar) and a set of intrinsic functions you can call to use the instruction set for the vector registers. SSE or Altivec or Cell SPE are the three most common. The intrinsics for each instruction set are completely different and they each have some functionality not provided by all of the instruction sets, but if you just want to do basic things like add, subtract, multiply, negate, min, max then they all have those operations. So you can write your own "wrapper class" that wraps the intrinsic type and provides a set of operators (or methods) that do the operations you're interested in: MultiplyAdd, Dot3, Dot4, etc. and then implement a copy of this for each instruction set, plus a generic version that just uses regular floats and will compile on anything, and use #ifdef again to select which version to use when you compile. Some people prefer not to use an actual wrapper class, but just to have a typedef and a bunch of standalone functions that operate on arguments of that type.
Anyway, the important thing with both of these examples is that C and C++ compilers sometimes provide non-portable facilities that are useful enough for your application, that its worth wrapping them up in your own portable wrapper. Your wrapper might need to be re-implemented for each new platform, but all of the code usingyour wrapper should work without changes. And if you always write a "generic" fallback version first, then you can compile for any platform and it will still work, the program will just run a little bit slower until the wrapper has specialized for that platform.
IMO the "generic" version is always worth writing, for two reasons:
(1) There will always be some platform that doesn't have POPCNT, or doesn't have any kind of SIMD, etc. Just imagine you port your program to a cell phone. Even if the new platform does have some low-level equivalent you could take advantage of, you can just use the "generic" version until you have time to write the code for it.
(2) For debugging, its very useful to be able to turn off the "fancy stuff" and see whether the program works with just the generic version of the functions. Its a fast way to discover if the "fancy stuff" has something to do with your problem or not. For example: Some vector operation sometimes produces NaN in the 4th component, and you don't know why. You think the problem is coming from your SIMD SquareRootEstimate function. By turning off the "fancy stuff", you can find out if this problem is some peculiarity related to the SIMD instructions that you use for SquareRootEstimate on this platform, or whether the code still produces NaN when you run it with the "simple" version that just calls C library sqrt().
For something like floating-point calculations, SSE2 math instructions will already be used by most compilers if you tell to use at least that instruction set. In theory some compilers can auto-vectorize, but if you really want to get the performance gains in real programs from doing SIMD on 4 floats at once, you probably have to write your code for that. Each compiler supports some kind of intrinsic types ("__vector4" or "vector float" or "vec_float4" or similar) and a set of intrinsic functions you can call to use the instruction set for the vector registers. SSE or Altivec or Cell SPE are the three most common. The intrinsics for each instruction set are completely different and they each have some functionality not provided by all of the instruction sets, but if you just want to do basic things like add, subtract, multiply, negate, min, max then they all have those operations. So you can write your own "wrapper class" that wraps the intrinsic type and provides a set of operators (or methods) that do the operations you're interested in: MultiplyAdd, Dot3, Dot4, etc. and then implement a copy of this for each instruction set, plus a generic version that just uses regular floats and will compile on anything, and use #ifdef again to select which version to use when you compile. Some people prefer not to use an actual wrapper class, but just to have a typedef and a bunch of standalone functions that operate on arguments of that type.
Anyway, the important thing with both of these examples is that C and C++ compilers sometimes provide non-portable facilities that are useful enough for your application, that its worth wrapping them up in your own portable wrapper. Your wrapper might need to be re-implemented for each new platform, but all of the code usingyour wrapper should work without changes. And if you always write a "generic" fallback version first, then you can compile for any platform and it will still work, the program will just run a little bit slower until the wrapper has specialized for that platform.
IMO the "generic" version is always worth writing, for two reasons:
(1) There will always be some platform that doesn't have POPCNT, or doesn't have any kind of SIMD, etc. Just imagine you port your program to a cell phone. Even if the new platform does have some low-level equivalent you could take advantage of, you can just use the "generic" version until you have time to write the code for it.
(2) For debugging, its very useful to be able to turn off the "fancy stuff" and see whether the program works with just the generic version of the functions. Its a fast way to discover if the "fancy stuff" has something to do with your problem or not. For example: Some vector operation sometimes produces NaN in the 4th component, and you don't know why. You think the problem is coming from your SIMD SquareRootEstimate function. By turning off the "fancy stuff", you can find out if this problem is some peculiarity related to the SIMD instructions that you use for SquareRootEstimate on this platform, or whether the code still produces NaN when you run it with the "simple" version that just calls C library sqrt().