Programming for superscalar/pipelined/out-of-order/speculative CPUs?

smatovic · Post by **smatovic** » Wed Dec 25, 2019 8:04 am

Either I ignored this topic, or there are not many posts on this.

I admit, I am not into assembly, but what do you do with C/C++ concerning modern
speculative CPUs with multiple execution units?

The only thing I can imagine is to unroll loops and avoid branches.

Are there any general rules to follow? Or do you assembly guys really know how
to order different instructions in the right way for each cpu architecture?

--
Srdja

dragontamer5788 · Post by **dragontamer5788** » Wed Dec 25, 2019 2:17 pm

1. Profile -- You don't know if you have a problem unless you measure. So measuring comes #1 first and foremost. Don't necessarily think about superscalar / pipelined / whatever execution. At the end of the day, its the execution time that matters. Use hardware performance counters to measure events (ex: Cache hits, Cache misses, Branch Prediction hit/miss ratios, etc. etc).

2. Get to L1 and L2 cache before anything else -- L1 cache would be ideal, but L2 cache is still within the core and is surprisingly capable. All of the superscalar / pipelined / out-of-order stuff doesn't really matter if your data is in L3 or above, because you'd be memory constrained. L3 cache (and DDR4 RAM) is shared between multiple cores and can slow you down.

3. Create opportunities for Instruction level Parallelism. Superscalar, Pipelined, Out-of-Order, Specular... these are all the same thing: Instruction level parallelism. Each of the techniques described is a way for a modern CPU code to execute the instruction stream in parallel instead of sequentially. Add SIMD-vectorization as well, because autovectorizers are getting better.

4. Measure. Yes, profile again. After writing code, you can figure out if it was ILP and detected by the compiler by timing it. If it got faster, you know you're doing things correctly.

5. Hyperthread. If you can't discover enough parallelism in one thread, throw two threads onto a core. Make sure that both threads fit in L1 (or L2) cache.

-----------

Or do you assembly guys really know how
to order different instructions in the right way for each cpu architecture?

Clang and GCC perform the same techniques to optimize your code, no matter what CPU it is on. All modern CPUs (even cell-phone CPUs) use Tomasulo's algorithm to perform superscalar, pipelined, out-of-order instruction-level parallelism. All modern CPUs search for more parallelism through speculative execution.

If you learn one modern microarchitecture, you've learned all of them. ARM, x86, POWER9, they all have the same features to cut dependencies while searching for ILP. Skylake may have 8 ports or Zen may have 10 ports, or one instruction (Ex: Division or Modulo) may be 20-clock ticks on one CPU or maybe 80-clock ticks on another architecture, but that doesn't matter in most cases. 20-clock ticks is still very slow: enough time to do 60x Additions or XOR instructions on 256-bit data on Skylake.

While Skylake Port5 is doing division, you want Port0 and Port1 to have something to do in parallel to the 20-cycle division instruction. Its all instruction-level parallelism, and the general thought process is the same no matter what CPU you use today. If you want to know how modern CPUs work, study Tomasulo's algorithm and work it out by hand. All the superscalar / pipelined / out-of-order stuff comes down to variants of Tomasulo's algorithm reimplemented on different CPUs.

------------

Unfortunately, most simple examples are automatically detected by compilers and converted into parallel form. I don't have time right now to make a good example, but I'll try to post an example using -O0 (optimizations disabled), so that you can get an idea of how and why assembly gets faster with the compiler tricks.

HG Muller's recent thread (http://talkchess.com/forum3/viewtopic.php?f=7&t=72539) is an example of converting a loop into a more parallel form. The compiler / CPU can detect that the two-loop solution is more parallel than the one-loop solution, so more of that code executes in parallel.

mar · Post by **mar** » Wed Dec 25, 2019 3:15 pm

I mostly trust the compiler.

Instruction scheduling, unrolling loops, avoiding branches if possible, auto-vectorization (to some extent, Microsoft compiler still lags behind quite a bit)
- all this is what you get out of the box from modern compilers.

SIMD intrinsics is one of the areas where it's still relatively easy to outperform the compilers.

When it comes to multithreading, I'm using a custom job system in my current project rather relying on what C++ has to offer (libs/OpenMP).
I like being in charge over what's going on and it was a good decision as far as I can tell.

Avoid false sharing, keep in mind that waking a thread (waiting on a cond. var) takes a lot (10-20 microseconds on my Win 10).

Another important thing is probably memory management - stack is fastest, thread-local where you can, pools for same-sized chunks, thread local temp LIFO allocators and so on.
This is where truly low-level capable languages like C or C++ shine, rather than relying on garbage collection (which is not the way to manage memory anyway).

Programming for superscalar/pipelined/out-of-order/speculative CPUs?

Programming for superscalar/pipelined/out-of-order/speculative CPUs?

Re: Programming for superscalar/pipelined/out-of-order/speculative CPUs?

Re: Programming for superscalar/pipelined/out-of-order/speculative CPUs?