1. Profile -- You don't know if you have a problem unless you measure. So measuring comes #1 first and foremost. Don't necessarily think about superscalar / pipelined / whatever execution. At the end of the day, its the execution time that matters. Use hardware performance counters to measure events (ex: Cache hits, Cache misses, Branch Prediction hit/miss ratios, etc. etc).
2. Get to L1 and L2 cache before anything else -- L1 cache would be ideal, but L2 cache is still within the core and is surprisingly capable. All of the superscalar / pipelined / out-of-order stuff doesn't really matter if your data is in L3 or above, because you'd be memory constrained. L3 cache (and DDR4 RAM) is shared between multiple cores and can slow you down.
3.
Create opportunities for Instruction level Parallelism. Superscalar, Pipelined, Out-of-Order, Specular... these are
all the same thing: Instruction level parallelism. Each of the techniques described is a way for a modern CPU code to execute the instruction stream in parallel instead of sequentially. Add SIMD-vectorization as well, because autovectorizers are getting better.
4.
Measure. Yes, profile again. After writing code, you can figure out if it was ILP and detected by the compiler by timing it. If it got faster, you know you're doing things correctly.
5.
Hyperthread. If you can't discover enough parallelism in one thread, throw two threads onto a core. Make sure that both threads fit in L1 (or L2) cache.
-----------
Or do you assembly guys really know how
to order different instructions in the right way for each cpu architecture?
Clang and GCC perform the same techniques to optimize your code, no matter what CPU it is on. All modern CPUs (even cell-phone CPUs) use Tomasulo's algorithm to perform superscalar, pipelined, out-of-order instruction-level parallelism. All modern CPUs search for more parallelism through speculative execution.
If you learn one modern microarchitecture, you've learned all of them. ARM, x86, POWER9, they all have the same features to cut dependencies while searching for ILP. Skylake may have 8 ports or Zen may have 10 ports, or one instruction (Ex: Division or Modulo) may be 20-clock ticks on one CPU or maybe 80-clock ticks on another architecture, but
that doesn't matter in most cases. 20-clock ticks is still very slow: enough time to do 60x Additions or XOR instructions on 256-bit data on Skylake.
While Skylake Port5 is doing division, you want Port0 and Port1 to have something to do
in parallel to the 20-cycle division instruction. Its all instruction-level parallelism, and the general thought process is the same no matter what CPU you use today. If you want to know how modern CPUs work, study
Tomasulo's algorithm and work it out by hand. All the superscalar / pipelined / out-of-order stuff comes down to variants of Tomasulo's algorithm reimplemented on different CPUs.
------------
Unfortunately, most simple examples are automatically detected by compilers and converted into parallel form. I don't have time right now to make a good example, but I'll try to post an example using -O0 (optimizations disabled), so that you can get an idea of how and why assembly gets faster with the compiler tricks.
HG Muller's recent thread (
http://talkchess.com/forum3/viewtopic.php?f=7&t=72539) is an example of converting a loop into a more parallel form. The compiler / CPU can detect that the two-loop solution is more parallel than the one-loop solution, so more of that code executes in parallel.