Software Optimization Guide for AMD Family 10h Processors

Gerd Isenberg · Post by **Gerd Isenberg** » Fri May 25, 2007 12:18 pm

Software Optimization Guide for AMD Family 10h Processors
40546 Rev. 3.02 May 2007

http://www.amd.com/us-en/assets/content ... /40546.pdf

some highlights: 128-bit alus for SSE*
POPCNT and LZCNT are single Directpath instructions with latency of two cycles!

Code: Select all

BSF    reg, reg VectorPath         4
BSR    reg, reg VectorPath         4
LZCNT  reg, reg DirectPath Single  2 note 6
POPCNT reg, reg DirectPath Single  2 note 6

note 6) This operation is restricted to scheduling in pipe 2.

Nid Hogge · Post by **Nid Hogge** » Fri May 25, 2007 9:00 pm

Hi Gerd,

Maybe this will be of interest for you:

Inside Barcelona: AMD's Next Generation

http://www.realworldtech.com/page.cfm?A ... 1607033728

A Very technical description of K10(Well, for me

).

And a discussion on the Optimization Guide with lots of experts:

http://www.realworldtech.com/forums/ind ... 3&roomid=2

This is well over my head..

PS If any of you guys can give out any performance estimations based on all of those details it would be great.
As of yet we barely know anything about how will it perform..

Gerd Isenberg · Post by **Gerd Isenberg** » Fri May 25, 2007 10:43 pm

Nid Hogge wrote:Hi Gerd,

This is well over my head..

Hi Nid,

this is over my head as well

I would really like a software simulation, to see all the units in parallel action, while spepping interactivly through the cycles.

PS If any of you guys can give out any performance estimations based on all of those details it would be great.
As of yet we barely know anything about how will it perform..

I have no exact measure. There are a lot of issues - for instance faster push/pop versus mov via rsp to save registers. Shared L3, ccNuma, new huge memory pages (if supported by bios/os/api), tlb- and btb-size and whatever else will likely require some research.

Most conventinal (bitboards or not) chess programs have a quite low ipc, and will not profit much from 32-byte fetch per cycle (was 16 a bottleneck?) and more or wider alus I guess.

Compiler makers must consider a lot of new stuff, new branch-prediction issues, eg. to avoid branches to one byte ret 0. A lot of work will take some time. C-intrinsics are needed for the new instructions as well, if you don't want to use gcc-inline assembly with some own defined opcodes.

Popcnt and lzcnt will speedup bitboard programs. So i guess some will traverse bitboards the other way around in the future, maybe with some extra xor 63 - bsf is improved but still 4 cycles vector path.

Code: Select all

if (bb) do
{
   sq = lzcnt(bb) ^ 63;
   ...
   bitTestAndReset(&bb,sq);
} while (bb);

Due to 128-bit alus sse-integer stuff becomes more and more interesting for hard-core bitboarders like me - fill-stuff, byte[64]*bit[64] dot-product, etc.

Maybe some start to try float- or double types for their evaluation?

Cheers,
Gerd

Nid Hogge · Post by **Nid Hogge** » Sat May 26, 2007 1:10 am

Hi Nid,

this is over my head as well

Huh, You gotta be kidding me : )

This sounds very neat, Thanks for sharing. I do hope it's as fast as they claim. AMD needs it badly and we as potential consumers must embrace competition.

A few days ago AMD demonstrated its Barcelona quad-core server chip, Comparing its performance to one of AMD's dual-core Opteron processors. (You can watch it here both CPU's running at the same frequency.

The demo measured the performance of the chips on an imaging benchmark called POV-Ray. POV-Ray scales almost perfectly with a large number of cores.

The 16 core K10 scored "above 4,000 pixels per second" and the rendering speed of the 16-core K10 system was just 1.87 times the speed of the 8-core K8 system.

Now the strange thing is that Intel's 8-core Clovertown(X5365) scores 4677 or better.

Intel wer'e quick enough to run a demonstration of their own:

8 Intel cores faster than 16 AMD Barcelona cores!

http://www.uberpulse.com/us/2007/05/int ... u_ship.php

This was kinda worrying, although it's only one Benchmark. 16-core NG that can't beat 8-core in a highly scalable MP environment never looks good. But we must remember it's still early and we don't know in what speed they wer'e running.

What I really wanna know, Is why did AMD choose PoV-Ray for the demonstration in the first place? They must have known it does not suit K10 well.

Some technical quotes I collected that could explain the results:

First, there is no actual usage of vectorized (or packed) instructions in PoV-Ray SSE. The only packed instructions I see from the binary are register conversions between x87 and SSE2 formats. PoV-Ray SSE basically treat the SSE2 as a faster [sic] x87 engine which can access xmm registers randomly (rather than stack-based in x87). For example, a simple double-precision division in PoV-Ray SSE is performed by the following instruction sequence:

Convert the divisor from single to double (CVTSS2SD)
Perform double-precision scalar division using DIVSD
Convert the result from two double values to two single values (CTVPD2PS).

This offers considerable advantage for Intel's Core 2, because SSE2 DIVSD (18 cycles) in Core 2 is much faster than x87 FDIV (36 cycles), and the conversion instructions are also quite fast (4 cycles). Overall, for Core 2, the above sequence will save ~30% number of cycles (4+18+4=26 vs. 36) from an x87 division. On the other hand, this sequence is very inefficient for K8, where SSE2 DIVSD is as fast as x87 FDIV (~20 cycles), but conversions are much slower (8 cycles). Overall, for K8, the sequence runs ~80% slower (8+20+8=36 vs. 20 cycles) than an x87 division.

Roughly estimating, about 1/4 to 1/3 of the numerical instructions in the PoV-Ray SSE undergo such convert-calculate-convert process, where you see CVTxx2yy instructions all over the places in these parts of the code. Now I'm not sure whether this is compiled by an Intel compiler, or with an Intel library, or whatever else, but this is simply not the good/right way to do vectorized acceleration. It gives Core 2 a performance boost only due to Core 2's design artifact where such conversions are cheap/fast. Still, PoV-Ray SSE manages to run slightly faster than PoV-Ray x87 on K8 probably due to the ability to access register randomly, which results in better superscalar and out-of-order executions.

Second, comparing the K10 instruction latency with the K8 instruction latency, we find that K10 has little, if any, improvement on scalar SSE instructions; worse yet, some CVTxx2yy instructions are even downgraded and have longer decode and higher latency. What this shows is that PoV-Ray SSE, being rather unfriendly to the K8 microarchitecture, appears even more hostile toward K10. Thus the fact that 16 cores of K10 can still almost double the speed of 8 cores of K8 actually implies there are some core improvements at work inside the K10 design.
.

Strange .. Hopefully just a bad configuration or such.
I hope and believe Computex will bring better news for AMD..

Gerd Isenberg · Post by **Gerd Isenberg** » Sat May 26, 2007 9:39 am

I think one can construct enough benchmarks where either the amd or the intel processor looks much better. Important is that compiler consider target cpu while generating optimal code.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat May 26, 2007 11:54 am

The Complex Multiplication of Streams of Complex Numbers on page 157ff nicely demonstrates improving ipc by loop unrolling.

-------------------------------------------------------------------------------

The move from memory (MOVAPS) requires 2 cycles (assuming that the data is available in L1 cache), MOVSHDUP, MOVSLDUP require 2 cycles each, the two MULPS instructions require 4 cycles, the SHUFPS requires 4 cycles, and ADDSUBPS requires 4 cycles. The instruction flow through the processor is illustrated on a clock-cycle basis, as follows:

Code: Select all

Instruction 0     2     4     6     8     10    12    14    16
MOVAPS      xxxxxx
MOVAPS      xxxxxx
MOVAPS            xxxxxx
MOVSHDUP          xxxxxx
MOVSLDUP          xxxxxx
SHUFPS                  xxxxxxxxxxxx
MULPS                               xxxxxxxxxxxx
MULPS                   xxxxxxxxxxxx
ADDSUBPS                                        xxxxxxxxxxxx

These two complex multiplies take 15 cycles to finish. During these 15 cycles, the processor has the ability to perform 60 single-precision adds and 60 single-precision multiplies, but in this code sequence it only performs eight multiplies and four adds (the subtracts are performed on the ADD execution unit). This is only 10% utilization. The majority of the time is spent waiting for previous instructions to terminate so that arguments to future instructions are available. By unrolling the multiplication and working with four complex numbers per loop, there are more instructions that are not dependent on previous or presently executing operations. This allows the processor to mask the execution latency and keep itself busier, as illustrated below:

Code: Select all

Instruction 0     2     4     6     8     10    12    14    16   18
MOVAPS      xxxxxx
MOVAPS      xxxxxx
MOVAPS         xxxxxx
MOVAPS         xxxxxx
MOVAPS            xxxxxx
MOVAPS            xxxxxx
MOVSHDUP          xxxxxx
MOVSHDUP             xxxxxx
MOVSLDUP             xxxxxx
MOVSLDUP             xxxxxx
SHUFPS                  xxxxxxxxxxxx
SHUFPS                  xxxxxxxxxxxx
MULPS                               xxxxxxxxxxxx
MULPS                      xxxxxxxxxxxx
MULPS                                  xxxxxxxxxxxx
MULPS                         xxxxxxxxxxxx
ADDSUBPS                                        xxxxxxxxxxxx
ADDSUBPS                                           xxxxxxxxxxxx

Multiplying four complex single-precision numbers only takes 17 cycles as opposed to 15 cycles to multiply two complex single-precision numbers. The floating-point pipes are kept busier by feeding new instructions into the floating-point pipeline each cycle. In the arrangement above, 16 multiplies and 8 additions are performed in 17 cycles, achieving a 1.8x increase in performance. Unrolling the loop one more time will improve efficiency even more, at the expense of requiring all 16 XMM registers at once.

diep · Post by **diep** » Sat May 26, 2007 9:03 pm

Hi,

Very dissappointing that K8L at least as it looks like from specs,
let's hope they managed to improve other stuff than the 64 ==> 128 bits bus thing.

The rest looks nearly 100% the same.

Besides of course SSE4 support as well additionally it has a hardware instruction for popcount.

That's what i had read a month or so ago in that manual.

So basically it's integer IPC is still 3 instructions a cycle that means that it is very unlikely that it can outperform for us the core2 processor in computerchess type workloads.

Additionally intel is going to scale their core2 stuff real high.

It's a run race. Intel has won it for the coming years it seems. Very amazing, i had understood K8L would do 4 instructions a cycle.

So we have to wait for a next generation AMD processor for that.

Of course for multimedia AMD kicks butt with that improved SIMD bandwidth. Simply doubling bandwidth.

It's supposed to have for multimedia type stuff that's number crunching software all of it of course, to be 10% faster than core2 in practical work loads.

Yet that's not our line of business. We are interested in our chessproggies and what AMD can do for it, and it's real little what it can do there extra on top of previous K8's.

Additionally if intels true quad core releases, they'll find another way to glue 2 together creating 8 cores, something AMD seems uncapable of doing.

From my viewpoint K8L is a big dissappointment for computerchess.

Now let's hope it is going to be a cheap processor, but i honestely doubt it will be.

In Highend of course things are diferent, scalability there is a big issue. There this K8L will kick of course major butt with its improved SIMD and already superior memory on die memory system and hypertransport.

Intel doesn't have that.

So to use a dutch saying: "the soup isn't as hot as when it is getting served".

I'd say, big bummer.

From amd's viewpoint they manage to double of course their number of cores from 2 to 4, so i imagine they are extremely happy with that.

Yet from our viewpoint, they didn't improve IPC with respect to integers and we already knew that compilers sucked for AMD there, as its shorter pipeline already meant it suffered less there.

It is again the compilers that help intel at its victory.

Not only it has a higher IPC, but additionally we know historically it always clocks higher.

Add to that, that the price of the existing c2q is going to be very cheap, that means interesting times for us.

Vincent

Gerd Isenberg wrote:Software Optimization Guide for AMD Family 10h Processors
40546 Rev. 3.02 May 2007

http://www.amd.com/us-en/assets/content ... /40546.pdf

some highlights: 128-bit alus for SSE*
POPCNT and LZCNT are single Directpath instructions with latency of two cycles!
Code: Select all
BSF    reg, reg VectorPath         4
BSR    reg, reg VectorPath         4
LZCNT  reg, reg DirectPath Single  2 note 6
POPCNT reg, reg DirectPath Single  2 note 6
note 6) This operation is restricted to scheduling in pipe 2.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat May 26, 2007 9:55 pm

diep wrote:Hi,

Very dissappointing that K8L at least as it looks like from specs,
let's hope they managed to improve other stuff than the 64 ==> 128 bits bus thing.

The rest looks nearly 100% the same.

Besides of course SSE4 support as well additionally it has a hardware instruction for popcount.

That's what i had read a month or so ago in that manual.

So basically it's integer IPC is still 3 instructions a cycle that means that it is very unlikely that it can outperform for us the core2 processor in computerchess type workloads.

Additionally intel is going to scale their core2 stuff real high.

It's a run race. Intel has won it for the coming years it seems. Very amazing, i had understood K8L would do 4 instructions a cycle.

So we have to wait for a next generation AMD processor for that.

Of course for multimedia AMD kicks butt with that improved SIMD bandwidth. Simply doubling bandwidth.

It's supposed to have for multimedia type stuff that's number crunching software all of it of course, to be 10% faster than core2 in practical work loads.

Yet that's not our line of business. We are interested in our chessproggies and what AMD can do for it, and it's real little what it can do there extra on top of previous K8's.

Additionally if intels true quad core releases, they'll find another way to glue 2 together creating 8 cores, something AMD seems uncapable of doing.

From my viewpoint K8L is a big dissappointment for computerchess.

Now let's hope it is going to be a cheap processor, but i honestely doubt it will be.

In Highend of course things are diferent, scalability there is a big issue. There this K8L will kick of course major butt with its improved SIMD and already superior memory on die memory system and hypertransport.

Intel doesn't have that.

So to use a dutch saying: "the soup isn't as hot as when it is getting served".

I'd say, big bummer.

From amd's viewpoint they manage to double of course their number of cores from 2 to 4, so i imagine they are extremely happy with that.

Yet from our viewpoint, they didn't improve IPC with respect to integers and we already knew that compilers sucked for AMD there, as its shorter pipeline already meant it suffered less there.

It is again the compilers that help intel at its victory.

Not only it has a higher IPC, but additionally we know historically it always clocks higher.

Add to that, that the price of the existing c2q is going to be very cheap, that means interesting times for us.

Vincent

Hi Vincent,

I guess memory and ccNuma issues are more important for a parallel Diep than integer ipc > 3. Branch-prediction, btb-size, tlbs and huge pages are issues as well.

Short push/pop opcodes results in considerable shorter code, while saving/restoring caller safe registers, rather than using mov via [rbp].

AMD-manual wrote:Faster PUSH/POP with the Sideband Stack Optimizer.

What about hash-tables with a 1GB page?

AMD-manual wrote:The L1 data TLB now supports 1GB pages, a benefit to applications making large data-set random accesses.

The L1 instruction TLB, L1 data TLB and L2 data TLB have increased the number of entries for 2MB pages. This improves the performance of software that uses 2MB code or data or code mixed with data virtual pages.

The L1 data TLB has also increased the number of entries for 4KB pages.

Lets see how Diep and others will perform in the future on intel and K8L or K10h or whatever. Both have will have popcnt and lzcnt, which is nice to have, specially for bitboarders.

Gerd

diep · Post by **diep** » Sun May 27, 2007 5:02 pm

Gerd Isenberg wrote:
diep wrote:Hi,

Very dissappointing that K8L at least as it looks like from specs,
let's hope they managed to improve other stuff than the 64 ==> 128 bits bus thing.

The rest looks nearly 100% the same.

Besides of course SSE4 support as well additionally it has a hardware instruction for popcount.

That's what i had read a month or so ago in that manual.

So basically it's integer IPC is still 3 instructions a cycle that means that it is very unlikely that it can outperform for us the core2 processor in computerchess type workloads.

Additionally intel is going to scale their core2 stuff real high.

It's a run race. Intel has won it for the coming years it seems. Very amazing, i had understood K8L would do 4 instructions a cycle.

So we have to wait for a next generation AMD processor for that.

Of course for multimedia AMD kicks butt with that improved SIMD bandwidth. Simply doubling bandwidth.

It's supposed to have for multimedia type stuff that's number crunching software all of it of course, to be 10% faster than core2 in practical work loads.

Yet that's not our line of business. We are interested in our chessproggies and what AMD can do for it, and it's real little what it can do there extra on top of previous K8's.

Additionally if intels true quad core releases, they'll find another way to glue 2 together creating 8 cores, something AMD seems uncapable of doing.

From my viewpoint K8L is a big dissappointment for computerchess.

Now let's hope it is going to be a cheap processor, but i honestely doubt it will be.

In Highend of course things are diferent, scalability there is a big issue. There this K8L will kick of course major butt with its improved SIMD and already superior memory on die memory system and hypertransport.

Intel doesn't have that.

So to use a dutch saying: "the soup isn't as hot as when it is getting served".

I'd say, big bummer.

From amd's viewpoint they manage to double of course their number of cores from 2 to 4, so i imagine they are extremely happy with that.

Yet from our viewpoint, they didn't improve IPC with respect to integers and we already knew that compilers sucked for AMD there, as its shorter pipeline already meant it suffered less there.

It is again the compilers that help intel at its victory.

Not only it has a higher IPC, but additionally we know historically it always clocks higher.

Add to that, that the price of the existing c2q is going to be very cheap, that means interesting times for us.

Vincent
Hi Vincent,

I guess memory and ccNuma issues are more important for a parallel Diep than integer ipc > 3. Branch-prediction, btb-size, tlbs and huge pages are issues as well.

Short push/pop opcodes results in considerable shorter code, while saving/restoring caller safe registers, rather than using mov via [rbp].
AMD-manual wrote:Faster PUSH/POP with the Sideband Stack Optimizer.
What about hash-tables with a 1GB page?
AMD-manual wrote:The L1 data TLB now supports 1GB pages, a benefit to applications making large data-set random accesses.

The L1 instruction TLB, L1 data TLB and L2 data TLB have increased the number of entries for 2MB pages. This improves the performance of software that uses 2MB code or data or code mixed with data virtual pages.

The L1 data TLB has also increased the number of entries for 4KB pages.
Lets see how Diep and others will perform in the future on intel and K8L or K10h or whatever. Both have will have popcnt and lzcnt, which is nice to have, specially for bitboarders.

Gerd

The only program that doesn't need to worry about scaling is Diep.

Well we already know the result of course, Diep is scaling 3.8 out of 4 at C2Q (thanks to Sune Fischer for running tests). Because Diep can work with very bad latencies of supercomputers, at 2 socket or 4 socket intel machines it will also scale real well, because no matter how intel is behind there at AMD, still those latencies are a factor 10+ better than the latency that those supercomputers have where diep has been designed for to scale well at too.

If evaluation function is slow, then the amount of time you spend for hashtable is relatively seen a lot less. So for Diep there is less problems in intels slower latency.

Basically a well designed 3 instructions a cycle cpu, can never outperform a well designed 4 instructions a cycle cpu. It is important to realize that AMD's biggest advantage over intel for us is its great misprediction penalty.

Yet it might be the case that K8L is 'worse' there than previous K8's. At least not better. Secondly K8 wins another few % back because of a bigger L1 cache and faster RAM access.

As a counter measure, we have pgo nowadays, that works real well for intel. AMD gets 11.3% out of that for Diep, Intel core2 gets 20-22% out of that for Diep. That's just a 5 minutes pgo run at visual c++ 8.0, intel c++ might even do better than that.

Core2 is better at predicting loops than K8. Many loops get 100% predicted. Core2 has a bigger lookahead. Wasn't it like 72 bytes for core2 versus 16 for k8 or so? Even if K8L improves that to 32 it still is factors less than core2.

Overall seen, that 33% faster integer speed at intel is simply killing AMD bigtime.

Add to that, that intel will clock perhaps 1Ghz higher than AMD, and have 8 cores done sooner, that's a killer blow in direction of AMD.

An important thing to realize with respect to the C2Q memory controller is that in off chip chipsets such as C2Q has, that it can do reads in parallel but not writes.

In computerchess we are doing a huge amount of reads. overwhelming more reads than writes. In many streaming data applications, they're nonstop writing as well. This last really shows AMD having a better scaling than intels approach.

That said, if we just look to the quality of the chipset, intels one is a lot better than that of AMD and can handle more DIMMS than AMD. Just because it's on chip, AMD one is higher clocked.

So as soon as we're just doing nonstop reads with intel always having a 2 fold bigger L2 cache than AMD, there is really not a big problem for us at C2Q, objectively seen.

C2Q is totally outgunning AMD therefore overall for branchy integer codes.

That said if AMD would increase its cpu from 3 instructions a cycle to 4, they would of course annihilate in IPC the C2Q. There was some rumours they were trying to achieve that, but seemingly were too much in a hurry catching up on intels quadcore and release a quadcore chip as soon as possible.

Can't blame AMD going for a short term victory, but i would've preferred that they released K8L a few months later doing 4 instructions a cycle which would kill intel till 2012 ipc wise seen.

It is very well possible that AMD isn't gonna sit idle and wait with a chip until intels 45 nm factory gets active, which is a quantum leap advantage for intel over AMD's 65 nm factory.

Might intel get that factory soon into production then AMD is history of course because of a huge clock advantage for intel.

In meantime of course intel will improve its floating point performance of core2 quite some. AMD taking there a 10% lead now (single core measured) is no real big deal if you consider which improvements intel still can make there by just giving its designers a bit more budget.

Intel is and remains a manufacturer which is really producing chips very cheap. It's therefore very amazing that already know it is clear that they will get a higher IPC.

Additionally if you're a company buying servers, let's be very honest.

Why on planet earth do you need to buy a quad xeon single core 500Mhz nowadays with 4 sockets.

You can already get 4 cores within 1 socket now for $266 at 22 july 2007.

If you look for example to a quad socket opteron, then the i/o is attached to socket 0.

So all cores need to communicate very slowly to socket 0.

There is really no difference there between a quad opteron and a C2Q, except that a quad opteron single core @ 800 watt will cost you perhaps $6000 and a C2Q @ 200 watt will cost you $1000.

So in server market, AMD in short term looks very good with K8L in the specfp charts, but intel is far cheaper and in long term clocks way higher with a bigger IPC.

Only for multimedia number crunchers in short term AMD is a nice option and for those who want to build machines that have a cost of millions.

For embarrassingly parallel number crunching already AMD is too expensive and intel wins it there bigtime, so even that market AMD is gone.

There really is very few applications where AMD is far superior over intel in the long run. You really need SIMD for that and nonstop writes to the memory controller.

Besides a few dudes who are using Photoshop, which guys qualify for that?

Oh dear, photoshopers use a macintosh anyway with OS/X 10.4.9, xcode compiler with a compiler that's 30% worse namely gcc 4.0.1 (auch), silly ways to resize windows (can only do in the right bottom corner), and the only processor you can choose from nowadays at apple is.... ...intel.

Vincent

Software Optimization Guide for AMD Family 10h Processors

Software Optimization Guide for AMD Family 10h Processors

Re: Software Optimization Guide for AMD Family 10h Processor

Re: Software Optimization Guide for AMD Family 10h Processor

Re: Software Optimization Guide for AMD Family 10h Processor

Re: Software Optimization Guide for AMD Family 10h Processor

Re: Software Optimization Guide for AMD Family 10h Processor

Re: Software Optimization Guide for AMD Family 10h Processor

Re: Software Optimization Guide for AMD Family 10h Processor

Re: Software Optimization Guide for AMD Family 10h Processor