Gerd Isenberg wrote:diep wrote:Hi,
Very dissappointing that K8L at least as it looks like from specs,
let's hope they managed to improve other stuff than the 64 ==> 128 bits bus thing.
The rest looks nearly 100% the same.
Besides of course SSE4 support as well additionally it has a hardware instruction for popcount.
That's what i had read a month or so ago in that manual.
So basically it's integer IPC is still 3 instructions a cycle that means that it is very unlikely that it can outperform for us the core2 processor in computerchess type workloads.
Additionally intel is going to scale their core2 stuff real high.
It's a run race. Intel has won it for the coming years it seems. Very amazing, i had understood K8L would do 4 instructions a cycle.
So we have to wait for a next generation AMD processor for that.
Of course for multimedia AMD kicks butt with that improved SIMD bandwidth. Simply doubling bandwidth.
It's supposed to have for multimedia type stuff that's number crunching software all of it of course, to be 10% faster than core2 in practical work loads.
Yet that's not our line of business. We are interested in our chessproggies and what AMD can do for it, and it's real little what it can do there extra on top of previous K8's.
Additionally if intels true quad core releases, they'll find another way to glue 2 together creating 8 cores, something AMD seems uncapable of doing.
From my viewpoint K8L is a big dissappointment for computerchess.
Now let's hope it is going to be a cheap processor, but i honestely doubt it will be.
In Highend of course things are diferent, scalability there is a big issue. There this K8L will kick of course major butt with its improved SIMD and already superior memory on die memory system and hypertransport.
Intel doesn't have that.
So to use a dutch saying: "the soup isn't as hot as when it is getting served".
I'd say, big bummer.
From amd's viewpoint they manage to double of course their number of cores from 2 to 4, so i imagine they are extremely happy with that.
Yet from our viewpoint, they didn't improve IPC with respect to integers and we already knew that compilers sucked for AMD there, as its shorter pipeline already meant it suffered less there.
It is again the compilers that help intel at its victory.
Not only it has a higher IPC, but additionally we know historically it always clocks higher.
Add to that, that the price of the existing c2q is going to be very cheap, that means interesting times for us.
Vincent
Hi Vincent,
I guess memory and ccNuma issues are more important for a parallel Diep than integer ipc > 3. Branch-prediction, btb-size, tlbs and huge pages are issues as well.
Short push/pop opcodes results in considerable shorter code, while saving/restoring caller safe registers, rather than using mov via [rbp].
AMD-manual wrote:Faster PUSH/POP with the Sideband Stack Optimizer.
What about hash-tables with a 1GB page?
AMD-manual wrote:The L1 data TLB now supports 1GB pages, a benefit to applications making large data-set random accesses.
The L1 instruction TLB, L1 data TLB and L2 data TLB have increased the number of entries for 2MB pages. This improves the performance of software that uses 2MB code or data or code mixed with data virtual pages.
The L1 data TLB has also increased the number of entries for 4KB pages.
Lets see how Diep and others will perform in the future on intel and K8L or K10h or whatever. Both have will have
popcnt and
lzcnt, which is nice to have, specially for bitboarders.
Gerd
The only program that doesn't need to worry about scaling is Diep.
Well we already know the result of course, Diep is scaling 3.8 out of 4 at C2Q (thanks to Sune Fischer for running tests). Because Diep can work with very bad latencies of supercomputers, at 2 socket or 4 socket intel machines it will also scale real well, because no matter how intel is behind there at AMD, still those latencies are a factor 10+ better than the latency that those supercomputers have where diep has been designed for to scale well at too.
If evaluation function is slow, then the amount of time you spend for hashtable is relatively seen a lot less. So for Diep there is less problems in intels slower latency.
Basically a well designed 3 instructions a cycle cpu, can never outperform a well designed 4 instructions a cycle cpu. It is important to realize that AMD's biggest advantage over intel for us is its great misprediction penalty.
Yet it might be the case that K8L is 'worse' there than previous K8's. At least not better. Secondly K8 wins another few % back because of a bigger L1 cache and faster RAM access.
As a counter measure, we have pgo nowadays, that works real well for intel. AMD gets 11.3% out of that for Diep, Intel core2 gets 20-22% out of that for Diep. That's just a 5 minutes pgo run at visual c++ 8.0, intel c++ might even do better than that.
Core2 is better at predicting loops than K8. Many loops get 100% predicted. Core2 has a bigger lookahead. Wasn't it like 72 bytes for core2 versus 16 for k8 or so? Even if K8L improves that to 32 it still is factors less than core2.
Overall seen, that 33% faster integer speed at intel is simply killing AMD bigtime.
Add to that, that intel will clock perhaps 1Ghz higher than AMD, and have 8 cores done sooner, that's a killer blow in direction of AMD.
An important thing to realize with respect to the C2Q memory controller is that in off chip chipsets such as C2Q has, that it can do reads in parallel but not writes.
In computerchess we are doing a huge amount of reads. overwhelming more reads than writes. In many streaming data applications, they're nonstop writing as well. This last really shows AMD having a better scaling than intels approach.
That said, if we just look to the quality of the chipset, intels one is a lot better than that of AMD and can handle more DIMMS than AMD. Just because it's on chip, AMD one is higher clocked.
So as soon as we're just doing nonstop reads with intel always having a 2 fold bigger L2 cache than AMD, there is really not a big problem for us at C2Q, objectively seen.
C2Q is totally outgunning AMD therefore overall for branchy integer codes.
That said if AMD would increase its cpu from 3 instructions a cycle to 4, they would of course annihilate in IPC the C2Q. There was some rumours they were trying to achieve that, but seemingly were too much in a hurry catching up on intels quadcore and release a quadcore chip as soon as possible.
Can't blame AMD going for a short term victory, but i would've preferred that they released K8L a few months later doing 4 instructions a cycle which would kill intel till 2012 ipc wise seen.
It is very well possible that AMD isn't gonna sit idle and wait with a chip until intels 45 nm factory gets active, which is a quantum leap advantage for intel over AMD's 65 nm factory.
Might intel get that factory soon into production then AMD is history of course because of a huge clock advantage for intel.
In meantime of course intel will improve its floating point performance of core2 quite some. AMD taking there a 10% lead now (single core measured) is no real big deal if you consider which improvements intel still can make there by just giving its designers a bit more budget.
Intel is and remains a manufacturer which is really producing chips very cheap. It's therefore very amazing that already know it is clear that they will get a higher IPC.
Additionally if you're a company buying servers, let's be very honest.
Why on planet earth do you need to buy a quad xeon single core 500Mhz nowadays with 4 sockets.
You can already get 4 cores within 1 socket now for $266 at 22 july 2007.
If you look for example to a quad socket opteron, then the i/o is attached to socket 0.
So all cores need to communicate very slowly to socket 0.
There is really no difference there between a quad opteron and a C2Q, except that a quad opteron single core @ 800 watt will cost you perhaps $6000 and a C2Q @ 200 watt will cost you $1000.
So in server market, AMD in short term looks very good with K8L in the specfp charts, but intel is far cheaper and in long term clocks way higher with a bigger IPC.
Only for multimedia number crunchers in short term AMD is a nice option and for those who want to build machines that have a cost of millions.
For embarrassingly parallel number crunching already AMD is too expensive and intel wins it there bigtime, so even that market AMD is gone.
There really is very few applications where AMD is far superior over intel in the long run. You really need SIMD for that and nonstop writes to the memory controller.
Besides a few dudes who are using Photoshop, which guys qualify for that?
Oh dear, photoshopers use a macintosh anyway with OS/X 10.4.9, xcode compiler with a compiler that's 30% worse namely gcc 4.0.1 (auch), silly ways to resize windows (can only do in the right bottom corner), and the only processor you can choose from nowadays at apple is.... ...intel.
Vincent