480 high speed CPUs for $500

bob · Post by **bob** » Mon Mar 30, 2009 3:01 am

diep wrote:
bob wrote:
diep wrote:
Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.
Forget about letting several cards cooperate.
Let's assume 1 card now with 2 gpu's inside.

A) the hashtable problem
a1) the 2 cpu's have no shared memory between each other, BIG problem.
a2) you can hardly allocate RAM at nvidia, only a block of 200MB
you might manage, despite device having more. At tesla cards you
might be able to allocate more.
a3) as a result of the hashtables not being shared you have a WORSE BRANCHING FACTOR.
a4) device RAM which you use for hashtable doesn't get cached,
BIG SLOWDOWN.

b) you have to create 2 layer parallellism

b1) parallellism between thread blocks (2 x 8 blocks)
b2) parallellism within 1 thread block of 30 cores

Most engines don't even have a parallellism that scales well let alone has a good speedup at B1, let alone that you are going to manage b2.

Note that B2 is deterministic

What's the speedup you get out of b1 + b2?

Moreover you need to be at Einstein level in parallel search to get it done without too much of an overhead.

AS YOU CANNOT LOSE A FACTOR 50 TO SCALING OVERHEAD LIKE CILKCHESS AND ZUGZWANG DID DO IN THE PAST.

c) the ipc of each core is a lot lower. Whereas nowadays core2/i7/phenom2 are approaching 2.0, multimedia code at nvidia is getting 0.4. Note multimedia code is a lot easier than chess code. chess code has more branches.

d) the manycore overhead. All code gets evaluated, whether you need it or not. Every core executes the same code within 1 block.

See above, every hashtable entry is uncached so you gotta wait each node for a hashtable entry, even if last few plies you don't want to do it.

Ok let's compare now speed.

32 core Xeon MP beckton (release at 15 april) :

32 * 2.9Ghz * ipc = 2.0 ==> 185.6 Ghz instructions a second
No parallel speedup penalty yet. I don't know the speedup of diep
at this 32 core box.

I would estimate it at 20 out of 32.

So that makes it in fact effectively:

20 * 2.0 * 2.9 Ghz instructions = 40 * 2.9 = 80 + 36 = 116 Gi

Now the 295.

You can use 240 cores of course. How to do parallellism between the 2 gpu's of 240 cores i do not know. You tell me.

So basic speed: 240 cores * 1.2Ghz * 0.4 = 115 Gi
Now for sure to hashtable you're gonna lose factor 2
You lose for sure to the manycore overhead (as it has
to evaluate every line of code) factor 2

Note this is optimistic guess. I know the opinion of some chessprogrammers who want to make it factor 5 or so.

Now the speedup at 240 cores. A big problem with 2 layer parallellism is that this is really suboptimal. It is not only complicated. But you lose in advance factor 2 to scaling i would expect. So the theoretic maximum would be 50% speedup. In reality more like 25-30% is normal. 30% from 50% = 15%. That's really very good to achieve.

So we start with 115 * 0.15 / 4 = 4 Gi

So a 32 core Xeon MP box is roughly factor 30 better.

That's still an optimistic guess.

We didn't calculate in the worse branching factor that you will get at a gpu yet.

The trick is of course that we do not compare 1 cpu with a gpu.
We compare 4 cpu's with 1 gpu.

As those 4 cpu's have shared memory and a gpu does NOT.

Forget parallellism via the pci-e.
Too slow latency.

Vincent
I agree with most of the above. The Beckton (and nehalem, et. al. i7 processors) pose an interesting set of constraints. From the shared L3 (8mb on the machine we have, bigger on a machine I have tested on) to the "sorta-NUMA" way Intel connects to memory (with a quad-package/socket box, you get local memory and remote memory, just two classes/speeds, rather than AMD's local memory, remote memory (two of four processors) and really remote (slowest of all). So some issues on AMD disappear so long as we talk about 4-socket machines. But 8 socket machines are also interesting and I am not sure what is going to happen there, with regards to intel. We already know with the opterons... Intel chose MESIF rather than MOESI so it isn't quite as efficient to share stuff across L3s if the data gets modified a lot. All of this means that a little programming care is needed. Screw this up and throughput can drop to 50% of optimal in a heartbeat. We've already seen and measured this and not just with regard to chess searching.

The 4x8 box looks very good, IMHO. And Linux is doing well with process scheduling to make them effective to use. But I suspect chess programs are going to produce widely differing levels of performance unless they are carefully tuned with respect to cache and memory access patterns.
Oh yes the 32 core box will be interesting, note that the AMD memory subsystem is quite superior to intel. Intel just copies in a bad manner AMD, avoiding patents.

It is superior to AMD in ways. For example, 3 channels. And the interconnects between the memory controllers can use 3 pathways as well, so a 4-socket box doesn't have 1 local memory, 2 1-hop memories, and 1 2-hop memory. Intel is 1 0-hop and 3 1-hop accesses which is better, although what they will do beyond 4 sockets is unknown. But 3 connections is clearly better than 2.

AMD already had L1,L2,L3 at mainstream opterons (called barcelona , phenom and now phenom2) before intel managed to upgrade their core2 to i7 with L3.

Next year, so not that much later than that intel sells beckton (of course having test hardware gives a big edge to intel here), AMD will release 12 core cpu's, which are quad socket capable. That's 48 core.

I don't know how high beckton will be clocked. Probably they manage 1 version at 3.2ghz, i'm not sure.

The whole point is that we've got remote shared memory at intel and AMD is such a huge advantage that it annihilates any GPU attempt in advance for game tree search.

GPU's can be fast if you have embarrassingly parallel software that doesn't need much of a RAM.

Let's suppose we have a 48 core AMD box in 2010 and that some magic GPU releases which has shared memory and a higher ipc.

The question then is: how does the 40 GB of hashtable at such a big box compare against the say 400MB hashtable for 240 cores or so?

Note there is on paper faster gpu's than nvidia.
AMD gpu's are faster (they bought ATI).

It has by now 1600 cores they call it.
I'd call that 320 stream processors with each 5 execution units,
quite better than nvidia's 480 cores with 1 unit.

That is more powerful than nvidia, and a lot.

However writing code that can effectively get a much higher ipc than 0.4, using those 5 execution units, is quite a challenge.

Another advantage of the AMD approach is that you don't need to write some sort of stupid 2 layer parallellism. Saves you out 50% of the effort.

Again the problem at AMD gpu is similar to nvidia with respect to hashtable; the x2 versions with 2 gpu's do not have shared memory.

Well you know there is a link between the gpu's that's quite fast, but to program for it is really complicating code once again.

Writing code for this is really complicated.

We don't have big teams of great low level coders for our chesssoftware.
Everyone is doing it himself.

It is very difficult to make software for a gpu, knowing thatonly you can use it, and that by the time you have it done, maybe nvidia is out of business?

So far cpu's were simply a lot higher clocked and faster.

I'm doing some effort right now in diep to see how i can streamline its parallellism better for the futuristic 32 and 48 core machines.

Yet making a new chess engine specially for a gpu, that would be a fulltime 2 years project really. I would first of all go for AMD, as that has more potential for computerchess, as a single processor of AMD is 900Mhz and 5 way, versus nvidia is 1.2Ghz and 1 instruction a cycle max.

So that's a potential of 1.2Gi nvidia vs 4.5Gi for AMD.

You start at say a factor 3 difference in advantage of ATI/AMD for a single stream processor. There is more cache available a streamcore at AMD than nvidia. Nvidia really has little cache.

Note that cache size is of total minor importance in the intel vs AMD battle, yet in gpu's the cache is so tiny for each stream core, that it might be relevant.

No good low level programmer is gonna start a gpu chess programming project any soon, simply because you can't sell it and a multisocket machine is gonna be faster anyway. THAT is the main 2 bottlenecks.

You really would need some sort of 4 layer parallellism to kick butt with a gpu.

Those skulltrails can have 5 ATI/AMD gpu cards.
Let's say you get from government access to their ultra classified
skulltrail clusters with 5 gpu cards a box and a fast network.

highest level of parallellism is the parallellism between nodes. Say you use 240 machines.

Then each machine has a second layer of parallellism over pci-e between 5 cards, which is really very ugly latency (slowest latency of all parallellism).

The within 1 card you have 2 gpu's and a bad parallellism between the 2 gpu's. That's 3d layer.

Then there is 4th layer of parallel search of th emanycores.

In case of nvidia cards you would need a 5th layer of parallellism (actually that would be 4th) between each thread block.

That requires an IQ180+ guy and to make all those layers of parallellism is really complicated.

Secondly you need a low level implementation of a chess engine, really complicated, in order to get some sort of not too ugly ipc. It is way more important to do that right on a gpu than on a cpu.

CPU's have billions of transistors in order to increase your ipc in a clever manner. Takes away all the problems of you. That's not there in gpu's.
They're real low power cores.

The real limit of gpu's is that you'll find no one who unpaid wants to do it.

Vincent

Dann Corbit · Post by **Dann Corbit** » Tue Mar 31, 2009 2:40 am

For local (on card) memory:
Memory Specs:
Memory Clock (MHz) 999 MHz
Standard Memory Config 1792 MB GDDR3 ( 896MB per GPU )
Memory Interface Width 896-bit ( 448-bit per GPU )
Memory Bandwidth (GB/sec) 223.8

Series of 11 articles on GPU programming:
http://www.ddj.com/article/printableArt ... architect/

The Intel response of Larrabee does not clearly define what kind of memory bandwidth can be achieved.

At about $1 per GPU, scalable to as many cards as you can add in your system, it seems to me that it may be a cost effective way to do chess. Of course, I did not try it and it might suck truck exhaust.

I don't know enough about your calculations to even comment but I do know you have studied the subject extensively and have even created physical implementations so I do not doubt that you are correct. But I still wonder how feasible the GPU model can become.

diep · Post by **diep** » Tue Mar 31, 2009 3:19 pm

Dann Corbit wrote:For local (on card) memory:
Memory Specs:
Memory Clock (MHz) 999 MHz
Standard Memory Config 1792 MB GDDR3 ( 896MB per GPU )
Memory Interface Width 896-bit ( 448-bit per GPU )
Memory Bandwidth (GB/sec) 223.8

Series of 11 articles on GPU programming:
http://www.ddj.com/article/printableArt ... architect/

The Intel response of Larrabee does not clearly define what kind of memory bandwidth can be achieved.

At about $1 per GPU, scalable to as many cards as you can add in your system, it seems to me that it may be a cost effective way to do chess. Of course, I did not try it and it might suck truck exhaust.

I don't know enough about your calculations to even comment but I do know you have studied the subject extensively and have even created physical implementations so I do not doubt that you are correct. But I still wonder how feasible the GPU model can become.

Well at least for gpu's you can program, larrabee is a vector processor model that is total unusable, except for double precision calculations.

You can find the instruction set here:

http://software.intel.com/en-us/article ... ves-guide/

Most interesting instructions are like 5-7 cycles latency though.

gpu's you CAN program for but the problem is its memory. They can claim of course a terabyte bandwidth or whatever to RAM, that is just paper.

nvidia you can throw away all gpu's except the $2000 league of course, as only that league really has the bandwidth they claim, the normal gpu's of them have the bandwidth not to RAM for gpu, but to pixel units bla bla bla,
in short for graphics, which is useless in gpgpu programming.

This is why the few who tried some experiments with graphics cards frmo nvidia are so disappointed, even for multimedia things on them; only the tesla cards are capable of gpgpu in a serious manner with nvidia.

That range starts at 2000 euro last time i checked and a watt or 300.

Yet those tesla's are serious things in contradiction to larrabee. Definition of what larrabee is, it has changed already a few times, so let's not brag about something until it exists in the shops. When something is there you can look to how it performs and how much faster its new generation is for you.

With gpu's of course some terabytes of RAM is easy to get. let each few cores hammer onto a different DIMM. That is local cache however, what we would call L1 + L2 bandwidth together of a cpu. They're counting the bandwidth in rather creative manners i'd say.

What we need is shared memory and it just doesn't have it in a very efficient manner.

There is no question about it that gpu's have more bandwidth to RAM as they've got DDR4 ram and cacheline sizes of like 256 bytes, but it is not shared RAM.

For us latency matters.

diep · Post by **diep** » Tue Mar 31, 2009 4:03 pm

Bob,

A few points on RAM.

DDR3 @ 3 lanes means you can get 192 bytes at once,
versus us poor souls with DDR2 @ 2 lanes can get 64 bytes at once.

Now the question is how happy we are with that as we aren't vector processing in chess, yet caring for latency of course.

The problem of the intel model is that CSI has 2 rings. So first ring first has to wake up out of its forever sleep, then when that servant finally enters the university to serve bytes to us, it gives its 192 bytes to the next one. Then after a while, after everyone touched it, it gets delivered to your office.

That's not a rather quick latency for intel in short.

Vincent

diep · Post by **diep** » Tue Mar 31, 2009 4:09 pm

diep wrote:Bob,

A few points on RAM.

DDR3 @ 3 lanes means you can get 192 bytes at once,
versus us poor souls with DDR2 @ 2 lanes can get 64 bytes at once.

Now the question is how happy we are with that as we aren't vector processing in chess, yet caring for latency of course.

The problem of the intel model is that CSI has 2 rings. So first ring first has to wake up out of its forever sleep, then when that servant finally enters the university to serve bytes to us, it gives its 192 bytes to the next one. Then after a while, after everyone touched it, it gets delivered to your office.

That's not a rather quick latency for intel in short.

Vincent

I would want to go further even.

Bandwidth measurements on i7 are it gets like 18GB/s.

That's with 192 bytes at a time. How many times per second can it serve us to hashtable if we realize AMD gets 12GB/s with 64 byte reads?

A bigger pipe means nearly always bad news for latency.
Now we didn't discuss the price. DDR ecc-reg ram still is really expensive.
DDR2 is dirt cheap. DDR3 again is real expensive, how about DDR3-ecc reg?

Of course that's not a worry of an university professor, but i like to have 64GB of ram one day, with ddr3 that ain't gonna happen.

bob · Post by **bob** » Tue Mar 31, 2009 6:43 pm

diep wrote:
diep wrote:Bob,

A few points on RAM.

DDR3 @ 3 lanes means you can get 192 bytes at once,
versus us poor souls with DDR2 @ 2 lanes can get 64 bytes at once.

Now the question is how happy we are with that as we aren't vector processing in chess, yet caring for latency of course.

The problem of the intel model is that CSI has 2 rings. So first ring first has to wake up out of its forever sleep, then when that servant finally enters the university to serve bytes to us, it gives its 192 bytes to the next one. Then after a while, after everyone touched it, it gets delivered to your office.

That's not a rather quick latency for intel in short.

Vincent
I would want to go further even.

Bandwidth measurements on i7 are it gets like 18GB/s.

That's with 192 bytes at a time. How many times per second can it serve us to hashtable if we realize AMD gets 12GB/s with 64 byte reads?

A bigger pipe means nearly always bad news for latency.
Now we didn't discuss the price. DDR ecc-reg ram still is really expensive.
DDR2 is dirt cheap. DDR3 again is real expensive, how about DDR3-ecc reg?

Of course that's not a worry of an university professor, but i like to have 64GB of ram one day, with ddr3 that ain't gonna happen.

Doessn't have to be registered, however. Our prototype is not using reg memory in fact, although the production models we order might. but it does produce some pretty good NPS numbers. The bottom-end 5520 is hitting 22-24M nps right now.

diep · Post by **diep** » Sat Apr 04, 2009 3:42 am

bob wrote:
diep wrote:
diep wrote:Bob,

A few points on RAM.

DDR3 @ 3 lanes means you can get 192 bytes at once,
versus us poor souls with DDR2 @ 2 lanes can get 64 bytes at once.

Now the question is how happy we are with that as we aren't vector processing in chess, yet caring for latency of course.

The problem of the intel model is that CSI has 2 rings. So first ring first has to wake up out of its forever sleep, then when that servant finally enters the university to serve bytes to us, it gives its 192 bytes to the next one. Then after a while, after everyone touched it, it gets delivered to your office.

That's not a rather quick latency for intel in short.

Vincent
I would want to go further even.

Bandwidth measurements on i7 are it gets like 18GB/s.

That's with 192 bytes at a time. How many times per second can it serve us to hashtable if we realize AMD gets 12GB/s with 64 byte reads?

A bigger pipe means nearly always bad news for latency.
Now we didn't discuss the price. DDR ecc-reg ram still is really expensive.
DDR2 is dirt cheap. DDR3 again is real expensive, how about DDR3-ecc reg?

Of course that's not a worry of an university professor, but i like to have 64GB of ram one day, with ddr3 that ain't gonna happen.
Doessn't have to be registered, however. Our prototype is not using reg memory in fact, although the production models we order might. but it does produce some pretty good NPS numbers. The bottom-end 5520 is hitting 22-24M nps right now.

Might be compiler for spec programs together with a 2nd memory controller on chip. At spec, even at oldie core2, the new compiler kicks total butt and is a lot faster than previous compiler. I still have to test it myself carefully for Diep. Diep isn't in spec of course, usually big boomer then.

2 memory controllers for 8 cores, is a lot better than 1 for 8 of course.

Diep scaled around 7 out of 8 at oldie core2 @ 8 cores, that's simply missing in short nearly 15%.

Add to that compiler, then it runs in the papers already.

However a core in itself isn't faster.

Vincent

480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500