480 high speed CPUs for $500

Dann Corbit · Post by **Dann Corbit** » Sat Mar 28, 2009 6:55 am

http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.

MattieShoes · Post by **MattieShoes** » Sat Mar 28, 2009 7:24 am

I've seen crafty running on a linksys router, but never on a graphics card

That'd be awesome!

wgarvin · Post by **wgarvin** » Sat Mar 28, 2009 8:21 am

Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.

I'm looking forward to Intel's Larrabee stuff. They are basically stripped-down x86 cores, but lots of them. They will probably be a great fit for chess engines.

bob · Post by **bob** » Sat Mar 28, 2009 4:25 pm

Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.

Yes, but it is a different world with very limited memory bandwidth between the video card memory and main computer memory. We have a project going on here in conjunction with Sandia lab to use a gpu card as a very high-performance RAID controller. And they have done quite well, but the issue is getting data in and out of the card itself.

CRoberson · Post by **CRoberson** » Sat Mar 28, 2009 6:08 pm

bob wrote:
Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.
Yes, but it is a different world with very limited memory bandwidth between the video card memory and main computer memory. We have a project going on here in conjunction with Sandia lab to use a gpu card as a very high-performance RAID controller. And they have done quite well, but the issue is getting data in and out of the card itself.

Agreed, but could this be an issue of changing dynamics. If I can get
480 cheap but fast processors and I can use them reasonably
efficiently, then maybe I don't care about memory bandwidth so
much. Seems to me the primary hog on memory bandwidth is
hash tables.

If hash tables give me a 2x to 4x Time to Ply improvement and
the 480 processors would give me a 300x speed up then I can
drop transposition tables and get a speed up.

sje · Post by **sje** » Sat Mar 28, 2009 6:47 pm

CRoberson wrote:
bob wrote:Yes, but it is a different world with very limited memory bandwidth between the video card memory and main computer memory. We have a project going on here in conjunction with Sandia lab to use a gpu card as a very high-performance RAID controller. And they have done quite well, but the issue is getting data in and out of the card itself.
Agreed, but could this be an issue of changing dynamics. If I can get
480 cheap but fast processors and I can use them reasonably
efficiently, then maybe I don't care about memory bandwidth so
much. Seems to me the primary hog on memory bandwidth is
hash tables.

If hash tables give me a 2x to 4x Time to Ply improvement and
the 480 processors would give me a 300x speed up then I can
drop transposition tables and get a speed up.

About ten years ago Apple introduced their Quartz imaging model to take advantage of the general increase of capabilities seen in newer video adaptor cards. (Quartz which uses much floating point replaced the 16 bit integer QuickDraw model from 1984.) A few years later, Apple added "Quartz Extreme" which started the offloading of all kinds of imaging processing to video adaptors. For those like me who deal almost entirely with simple text, the Quartz Extreme model is no big deal. But it does help Apple with its persistent pushing of users towards more frequent hardware purchases.

--------

My experiments with multi-threaded perft() on dual and quad core boxes has shown that a two tier transposition system (each thread gets its own private table and access to a shared locking store of higher value entries) works fairly well. But only if the right level bound is set for choosing which table to use.

I'll guess that for a many-core box, a transposition system with log(number_of_cores) tiers would be best. Each thread could measure how much time it wastes waiting for locks and use that data to self throttle access to the upper locking tiers.

bob · Post by **bob** » Sat Mar 28, 2009 11:28 pm

CRoberson wrote:
bob wrote:
Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.
Yes, but it is a different world with very limited memory bandwidth between the video card memory and main computer memory. We have a project going on here in conjunction with Sandia lab to use a gpu card as a very high-performance RAID controller. And they have done quite well, but the issue is getting data in and out of the card itself.
Agreed, but could this be an issue of changing dynamics. If I can get
480 cheap but fast processors and I can use them reasonably
efficiently, then maybe I don't care about memory bandwidth so
much. Seems to me the primary hog on memory bandwidth is
hash tables.

If hash tables give me a 2x to 4x Time to Ply improvement and
the 480 processors would give me a 300x speed up then I can
drop transposition tables and get a speed up.

To do a parallel search, you have to have inter-processor communication that is quick with low latency... That's the issue I was thinking of, not hashing at all. One could tolerate significant hash probe latency with a little extra programming, but not delays in synchronization and such.

diep · Post by **diep** » Sun Mar 29, 2009 6:21 pm

Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.

Forget about letting several cards cooperate.
Let's assume 1 card now with 2 gpu's inside.

A) the hashtable problem
a1) the 2 cpu's have no shared memory between each other, BIG problem.
a2) you can hardly allocate RAM at nvidia, only a block of 200MB
you might manage, despite device having more. At tesla cards you
might be able to allocate more.
a3) as a result of the hashtables not being shared you have a WORSE BRANCHING FACTOR.
a4) device RAM which you use for hashtable doesn't get cached,
BIG SLOWDOWN.

b) you have to create 2 layer parallellism

b1) parallellism between thread blocks (2 x 8 blocks)
b2) parallellism within 1 thread block of 30 cores

Most engines don't even have a parallellism that scales well let alone has a good speedup at B1, let alone that you are going to manage b2.

Note that B2 is deterministic

What's the speedup you get out of b1 + b2?

Moreover you need to be at Einstein level in parallel search to get it done without too much of an overhead.

AS YOU CANNOT LOSE A FACTOR 50 TO SCALING OVERHEAD LIKE CILKCHESS AND ZUGZWANG DID DO IN THE PAST.

c) the ipc of each core is a lot lower. Whereas nowadays core2/i7/phenom2 are approaching 2.0, multimedia code at nvidia is getting 0.4. Note multimedia code is a lot easier than chess code. chess code has more branches.

d) the manycore overhead. All code gets evaluated, whether you need it or not. Every core executes the same code within 1 block.

See above, every hashtable entry is uncached so you gotta wait each node for a hashtable entry, even if last few plies you don't want to do it.

Ok let's compare now speed.

32 core Xeon MP beckton (release at 15 april) :

32 * 2.9Ghz * ipc = 2.0 ==> 185.6 Ghz instructions a second
No parallel speedup penalty yet. I don't know the speedup of diep
at this 32 core box.

I would estimate it at 20 out of 32.

So that makes it in fact effectively:

20 * 2.0 * 2.9 Ghz instructions = 40 * 2.9 = 80 + 36 = 116 Gi

Now the 295.

You can use 240 cores of course. How to do parallellism between the 2 gpu's of 240 cores i do not know. You tell me.

So basic speed: 240 cores * 1.2Ghz * 0.4 = 115 Gi
Now for sure to hashtable you're gonna lose factor 2
You lose for sure to the manycore overhead (as it has
to evaluate every line of code) factor 2

Note this is optimistic guess. I know the opinion of some chessprogrammers who want to make it factor 5 or so.

Now the speedup at 240 cores. A big problem with 2 layer parallellism is that this is really suboptimal. It is not only complicated. But you lose in advance factor 2 to scaling i would expect. So the theoretic maximum would be 50% speedup. In reality more like 25-30% is normal. 30% from 50% = 15%. That's really very good to achieve.

So we start with 115 * 0.15 / 4 = 4 Gi

So a 32 core Xeon MP box is roughly factor 30 better.

That's still an optimistic guess.

We didn't calculate in the worse branching factor that you will get at a gpu yet.

The trick is of course that we do not compare 1 cpu with a gpu.
We compare 4 cpu's with 1 gpu.

As those 4 cpu's have shared memory and a gpu does NOT.

Forget parallellism via the pci-e.
Too slow latency.

Vincent

bob · Post by **bob** » Sun Mar 29, 2009 6:49 pm

diep wrote:
Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.
Forget about letting several cards cooperate.
Let's assume 1 card now with 2 gpu's inside.

A) the hashtable problem
a1) the 2 cpu's have no shared memory between each other, BIG problem.
a2) you can hardly allocate RAM at nvidia, only a block of 200MB
you might manage, despite device having more. At tesla cards you
might be able to allocate more.
a3) as a result of the hashtables not being shared you have a WORSE BRANCHING FACTOR.
a4) device RAM which you use for hashtable doesn't get cached,
BIG SLOWDOWN.

b) you have to create 2 layer parallellism

b1) parallellism between thread blocks (2 x 8 blocks)
b2) parallellism within 1 thread block of 30 cores

Most engines don't even have a parallellism that scales well let alone has a good speedup at B1, let alone that you are going to manage b2.

Note that B2 is deterministic

What's the speedup you get out of b1 + b2?

Moreover you need to be at Einstein level in parallel search to get it done without too much of an overhead.

AS YOU CANNOT LOSE A FACTOR 50 TO SCALING OVERHEAD LIKE CILKCHESS AND ZUGZWANG DID DO IN THE PAST.

c) the ipc of each core is a lot lower. Whereas nowadays core2/i7/phenom2 are approaching 2.0, multimedia code at nvidia is getting 0.4. Note multimedia code is a lot easier than chess code. chess code has more branches.

d) the manycore overhead. All code gets evaluated, whether you need it or not. Every core executes the same code within 1 block.

See above, every hashtable entry is uncached so you gotta wait each node for a hashtable entry, even if last few plies you don't want to do it.

Ok let's compare now speed.

32 core Xeon MP beckton (release at 15 april) :

32 * 2.9Ghz * ipc = 2.0 ==> 185.6 Ghz instructions a second
No parallel speedup penalty yet. I don't know the speedup of diep
at this 32 core box.

I would estimate it at 20 out of 32.

So that makes it in fact effectively:

20 * 2.0 * 2.9 Ghz instructions = 40 * 2.9 = 80 + 36 = 116 Gi

Now the 295.

You can use 240 cores of course. How to do parallellism between the 2 gpu's of 240 cores i do not know. You tell me.

So basic speed: 240 cores * 1.2Ghz * 0.4 = 115 Gi
Now for sure to hashtable you're gonna lose factor 2
You lose for sure to the manycore overhead (as it has
to evaluate every line of code) factor 2

Note this is optimistic guess. I know the opinion of some chessprogrammers who want to make it factor 5 or so.

Now the speedup at 240 cores. A big problem with 2 layer parallellism is that this is really suboptimal. It is not only complicated. But you lose in advance factor 2 to scaling i would expect. So the theoretic maximum would be 50% speedup. In reality more like 25-30% is normal. 30% from 50% = 15%. That's really very good to achieve.

So we start with 115 * 0.15 / 4 = 4 Gi

So a 32 core Xeon MP box is roughly factor 30 better.

That's still an optimistic guess.

We didn't calculate in the worse branching factor that you will get at a gpu yet.

The trick is of course that we do not compare 1 cpu with a gpu.
We compare 4 cpu's with 1 gpu.

As those 4 cpu's have shared memory and a gpu does NOT.

Forget parallellism via the pci-e.
Too slow latency.

Vincent

I agree with most of the above. The Beckton (and nehalem, et. al. i7 processors) pose an interesting set of constraints. From the shared L3 (8mb on the machine we have, bigger on a machine I have tested on) to the "sorta-NUMA" way Intel connects to memory (with a quad-package/socket box, you get local memory and remote memory, just two classes/speeds, rather than AMD's local memory, remote memory (two of four processors) and really remote (slowest of all). So some issues on AMD disappear so long as we talk about 4-socket machines. But 8 socket machines are also interesting and I am not sure what is going to happen there, with regards to intel. We already know with the opterons... Intel chose MESIF rather than MOESI so it isn't quite as efficient to share stuff across L3s if the data gets modified a lot. All of this means that a little programming care is needed. Screw this up and throughput can drop to 50% of optimal in a heartbeat. We've already seen and measured this and not just with regard to chess searching.

The 4x8 box looks very good, IMHO. And Linux is doing well with process scheduling to make them effective to use. But I suspect chess programs are going to produce widely differing levels of performance unless they are carefully tuned with respect to cache and memory access patterns.

diep · Post by **diep** » Sun Mar 29, 2009 7:15 pm

bob wrote:
diep wrote:
Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html

Since there are so many gamers, I guess that cards like this are very common.

At some point, strong chess engines will take advantage of it.
Forget about letting several cards cooperate.
Let's assume 1 card now with 2 gpu's inside.

A) the hashtable problem
a1) the 2 cpu's have no shared memory between each other, BIG problem.
a2) you can hardly allocate RAM at nvidia, only a block of 200MB
you might manage, despite device having more. At tesla cards you
might be able to allocate more.
a3) as a result of the hashtables not being shared you have a WORSE BRANCHING FACTOR.
a4) device RAM which you use for hashtable doesn't get cached,
BIG SLOWDOWN.

b) you have to create 2 layer parallellism

b1) parallellism between thread blocks (2 x 8 blocks)
b2) parallellism within 1 thread block of 30 cores

Most engines don't even have a parallellism that scales well let alone has a good speedup at B1, let alone that you are going to manage b2.

Note that B2 is deterministic

What's the speedup you get out of b1 + b2?

Moreover you need to be at Einstein level in parallel search to get it done without too much of an overhead.

AS YOU CANNOT LOSE A FACTOR 50 TO SCALING OVERHEAD LIKE CILKCHESS AND ZUGZWANG DID DO IN THE PAST.

c) the ipc of each core is a lot lower. Whereas nowadays core2/i7/phenom2 are approaching 2.0, multimedia code at nvidia is getting 0.4. Note multimedia code is a lot easier than chess code. chess code has more branches.

d) the manycore overhead. All code gets evaluated, whether you need it or not. Every core executes the same code within 1 block.

See above, every hashtable entry is uncached so you gotta wait each node for a hashtable entry, even if last few plies you don't want to do it.

Ok let's compare now speed.

32 core Xeon MP beckton (release at 15 april) :

32 * 2.9Ghz * ipc = 2.0 ==> 185.6 Ghz instructions a second
No parallel speedup penalty yet. I don't know the speedup of diep
at this 32 core box.

I would estimate it at 20 out of 32.

So that makes it in fact effectively:

20 * 2.0 * 2.9 Ghz instructions = 40 * 2.9 = 80 + 36 = 116 Gi

Now the 295.

You can use 240 cores of course. How to do parallellism between the 2 gpu's of 240 cores i do not know. You tell me.

So basic speed: 240 cores * 1.2Ghz * 0.4 = 115 Gi
Now for sure to hashtable you're gonna lose factor 2
You lose for sure to the manycore overhead (as it has
to evaluate every line of code) factor 2

Note this is optimistic guess. I know the opinion of some chessprogrammers who want to make it factor 5 or so.

Now the speedup at 240 cores. A big problem with 2 layer parallellism is that this is really suboptimal. It is not only complicated. But you lose in advance factor 2 to scaling i would expect. So the theoretic maximum would be 50% speedup. In reality more like 25-30% is normal. 30% from 50% = 15%. That's really very good to achieve.

So we start with 115 * 0.15 / 4 = 4 Gi

So a 32 core Xeon MP box is roughly factor 30 better.

That's still an optimistic guess.

We didn't calculate in the worse branching factor that you will get at a gpu yet.

The trick is of course that we do not compare 1 cpu with a gpu.
We compare 4 cpu's with 1 gpu.

As those 4 cpu's have shared memory and a gpu does NOT.

Forget parallellism via the pci-e.
Too slow latency.

Vincent
I agree with most of the above. The Beckton (and nehalem, et. al. i7 processors) pose an interesting set of constraints. From the shared L3 (8mb on the machine we have, bigger on a machine I have tested on) to the "sorta-NUMA" way Intel connects to memory (with a quad-package/socket box, you get local memory and remote memory, just two classes/speeds, rather than AMD's local memory, remote memory (two of four processors) and really remote (slowest of all). So some issues on AMD disappear so long as we talk about 4-socket machines. But 8 socket machines are also interesting and I am not sure what is going to happen there, with regards to intel. We already know with the opterons... Intel chose MESIF rather than MOESI so it isn't quite as efficient to share stuff across L3s if the data gets modified a lot. All of this means that a little programming care is needed. Screw this up and throughput can drop to 50% of optimal in a heartbeat. We've already seen and measured this and not just with regard to chess searching.

The 4x8 box looks very good, IMHO. And Linux is doing well with process scheduling to make them effective to use. But I suspect chess programs are going to produce widely differing levels of performance unless they are carefully tuned with respect to cache and memory access patterns.

Oh yes the 32 core box will be interesting, note that the AMD memory subsystem is quite superior to intel. Intel just copies in a bad manner AMD, avoiding patents.

AMD already had L1,L2,L3 at mainstream opterons (called barcelona , phenom and now phenom2) before intel managed to upgrade their core2 to i7 with L3.

Next year, so not that much later than that intel sells beckton (of course having test hardware gives a big edge to intel here), AMD will release 12 core cpu's, which are quad socket capable. That's 48 core.

I don't know how high beckton will be clocked. Probably they manage 1 version at 3.2ghz, i'm not sure.

The whole point is that we've got remote shared memory at intel and AMD is such a huge advantage that it annihilates any GPU attempt in advance for game tree search.

GPU's can be fast if you have embarrassingly parallel software that doesn't need much of a RAM.

Let's suppose we have a 48 core AMD box in 2010 and that some magic GPU releases which has shared memory and a higher ipc.

The question then is: how does the 40 GB of hashtable at such a big box compare against the say 400MB hashtable for 240 cores or so?

Note there is on paper faster gpu's than nvidia.
AMD gpu's are faster (they bought ATI).

It has by now 1600 cores they call it.
I'd call that 320 stream processors with each 5 execution units,
quite better than nvidia's 480 cores with 1 unit.

That is more powerful than nvidia, and a lot.

However writing code that can effectively get a much higher ipc than 0.4, using those 5 execution units, is quite a challenge.

Another advantage of the AMD approach is that you don't need to write some sort of stupid 2 layer parallellism. Saves you out 50% of the effort.

Again the problem at AMD gpu is similar to nvidia with respect to hashtable; the x2 versions with 2 gpu's do not have shared memory.

Well you know there is a link between the gpu's that's quite fast, but to program for it is really complicating code once again.

Writing code for this is really complicated.

We don't have big teams of great low level coders for our chesssoftware.
Everyone is doing it himself.

It is very difficult to make software for a gpu, knowing thatonly you can use it, and that by the time you have it done, maybe nvidia is out of business?

So far cpu's were simply a lot higher clocked and faster.

I'm doing some effort right now in diep to see how i can streamline its parallellism better for the futuristic 32 and 48 core machines.

Yet making a new chess engine specially for a gpu, that would be a fulltime 2 years project really. I would first of all go for AMD, as that has more potential for computerchess, as a single processor of AMD is 900Mhz and 5 way, versus nvidia is 1.2Ghz and 1 instruction a cycle max.

So that's a potential of 1.2Gi nvidia vs 4.5Gi for AMD.

You start at say a factor 3 difference in advantage of ATI/AMD for a single stream processor. There is more cache available a streamcore at AMD than nvidia. Nvidia really has little cache.

Note that cache size is of total minor importance in the intel vs AMD battle, yet in gpu's the cache is so tiny for each stream core, that it might be relevant.

No good low level programmer is gonna start a gpu chess programming project any soon, simply because you can't sell it and a multisocket machine is gonna be faster anyway. THAT is the main 2 bottlenecks.

You really would need some sort of 4 layer parallellism to kick butt with a gpu.

Those skulltrails can have 5 ATI/AMD gpu cards.
Let's say you get from government access to their ultra classified
skulltrail clusters with 5 gpu cards a box and a fast network.

highest level of parallellism is the parallellism between nodes. Say you use 240 machines.

Then each machine has a second layer of parallellism over pci-e between 5 cards, which is really very ugly latency (slowest latency of all parallellism).

The within 1 card you have 2 gpu's and a bad parallellism between the 2 gpu's. That's 3d layer.

Then there is 4th layer of parallel search of th emanycores.

In case of nvidia cards you would need a 5th layer of parallellism (actually that would be 4th) between each thread block.

That requires an IQ180+ guy and to make all those layers of parallellism is really complicated.

Secondly you need a low level implementation of a chess engine, really complicated, in order to get some sort of not too ugly ipc. It is way more important to do that right on a gpu than on a cpu.

CPU's have billions of transistors in order to increase your ipc in a clever manner. Takes away all the problems of you. That's not there in gpu's.
They're real low power cores.

The real limit of gpu's is that you'll find no one who unpaid wants to do it.

Vincent

480 high speed CPUs for $500

480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500

Re: 480 high speed CPUs for $500