It is superior to AMD in ways. For example, 3 channels. And the interconnects between the memory controllers can use 3 pathways as well, so a 4-socket box doesn't have 1 local memory, 2 1-hop memories, and 1 2-hop memory. Intel is 1 0-hop and 3 1-hop accesses which is better, although what they will do beyond 4 sockets is unknown. But 3 connections is clearly better than 2.diep wrote:Oh yes the 32 core box will be interesting, note that the AMD memory subsystem is quite superior to intel. Intel just copies in a bad manner AMD, avoiding patents.bob wrote:I agree with most of the above. The Beckton (and nehalem, et. al. i7 processors) pose an interesting set of constraints. From the shared L3 (8mb on the machine we have, bigger on a machine I have tested on) to the "sorta-NUMA" way Intel connects to memory (with a quad-package/socket box, you get local memory and remote memory, just two classes/speeds, rather than AMD's local memory, remote memory (two of four processors) and really remote (slowest of all). So some issues on AMD disappear so long as we talk about 4-socket machines. But 8 socket machines are also interesting and I am not sure what is going to happen there, with regards to intel. We already know with the opterons... Intel chose MESIF rather than MOESI so it isn't quite as efficient to share stuff across L3s if the data gets modified a lot. All of this means that a little programming care is needed. Screw this up and throughput can drop to 50% of optimal in a heartbeat. We've already seen and measured this and not just with regard to chess searching.diep wrote:Forget about letting several cards cooperate.Dann Corbit wrote:http://www.nvidia.com/object/product_ge ... 95_us.html
Since there are so many gamers, I guess that cards like this are very common.
At some point, strong chess engines will take advantage of it.
Let's assume 1 card now with 2 gpu's inside.
A) the hashtable problem
a1) the 2 cpu's have no shared memory between each other, BIG problem.
a2) you can hardly allocate RAM at nvidia, only a block of 200MB
you might manage, despite device having more. At tesla cards you
might be able to allocate more.
a3) as a result of the hashtables not being shared you have a WORSE BRANCHING FACTOR.
a4) device RAM which you use for hashtable doesn't get cached,
BIG SLOWDOWN.
b) you have to create 2 layer parallellism
b1) parallellism between thread blocks (2 x 8 blocks)
b2) parallellism within 1 thread block of 30 cores
Most engines don't even have a parallellism that scales well let alone has a good speedup at B1, let alone that you are going to manage b2.
Note that B2 is deterministic
What's the speedup you get out of b1 + b2?
Moreover you need to be at Einstein level in parallel search to get it done without too much of an overhead.
AS YOU CANNOT LOSE A FACTOR 50 TO SCALING OVERHEAD LIKE CILKCHESS AND ZUGZWANG DID DO IN THE PAST.
c) the ipc of each core is a lot lower. Whereas nowadays core2/i7/phenom2 are approaching 2.0, multimedia code at nvidia is getting 0.4. Note multimedia code is a lot easier than chess code. chess code has more branches.
d) the manycore overhead. All code gets evaluated, whether you need it or not. Every core executes the same code within 1 block.
See above, every hashtable entry is uncached so you gotta wait each node for a hashtable entry, even if last few plies you don't want to do it.
Ok let's compare now speed.
32 core Xeon MP beckton (release at 15 april) :
32 * 2.9Ghz * ipc = 2.0 ==> 185.6 Ghz instructions a second
No parallel speedup penalty yet. I don't know the speedup of diep
at this 32 core box.
I would estimate it at 20 out of 32.
So that makes it in fact effectively:
20 * 2.0 * 2.9 Ghz instructions = 40 * 2.9 = 80 + 36 = 116 Gi
Now the 295.
You can use 240 cores of course. How to do parallellism between the 2 gpu's of 240 cores i do not know. You tell me.
So basic speed: 240 cores * 1.2Ghz * 0.4 = 115 Gi
Now for sure to hashtable you're gonna lose factor 2
You lose for sure to the manycore overhead (as it has
to evaluate every line of code) factor 2
Note this is optimistic guess. I know the opinion of some chessprogrammers who want to make it factor 5 or so.
Now the speedup at 240 cores. A big problem with 2 layer parallellism is that this is really suboptimal. It is not only complicated. But you lose in advance factor 2 to scaling i would expect. So the theoretic maximum would be 50% speedup. In reality more like 25-30% is normal. 30% from 50% = 15%. That's really very good to achieve.
So we start with 115 * 0.15 / 4 = 4 Gi
So a 32 core Xeon MP box is roughly factor 30 better.
That's still an optimistic guess.
We didn't calculate in the worse branching factor that you will get at a gpu yet.
The trick is of course that we do not compare 1 cpu with a gpu.
We compare 4 cpu's with 1 gpu.
As those 4 cpu's have shared memory and a gpu does NOT.
Forget parallellism via the pci-e.
Too slow latency.
Vincent
The 4x8 box looks very good, IMHO. And Linux is doing well with process scheduling to make them effective to use. But I suspect chess programs are going to produce widely differing levels of performance unless they are carefully tuned with respect to cache and memory access patterns.
AMD already had L1,L2,L3 at mainstream opterons (called barcelona , phenom and now phenom2) before intel managed to upgrade their core2 to i7 with L3.
Next year, so not that much later than that intel sells beckton (of course having test hardware gives a big edge to intel here), AMD will release 12 core cpu's, which are quad socket capable. That's 48 core.
I don't know how high beckton will be clocked. Probably they manage 1 version at 3.2ghz, i'm not sure.
The whole point is that we've got remote shared memory at intel and AMD is such a huge advantage that it annihilates any GPU attempt in advance for game tree search.
GPU's can be fast if you have embarrassingly parallel software that doesn't need much of a RAM.
Let's suppose we have a 48 core AMD box in 2010 and that some magic GPU releases which has shared memory and a higher ipc.
The question then is: how does the 40 GB of hashtable at such a big box compare against the say 400MB hashtable for 240 cores or so?
Note there is on paper faster gpu's than nvidia.
AMD gpu's are faster (they bought ATI).
It has by now 1600 cores they call it.
I'd call that 320 stream processors with each 5 execution units,
quite better than nvidia's 480 cores with 1 unit.
That is more powerful than nvidia, and a lot.
However writing code that can effectively get a much higher ipc than 0.4, using those 5 execution units, is quite a challenge.
Another advantage of the AMD approach is that you don't need to write some sort of stupid 2 layer parallellism. Saves you out 50% of the effort.
Again the problem at AMD gpu is similar to nvidia with respect to hashtable; the x2 versions with 2 gpu's do not have shared memory.
Well you know there is a link between the gpu's that's quite fast, but to program for it is really complicating code once again.
Writing code for this is really complicated.
We don't have big teams of great low level coders for our chesssoftware.
Everyone is doing it himself.
It is very difficult to make software for a gpu, knowing thatonly you can use it, and that by the time you have it done, maybe nvidia is out of business?
So far cpu's were simply a lot higher clocked and faster.
I'm doing some effort right now in diep to see how i can streamline its parallellism better for the futuristic 32 and 48 core machines.
Yet making a new chess engine specially for a gpu, that would be a fulltime 2 years project really. I would first of all go for AMD, as that has more potential for computerchess, as a single processor of AMD is 900Mhz and 5 way, versus nvidia is 1.2Ghz and 1 instruction a cycle max.
So that's a potential of 1.2Gi nvidia vs 4.5Gi for AMD.
You start at say a factor 3 difference in advantage of ATI/AMD for a single stream processor. There is more cache available a streamcore at AMD than nvidia. Nvidia really has little cache.
Note that cache size is of total minor importance in the intel vs AMD battle, yet in gpu's the cache is so tiny for each stream core, that it might be relevant.
No good low level programmer is gonna start a gpu chess programming project any soon, simply because you can't sell it and a multisocket machine is gonna be faster anyway. THAT is the main 2 bottlenecks.
You really would need some sort of 4 layer parallellism to kick butt with a gpu.
Those skulltrails can have 5 ATI/AMD gpu cards.
Let's say you get from government access to their ultra classified
skulltrail clusters with 5 gpu cards a box and a fast network.
highest level of parallellism is the parallellism between nodes. Say you use 240 machines.
Then each machine has a second layer of parallellism over pci-e between 5 cards, which is really very ugly latency (slowest latency of all parallellism).
The within 1 card you have 2 gpu's and a bad parallellism between the 2 gpu's. That's 3d layer.
Then there is 4th layer of parallel search of th emanycores.
In case of nvidia cards you would need a 5th layer of parallellism (actually that would be 4th) between each thread block.
That requires an IQ180+ guy and to make all those layers of parallellism is really complicated.
Secondly you need a low level implementation of a chess engine, really complicated, in order to get some sort of not too ugly ipc. It is way more important to do that right on a gpu than on a cpu.
CPU's have billions of transistors in order to increase your ipc in a clever manner. Takes away all the problems of you. That's not there in gpu's.
They're real low power cores.
The real limit of gpu's is that you'll find no one who unpaid wants to do it.
Vincent