INTEL XEON PHI 1TERAFLOP PCIe!
Moderators: hgm, Rebel, chrisw
-
- Posts: 900
- Joined: Tue Apr 27, 2010 3:48 pm
Re: INTEL XEON PHI 1TERAFLOP PCIe!
Interesting. I found this article which has some more details:
http://www.anandtech.com/show/6017/inte ... es-retail/
http://www.anandtech.com/show/6017/inte ... es-retail/
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: INTEL XEON PHI 1TERAFLOP PCIe!
For us chess programmers, it is still 50 threads and whatever you can do with a 512 bit SIMD registers such as vectorized loops and bitboard tricks. The x86 support is a major selling point of xeon phi for the HPC market ,not tera flops as it lags 5x behind amd and nvidia there. For example it runs its own operating system so you can message pass with to like you do in clusters. Many scientists would prefer to work with such familiar languages rather than specifically code for a GPU card.
-
- Posts: 900
- Joined: Tue Apr 27, 2010 3:48 pm
Re: INTEL XEON PHI 1TERAFLOP PCIe!
Compared to GPUs I expect that this architecture will be much more flexible. GPUs trade off flexibility for raw speed (that's why they do so many TFlops).Daniel Shawul wrote:For us chess programmers, it is still 50 threads and whatever you can do with a 512 bit SIMD registers such as vectorized loops and bitboard tricks. The x86 support is a major selling point of xeon phi for the HPC market ,not tera flops as it lags 5x behind amd and nvidia there. For example it runs its own operating system so you can message pass with to like you do in clusters. Many scientists would prefer to work with such familiar languages rather than specifically code for a GPU card.
By flexibility I mean things like being able to run threads that are truly independent from each other. On GPUs, if threads have conditional statements this stalls all the other cores in the same computation unit because they have to run in sync with each other.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: INTEL XEON PHI 1TERAFLOP PCIe!
Yes but GPUs themselves are becoming more flexible by each release. The only thing I haven't seen so far is a cache coherent L1 cache. For a normal x86 program this is usually assumed to be the case. It makes programming a lot more easier but it does have additional hardware costs and incurs performance loss specially for HPC. The microarchitectures are very different (besides the ISA) to say which is better. Xeon phi has many inherited features from Larrabee with the exception of video gaming which they now dropped completely after their utter failure in 2009.rbarreira wrote:Compared to GPUs I expect that this architecture will be much more flexible. GPUs trade off flexibility for raw speed (that's why they do so many TFlops).Daniel Shawul wrote:For us chess programmers, it is still 50 threads and whatever you can do with a 512 bit SIMD registers such as vectorized loops and bitboard tricks. The x86 support is a major selling point of xeon phi for the HPC market ,not tera flops as it lags 5x behind amd and nvidia there. For example it runs its own operating system so you can message pass with to like you do in clusters. Many scientists would prefer to work with such familiar languages rather than specifically code for a GPU card.
You can think of it like having 50 SMs to compare it with GPUs who usually have 16 or so. And the 516 bit wide registers are equivalent to 16 cores. I don't know how many threads you can launch in xeon phi cores but it will be far smaller than GPUs. So the difference is GPUs launch a lot more threads threads and the execution is SIMD only for 32 threads at a time. The SMs (even other warps in the same block) can execute completely independent code. For my MC thread search each thread does its own independent MC simulation (tens of thousands), but using xeon phi would be only 50 plus maybe 4 per core = 200 threads at a time. It can do the same teraflops in the end but the way it is achieved is different.By flexibility I mean things like being able to run threads that are truly independent from each other. On GPUs, if threads have conditional statements this stalls all the other cores in the same computation unit because they have to run in sync with each other.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: INTEL XEON PHI 1TERAFLOP PCIe!
Wow the 1 teraflop of xeon phi is double precision ! I thought it was single precision. Well then that would make it same as HD7970 which has about the same double-precison but 4x more single precision. OTOH Nvidia's Tesla K10 has 4.6 tera flosp single precision but a much lower 0.19 tera flops double precision. I think they got real competition in their hands. But there may be a catche somewhere like being more expensive or consuming more power etc..not tera flops as it lags 5x behind amd and nvidia there
-
- Posts: 2657
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: INTEL XEON PHI 1TERAFLOP PCIe! Integer Performace?
Does anybody know about the Integer-Throughput?
Hope it will support OpenCL.
--
Srdja
Hope it will support OpenCL.
--
Srdja
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: INTEL XEON PHI 1TERAFLOP PCIe! Integer Performace?
I think the integer throughput will not be as impressive since it has the same 16 64bit registers and the cores are from an old pentium processor. OTOH they have added 32 512-bit wide registers so single precision flops could be a surprise figure when announced, since DP is already at 1 TFlops. Performance also depends on the number of ALU/FPU/DPU units it has which I don't know ...
But then the micro architecture is completely diffent from GPUs and I am not convinced if this 1 TFlop figure is relevant since it all comes from issuing same operations on multiple registers (data). I now believe the tesla is more general purpose since you have many threads actually running. NVIDIA understandably called it SIMT. But xeon phi on the other hand gains its perfromance from pure SIMD operations since it launches far less threads. It is good that we have a 50 cores (albeit with much less power) that can be used for general purpose, but I don't think it is more flexible in other aspects if your goal is performance... The 4 threads per core also seems to be used for hyper-threading so there probably isn't much to be gained from there.
Everything I said should be taken with a big grain of salt
P.S: OpenCL ,MPI, openMP and others should not be a problem.
But then the micro architecture is completely diffent from GPUs and I am not convinced if this 1 TFlop figure is relevant since it all comes from issuing same operations on multiple registers (data). I now believe the tesla is more general purpose since you have many threads actually running. NVIDIA understandably called it SIMT. But xeon phi on the other hand gains its perfromance from pure SIMD operations since it launches far less threads. It is good that we have a 50 cores (albeit with much less power) that can be used for general purpose, but I don't think it is more flexible in other aspects if your goal is performance... The 4 threads per core also seems to be used for hyper-threading so there probably isn't much to be gained from there.
Everything I said should be taken with a big grain of salt
P.S: OpenCL ,MPI, openMP and others should not be a problem.
-
- Posts: 2657
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: INTEL XEON PHI 1TERAFLOP PCIe! Integer Performace?
I think 50 Cores are more suitable for classic Depth-First-Searcher like YBWC where communication|sync is necessary.
--
Srdja
--
Srdja