INTEL XEON PHI 1TERAFLOP PCIe!

hammerklavier · Post by **hammerklavier** » Thu Jul 05, 2012 2:25 am

http://blogs.intel.com/technology/2012/ ... nnovation/

rbarreira · Post by **rbarreira** » Thu Jul 05, 2012 10:33 am

Interesting. I found this article which has some more details:

http://www.anandtech.com/show/6017/inte ... es-retail/

Daniel Shawul · Post by **Daniel Shawul** » Thu Jul 05, 2012 1:21 pm

For us chess programmers, it is still 50 threads and whatever you can do with a 512 bit SIMD registers such as vectorized loops and bitboard tricks. The x86 support is a major selling point of xeon phi for the HPC market ,not tera flops as it lags 5x behind amd and nvidia there. For example it runs its own operating system so you can message pass with to like you do in clusters. Many scientists would prefer to work with such familiar languages rather than specifically code for a GPU card.

rbarreira · Post by **rbarreira** » Thu Jul 05, 2012 1:24 pm

Daniel Shawul wrote:For us chess programmers, it is still 50 threads and whatever you can do with a 512 bit SIMD registers such as vectorized loops and bitboard tricks. The x86 support is a major selling point of xeon phi for the HPC market ,not tera flops as it lags 5x behind amd and nvidia there. For example it runs its own operating system so you can message pass with to like you do in clusters. Many scientists would prefer to work with such familiar languages rather than specifically code for a GPU card.

Compared to GPUs I expect that this architecture will be much more flexible. GPUs trade off flexibility for raw speed (that's why they do so many TFlops).

By flexibility I mean things like being able to run threads that are truly independent from each other. On GPUs, if threads have conditional statements this stalls all the other cores in the same computation unit because they have to run in sync with each other.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jul 05, 2012 1:49 pm

rbarreira wrote:
Daniel Shawul wrote:For us chess programmers, it is still 50 threads and whatever you can do with a 512 bit SIMD registers such as vectorized loops and bitboard tricks. The x86 support is a major selling point of xeon phi for the HPC market ,not tera flops as it lags 5x behind amd and nvidia there. For example it runs its own operating system so you can message pass with to like you do in clusters. Many scientists would prefer to work with such familiar languages rather than specifically code for a GPU card.
Compared to GPUs I expect that this architecture will be much more flexible. GPUs trade off flexibility for raw speed (that's why they do so many TFlops).

Yes but GPUs themselves are becoming more flexible by each release. The only thing I haven't seen so far is a cache coherent L1 cache. For a normal x86 program this is usually assumed to be the case. It makes programming a lot more easier but it does have additional hardware costs and incurs performance loss specially for HPC. The microarchitectures are very different (besides the ISA) to say which is better. Xeon phi has many inherited features from Larrabee with the exception of video gaming which they now dropped completely after their utter failure in 2009.

By flexibility I mean things like being able to run threads that are truly independent from each other. On GPUs, if threads have conditional statements this stalls all the other cores in the same computation unit because they have to run in sync with each other.

You can think of it like having 50 SMs to compare it with GPUs who usually have 16 or so. And the 516 bit wide registers are equivalent to 16 cores. I don't know how many threads you can launch in xeon phi cores but it will be far smaller than GPUs. So the difference is GPUs launch a lot more threads threads and the execution is SIMD only for 32 threads at a time. The SMs (even other warps in the same block) can execute completely independent code. For my MC thread search each thread does its own independent MC simulation (tens of thousands), but using xeon phi would be only 50 plus maybe 4 per core = 200 threads at a time. It can do the same teraflops in the end but the way it is achieved is different.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jul 05, 2012 3:46 pm

not tera flops as it lags 5x behind amd and nvidia there

Wow the 1 teraflop of xeon phi is double precision ! I thought it was single precision. Well then that would make it same as HD7970 which has about the same double-precison but 4x more single precision. OTOH Nvidia's Tesla K10 has 4.6 tera flosp single precision but a much lower 0.19 tera flops double precision. I think they got real competition in their hands. But there may be a catche somewhere like being more expensive or consuming more power etc..

smatovic · Post by **smatovic** » Thu Jul 05, 2012 8:29 pm

Does anybody know about the Integer-Throughput?

Hope it will support OpenCL.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Thu Jul 05, 2012 11:24 pm

I think the integer throughput will not be as impressive since it has the same 16 64bit registers and the cores are from an old pentium processor. OTOH they have added 32 512-bit wide registers so single precision flops could be a surprise figure when announced, since DP is already at 1 TFlops. Performance also depends on the number of ALU/FPU/DPU units it has which I don't know ...
But then the micro architecture is completely diffent from GPUs and I am not convinced if this 1 TFlop figure is relevant since it all comes from issuing same operations on multiple registers (data). I now believe the tesla is more general purpose since you have many threads actually running. NVIDIA understandably called it SIMT. But xeon phi on the other hand gains its perfromance from pure SIMD operations since it launches far less threads. It is good that we have a 50 cores (albeit with much less power) that can be used for general purpose, but I don't think it is more flexible in other aspects if your goal is performance... The 4 threads per core also seems to be used for hyper-threading so there probably isn't much to be gained from there.
Everything I said should be taken with a big grain of salt

P.S: OpenCL ,MPI, openMP and others should not be a problem.

smatovic · Post by **smatovic** » Thu Jul 05, 2012 11:42 pm

I think 50 Cores are more suitable for classic Depth-First-Searcher like YBWC where communication|sync is necessary.

--
Srdja

INTEL XEON PHI 1TERAFLOP PCIe!

INTEL XEON PHI 1TERAFLOP PCIe!

Re: INTEL XEON PHI 1TERAFLOP PCIe!

Re: INTEL XEON PHI 1TERAFLOP PCIe!

Re: INTEL XEON PHI 1TERAFLOP PCIe!

Re: INTEL XEON PHI 1TERAFLOP PCIe!

Re: INTEL XEON PHI 1TERAFLOP PCIe!

Re: INTEL XEON PHI 1TERAFLOP PCIe! Integer Performace?

Re: INTEL XEON PHI 1TERAFLOP PCIe! Integer Performace?

Re: INTEL XEON PHI 1TERAFLOP PCIe! Integer Performace?