Hyperthreading

dark_wizzie · Post by **dark_wizzie** » Wed Oct 29, 2014 10:36 pm

Hi,

I started a thread like this in Rybka Forum quite a while back but Mark says it might be interesting to post it here. Has anybody done any serious testing of hyperthreading and chess engines? Namingly something for H4, K8, SF5.

My understanding is this: HT with 2 cores is not the same at all as HT with 8 cores. If HT gives 2 + 2 cores and is somehow a gain in strength, 8 + 8 cores won't necessarily be a gain in strength over 8 real cores. Some people say that these ideas can be tested with engine matches (as reading kn/s is misleading). We need to set core affinity... odd cores or something?

I've done a test like this on my Atom netbook, but it is the first-gen Atom, the really crappy one, and it's 1 + 1 cores and with like 500 games it showed a 35 elo improvement.

hgm · Post by **hgm** » Wed Oct 29, 2014 11:01 pm

Indeed, the old Atoms are expected to benefit from HT, because (being an in-order CPU) it stalls so often that the two HT hardly slow each other down.

And yes, engines gain more Elo going from 2 to 4 search trheads than going from 8 to 16 search threads. You lose at least an equal number of nps per thread in switching them from cores to HT, in both cases though (and even more in going from 8 to 16 if memory contention plays a role). So it will definitely be much harder to get a gain from HT in the 8->16 case.

syzygy · Post by **syzygy** » Wed Oct 29, 2014 11:14 pm

dark_wizzie wrote:My understanding is this: HT with 2 cores is not the same at all as HT with 8 cores.

It is more correct to say that Atom gains much more from HT than regular Intel chips. This is because Atom does not do out-of-order execution. It is somewhat interesting though that Atom is capable of HT, as HT, or rather SMT, was originally designed for superscalar processors (superscalar is not identical to out-of-order, but closely related).
http://www.agner.org/optimize/blog/read.php?i=6

Agner Fog wrote:The Intel Atom is a small low-power processor which is used in small netbook computers and embedded applications. It has two cores capable of running two threads each. The execution units of the Atom are much smaller than the i7. It sounds like a weird idea to share the already meager execution units between two threads. The rationale is that the Atom lacks the out-of-order capabilities of the bigger processors. When the execution unit is waiting for an uncached memory operand or some other long-latency event, it would have nothing else to do in the meantime unless there was a second thread it could work on.

bob · Post by **bob** » Wed Oct 29, 2014 11:32 pm

syzygy wrote:
dark_wizzie wrote:My understanding is this: HT with 2 cores is not the same at all as HT with 8 cores.
It is more correct to say that Atom gains much more from HT than regular Intel chips. This is because Atom does not do out-of-order execution. It is somewhat interesting though that Atom is capable of HT, as HT, or rather SMT, was originally designed for superscalar processors (superscalar is not identical to out-of-order, but closely related).
http://www.agner.org/optimize/blog/read.php?i=6
Agner Fog wrote:The Intel Atom is a small low-power processor which is used in small netbook computers and embedded applications. It has two cores capable of running two threads each. The execution units of the Atom are much smaller than the i7. It sounds like a weird idea to share the already meager execution units between two threads. The rationale is that the Atom lacks the out-of-order capabilities of the bigger processors. When the execution unit is waiting for an uncached memory operand or some other long-latency event, it would have nothing else to do in the meantime unless there was a second thread it could work on.

I think there is too much guesswork in the link you posted. The author didn't know what SMT was? That is exactly what Intel used to refer to hyper-threading. They were one in the same, yet the author doesn't seem to know this. Additionally, cache IS a hyper threading issue. If one thread misses in L1 and has to go to L2 or L3, there is a definite delay. The second thread can use those idle cycles if/when the first thread cannot, and see a performance gain. It doesn't need to be a "long-latency" in the concept of "real long latency". A few clocks is more than enough since switching between threads is essentially instantaneous.

For chess it is a combination of two issues only.

(1) how much of the time is one thread blocked so that the other thread offers useful work to keep the processor core busy; (2) what is the cost (in chess the answer is search overhead caused by searching extra nodes since move ordering is not perfect). If you gain more with (1) than you lose with (2), it is a good idea. Otherwise it is not.

Unfortunately this is not consistent across all chess engines. If an author spends a LOT of time trying to optimize cache accesses, and tries to minimize the use of spin locks (xchg instruction and such) then hyper threading is usually a losing proposition since the two threads interfere rather than overlap when both are not blocked a lot of the time. If the engine is not highly tuned/optimized, hyper threading will generally offer enough of a gain that it might actually gain more than it loses via the search overhead issue.

I have yet to find a case where my program runs better with SMT enabled, however, nor any other program we regularly run on our cluster nodes, we therefore have this disabled in all of our machines.

syzygy · Post by **syzygy** » Thu Oct 30, 2014 12:15 am

bob wrote:
syzygy wrote:http://www.agner.org/optimize/blog/read.php?i=6
Agner Fog wrote:The Intel Atom is a small low-power processor which is used in small netbook computers and embedded applications. It has two cores capable of running two threads each. The execution units of the Atom are much smaller than the i7. It sounds like a weird idea to share the already meager execution units between two threads. The rationale is that the Atom lacks the out-of-order capabilities of the bigger processors. When the execution unit is waiting for an uncached memory operand or some other long-latency event, it would have nothing else to do in the meantime unless there was a second thread it could work on.
I think there is too much guesswork in the link you posted. The author didn't know what SMT was? That is exactly what Intel used to refer to hyper-threading. They were one in the same, yet the author doesn't seem to know this.

In fact HT and SMT are not identical concepts, but Fog is not correct in saying (in the comments) that HT is a subset of SMT. It is the other way around: SMT is a subset of HT.

HT is the term Intel uses both for "switch-on-event multithreading" (SOEMT) as implemented in Itanium and for SMT as implemented in the P4 and later x86-based CPUs (including Atom).

Additionally, cache IS a hyper threading issue. If one thread misses in L1 and has to go to L2 or L3, there is a definite delay. The second thread can use those idle cycles if/when the first thread cannot, and see a performance gain. It doesn't need to be a "long-latency" in the concept of "real long latency". A few clocks is more than enough since switching between threads is essentially instantaneous.

I agree. What Fog describes is SOEMT: execution switches to the other thread upon a long-latency event. SMT is more fine-grained. Still, his explanation for why Atom benefits more from HT than say an i7 is correct.

Hyperthreading

Hyperthreading

Re: Hyperthreading

Re: Hyperthreading

Re: Hyperthreading

Re: Hyperthreading