Differences of running Dual x 1 processor

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Ponti
Posts: 507
Joined: Wed Mar 15, 2006 6:13 am
Location: Curitiba - PR - BRAZIL
Full name: Aloisio Ponti Lopes

Differences of running Dual x 1 processor

Post by Ponti »

1) Efficiency ?

2) HT x no HT ?
(I guess HT doesn´t improve an engine´s performance, am I wrong?)

3) Does memory impact the performance when running Dual-Xeons ?

Please explain in simple terms what are the differences...because I´m not an engineer! :lol:
A. Ponti
AMD Ryzen 1800x, Windows 10.
FIDE current ratings: standard 1913, rapid 1931
jdart
Posts: 4428
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Differences of running Dual x 1 processor

Post by jdart »

This is a very complex topic.

Generally speaking most engines scale reasonably well on relatively small numbers of cores (4-8).

One of the main bottlenecks is the hash table, which is a shared resource. Multiple threads read and write it and it is typically too large to fit in a processor cache. When a core has to access memory that is not in its local cache, a cache miss occurs and memory latency increases. There are actually multiple levels of caching. In a multi-socket system, on modern machines there is also a penalty for accessing memory that is not assigned to the local core.

All kinds of other global memory access can cause cache misses and possibly non-local memory access but the hash table is the big culprit, generally.

Hyperthreading is often of limited benefit because of contention for resources the cores share; but some have reported here that it can increase performance somewhat, for some engines.

--Jon
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Differences of running Dual x 1 processor

Post by Milos »

Ponti wrote:1) Efficiency ?

2) HT x no HT ?
(I guess HT doesn´t improve an engine´s performance, am I wrong?)

3) Does memory impact the performance when running Dual-Xeons ?

Please explain in simple terms what are the differences...because I´m not an engineer! :lol:
Dual CPU requires communication over NUMA bus with memory (usually each CPU has direct access to 4 memory modules in parallel - quad channel access) which means you will see a slowdown when 1 CPU accesses memory that is directly controlled by the other CPU.
However, raw bandwidth of today's memory is enormous (>50GB/s for DDR4 and quad channel access) meaning even with 8 cores you will not see any slowdown due to memory access in any chess application on single CPU. Even with dual CPU's slowdown won't be significant. That you can nicely see if you check benchmarks of single and dual CPU configurations. Since with LazySMP scaling in terms of NPS is almost perfect and actual drop in NPS comes mainly from memory latency over NUMA.
Here are numbers from RH for H5:
http://www.cruxis.com/chess/manual/inde ... ersion.htm

Regarding HT, forget about it, it produces almost exclusively useless NPS.
APassionForCriminalJustic
Posts: 417
Joined: Sat May 24, 2014 9:16 am

Re: Differences of running Dual x 1 processor

Post by APassionForCriminalJustic »

Milos wrote:
Ponti wrote:1) Efficiency ?

2) HT x no HT ?
(I guess HT doesn´t improve an engine´s performance, am I wrong?)

3) Does memory impact the performance when running Dual-Xeons ?

Please explain in simple terms what are the differences...because I´m not an engineer! :lol:
Dual CPU requires communication over NUMA bus with memory (usually each CPU has direct access to 4 memory modules in parallel - quad channel access) which means you will see a slowdown when 1 CPU accesses memory that is directly controlled by the other CPU.
However, raw bandwidth of today's memory is enormous (>50GB/s for DDR4 and quad channel access) meaning even with 8 cores you will not see any slowdown due to memory access in any chess application on single CPU. Even with dual CPU's slowdown won't be significant. That you can nicely see if you check benchmarks of single and dual CPU configurations. Since with LazySMP scaling in terms of NPS is almost perfect and actual drop in NPS comes mainly from memory latency over NUMA.
Here are numbers from RH for H5:
http://www.cruxis.com/chess/manual/inde ... ersion.htm

Regarding HT, forget about it, it produces almost exclusively useless NPS.
Useless NPS? Why don't you prove that? Hyperthreading actually allows for more work to get done. It's all about offsetting the inefficiency of the search when you double the threads. It's pretty stupid to just say forget about it... I get a minimal 30 percent increase in nodes with hyperthreading ON; I'll take that. Since you mentioned Robert Houdart then why don't you read his part about hyperthreading. It is right in the manual.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Differences of running Dual x 1 processor

Post by Milos »

APassionForCriminalJustic wrote:Useless NPS? Why don't you prove that? Hyperthreading actually allows for more work to get done. It's all about offsetting the inefficiency of the search when you double the threads. It's pretty stupid to just say forget about it... I get a minimal 30 percent increase in nodes with hyperthreading ON; I'll take that. Since you mentioned Robert Houdart then why don't you read his part about hyperthreading. It is right in the manual.
You come even here on programming forum to troll, despite not understanding even basic stuff about parallel search???
Gee, you really got the nerve.
Here is some parallel search math that demonstrates why HT is useless, if you don't get it sorry, try first googling Amdahl's Law, if you still don't understand after that, you are obviously in a wrong forum just go play with your toys and spare us your usually trolling.
As has been already demonstrated and confirmed SF's LazySMP implementation is at most (till 8 cores) 95.5% efficient which reflects really good implementation.
Let's take simple case of 10 cores:
1) No HT - total efficiency = 1/(0.045+0.955/10) = 7.12
2) With HT (30% more nodes as you say so speed per thread 1.3/2, totally 20 threads, and we even assume same efficiency 95.5% which is totally unrealistic, i.e. real efficiency is much lower in HT case) -
total efficiency = 1.3/2 * 1/(0.045+0.955/20) = 7

In reality efficiency with HT is 90% or less, so HT only makes sense in case of dual core machine and even than only barely i.e.
1) no HT total efficiency = 1/(0.045+0.955/2) = 1.91,
2) with HT total efficiency = 1.3/2 * 1/(0.1+0.9/4) = 2