Some tests with hyper threading

bob · Post by **bob** » Fri Mar 27, 2009 6:46 am

Howard E wrote:The last two tests do this:
8Threads for ht on only 4.6m nps for Arasan 11.3
8Threads for ht on only 7.7m nps for bright 0.4a

There are only 4 physical processors on my machine,
so the other 4 are virtual.
But it looks like nps count is not accurate for comparing MP capable engine performance. I coming out of the single processor dark ages and just found this out.

I have been testing on a dual-socket quad-core nehalem, turning on SMT actually slows the NPS for Crafty. And using mt=16 to use the logical processors further hurts because of the SMP search overhead. I'm not so sure SMT is that great on dual-socket machines when the program is pretty well tuned with respect to cache usage, etc...

bob · Post by **bob** » Fri Mar 27, 2009 6:51 am

Allard Siemelink wrote:
Howard E wrote:Computer:
corei7-920 (modest overclock from 133 to 150)
so 2660 mhz to 3000 mhz (20 * clock)
for single core apps something in mother board called
turbo enabled yields 21 * speed so 3150 mhz

8gb ram 512hash allotted for chess programs

Test:
nps count from new game starting position
ht is hyper threading
T is threads
nps is million except rybka's count

1. Rybka 2.2mpox64
ht=on 862.784
ht=off 744.450

2. Arasan 11.3
ht=off ht=on
1T 1.7 1.8
2T 3.2 2.9
4T 4.3 3.7
8Tfor ht on only 4.6

3. Bright0.4a
ht=off ht=on
1T 1.6 1.6
2T 3.1 3.0
4T 6.1 5.1
8T for ht on only 7.7
Thanks Howard, this is useful information.
bright's scaling without hyperthreading seems pretty good: 1.6, 3.1, 6.1.
(as I only have a dual core computer, I was not sure of the 4cpu nps)
But if hyperthreading is enabled, the 4cpu nps only reaches 5.1.

It seems that bright (and arasan too?!) needs to set the processor affinity to cope with hyperthreading (to make sure each thread gets its own cpu)

I haven't read the other discussion yet, but it seems to me that larger nps is better, so yes, bright would perform best (although just marginally better then 4 threads and no hyperthreading) with 8 threads and hyperthreading enabled.
Since the 8 thread (HT) nps is only 25% better than the 4 threads (no HT)nps, you'd need to play a large number (1000's) of games to actually prove it.

NPS is irrelevant. All that counts is time to depth. And I know of no program that given the choice of 4 cpus at 2M nodes per second per CPU, or 8 cpus at 1M nodes per second, would produce faster time-to-depth on the 8 cpus. Both would search the same NPS. But the search overhead would make the 8 cpu version slower, and weaker.

it takes at _least_ a 30% improvement in NPS to make hyper-threading worthwhile for chess. And I have not seen that kind of improvement, making it a losing proposition.

Easy enough to test. Just run using no SMT and 4 physical processors and search a group of positions to a fixed depth. Then turn SMT on and run the same positions with 8 processors. The latter will take longer to complete, making the point quite clear.

BBauer · Post by **BBauer** » Fri Mar 27, 2009 11:44 am

I do not think 'all that counts is not nps' and not nps to fixed depth.
All that counts is time to find the best move.
For the following position I ran Crafty 2 times.
The results differ pretty much.
[D] r1b1N2k/1pBn2p1/p3Q2p/5n2/8/2q5/2PRB1PP/7K w

Code: Select all

1. run
               13->   3.55  -0.93   1. Qd5 Qe3 2. Ba5 Ne5 3. Qd8 Kh7 4.
               14->  11.91  -0.61   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               15->  20.10  -0.54   1. Rd1 Nc5 2. Qf7 Ne3 3. Qf8+ Kh7 4.
               16->  38.23  -0.60   1. Rd1 Nc5 2. Qf7 Be6 3. Qf8+ Bg8 4.
               17->   1&#58;04  -0.68   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
              time=1&#58;04  mat=-1  n=265747700  fh=90%  nps=4.1M
              ext-> check=10.6M qcheck=11.1M reduce=136.9M/26.3M
              predicted=0  evals=209.1M  50move=0  EGTBprobes=0  hits=0
              SMP->  splits=690  aborts=89  data=7/128  elap=1&#58;04

2. run
               13->   3.06  -0.83   1. Qd5 Qe3 2. Ba5 Qf2 3. Qf3 Qe1+ 4.
               14->   9.51  -0.61   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               15->  17.72  -0.57   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               16->  36.13  -0.64   1. Bf4 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               17->   1&#58;09  -0.64   1. Bf4 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
              time=1&#58;09  mat=-1  n=282069754  fh=90%  nps=4.1M
              ext-> check=11.3M qcheck=11.7M reduce=145.3M/28.1M
              predicted=0  evals=221.1M  50move=0  EGTBprobes=0  hits=0
              SMP->  splits=766  aborts=104  data=6/128  elap=1&#58;09
&#91;quote&#93;
kind regards
Bernhard&#91;/quote&#93;

Gian-Carlo Pascutto · Fri Mar 27, 2009 11:51 am

You are right that time to solution is what matters. But time to depth is usually a good predictor of that (especially averaged over a lot of positions). NPS is not.

ernest · Post by **ernest** » Fri Mar 27, 2009 6:12 pm

BBauer wrote:All that counts is time to find the best move.
For the following position I ran Crafty 2 times.
The results differ pretty much.

I don't understand...
Isn't Qxf5 the best move here? (Rybka gives +3.50 to that)

And what processor did you use? If you are using multiprocessor (or multicore) then of course you get non-reproducibility!

bob · Post by **bob** » Fri Mar 27, 2009 6:31 pm

BBauer wrote:I do not think 'all that counts is not nps' and not nps to fixed depth.
All that counts is time to find the best move.
For the following position I ran Crafty 2 times.
The results differ pretty much.
[D] r1b1N2k/1pBn2p1/p3Q2p/5n2/8/2q5/2PRB1PP/7K w

Code: Select all

1. run
               13->   3.55  -0.93   1. Qd5 Qe3 2. Ba5 Ne5 3. Qd8 Kh7 4.
               14->  11.91  -0.61   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               15->  20.10  -0.54   1. Rd1 Nc5 2. Qf7 Ne3 3. Qf8+ Kh7 4.
               16->  38.23  -0.60   1. Rd1 Nc5 2. Qf7 Be6 3. Qf8+ Bg8 4.
               17->   1&#58;04  -0.68   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
              time=1&#58;04  mat=-1  n=265747700  fh=90%  nps=4.1M
              ext-> check=10.6M qcheck=11.1M reduce=136.9M/26.3M
              predicted=0  evals=209.1M  50move=0  EGTBprobes=0  hits=0
              SMP->  splits=690  aborts=89  data=7/128  elap=1&#58;04

2. run
               13->   3.06  -0.83   1. Qd5 Qe3 2. Ba5 Qf2 3. Qf3 Qe1+ 4.
               14->   9.51  -0.61   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               15->  17.72  -0.57   1. Rd1 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               16->  36.13  -0.64   1. Bf4 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
               17->   1&#58;09  -0.64   1. Bf4 Nc5 2. Qe5 Qxe5 3. Bxe5 Be6
              time=1&#58;09  mat=-1  n=282069754  fh=90%  nps=4.1M
              ext-> check=11.3M qcheck=11.7M reduce=145.3M/28.1M
              predicted=0  evals=221.1M  50move=0  EGTBprobes=0  hits=0
              SMP->  splits=766  aborts=104  data=6/128  elap=1&#58;09
&#91;quote&#93;
kind regards
Bernhard&#91;/quote&#93;

TIme to depth is all that matters. I'd hope everyone knows that for smp testing, you need to run a position several times and average things rather than just making one run...

Some tests with hyper threading

Re: Some tests with hyper threading

Re: Some tests with hyper threading

Re: Some tests with hyper threading

Re: Some tests with hyper threading

Re: Some tests with hyper threading

Re: Some tests with hyper threading