Stockfish "Use Sleeping Threads" Test

bob · Post by **bob** » Fri Jan 07, 2011 12:41 am

mcostalba wrote:
zullil wrote: Hi Marco,

When I use "bench 1024 [8,16] 3 default time", I get only a 3% gain in nps when going from 8 to 16 threads. With "bench 1024 [8,16] 5 default time", the gain is still just 7%.

Thus, as Robert Houdart and Bob Hyatt have suggested, test games at short time controls are not likely to demonstrate anything---and that's also assuming that gains in nps result in better chess performance.

Both Tord and Bob have convinced me that increasing nps is not likely to increase chess performance in any meaningful way, unless time/depth is also being decreased (which doesn't seem to be happening).

Since testing at anything but the fastest time controls would completely occupy my (work ) computer for an extended period, I won't be able to run this test. Perhaps someone else with more hardware can run the test for you.

Thanks again for your work with SF.

Louis

Hi Louis,

thanks to you for your very interesting tests. You are the first to report an nps improvment swicthing on HT. Of course this does not mean SF is stronger, but anyhow is something that until yesterday was not even thinkable for SF because the way threads are managed was not allowing this.

Thanks
Marco

That's a strange comment. Every SMP program I have tried shows a modest NPS improvement because the two logical cores can hide memory latency to an extent. But not a single program I have ever tested has shown a speed (time to depth) improvement because the search overhead (extra nodes) far exceeds the NPS speed gains... I've posted quite a few HT on/off nps numbers over recent years, dating back to the pentium IV which was the first to offer HT.

PawnStormZ · Post by **PawnStormZ** » Fri Jan 07, 2011 5:16 am

I was thinking that when HT was on it changed the way the processor worked in such a way that allocating 4 of 8 threads to the engine would not be exactly the same as running it on 4 threads with the HT off.

Obviously I will defer to your slightly superior knowledge on the subject.

MikeB · Post by **MikeB** » Fri Jan 07, 2011 7:36 am

bob wrote:
MikeB wrote:
zullil wrote: ...
while having both hyperthreading and Use Sleeping Threads enabled gives a speedup of about 10% compared to having no hyperthreading. (I have checked that my machine runs the 8 threads on 8 distinct physical cores. i.e., no hyperthreading.)

...
tha's what I saw too - it works best with both enabled ~ 10% gain I3, two physcial cores, 4 logical cores, Windows 7 , 64 bit.

Mike
Only problem is NPS is not important, time to a fixed depth is how chess is actually played. If your NPS goes up, _and_ your time to fixed depth goes up, you are not gaining anything, the program is weaker.

My coment referencing to fixed depth for m, the tiem to reach fixed depth is 10% or more faster. The NPS is about 30-40% higher with HT on my laptop Windows 7 laptop. HT is much different than how it was 5 years ago

===========================
MT=2 on a 2 Physical Core Machine

Total time (ms) : 53352
Nodes searched : 59118660
Nodes/second : 1108087

==============================
Same machine, same program, same bench test now using all 4 Logical Cores

Total time (ms) : 41542
Nodes searched : 61921860
Nodes/second : 1490584

I have tried it both long(60 seconds) and short searches ( 4- 5 seconds. HT enabled and used on my laptop consistly reaches a fixed depth faster than it does without enabled.

IQ · Post by IQ » Fri Jan 07, 2011 11:13 am

But if we agree in principle, wouldnt then time to solutiion be a much better test than time to depth. I understand that time to depth is probably still ok, but in your reasoning you mention examples where it might fail and emphasize the need for a reasonable number of different positions to test. But why then, not use time to solution in the first place? In both cases you have to choose positions carefully and use multiple runs, but time to solution should be less error prone. Maybe somebody can do an experiment analyzing the SD of both approaches?

bob wrote:
IQ wrote:
bob wrote:
zullil wrote:Is the following statement reasonable?

Suppose that for each position and each fixed amount of search time, 16 threads reaches a higher depth than 8 threads. Then 16 threads is likely to perform at least as well as 8 threads, as measured by winning chess games.
Absolutely. But incredibly unlikely. To the point of "winning-the-lottery" type probability.

And the same rule still applies. Not just one run but several. But if you can improve the depth, then the thing will be stronger. Just watch out for flying pigs while doing this test.
I disagree here. Even a higher displayed depth in a fixed time means nothing. It could very well be that through the non deterministic nature of the smp, hash table interaction and the high selectivity of modern programs that a higher depth is reached without playing stronger. The best test in my mind would be the TIME to SOLUTION of positions with known best moves (or as an approximation the MOVE where a reasonable large sample of engines agree on as depth goes to infinity). If you average time to solution over a reasonable number of positions (whose estimates themselves should be averages of multiple runs) you should be fine. Don't let yourself be fooled by depth and nodes programs display, in a parallel world and with modern selective programs their informative value is relative.
While I agree in principle, I don't agree in practice. If we were diddling around with extensions and reductions and modifying them, I would not use time-to-depth for anything. But we are not modifying the search or pruning/reduction rules. There are not very many cases where you find the answer at a different depth when using threads vs single search. There are a few, which is why I always advocate using a significant number of positions, and then weeding the oddballs out. I have several positions where 2 threads is way more than 2x faster to get the right answer. I try to make sure that I don't depend on such positions to compute speedups. If you pick a set of positions, some will show super-linear speedup, and those should count. But those should not be the _only_ ones that count...

Time to fixed depth is a good SMP test. There will be occasional oddities. You just have to repeat the tests enough times that they don't skew the results...

At least 95% of the time, time to depth and time to solution will be comparable when computing speedups.

zullil · Post by **zullil** » Fri Jan 07, 2011 11:32 am

MikeB wrote: The NPS is about 30-40% higher with HT on my laptop Windows 7 laptop.

HT enabled and used on my laptop consistly reaches a fixed depth faster than it does without enabled.

I don't think I've ever seen more than 15% gain in nps using hyperthreading, and even that gain occurs only in bench tests with large depth. 30-40% seems very unusual.

As I've (re)discovered recently, time to depth measurements are very variable. If you are consistently getting shorter time to depth numbers using hyperthreading that would be interesting indeed (or so I imagine).

This is a core i3 processor, BTW?

zullil · Post by **zullil** » Fri Jan 07, 2011 12:12 pm

MikeB wrote: ===========================
MT=2 on a 2 Physical Core Machine

Total time (ms) : 53352
Nodes searched : 59118660
Nodes/second : 1108087

==============================
Same machine, same program, same bench test now using all 4 Logical Cores

Total time (ms) : 41542
Nodes searched : 61921860
Nodes/second : 1490584

This is representative of what I'm seeing:

Code: Select all

Using bench 1024 8 24 default depth
===============================
Total time (ms) : 527925
Nodes searched  : 3367481098
Nodes/second    : 6378711

Using bench 1024 16 24 default depth
===============================
Total time (ms) : 558527
Nodes searched  : 4173317854
Nodes/second    : 7472007

So with HT I got a 17% (wow, that's high) gain in nps, but at the cost of a 6% rise in total time. Based on what Bob and Tord have said, this suggests that HT is actually deleterious to chess performance on my system.

mcostalba · Post by **mcostalba** » Fri Jan 07, 2011 12:31 pm

bob wrote: That's a strange comment. Every SMP program I have tried shows a modest NPS improvement

In SF when a thread has finished searching it keeps busy polling for new work to be done, this could be acceptable if each thread runs on a different CPU, but in case of HT when one thread keeps polling in a tight loop it drains resources from the sibling logical thread on the same CPU for no reason.

With the new "Sleeping thread" feature the threads that have finished searching and return to the split point root are put to sleep. For this to work you need a fast locking and condition variables scheme such are the newly introduces SRWLocks and Condition Variables under Windows so that the sleeping thread could be very quickly signaled to wake up when there's some new work to do.

To properly verify that the newly introduced feature does work is important to rely on modern hardware and on fast locking (so not the ones delivered in the official JA builds that instead aim at backward compatibility).

zullil · Post by **zullil** » Fri Jan 07, 2011 1:28 pm

zullil wrote:
MikeB wrote: ===========================
MT=2 on a 2 Physical Core Machine

Total time (ms) : 53352
Nodes searched : 59118660
Nodes/second : 1108087

==============================
Same machine, same program, same bench test now using all 4 Logical Cores

Total time (ms) : 41542
Nodes searched : 61921860
Nodes/second : 1490584

This is representative of what I'm seeing:
Code: Select all
Using bench 1024 8 24 default depth
===============================
Total time (ms) : 527925
Nodes searched  : 3367481098
Nodes/second    : 6378711

Using bench 1024 16 24 default depth
===============================
Total time (ms) : 558527
Nodes searched  : 4173317854
Nodes/second    : 7472007
So with HT I got a 17% (wow, that's high) gain in nps, but at the cost of a 6% rise in total time. Based on what Bob and Tord have said, this suggests that HT is actually deleterious to chess performance on my system.

And this is what I see on my i3 iMac using the icc compiler:

Code: Select all

Using ./stockfish bench 128 2 20
===============================
Total time (ms) : 170558
Nodes searched  : 350598936
Nodes/second    : 2055599

Using ./stockfish bench 128 4 20
===============================
Total time (ms) : 182437
Nodes searched  : 468851716
Nodes/second    : 2569937

A 25% gain in nps (wow!) but also a 7% gain in total time.

[EDIT] I repeated the test:

Code: Select all

Using bench 128 2 20
===============================
Total time (ms) : 203558
Nodes searched  : 441985063
Nodes/second    : 2171297

Using bench 128 4 20

===============================
Total time (ms) : 189368
Nodes searched  : 488148168
Nodes/second    : 2577775

This time a 19% gain in nps and a 7% drop in total time.

The variance of total time is very large. On would need to repeat this test a lot of times before a meaningful measurement could be established.

zullil · Post by **zullil** » Fri Jan 07, 2011 6:29 pm

Here are the results of ten searches to depth 24 of the second of Stockfish's sixteen default positions. This was using 8 physical cores on my Mac Pro. Note the variability of the total time (in ms) taken to complete each search. I calculate a mean of 90066.6 and a sample standard deviation of 26983.9. Assuming a normal distribution, these numbers suggest that about 1 in 3 searches will take either less than 63083 and more than 117051. Perhaps you could try this experiment on your i3 machine using your SF binary. I think this illustrates why it will be difficult to demonstrate that HT yields a better-performing engine.

Code: Select all

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 640777362
Total time (ms): 97647
Nodes/second: 6562181
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 393386958
Total time (ms): 60933
Nodes/second: 6456057
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 416546724
Total time (ms): 64487
Nodes/second: 6459390
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 822556975
Total time (ms): 122239
Nodes/second: 6729087
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 395129719
Total time (ms): 60075
Nodes/second: 6577273
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 746878001
Total time (ms): 113729
Nodes/second: 6567172
Best move: dxe6
Ponder move: Bxe2

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 560845985
Total time (ms): 84954
Nodes/second: 6601760
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 565951159
Total time (ms): 85320
Nodes/second: 6633276
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 486849305
Total time (ms): 74301
Nodes/second: 6552392
Best move: Bxa6
Ponder move: bxc3

Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 938499802
Total time (ms): 136981
Nodes/second: 6851313
Best move: Bxa6
Ponder move: bxc3

bob · Post by **bob** » Sat Jan 08, 2011 12:24 am

PawnStormZ wrote: I was thinking that when HT was on it changed the way the processor worked in such a way that allocating 4 of 8 threads to the engine would not be exactly the same as running it on 4 threads with the HT off.

Obviously I will defer to your slightly superior knowledge on the subject.

It will be somewhat different if, say, you have 4 physical cores, 8 logical cores, and you run 4 chess threads + another application. But that is actually more efficient (at least under Linux) than running 4 phisical/logical cores and letting the operating system context switch among the 5 runnable processors...

But if you really have 4 threads running, the "logical" processors will only be involved in running minute bits of code (interrupt handlers and such) and they can do this without causing context switches which is a tiny bit more efficient...

Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test

Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test (Crafty

Re: Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test

Re: Stockfish "Use Sleeping Threads" Test