Hyperthreading and Computer Chess: Intel i5-3210M

bob · Post by **bob** » Tue Apr 30, 2013 8:24 pm

Mike S. wrote:If you have an alive Windows, ALWAYS some other processes run in the background. But as far as I am aware, none took more than 1% CPU.

Yeah, I am used to it that my numbers look "suspicious". But they are real. And that ends this particular discussion for me on this point.

Bad approach. that > 2x for chess is a red flag. Accepting data that looks beyond questionable doesn't help the discussion.

bob · Post by **bob** » Tue Apr 30, 2013 8:30 pm

syzygy wrote:
bob wrote:To answer the HT question will be more complicated. A HT run is going to be somewhat faster. No doubt about it. How much faster really does depend on the quality of the internal engine programming and how much it has been optimized. The better optimized the code (better cache locality for memory accesses, less dependencies and pipeline stalls, the less HT is going to help. When I first turned on HT on my PIV, it improved Crafty very slightly. But after a few months of optimizing for AMD cache stuff, I went back to test again on my PIV and the thing was not helping near as much (HT on). And, in fact, it was a net loss after factoring in the overhead. Something on the order of 15% loss or so overall.
What has not yet been done, as far as I know, is optimise a parallel search specifically for HT.

With normal cores, it is important to minimise idle time of all processors. As a rule it is always better for a core to do something than to do nothing (of course I'm aware there is splitting overhead).

With HT, this is different. If one logical core is idle (and not spinning), its resources go to the other logical core of the same physical core. So if the other logical core is doing useful work, there is no need to join a split node that is of dubious quality. Only if a split node is available that is almost certain not to result in wasted work should the logical core join it.

Why would this not be the goal for ANY parallel search split?

So if a thread becomes idle, it should check whether its partner thread on the same physical core is busy. If it is, the thread should block instead of spin (the pause instruction in a spin loop helps only to some extent, and monitor/mwait are unfortunately not available in user space). If the other thread is also idle it should make sure that one of them spins, the other blocks. Spinning threads join any split node that is available. Blocked threads are woken up once a "very good" split node becomes available.

Again, I don't see why the goals for HT or non-HT would be any different, except for the spin-lock usage. But spin-lock usage is an issue anytime one can have more processes than cores, such as on a shared machine...

In a first approximation, "very good" could be defined as "high remaining depth". Of course the above requires per thread affinity settings.

It might make more work than this, but I would be surprised if HT could not be made into a definite win.

Of course none of this applies to current engines, as far as I know.

From my perspective, the more one optimizes an engine, the less HT helps. HT tries to hide latencies that can be addressed with good programming practices, at least to a significant level.

syzygy · Post by **syzygy** » Tue Apr 30, 2013 8:31 pm

bob wrote:
syzygy wrote:
Mike S. wrote:
Code: Select all
Engine              2T    4T  factor
------------------------------------
Critter 1.6a       2667  3930  1,47
Houdini 1.5a       3408  3964  1,16
Rybka 2.3.2a        168   283  1,68
Shredder Cl. 2012   626  1404  2,24
Spark 1.0          4304  5644  1,31
                              ------
                            Ø  1,57
If you got more than double the nps with 4 threads on a dual core with HT, then something went wrong in your testing. The other numbers except for maybe Houdini also look suspicious. (Well, Rybka anyway reports an imaginary nps, so we can safely ignore those numbers.) Maybe some other processes were running in the background?
There are programs that cause issues that HT can help. I have seen 2x in the past, although not with chess engines.

And there is the 32/64 bit issue. HT has helped some 32 bit apps more than expected...

HT certainly can help, but two threads running on a HT core are not both going to be faster than when only one thread is running on that same core.

syzygy · Post by **syzygy** » Tue Apr 30, 2013 9:25 pm

bob wrote:
syzygy wrote:
bob wrote:To answer the HT question will be more complicated. A HT run is going to be somewhat faster. No doubt about it. How much faster really does depend on the quality of the internal engine programming and how much it has been optimized. The better optimized the code (better cache locality for memory accesses, less dependencies and pipeline stalls, the less HT is going to help. When I first turned on HT on my PIV, it improved Crafty very slightly. But after a few months of optimizing for AMD cache stuff, I went back to test again on my PIV and the thing was not helping near as much (HT on). And, in fact, it was a net loss after factoring in the overhead. Something on the order of 15% loss or so overall.
What has not yet been done, as far as I know, is optimise a parallel search specifically for HT.

With normal cores, it is important to minimise idle time of all processors. As a rule it is always better for a core to do something than to do nothing (of course I'm aware there is splitting overhead).

With HT, this is different. If one logical core is idle (and not spinning), its resources go to the other logical core of the same physical core. So if the other logical core is doing useful work, there is no need to join a split node that is of dubious quality. Only if a split node is available that is almost certain not to result in wasted work should the logical core join it.
Why would this not be the goal for ANY parallel search split?

It is always the GOAL not to do wasted work. But if you have nothing better to do with a physical core, there is no harm in letting it search some branches even if there is a significant chance that the effort turns out to be wasted. (Of course the splitting overhead must still be taken into account.)

With HT, a logical core that is doing useless work badly hurts the other logical core of the same physical core. So if there is a significant chance that doing a split is not going to contribute, it is better not to do the split and to let the logical thread relinquish all resources of its physical core to the other logical thread. On the other hand, if you're quite sure a particular node is not going to fail high, it is safe to do a split.

For example, after the first root move has been searched, it is probably beneficial to let all other root moves be searched in parallel by as many logical cores as possible. One could stop there: once you run out of root moves, the secondary threads are put to sleep and only the primary threads help out with the root moves that take longer. This would not make sense at all with N threads on N physical cores, but it makes sense with N primary and N secondary threads on N physical cores with HT.

From my perspective, the more one optimizes an engine, the less HT helps. HT tries to hide latencies that can be addressed with good programming practices, at least to a significant level.

I'm talking about optimising specifically for HT. It should be possible to do much better than just blindly doubling the number of threads and see where that gets you.

bob · Post by **bob** » Tue Apr 30, 2013 11:23 pm

syzygy wrote:
bob wrote:
syzygy wrote:
Mike S. wrote:
Code: Select all
Engine              2T    4T  factor
------------------------------------
Critter 1.6a       2667  3930  1,47
Houdini 1.5a       3408  3964  1,16
Rybka 2.3.2a        168   283  1,68
Shredder Cl. 2012   626  1404  2,24
Spark 1.0          4304  5644  1,31
                              ------
                            Ø  1,57
If you got more than double the nps with 4 threads on a dual core with HT, then something went wrong in your testing. The other numbers except for maybe Houdini also look suspicious. (Well, Rybka anyway reports an imaginary nps, so we can safely ignore those numbers.) Maybe some other processes were running in the background?
There are programs that cause issues that HT can help. I have seen 2x in the past, although not with chess engines.

And there is the 32/64 bit issue. HT has helped some 32 bit apps more than expected...
HT certainly can help, but two threads running on a HT core are not both going to be faster than when only one thread is running on that same core.

I completely agree. And even getting to 2x is not a given for a very poorly behaved program. I think the 2x case I saw published was running two separate programs, as opposed to one threaded application.

bob · Post by **bob** » Tue Apr 30, 2013 11:37 pm

BTW to avoid further derailing, I ran a couple of tests and counted the total nodes search as always, and also counted the number of PV nodes, defined as any node where alpha != beta - 1.

nodes=108,809,688 PV nodes = 26,576
nodes=136,217,915 PV nodes = 77,215

That pattern was repeated, worst case was 0.05% (0.0005) of the nodes can produce different/updated alpha or beta values. So the A/B values used for whatever pruning/reduction decisions are insignificant...

FlavusSnow · Post by **FlavusSnow** » Tue Apr 30, 2013 11:48 pm

Bob, I appreciate your comments and certainly respect your research. I generally agree with your assessment of hyperthreading. I want to note a simple example though, citing published benchmarks of Fritz from Tom's Hardware's 2012 CPU charts:

Intel Core i7-2600K 3.4 Ghz 4c/8t = 12,986 nps
We have to adjust this for speedup, based on your DTS speedups from your publications, Bob, I'm assuming a speedup of 6.6 times - so to normalize these nps I will multiply by 6.6/8 = 10,713 'effective' nodes per second.

Intel Core i5-2500K 3.3 ghz 4c/4t = 10,146 nps
We have to adjust this for clock speed so I will multiply by 3.4/3.3 = 10,453 nps
We now have to get 'effective' nodes by applying a true speedup, so I will multiply by 3.7/4 = 9,669.

As 10,713 > 9,669 I would have to conclude that HT does indeed increase the search speed for Fritz. Now this is a 11% increase which means maybe a ~5 ELO increase in strength, but an increase nonetheless.

This can't be conclusive for all engines. Its interesting to note that activating HT for Fritz only increases the raw nps by about 20-25%, but this seems to be enough to make it worth while despite the increased inefficiencies introduced by adding threads.

bob · Post by **bob** » Wed May 01, 2013 12:50 am

FlavusSnow wrote:Bob, I appreciate your comments and certainly respect your research. I generally agree with your assessment of hyperthreading. I want to note a simple example though, citing published benchmarks of Fritz from Tom's Hardware's 2012 CPU charts:

Intel Core i7-2600K 3.4 Ghz 4c/8t = 12,986 nps
We have to adjust this for speedup, based on your DTS speedups from your publications, Bob, I'm assuming a speedup of 6.6 times - so to normalize these nps I will multiply by 6.6/8 = 10,713 'effective' nodes per second.

Intel Core i5-2500K 3.3 ghz 4c/4t = 10,146 nps
We have to adjust this for clock speed so I will multiply by 3.4/3.3 = 10,453 nps
We now have to get 'effective' nodes by applying a true speedup, so I will multiply by 3.7/4 = 9,669.

As 10,713 > 9,669 I would have to conclude that HT does indeed increase the search speed for Fritz. Now this is a 11% increase which means maybe a ~5 ELO increase in strength, but an increase nonetheless.

This can't be conclusive for all engines. Its interesting to note that activating HT for Fritz only increases the raw nps by about 20-25%, but this seems to be enough to make it worth while despite the increased inefficiencies introduced by adding threads.

The DTS speedup numbers really have nothing to do with what Fritz, crafty and such do today, because (a) they don't use DTS and (b) the trees are far different (more variable) than the trees from back then.

For example, the last time I ran an 8 cpu test and posted the results (I think the data is still on my ftp box for those interested, but it is a ton of Crafty log files so it takes some time to filter through) the speedup was maybe 6 (I would have to locate the results myself.) The data did pretty well fit my linear approximation of

speedup = 1 + .7 * (NCPUS - 1)

which would give 5.9...

I am going to post some data concerning search overhead, by running a ton of positions to fixed depth, using 1, 2, 4 and 8 threads. That will pretty well show the search overhead, which I have typically measured at 30%. Which means if HT doesn't give you a 30% NPS increase, it will lose. At least for Crafty. I've seen nothing to suggest that Fritz's parallel search is any better, and I don't even know if it is as good as Crafty's search. Based on your numbers (25%) it would appear to be a losing idea.

You can certainly compute Fritz's search overhead by doing the same test I am doing... Search a significant number of test positions (NOT tactical positions, my opening positions for cluster testing would be OK although I would like to see some later middle-game and even endgame positions thrown in) to a fixed depth with one cpu and record the total nodes searched for all positions. Repeat using 2 cores. the 2 core run will search more nodes. The last time I tested this on Crafty, admittedly a few years ago, it was 30%. If it averages less than your 25% speedup, then HT is a gain. However, I will point out that if you gain just 1 elo, there is a risk, because doubling the number of cores certainly increases the variability on the search times, which can hurt at the wrong time.

Let me know if you run this, because I am doing it myself, and it would be interesting to see how the two compare. My numbers will follow in a day or two...

BTW, I am almost certain you will not see anything near 3.7x speedup using 4 cores. My numbers are closer to 3.2 or 3.2 (or they were the last time I ran this kind of test, which was several years ago). I doubt my numbers are any better, and just hope they haven't gone down with the more selective trees we search.

bob · Post by **bob** » Wed May 01, 2013 7:26 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:To answer the HT question will be more complicated. A HT run is going to be somewhat faster. No doubt about it. How much faster really does depend on the quality of the internal engine programming and how much it has been optimized. The better optimized the code (better cache locality for memory accesses, less dependencies and pipeline stalls, the less HT is going to help. When I first turned on HT on my PIV, it improved Crafty very slightly. But after a few months of optimizing for AMD cache stuff, I went back to test again on my PIV and the thing was not helping near as much (HT on). And, in fact, it was a net loss after factoring in the overhead. Something on the order of 15% loss or so overall.
What has not yet been done, as far as I know, is optimise a parallel search specifically for HT.

With normal cores, it is important to minimise idle time of all processors. As a rule it is always better for a core to do something than to do nothing (of course I'm aware there is splitting overhead).

With HT, this is different. If one logical core is idle (and not spinning), its resources go to the other logical core of the same physical core. So if the other logical core is doing useful work, there is no need to join a split node that is of dubious quality. Only if a split node is available that is almost certain not to result in wasted work should the logical core join it.
Why would this not be the goal for ANY parallel search split?
It is always the GOAL not to do wasted work. But if you have nothing better to do with a physical core, there is no harm in letting it search some branches even if there is a significant chance that the effort turns out to be wasted. (Of course the splitting overhead must still be taken into account.)

With HT, a logical core that is doing useless work badly hurts the other logical core of the same physical core. So if there is a significant chance that doing a split is not going to contribute, it is better not to do the split and to let the logical thread relinquish all resources of its physical core to the other logical thread. On the other hand, if you're quite sure a particular node is not going to fail high, it is safe to do a split.

For example, after the first root move has been searched, it is probably beneficial to let all other root moves be searched in parallel by as many logical cores as possible. One could stop there: once you run out of root moves, the secondary threads are put to sleep and only the primary threads help out with the root moves that take longer. This would not make sense at all with N threads on N physical cores, but it makes sense with N primary and N secondary threads on N physical cores with HT.

We certainly don't agree there. I have always "split at the root" because that is the one place where you are guaranteed to have to search every move, no matter what... I don't follow your reasoning about the HT vs not-HT search at the end of your explanation. If it works (whatever search approach you use) for normal cores, it ought to be good for HT cores, assuming HT provides any real gain. I make the basic assumption that I am going to search what is necessary, and no more, wherever I choose to do a parallel search...

From my perspective, the more one optimizes an engine, the less HT helps. HT tries to hide latencies that can be addressed with good programming practices, at least to a significant level.
I'm talking about optimising specifically for HT. It should be possible to do much better than just blindly doubling the number of threads and see where that gets you.

Since you can't control what is really going on, I'm not sure this is worth the effort. Again, I only want to search things that are needed, no speculative work in my design. That makes it hard to decide when to turn HT on and off. I still believe "off" is the right answer, but we'll see...

syzygy · Post by **syzygy** » Wed May 01, 2013 12:37 pm

bob wrote:
syzygy wrote:For example, after the first root move has been searched, it is probably beneficial to let all other root moves be searched in parallel by as many logical cores as possible. One could stop there: once you run out of root moves, the secondary threads are put to sleep and only the primary threads help out with the root moves that take longer. This would not make sense at all with N threads on N physical cores, but it makes sense with N primary and N secondary threads on N physical cores with HT.
We certainly don't agree there. I have always "split at the root" because that is the one place where you are guaranteed to have to search every move, no matter what...

I don't see any disagreement at all.
I assume an N-core processor with HT. The first root move has been searched.
With HT off, we would use N processors to search the remaining root moves. If some of those processors are finished and we run out of root moves, they will join the search of the other moves.

What I propose is this: as long as we have root moves to search, search them with 2N threads. Once we run out of root moves and threads start to finish searching their root move, reduce the number of searching threads to N (one per physical core).

The point is: there will be little parallel overhead when searching root moves in parallel (each root move being searched serially). So all the extra nps that HT delivers is a win.

Once threads join other threads to finish searching the last root moves, parallel overhead will probably increase. Therefore don't use HT there, because the nps increase might not offset the increased parallel overhead.

I don't follow your reasoning about the HT vs not-HT search at the end of your explanation. If it works (whatever search approach you use) for normal cores, it ought to be good for HT cores, assuming HT provides any real gain.

I am only assuming here that HT provides a real gain in terms of nps. I do not assume here that in general this increase offsets the parallel overhead. Therefore I propose to restrict application of HT to part of the search where the parallel overhead should be small: "the one place where you are guaranteed to have to search every move, no matter what..."

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: update - final result

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M