Hyperthreading and Computer Chess: Intel i5-3210M

bob · Post by **bob** » Sun Apr 28, 2013 9:21 pm

syzygy wrote:
bob wrote:
The base idea is to use the traditional knuth/moore form of N = 2 * W ^(d/2) (for D even) or N = W ^ floor(d/2) + W ^ ceil(d/2) for D odd or even. If you analyze that recursively, the first branch at the root searches many more nodes than the remainder. And this can be spelled out mathematically using Knuth's formula. The idea is that when you screw up with move ordering, so that the second move is better than the first, you almost double the size of the sub-tree for that branch...
If you screw up move ordering, you are screwed whether you search in parallel or not. When searching serially the size of the subtree for that branch doubles just as much as when searching in parallel.
Nope, not at all.
Ehm, very clear it does... I am sorry, but I respond to what you wrote above and not to something you did not write.

If you search serially from move 1 to move 40, and move 5 causes a fail-high, you only searched moves 1-4 unnecessarily, due to bad ordering. With a parallel search, you could easily search all 40 moves before you get that fail high. I thought this was obvious. The "overhead" is not comparing to an "optimal tree". It is comparing the parallel search space to the sequential search space. Those two comparisons are not the same thing at all.
Yes, and I am not denying that parallel overhead exists. But your argument from your previous post was very clearly comparing to the optimal tree. And the paper you were referencing is not analyzing this parallel overhead.

I compare to the "optimal sequential tree" which is NOT the same as "the optimal tree". The paper I mentioned (I only assume you are talking about the parallel computing journal paper) very carefully analyzes this "search overhead". So perhaps we are on a different topic altogether. I have a bunch of reprints the journal sent me, I don't have any electronic copy however. The font is small enough and uses enough formula symbols that the last time I tried to OCR the thing, it was too messy to worry with.

DTS also gives you far more flexibility on choosing a split point than YBW does. With YBW you are always faced with the decision "split HERE, after one move has been searched, or do nothing." With DTS I could split ANYWHERE giving me a better chance to pick a better place to split.
This inflexibility of many YBW implementations is not part of YBW. In my implementation of YBW, idle threads actively look for a potential split point as close to the root as possible.

So how do they know "when" to start looking? Surely they don't just look at all tree state repeatedly until there is a split point possible? Most trigger a parallel search AFTER at least one move has been searched at any node. If you use an iterated search, it is certainly doable, but not with recursion.

I do not agree it had no test results. In fact, it has test results for 24 Bratko-Kopec positions on 5-ply searches, both for PVS and for EPVS.

Let me summarise the paper:
Section 1:
minimax searches a lot of nodes, alpha/beta much less. Knuth formula, node types.
Section 2:
PVS searches no extra nodes if the tree is perfectly ordered. However, to make this work in practice the search must be able to split at any depth without penalty and (what I believe is only mentioned in section 3) the tree must be perfectly balanced (all subtrees of all moves after the 1st have the same size) because otherwise processors become idle.
Section 3:
It is mentioned that things are different with imperfect move ordering, and that imperfect move ordering cannot be avoided. This is a big problem for PVS, because it makes the tree very imbalanced, therefore lots of idle processors. Enters enhanced PVS or EPVS: if one processor becomes idle, all other searches are aborted and a new split point is made two plies higher in the tree. The TT is relied on to not lose the work that has already been done.
Section 4:
PVS and EPVS were implemented in Cray Blitz, tests were run on a Sequent Balance 21000, results are shown. (It seems 1 processor searched about 63 nodes/second.)
Section 5:
Conclusion. We need a better algorithm.

I don't see anything in this paper on which one could base a conclusion that HT can't possibly be of benefit with YBW-based parallel searches.
Read between the lines. HT provides a minimal speed boost. Additional threads without perfect move ordering produce search overhead. The overhead outweighs the NPS increase.
Between the lines? Right...

Overhead is CLEARLY defined in terms of Knuth's formula for alpha/beta. That gives the overhead. HT merely has to overcome that to be useful. From my perspective, this is a simple thing to measure. I'm not into the "overhead can help" without some type of actual evidence to support it...

syzygy · Post by **syzygy** » Sun Apr 28, 2013 9:43 pm

bob wrote:This:

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.

So explain how to search extra nodes in the same time on the same hardware. Emulating two threads by a signle thread through interleaving isn't going to do that.

Do you REALLY think that by reducing the pruning or reductions, one can gain 5%? If so, why not just modify the reduction/pruning code to reduce or prune less in those same places where you would normally do a parallel split.

Because then you lose nodes=time by reducing/pruning less.....

The argument is that with the highly selective searches of today, a parallel search will be somewhat less selective. Some reductions won't be triggered. This would also explain why parallel search overhead today is higher than in the past, as you insist it is.

Of course you can try to simulate this in a serial search by not triggering some reductions, but the result will just be that you search more nodes, so take more time. If the programmer has tested his reductions well, turning them off will be a bad trade off. The reductions were not put in for nothing. But if HT turns them off "for free", or at least at lesser cost, then this is a factor that may help.

I don't like this "guesswork" approach about "the extra width or nodes might help..."

I am just offering a possible explanation for why it is not completely unthinkable that HT could work even if the nps speed increase might not completely outweigh the extra nodes searched due to parallel overhead.

One way to test it is as follows. Let an engine play itself using fixed depth searches, one side searching with 4 threads and the other side searching with 8 threads. (This should be done on a machine with at least 4 cores and HT on, because 8 threads with just 4 hardware threads might give quite different trees.) If the engine using 8 threads clearly outperforms the same engine using 4 threads, that proves that the extra nodes of the 8-thread searches add to the quality of the search.

Of course even if this tests shows that 8-thread searches to the same depth are somehow of higher quality, this does not mean that HT is a win. But it would show there is some effect that should not be overlooked.

I am not convinced that the effect exists, but I am also not convinced that it does not exist.

One could always test this easily enough. Take a program that does not use spin locks, and play it against a gauntlet, using N cores and N threads, and then again using N cores and 2N threads. That will precisely define what the effect of the search overhead is (note that for 2N threads, I mean NO HT. Just double threads (2 per physical/logical cpu). Now we know how much the search overhead hurts. Then one can carefully measure how much HT affects search speed (gaining some back). The net will show whether it is a gain or loss.

Running 8 threads on 4 physical cores without HT seems a very poor simulation of using HT.

Play a series of matches to fixed depth using 1 cpu, then 2 cpus, then 4. Since the depth is fixed, the time to complete is irrelevant, and the only change will be the search overhead. If it does help, the 4 cpu fixed depth program should perform better than the 1 cpu fixed depth program.

This test I agree with. (I wrote my test proposal before I read this.)

Still it is beyond me why you would think that if the latter test shows a clear win for the 4 cpu fixed depth program, this could be of any help for improving a serial search.

syzygy · Post by **syzygy** » Sun Apr 28, 2013 10:07 pm

bob wrote:The paper I mentioned (I only assume you are talking about the parallel computing journal paper) very carefully analyzes this "search overhead". So perhaps we are on a different topic altogether.

The paper I have here is "A parallel alpha/beta tree searching algorithm" published in Parallel Computing 10 (1989), 299-308. It starts out with Knuth's formula. Paper number 12 at http://www.cis.uab.edu/hyatt/ (which contains a typo: "Al" should be "Algorithm"). This paper does not analyze parallel search overhead and certainly not in a way that would be applicable to DTS or YBW. The paper is actually mostly concerned with the problem of idle threads, which is a problem of PVS and EPVS and not of DTS and YBW.

That parallel search overhead exists is clear and undisputed and does not require Knuth's formula or any paper. But a statement that doubling the number of threads gives 30% overhead cannot be based on this paper, also not to anything "between the lines".

Anyway, I don't necessarily disagree that the 30% number is accurate, but I found it interesting to note that the DTS paper suggested otherwise. Ok, so that DTS paper was from a time with very different search trees. Well, the same applies to that 1989 paper... on which you seem to base the 30%... Stack overflow.

DTS also gives you far more flexibility on choosing a split point than YBW does. With YBW you are always faced with the decision "split HERE, after one move has been searched, or do nothing." With DTS I could split ANYWHERE giving me a better chance to pick a better place to split.
This inflexibility of many YBW implementations is not part of YBW. In my implementation of YBW, idle threads actively look for a potential split point as close to the root as possible.
So how do they know "when" to start looking? Surely they don't just look at all tree state repeatedly until there is a split point possible? Most trigger a parallel search AFTER at least one move has been searched at any node. If you use an iterated search, it is certainly doable, but not with recursion.

Threads that are searching set a flag for a node that is ready to be split (e.g. after the first node was searched, or after at least captures and killers were searched), thereby marking them as potential split points. Idle threads look for those flags and pick the potential split node that is closest to the root or that is searched with the highest remaining depth. Making the split is a matter of setting a flag in that node and copying the moves leading from the root position to that node. (So it is important to store those moves in a globally accessible array and not just on the stack.)

The only thing you can't do with recursion, or at least not easily, is let a master thread that has become idle help at other nodes than a node searched by one of its slaves. (This limitation is actually quite helpful for the implementation, because it means that a helpful master does not have to switch to a different search tree.)

bob · Post by **bob** » Mon Apr 29, 2013 12:18 am

syzygy wrote:
bob wrote:This:

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.
So explain how to search extra nodes in the same time on the same hardware. Emulating two threads by a signle thread through interleaving isn't going to do that.

If that extra 15% search space is worth anything, I will simply increase the target time per move to reach the SAME benefit. I simply do not believe it is there.

Do you REALLY think that by reducing the pruning or reductions, one can gain 5%? If so, why not just modify the reduction/pruning code to reduce or prune less in those same places where you would normally do a parallel split.
Because then you lose nodes=time by reducing/pruning less.....

The argument is that with the highly selective searches of today, a parallel search will be somewhat less selective. Some reductions won't be triggered. This would also explain why parallel search overhead today is higher than in the past, as you insist it is.

Of course you can try to simulate this in a serial search by not triggering some reductions, but the result will just be that you search more nodes, so take more time. If the programmer has tested his reductions well, turning them off will be a bad trade off. The reductions were not put in for nothing. But if HT turns them off "for free", or at least at lesser cost, then this is a factor that may help.

I don't like this "guesswork" approach about "the extra width or nodes might help..."
I am just offering a possible explanation for why it is not completely unthinkable that HT could work even if the nps speed increase might not completely outweigh the extra nodes searched due to parallel overhead.

One way to test it is as follows. Let an engine play itself using fixed depth searches, one side searching with 4 threads and the other side searching with 8 threads. (This should be done on a machine with at least 4 cores and HT on, because 8 threads with just 4 hardware threads might give quite different trees.) If the engine using 8 threads clearly outperforms the same engine using 4 threads, that proves that the extra nodes of the 8-thread searches add to the quality of the search.

Of course even if this tests shows that 8-thread searches to the same depth are somehow of higher quality, this does not mean that HT is a win. But it would show there is some effect that should not be overlooked.

I am not convinced that the effect exists, but I am also not convinced that it does not exist.

One could always test this easily enough. Take a program that does not use spin locks, and play it against a gauntlet, using N cores and N threads, and then again using N cores and 2N threads. That will precisely define what the effect of the search overhead is (note that for 2N threads, I mean NO HT. Just double threads (2 per physical/logical cpu). Now we know how much the search overhead hurts. Then one can carefully measure how much HT affects search speed (gaining some back). The net will show whether it is a gain or loss.
Running 8 threads on 4 physical cores without HT seems a very poor simulation of using HT.

Play a series of matches to fixed depth using 1 cpu, then 2 cpus, then 4. Since the depth is fixed, the time to complete is irrelevant, and the only change will be the search overhead. If it does help, the 4 cpu fixed depth program should perform better than the 1 cpu fixed depth program.
This test I agree with. (I wrote my test proposal before I read this.)

Still it is beyond me why you would think that if the latter test shows a clear win for the 4 cpu fixed depth program, this could be of any help for improving a serial search.

The ONLY difference between the 1, 2 and 4 cpu programs would be the search overhead. Nothing else. If the search overhead gains anything at all, this test would expose that. I am not 100% sure myself, but I am certain enough, as I have mentioned. I've only run a zillion parallel tests over the years, trying out everything imaginable.

However, this one is easy enough for me to do, I will just have to figure out how to set up the opponents in my test to use a fixed depth as well, and then find a "reasonable depth setting" for each to produce similar results as those in my normal testing.

I'll report back once I get the results. However, something tells me I am going to need a TON of games to even have a chance on measuring any possible gain...

More in a few days...

IQ · Post by IQ » Mon Apr 29, 2013 1:18 am

hgm wrote:
IQ wrote:It can if it is not the "extra" computational power which leads to better performance, but the shape of the tree searched. Think about it like this.

1) Assume that the 12%-20% gain through hyperthreading is not enough to offset the additonal parallel search overhead
2) If then the HT version with 2 x the number of threads still outperforms the non hyperthreaded version, it is most likely due to the different shape of the (and somewhat larger) tree searched
3) Nowadays with all sorts of reductions/extensions/pruning going and the additonal splitting in the HT ON version, it could very well be that some of these reductions which would kick in normally do not anymore in the additional searched treespace. This would also explain the faster solution of tactical tests as "key" moves are usually NOT at the top of the previous move ordering.
4) But if the hypothetical gain is due to the different shape of the tree, through mechanisms such as 3 -> then this behaviour could be replicated in software by adjusting the reductions/extension/pruning or even alpha/beta pruning or introducting a probabilistic element regime.

This can all easily put to rest if some of the kind testers do the same test with hyperthreading turned of in bios (and all other variables like turbo-boost) playing against an identical machine using hyperthreading.
Earlier you were talking about a an 8-HT machine with an 8-thread Houdini beating a 4-full-core machine with a 4-threaded Houdini. I don't see how you get from there to this new assumption (1), the nps increase not offsetting the 'parallel-search overhead'. I would say if the HT machine wins, it is obvious that the speedup is more than offsetting the search overhead.

The distinction you make between shape of the tree and search overhead seems completely arbitrary. The different tree shape is the search overhead. If the tree shapes were the same, and the nps were the same, there wouldn't be any search overhead. More threads means a more bushy tree compared to the alpha-beta minimum, and that is exactly what we call 'overhead'.

I think we (and everyone else) agrees that using 8 threads will result in a less efficient tree than using 4 threads. The only questions are: "how much less efficient", and "how many extra nps from HT".

I never talked about "an 8-thread Houdini beating a 4-full-core machine with a 4-threaded Houdini". Some members performed tests and claim such. I performed a thought experiment. This experiment assumed that this claim were true, but also that the parallel overhead is greater than ht gain. I then took an explanation provided by another member here: "the extra nodes" are all somewhat useful and increase strength and concluded that under these hypothetical circumstances the same gain in strength could be reached through software. Please see my other posts where I tried to restate this thought experiment in maybe less confusing ways. For the record I also want to add that personally am sceptical about the claimed test results because there might be some bugs in houndini when using affinities to simulate HT off (a proper test would have to turn Ht of in bios). Again I probably should have separated the discussion about the houdini bug from the thought experiment as it turned out to be a rather convoluted post.

IQ · Post by IQ » Mon Apr 29, 2013 1:28 am

bob wrote:
syzygy wrote:
IQ wrote:Assumptions:
1) HT gain does not outweigh parallel overhead. Lets say HT gains 15% and parallel overhead going from lets say 6 to 12 threads is 30%. Usually this would lead to the NON-HT version perfoming somewhat better than the HT version. Not arguing specific numbers here, but this is for example what Bob says.
What do you mean by "HT gains 15%"?
What do you mean by "parallel overhead is 30%"?

Define your terms...

I suppose the following definitions make sense:
- HT leads to 15% higher nps.
- doubling the number of threads leads to 30% extra nodes searched to reach the same depth.

Doubling the number of threads, everything else staying the same, is ALWAYS bad. All programmers agree on this. Only the 15% higher nps could possibly offset it. Now the argument is already over, because removing the HT hardware removes the 15% higher nps.

What is not clear about this?

Your "conclusion a)" is more or less copied from what I wrote much earlier in this thread. If 1 and 2 are both true, then a) is the explanation.

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.

That 15% HT gain might offset 30% parallel overhead does NOT mean that 0% HT gain might offset 30% parallel overhead.

Of course if you are a very good programmer you can take H3, completely redesign the extension/reduction scheme, and release a stronger H4, but nothing in the way HT works will point you in the right direction. HT is nothing else than "double the number of threads, some increase in nps". Somehow pretending that one could simulate the effects of doubling the number of threads without parallel overhead is not going to help.
You are wrong as this is a basic discussion that comes up in parallel search. IF, as you want to claim, the extra nodes provide extra knowledge that improves the results from a search to the same depth, one CAN implement that in a single thread. Just set up your data structures so that you pick one move at ply N, and make it giving a new sub-tree. Then pick another move at the SAME ply and make it giving a second sub-tree. Now multiplex between these two trees, one node at a time. You are now searching the IDENTICAL tree the parallel search would grow, and you would then see the SAME benefit, without the extra thread.

This assumption that the search overhead is good is simply wrong. It is overhead. There is absolutely zero evidence to show that the overhead helps in any way at all. One could easily measure this given test positions searched to the SAME depth, once with 1 cpu and once with more than one. If your speculation is true, the second result should be better. I've run a zillion such tests and have only seen a FERY few cases where it happens, and those are quite often not repeatable. Something that happens once every 100 or 1000 moves is not very helpful.

Thank you! This is what my thought experiment was about, if those nodes "magically" help, one could achieve the same with search changes. Also it seems reasonable that even if some of the nodes help that the vast majority are still plain parallel overhead -> this would mean that HT gain would have to be much much higher than parallel overhead.

I am not 100% sure whether this software change could be achieved through interleaving though. This would then have the same parrallel overhead only serialized (and without the HT gain). As i have not thought this low level simulation completely through I proposed some higher level possibilities searching the "magic" nodes avoiding this loss. But this is an interesting discussion....

syzygy · Post by **syzygy** » Mon Apr 29, 2013 2:03 am

bob wrote:
syzygy wrote:
bob wrote:This:

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.
So explain how to search extra nodes in the same time on the same hardware. Emulating two threads by a signle thread through interleaving isn't going to do that.
If that extra 15% search space is worth anything, I will simply increase the target time per move to reach the SAME benefit. I simply do not believe it is there.

If I am allowed to increase the target time, I have a simple way to increase the strength of any engine...

I take it that in all seriousness we agree that any improved quality of say a "4 cpu fixed depth" search over a "1 cpu fixed depth" search, at same depth, cannot be used to improve a serial search. Increasing the target time of course does not count.

The ONLY difference between the 1, 2 and 4 cpu programs would be the search overhead. Nothing else. If the search overhead gains anything at all, this test would expose that. I am not 100% sure myself, but I am certain enough, as I have mentioned. I've only run a zillion parallel tests over the years, trying out everything imaginable.

However, this one is easy enough for me to do, I will just have to figure out how to set up the opponents in my test to use a fixed depth as well, and then find a "reasonable depth setting" for each to produce similar results as those in my normal testing.

I'll report back once I get the results. However, something tells me I am going to need a TON of games to even have a chance on measuring any possible gain...

More in a few days...

I agree that this test should settle the matter (at least for crafty

). If it turns out there is no increase or decrease in quality, then only time-to-depth counts. That would be a useful result.

bob · Post by **bob** » Mon Apr 29, 2013 6:18 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:This:

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.
So explain how to search extra nodes in the same time on the same hardware. Emulating two threads by a signle thread through interleaving isn't going to do that.
If that extra 15% search space is worth anything, I will simply increase the target time per move to reach the SAME benefit. I simply do not believe it is there.
If I am allowed to increase the target time, I have a simple way to increase the strength of any engine...

I take it that in all seriousness we agree that any improved quality of say a "4 cpu fixed depth" search over a "1 cpu fixed depth" search, at same depth, cannot be used to improve a serial search. Increasing the target time of course does not count.

The ONLY difference between the 1, 2 and 4 cpu programs would be the search overhead. Nothing else. If the search overhead gains anything at all, this test would expose that. I am not 100% sure myself, but I am certain enough, as I have mentioned. I've only run a zillion parallel tests over the years, trying out everything imaginable.

However, this one is easy enough for me to do, I will just have to figure out how to set up the opponents in my test to use a fixed depth as well, and then find a "reasonable depth setting" for each to produce similar results as those in my normal testing.

I'll report back once I get the results. However, something tells me I am going to need a TON of games to even have a chance on measuring any possible gain...

More in a few days...
I agree that this test should settle the matter (at least for crafty ). If it turns out there is no increase or decrease in quality, then only time-to-depth counts. That would be a useful result.

I can't agree or disagree yet. This is going to only answer one question, "Is the parallel search overhead all wasted effort or is there some accuracy/quality gained from it?"

To answer the HT question will be more complicated. A HT run is going to be somewhat faster. No doubt about it. How much faster really does depend on the quality of the internal engine programming and how much it has been optimized. The better optimized the code (better cache locality for memory accesses, less dependencies and pipeline stalls, the less HT is going to help. When I first turned on HT on my PIV, it improved Crafty very slightly. But after a few months of optimizing for AMD cache stuff, I went back to test again on my PIV and the thing was not helping near as much (HT on). And, in fact, it was a net loss after factoring in the overhead. Something on the order of 15% loss or so overall.

Crafty's been around for a LONG time. It has been optimized repeatedly, where even Eugene Nalimov and I looked at it carefully when first migrating to an 8-way opteron box that scaled very poorly NPS-wise. So it is, quite simply, extremely fast and efficient. How a reverse-engineering derived program will compare in that regard I can't say (robolito is absolutely a reverse-engineered something, which makes Houdini more of the same). I've seen some (non-chess) applications that run 2x faster on HT boxes. I've seen some, such as Crafty, that are somewhat faster (just ran a test and saw a 20% improvement in NPS, which is double what I was seeing on the older PIV box). And I have seen some that run with absolutely no speed improvement at all.

For Crafty, the gain is 20%, the search overhead is about 30% although I will try to run a bunch of test positions to fixed depth and average the numbers to get an accurate value here. That leaves me with a 10% loss, which is measurable. Unless there is a Elo gain from that overhead that offsets that, for me its a loser.

More once I get the data and results...

I have to do a little work to modify my test script since it assumes 8 or 12 games per node to use one core per game. I have to adjust that to reserve enough cores per game to make this work.

bob · Post by **bob** » Mon Apr 29, 2013 6:23 am

syzygy wrote:
bob wrote:The paper I mentioned (I only assume you are talking about the parallel computing journal paper) very carefully analyzes this "search overhead". So perhaps we are on a different topic altogether.
The paper I have here is "A parallel alpha/beta tree searching algorithm" published in Parallel Computing 10 (1989), 299-308. It starts out with Knuth's formula. Paper number 12 at http://www.cis.uab.edu/hyatt/ (which contains a typo: "Al" should be "Algorithm"). This paper does not analyze parallel search overhead and certainly not in a way that would be applicable to DTS or YBW. The paper is actually mostly concerned with the problem of idle threads, which is a problem of PVS and EPVS and not of DTS and YBW.

That parallel search overhead exists is clear and undisputed and does not require Knuth's formula or any paper. But a statement that doubling the number of threads gives 30% overhead cannot be based on this paper, also not to anything "between the lines".

Anyway, I don't necessarily disagree that the 30% number is accurate, but I found it interesting to note that the DTS paper suggested otherwise. Ok, so that DTS paper was from a time with very different search trees. Well, the same applies to that 1989 paper... on which you seem to base the 30%... Stack overflow.

DTS also gives you far more flexibility on choosing a split point than YBW does. With YBW you are always faced with the decision "split HERE, after one move has been searched, or do nothing." With DTS I could split ANYWHERE giving me a better chance to pick a better place to split.
This inflexibility of many YBW implementations is not part of YBW. In my implementation of YBW, idle threads actively look for a potential split point as close to the root as possible.
So how do they know "when" to start looking? Surely they don't just look at all tree state repeatedly until there is a split point possible? Most trigger a parallel search AFTER at least one move has been searched at any node. If you use an iterated search, it is certainly doable, but not with recursion.
Threads that are searching set a flag for a node that is ready to be split (e.g. after the first node was searched, or after at least captures and killers were searched), thereby marking them as potential split points. Idle threads look for those flags and pick the potential split node that is closest to the root or that is searched with the highest remaining depth. Making the split is a matter of setting a flag in that node and copying the moves leading from the root position to that node. (So it is important to store those moves in a globally accessible array and not just on the stack.)

The only thing you can't do with recursion, or at least not easily, is let a master thread that has become idle help at other nodes than a node searched by one of its slaves. (This limitation is actually quite helpful for the implementation, because it means that a helpful master does not have to switch to a different search tree.)

For the record, the 30% comes from thousands of test runs, where I ran the same set of positions to fixed depth, using 1 cpu, then 2, 4 and 8. And then comparing the nodes searched for each run. 2 cpus searches a tree about 30% larger than 1 cpu. 3 cpus about 60% larger. Etc. That's where my speedup = 1+ (NCPUS - 1) * .7 came from, in fact. That was from Crafty, maybe 8-10 years ago. Might be worse now for all I know as we were not doing as aggressive LMR reductions or forward pruning as today. I'll get an accurate number for Crafty, since it is not hard.

I've got a shell script that will produce that data automatically, just have to find it.

bob · Post by **bob** » Mon Apr 29, 2013 9:30 pm

hgm wrote:
IQ wrote:It can if it is not the "extra" computational power which leads to better performance, but the shape of the tree searched. Think about it like this.

1) Assume that the 12%-20% gain through hyperthreading is not enough to offset the additonal parallel search overhead
2) If then the HT version with 2 x the number of threads still outperforms the non hyperthreaded version, it is most likely due to the different shape of the (and somewhat larger) tree searched
3) Nowadays with all sorts of reductions/extensions/pruning going and the additonal splitting in the HT ON version, it could very well be that some of these reductions which would kick in normally do not anymore in the additional searched treespace. This would also explain the faster solution of tactical tests as "key" moves are usually NOT at the top of the previous move ordering.
4) But if the hypothetical gain is due to the different shape of the tree, through mechanisms such as 3 -> then this behaviour could be replicated in software by adjusting the reductions/extension/pruning or even alpha/beta pruning or introducting a probabilistic element regime.

This can all easily put to rest if some of the kind testers do the same test with hyperthreading turned of in bios (and all other variables like turbo-boost) playing against an identical machine using hyperthreading.
Earlier you were talking about a an 8-HT machine with an 8-thread Houdini beating a 4-full-core machine with a 4-threaded Houdini. I don't see how you get from there to this new assumption (1), the nps increase not offsetting the 'parallel-search overhead'. I would say if the HT machine wins, it is obvious that the speedup is more than offsetting the search overhead.

The distinction you make between shape of the tree and search overhead seems completely arbitrary. The different tree shape is the search overhead. If the tree shapes were the same, and the nps were the same, there wouldn't be any search overhead. More threads means a more bushy tree compared to the alpha-beta minimum, and that is exactly what we call 'overhead'.

I think we (and everyone else) agrees that using 8 threads will result in a less efficient tree than using 4 threads. The only questions are: "how much less efficient", and "how many extra nps from HT".

I am in the process of producing some data to determine if the "extra width / overhead" improves the quality of the search or not. I am currently running two identical matches, using 1 thread, to fixed depth. When this finishes, I am going to run two more matches (30K each as always) with everything the same except Crafty uses two threads, which will make it search much faster, and search more nodes as well. I will repeat for 4 threads as well. Whether I go to 8 or not depends on whether 2 and 4 are showing any sort of improvement at all, as 8 runs 8x slower than normal since I have to use all cores and play just one game at a time per node...

First data should be available later this afternoon, after the first 120,000 games are done.

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M