Hyperthreading and Computer Chess: Intel i5-3210M

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by syzygy »

IQ wrote:
syzygy wrote:
IQ wrote:Good to now have you on the same page here too, only that you mistate my assumption about speed gain/loss. I specifically said ".. AND the HT Speed gain does NOT offset addittional parallel overhead". Which was suppose to mean the parallel loss is greater than HT gain.
Whatever assumptions you are making, they cannot hold. I have explained this twice, but the explanation might be too technical.
Its good to see you finally understood that the conclusions are valid and try to argue about assumptions.
You have a peculiar way of arguing.

I'm not interested in figuring out exactly where your reasoning fails. If you show me the design for a perpetuum mobile, it can take a lot of time to explain where the design fails. It is much easier to point to the law of thermodynamics: it can't be done.

HT is a hardware measure. For a chess engine, HT equals "double the number of threads, some increase in nps". If you take away the hardware, you can't have the increase in nps. The doubling of threads is just bad. Ergo nothing to gain from HT without hardware HT. Now I leave it to you to figure out where your reasoning fails.
I still fail to see how anybody with any parallel programming experience would take an assumption like (1) "HT gain does not offset parallel overhead" as offensive.
Are you sure you know what you mean by "HT gain" and "parallel overhead"? Try to define the terms.
IQ
Posts: 162
Joined: Thu Dec 17, 2009 10:46 am

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by IQ »

You continue to miss the point. Your bringing "perpetuum mobiles" into the discussion just shows that you are out of your depth. But your self-assessment that it is difficult for you find out where my reasoning fails is true, mainly because it does not fail. As you like to shift focus, change assumptions, mistate conclusion it is difficult to argue with you. I will try one last time, maybe in more simple terms so that you can follow. Please note that the assumptions rest on what people in this thread reported. If you want to dispute them fine - take it on with them. I am interested on the conclusions based on these assumptions which i find kind of neat.

Assumptions:
1) HT gain does not outweigh parallel overhead. Lets say HT gains 15% and parallel overhead going from lets say 6 to 12 threads is 30%. Usually this would lead to the NON-HT version perfoming somewhat better than the HT version. Not arguing specific numbers here, but this is for example what Bob says.

2) Some members on here report that through testing this is not so, they in fact state that in their testing HT is performing better than non HT.

Now here begins my thought experiment, What if 1 + 2 are TRUE at the same time. This leads me to the following conclusions:

a) As modern programs are not pure alpha/beta seachern but use pruning/extensions heavily, the additional nodes searched by the HT version (which normally would be just parallel overhead) somewhat increase playing strenght. Maybe due some extension in an otherwise unvisited part of the tree or because some reductions are avoided through different splitpoints. If assumption 2 were true I can hardly imagine another mechanism
b) But now the following seems logical. As the non ht version taking into account parallel overhead is still faster (30%-15%) than the ht version you could also search these nodes in a non HT enviroment. Only that you trigger the search of these nodes not through the arbitrary splitting regime (which was only induced because the HT version uses 2 as many threads) but through modifiying search in software. If one cannot comeup with a reasonable scheme to reduce reductions or extend more, one could just sort of replicate the random nature of parallel splitting.


syzygy wrote:
IQ wrote:
syzygy wrote:
IQ wrote:Good to now have you on the same page here too, only that you mistate my assumption about speed gain/loss. I specifically said ".. AND the HT Speed gain does NOT offset addittional parallel overhead". Which was suppose to mean the parallel loss is greater than HT gain.
Whatever assumptions you are making, they cannot hold. I have explained this twice, but the explanation might be too technical.
Its good to see you finally understood that the conclusions are valid and try to argue about assumptions.
You have a peculiar way of arguing.

I'm not interested in figuring out exactly where your reasoning fails. If you show me the design for a perpetuum mobile, it can take a lot of time to explain where the design fails. It is much easier to point to the law of thermodynamics: it can't be done.

HT is a hardware measure. For a chess engine, HT equals "double the number of threads, some increase in nps". If you take away the hardware, you can't have the increase in nps. The doubling of threads is just bad. Ergo nothing to gain from HT without hardware HT. Now I leave it to you to figure out where your reasoning fails.
I still fail to see how anybody with any parallel programming experience would take an assumption like (1) "HT gain does not offset parallel overhead" as offensive.
Are you sure you know what you mean by "HT gain" and "parallel overhead"? Try to define the terms.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by syzygy »

IQ wrote:Assumptions:
1) HT gain does not outweigh parallel overhead. Lets say HT gains 15% and parallel overhead going from lets say 6 to 12 threads is 30%. Usually this would lead to the NON-HT version perfoming somewhat better than the HT version. Not arguing specific numbers here, but this is for example what Bob says.
What do you mean by "HT gains 15%"?
What do you mean by "parallel overhead is 30%"?

Define your terms...

I suppose the following definitions make sense:
- HT leads to 15% higher nps.
- doubling the number of threads leads to 30% extra nodes searched to reach the same depth.

Doubling the number of threads, everything else staying the same, is ALWAYS bad. All programmers agree on this. Only the 15% higher nps could possibly offset it. Now the argument is already over, because removing the HT hardware removes the 15% higher nps.

What is not clear about this?

Your "conclusion a)" is more or less copied from what I wrote much earlier in this thread. If 1 and 2 are both true, then a) is the explanation.

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.

That 15% HT gain might offset 30% parallel overhead does NOT mean that 0% HT gain might offset 30% parallel overhead.

Of course if you are a very good programmer you can take H3, completely redesign the extension/reduction scheme, and release a stronger H4, but nothing in the way HT works will point you in the right direction. HT is nothing else than "double the number of threads, some increase in nps". Somehow pretending that one could simulate the effects of doubling the number of threads without parallel overhead is not going to help.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by bob »

syzygy wrote:
bob wrote:Let's pick a single topic? DTS is a good algorithm. Complex to program compared to what I (and most others) do today. My point about DTS however, was that the trees searched back then were easier to search than those today. They were MUCH more uniform, with no forward pruning of any kind, and no reductions except for null-move. More uniform trees are easier to search in parallel. So DTS provides a strong upper bound on what can be done. But when discussing Houdini and such, DTS doesn't belong in the conversation and is completely irrelevant.
We were discussing potential benefits of HT in general. Some tests suggest that Houdini 3 profits from it, at least under certain conditions. I have pointed out that a hypothetical Cray Blitz on x86-hardware would have benefitted as well (given that 12% or nps increase from HT is quite reasonable).
My reference, btw was NOT to the DTS paper. It was to a paper on "parallel search of alpha/beta game trees" published in the Journal of Parallel Computing. That was a straightforward mathematical analysis of the effect of move ordering on the size of alpha/beta trees, following up on the analysis Knuth/Moore did for serial alpha/beta searching.
I have the paper here. It is about PVS and EPVS, two algorithms that I don't think anybody is using anymore. The conclusion is that "[f]uture algorithms must allow processors to work at different nodes within the tree in order to be able to effectively use large number of processors". Indeed. This paper does not seem relevant for YBW (nor for DTS, for that matter).

Re-read it more carefully. It is about alpha/beta in general, FIRST. And it explains the math of the tree search, showing what happens when the first move is not the best move. That was the key point in that paper.
I thought I was clear about the article I was referencing.
Yes, you were clear. But my point was and is that it is dangerous to rely on the "conventional knowledge" that is mostly based on those old papers from a time where algorithms were tested on 24 positions searched to a ply or 5. In fact you are now arguing that the DTS paper has lost its relevance because today's search algorithms are incomparable.
Lots of things are no longer relevant when talking about specific details. But we DO have data from current programs about how well they order moves, and that quite clearly shows that the 30% number is FAR closer to reality than the 12% number you were referencing.

I suppose, to end the debate, I could simply run several thousand positions to fixed depth using 1 cpu and then 2 and 4 and compute the average overhead. I certainly have enough positions to do this. But I have already done it multiple times over the past 15 years. And each time we got trickier in the search, the overhead climbed. To the point where it is today.

The base idea is to use the traditional knuth/moore form of N = 2 * W ^(d/2) (for D even) or N = W ^ floor(d/2) + W ^ ceil(d/2) for D odd or even. If you analyze that recursively, the first branch at the root searches many more nodes than the remainder. And this can be spelled out mathematically using Knuth's formula. The idea is that when you screw up with move ordering, so that the second move is better than the first, you almost double the size of the sub-tree for that branch...
If you screw up move ordering, you are screwed whether you search in parallel or not. When searching serially the size of the subtree for that branch doubles just as much as when searching in parallel.
Nope, not at all. If you search serially from move 1 to move 40, and move 5 causes a fail-high, you only searched moves 1-4 unnecessarily, due to bad ordering. With a parallel search, you could easily search all 40 moves before you get that fail high. I thought this was obvious. The "overhead" is not comparing to an "optimal tree". It is comparing the parallel search space to the sequential search space. Those two comparisons are not the same thing at all.



BTW that paper had absolutely nothing to do with DTS and had no test results at all, it was a mathematical analysis of the issue only.
I agree it has nothing to do with DTS. It also has nothing to do with YBW. But if it "proves" the 30% parallel overhead (which I don't think it does) independent of the search algorithm used, can you explain why DTS on Cray Blitz had much lower parallel overhead? (And if it doesn't mean anything for DTS, then why should it mean anything for YBW?)
It is easier to search a fairly uniform tree, as opposed to the deep, heavily pruned/reduced trees we search today. The more aggressive reductions/pruning decisions become, the more unstable the tree becomes. And that causes problems with parallel search when you split at a bad point and introduce overhead. DTS also gives you far more flexibility on choosing a split point than YBW does. With YBW you are always faced with the decision "split HERE, after one move has been searched, or do nothing." With DTS I could split ANYWHERE giving me a better chance to pick a better place to split. For example, 25 ply search, I am at depth=18, and have searched just one move, satisfying YBW criterion. I can split here, which means there are 18 plies where I could have screwed up move ordering, or I can look back and notice that at ply=3, I have searched 4 moves with no fail high, marking that a safer place to split for several reasons. Closer to the root so each branch represents more work. Closer to the root so the likelihood of screwing up move ordering anywhere between the root and this node is reduced. Searched more moves suggesting that a fail-high is less likely than after searching just one move, etc...


I do not agree it had no test results. In fact, it has test results for 24 Bratko-Kopec positions on 5-ply searches, both for PVS and for EPVS.

Let me summarise the paper:
Section 1:
minimax searches a lot of nodes, alpha/beta much less. Knuth formula, node types.
Section 2:
PVS searches no extra nodes if the tree is perfectly ordered. However, to make this work in practice the search must be able to split at any depth without penalty and (what I believe is only mentioned in section 3) the tree must be perfectly balanced (all subtrees of all moves after the 1st have the same size) because otherwise processors become idle.
Section 3:
It is mentioned that things are different with imperfect move ordering, and that imperfect move ordering cannot be avoided. This is a big problem for PVS, because it makes the tree very imbalanced, therefore lots of idle processors. Enters enhanced PVS or EPVS: if one processor becomes idle, all other searches are aborted and a new split point is made two plies higher in the tree. The TT is relied on to not lose the work that has already been done.
Section 4:
PVS and EPVS were implemented in Cray Blitz, tests were run on a Sequent Balance 21000, results are shown. (It seems 1 processor searched about 63 nodes/second.)
Section 5:
Conclusion. We need a better algorithm.

I don't see anything in this paper on which one could base a conclusion that HT can't possibly be of benefit with YBW-based parallel searches.
Read between the lines. HT provides a minimal speed boost. Additional threads without perfect move ordering produce search overhead. The overhead outweighs the NPS increase.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by bob »

syzygy wrote:
IQ wrote:Assumptions:
1) HT gain does not outweigh parallel overhead. Lets say HT gains 15% and parallel overhead going from lets say 6 to 12 threads is 30%. Usually this would lead to the NON-HT version perfoming somewhat better than the HT version. Not arguing specific numbers here, but this is for example what Bob says.
What do you mean by "HT gains 15%"?
What do you mean by "parallel overhead is 30%"?

Define your terms...

I suppose the following definitions make sense:
- HT leads to 15% higher nps.
- doubling the number of threads leads to 30% extra nodes searched to reach the same depth.

Doubling the number of threads, everything else staying the same, is ALWAYS bad. All programmers agree on this. Only the 15% higher nps could possibly offset it. Now the argument is already over, because removing the HT hardware removes the 15% higher nps.

What is not clear about this?

Your "conclusion a)" is more or less copied from what I wrote much earlier in this thread. If 1 and 2 are both true, then a) is the explanation.

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.

That 15% HT gain might offset 30% parallel overhead does NOT mean that 0% HT gain might offset 30% parallel overhead.

Of course if you are a very good programmer you can take H3, completely redesign the extension/reduction scheme, and release a stronger H4, but nothing in the way HT works will point you in the right direction. HT is nothing else than "double the number of threads, some increase in nps". Somehow pretending that one could simulate the effects of doubling the number of threads without parallel overhead is not going to help.
You are wrong as this is a basic discussion that comes up in parallel search. IF, as you want to claim, the extra nodes provide extra knowledge that improves the results from a search to the same depth, one CAN implement that in a single thread. Just set up your data structures so that you pick one move at ply N, and make it giving a new sub-tree. Then pick another move at the SAME ply and make it giving a second sub-tree. Now multiplex between these two trees, one node at a time. You are now searching the IDENTICAL tree the parallel search would grow, and you would then see the SAME benefit, without the extra thread.

This assumption that the search overhead is good is simply wrong. It is overhead. There is absolutely zero evidence to show that the overhead helps in any way at all. One could easily measure this given test positions searched to the SAME depth, once with 1 cpu and once with more than one. If your speculation is true, the second result should be better. I've run a zillion such tests and have only seen a FERY few cases where it happens, and those are quite often not repeatable. Something that happens once every 100 or 1000 moves is not very helpful.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by Laskos »

bob wrote:
Read between the lines. HT provides a minimal speed boost. Additional threads without perfect move ordering produce search overhead. The overhead outweighs the NPS increase.
HT in my case provides for ~30% speed boost of Houdini 3, not "minimal" as you stated. Several others reported similar numbers. Besides that, time to depth is lower with HT on. Not talking of head-on matches which show 10-20 points improvement from HT.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by syzygy »

bob wrote:
The base idea is to use the traditional knuth/moore form of N = 2 * W ^(d/2) (for D even) or N = W ^ floor(d/2) + W ^ ceil(d/2) for D odd or even. If you analyze that recursively, the first branch at the root searches many more nodes than the remainder. And this can be spelled out mathematically using Knuth's formula. The idea is that when you screw up with move ordering, so that the second move is better than the first, you almost double the size of the sub-tree for that branch...
If you screw up move ordering, you are screwed whether you search in parallel or not. When searching serially the size of the subtree for that branch doubles just as much as when searching in parallel.
Nope, not at all.
Ehm, very clear it does... I am sorry, but I respond to what you wrote above and not to something you did not write.
If you search serially from move 1 to move 40, and move 5 causes a fail-high, you only searched moves 1-4 unnecessarily, due to bad ordering. With a parallel search, you could easily search all 40 moves before you get that fail high. I thought this was obvious. The "overhead" is not comparing to an "optimal tree". It is comparing the parallel search space to the sequential search space. Those two comparisons are not the same thing at all.
Yes, and I am not denying that parallel overhead exists. But your argument from your previous post was very clearly comparing to the optimal tree. And the paper you were referencing is not analyzing this parallel overhead.
DTS also gives you far more flexibility on choosing a split point than YBW does. With YBW you are always faced with the decision "split HERE, after one move has been searched, or do nothing." With DTS I could split ANYWHERE giving me a better chance to pick a better place to split.
This inflexibility of many YBW implementations is not part of YBW. In my implementation of YBW, idle threads actively look for a potential split point as close to the root as possible.
I do not agree it had no test results. In fact, it has test results for 24 Bratko-Kopec positions on 5-ply searches, both for PVS and for EPVS.

Let me summarise the paper:
Section 1:
minimax searches a lot of nodes, alpha/beta much less. Knuth formula, node types.
Section 2:
PVS searches no extra nodes if the tree is perfectly ordered. However, to make this work in practice the search must be able to split at any depth without penalty and (what I believe is only mentioned in section 3) the tree must be perfectly balanced (all subtrees of all moves after the 1st have the same size) because otherwise processors become idle.
Section 3:
It is mentioned that things are different with imperfect move ordering, and that imperfect move ordering cannot be avoided. This is a big problem for PVS, because it makes the tree very imbalanced, therefore lots of idle processors. Enters enhanced PVS or EPVS: if one processor becomes idle, all other searches are aborted and a new split point is made two plies higher in the tree. The TT is relied on to not lose the work that has already been done.
Section 4:
PVS and EPVS were implemented in Cray Blitz, tests were run on a Sequent Balance 21000, results are shown. (It seems 1 processor searched about 63 nodes/second.)
Section 5:
Conclusion. We need a better algorithm.

I don't see anything in this paper on which one could base a conclusion that HT can't possibly be of benefit with YBW-based parallel searches.
Read between the lines. HT provides a minimal speed boost. Additional threads without perfect move ordering produce search overhead. The overhead outweighs the NPS increase.
Between the lines? Right...
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by syzygy »

bob wrote:
syzygy wrote:
IQ wrote:Assumptions:
1) HT gain does not outweigh parallel overhead. Lets say HT gains 15% and parallel overhead going from lets say 6 to 12 threads is 30%. Usually this would lead to the NON-HT version perfoming somewhat better than the HT version. Not arguing specific numbers here, but this is for example what Bob says.
What do you mean by "HT gains 15%"?
What do you mean by "parallel overhead is 30%"?

Define your terms...

I suppose the following definitions make sense:
- HT leads to 15% higher nps.
- doubling the number of threads leads to 30% extra nodes searched to reach the same depth.

Doubling the number of threads, everything else staying the same, is ALWAYS bad. All programmers agree on this. Only the 15% higher nps could possibly offset it. Now the argument is already over, because removing the HT hardware removes the 15% higher nps.

What is not clear about this?

Your "conclusion a)" is more or less copied from what I wrote much earlier in this thread. If 1 and 2 are both true, then a) is the explanation.

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.

That 15% HT gain might offset 30% parallel overhead does NOT mean that 0% HT gain might offset 30% parallel overhead.

Of course if you are a very good programmer you can take H3, completely redesign the extension/reduction scheme, and release a stronger H4, but nothing in the way HT works will point you in the right direction. HT is nothing else than "double the number of threads, some increase in nps". Somehow pretending that one could simulate the effects of doubling the number of threads without parallel overhead is not going to help.
You are wrong as this is a basic discussion that comes up in parallel search.
What exactly is wrong. Please point to a specific statement.

I'm pretty sure we agree on what I wrote above...

In case you mean this:
syzygy wrote:Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.
Clearly I meant you can't search those extra nodes in a non-HT environment without spending more time...
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by hgm »

IQ wrote:It can if it is not the "extra" computational power which leads to better performance, but the shape of the tree searched. Think about it like this.

1) Assume that the 12%-20% gain through hyperthreading is not enough to offset the additonal parallel search overhead
2) If then the HT version with 2 x the number of threads still outperforms the non hyperthreaded version, it is most likely due to the different shape of the (and somewhat larger) tree searched
3) Nowadays with all sorts of reductions/extensions/pruning going and the additonal splitting in the HT ON version, it could very well be that some of these reductions which would kick in normally do not anymore in the additional searched treespace. This would also explain the faster solution of tactical tests as "key" moves are usually NOT at the top of the previous move ordering.
4) But if the hypothetical gain is due to the different shape of the tree, through mechanisms such as 3 -> then this behaviour could be replicated in software by adjusting the reductions/extension/pruning or even alpha/beta pruning or introducting a probabilistic element regime.

This can all easily put to rest if some of the kind testers do the same test with hyperthreading turned of in bios (and all other variables like turbo-boost) playing against an identical machine using hyperthreading.
Earlier you were talking about a an 8-HT machine with an 8-thread Houdini beating a 4-full-core machine with a 4-threaded Houdini. I don't see how you get from there to this new assumption (1), the nps increase not offsetting the 'parallel-search overhead'. I would say if the HT machine wins, it is obvious that the speedup is more than offsetting the search overhead.

The distinction you make between shape of the tree and search overhead seems completely arbitrary. The different tree shape is the search overhead. If the tree shapes were the same, and the nps were the same, there wouldn't be any search overhead. More threads means a more bushy tree compared to the alpha-beta minimum, and that is exactly what we call 'overhead'.

I think we (and everyone else) agrees that using 8 threads will result in a less efficient tree than using 4 threads. The only questions are: "how much less efficient", and "how many extra nps from HT".
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Post by bob »

syzygy wrote:
bob wrote:
syzygy wrote:
IQ wrote:Assumptions:
1) HT gain does not outweigh parallel overhead. Lets say HT gains 15% and parallel overhead going from lets say 6 to 12 threads is 30%. Usually this would lead to the NON-HT version perfoming somewhat better than the HT version. Not arguing specific numbers here, but this is for example what Bob says.
What do you mean by "HT gains 15%"?
What do you mean by "parallel overhead is 30%"?

Define your terms...

I suppose the following definitions make sense:
- HT leads to 15% higher nps.
- doubling the number of threads leads to 30% extra nodes searched to reach the same depth.

Doubling the number of threads, everything else staying the same, is ALWAYS bad. All programmers agree on this. Only the 15% higher nps could possibly offset it. Now the argument is already over, because removing the HT hardware removes the 15% higher nps.

What is not clear about this?

Your "conclusion a)" is more or less copied from what I wrote much earlier in this thread. If 1 and 2 are both true, then a) is the explanation.

Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.

That 15% HT gain might offset 30% parallel overhead does NOT mean that 0% HT gain might offset 30% parallel overhead.

Of course if you are a very good programmer you can take H3, completely redesign the extension/reduction scheme, and release a stronger H4, but nothing in the way HT works will point you in the right direction. HT is nothing else than "double the number of threads, some increase in nps". Somehow pretending that one could simulate the effects of doubling the number of threads without parallel overhead is not going to help.
You are wrong as this is a basic discussion that comes up in parallel search.
What exactly is wrong. Please point to a specific statement.

I'm pretty sure we agree on what I wrote above...

In case you mean this:
syzygy wrote:Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.
Clearly I meant you can't search those extra nodes in a non-HT environment without spending more time...
This:
Your conclusion b) is false. You could not search those nodes in any obvious way in a non-HT environment, because those nodes were searched using the 15% nps increase.
If you believe those extra nodes make a qualitative improvement in play, then they have to offset the net loss between HT-parallel search and the sequential search. If I avoid that 30% overhead, I gain whatever elo running 30% faster provides . That's non-trivial. I'd make a quick guess of at least 20 Elo, but maybe 25 or a little more.

With HT on, you lose overall in terms of time, so far as I have measured. Which is a net loss of speed that is the difference between the 30% overhead and whatever you can recover from HT. Let's just hope you get 25% of that back, if the search has a few issues that HT helps with. You are down 5%. Do you REALLY think that by reducing the pruning or reductions, one can gain 5%? If so, why not just modify the reduction/pruning code to reduce or prune less in those same places where you would normally do a parallel split.

I simply do not agree that there is a measurable improvement by searching a tree that is made larger in a somewhat random and unpredictable way. I've made that mistake way too many times. That is, find cases where something works well and assume it is a win, when in reality, it is worse in all those non-related cases. I've already reported a couple of those ideas previously. My "easy move" code is a zero Elo improvement. I have used it since the 70's, as has most everyone else. Yet careful testing/measurement has proven that it doesn't do a thing, good or bad.

I don't like this "guesswork" approach about "the extra width or nodes might help..." One could always test this easily enough. Take a program that does not use spin locks, and play it against a gauntlet, using N cores and N threads, and then again using N cores and 2N threads. That will precisely define what the effect of the search overhead is (note that for 2N threads, I mean NO HT. Just double threads (2 per physical/logical cpu). Now we know how much the search overhead hurts. Then one can carefully measure how much HT affects search speed (gaining some back). The net will show whether it is a gain or loss.

To measure this "the extra search nodes can hall" one can do a very simple test.

Play a series of matches to fixed depth using 1 cpu, then 2 cpus, then 4. Since the depth is fixed, the time to complete is irrelevant, and the only change will be the search overhead. If it does help, the 4 cpu fixed depth program should perform better than the 1 cpu fixed depth program.

I can run that test although I already am almost 100% certain about the outcome, thanks to past testing...