Hyperthreading and Computer Chess: Intel i5-3210M

syzygy · Post by **syzygy** » Sat Apr 27, 2013 1:36 am

IQ wrote:
syzygy wrote:
IQ wrote:But also i must say it is still open whether this is a scheduling problem of windows or a houdini race condition bug. Calling it the latter was premature.
It has been indicated several times that not respecting affinities is not an engine problem but an OS problem.
Please reread the links given - the affinities are respected in all cases.

Huh? What you wrote there was this:

When you set affinities in Houndini sometimes the threads are not created on different cores (using 6 threads on a 6 core 12 thread machine). Instead two worker threads are stuck on the same physical core (with one core idling) this slows down Houdini tremendously (sometimes 40% and sometimes even more).

The "behaviour" might be due to an OS problem, but it has nothing to do with affinities not being obeyed.

I don't understand you. The point of setting affinities is to force the OS to schedule each thread on a different physical core. If affinities are set like that, but two threads end up on the same physical core, then affinities are not respected.

Again please reread

Really, I know what I read, and I have answered. Your argument fails, because the different shape of the tree should be understood as "having far more nodes, not all of them useless". Without more nodes you will definitely be worse off with more threads. In my previous post I have tried to be clear. If I have failed, then I cannot help it.

bob · Post by **bob** » Sat Apr 27, 2013 6:53 am

syzygy wrote:
bob wrote:#1. I have not seen ANYONE get that kind of speedup today. Back then the trees were pretty much a fixed depth with no forward pruning, or reductions, or any of the other things that make todays trees much more irregular and unstable. So until you find ANY current program that can produce those kinds of speedups, we have to use todays numbers. Crafty is nowhere NEAR the DTS numbers. And I have not seen any program yet that has beat Crafty's numbers, since everyone is basically using the SAME parallel search approach that is used in Crafty. Ergo, 12% is meaningless.
Are you saying DTS would be less efficient on x86? I was referring to a hypothetical Cray Blitz that would run on a 4-core cpu with HT.

It's interesting that just a few posts ago you were referring me to a 1988 paper, while now the DTS results are suddenly meaningless for today's programs. Cherry picking is not exactly scientific.

#3. I generally quote Crafty results because I know what it does, and how it does it. I let others report their parallel search results (few do, however) since it is their code. Crafty's behavior in a parallel search is not going to be that different from what anyone else gets, for obvious reasons (everyone is using YBW as the basic idea, and many are using a near-identical approach to that used in Crafty since the source has been available for almost 20 years now (parallel search code).
So when someone reports results that don't suit you, you dismiss it with the very scientific hint: "ONE PROGRAM" ?

Let's pick a single topic? DTS is a good algorithm. Complex to program compared to what I (and most others) do today. My point about DTS however, was that the trees searched back then were easier to search than those today. They were MUCH more uniform, with no forward pruning of any kind, and no reductions except for null-move. More uniform trees are easier to search in parallel. So DTS provides a strong upper bound on what can be done. But when discussing Houdini and such, DTS doesn't belong in the conversation and is completely irrelevant. Therefore, imagining a case where hyper-threading would actually help is difficult to come up with. Not impossible, for sure, but difficult to imagine. That 30% overhead looms large. And I'd expect everyone that is getting the fail-high-on-first-move percentage around 90% or so is going to be stuck with that. At least I have neither been able to beat it myself, nor have I seen anyone else do so. Going beyond 90% is effectively impossible. Which means going less than 30% overhead is just as impossible. My reference, btw was NOT to the DTS paper. It was to a paper on "parallel search of alpha/beta game trees" published in the Journal of Parallel Computing. That was a straightforward mathematical analysis of the effect of move ordering on the size of alpha/beta trees, following up on the analysis Knuth/Moore did for serial alpha/beta searching.

I thought I was clear about the article I was referencing. If I wasn't, that was my fault. DTS was not published in the Journal of Parallel Computing, it was (first) my dissertation and then (later) simplified/shortened in the ICGA Journal.

The base idea is to use the traditional knuth/moore form of N = 2 * W ^(d/2) (for D even) or N = W ^ floor(d/2) + W ^ ceil(d/2) for D odd or even. If you analyze that recursively, the first branch at the root searches many more nodes than the remainder. And this can be spelled out mathematically using Knuth's formula. The idea is that when you screw up with move ordering, so that the second move is better than the first, you almost double the size of the sub-tree for that branch...

It gives a mathematical explanation for the search overhead we all are aware of and can do little to reduce.

BTW that paper had absolutely nothing to do with DTS and had no test results at all, it was a mathematical analysis of the issue only.

Rebel · Post by **Rebel** » Sat Apr 27, 2013 7:27 am

hgm wrote:Sorry, that is nonsense. If A and B would always play the same game against each other, and A happened to win it, it would not prove that A is stronger at all. It could very well be that starting from every position that is not in the game B would win. (In practice this could even occur, e.g. because the opening book of the far stronger engine B contains an error that allows a book win.)

Ricardo is right, with the caveat that one should not go to extremes: if the randomness would be so high that it starts to randomly decide the result (e.g. by randomly starving processes for CPU so they would always lose on time before they could complete 20 moves), that would qualify as "too much randomness". But in typical testing conditions we are very far from this limit.

It's one of those topics I agree with Bob. Hard to picture that! but it is true

IQ · Post by IQ » Sat Apr 27, 2013 12:06 pm

syzygy wrote:I don't understand you. The point of setting affinities is to force the OS to schedule each thread on a different physical core. If affinities are set like that, but two threads end up on the same physical core, then affinities are not respected.

I was under the impression everybody new how affinities worked - but i see that you do not, let me explain. Setting affinites does NOT force to OS to schedule each thread on a different physical core. It is just a mask that restricts the logical cores that are available for scheduling. So i you set affinities to 0,2,4,6 on a HT system that does not prevent the OS to schedule two or more threads on logical core 2, it ONLY prevents a thread from beeing scheduled on the non masked logical cores 1,3,5,7. Usually under normal conditions and if the threads to be scheduled are busy and not idle waiting the os should distribute the threads effecentily on the logical cores with their affinities set. This often leads to the threads being on different logical cores but this is not guranteed just through affinities. Imagine a chess program to creating a worker thread that idles and waits for some signal from another thread. Imagine that his other thread is at the moment also idle waiting. No affinities will prevent the os from scheduling to idle thread to the same logical core. Normally as soon as the signals arrive and the threads become busy the get rescheduled, but somehow that does not happen due to some os bug and/or race condition. From the links i provided you can see that this behaviour was confirmed by at least two other people here - its a rare thing so i wouldn't sweat to much over it, as houdini also recovers after the next move or two. But it's seems real and happens with affinities set.

Your argument fails, because the different shape of the tree should be understood as "having far more nodes, not all of them useless". Without more nodes you will definitely be worse off with more threads. In my previous post I have tried to be clear. If I have failed, then I cannot help it.

The Argument holds. Let me rephrase in your language and ask you a question: If there are far more nodes and not all of them useless AND the HT Speed gain does NOT offset addittional parallel overhead THEN what prevents you from changing the program to search the additional "not useless" in software without relying on the splitting regime due to the 12 HT threads? Without HT you should have less parallel overhead which would allow you to search these nodes at the same cost - under the assumptions given!

syzygy · Post by **syzygy** » Sat Apr 27, 2013 12:49 pm

IQ wrote:
syzygy wrote:I don't understand you. The point of setting affinities is to force the OS to schedule each thread on a different physical core. If affinities are set like that, but two threads end up on the same physical core, then affinities are not respected.
I was under the impression everybody new how affinities worked - but i see that you do not, let me explain.

Sure, I don't know how computers work.

Setting affinites does NOT force to OS to schedule each thread on a different physical core.

It is completely possible, even on Windows, to force one particular thread to one logical core and thereby to one particular physical core.

If instead the whole process is restricted to a subset of the logical cores, then yes, affinities are strictly speaking respected if two threads are run on the same logical core so that one of the physical cores remains idle. However, you should then see the same problem occurring on a non-HT system with no affinities set: some threads being scheduled on the same logical=physical core, so that one of the physical cores remains idle. Unless the problem is specific to affinities and Windows cannot properly handle them...

Your argument fails, because the different shape of the tree should be understood as "having far more nodes, not all of them useless". Without more nodes you will definitely be worse off with more threads. In my previous post I have tried to be clear. If I have failed, then I cannot help it.
The Argument holds. Let me rephrase in your language and ask you a question: If there are far more nodes and not all of them useless AND the HT Speed gain does NOT offset addittional parallel overhead THEN what prevents you from changing the program to search the additional "not useless" in software without relying on the splitting regime due to the 12 HT threads?

Because it takes TIME to search those "not useless" nodes. It *is* the HT speed gain that searches those "not useless" nodes without taking more time.

Without HT you should have less parallel overhead which would allow you to search these nodes at the same cost - under the assumptions given!

You're mixing up things, probably because you don't have a fixed idea of what is "parallel overhead".

Let's define parallel overhead as the number of extra nodes to search to the same depth.

Now turn HT on and double the number of threads. Let's assume for the sake of argument that parallel overhead equals the nps speed gain. That means we reach the same depth in the same time. At first sight this looks like nothing lost, nothing gained.

However, more nodes were searched and some of them might have increased the quality of the search. So *maybe* we have some gain from HT after all. But this increase in quality, if it exists at all, came from the additional amount of nodes being searched. It came from the nps speed gain.

Now turn off HT. Then you turn off the nps speed gain. No extra nodes.

If you would somehow want to simulate searching those extra nodes (without losing depth), you will need to find the extra computational power to search those extra nodes.

IQ · Post by IQ » Sat Apr 27, 2013 1:29 pm

syzygy wrote:If instead the whole process is restricted to a subset of the logical cores, then yes, affinities are strictly speaking respected if two threads are run on the same logical core so that one of the physical cores remains idle. However, you should then see the same problem occurring on a non-HT system with no affinities set: some threads being scheduled on the same logical=physical core, so that one of the physical cores remains idle. Unless the problem is specific to affinities and Windows cannot properly handle them...

Good to have you on the same page now. It is not so much that windows does not handle affinities correctly but rather that maybe rescheduling of previously idle threats sometimes fails (maybe more so if affinities are set). So it might be an OS problem, but as other chess programs do not have this problem I might also suspect houdini. Also the problem does also exisist on non-HT systems but much much more rarely. As houdini 2 and 3 are affected be can only hope that someday Robert replicates this. At the moment he just recommends to turn HT off in BIOS.

Now turn HT on and double the number of threads. Let's assume for the sake of argument that parallel overhead equals the nps speed gain. That means we reach the same depth in the same time. At first sight this looks like nothing lost, nothing gained. However, more nodes were searched and some of them might have increased the quality of the search. So *maybe* we have some gain from HT after all. But this increase in quality, if it exists at all, came from the additional amount of nodes being searched. It came from the nps speed gain. Now turn off HT. Then you turn off the nps speed gain. No extra nodes. If you would somehow want to simulate searching those extra nodes (without losing depth), you will need to find the extra computational power to search those extra nodes.

Good to now have you on the same page here too, only that you mistate my assumption about speed gain/loss. I specifically said ".. AND the HT Speed gain does NOT offset addittional parallel overhead". Which was suppose to mean the parallel loss is greater than HT gain. Maybe that was unfortunately phrased. Then I suppose you agree now if the difference in parallel loss and HT gain is large enough we have the computational power to search the "extra useful" nodes right there.

But I even want to go further, but admit it needs more assumptions. Even if HT gain and parallel overhead are a wash. Then your argument would imply that all new "extra" searched nodes are useful, which is clearly an overoptimistic position. In fact I would argue that from a theoretical alpha/beta standpoint all extra nodes are not useful. This implies that that usefulness can only come from a part of the tree which was otherwise subject to some sort of reduction or which only appeared due to an additional extension. That's why I believe changing the pruning/extension regime in the program would yield better results than relying on seach inefficiencies magically producing only "useful" extra nodes. Playing devils advocate, even if that would be somewhat true one could try a naive probabilistc approach of randomly searching additional tree space. In my view it might be possible to search only a fraction of your "extra" nodes with such a probabilistic regime swaying the balance of HT gain vs. parallel loss highly in favour of parallel loss. BUT this all rests on the assumption there really is a gain through HT which is not due to the aforementioned bugs or other testing circumstances. And that i highly doubt.

syzygy · Post by **syzygy** » Sat Apr 27, 2013 10:24 pm

IQ wrote:Good to now have you on the same page here too, only that you mistate my assumption about speed gain/loss. I specifically said ".. AND the HT Speed gain does NOT offset addittional parallel overhead". Which was suppose to mean the parallel loss is greater than HT gain.

Whatever assumptions you are making, they cannot hold. I have explained this twice, but the explanation might be too technical.

syzygy · Post by **syzygy** » Sat Apr 27, 2013 11:27 pm

syzygy wrote:
IQ wrote:Good to now have you on the same page here too, only that you mistate my assumption about speed gain/loss. I specifically said ".. AND the HT Speed gain does NOT offset addittional parallel overhead". Which was suppose to mean the parallel loss is greater than HT gain.
Whatever assumptions you are making, they cannot hold. I have explained this twice, but the explanation might be too technical.

I probably should not try to convince, but see it like this: if you double the number of threads, but keep nps the same, the total quality of the search will degrade. Everybody who understands how parallel tree search works agrees on this.

To increase nps, you need extra hardware.

Of course there are other ways to improve an engine than through hardware. But there is no "easy" way to improve an engine through algorithmic changes somewhere hidden in how HT works, because any beneficial effect of HT, if it exists at all, is based fully on the nps increase. HT is nothing else than "double the number of threads + slightly higher nps". The former is bad, the latter is good. Remove the latter and it's all bad. Remove HT hardware and you can't have the latter.

syzygy · Post by **syzygy** » Sun Apr 28, 2013 2:27 am

bob wrote:Let's pick a single topic? DTS is a good algorithm. Complex to program compared to what I (and most others) do today. My point about DTS however, was that the trees searched back then were easier to search than those today. They were MUCH more uniform, with no forward pruning of any kind, and no reductions except for null-move. More uniform trees are easier to search in parallel. So DTS provides a strong upper bound on what can be done. But when discussing Houdini and such, DTS doesn't belong in the conversation and is completely irrelevant.

We were discussing potential benefits of HT in general. Some tests suggest that Houdini 3 profits from it, at least under certain conditions. I have pointed out that a hypothetical Cray Blitz on x86-hardware would have benefitted as well (given that 12% or nps increase from HT is quite reasonable).

My reference, btw was NOT to the DTS paper. It was to a paper on "parallel search of alpha/beta game trees" published in the Journal of Parallel Computing. That was a straightforward mathematical analysis of the effect of move ordering on the size of alpha/beta trees, following up on the analysis Knuth/Moore did for serial alpha/beta searching.

I have the paper here. It is about PVS and EPVS, two algorithms that I don't think anybody is using anymore. The conclusion is that "[f]uture algorithms must allow processors to work at different nodes within the tree in order to be able to effectively use large number of processors". Indeed. This paper does not seem relevant for YBW (nor for DTS, for that matter).

I thought I was clear about the article I was referencing.

Yes, you were clear. But my point was and is that it is dangerous to rely on the "conventional knowledge" that is mostly based on those old papers from a time where algorithms were tested on 24 positions searched to a ply or 5. In fact you are now arguing that the DTS paper has lost its relevance because today's search algorithms are incomparable.

The base idea is to use the traditional knuth/moore form of N = 2 * W ^(d/2) (for D even) or N = W ^ floor(d/2) + W ^ ceil(d/2) for D odd or even. If you analyze that recursively, the first branch at the root searches many more nodes than the remainder. And this can be spelled out mathematically using Knuth's formula. The idea is that when you screw up with move ordering, so that the second move is better than the first, you almost double the size of the sub-tree for that branch...

If you screw up move ordering, you are screwed whether you search in parallel or not. When searching serially the size of the subtree for that branch doubles just as much as when searching in parallel.

BTW that paper had absolutely nothing to do with DTS and had no test results at all, it was a mathematical analysis of the issue only.

I agree it has nothing to do with DTS. It also has nothing to do with YBW. But if it "proves" the 30% parallel overhead (which I don't think it does) independent of the search algorithm used, can you explain why DTS on Cray Blitz had much lower parallel overhead? (And if it doesn't mean anything for DTS, then why should it mean anything for YBW?)

I do not agree it had no test results. In fact, it has test results for 24 Bratko-Kopec positions on 5-ply searches, both for PVS and for EPVS.

Let me summarise the paper:
Section 1:
minimax searches a lot of nodes, alpha/beta much less. Knuth formula, node types.
Section 2:
PVS searches no extra nodes if the tree is perfectly ordered. However, to make this work in practice the search must be able to split at any depth without penalty and (what I believe is only mentioned in section 3) the tree must be perfectly balanced (all subtrees of all moves after the 1st have the same size) because otherwise processors become idle.
Section 3:
It is mentioned that things are different with imperfect move ordering, and that imperfect move ordering cannot be avoided. This is a big problem for PVS, because it makes the tree very imbalanced, therefore lots of idle processors. Enters enhanced PVS or EPVS: if one processor becomes idle, all other searches are aborted and a new split point is made two plies higher in the tree. The TT is relied on to not lose the work that has already been done.
Section 4:
PVS and EPVS were implemented in Cray Blitz, tests were run on a Sequent Balance 21000, results are shown. (It seems 1 processor searched about 63 nodes/second.)
Section 5:
Conclusion. We need a better algorithm.

I don't see anything in this paper on which one could base a conclusion that HT can't possibly be of benefit with YBW-based parallel searches.

IQ · Post by IQ » Sun Apr 28, 2013 3:27 pm

syzygy wrote:
IQ wrote:Good to now have you on the same page here too, only that you mistate my assumption about speed gain/loss. I specifically said ".. AND the HT Speed gain does NOT offset addittional parallel overhead". Which was suppose to mean the parallel loss is greater than HT gain.
Whatever assumptions you are making, they cannot hold. I have explained this twice, but the explanation might be too technical.

Its good to see you finally understood that the conclusions are valid and try to argue about assumptions. I still fail to see how anybody with any parallel programming experience would take an assumption like (1) "HT gain does not offset parallel overhead" as offensive. At least it is debateable. So your quib "they cannot" hold it clearly false. The only other assumptions was based on observations by members of this forum that (2) HT brought a strength increase for houdini. So again this is at least debateable and far from "they cannot hold". Funnily your own conclusion was that the extra nodes magically all matter - i only argued with that. Even more so as I believe that assumption 2 might be based on a bug in houdini. Maybe some tester with two machines will clear that up, maybe not. Personally it is sad to see that instead of arguments you resort to phrases like "to technical". Obviously you are speaking for yourself.

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M