Hyperthreading and Computer Chess: Intel i5-3210M

bob · Post by **bob** » Wed May 01, 2013 7:26 pm

syzygy wrote:
bob wrote:
syzygy wrote:For example, after the first root move has been searched, it is probably beneficial to let all other root moves be searched in parallel by as many logical cores as possible. One could stop there: once you run out of root moves, the secondary threads are put to sleep and only the primary threads help out with the root moves that take longer. This would not make sense at all with N threads on N physical cores, but it makes sense with N primary and N secondary threads on N physical cores with HT.
We certainly don't agree there. I have always "split at the root" because that is the one place where you are guaranteed to have to search every move, no matter what...
I don't see any disagreement at all.
I assume an N-core processor with HT. The first root move has been searched.
With HT off, we would use N processors to search the remaining root moves. If some of those processors are finished and we run out of root moves, they will join the search of the other moves.

What I propose is this: as long as we have root moves to search, search them with 2N threads. Once we run out of root moves and threads start to finish searching their root move, reduce the number of searching threads to N (one per physical core).

The point is: there will be little parallel overhead when searching root moves in parallel (each root move being searched serially). So all the extra nps that HT delivers is a win.

Once threads join other threads to finish searching the last root moves, parallel overhead will probably increase. Therefore don't use HT there, because the nps increase might not offset the increased parallel overhead.

I don't follow your reasoning about the HT vs not-HT search at the end of your explanation. If it works (whatever search approach you use) for normal cores, it ought to be good for HT cores, assuming HT provides any real gain.
I am only assuming here that HT provides a real gain in terms of nps. I do not assume here that in general this increase offsets the parallel overhead. Therefore I propose to restrict application of HT to part of the search where the parallel overhead should be small: "the one place where you are guaranteed to have to search every move, no matter what..."

Ok, in that regard I agree with you. However, there is still a risk of overhead. Namely when you change your mind at the root. So you still suffer in about 1/6 of the cases (programs seem to change their mind in something over 15% of the time. So you still will end up eating some overhead...

syzygy · Post by **syzygy** » Wed May 01, 2013 9:06 pm

bob wrote:Ok, in that regard I agree with you. However, there is still a risk of overhead. Namely when you change your mind at the root. So you still suffer in about 1/6 of the cases (programs seem to change their mind in something over 15% of the time. So you still will end up eating some overhead...

Some overhead 15% of the time should easily be offset by a 15-20% speed increase. So if practically all parallel overhead indeed arises from imperfect move ordering (and not e.g. from missed transpositions), then it should work. The gain might not be much (since HT would be used only part of the time), but it should be real.

I might try this, but first I'll have to modify my search to split at the root.

Maybe this (untested) idea would work just as well at other PV nodes. After the first move of a PV node has been searched all threads but one are normally idle, at which point the remaining PV nodes may be distributed over the 2N hyperthreads. As one runs out of PV node moves, only N threads are allowed to search on (helping out with the remaining moves).

bob · Post by **bob** » Thu May 02, 2013 1:33 am

syzygy wrote:
bob wrote:Ok, in that regard I agree with you. However, there is still a risk of overhead. Namely when you change your mind at the root. So you still suffer in about 1/6 of the cases (programs seem to change their mind in something over 15% of the time. So you still will end up eating some overhead...
Some overhead 15% of the time should easily be offset by a 15-20% speed increase. So if practically all parallel overhead indeed arises from imperfect move ordering (and not e.g. from missed transpositions), then it should work. The gain might not be much (since HT would be used only part of the time), but it should be real.

I might try this, but first I'll have to modify my search to split at the root.

Maybe this (untested) idea would work just as well at other PV nodes. After the first move of a PV node has been searched all threads but one are normally idle, at which point the remaining PV nodes may be distributed over the 2N hyperthreads. As one runs out of PV node moves, only N threads are allowed to search on (helping out with the remaining moves).

I am having to take a break on gathering data here. I ran the DTS positions and am getting some (most likely) impossible/improbable results. 8 cpu speedup of 7.25? I don't think the shape of the tree has changed that dramatically, although I can certainly hope for a miracle.

In any case, I am now running 23.0 on the same test to see if it shows a similar speedup, which I doubt. I'll work my way through to the version where the speedup goes bananas and see if I can figure out what I broke, and when.

Thing that is catching my eye is that the total search overhead seems to be negative on this test, although it looked perfectly normal on the tests I ran the other day. So for the moment, I'm taking everything with a grain of salt until I see exactly what is up. I'd love to see a 7.25x speedup on 8 cores, but it is highly unrealistic. I've made quite a few SMP changes over the past couple of years, most likely one of 'em is broken.

Laskos · Post by **Laskos** » Sat Jun 08, 2013 12:02 pm

Robert Hyatt wrote

bob wrote:
Regardless of urban legend, I have NEVER seen one example where using hyper threading improves the performance of a chess engine. Not a single one.

What I am citing is NOT "urban legend". It is something that is well-known and well-understood by those that have actually spent time developing a parallel search and testing it for improvements.

BTW, searching tactical positions is not a valid way of testing parallel search. The key there is that the best move is often ordered later in the list by the very nature of the position (the best move is usually some sort of 'surprise'. This plays right into the hands of a parallel search that by its very nature tends to do better when move ordering is sub-optimal

Time to depth is the ONLY valid way to measure parallel search improvement.

You objected to my test with tactical positions, where I measured the number of solutions and the time to solutions, comparing 8 threads with 4 threads.
I managed to get with Houdini 3 time-to-depth for a set of 100 positions repeated 3 times with HT OFF in BIOS 4 threads on 4 CPU and HT ON 8 threads on 4 CPU. I selected 100 quiet positions from your set of 4,000 neutral positions, and fed this EPD file to Arena for solving to depth 20. Obviously no position is solved, as EPD file has no bm or am tags, the engine just searches to depth 20. The PC is i7 2600, 4 physical cores.

HT OFF in BIOS, 4 threads, 100 positions to depth 20
3 runs:
Rated time: 19:47 = 1187 Seconds
Rated time: 19:45 = 1185 Seconds
Rated time: 19:30 = 1170 Seconds

Average: 19:41 = 1181 Seconds

HT ON, 8 threads, 100 positions to depth 20
3 runs:
Rated time: 17:05 = 1025 Seconds
Rated time: 17:42 = 1062 Seconds
Rated time: 18:05 = 1085 Seconds

Average: 17:37 = 1057 Seconds

8 threads HT ON vs 4 threads HT OFF is faster by 12% regarding time-to-depth. If you insist that time to depth is the only valid way to measure parallel search improvement, then hyperthreading is beneficial even in this perspective, at least for Houdini 3.

Mike S. · Post by **Mike S.** » Sat Jun 08, 2013 1:54 pm

Not clear to me why the gain is only 12%, see examples above...

syzygy · Post by **syzygy** » Sat Jun 08, 2013 2:03 pm

Mike S. wrote:Not clear to me why the gain is only 12%, see examples above...

1181 / 1057 = 1.117 gives 11.7%

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M