Hyperthreading and Computer Chess: Intel i5-3210M

bob · Post by **bob** » Mon Apr 29, 2013 9:31 pm

Rebel wrote:
hgm wrote:Sorry, that is nonsense. If A and B would always play the same game against each other, and A happened to win it, it would not prove that A is stronger at all. It could very well be that starting from every position that is not in the game B would win. (In practice this could even occur, e.g. because the opening book of the far stronger engine B contains an error that allows a book win.)

Ricardo is right, with the caveat that one should not go to extremes: if the randomness would be so high that it starts to randomly decide the result (e.g. by randomly starving processes for CPU so they would always lose on time before they could complete 20 moves), that would qualify as "too much randomness". But in typical testing conditions we are very far from this limit.
It's one of those topics I agree with Bob. Hard to picture that! but it is true

this is one of those things that is based on a false assumption, namely that random changes in the time one program can use effects both programs equally. Evidence shows this is not true, which just introduces more jitter into the final results.

syzygy · Post by **syzygy** » Mon Apr 29, 2013 10:40 pm

bob wrote:To answer the HT question will be more complicated. A HT run is going to be somewhat faster. No doubt about it. How much faster really does depend on the quality of the internal engine programming and how much it has been optimized. The better optimized the code (better cache locality for memory accesses, less dependencies and pipeline stalls, the less HT is going to help. When I first turned on HT on my PIV, it improved Crafty very slightly. But after a few months of optimizing for AMD cache stuff, I went back to test again on my PIV and the thing was not helping near as much (HT on). And, in fact, it was a net loss after factoring in the overhead. Something on the order of 15% loss or so overall.

What has not yet been done, as far as I know, is optimise a parallel search specifically for HT.

With normal cores, it is important to minimise idle time of all processors. As a rule it is always better for a core to do something than to do nothing (of course I'm aware there is splitting overhead).

With HT, this is different. If one logical core is idle (and not spinning), its resources go to the other logical core of the same physical core. So if the other logical core is doing useful work, there is no need to join a split node that is of dubious quality. Only if a split node is available that is almost certain not to result in wasted work should the logical core join it.

So if a thread becomes idle, it should check whether its partner thread on the same physical core is busy. If it is, the thread should block instead of spin (the pause instruction in a spin loop helps only to some extent, and monitor/mwait are unfortunately not available in user space). If the other thread is also idle it should make sure that one of them spins, the other blocks. Spinning threads join any split node that is available. Blocked threads are woken up once a "very good" split node becomes available.

In a first approximation, "very good" could be defined as "high remaining depth". Of course the above requires per thread affinity settings.

It might make more work than this, but I would be surprised if HT could not be made into a definite win.

Of course none of this applies to current engines, as far as I know.

bob · Post by **bob** » Tue Apr 30, 2013 4:34 am

Will be in the morning before I have results. I overlooked one line in my shell script that runs these matches, and it ran 'em all with mt=1, which was not exactly what I wanted.

Started over. Might have some prelim results in an hour or two, but the mt=4 run will take about 4x longer than the one thread version since I can only run 2 games per node rather than 8.

bob · Post by **bob** » Tue Apr 30, 2013 7:52 am

bob wrote:Will be in the morning before I have results. I overlooked one line in my shell script that runs these matches, and it ran 'em all with mt=1, which was not exactly what I wanted.

Started over. Might have some prelim results in an hour or two, but the mt=4 run will take about 4x longer than the one thread version since I can only run 2 games per node rather than 8.

After the first two-thread match is almost complete, zero change in Elo from the two 1-thread runs. Got another 2 thread run for verification queued up, and then 4 thread runs. Should be able to answer this subject definitively tomorrow if nothing goes wrong overnight.

Mike S. · Post by **Mike S.** » Tue Apr 30, 2013 6:33 pm

As a sidenote, I just want to say that I was surprised that this topic got so much attention. I was not aware that it is a "hot" one.

I hope you have enjoyed, and will continue to enjoy a fruitful discussion.

Some samples about node rates on the i5-3210M dual core cpu:

[D]r1bq1rk1/2ppbppp/p1n2n2/1p2p3/4P3/1B3N2/PPPP1PPP/RNBQR1K1 w - -
Node rate comparison (kN/s), exactly after 30 seconds each:

Code: Select all

Engine              2T    4T  factor
------------------------------------
Critter 1.6a       2667  3930  1,47
Houdini 1.5a       3408  3964  1,16
Rybka 2.3.2a        168   283  1,68
Shredder Cl. 2012   626  1404  2,24
Spark 1.0          4304  5644  1,31
                              ------
                            Ø  1,57

syzygy · Post by **syzygy** » Tue Apr 30, 2013 6:58 pm

Mike S. wrote:

Code: Select all

Engine              2T    4T  factor
------------------------------------
Critter 1.6a       2667  3930  1,47
Houdini 1.5a       3408  3964  1,16
Rybka 2.3.2a        168   283  1,68
Shredder Cl. 2012   626  1404  2,24
Spark 1.0          4304  5644  1,31
                              ------
                            Ø  1,57

If you got more than double the nps with 4 threads on a dual core with HT, then something went wrong in your testing. The other numbers except for maybe Houdini also look suspicious. (Well, Rybka anyway reports an imaginary nps, so we can safely ignore those numbers.) Maybe some other processes were running in the background?

bob · Post by **bob** » Tue Apr 30, 2013 8:07 pm

Here's the final results. version x1 and x2 are two separate runs, where the x indicates the number of threads used. I played these matches to a fixed depth that was constant, and I turned off the time forfeit code in my match manager to avoid irrelevant losses caused by the fixed depth.

versions -11 and -12 are normal crafty, no threads. -21 and -22 use 2 threads, but same depth limit. -41 and -42 do exactly the same but with 4 threads.

My only conclusion is that the extra breadth does nothing. But I go to quite a bit of trouble inside the search to try to search the SAME tree without threads, that is, if a move gets reduced using sequential search, it will be reduced in the parallel search as well. Ditto for pruning and such...

Code: Select all

Crafty-23.6-12       2612    3    3 30000   53%  2590   25%
Crafty-23.6-21       2611    3    3 30000   53%  2590   25%
Crafty-23.6-41       2611    3    3 30000   53%  2590   25%
Crafty-23.6-22       2610    3    3 30000   52%  2590   25%
Crafty-23.6-42       2609    3    3 30000   52%  2590   25%
Crafty-23.6-11       2609    3    3 30000   52%  2590   25%

bob · Post by **bob** » Tue Apr 30, 2013 8:07 pm

bob wrote:Will be in the morning before I have results. I overlooked one line in my shell script that runs these matches, and it ran 'em all with mt=1, which was not exactly what I wanted.

Started over. Might have some prelim results in an hour or two, but the mt=4 run will take about 4x longer than the one thread version since I can only run 2 games per node rather than 8.

After the first two-thread match is almost complete, zero change in Elo from the two 1-thread runs. Got another 2 thread run for verification queued up, and then 4 thread runs. Should be able to answer this subject definitively tomorrow if nothing goes wrong overnight.

Mike S. · Post by **Mike S.** » Tue Apr 30, 2013 8:10 pm

If you have an alive Windows, ALWAYS some other processes run in the background. But as far as I am aware, none took more than 1% CPU.

Yeah, I am used to it that my numbers look "suspicious". But they are real. And that ends this particular discussion for me on this point.

bob · Post by **bob** » Tue Apr 30, 2013 8:22 pm

syzygy wrote:
Mike S. wrote:
Code: Select all
Engine              2T    4T  factor
------------------------------------
Critter 1.6a       2667  3930  1,47
Houdini 1.5a       3408  3964  1,16
Rybka 2.3.2a        168   283  1,68
Shredder Cl. 2012   626  1404  2,24
Spark 1.0          4304  5644  1,31
                              ------
                            Ø  1,57
If you got more than double the nps with 4 threads on a dual core with HT, then something went wrong in your testing. The other numbers except for maybe Houdini also look suspicious. (Well, Rybka anyway reports an imaginary nps, so we can safely ignore those numbers.) Maybe some other processes were running in the background?

There are programs that cause issues that HT can help. I have seen 2x in the past, although not with chess engines.

And there is the 32/64 bit issue. HT has helped some 32 bit apps more than expected...

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: update - final result

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: update

Re: Hyperthreading and Computer Chess: update