Hyperthreading and Computer Chess: Intel i5-3210M

rbarreira · Post by **rbarreira** » Thu Apr 25, 2013 11:09 am

hgm wrote:Sorry, that is nonsense. If A and B would always play the same game against each other, and A happened to win it, it would not prove that A is stronger at all. It could very well be that starting from every position that is not in the game B would win. (In practice this could even occur, e.g. because the opening book of the far stronger engine B contains an error that allows a book win.)

Ricardo is right, with the caveat that one should not go to extremes: if the randomness would be so high that it starts to randomly decide the result (e.g. by randomly starving processes for CPU so they would always lose on time before they could complete 20 moves), that would qualify as "too much randomness". But in typical testing conditions we are very far from this limit.

Yeah, this.

I guess I should have been even clearer and say all randomness is good, as long as you have fixed the conditions you want to fix - the programs, the time control and the available hardware power (plus perhaps a few other implicit things I'm forgetting now). Most other things should be left to randomness, which will help your statistical measurements. Non-determinism coming from the parallel search definitely helps testing.

syzygy · Post by **syzygy** » Thu Apr 25, 2013 11:57 am

Modern Times wrote:
syzygy wrote:If it doesn't work for Crafty, it can't work for any other engine? (Well, except for "poorly implemented" engines like Houdini, I guess.)
Indeed. Mark Uniacke (Hiarcs) clearly considers that Hiarcs benefits from hyperthreading, because they used it in tournament play a couple of years back. Not sure if it was WCCC or another tournament, but I'm positive they used it.

So that makes two poorly implemented engines.

If Cray Blitz would have run on a 4-processor engine with hyperthreading, it would also have benefited, provided that the nps speedup from HT was at least 12%. This is based on the speedup numbers for 4- and 8- processors mentioned in the DTS paper, which I admit are backed up only by the statistically insignificant number of just 24 test positions. (Btw, a bit less than 12% nps speedup should be fine too, because most likely the nps speedup for Cray Blitz when going from 4 to 8 processors is less than 2 due to increased locking overhead.)

syzygy · Post by **syzygy** » Thu Apr 25, 2013 12:17 pm

syzygy wrote:If Cray Blitz would have run on a 4-processor engine with hyperthreading

Oops, on a 4-core processor is what I meant.

bob · Post by **bob** » Thu Apr 25, 2013 6:55 pm

syzygy wrote:
bob wrote:And maybe ONE day a little science will work its way into these discussions.

(hint: "ONE PROGRAM"?

And based on 1K games, with ONE program, HT is a good idea for all?
Let's see:
bob wrote:Regardless of urban legend, I have NEVER seen one example where using hyper threading improves the performance of a chess engine. Not a single one.

bob wrote:There's something badly wrong with your testing. I can post a ton of data relative to Crafty and SMT (hyper-threading). And it has ALWAYS been worse on than off. Including the recent test on my macbook dual i7 with SMT enabled (I can't turn it off).
If it doesn't work for Crafty, it can't work for any other engine? (Well, except for "poorly implemented" engines like Houdini, I guess.)

Btw, I do agree that Mike's results are not statistically significant because of the small sample size. But is it much different for the old papers on which most "conventional knowledge" is based? Just an example: 24 test positions?

You DO realize that I run multi-threaded programs regularly? I have posted results of cluster testing with threads using things like stockfish, to measure Elo improvement.

I chose the "24 test positions" because that was what everyone else was using at the time, and it made our results directly comparable, even if not as accurate as might be desired. In my DTS paper I didn't use those positions, I changed the idea to following a real game, move by move, one that Cray Blitz had actually played...

My point was/is, there is ALWAYS overhead in a parallel search. The 30% number I used has some mathematical basis as I mentioned. One can measure the typical "fail high on first move" percentage to determine how accurate the move ordering is, and that can be directly followed to the expected overhead caused by splitting at an ALL node that is right below one of those "didn't fail high on first move" which changes that ALL node into a CUT node.

One can't escape alpha/beta's limitations. One can't improve move ordering beyond what we see today, given what is being used today for ordering. Which means the search will always have overhead. And to date, no SMT-enabled processor I have seen offers a 30% (or more) improvement for a program that is reasonable in memory accesses and data dependencies...

And in the critical places, it is worse. I worked on a position earlier this week where the fail-high-on-first-move dropped to 70%, in a complex tactical position. That drives up parallel search overhead WAY beyond 30%. When you need it the most, it helps the least.

bob · Post by **bob** » Thu Apr 25, 2013 6:57 pm

syzygy wrote:
Modern Times wrote:
syzygy wrote:If it doesn't work for Crafty, it can't work for any other engine? (Well, except for "poorly implemented" engines like Houdini, I guess.)
Indeed. Mark Uniacke (Hiarcs) clearly considers that Hiarcs benefits from hyperthreading, because they used it in tournament play a couple of years back. Not sure if it was WCCC or another tournament, but I'm positive they used it.
So that makes two poorly implemented engines.

If Cray Blitz would have run on a 4-processor engine with hyperthreading, it would also have benefited, provided that the nps speedup from HT was at least 12%. This is based on the speedup numbers for 4- and 8- processors mentioned in the DTS paper, which I admit are backed up only by the statistically insignificant number of just 24 test positions. (Btw, a bit less than 12% nps speedup should be fine too, because most likely the nps speedup for Cray Blitz when going from 4 to 8 processors is less than 2 due to increased locking overhead.)

There was no significant locking overhead in Cray Blitz. NPS scaled perfectly up to the 32-cpu T90 we used. 12% increase in NPS will NOT offset the 30% increase in nodes searched, however...

bob · Post by **bob** » Thu Apr 25, 2013 7:12 pm

hgm wrote:Can you explain why a 'race-condition bug' would care whether the cores are full cores or hyper threads? Or are you claiming that this bug would also strike when Houdini was running on a true 8-core machine (with HT off) if it only uses 4 threads there? If so, would this apply to any situation where the number of threads was smaller than the number of cores (e.g. using 3 threads on a 4-core HT-off machine)?

I don't follow the logic of your point III. What you say there sounds very non-sensical. Software can never emulate extra computational power delivered by the hardware (in this case through HT).

One example: you are doing a spin-lock in thread 1, waiting for thread 2 to complete the critical section and release the lock. That spinlock is doing nothing, yet if you have insufficient cores, it will still get a time quantum and burn it doing nothing.

One can cause this trivially in Crafty. Here's a quick example. 5 tests, one position, using a 2-core SMT-enabled macbook i7. First run uses 4 threads, seconds uses 5, last one uses 8. Watch the NPS. And of course, notice that super-linear nonsense on the last run even though the NPS is the slowest of all. Not repeatable, however...

time=8.61 mat=0 n=85069658 fh=95% nps=9.9M
time=3.96 mat=0 n=33860195 fh=95% nps=8.6M
time=19.10 mat=0 n=144994050 fh=95% nps=7.6M
time=26.93 mat=0 n=141080173 fh=95% nps=5.2M
time=10.59 mat=0 n=31553118 fh=95% nps=3.0M

Same test run again, same identical conditions:

time=15.70 mat=0 n=149446886 fh=94% nps=9.5M
time=4.89 mat=0 n=41987463 fh=95% nps=8.6M
time=17.04 mat=0 n=120357149 fh=94% nps=7.1M
time=25.54 mat=0 n=148366512 fh=94% nps=5.8M
time=28.74 mat=0 n=92600155 fh=95% nps=3.2M

Only thing consistent is that as threads go beyond cores (logical cores = 4 here, physical cores = 2) the NPS drops because of the lock overheads that become an issue.

bob · Post by **bob** » Thu Apr 25, 2013 7:14 pm

hgm wrote:I don't see how Houdini could perceive a situation where it is only allowed to interact with 4 of the 8 hyper threads different from a situation where there only are 4 unsplit cores in total, no matter what bugs it might have. What goes on on the remaining HT would just be outside its 'field of view'. So if there are any problems, it seems that setting an affinity mask for the 4-threads instance to logical core 0,2,4,6 should cure them.

Could be an issue with older operating systems that are not SMT aware. The O/S needs to ensure that you get one thread per physical core until the number of threads exceeds the number of physical cores. This has not been true in the past. And in fact, the 6 core boxes screwed up at least a Linux version or two in that regard, since the number of cores was not a power of two.

bob · Post by **bob** » Thu Apr 25, 2013 7:23 pm

hgm wrote:Sorry, that is nonsense. If A and B would always play the same game against each other, and A happened to win it, it would not prove that A is stronger at all. It could very well be that starting from every position that is not in the game B would win. (In practice this could even occur, e.g. because the opening book of the far stronger engine B contains an error that allows a book win.)

Ricardo is right, with the caveat that one should not go to extremes: if the randomness would be so high that it starts to randomly decide the result (e.g. by randomly starving processes for CPU so they would always lose on time before they could complete 20 moves), that would qualify as "too much randomness". But in typical testing conditions we are very far from this limit.

You want randomness in the GAMES. We provide that by choosing a large number of different opening positions to start playing from. You do NOT want randomness introduced by varying the hardware speed, or by doing silly things like EGTB probes to a remote file-system where network latency introduces randomness, etc. There are plenty of examples where program A plays worse at a faster (or slower time control) than program B. If you slow down the hardware randomly, you essentially make one or both play at a faster effective speed, which skews the results in random ways. Not wanted.

If you have hardware randomness, you have to play enough games to be sure that the randomness affects both sides equally, but with that kind of randomness, more games doesn't help.

bob · Post by **bob** » Thu Apr 25, 2013 7:36 pm

rbarreira wrote:
hgm wrote:Sorry, that is nonsense. If A and B would always play the same game against each other, and A happened to win it, it would not prove that A is stronger at all. It could very well be that starting from every position that is not in the game B would win. (In practice this could even occur, e.g. because the opening book of the far stronger engine B contains an error that allows a book win.)

Ricardo is right, with the caveat that one should not go to extremes: if the randomness would be so high that it starts to randomly decide the result (e.g. by randomly starving processes for CPU so they would always lose on time before they could complete 20 moves), that would qualify as "too much randomness". But in typical testing conditions we are very far from this limit.
Yeah, this.

I guess I should have been even clearer and say all randomness is good, as long as you have fixed the conditions you want to fix - the programs, the time control and the available hardware power (plus perhaps a few other implicit things I'm forgetting now). Most other things should be left to randomness, which will help your statistical measurements. Non-determinism coming from the parallel search definitely helps testing.

I disagree. The reason for randomness is to avoid correlated tests. Playing the same position many times is a bad idea, even though the games won't be the same. This was discussed at length back when I started cluster testing, and proved that one needed many starting positions so that one has "independent trials". Starting from the same position and playing different games due to randomness is far from that "independent trials" that Elo measurements depend on.

syzygy · Post by **syzygy** » Thu Apr 25, 2013 8:11 pm

bob wrote:
syzygy wrote:
bob wrote:And maybe ONE day a little science will work its way into these discussions.

(hint: "ONE PROGRAM"?

And based on 1K games, with ONE program, HT is a good idea for all?
Let's see:
bob wrote:Regardless of urban legend, I have NEVER seen one example where using hyper threading improves the performance of a chess engine. Not a single one.

bob wrote:There's something badly wrong with your testing. I can post a ton of data relative to Crafty and SMT (hyper-threading). And it has ALWAYS been worse on than off. Including the recent test on my macbook dual i7 with SMT enabled (I can't turn it off).
If it doesn't work for Crafty, it can't work for any other engine? (Well, except for "poorly implemented" engines like Houdini, I guess.)

Btw, I do agree that Mike's results are not statistically significant because of the small sample size. But is it much different for the old papers on which most "conventional knowledge" is based? Just an example: 24 test positions?
You DO realize that I run multi-threaded programs regularly? I have posted results of cluster testing with threads using things like stockfish, to measure Elo improvement.

I'm looking at what you wrote, and what you wrote is that it doesn't work for Crafty with a strong suggestion that therefore it can't work for other engines.

bob wrote:I chose the "24 test positions" because that was what everyone else was using at the time, and it made our results directly comparable, even if not as accurate as might be desired. In my DTS paper I didn't use those positions, I changed the idea to following a real game, move by move, one that Cray Blitz had actually played...

That everyone else was doing the same doesn't change the fact that results from 24 positions are statistically insignificant. Let today be the day that some science gets into this discussion...

bob wrote:There was no significant locking overhead in Cray Blitz. NPS scaled perfectly up to the 32-cpu T90 we used. 12% increase in NPS will NOT offset the 30% increase in nodes searched, however...

Let's do the math. Directly from your DTS paper: 4 processors give a speedup of 3.7, 8 processors give a speedup of 6.6.

This means that on a 4-core machine with HT off, the speedup is 3.7. This is easy.

With 8 processors the speedup is 6.6, but that is including the 2x nps speedup compared to 4 processors. So on a 4-core machine with HT on and assuming 12% higher nps, the speedup is (6.6 / 2) x 1.12 = 3.7. This is still elementary.

So with more than 12% higher nps due to HT, HT is a win.

Again, this is based directly on your DTS paper. The math is not difficult.

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M