Hyperthreading Hype predates Intel

bob · Post by **bob** » Tue Feb 10, 2009 4:53 pm

sje wrote:
bob wrote:First, get a better O/S. Linux _never_ "freezes".
Ah, were but that to be true. Alas, the Linux developer forums say otherwise.

I've seen it from time to time. In older kernels at least, a process with runaway memory allocation that eats up all of virtual memory will force a reboot. Also, on some hardware (usually the cheap stuff) even the latest Ubuntu distribution will occasionally freeze on system shutdown/restart after a software upgrade.

I've never seen "runaway memory allocation" cause a crash/freeze. Yes the system can get very slow. Unless you do as I do and minimize or use zero swap space. If you blow out virtual memory, that doesn't cause any freeze I have ever seen, but it will make the system so slow that it might take a _long_ time for you to get in a command to kill the runaway processor whatever. I don't call that a "freeze" as much as a "program error" since the system will recover...

Our longest record here was a pair of servers that remained alive for almost 4 years, no reboots or crashes. Most of our reboots deal with upgrades to the software or hardware. We have about 250 linux boxes running here. A quick look at the maintenance log shows zero crashes that were not hardware failures (disk failure, a few memory failures on our cluster, etc.)

That suggests (to me) reliability "above and beyond". Our big cluster has been up for several months, last restart was when we threw out the IBRIX stuff that is beyond buggy and went back to straight NFS for the nodes. Our 70x8 cluster has been up for a half a year, last restart was to upgrade "rocks" (cluster management software / kernel images / etc.)

I can't find a single windows box we have that has been up more than 30 days...

bnemias · Post by **bnemias** » Tue Feb 10, 2009 5:16 pm

Matthias Gemuh wrote:Because you have a "dual", you cannot know how annoying a "single" can be.
On a "single" without HT, if a chess tournament is running, simple tasks like unpacking an archive can either take very long or even freeze the system.

Matthias.

That is easily solved on a single by reducing engine priority. (Unless the problem is running out of memory and not strictly scheduling)

The schedulers on windows and linux behave differently. They both work well, the difference is that linux sometimes guesses to increase priority for tasks that are actually equal. It's debatable if this is preferable-- linux isn't doing strictly what it's told.

lmader · Post by **lmader** » Tue Feb 10, 2009 6:42 pm

The clock speed of the "logical processors" is "throttled down". And that is simple to measure, just turn on SMT on one cpu, then run two copies of something to see what happens. They will run (in the case of Crafty) at about 55% of the normal speed of using just one CPU.

Ok. Think about what you just said. But let's think it through from the start. Begin with one non-SMT CPU and run two copies of something. They will run at slightly less than half of normal speed due to the overhead of the task switching, copying register states, etc. Now turn on SMT. You have just enabled a much more powerful task switching technology, which will almost always perform better. It is maintaining the register states of two threads and thus eliminating much of the task switching overhead. Now we have all seen examples of application scenarios that do indeed lose a little (2-3%) performance with hyperthreading. Is it a "big win"? I don't know how you measure big, but it is clear that you have a more efficient CPU for handling multiple threads.

Even your very own Crafty program demonstrates this - it's calculation speed is MEASURABLY improved. Hyperthreading is working as advertised right under your nose. Show me the money - where is hyperthreading slowing you down, vs where is your own software implentation slowing you down?

So perhaps hyperthreading is nice and you simply need to implement a better multithreaded search algorithm that doesn't cause redundant Tree Node bloating.

bob · Post by **bob** » Tue Feb 10, 2009 7:34 pm

lmader wrote:
The clock speed of the "logical processors" is "throttled down". And that is simple to measure, just turn on SMT on one cpu, then run two copies of something to see what happens. They will run (in the case of Crafty) at about 55% of the normal speed of using just one CPU.

Ok. Think about what you just said. But let's think it through from the start. Begin with one non-SMT CPU and run two copies of something. They will run at slightly less than half of normal speed due to the overhead of the task switching, copying register states, etc. Now turn on SMT. You have just enabled a much more powerful task switching technology, which will almost always perform better. It is maintaining the register states of two threads and thus eliminating much of the task switching overhead. Now we have all seen examples of application scenarios that do indeed lose a little (2-3%) performance with hyperthreading. Is it a "big win"? I don't know how you measure big, but it is clear that you have a more efficient CPU for handling multiple threads.

Even your very own Crafty program demonstrates this - it's calculation speed is MEASURABLY improved. Hyperthreading is working as advertised right under your nose. Show me the money - where is hyperthreading slowing you down, vs where is your own software implentation slowing you down?

So perhaps hyperthreading is nice and you simply need to implement a better multithreaded search algorithm that doesn't cause redundant Tree Node bloating.

Two points you keep missing:

(1) someone mentioned running _two_ games at a time in a tournament, on a single-cpu with SMT enabled. Does that sound like a good idea? Or is it really just like running two games on a cpu that is 1/2 as fast, which means the time control is twice as fast since each program gets 1/2 the time? And then one of those games ends, and now the time control is effectively doubled because now the remaining game gets all of the processor. _That_ is a problem. Since there is no telling where game 1 finishes, suddenly one program in game 2 gets a 2x speedup bonus for either the rest of the game, or for a move if another game is started quickly (but both programs remain in book).

(2) I've done the measurements ad nauseum on Crafty. I have a dual PIV/2.8ghz in my office. I currently have SMT disabled. When I first got it, I ran with SMT on, and saw a _small_ performance increase. But as I worked on making cache accesses more efficient, that increase turned into a penalty. Last time I tested, I got about a 10-15% nps improvement. But each CPU has a 30% overhead penalty. A net loss of 15-20% for each extra "logical processor".

What is slowing me down is _not_ a "software implementation issue". It is simply an issue produced by the alpha/beta algorithm being inherently serial. Since we can't possible write a perfect move ordering code, parallel will _always_ have overhead. Even with a "perfect" algorithm, because of alpha/beta.

If you do some research you will find _plenty_ of non-chess applications that see no benefit whatsoever from SMT also. There was a long discussion on this on several hardware message boards when hyper-threading was first announced. Hyper-threading depends on poor program tuning to work. If a program is not doing lots of memory traffic, or having lots of pipeline stalls due to dependencies, then SMT will always hurt it, because a good program can keep the physical processor busy almost 100% of the time. And running two such programs on two logical processors offers exactly no gain over running them sequentially on one physical processor...

bob · Post by **bob** » Tue Feb 10, 2009 7:38 pm

bnemias wrote:
Matthias Gemuh wrote:Because you have a "dual", you cannot know how annoying a "single" can be.
On a "single" without HT, if a chess tournament is running, simple tasks like unpacking an archive can either take very long or even freeze the system.

Matthias.
That is easily solved on a single by reducing engine priority. (Unless the problem is running out of memory and not strictly scheduling)

The schedulers on windows and linux behave differently. They both work well, the difference is that linux sometimes guesses to increase priority for tasks that are actually equal. It's debatable if this is preferable-- linux isn't doing strictly what it's told.

If you use nice, it works exactly as you would expect. At least it does for me, and I do lots of demos of this in my O/S course when we are talking about processor scheduling, quantums, priority, fairness, and the occasional need to be more than or less than "fair" which nice does quite well..

What is being overlooked is that if you are running on a single cpu with SMT enabled, if you play a chess tournament, the engines get 100% of the CPU. Until you run another cpu-hog application where they then get 50%. Same thing happens without SMT. Running a tournament in such an environment is beyond worthless, because you are randomly slowing down one program or the other, which will affect the game results in unexpected ways, rendering the results useless.

lmader · Post by **lmader** » Wed Feb 11, 2009 5:20 am

Two points you keep missing:

(1) someone mentioned running _two_ games at a time in a tournament, on a single-cpu with SMT enabled. Does that sound like a good idea? Or is it really just like running two games on a cpu that is 1/2 as fast, which means the time control is twice as fast since each program gets 1/2 the time? And then one of those games ends, and now the time control is effectively doubled because now the remaining game gets all of the processor. _That_ is a problem. Since there is no telling where game 1 finishes, suddenly one program in game 2 gets a 2x speedup bonus for either the rest of the game, or for a move if another game is started quickly (but both programs remain in book).

(2) I've done the measurements ad nauseum on Crafty. I have a dual PIV/2.8ghz in my office. I currently have SMT disabled. When I first got it, I ran with SMT on, and saw a _small_ performance increase. But as I worked on making cache accesses more efficient, that increase turned into a penalty. Last time I tested, I got about a 10-15% nps improvement. But each CPU has a 30% overhead penalty. A net loss of 15-20% for each extra "logical processor".

What is slowing me down is _not_ a "software implementation issue". It is simply an issue produced by the alpha/beta algorithm being inherently serial. Since we can't possible write a perfect move ordering code, parallel will _always_ have overhead. Even with a "perfect" algorithm, because of alpha/beta.

If you do some research you will find _plenty_ of non-chess applications that see no benefit whatsoever from SMT also. There was a long discussion on this on several hardware message boards when hyper-threading was first announced. Hyper-threading depends on poor program tuning to work. If a program is not doing lots of memory traffic, or having lots of pipeline stalls due to dependencies, then SMT will always hurt it, because a good program can keep the physical processor busy almost 100% of the time. And running two such programs on two logical processors offers exactly no gain over running them sequentially on one physical processor...

Dr. Hyatt, you are waayyy too smart and knowledgeably to be making these arguments.

I wasn't trying to address point number 1). And please, the problem you describe here would be a problem on any CPU - a single CPU, an SMT enabled CPU, dual CPUs etc. This is a stupid question.

In point number 2) you basically make my case for me. You admit that your software runs faster when hyperthreaded (

10-15% nps improvement

)but suffers from a larger increase in overhead imposed by the way an alpha-beta search works. This is a SOFTWARE issue, and it IS due to the nature of the ALGORITHM. I'm really sorry that you are unable to make an alpha-beta search function in a multithreaded environment without incurring this overhead. But just stop, for one second, and admit that hyperthreading is working. Your software is running faster because the CPU is FASTER. You just don't benefit because of the nature of your SOFTWARE.

Please. This is my only point. This is about hyperthreading. Hyperthreading is NOT slower hardware. In many cases it is faster and generates faster performance for real world applications. The world of chess engines is not the only benchmark for any given technology. Off the top of my head, apps that benefit from HT - video encoders, Quake 4 (multithreaded version that runs the sound on a separate thread), archivers, general purpose OS task switching between processes...

Just answer this one question, and I will shut up: Does your software get more work done / time with HT enabled? To rephrase, doesn't your software compute more NPS? If you answer yes, then you are admitting that HT is running your code faster. That's it. I dont' need to hear "but my algorithm is inherently serial so it has more overhead when multithreaded". I don't care that your software screws the pooch by bloating the search tree. That just means your code needs MORE of a speedup than HT provides.

And I actually followed the endless, tedious thread long ago in which you and another CCC member (Tom somebody, I think), argued this to the edge of insanity.

Just as there are apps that don't gain a net benefit from this, there plenty of apps that do. Furthermore, it's not just about how well a single application is tuned, it's about the fact that an OS is multitasking many PROCESSES, and the HT CPU is better able to handle branch prediction misses in the long P4 pipeline, better able to task switch with extra registers, etc, when compared to a SINGLE NON-HT CPU.

bob · Post by **bob** » Wed Feb 11, 2009 6:29 am

lmader wrote:
Two points you keep missing:

(1) someone mentioned running _two_ games at a time in a tournament, on a single-cpu with SMT enabled. Does that sound like a good idea? Or is it really just like running two games on a cpu that is 1/2 as fast, which means the time control is twice as fast since each program gets 1/2 the time? And then one of those games ends, and now the time control is effectively doubled because now the remaining game gets all of the processor. _That_ is a problem. Since there is no telling where game 1 finishes, suddenly one program in game 2 gets a 2x speedup bonus for either the rest of the game, or for a move if another game is started quickly (but both programs remain in book).

(2) I've done the measurements ad nauseum on Crafty. I have a dual PIV/2.8ghz in my office. I currently have SMT disabled. When I first got it, I ran with SMT on, and saw a _small_ performance increase. But as I worked on making cache accesses more efficient, that increase turned into a penalty. Last time I tested, I got about a 10-15% nps improvement. But each CPU has a 30% overhead penalty. A net loss of 15-20% for each extra "logical processor".

What is slowing me down is _not_ a "software implementation issue". It is simply an issue produced by the alpha/beta algorithm being inherently serial. Since we can't possible write a perfect move ordering code, parallel will _always_ have overhead. Even with a "perfect" algorithm, because of alpha/beta.

If you do some research you will find _plenty_ of non-chess applications that see no benefit whatsoever from SMT also. There was a long discussion on this on several hardware message boards when hyper-threading was first announced. Hyper-threading depends on poor program tuning to work. If a program is not doing lots of memory traffic, or having lots of pipeline stalls due to dependencies, then SMT will always hurt it, because a good program can keep the physical processor busy almost 100% of the time. And running two such programs on two logical processors offers exactly no gain over running them sequentially on one physical processor...
Dr. Hyatt, you are waayyy too smart and knowledgeably to be making these arguments.

I wasn't trying to address point number 1). And please, the problem you describe here would be a problem on any CPU - a single CPU, an SMT enabled CPU, dual CPUs etc. This is a stupid question.

To you, perhaps it is a stupid assumption. But if you go back and read the initial post/questions, you might come to the conclusion that the user actually believed that using hyper-threading would not hurt the other program. When it will, of course.

In point number 2) you basically make my case for me. You admit that your software runs faster when hyperthreaded (
10-15% nps improvement
)but suffers from a larger increase in overhead imposed by the way an alpha-beta search works. This is a SOFTWARE issue, and it IS due to the nature of the ALGORITHM. I'm really sorry that you are unable to make an alpha-beta search function in a multithreaded environment without incurring this overhead. But just stop, for one second, and admit that hyperthreading is working. Your software is running faster because the CPU is FASTER. You just don't benefit because of the nature of your SOFTWARE.

Two issues.

1. Does the NPS increase with hyper-threading? For some, yes. For others, no. For Crafty, a little so yes.

2. Does the search go faster? For Crafty, No, it is slower. And _that_ is the software you have to deal with. There is no "better" way because of the alpha/beta sequential property that simply can not be worked around. So, for chess, SMT is a loser. No argument is possible. It is a loser.

Please. This is my only point. This is about hyperthreading. Hyperthreading is NOT slower hardware. In many cases it is faster and generates faster performance for real world applications.

You made my point. "In many cases...' In many _other_ cases it is worse. And in others still, it is "no change".

So it is not always better, it is not always worse. The only good thing I can say about it is that when running two threads on one CPU vs on two logical SMT cpus, you get rid of the process scheduler overhead for context switches, since this is now internal to the hardware. But there are interactions that can and often are bad. Now you have two logical processors sharing one physical L1 / L2 cache. They can interfere and result in more memory traffic than if each were run for a quantum of time by itself. Too many ways it can go wrong, too few ways it can actually help, we have hundreds of machines with this option, and it is disabled everywhere. For good reason.

The world of chess engines is not the only benchmark for any given technology. Off the top of my head, apps that benefit from HT - video encoders, Quake 4 (multithreaded version that runs the sound on a separate thread), archivers, general purpose OS task switching between processes...

See above, those are not _always_ better.

Just answer this one question, and I will shut up: Does your software get more work done / time with HT enabled? To rephrase, doesn't your software compute more NPS? If you answer yes, then you are admitting that HT is running your code faster. That's it. I dont' need to hear "but my algorithm is inherently serial so it has more overhead when multithreaded". I don't care that your software screws the pooch by bloating the search tree. That just means your code needs MORE of a speedup than HT provides.

And I actually followed the endless, tedious thread long ago in which you and another CCC member (Tom somebody, I think), argued this to the edge of insanity.

Just as there are apps that don't gain a net benefit from this, there plenty of apps that do. Furthermore, it's not just about how well a single application is tuned, it's about the fact that an OS is multitasking many PROCESSES, and the HT CPU is better able to handle branch prediction misses in the long P4 pipeline, better able to task switch with extra registers, etc, when compared to a SINGLE NON-HT CPU.

lmader · Post by **lmader** » Wed Feb 11, 2009 7:21 am

Two issues.

1. Does the NPS increase with hyper-threading? For some, yes. For others, no. For Crafty, a little so yes.

2. Does the search go faster? For Crafty, No, it is slower. And _that_ is the software you have to deal with. There is no "better" way because of the alpha/beta sequential property that simply can not be worked around. So, for chess, SMT is a loser. No argument is possible. It is a loser.

You made my point. "In many cases...' In many _other_ cases it is worse. And in others still, it is "no change".

So it is not always better, it is not always worse. The only good thing I can say about it is that when running two threads on one CPU vs on two logical SMT cpus, you get rid of the process scheduler overhead for context switches, since this is now internal to the hardware. But there are interactions that can and often are bad. Now you have two logical processors sharing one physical L1 / L2 cache. They can interfere and result in more memory traffic than if each were run for a quantum of time by itself. Too many ways it can go wrong, too few ways it can actually help, we have hundreds of machines with this option, and it is disabled everywhere. For good reason.

Ahhhh, ok. I am at peace now

We can agree that HT CPUs aren't always slow.

And yes, the L1/L2 caches getting wiped out by one thread, and various other issues can result in performance loss with HT. Whether it's an overall, general system gain for the average user is probably unclear, but I do notice that my old P4 system with a gig and half of memory runs Vista more smoothly than with the HT off. Perhaps Vista has optimized the scheduler well.

bob · Post by **bob** » Wed Feb 11, 2009 6:01 pm

lmader wrote:

Two issues.

1. Does the NPS increase with hyper-threading? For some, yes. For others, no. For Crafty, a little so yes.

2. Does the search go faster? For Crafty, No, it is slower. And _that_ is the software you have to deal with. There is no "better" way because of the alpha/beta sequential property that simply can not be worked around. So, for chess, SMT is a loser. No argument is possible. It is a loser.

You made my point. "In many cases...' In many _other_ cases it is worse. And in others still, it is "no change".

So it is not always better, it is not always worse. The only good thing I can say about it is that when running two threads on one CPU vs on two logical SMT cpus, you get rid of the process scheduler overhead for context switches, since this is now internal to the hardware. But there are interactions that can and often are bad. Now you have two logical processors sharing one physical L1 / L2 cache. They can interfere and result in more memory traffic than if each were run for a quantum of time by itself. Too many ways it can go wrong, too few ways it can actually help, we have hundreds of machines with this option, and it is disabled everywhere. For good reason.

Ahhhh, ok. I am at peace now
We can agree that HT CPUs aren't always slow.

And yes, the L1/L2 caches getting wiped out by one thread, and various other issues can result in performance loss with HT. Whether it's an overall, general system gain for the average user is probably unclear, but I do notice that my old P4 system with a gig and half of memory runs Vista more smoothly than with the HT off. Perhaps Vista has optimized the scheduler well.

Here is how it went for me...

When the dual PIV (with SMT) came out, I bought a Dell Poweredge 2600 box (I wanted that specific box because I wanted a lot of slots for 15K scsi drives. It has two cpus and came with hyper-threading turned on. Early on, SMT on produced >30% greater overall NPS, which was marginally more than than the overhead caused by the extra two threads. So I ran that way for a while. But as I was testing on other boxes, I ran into a problem when AMD went to the MOESI cache coherency protocol and started to track down a poor NPS scaling issue (8 cores would not come anywhere near 8x the NPS). And as I cleaned up how memory was accessed and mapped into cache, I suddenly discovered that SMT on my PIV was no longer a net gain, and had actually turned into a net loss. On my PIV the NPS might go up by 15% but overhead goes up by twice that much which hurts overall. As a result, I have been running with SMT off. I could not tell any difference in normal operation, building kernels, etc, with SMT on or off, but since it hurt Crafty, I turned it off and it has been off ever since.

Recent core-3 (I7) testing has shown the same results. I suspect that SMT should be turned off for _any_ chess program. Modern O/S schedulers understand SMT-based processors so that if you have a dual processor with 2 logical cores per processor, and you run just two compute-bound threads, each will get scheduled on a different physical processor. Early on, both could be scheduled on different logical cores, but these could be on the same physical core. Linux has fixed this, and I believe Windows has as well...

I might try turning it on to see if using two threads with SMT on is any worse than two threads with SMT off. I will try this this afternoon...

lmader · Post by **lmader** » Wed Feb 11, 2009 11:18 pm

Early on, both could be scheduled on different logical cores, but these could be on the same physical core. Linux has fixed this, and I believe Windows has as well...

Yeah, in fact Windows 2000 doesn't know anything about SMT CPUs and will make exactly this mistake. Windows XP was the first version of Windows to be SMT aware, and fixed this.

Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel

Re: Hyperthreading Hype predates Intel