Two points you keep missing:
(1) someone mentioned running _two_ games at a time in a tournament, on a single-cpu with SMT enabled. Does that sound like a good idea? Or is it really just like running two games on a cpu that is 1/2 as fast, which means the time control is twice as fast since each program gets 1/2 the time? And then one of those games ends, and now the time control is effectively doubled because now the remaining game gets all of the processor. _That_ is a problem. Since there is no telling where game 1 finishes, suddenly one program in game 2 gets a 2x speedup bonus for either the rest of the game, or for a move if another game is started quickly (but both programs remain in book).
(2) I've done the measurements ad nauseum on Crafty. I have a dual PIV/2.8ghz in my office. I currently have SMT disabled. When I first got it, I ran with SMT on, and saw a _small_ performance increase. But as I worked on making cache accesses more efficient, that increase turned into a penalty. Last time I tested, I got about a 10-15% nps improvement. But each CPU has a 30% overhead penalty. A net loss of 15-20% for each extra "logical processor".
What is slowing me down is _not_ a "software implementation issue". It is simply an issue produced by the alpha/beta algorithm being inherently serial. Since we can't possible write a perfect move ordering code, parallel will _always_ have overhead. Even with a "perfect" algorithm, because of alpha/beta.
If you do some research you will find _plenty_ of non-chess applications that see no benefit whatsoever from SMT also. There was a long discussion on this on several hardware message boards when hyper-threading was first announced. Hyper-threading depends on poor program tuning to work. If a program is not doing lots of memory traffic, or having lots of pipeline stalls due to dependencies, then SMT will always hurt it, because a good program can keep the physical processor busy almost 100% of the time. And running two such programs on two logical processors offers exactly no gain over running them sequentially on one physical processor...
Dr. Hyatt, you are waayyy too smart and knowledgeably to be making these arguments.
I wasn't trying to address point number 1). And please, the problem you describe here would be a problem on any CPU - a single CPU, an SMT enabled CPU, dual CPUs etc. This is a stupid question.
In point number 2) you basically make my case for me. You admit that your software runs faster when hyperthreaded (
10-15% nps improvement
)but suffers from a larger increase in overhead imposed by the way an alpha-beta search works. This is a SOFTWARE issue, and it IS due to the nature of the ALGORITHM. I'm really sorry that you are unable to make an alpha-beta search function in a multithreaded environment without incurring this overhead. But just stop, for one second, and admit that hyperthreading is working. Your software is running faster because the CPU is FASTER. You just don't benefit because of the nature of your SOFTWARE.
Please. This is my only point. This is about hyperthreading. Hyperthreading is NOT slower hardware. In many cases it is faster and generates faster performance for real world applications. The world of chess engines is not the only benchmark for any given technology. Off the top of my head, apps that benefit from HT - video encoders, Quake 4 (multithreaded version that runs the sound on a separate thread), archivers, general purpose OS task switching between processes...
Just answer this one question, and I will shut up: Does your software get more work done / time with HT enabled? To rephrase, doesn't your software compute more NPS? If you answer yes, then you are admitting that HT is running your code faster. That's it. I dont' need to hear "but my algorithm is inherently serial so it has more overhead when multithreaded". I don't care that your software screws the pooch by bloating the search tree. That just means your code needs MORE of a speedup than HT provides.
And I actually followed the endless, tedious thread long ago in which you and another CCC member (Tom somebody, I think), argued this to the edge of insanity.
Just as there are apps that don't gain a net benefit from this, there plenty of apps that do. Furthermore, it's not just about how well a single application is tuned, it's about the fact that an OS is multitasking many PROCESSES, and the HT CPU is better able to handle branch prediction misses in the long P4 pipeline, better able to task switch with extra registers, etc, when compared to a SINGLE NON-HT CPU.
"The foundation of morality is to have done, once for all, with lying; to give up pretending to believe that for which there is no evidence, and repeating unintelligible propositions about things beyond the possibilities of knowledge." - T. H. Huxley