Stockfish 020114 - Houdini 4 x64A Testing 39 of 100 played.

mwyoung · Post by **mwyoung** » Tue Jan 07, 2014 10:16 pm

bnculp wrote:
ouachita wrote:
ouachita wrote:logical processors is the key. 16 real cores with HT On means 32 logical processors.
This issue has turned out to be a challenge to describe in writing. Nevertheless, since my 16 core CPU allows me to enable or disable HT, my assumptions therefore would be:

A. HT enabled:

1. Both SF and H can run on 1 thru 32 threads;

2. Both SF and H are stronger as core usage increases from 2 to 16;

2. SF may or may not be stronger while using threads 17 thru 32 (ongoing testing indicates HT to 32 does not help SF);

3. H will not benefit and will likely be adversely affected from using threads 17 thru 32 (per RH);

B. HT disabled:

1. Both SF and H can run on only 1 thru 16 cores/threads;

Larry can tell me how much of this applies to Kr.

I agree with most although in my testing with HT enabled on a 4 core system, SF 8-threads did edge out SF 4-threads by 7 ELO. One must also be careful about drawing firm conclusions based on small sample sizes. Even a match of 1000 games which I have been using, has error bars about +- 12 ELO. Another variable in your case is 16 real cores and how each engine handles that kind of powerful hardware. There has been some speculation and some evidence (TCEC and Clemens tours both with no HT or HT off) that Stockfish and Komodo perform better than Houdini on higher powered machines. YMMV

It is possible that and likely that for example using HT to go to 32 threads on a 16 core machine may not do anything for stockfish or threads far below that 32 number.

Chess programs lose efficiency very quickly as you add more cores or logical cores, or threads. What ever term you wish to use.

The point is you lose any more gains as you increase the threads, it does not matter if they are real cores or logical cores. I would expect real cores to perform better because you are not splitting the real core into 2 logical cores. But the scaling problem still exist for both. It is the nature of the chess engine, and one that has not been solved, and my not be possible to solve on todays systems.

The scaling problem exist because many times all the threads are looking and evaluating the exact same position, and no matter how many CPU you have looking at the same position in the search tree. It gains you nothing, just spinning your wheels so to speak. And you can only split the workload effectively so many times.

That is why you gain much more going from 1 core to 8 cores, but much much less going from 8 cores to 16 cores on a 16 cores system.

The effect can be seen anytime you use MP and add threads. 1 to 2, 2 to 4, 4 to 8....

arjuntemurnikar · Post by **arjuntemurnikar** » Tue Jan 07, 2014 11:42 pm

mwyoung wrote:
I have 7 computers in my house, some amd, some intel. Some with Bios HT option. My engine vs engine computer I keep nothing else on but chess testing. But I can test on any of them. I was referring to my computer I run my test on. To keep the overhead the lowest for testing engine vs engine. This is why my CPU % at rest is almost 0%.

I can do many things on my other computers, like type this response.

Other testers also have many computers at their homes and are posting the results here on your HT Question.

I have run many test on this including running test positions. I get the same NPS, the same time per depth. Except for the normal variation one would expect from testing MP. That why I tested this many times to get a good average.

Larry, I don't know what more I can do or say, but let others post their results.

You need to explain your theory, how a cpu core processing 1 pipe only per core with HT on can cripple a engine, over having HT off.

The CPU is the same either with HT on or off in Bios. ALL the bios HT ON/OFF option does is keep the programs restricted to 1 pipe on the core. But the 2 pipes are still there regardless of the HT setting in the BIOS.

If my results are incorrect and anything is possible that is why we test. INTEL is going to make many customers upset when running their single thread apps on their intel HT CPU. Because that is all INTEL is mostly making in CPU chips.

And if somehow this is true, I am sure we would have read about this from AMD inc.

Sorry I have to disagree, because of my testing. But others are also testing this and posting their results.

There is a difference between using the BIOS option to switch off HT and using 4/8 cores with HT=ON. You are right, the number of pipes stay the same, but the difference is in how information gets piped through them.

For example, consider an quad core system with 4 real cores (1, 2, 3 & 4) each divided into 2 logical cores (A & B).

With HT=ON, an 8 thread engine will fire all 8 logical cores (1A, 1B, 2A, 2B, 3A, 3B, 4A & 4B)

With HT=ON, a 4 thread engine will fire 4/8 logical cores For e.g. 1A, 2A, 3A, 4A or it could even be 1A, 1B, 2A, 3A, depending on how the CPU controller distributes the load internally. The engine has no control over which cores its using. So effectively its using 4/8 logical cores or 50% of the total load.

Now, with HT=OFF at the BIOS level, effectively the two logical cores in each real core act as ONE real core. So this means, a 4 thread engine will fire all 4 real cores (i.e. 1A+1B, 2A+2B, 3A+3B & 4A+4B). This means it will utilize all cores for full CPU load of 100%. This is what the task manager indicates to you.

So, in conclusion, the only way to test HT vs no-HT is by using two identical systems. Tests on single systems are simply flawed to begin with.

Now, that said, all these tests were anyway done with small sample sizes, so the mystery of HT vs no-HT remains unsolved.

ouachita · Post by **ouachita** » Tue Jan 07, 2014 11:50 pm

arjuntemurnikar wrote:so the mystery of HT vs no-HT remains unsolved.

touche . . . as does many other such topics and questions regarding chess engine testing, matches and tournaments. There is little mathematical precision in these endeavors. This circumstance is simply a fact and must be accepted.

I'm simply seeking to win more chess games.

arjuntemurnikar · Post by **arjuntemurnikar** » Wed Jan 08, 2014 12:07 am

arjuntemurnikar wrote:
There is a difference between using the BIOS option to switch off HT and using 4/8 cores with HT=ON. You are right, the number of pipes stay the same, but the difference is in how information gets piped through them.

For example, consider an quad core system with 4 real cores (1, 2, 3 & 4) each divided into 2 logical cores (A & B).

With HT=ON, an 8 thread engine will fire all 8 logical cores (1A, 1B, 2A, 2B, 3A, 3B, 4A & 4B)

With HT=ON, a 4 thread engine will fire 4/8 logical cores For e.g. 1A, 2A, 3A, 4A or it could even be 1A, 1B, 2A, 3A, depending on how the CPU controller distributes the load internally. The engine has no control over which cores its using. So effectively its using 4/8 logical cores or 50% of the total load.

Now, with HT=OFF at the BIOS level, effectively the two logical cores in each real core act as ONE real core. So this means, a 4 thread engine will fire all 4 real cores (i.e. 1A+1B, 2A+2B, 3A+3B & 4A+4B). This means it will utilize all cores for full CPU load of 100%. This is what the task manager indicates to you.

So, in conclusion, the only way to test HT vs no-HT is by using two identical systems. Tests on single systems are simply flawed to begin with.

Now, that said, all these tests were anyway done with small sample sizes, so the mystery of HT vs no-HT remains unsolved.

Just a few more things I want to point out before lashers lash out at me:

I have over-simplified the examples in the hope of conveying my point more clearly. The gains of 8/8 vs 4/8 logical cores is not a straightforward 50% vs 100% and 8/8 logical cores vs 4 real cores is also not an equal 100% vs 100%. There are other factors that come into play related to CPU architecture that may reduce the gains somewhat. So there might be a small difference here and there in performance, but my final conclusive statement remains the same:

The only way to test HT vs no-HT is by using two identical systems, one with HT=ON and one with HT=OFF (at BIOS level) i.e. a raw match between 4 real cores and 8 logical cores.

bnculp · Post by **bnculp** » Wed Jan 08, 2014 12:17 am

The only way to test HT vs no-HT is by using two identical systems, one with HT=ON and one with HT=OFF (at BIOS level) i.e. a raw match between 4 real cores and 8 logical cores.

I totally agree

With HT=ON, a 4 thread engine will fire 4/8 logical cores For e.g. 1A, 2A, 3A, 4A or it could even be 1A, 1B, 2A, 3A, depending on how the CPU controller distributes the load internally. The engine has no control over which cores its using. So effectively its using 4/8 logical cores or 50% of the total load.

Are you saying that on a 4 core system with Hyperthreading enabled, Houdini with 8 threads should be better than Houdini with 4 threads ??

ouachita · Post by **ouachita** » Wed Jan 08, 2014 12:24 am

arjuntemurnikar wrote:The only way to test HT vs no-HT is by using two identical systems, one with HT=ON and one with HT=OFF (at BIOS level) i.e. a raw match between 4 real cores and 8 logical cores.

. . . which of course will never happen

arjuntemurnikar · Post by **arjuntemurnikar** » Wed Jan 08, 2014 12:38 am

bnculp wrote: Are you saying that on a 4 core system with Hyperthreading enabled, Houdini with 8 threads should be better than Houdini with 4 threads ??

I am not saying anything about a specific engine's performance. The gains from doubling the logical cores (Intel claims about ~30% better performance between hyperthread ON vs OFF, but this of course completely depends on the program that is running) may or may not cancel out the decreased efficiency of the alpha-beta algorithm. So, it may be worse. It may be better. It may be the same. As far as I can tell, there is no conclusively large enough sample size of games between such two system configurations to say anything. I welcome testers to test it out though.

If you must insist on me giving my best guess, I would say Houdini should be slightly better with 8 threads vs 4 threads on a HT=ON Intel system. However, I think Houdini will be worse with 8/8 threads on a HT=ON system vs a 4/4 HT=OFF system. This is what Mr. Houdart means when he says Houdini will perform better without hyperthreading. Of course, these are my guesses only, so I wait for someone to do some real conclusive testing.

bnculp · Post by **bnculp** » Wed Jan 08, 2014 12:44 am

Code: Select all

  System  Hyperthread       Engine     Threads     ELO 

  i7-3720QM   Yes           SF-311213       8       +19
  i7-3720QM   Yes           Houdini-4Pro    8       -19 

  i7-3720QM   Yes           SF-311213       8        +7 
  i7-3720QM   Yes           Houdini-4Pro    4        -7 

  i7-3720QM   Yes           SF-311213       8        +7 
  i7-3720QM   Yes           SF-311213       4        -7 

  i7-3720QM   Yes           Houdini-4Pro    8       -14 
  i7-3720QM   Yes           Houdini-4Pro    4       +14 

  i7-2600k    Yes           SF-311213       4       +12
  i7-2600k    Yes           Houdini-4Pro    4       -12

  i7-2600k    No            SF-311213       4       -16 
  i7-2600k    No            Houdini-4Pro    4       +16 

  i7-2600k    Yes           SF-311213       2       -19 
  i7-2600k    Yes           Houdini-4Pro    2       +19
   
  i7-2600k    Yes           SF-311213       1       -63 
  i7-2600k    Yes           Houdini-4Pro    1       +63 

  i7-2600k    No            SF-311213       1       -53 
  i7-2600k    No            Houdini-4Pro    1       +53

Match conditions : 9 total matches with the same setup except threads and Hyperthreading

GUI - Cutechess
Games - 1000
Time control - 15 sec + .05 sec
Hash - 128mb
Houdini contempt - 0
Ponder - off
Openings - 8moves_v3.pgn , repeat with colors reversed
EGTBs - none

Summary :

On a 4 core system with hyperthreading enabled, Stockfish performed best with 8 threads. Houdini's best setting was 4 threads.
Houdini has a large edge with single thread testing regardless of hyperthreading. Stockfish's performance grew as the number of threads increased.

For those who think Houdini 8-threads must be better than Houdini 4-threads on a 4-core system with HT enabled, I ask that you look carefully at the data I have listed above. I also once believed that 8 threads had to be better than 4 threads. I now know that is wrong. I would also ask that the 8 thread believers read the following quote from the author of Houdini :

"The architecture of Houdini (and of chess engines in general) is not very well suited for hyper-threading; using more threads than physical cores will usually degrade the performance of the engine. Although the hyper-threads often produce a slightly higher node speed, the increased inefficiency of the parallel alpha-beta search more than offsets the speed gain obtained with the additional hyper-threads. To give a practical example, it's more efficient to use 4 threads running at 2,000 kN/s each than 8 threads running at 1,100 kN/s each, although the latter situation produces a higher total node speed. For this reason it's best to set the number of threads not higher than the number of physical cores of your hardware".

Finally be advised that the only way to test HT vs no HT is by using two identical systems, one with HT enabled and one with HT disabled. This post and the tests I ran are not trying to evaluate that question .

ouachita · Post by **ouachita** » Wed Jan 08, 2014 1:29 am

arjuntemurnikar wrote:some real conclusive testing.

The Quest for the Holy Grail ends here:

Code: Select all

SF010214-16 v. H4B-16, Blitz 1m+1s  0

                                      
1   Houdini 4 Pro x64B           +28  +29/=50/-21 54.00%   54.0/100
2   Stockfish 020114 64 SSE4.2   -28  +21/=50/-29 46.00%   46.0/100

SF010214-32T v. H4B-16C, 1+1  0

                                      
1   Houdini 4 Pro x64B            +7  +21/=60/-19 51.00%   51.0/100
2   Stockfish 020114 64 SSE4.2    -7  +19/=60/-21 49.00%   49.0/100

The single difference between these two test bases, aside from kPa and RH, was changing SF to 32 threads.

As the narrator said so often in The Wonder Years, "there you have it."

arjuntemurnikar · Post by **arjuntemurnikar** » Wed Jan 08, 2014 1:52 am

ouachita wrote:

arjuntemurnikar wrote:some real conclusive testing.

The Quest for the Holy Grail ends here:

Code: Select all

SF010214-16 v. H4B-16, Blitz 1m+1s  0

                                      
1   Houdini 4 Pro x64B           +28  +29/=50/-21 54.00%   54.0/100
2   Stockfish 020114 64 SSE4.2   -28  +21/=50/-29 46.00%   46.0/100

SF010214-32T v. H4B-16C, 1+1  0

                                      
1   Houdini 4 Pro x64B            +7  +21/=60/-19 51.00%   51.0/100
2   Stockfish 020114 64 SSE4.2    -7  +19/=60/-21 49.00%   49.0/100

The single difference between these two test bases, aside from kPa and RH, was changing SF to 32 threads.

As the narrator said so often in The Wonder Years, "there you have it."

When I said conclusive testing, I meant statistically significant amount of games. When I say statistically significant amount of games, I mean where the error bars are within the elo difference.

So the quest for the "Holy Grail" remains a incomplete.

Stockfish 020114 - Houdini 4 x64A Testing 39 of 100 played.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.

Re: Video on how hypethreading works in a Intel CPU.