"Correct" benchmarks of modern CPUs for chess

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu May 21, 2009 1:57 pm

Hi,

shortly before the WCCC I was looking to upgrade some of my machines here, but a problem I ran into is that all benchmarks you see usually make several mistakes which makes them not representative for chess, even when they are doing Fritzmarks. Particularly with Core i7, many sites are benchmarking with Turbo mode on, Hyperthreading on, and mostly 32 bit software.

For chess, this is not very useful: Hyperthreading is a loss for almost all programs, Turbo mode won't be active if you're running all cores full throttle, and the 64 bit programs are stronger.

So I spent some time benchmarking the latest Deep Sjeng and Crafty 23.0 on some systems that were of my interest (and where appropriate, HT and Turbo mode off, and always 64 bits). Results below:

Code: Select all

Deep Sjeng PP2E x64  (64M hash)

Phe2 955  3.80Ghz DDR2-1066 2chu NPS: 1268395
Core i7   2.93Ghz DDR3-1066 3ch  NPS: 1109525
P9500     2.53Ghz DDR2-800  2ch  NPS:  879489
Q6600     2.4Ghz  DDR2-800  2ch  NPS:  814383
Phe1 9650 2.33Ghz DDR2-800  2chu NPS:  725799

Crafty 23.0 MSVC x64 (384M hash)

Phe2 955  3.80Ghz DDR2-1066 2chu NPS: 4040253
Core i7   2.93Ghz DDR3-1066 3ch  NPS: 3497355
P9500     2.53Ghz DDR2-800  2ch  NPS: 2676681
Q6600     2.4Ghz  DDR2-800  2ch  ns:  2401037
Phe1 9650 2.33Ghz DDR2-800  2chu NPS: 2045978

RASML 250000000 x cores

Phe2 955  3.80Ghz DDR2-1066 2chu ns:     122
Core i7   2.93Ghz DDR3-1066 3ch  ns:      73
P9500     2.53Ghz DDR2-800  2ch  ns:     126
Q6600     2.4Ghz  DDR2-800  2ch  ns:     158 
Phe1 9650 2.33Ghz DDR2-800  2chu ns:     179

Legend:
Phe2 = Phenom II X4
P9500 = Core 2 mobile
Phe1 = Phenom X4
RASML = random access shared memory latency (time for a hashtable probe)
2chu = dual channel unganged

Note that you have to take clockspeeds into account. The Phenom II system was overclocked. The Core i7 was not and usually has headroom to do this.

Code: Select all

Clock for clock (Deep Sjeng):

Core i7 3.0Ghz = 1.136k
Core 2  3.0Ghz = 1.042k
Phe2    3.0ghz = 1.001k
Phe1    3.0Ghz =   934k

Most interesting for me:

- Phenom II is very close to Core2 for chess. Core i7 is 14% faster.
- The memory latency of Phenom I is HUGE. Despite on-die controller!

I'm curious if anyone knows the reason for this last result.

The Crafty binary used is here: http://www.sjeng.org/ftp/crafty_MSVC.exe

Spock · Post by **Spock** » Thu May 21, 2009 2:20 pm

I have an HP ML115-G5 Server, which has a 2.1GHz Quad core Opteron 1352 in it, basically a Phenom 1 re-branded for the server market as I understand it. (It does not have the TLB bug). Anyway, I can confirm that it is indeed quite slow for chess, but of course that isn't the purpose of the box so it isn't an issue for me. It is no quicker than my older dual socket Opteron box (2 x dual cores) and I think clock for clock it may even be slower. OK the HP BIOS almost certainly has stability in mind rather than speed, yet I was surprised.

Another point, Crafty on the older dual socket Opteron seems to perform better than expected, it detects the NUMA configuration and seems to like it

Matthias Gemuh · Post by **Matthias Gemuh** » Thu May 21, 2009 3:06 pm

Gian-Carlo Pascutto wrote: The Crafty binary used is here: http://www.sjeng.org/ftp/crafty_MSVC.exe

You forgot to post the link to the Deep Sjeng that was used.

Matthias.

M ANSARI · Post by **M ANSARI** » Thu May 21, 2009 3:59 pm

Phenom is using DDR II while Core i7 is using DDR III. Also the Core i7 has quite a bit more cache per core. You can overclock the Core i7 memory from 1066 Mhz to 1600 Mhz and get some quite incredible memory bandwidth. Core i7 is a generation ahead of AMD Phenomn in this

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu May 21, 2009 5:04 pm

Phenom is using DDR II while Core i7 is using DDR III.

Speed of memory itself was the same in both systems (533Mhz).

M ANSARI wrote:Also the Core i7 has quite a bit more cache per core.

It's the opposite.

P2: 64+64K L1, 512K L2, 6M L3
i7: 32+32K L1, 256K L2, 8M L3

Shared cache of i7 is bigger but there is less cache per core.

You can overclock the Core i7 memory from 1066 Mhz to 1600 Mhz and get some quite incredible memory bandwidth. Core i7 is a generation ahead of AMD Phenomn in this

In latency, which is more important in chess, Core i7 wins easily. But you can put DDR3 in Phenom 2 systems. I just don't have access to any such machine. I don't think the bandwidth actually matters for chess.

bob · Post by **bob** » Thu May 21, 2009 7:21 pm

Gian-Carlo Pascutto wrote:Hi,

shortly before the WCCC I was looking to upgrade some of my machines here, but a problem I ran into is that all benchmarks you see usually make several mistakes which makes them not representative for chess, even when they are doing Fritzmarks. Particularly with Core i7, many sites are benchmarking with Turbo mode on, Hyperthreading on, and mostly 32 bit software.

For chess, this is not very useful: Hyperthreading is a loss for almost all programs, Turbo mode won't be active if you're running all cores full throttle, and the 64 bit programs are stronger.

So I spent some time benchmarking the latest Deep Sjeng and Crafty 23.0 on some systems that were of my interest (and where appropriate, HT and Turbo mode off, and always 64 bits). Results below:
Code: Select all
Deep Sjeng PP2E x64  (64M hash)

Phe2 955  3.80Ghz DDR2-1066 2chu NPS: 1268395
Core i7   2.93Ghz DDR3-1066 3ch  NPS: 1109525
P9500     2.53Ghz DDR2-800  2ch  NPS:  879489
Q6600     2.4Ghz  DDR2-800  2ch  NPS:  814383
Phe1 9650 2.33Ghz DDR2-800  2chu NPS:  725799

Crafty 23.0 MSVC x64 (384M hash)

Phe2 955  3.80Ghz DDR2-1066 2chu NPS: 4040253
Core i7   2.93Ghz DDR3-1066 3ch  NPS: 3497355
P9500     2.53Ghz DDR2-800  2ch  NPS: 2676681
Q6600     2.4Ghz  DDR2-800  2ch  ns:  2401037
Phe1 9650 2.33Ghz DDR2-800  2chu NPS: 2045978

RASML 250000000 x cores

Phe2 955  3.80Ghz DDR2-1066 2chu ns:     122
Core i7   2.93Ghz DDR3-1066 3ch  ns:      73
P9500     2.53Ghz DDR2-800  2ch  ns:     126
Q6600     2.4Ghz  DDR2-800  2ch  ns:     158 
Phe1 9650 2.33Ghz DDR2-800  2chu ns:     179
Legend:
Phe2 = Phenom II X4
P9500 = Core 2 mobile
Phe1 = Phenom X4
RASML = random access shared memory latency (time for a hashtable probe)
2chu = dual channel unganged

Note that you have to take clockspeeds into account. The Phenom II system was overclocked. The Core i7 was not and usually has headroom to do this.
Code: Select all
Clock for clock (Deep Sjeng):

Core i7 3.0Ghz = 1.136k
Core 2  3.0Ghz = 1.042k
Phe2    3.0ghz = 1.001k
Phe1    3.0Ghz =   934k
Most interesting for me:

- Phenom II is very close to Core2 for chess. Core i7 is 14% faster.
- The memory latency of Phenom I is HUGE. Despite on-die controller!

I'm curious if anyone knows the reason for this last result.

The Crafty binary used is here: http://www.sjeng.org/ftp/crafty_MSVC.exe

One note.

You might be able to find a Dell with a BIOS bug that makes turbo-boost actually valuable. I spent a week trying to figure out the performance issues on an i7 box, and finally discovered/proved that the BIOS was doing this backward. It would ramp up the clock when all cores were busy, not the other way around. So with 8 cores running on this box, it was more than 25% faster than when using just 6 (3 on each socket).

But in general, after that was fixed, I got better results by using turbo-boost disabled. This let me compare the 1 core to 2 core to 4 core to 8 core speeds without any distortions provided by the one/two core tests getting slightly overclocked while the 4 core test (on a single socket) would run at normal clock speed. This box was a 2.37ghz i7, and could overclock in two steps of 133mhz, which means that the one/two core speeds were based on 2.64ghz clock speeds, but the four core speeds are only running at 2.37ghz. Not good for benchmarking.

However, I ran that box for 24 hours non-stop with the bios bug, and both processors ran at the max overclocking for that long with no problems at all. With that bug, turbo-boost is actually worth something.

I also tested with HT on and off, but never used more than 8 threads since I had 8 physical cores. The linux scheduler was quite good and I could not measure any speed difference using 8 threads with HT on or HT off. In the past schedulers would make mistakes and run two threads on one physical core which is not good for performance, but current schedulers do not fall for this.

bob · Post by **bob** » Thu May 21, 2009 7:23 pm

Gian-Carlo Pascutto wrote:Hi,

shortly before the WCCC I was looking to upgrade some of my machines here, but a problem I ran into is that all benchmarks you see usually make several mistakes which makes them not representative for chess, even when they are doing Fritzmarks. Particularly with Core i7, many sites are benchmarking with Turbo mode on, Hyperthreading on, and mostly 32 bit software.

For chess, this is not very useful: Hyperthreading is a loss for almost all programs, Turbo mode won't be active if you're running all cores full throttle, and the 64 bit programs are stronger.

So I spent some time benchmarking the latest Deep Sjeng and Crafty 23.0 on some systems that were of my interest (and where appropriate, HT and Turbo mode off, and always 64 bits). Results below:
Code: Select all
Deep Sjeng PP2E x64  (64M hash)

Phe2 955  3.80Ghz DDR2-1066 2chu NPS: 1268395
Core i7   2.93Ghz DDR3-1066 3ch  NPS: 1109525
P9500     2.53Ghz DDR2-800  2ch  NPS:  879489
Q6600     2.4Ghz  DDR2-800  2ch  NPS:  814383
Phe1 9650 2.33Ghz DDR2-800  2chu NPS:  725799

Crafty 23.0 MSVC x64 (384M hash)

Phe2 955  3.80Ghz DDR2-1066 2chu NPS: 4040253
Core i7   2.93Ghz DDR3-1066 3ch  NPS: 3497355
P9500     2.53Ghz DDR2-800  2ch  NPS: 2676681
Q6600     2.4Ghz  DDR2-800  2ch  ns:  2401037
Phe1 9650 2.33Ghz DDR2-800  2chu NPS: 2045978

RASML 250000000 x cores

Phe2 955  3.80Ghz DDR2-1066 2chu ns:     122
Core i7   2.93Ghz DDR3-1066 3ch  ns:      73
P9500     2.53Ghz DDR2-800  2ch  ns:     126
Q6600     2.4Ghz  DDR2-800  2ch  ns:     158 
Phe1 9650 2.33Ghz DDR2-800  2chu ns:     179
Legend:
Phe2 = Phenom II X4
P9500 = Core 2 mobile
Phe1 = Phenom X4
RASML = random access shared memory latency (time for a hashtable probe)
2chu = dual channel unganged

Note that you have to take clockspeeds into account. The Phenom II system was overclocked. The Core i7 was not and usually has headroom to do this.
Code: Select all
Clock for clock (Deep Sjeng):

Core i7 3.0Ghz = 1.136k
Core 2  3.0Ghz = 1.042k
Phe2    3.0ghz = 1.001k
Phe1    3.0Ghz =   934k
Most interesting for me:

- Phenom II is very close to Core2 for chess. Core i7 is 14% faster.
- The memory latency of Phenom I is HUGE. Despite on-die controller!

I'm curious if anyone knows the reason for this last result.

The Crafty binary used is here: http://www.sjeng.org/ftp/crafty_MSVC.exe

One issue is the 4-level virtual-to-real addressing tables AMD uses. A single memory access that results in a TLB miss turns into 5 memory accesses, 4 for the page tables, then one to actually fetch the data. You are probably measuring some of the latter depending on what memory test you used and how much memory you actually requested for the test run.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu May 21, 2009 7:39 pm

bob wrote: One issue is the 4-level virtual-to-real addressing tables AMD uses. A single memory access that results in a TLB miss turns into 5 memory accesses, 4 for the page tables, then one to actually fetch the data. You are probably measuring some of the latter depending on what memory test you used and how much memory you actually requested for the test run.

Allocating 250M shared memory and then reading it randomly from 4 threads in parallel (=hash table access in a chessprogram). I don't think the 4 level tables are enabled by default in Windows?

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu May 21, 2009 7:46 pm

bob wrote: However, I ran that box for 24 hours non-stop with the bios bug, and both processors ran at the max overclocking for that long with no problems at all. With that bug, turbo-boost is actually worth something.

Turbo-boost is a bit poor man's overclocking. Since I'd overclock the CPUs to max stable anyway, Turbo mode won't provide any further advantage. But it's tricky in benchmarks because you think you're benchmarking a 2.93Ghz CPU, but it will actually run faster.

I also tested with HT on and off, but never used more than 8 threads since I had 8 physical cores. The linux scheduler was quite good and I could not measure any speed difference using 8 threads with HT on or HT off. In the past schedulers would make mistakes and run two threads on one physical core which is not good for performance, but current schedulers do not fall for this.

Was the performance difference really 0? With HT disabled, some on-chip buffers are no longer split in 2 and could help performance.

bob · Post by **bob** » Thu May 21, 2009 8:14 pm

Gian-Carlo Pascutto wrote:
bob wrote: However, I ran that box for 24 hours non-stop with the bios bug, and both processors ran at the max overclocking for that long with no problems at all. With that bug, turbo-boost is actually worth something.
Turbo-boost is a bit poor man's overclocking. Since I'd overclock the CPUs to max stable anyway, Turbo mode won't provide any further advantage. But it's tricky in benchmarks because you think you're benchmarking a 2.93Ghz CPU, but it will actually run faster.

I also tested with HT on and off, but never used more than 8 threads since I had 8 physical cores. The linux scheduler was quite good and I could not measure any speed difference using 8 threads with HT on or HT off. In the past schedulers would make mistakes and run two threads on one physical core which is not good for performance, but current schedulers do not fall for this.
Was the performance difference really 0? With HT disabled, some on-chip buffers are no longer split in 2 and could help performance.

I saw absolutely no measurable difference. With only 8 threads, linux is perfect in running one thread per physical core. I tried it with HT on and off and could find no speed difference at all. In years past this was not true and I generally always had HT off on any box that supported it (my dual PIV in my office has HT). I went back and re-tested that box with the new process scheduler and found the same results there as well, HT on doesn't hurt a thing unless you actually run enough threads to use the logical processors and then things start to go bad.

The only issue that does come up is that with the dual-socket box I had, 4 cores per socket, linux will run two threads on two physical sockets (Linux calls these "packages") which means each thread gets its own core, L1D/L1I, L2 and L3 caches, which is not so good. Two threads sharing data, using separate L3 caches is problematic since the two caches generate a lot of "forwarding" traffic back and forth as shared values are modified/accessed.

For chess I would rather run two threads on the same "package" and the same for four, only going to the second "package" when adding the 5th (and up) threads. But that was the only issue I encountered and it is irrelevant on my office box which has two "packages" with one core per package.

The box I had did not support overclocking. There were no BIOS options to tweak clock, memory, voltages, etc. I think the "extreme" versions provide that level of control, but not the box I had. I could enable SMT and TB, or disable either or both, but that was all the control it allowed. Of course that was a 2u rack-mount server box where one would likely not want to overclock anyway.

"Correct" benchmarks of modern CPUs for chess

"Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess

Re: "Correct" benchmarks of modern CPUs for chess