Ivy Bridge vs Sandy Bridge for computer chess

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

lkaufman
Posts: 5942
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by lkaufman »

syzygy wrote:
lkaufman wrote:When we did this, the ratios were pretty constant. In fact Komodo actually ran slightly BETTER on the 12 core relative to the other engines when tested this way. What does that suggest?
It might be that Komodo is more memory-bandwidth hungry, as Rein speculates. Turboboost kicking or not kicking in can also make a difference.
It sounds like the memory-bandwith issue is the key one, as neither the old I7 nor the 12 core cad Turboboost, so this can't explain the different performance on those two machines.
Is there any reason to think that Turboboost would not be fully effective for all chess engines? As far as we can tell, when Komodo is running the turboboost is in full force on the new machine.
syzygy
Posts: 5554
Joined: Tue Feb 28, 2012 11:56 pm

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by syzygy »

lkaufman wrote:
syzygy wrote:
lkaufman wrote:When we did this, the ratios were pretty constant. In fact Komodo actually ran slightly BETTER on the 12 core relative to the other engines when tested this way. What does that suggest?
It might be that Komodo is more memory-bandwidth hungry, as Rein speculates. Turboboost kicking or not kicking in can also make a difference.
It sounds like the memory-bandwith issue is the key one, as neither the old I7 nor the 12 core cad Turboboost, so this can't explain the different performance on those two machines.
As far as I can tell, all 6-core Westmere-EP Xeons have turboboost (link). I'm reasonably sure that your 4-core i7 also has turboboost.
Is there any reason to think that Turboboost would not be fully effective for all chess engines? As far as we can tell, when Komodo is running the turboboost is in full force on the new machine.
If an engine for some reason causes the cpu to consume more power than another engine, it will benefit less from turbomode. See Intel Turbo Boost for more information. Both turboboost and memory bandwidth limitations are reasons for why the performance of multicore CPUs does not scale perfectly with the number of threads being run. Another reason is the shared L3 cache.

If turboboost is fully effective on the new machine and not on the older machine when using many threads, that could be the explanation.

To eliminate Turboboost from the equation, you can (temporarily) disable it in the BIOS.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by diep »

lkaufman wrote:
diep wrote: Are they all running the same operating system, so all linux or all windows and in case both are windows which version runs on which?

how many processes do you run at the same time. A match is 2 processes so 1 program playing another?

So 12 matches run 24 processes?

How many physical cores does each machine have and how many logical cores are enabled?

You can turn off hyperthreading - some HPC centers turn off hyperthreading on newer machines with many cores.

So are we comparing the same things here?

Maybe you ran a machine with hyperthreading, so 12 cores @ 12 logical cores and compared with 16 physical cores @ 32 logical cores.

Figuring this out is important.

Further important is the RAM. I see so many companies that sell hardware put in total junk RAM into machines. Usually clocked the minimum what machines can handle.

For chess this is a big difference if you run that many matches at the same time.

Add to that, that most profilers do not factor in the RAM and have even the system time spent in functions wrong as they didn't factor in when which function gives a cachemiss. Intels Vtune suffers relative little from this phenomena.

Which types of RAM does each machine have?

A simple way to find out is run a testprogram to benchmark the RAM in parallel. So at all cores at the same time.

The only test on the planet i know that's doing this, as 99.9% of them run only at 1 single core, is one i wrote. If you give me an email i can email it to you.

I wrote it to benchmark on the supercomputer.

Now another big problem is when the machine is located in a HPC center.

What happens there is that a machine has huge RAM and that with the chessprogram you eat relative little of it.

So they also give some other user a big part of the RAM. That will screw your bandwidth. The above RAM test would notice this directly (at the moment that this happens - not if it happens later on).

Todays buzzword for screwing you that way on a machine is called virtualization. Commercial parties are world champion in doing this.

(Did the name Amazon echo in the corridor?)

In case it's a government HPC center I assume you don't have admin access rights but Don with some ps type commands might be able to see this when it happens. You can easily hack all those linux HPC machines. They run always stable solid kernels with everything enabled including selfhacking.

The problem in HPC centers is that you usually do not have the garantuee that you exclusively run on a given node. So as soon as a few cores seem idle, it will schedule other jobs there as well.
The computers are my own, in my own home, no other users or uses. The new one cost just over $5,000.
Both the twelve core and the 16 core have two xeon processors, each with six and eight cores respectively. Both were bought from the same company, one that uses quality parts. Both used the best RAM available at the time (excluding any hyper-expensive RAM). The 16 core is 1.5 years newer so presumably has somewhat better RAM. They all run Ubuntu Linux, each with the version that was out when the machine was made (so not the same). Normally we run one test per thread, so 24 tests on the 12 core and 32 on the 16 core. We keep hyperthreading on; it may not be helpful for an MP program but it is clearly helpful for multiple tests of a single core engine. The new machine is Ivy bridge, the old one Sandy bridge. Somehow, the new machine "likes" Komodo and the older one "likes" Houdini, Critter, Ivanhoe, and Stockfish more. My standard off the shelf i-7 acts more like the new machine, i.e. is more friendly to Komodo.
Can you think of anything that would account for this?
One 'test' involves 1 engine of the opponent and 1 komodo and pondering is turned on?

So it is 2 processes running at the same time. Is that correct?

You run 64 processes at 32 logical cores, do i say this correct?
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by diep »

Modern Times wrote:Well, there have been significant changes to the Linux kernel in 18 months, so that could be a factor.

Leaving hyperthreading on also introduces an element of unpredictability.

Just my thoughts, no firm theories or conclusions.
Everyone agrees with you though :)

If i'd test Diep in this manner, sometimes you get nailed getting scheduled on a core that hosts another proces and then a proces has a full core for itself without another one scheduled, that gives huge unpredictability in every match.

Would need a 10 million games+ at around 10 minutes a game a match to say anything significant about one small tiny modification, just because of the added unpredictability :)

Only if one would use a constant amount of nodes per search that would remove a lot of noise, yet that has other drawbacks :)
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by diep »

lkaufman wrote:
syzygy wrote:
lkaufman wrote:When we did this, the ratios were pretty constant. In fact Komodo actually ran slightly BETTER on the 12 core relative to the other engines when tested this way. What does that suggest?
It might be that Komodo is more memory-bandwidth hungry, as Rein speculates. Turboboost kicking or not kicking in can also make a difference.
It sounds like the memory-bandwith issue is the key one, as neither the old I7 nor the 12 core cad Turboboost, so this can't explain the different performance on those two machines.
Is there any reason to think that Turboboost would not be fully effective for all chess engines? As far as we can tell, when Komodo is running the turboboost is in full force on the new machine.
In general 2 socket machines will NEVER turboboost if you use from each physical cpu more than 2 cores.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by diep »

syzygy wrote:
lkaufman wrote:
syzygy wrote:
lkaufman wrote:When we did this, the ratios were pretty constant. In fact Komodo actually ran slightly BETTER on the 12 core relative to the other engines when tested this way. What does that suggest?
It might be that Komodo is more memory-bandwidth hungry, as Rein speculates. Turboboost kicking or not kicking in can also make a difference.
It sounds like the memory-bandwith issue is the key one, as neither the old I7 nor the 12 core cad Turboboost, so this can't explain the different performance on those two machines.
As far as I can tell, all 6-core Westmere-EP Xeons have turboboost (link). I'm reasonably sure that your 4-core i7 also has turboboost.
Is there any reason to think that Turboboost would not be fully effective for all chess engines? As far as we can tell, when Komodo is running the turboboost is in full force on the new machine.
If an engine for some reason causes the cpu to consume more power than another engine, it will benefit less from turbomode. See Intel Turbo Boost for more information. Both turboboost and memory bandwidth limitations are reasons for why the performance of multicore CPUs does not scale perfectly with the number of threads being run. Another reason is the shared L3 cache.

If turboboost is fully effective on the new machine and not on the older machine when using many threads, that could be the explanation.

To eliminate Turboboost from the equation, you can (temporarily) disable it in the BIOS.
Intel is very fuzzy about turboboost in general and how it works for n cores.

This because they 'turboboost' their testmachines in a manner that user cpu's do not.

For example on paper i7-965 turboboosts to 3.46Ghz,
yet back then their intel testmachines turboboosted ALL cores to 3.6Ghz,
so intel was very reluctant to give much information about this.

What they write on paper doesn't reflect reality.

Then they spreaded the rumour that for 1 core it could boost around a 400Mhz and for more than 1 core it coudl boost around a 200Mhz

The dual socket machines back then didn't turboboost at all in production environments, though on paper they can turboboost.

Yet when such cpu's get tested, usually they use a single socket motherboard, then suddenly it boosts to the maximum they can boost to without getting big courtcases.

We see in newer cpu's they use turboboost for benchmarks more and more.

They just push it each time, and if you buy something that is gonna crunch, you won't benefit from it simply.

Vincent
lkaufman
Posts: 5942
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by lkaufman »

diep wrote: One 'test' involves 1 engine of the opponent and 1 komodo and pondering is turned on?

So it is 2 processes running at the same time. Is that correct?

You run 64 processes at 32 logical cores, do i say this correct?
No, we test with pondering off.
lkaufman
Posts: 5942
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by lkaufman »

diep wrote:
lkaufman wrote:
syzygy wrote:
lkaufman wrote:When we did this, the ratios were pretty constant. In fact Komodo actually ran slightly BETTER on the 12 core relative to the other engines when tested this way. What does that suggest?
It might be that Komodo is more memory-bandwidth hungry, as Rein speculates. Turboboost kicking or not kicking in can also make a difference.
It sounds like the memory-bandwith issue is the key one, as neither the old I7 nor the 12 core cad Turboboost, so this can't explain the different performance on those two machines.
Is there any reason to think that Turboboost would not be fully effective for all chess engines? As far as we can tell, when Komodo is running the turboboost is in full force on the new machine.
In general 2 socket machines will NEVER turboboost if you use from each physical cpu more than 2 cores.
I checked this out with a special program installed for me by the manufacturer to check the speed; when Komodo is being tested the speed goes up from the nominal 2.6 to 3.0, which is the max turbo speed when all cores are in use. If only 2 cores per processor are in use it would be 3.3 GHz.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by bob »

sje wrote:As I've read, the main delta from Sandy to Ivy is a better calculation/watt efficiency and better integrated graphics. I don't see much, if any, difference in chess playing strength.
The only real issue is that as the # of cores go up, accessing memory becomes more and more problematic. Programs that have a small memory footprint for their primary "kernel" will do better since they can run out of local cache and stay off the path to memory where the conflicts really hurt.

The one thing I ALWAYS test is to play a few games, Crafty VS Crafty, to establish speed. I play one game at a time, no pondering or parallel search used. I then re-run the test but play two games at a time. Then 4, then 8, etc, up to one game per core. It is not uncommon to see the 8 vs 8 or 16 vs 16 games show a speed reduction which is a direct result of memory access bottlenecks. I do that so that I can understand the theoretical max speedup for a parallel search. If the 16 vs 16 shows significant degradation (adding up total NPS should show the same for 12 vs 12 as it does for 16 vs 16 if everything is well-balanced with no bottlenecks) then I can at least see that a speedup beyond 12x is not possible, rather than trying to figure out later what is not working in my parallel search very efficiently... when it is not an issue of my parallel search at all.

when I run one process and get an NPS of X, I want to run 16 processes and see EACH one get an NPS of X. That indicates there are no hardware bottlenecks and any nps drop in the parallel crafty is a parallel search issue.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Ivy Bridge vs Sandy Bridge for computer chess

Post by diep »

lkaufman wrote:
diep wrote: One 'test' involves 1 engine of the opponent and 1 komodo and pondering is turned on?

So it is 2 processes running at the same time. Is that correct?

You run 64 processes at 32 logical cores, do i say this correct?
No, we test with pondering off.
That's gonna be a busy kernel scheduling 64 processes at 32 logical cores.

What time control do you test at?