Strongest MPI-capable (cluster) engine?

Discussion of chess software programming and technical issues.

Moderator: Ras

diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Strongest MPI-capable (cluster) engine?

Post by diep »

abulmo wrote:
diep wrote:It is the same cpu core from game tree search viewpoint seen. So your code should be equally fast. Only difference is a built in memory controller. That should be a few percent to you, not factor 2 difference in speed.


When reading technical articles, there are more differences, some coming from intermediate CPU architecture.
* µop cache (faster instruction bandwitdh a)
* better branch prediction unit
* New instructions (popcount makes my program 5% faster).
* faster memory (DDR3 vs DDR2).
* built-in system agent
* hyperthreading
* etc.
diep wrote:Of course running at 1 core, as you compare a 8 thread cpu with 4 now.

Of course not. A fair comparison is not to disable half of the capabilities of the sandy bridge. Both CPU are 4 cores. One can support 8 thread, the other not. The fair comparaison is 8 threads against 4 threads.That say, HT acceleration is only 20% (vs 75% if using 8 real cores).

There are many small improvements that add up to make the CPU running my program 50% faster.
If just the hardware popcount improves your program by 5% in speed, you're doing something wrong. Doing something incremental then will probably speed you up 20% or so.

The RAM shouldn't be a big issue, only give a few % or so. If it gives more then you're doing something wrong obviously.

Saying a processor is stronger because of hyperthreading is IMHO a wrong argument. First of all very few engines benefit from hyperthreading.

We have only had some guys who work at intel or microsoft who claimed serious advantages in hyperthreading so far. Your name i never saw before. Where do you work?

Even some who brag publicly it does for them, their tournament machine doesn't use it.

The 'bigger bandwidth' for instructions is not true for chess of course. This brings you 0 instructions a cycle extra. Saying something about better branch prediction is not serious if you're using bitboards. Will also bring you 0 benefits.

If the i7 really would be a better processor for integer code, then obviously Diep also would be a lot faster at it, yet it isn't. Even the tiniest improvement in branch prediction i would notice - yet it doesn't.

Same for other codes.

The huge 2 differences are only the hyperthreading and built in memory controller (DDR3 versus DDR2). DDR3 can abort a read already premature after reading 32 bytes. So we already realize here that your hashtables must be pretty weak implemented. As single read must be under 32 bytes otherwise the DDR3 wouldn't show its maximum potential. Even then you should just optimize your program for the CPU, not vice versa.

Hyperthreading i discussed above.

Probably you have a weak form of SMP.

In the first place you should optimize your engine, not look for hardware that can run your crapcode faster.
abulmo
Posts: 151
Joined: Thu Nov 12, 2009 6:31 pm

Re: Strongest MPI-capable (cluster) engine?

Post by abulmo »

diep wrote:If just the hardware popcount improves your program by 5% in speed, you're doing something wrong. Doing something incremental then will probably speed you up 20% or so.

The RAM shouldn't be a big issue, only give a few % or so. If it gives more then you're doing something wrong obviously.
The difference between an Othello program and a chess program probably lies here. I have a lot of things to count, discs (near the leaves), moves (mobility is great for move sorting), stable discs, etc. The popcount is thus called very often. In the past I implemented incremental disc counting, but it was slower, even against a soft popcount. Memory is also used more than in Chess, as the game knowledge is contained in a huge array (>100 Mb) called at the leaves (in midgame). I do not think I am doing things wrong, neither you are. My point of view is just that we have different experiences, and that there is not a single truth.
diep wrote:The 'bigger bandwidth' for instructions is not true for chess of course. This brings you 0 instructions a cycle extra. Saying something about better branch prediction is not serious if you're using bitboards. Will also bring you 0 benefits.

This is not obvious to me. Alpha beta like algorithms are full of branches.
diep wrote:If the i7 really would be a better processor for integer code, then obviously Diep also would be a lot faster at it, yet it isn't. Even the tiniest improvement in branch prediction i would notice - yet it doesn't.
One behaviour with one program is not a universal truth. The proof is that my program run faster on a Sandy Bridge than on a Core2 CPU.
diep wrote:The huge 2 differences are only the hyperthreading and built in memory controller (DDR3 versus DDR2). DDR3 can abort a read already premature after reading 32 bytes. So we already realize here that your hashtables must be pretty weak implemented.
As single read must be under 32 bytes otherwise the DDR3 wouldn't show its maximum potential. Even then you should just optimize your program for the CPU, not vice versa.
.
I hope that my hashtables are not so weakly implemented. An Entry is 16 byte long. But as the hashtable is multiway I have indeed to read several of them.

diep wrote:Hyperthreading i discussed above. Probably you have a weak form of SMP.
.
I am not specially happy with this part of the code (It probably need a rewording to be cleaner and easier to understand), but it looks efficient and to scale correctly. I
diep wrote:In the first place you should optimize your engine, not look for hardware that can run your crapcode faster.
There is probably some room to make my program faster. Nevertheless, my program is among the fastest, if not the fastest, of the available Othello programs I know about (and I know many of them), even on slow hardware.
Richard
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Strongest MPI-capable (cluster) engine?

Post by diep »

[snip]

It's obvious the i7 doesn't have a bigger bandwidth internal for integer instructions using 64 bits of data in total, than core2.

Very few things can get vectorized.

From practical tests if you aren't making big mistakes as a programmer, you see the code executes, especially branchy code, at same speed at core2 like at i7.

0% difference there.

What we do know is that multiplication of floating point has better *throughput* latency.

I assume you're not using floating point nor are you interested in throughput of the multiplier...

So there is 0 difference between the individual cores of L5420 there and i7.

It's just paper BS from intel that i7 has a higher IPC.

What is a big difference is the on die memory controller, but in game tree search you can optimize your software for having it and not having it.

If you look at accurate tests at Lostcircuits you see clearly that IPC of Diep isn't higher for the same number of processes.

Reducing the hyperthreading as you can see in earlier posts here prove that clearly.
abulmo
Posts: 151
Joined: Thu Nov 12, 2009 6:31 pm

Re: Strongest MPI-capable (cluster) engine?

Post by abulmo »

diep wrote: If you look at accurate tests at Lostcircuits you see clearly that IPC of Diep isn't higher for the same number of processes.

Reducing the hyperthreading as you can see in earlier posts here prove that clearly.
I looked at Diep benchmark at Lostcircuits. I did some computations on a spreadsheet to see to what Diep is sensitive, from the frst 21 CPU.
Taking into account only #cores & Ghz only explains 43% of Diep NPS.
I then took the Intel Core2 as a reference, and look at the presence of the following features, to see what acceleration they provide:

Code: Select all

 - being an Intel CPU              0% (reference)
 - being an AMD                   -8,19%
 - being a Core2                   0% (reference)
 - being a Bulldozer             -38,15%
 - being a Sandy/Ivy Bridge        0,45%
 - being a Nehalem                -5,6%
 - having HT                     +28,17%
Taking all these into account make it possible to explain 99% of Diep NPS.

Some precautions should be taken in analyzing the data, as some features are correlated, for example having hyperthreading and being a Nehalem or a Sandy Bridge. However it is clear to me that Diep is more efficient on a Sandy Bridge with HT available, than on any other configuration.
It is also as clear as crystal that # of cores and Ghz are not enough to explain Diep speed.
Richard
syzygy
Posts: 5911
Joined: Tue Feb 28, 2012 11:56 pm

Re: Strongest MPI-capable (cluster) engine?

Post by syzygy »

abulmo wrote:One behaviour with one program is not a universal truth. The proof is that my program run faster on a Sandy Bridge than on a Core2 CPU.
It is the same for my single-threaded (bitboard) chess engine. The speed increase is very substantial.
diep wrote:The huge 2 differences are only the hyperthreading and built in memory controller (DDR3 versus DDR2). DDR3 can abort a read already premature after reading 32 bytes. So we already realize here that your hashtables must be pretty weak implemented.
As single read must be under 32 bytes otherwise the DDR3 wouldn't show its maximum potential. Even then you should just optimize your program for the CPU, not vice versa.
.
I hope that my hashtables are not so weakly implemented. An Entry is 16 byte long. But as the hashtable is multiway I have indeed to read several of them.
The cacheline must be filled, so 64 bytes are read?
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Strongest MPI-capable (cluster) engine?

Post by diep »

abulmo wrote:
diep wrote: If you look at accurate tests at Lostcircuits you see clearly that IPC of Diep isn't higher for the same number of processes.

Reducing the hyperthreading as you can see in earlier posts here prove that clearly.
I looked at Diep benchmark at Lostcircuits. I did some computations on a spreadsheet to see to what Diep is sensitive, from the frst 21 CPU.
Taking into account only #cores & Ghz only explains 43% of Diep NPS.
I then took the Intel Core2 as a reference, and look at the presence of the following features, to see what acceleration they provide:

Code: Select all

 - being an Intel CPU              0% (reference)
 - being an AMD                   -8,19%
 - being a Core2                   0% (reference)
 - being a Bulldozer             -38,15%
 - being a Sandy/Ivy Bridge        0,45%
 - being a Nehalem                -5,6%
 - having HT                     +28,17%
Taking all these into account make it possible to explain 99% of Diep NPS.

Some precautions should be taken in analyzing the data, as some features are correlated, for example having hyperthreading and being a Nehalem or a Sandy Bridge. However it is clear to me that Diep is more efficient on a Sandy Bridge with HT available, than on any other configuration.
It is also as clear as crystal that # of cores and Ghz are not enough to explain Diep speed.
For your calculation you can adjust to next.

AMD cpu's do not profit from AMd's turbo. it's turned on this boost, but it doesn't deliver when diep runs at all cores.

So a bulldozer 3.xGhz cpu also is clocked at 3.xGhz.

You will see IPC of 2 minicores is nearly identical to 2 logical i7 cores.

The real confusing thing is that at higher frequencies of intel, hyperthreading seems to give more than at lower frequencies.

We see the same problem at specint.

So for intel it's crucial to get tested at higher frequencies than the CPU is.

with turboboost they enforce that. the testers usually simply force the motherboard at 3.9Ghz or something even if the chip officially is 3.3Ghz
and if you run that cpu at home and use all cores, then turboboost won't work simply as it consumes all power.

Normal users cannot enforce this, you can at the testmachines.

If you want to put a normal chip forced at 3.9Ghz or whatever it's maximum turbo frequency, you need an unlocked CPU. The cpu's you can get in the shops aren't 100% unlocked however.

Note that this doesn't happen with all intel cpu's, so you have to calculate some to figure out which frequency it was with all cores active.

Turboboost in the initial i7's worked in stages. So the results initially were for all cores with hyperthreading 200Mhz or so above it's normal clock whereas the turboboost displayed at intels ark website just mentions it for 1 core, which is a higher frequency.

Like i7-965 it is 3.2Ghz. For diep @ 8 logical cores it was 3.33ghz or so so it didn't get tested enforced at 3.46Ghz.

Later cpu's this all got weirder and weirder. Nowadays it's 600Mhz difference.

Most of the performances of cpu's there are total artificial nowadays as this turboboost is going completely out of hand.

And in order to show the next cpu to be better than this one, they will move from 3.9Ghz to 4.0Ghz for the next one, even if they sell it for 3.0Ghz in the shops.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Strongest MPI-capable (cluster) engine?

Post by diep »

abulmo wrote:
diep wrote: If you look at accurate tests at Lostcircuits you see clearly that IPC of Diep isn't higher for the same number of processes.

Reducing the hyperthreading as you can see in earlier posts here prove that clearly.
I looked at Diep benchmark at Lostcircuits. I did some computations on a spreadsheet to see to what Diep is sensitive, from the frst 21 CPU.
Taking into account only #cores & Ghz only explains 43% of Diep NPS.
I then took the Intel Core2 as a reference, and look at the presence of the following features, to see what acceleration they provide:

Code: Select all

 - being an Intel CPU              0% (reference)
 - being an AMD                   -8,19%
 - being a Core2                   0% (reference)
 - being a Bulldozer             -38,15%
 - being a Sandy/Ivy Bridge        0,45%
 - being a Nehalem                -5,6%
 - having HT                     +28,17%
Taking all these into account make it possible to explain 99% of Diep NPS.

Some precautions should be taken in analyzing the data, as some features are correlated, for example having hyperthreading and being a Nehalem or a Sandy Bridge. However it is clear to me that Diep is more efficient on a Sandy Bridge with HT available, than on any other configuration.
It is also as clear as crystal that # of cores and Ghz are not enough to explain Diep speed.
The sandy bridges boost in turboboost hundreds of megaherz higher each time and they have 1 more memory channel. 4 for sandy bridge versus 3 for Nehalem.

This diep version benchmarked isn't prefetching hashtables, so it feels the latency quite some. Also it has for each proces its own local evaluation table of a 20MB or so and a pawn table that's inside that, also local table.

So local each core uses a 20MB or so in tables, add on top of that a 1GB hashtable which is global.

If you do the compare objectively you wll see Sandy Bridge CPU core isn't any faster at all than core2. Just the memory subsystem is faster and it has a much higher clock.

Of course core2 doesn't scale well because of the off die memory controller. Clocking core2 at 3Ghz or more is nonsense. It doesn't scale then.

The cpu core itself doesn't generate more instructions a second if we do not allow vectorization.

All this 'bigger bandwidth' is just based upon using AVX.

If you execute simple integer chessprogram that isn't vectorizing, then it isn't executing more instructions a cycle of course.

Still it can issue more 32 bits instructions a cycle than 64 bits instructions.

So 32 bits is faster.

to show proof, look at this:

980x versus i7-3960x.

3960x has 4 memory channels versus 980x just 3.

If we divide it by its turboboost clock frequency:

980x: 1852234 nps / 3.6Ghz = 514,509.444444444
3960x: 2032736 nps / 3.9Ghz = 521,214.358974359

That's a difference of just 1.3%

So you see these cpu's simply run at their top frequency when getting tested nowadays. It has slowly changed over time.

Initially the i7's boosted just a little with all cores active, now they boost at testmachines 100% with all cores active.

Which improvements in bandwidth? They mean more bandwidth by means of overclocking?

No nothing of that, 1.3% is very little.

I didn't see this 1.3% in your table.

I know from own experiences at overclocked i7's that this 1.3% can get explained already by improved hyperthreading moving from low 3.xGhz to nearly 4Ghz where it gives percentages more somehow.

What explains it - i don't know. I just measure that diep gets a higher scaling out of hyperthreading at higher frequencies. At 2.xGhz the hyperthreading is not even worth turning on. It gives extra but in speedup you'll lose it.

At 4.xGhz and 3.9Ghz is close to that, it goes up to over 30% for hyperthreading. That's for i7-990x measured ok. No nothing sandy bridge feature.

All i7's have it.

Compare the output now with the i7-965 which was run on 3.33Ghz according to the statement i got. Let's compensate for the 2 additional cores lineair just for fun:

It's 1247903 * 1.5 / 3.33 = 562,118.468468468

So the old i7 has a BIGGER BANDWIDTH than the Sandy bridge has for each core. Right?

So the whole intel story on bigger bandwidth is total nonsense if you compare it with old i7's. As the 3 memory controllers serve just 8 logical cores there so on a per core basis that's a higher bandwidth.

It's total beating the Sandy Bridge there.

So that's why i say: don't look at this marketing nonsense from intel.
Core2 = i7 simply for execution core. Just memory improved not the execution core. It's still the same.

Adding a few instructions to the SIMD, if that makes the kids happy, then so be it. It's nothing substantial.
abulmo
Posts: 151
Joined: Thu Nov 12, 2009 6:31 pm

Re: Strongest MPI-capable (cluster) engine?

Post by abulmo »

diep wrote:Turboboost in the initial i7's worked in stages. So the results initially were for all cores with hyperthreading 200Mhz or so above it's normal clock whereas the turboboost displayed at intels ark website just mentions it for 1 core, which is a higher frequency.
I adjusted the base frequency from data obtained here:
http://en.wikipedia.org/wiki/List_of_In ... processors

I supposed no overclocking was applied to them.
Richard
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Strongest MPI-capable (cluster) engine?

Post by diep »

abulmo wrote:
diep wrote:Turboboost in the initial i7's worked in stages. So the results initially were for all cores with hyperthreading 200Mhz or so above it's normal clock whereas the turboboost displayed at intels ark website just mentions it for 1 core, which is a higher frequency.
I adjusted the base frequency from data obtained here:
http://en.wikipedia.org/wiki/List_of_In ... processors

I supposed no overclocking was applied to them.
Please investigate the specint results on sjeng using the same
compiler of the i7-965.

First what intel showed up with and also some other companies
that overclocked it to 3.7Ghz.

If you calculate your way down from that 3.7 Ghz how many Ghz the i7 - 965 must have run you're never again gonna believe in benchmarks where turboboost exists :)

At 'base' benchmarked it had a 7% extra performance that's not explainable by any 'turboboost' table from intel.

Intel has many reasons to not be crystal clear on how turboboost works...

This benchmark of i7-965 is in history of intel a very important benchmark. It was their first i7 release, so it had to be good.

I guess AMD cpu's eat too much power to overclock testmachines as much as intel is doing...
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Strongest MPI-capable (cluster) engine?

Post by diep »

Let's suppose the next scenario.

This is a fictive example no conclusions based upon reality apply.

Suppose someone proves that in order to get specific results
a new intel cpu that's clocked on paper 3.3Ghz, that it had to be clocked
at 4.3Ghz to get the result.

Big long courtcase starts and some competing manufacturer, say AMD,
shows up.

Clever intel lawyer: "this chip turboboosted to 4.3Ghz".

End of courtcase.

Then some stuttering about that none of the tested bought cpu's had the same boost to 4.3Ghz with turboboost.

Intel: "turboboost can never be garantueed and is different from cpu to cpu".

That's the problem of turboboost.

It gives a legal juridical disclaimer and opens up cheating.

Ever seen a 100 billion dollar company that didn't cheat when they knew 100% sure they could talk their way out in court?

p.s.
If you CORRECTLY calculate back the Diep benchmarks from i7-965 you will figure out that it makes sense if you compare it with other data, that the i7-965 turboboosted all cores to something nearby 3.6Ghz.

And your chip at home if it's an i7-965, it will NOT turboboost all cores to 3.6Ghz. As simple as that.

Intel had spreaded some rumours back then as well that 1 core could boost to 3.6Ghz for i7-965 and that usually not all cores would be able to reach this.

What ark.intel has now is total IRRELEVANT.

Just extrapolate it in the same manner like i did do.

From overclockers who only used aircooling and ran chessprograms, we know that 3.6Ghz is close to the limit. Most managed 3.7Ghz in a stable manner, yet if you want 100% reliability, then 3.6Ghz was the limit.

Of course watercooled overclocking you can get to the cache limits of 4.5Ghz but i wouldn't advice crunching prime numbers on it then :)

So turboboosting it to 3.6Ghz really was touching the limits of that first series of i7's they released. So they boosted to it...

We see now that they boost to nearby or just over 4Ghz for benchmarks...