Searching with many cores question

Discussion of chess software programming and technical issues.

Moderator: Ras

Cardoso
Posts: 363
Joined: Thu Mar 16, 2006 7:39 pm
Location: Portugal
Full name: Alvaro Cardoso

Searching with many cores question

Post by Cardoso »

Intel has anounced an eight-core Xeon and AMD has plans to glue two six-core chips making an 12 core Opteron. With a system with multi socket using several of these chips I guess we will run into problems with RAM being the botleneck. I mean with current DDR3 technology how will chess programs benefit of a many core system if RAM is too slow to access the transposition table?
Also I haven't yet seen any announcement of a new RAM technology that will adress this problem.
Those of you that have access to these kind of systems what is your experience with chess tree searching?

Thanks in advance,
Alvaro
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Searching with many cores question

Post by bob »

Cardoso wrote:Intel has anounced an eight-core Xeon and AMD has plans to glue two six-core chips making an 12 core Opteron. With a system with multi socket using several of these chips I guess we will run into problems with RAM being the botleneck. I mean with current DDR3 technology how will chess programs benefit of a many core system if RAM is too slow to access the transposition table?
Also I haven't yet seen any announcement of a new RAM technology that will adress this problem.
Those of you that have access to these kind of systems what is your experience with chess tree searching?

Thanks in advance,
Alvaro
It is an issue, but so far it has not been a serious issue. The main point is that one needs to be very efficient in using cache, which is the thing that makes these machines work.
Cardoso
Posts: 363
Joined: Thu Mar 16, 2006 7:39 pm
Location: Portugal
Full name: Alvaro Cardoso

Re: Searching with many cores question

Post by Cardoso »

bob wrote:
Cardoso wrote:Intel has anounced an eight-core Xeon and AMD has plans to glue two six-core chips making an 12 core Opteron. With a system with multi socket using several of these chips I guess we will run into problems with RAM being the botleneck. I mean with current DDR3 technology how will chess programs benefit of a many core system if RAM is too slow to access the transposition table?
Also I haven't yet seen any announcement of a new RAM technology that will adress this problem.
Those of you that have access to these kind of systems what is your experience with chess tree searching?

Thanks in advance,
Alvaro
It is an issue, but so far it has not been a serious issue. The main point is that one needs to be very efficient in using cache, which is the thing that makes these machines work.

Thanks for your reply.
I do have crafty in mind as a model for parallel search.
Let me give an example. A four-socket with eight-core chips gives 32 real cores. How would crafty scale in such machine?

Alvaro
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Searching with many cores question

Post by bob »

Cardoso wrote:
bob wrote:
Cardoso wrote:Intel has anounced an eight-core Xeon and AMD has plans to glue two six-core chips making an 12 core Opteron. With a system with multi socket using several of these chips I guess we will run into problems with RAM being the botleneck. I mean with current DDR3 technology how will chess programs benefit of a many core system if RAM is too slow to access the transposition table?
Also I haven't yet seen any announcement of a new RAM technology that will adress this problem.
Those of you that have access to these kind of systems what is your experience with chess tree searching?

Thanks in advance,
Alvaro
It is an issue, but so far it has not been a serious issue. The main point is that one needs to be very efficient in using cache, which is the thing that makes these machines work.

Thanks for your reply.
I do have crafty in mind as a model for parallel search.
Let me give an example. A four-socket with eight-core chips gives 32 real cores. How would crafty scale in such machine?

Alvaro
On the one prototype I have run on, (4 x 8core) it scaled almost perfectly. It was running about 25M nodes per second one one socket, and reached 100M on four. This on nehalem-level hardware (obviously not Nehalem chips which go to 4 cores at present). On the dual quad-core box I run on all the time I see the same results. 2.5-3M nps on one core, 20-24M on 8. This is the box that is running on ICC all the time.
Cardoso
Posts: 363
Joined: Thu Mar 16, 2006 7:39 pm
Location: Portugal
Full name: Alvaro Cardoso

Many thanks (NT)

Post by Cardoso »

nt
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Searching with many cores question

Post by diep »

Cardoso wrote:Intel has anounced an eight-core Xeon and AMD has plans to glue two six-core chips making an 12 core Opteron. With a system with multi socket using several of these chips I guess we will run into problems with RAM being the botleneck. I mean with current DDR3 technology how will chess programs benefit of a many core system if RAM is too slow to access the transposition table?
Also I haven't yet seen any announcement of a new RAM technology that will adress this problem.
Those of you that have access to these kind of systems what is your experience with chess tree searching?

Thanks in advance,
Alvaro
hi,

As Bob explained already, the pc's are by definition low latency to RAM.

A node costs crafty roughly 1000 cycles.
Crafty isn't using hashtables in qsearch which is say 70% to 80% of all nodes, so only 1 in each 4 to each 5 needs access.

So that means you're doing a probe and write action in roughly each 4000 cycles in case of crafty.

A random RAM access to single socket Nehalem is 60 to 70 ns (if you use superior RAM) with DDR3 which is a lot faster than DDR2 for reading, thanks to be able to drop last 32 bytes out of each 64 bytes cache line, if you don't need it (DDR2 cannot do this therefore it takes extra cycles to lose the RAM).

2 socket nehalem i didn't test yet, i would estimate it at 100 to 110 ns.
2 socket core2 Xeon Bob tested with the same program and it is 206 ns.

4 sockets is always more unclear, yet both AMD as well as Intel are still fast there. Note intel has other problems there causing it to get delayed until 2010 or something (let's sit and wait). AMD's 16 core box already exists long time and right now also 24 core version 2.6Ghz released which kicks butt. For production boxes intel will release at 2.66Ghz rumours say right now (and when things get delayed more it will clock higher), yet for benchmarks i bet they also release a 'special' version that eats 170+ watt a cpu (that's actual cpu TDP, not even counted in the losses of the mainboard and psu) and it might run at 3.2Ghz i guess.

Yet all this will be expensive. The AMD's are still considered 'cheap' at $2300.

The optimal die size seems to be around 300 mm^2 for cpu's. The 6 core AMd's are well above that. If they would glue 2 together and get to 12 cores that won't be cheaper of course, yet still can getp roduced relative cheap (as you glue 2 correct versions together the production problem is that of a 350mm^2 cpu). That glueing together is not new in history, intel did do it with Q6600 arguably the most succesful quadcore cpu in history.

It kicked butt and it was cheap.

Those intel 8 core Xeon MP (beckton) cpu's are like 700mm^2. That's huge.
If you add a zero to that you probably still can't buy one for that price (provided AMD doesn't have something faster at which point intel has to compete).

It is all not realistic for production work that intel highend junk there.

Larrabee at 700mm^2 won't be cheap either obviously.

AMD's 4 socket machines right now are leading the PC world by a large margin practical (ignore testsets where they test software embarrassingly parallel which is simply not reality, as buying 16 machines Q6600 is going to be 2 times faster and 10 times cheaper for embarrassingly parallel stuff).

AMD has its opteron platform using Hypertransport since 2004 at 4 sockets; intel yet has to release a 4 socket platform based upon their equivalent of hypertransport (QPI). So AMD has 5 years of bugfixing advantage to intel there which shows in practical software.

As you can see from the above calcualtion model this all is not a problem for computerchess, there is potential to work with latencies a lot worse than this hardware has right now. Even the 1 memory controller, 4 socket intel dunnington works fine for some of the chessprograms.

The reason this stuff is pricey and just 4 to 6 cores right now is simply the low latency that PC software requires.

Obviously it would be not so hard to design hardware with a lot more cores and a lot slower latency to remote RAM using NUMA.

NUMA is cheap.

You very soon move to the 1 us to 5 us latencies then.

Amazingly there is 1 chessprogram still working with those latencies,
it's called Diep.

Vincent

p.s. to be honest i find AMD stupid to price their 6 core istanbuls for 4 sockets at $2300. If they'd price them $1000 they might be able to drop intel from many projects as clusters get built right now from 2 socket machines simply as those cpu's are a lot cheaper. With 4 socket hardware getting affordable (and a mainboard is just $800 or so for AMD with 4 socket F, which is quite cheap if we remember intels pricing past 10 years there) many will switch to 4 socket machines as that gets in price range then. AMD then could sell 2 times more cpu's to start with as the machines eat 2 times more cpu's.