Best engine for greater than 8-core SMP system

bob · Post by **bob** » Thu Dec 09, 2010 9:20 pm

M ANSARI wrote:Rybka nodes with more than 8 cores are under valued. I remember on a 48 core AMD machine the nodes were unusually low, but the scaling of the 48 cores strength wise (ELO strength) seemed to follow traditional scaling gains when strength tested. I don't think that Rybka has a proper method of giving an accurate guesstimate to what knps it should show for more than 8 cores, but it scales quite well at higher cores, and I would disregard the knps shown as a pointer to performance.

One more thing, although Rybka seems to be the best scaling engine on more than 8 cores today, Zappa Mexico II also has excellent scaling even though its search and evaluation might be outdated. Rybka only managed to reach the scaling of ZM II with R3, and it lagged quite a bit in scaling before that.

How are you measuring "scaling"? We have run lots of tests comparing Rybka and Crafty on 8 cores. I have not seen Rybka scale better, unless you cherry-pick one position out of 10 or whatever... I do not believe that Rybka represents a "silver bullet" with regard to parallel search.

M ANSARI · Post by **M ANSARI** » Thu Dec 09, 2010 11:19 pm

Actually I have never tested against Crafty, but when testing Rybka against Zappa, the "scaling" I mean was that for example scores of Rybka 2.3.2a against ZMII on single core were say 80 ELO for Rybka, on 2 cores it would something like 60 ELO and so on. This was quite linear until at 8 cores and 5_0 matches, ZMII was pulling very close and pulling ahead when cores were pushed to 5 Ghz. That led me to think that ZMII was scaling better than R 2.3.2a. This was not the case with R3 where scores against ZMII were pretty much the same as you increased cores.

bob · Post by **bob** » Thu Dec 09, 2010 11:29 pm

M ANSARI wrote:Actually I have never tested against Crafty, but when testing Rybka against Zappa, the "scaling" I mean was that for example scores of Rybka 2.3.2a against ZMII on single core were say 80 ELO for Rybka, on 2 cores it would something like 60 ELO and so on. This was quite linear until at 8 cores and 5_0 matches, ZMII was pulling very close and pulling ahead when cores were pushed to 5 Ghz. That led me to think that ZMII was scaling better than R 2.3.2a. This was not the case with R3 where scores against ZMII were pretty much the same as you increased cores.

That's a very imprecise way of measuring "scaling" and could be a way of measuring "parallel search bugs" in fact.

M ANSARI · Post by **M ANSARI** » Fri Dec 10, 2010 6:19 am

bob wrote:
M ANSARI wrote:Actually I have never tested against Crafty, but when testing Rybka against Zappa, the "scaling" I mean was that for example scores of Rybka 2.3.2a against ZMII on single core were say 80 ELO for Rybka, on 2 cores it would something like 60 ELO and so on. This was quite linear until at 8 cores and 5_0 matches, ZMII was pulling very close and pulling ahead when cores were pushed to 5 Ghz. That led me to think that ZMII was scaling better than R 2.3.2a. This was not the case with R3 where scores against ZMII were pretty much the same as you increased cores.
That's a very imprecise way of measuring "scaling" and could be a way of measuring "parallel search bugs" in fact.

Probably, but it is the only way I can think of measuring scaling cross platform, as I have yet see two original engines create similar knps profiles. I also think that "parallel search bugs" are part of the equation when seeing how an engine performs with multiple cores. I guess the more cores you have the more squeaky clean your code has to be, because the chances of a rare bug hitting become less rare. You can see that in the latest R3 derivatives ... they seem to be very stable at single core and even dual cores, but on 8 cores they are most definitely not stable and it is hard to play a 100 game tourney without an exception fault.

lucasart · Post by **lucasart** » Sun Dec 12, 2010 2:28 pm

FlavusSnow wrote:I've done a fair share or research trying to find engines that can use more than 8 cores. Crafty and a couple of crafty's offspring are the only world-class engines that I can find for such hardware.

And what do you want to do with more than 8-cores ?

You should also note that StockFish is probably stronger on 1 CPU and Crafty would be on 8 CPU (sorry Bob, don't take it personnally). Plus it's a free software, and its authors are intelligent and open minded people, unlike some authors of commercial softwares that I will not name...

If you're only interested in ELO strength, the strongest you can get is probably Houdini on 8 cores. It's not open source, but it's free. It's only available on Windows however.

George Tsavdaris · Post by **George Tsavdaris** » Sun Dec 12, 2010 2:54 pm

M ANSARI wrote:
bob wrote:
M ANSARI wrote:Actually I have never tested against Crafty, but when testing Rybka against Zappa, the "scaling" I mean was that for example scores of Rybka 2.3.2a against ZMII on single core were say 80 ELO for Rybka, on 2 cores it would something like 60 ELO and so on. This was quite linear until at 8 cores and 5_0 matches, ZMII was pulling very close and pulling ahead when cores were pushed to 5 Ghz. That led me to think that ZMII was scaling better than R 2.3.2a. This was not the case with R3 where scores against ZMII were pretty much the same as you increased cores.
That's a very imprecise way of measuring "scaling" and could be a way of measuring "parallel search bugs" in fact.
Probably, but it is the only way I can think of measuring scaling cross platform, as I have yet see two original engines create similar knps profiles.

The only general and valid way i see for measuring scaling is the simple procedure of let's say taking 10 different positions and measuring the time an engine does to reach a certain depth in each position(the bigger the depth the better) for each hardware and then dividing the times from the different hardwares to see the speedup for each engine, and then you take the average or something to have a good value of the actual speedup.

For example for Rybka,Crafty on Quad, Octal, with positions named P1,P2,...,P10 and depth to reach in the corresponding positions, D1(e.g 25 plies),D2,..,D10, with times to reach every depth on every position, t1,t2,...,t10 :

You first let Rybka run each of the positions on the Quad and you keep the time:

Quad_P1: time to reach D1 -> tq1
Quad_P2: time to reach D2 -> tq2
...............................
Quad_P10: time to reach D10 -> tq10

Then you let Rybka run each of the positions on the Octal. And you keep the time again:
Octal_P1: time to reach D1 -> to1
Octal_P2: time to reach D2 -> to2
...............................
Octal_P10: time to reach D10 -> to10

And the speedup, the scaling from Quad to Octal for Rybka is:
Position-1: to1/tq1
Position-2: to2/tq2
...........................
Position-10: to10/tq10

So average speedup = the average of the above.

The same for Crafty.

This seems a legitimate method of measuring the actual speedup, i.e the scaling.
Is there a better one or does this has any flaw?

FlavusSnow · Post by **FlavusSnow** » Sun Dec 12, 2010 3:16 pm

I most often run Stockfish on 4 cores at 3.4 Ghz, but I've had that machine for over a year now and my budget is annual (looking to upgrade). Unfortunately, for less than $2,000, it seems like there isn't much of a hardware upgrade that would do noticeably better than what I already have.

I don't have a particular reason to do the 10k game tests like each of the authors do, so I don't see any other benefit of getting another system. I have volunteered CPU time to a handful of chess projects, but I've gotten no responses. So for now I think the quad core machine will just stay what it is, playing on FICS most days.

michiguel · Post by **michiguel** » Sun Dec 12, 2010 3:17 pm

George Tsavdaris wrote:
M ANSARI wrote:
bob wrote:
M ANSARI wrote:Actually I have never tested against Crafty, but when testing Rybka against Zappa, the "scaling" I mean was that for example scores of Rybka 2.3.2a against ZMII on single core were say 80 ELO for Rybka, on 2 cores it would something like 60 ELO and so on. This was quite linear until at 8 cores and 5_0 matches, ZMII was pulling very close and pulling ahead when cores were pushed to 5 Ghz. That led me to think that ZMII was scaling better than R 2.3.2a. This was not the case with R3 where scores against ZMII were pretty much the same as you increased cores.
That's a very imprecise way of measuring "scaling" and could be a way of measuring "parallel search bugs" in fact.
Probably, but it is the only way I can think of measuring scaling cross platform, as I have yet see two original engines create similar knps profiles.
The only general and valid way i see for measuring scaling is the simple procedure of let's say taking 10 different positions and measuring the time an engine does to reach a certain depth in each position(the bigger the depth the better) for each hardware and then dividing the times from the different hardwares to see the speedup for each engine, and then you take the average or something to have a good value of the actual speedup.

For example for Rybka,Crafty on Quad, Octal, with positions named P1,P2,...,P10 and depth to reach in the corresponding positions, D1(e.g 25 plies),D2,..,D10, with times to reach every depth on every position, t1,t2,...,t10 :

You first let Rybka run each of the positions on the Quad and you keep the time:

Quad_P1: time to reach D1 -> tq1
Quad_P2: time to reach D2 -> tq2
...............................
Quad_P10: time to reach D10 -> tq10

Then you let Rybka run each of the positions on the Octal. And you keep the time again:
Octal_P1: time to reach D1 -> to1
Octal_P2: time to reach D2 -> to2
...............................
Octal_P10: time to reach D10 -> to10

And the speedup, the scaling from Quad to Octal for Rybka is:
Position-1: to1/tq1
Position-2: to2/tq2
...........................
Position-10: to10/tq10

So average speedup = the average of the above.

The same for Crafty.

This seems a legitimate method of measuring the actual speedup, i.e the scaling.
Is there a better one or does this has any flaw?

10 positions is way too few, because the speed up is very dependent on positions. I would choose enough so the SD becomes small enough.

The alternative is to measure time to solution for positions that are relativerly quiet, like the STS.

I think that the only method that takes into account all the potential issues and parameters is to measure Delta ELO vs N of CPUs, but of course, it is time consuming.

Miguel

George Tsavdaris · Post by **George Tsavdaris** » Sun Dec 12, 2010 3:54 pm

michiguel wrote:
George Tsavdaris wrote: The only general and valid way i see for measuring scaling is the simple procedure of let's say taking 10 different positions and measuring the time an engine does to reach a certain depth in each position(the bigger the depth the better) for each hardware and then dividing the times from the different hardwares to see the speedup for each engine, and then you take the average or something to have a good value of the actual speedup.

For example for Rybka,Crafty on Quad, Octal, with positions named P1,P2,...,P10 and depth to reach in the corresponding positions, D1(e.g 25 plies),D2,..,D10, with times to reach every depth on every position, t1,t2,...,t10 :

You first let Rybka run each of the positions on the Quad and you keep the time:

Quad_P1: time to reach D1 -> tq1
Quad_P2: time to reach D2 -> tq2
...............................
Quad_P10: time to reach D10 -> tq10

Then you let Rybka run each of the positions on the Octal. And you keep the time again:
Octal_P1: time to reach D1 -> to1
Octal_P2: time to reach D2 -> to2
...............................
Octal_P10: time to reach D10 -> to10

And the speedup, the scaling from Quad to Octal for Rybka is:
Position-1: to1/tq1
Position-2: to2/tq2
...........................
Position-10: to10/tq10

So average speedup = the average of the above.

The same for Crafty.

This seems a legitimate method of measuring the actual speedup, i.e the scaling.
Is there a better one or does this has any flaw?
10 positions is way too few, because the speed up is very dependent on positions. I would choose enough so the SD becomes small enough.

If the speed up is heavily dependent on the positions then this method is not so good after all.
But in fact i can't understand why they are dependent on the positions.

I mean a very small difference is normal, but a big one as you propose is a puzzle for me. I would have expected only tiny differences and even 10 positions to be many and actually a bit of a waste of time.

Does it really happening such a high deviation of the speedups noticed between various positions?

I think that the only method that takes into account all the potential issues and parameters is to measure Delta ELO vs N of CPUs, but of course, it is time consuming.

What is Delta ELO?

Milos · Post by **Milos** » Sun Dec 12, 2010 4:47 pm

George Tsavdaris wrote:But in fact i can't understand why they are dependent on the positions.
I mean a very small difference is normal, but a big one as you propose is a puzzle for me. I would have expected only tiny differences and even 10 positions to be many and actually a bit of a waste of time.

Does it really happening such a high deviation of the speedups noticed between various positions?

Before posting ridiculous stuff, try to at least read Bob's paper on parallel search...

Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system