Best engine for greater than 8-core SMP system

Milos · Post by **Milos** » Sun Dec 12, 2010 4:49 pm

George Tsavdaris wrote:What is Delta ELO?

Nothing but the elo gain when increasing number of CPUs.
In other words what Miguel is saying you have to play some games, a lot of games actually...

bob · Post by **bob** » Sun Dec 12, 2010 5:07 pm

George Tsavdaris wrote:
M ANSARI wrote:
bob wrote:
M ANSARI wrote:Actually I have never tested against Crafty, but when testing Rybka against Zappa, the "scaling" I mean was that for example scores of Rybka 2.3.2a against ZMII on single core were say 80 ELO for Rybka, on 2 cores it would something like 60 ELO and so on. This was quite linear until at 8 cores and 5_0 matches, ZMII was pulling very close and pulling ahead when cores were pushed to 5 Ghz. That led me to think that ZMII was scaling better than R 2.3.2a. This was not the case with R3 where scores against ZMII were pretty much the same as you increased cores.
That's a very imprecise way of measuring "scaling" and could be a way of measuring "parallel search bugs" in fact.
Probably, but it is the only way I can think of measuring scaling cross platform, as I have yet see two original engines create similar knps profiles.
The only general and valid way i see for measuring scaling is the simple procedure of let's say taking 10 different positions and measuring the time an engine does to reach a certain depth in each position(the bigger the depth the better) for each hardware and then dividing the times from the different hardwares to see the speedup for each engine, and then you take the average or something to have a good value of the actual speedup.

For example for Rybka,Crafty on Quad, Octal, with positions named P1,P2,...,P10 and depth to reach in the corresponding positions, D1(e.g 25 plies),D2,..,D10, with times to reach every depth on every position, t1,t2,...,t10 :

You first let Rybka run each of the positions on the Quad and you keep the time:

Quad_P1: time to reach D1 -> tq1
Quad_P2: time to reach D2 -> tq2
...............................
Quad_P10: time to reach D10 -> tq10

Then you let Rybka run each of the positions on the Octal. And you keep the time again:
Octal_P1: time to reach D1 -> to1
Octal_P2: time to reach D2 -> to2
...............................
Octal_P10: time to reach D10 -> to10

And the speedup, the scaling from Quad to Octal for Rybka is:
Position-1: to1/tq1
Position-2: to2/tq2
...........................
Position-10: to10/tq10

So average speedup = the average of the above.

The same for Crafty.

This seems a legitimate method of measuring the actual speedup, i.e the scaling.
Is there a better one or does this has any flaw?

That's about as good as you can do, with one addition. You will probably want to "normalize" the times so that each position using one cpu takes the same amount of time. Since you can't do this just by setting depth, you can come up with a multiplier so that for one cpu each position takes the same "adjusted" time.

If you don't do this, the positions with the longest times will be weighed more heavily if you add all the times together and divide by N. If you instead compute each speedup individually, then add and average, you treat them all "equally" and let an oddball result over-balance things. For example, you will occasionally see the "super-linear speedup" on a position, where it takes far less time to solve with 2 cpus than with 1. I have seen a 2cpu speedup of 10+. Simple averaging will cause that one to raise the overall average unrealistically.

One approach is this:

1. take n positions. For each position, using a single CPU, adjust the depth individually so that the search takes 10 minutes, or as close as you can get without going over.

2. For each position, compute a multiplier to adjust the time to exactly 10 minutes. If you can't get closer than 8.0 minutes for one, then use a multiplier of 1.25 for that one. Now all N positions take exactly 10 minutes of "adjusted" time.

3. For each position, run with 2, 4, 8, 16, ... cpus. For each resulting time, adjust by multiplying with the individual multipliers you computed, which causes each position to represent 1/nth of the total computation.

Now you can safely add all the 2 cpu times together, divide by the total 1 cpu time, and get a 2 cpu speedup where each position is equally weighted.

Technically, no matter how you compute speedup, you introduce bias. You can just add all the times and average. Longest searches get weighed disproportionally. You can compute each speedup (for each position) and then average. This can unfairly bias the results if you have unusual positions here and there. You can do as I suggested. The problem with any of the above is that "an average of an average" is an issue. There are more accurate ways, including running each position N times, using some sort of smoothing formula to get rid of the variability. Bottom line is a lot of positions, run a lot of times (more times as you use more CPUs because variability will go through the roof).

George Tsavdaris · Post by **George Tsavdaris** » Sun Dec 12, 2010 5:12 pm

Milos wrote:
George Tsavdaris wrote:But in fact i can't understand why they are dependent on the positions.
I mean a very small difference is normal, but a big one as you propose is a puzzle for me. I would have expected only tiny differences and even 10 positions to be many and actually a bit of a waste of time.

Does it really happening such a high deviation of the speedups noticed between various positions?
Before posting ridiculous stuff, try to at least read Bob's paper on parallel search...

OK thanks although you didn't forgive my ignorance. But at least you helped.

Remember that in order to go from low to high you have to be to low first.

And being there you have to ask questions that would seem silly to people that are on the high level.

I guess Bob's paper is the following right?
http://www.cis.uab.edu/hyatt/search.html

bob · Post by **bob** » Sun Dec 12, 2010 5:19 pm

George Tsavdaris wrote:
michiguel wrote:
George Tsavdaris wrote: The only general and valid way i see for measuring scaling is the simple procedure of let's say taking 10 different positions and measuring the time an engine does to reach a certain depth in each position(the bigger the depth the better) for each hardware and then dividing the times from the different hardwares to see the speedup for each engine, and then you take the average or something to have a good value of the actual speedup.

For example for Rybka,Crafty on Quad, Octal, with positions named P1,P2,...,P10 and depth to reach in the corresponding positions, D1(e.g 25 plies),D2,..,D10, with times to reach every depth on every position, t1,t2,...,t10 :

You first let Rybka run each of the positions on the Quad and you keep the time:

Quad_P1: time to reach D1 -> tq1
Quad_P2: time to reach D2 -> tq2
...............................
Quad_P10: time to reach D10 -> tq10

Then you let Rybka run each of the positions on the Octal. And you keep the time again:
Octal_P1: time to reach D1 -> to1
Octal_P2: time to reach D2 -> to2
...............................
Octal_P10: time to reach D10 -> to10

And the speedup, the scaling from Quad to Octal for Rybka is:
Position-1: to1/tq1
Position-2: to2/tq2
...........................
Position-10: to10/tq10

So average speedup = the average of the above.

The same for Crafty.

This seems a legitimate method of measuring the actual speedup, i.e the scaling.
Is there a better one or does this has any flaw?
10 positions is way too few, because the speed up is very dependent on positions. I would choose enough so the SD becomes small enough.
If the speed up is heavily dependent on the positions then this method is not so good after all.
But in fact i can't understand why they are dependent on the positions.
I mean a very small difference is normal, but a big one as you propose is a puzzle for me. I would have expected only tiny differences and even 10 positions to be many and actually a bit of a waste of time.

Does it really happening such a high deviation of the speedups noticed between various positions?

I've posted a ton of data on this subject, here and in several papers on parallel search. Some positions produce almost exactly the same speedup each time they are run with the same number of CPUs. Some produce wildly variable speedups under the same circumstances. The majority produce less wild variability, but when I say "less wild" I mean _exactly_ that. It is quite common to see a run with 8 cpus that looks like this: 1.5m, 2.8m, 1.1m, 2.3m, 5.8m. One "outlier point" is common. And that 3x+ variability is not uncommon either.

And you will, on occasion, see a position that takes longer to reach depth D with 8 cpus than it does using just one. Why? Lots of reasons. Split points gets chosen pretty much at random. You can pick a bad one or a good one without knowing beforehand which is which. The hash table allows different threads to influence each other via shared scores. In ways that are unpredictable. It is quite possible to get a _better_ 15 ply score using parallel search than you get with 17 plies using just one CPU. When I say "better" I don't mean larger, I mean more accurate.

There's a ton of "serendipity" in this stuff that boggles the imagination.

I think that the only method that takes into account all the potential issues and parameters is to measure Delta ELO vs N of CPUs, but of course, it is time consuming.
What is Delta ELO?

Milos · Post by **Milos** » Sun Dec 12, 2010 5:19 pm

George Tsavdaris wrote:I guess Bob's paper is the following right?
http://www.cis.uab.edu/hyatt/search.html

Yes, check the tables in the end and mind that in those times variability in node count was noticeably smaller than today when search is far more complex.

Milos · Post by **Milos** » Sun Dec 12, 2010 5:22 pm

bob wrote:Technically, no matter how you compute speedup, you introduce bias. You can just add all the times and average. Longest searches get weighed disproportionally. You can compute each speedup (for each position) and then average. This can unfairly bias the results if you have unusual positions here and there. You can do as I suggested. The problem with any of the above is that "an average of an average" is an issue. There are more accurate ways, including running each position N times, using some sort of smoothing formula to get rid of the variability. Bottom line is a lot of positions, run a lot of times (more times as you use more CPUs because variability will go through the roof).

If you have a cluster

you can always play games to accurately determine the speed up in Elo.

bob · Post by **bob** » Sun Dec 12, 2010 5:53 pm

Milos wrote:
bob wrote:Technically, no matter how you compute speedup, you introduce bias. You can just add all the times and average. Longest searches get weighed disproportionally. You can compute each speedup (for each position) and then average. This can unfairly bias the results if you have unusual positions here and there. You can do as I suggested. The problem with any of the above is that "an average of an average" is an issue. There are more accurate ways, including running each position N times, using some sort of smoothing formula to get rid of the variability. Bottom line is a lot of positions, run a lot of times (more times as you use more CPUs because variability will go through the roof).
If you have a cluster you can always play games to accurately determine the speed up in Elo.

That's still problematic. The cluster I use has dual cpus per node. Not much help. Our other cluster only has 70 nodes, 8 cores per node. But it is much busier. And SMP improves with depth so that fast games are worse than slower games. And longer games become much more problematic in terms of time needed...

Milos · Post by **Milos** » Sun Dec 12, 2010 6:32 pm

bob wrote:That's still problematic. The cluster I use has dual cpus per node. Not much help. Our other cluster only has 70 nodes, 8 cores per node. But it is much busier. And SMP improves with depth so that fast games are worse than slower games. And longer games become much more problematic in terms of time needed...

I know I was joking a bit, playing games for determining speed-up is extremely impractical.
However, a serious question now, how is that in times of Cray Blitz depths in the range of 9-12 (9 single CPU, 12 16-CPUs, opening and mid-game positions) were sufficient to claim accuracy and today depths of 15-18 (for very fast games) are not sufficient with much larger node counts?
Yes there are different prunnings, LMRs, things that change tree shape a lot, but still, would that mean that if you repeated Cray Blitz parallel search tests on today's hardware you would get much better performance???
My understanding is that SMP improves with depth only to a certain point (which is in the depth range from 15-20 for today's hardware and engines) and than you don't see any significant improvement any more.

bob · Post by **bob** » Mon Dec 13, 2010 3:26 am

Milos wrote:
bob wrote:That's still problematic. The cluster I use has dual cpus per node. Not much help. Our other cluster only has 70 nodes, 8 cores per node. But it is much busier. And SMP improves with depth so that fast games are worse than slower games. And longer games become much more problematic in terms of time needed...
I know I was joking a bit, playing games for determining speed-up is extremely impractical.
However, a serious question now, how is that in times of Cray Blitz depths in the range of 9-12 (9 single CPU, 12 16-CPUs, opening and mid-game positions) were sufficient to claim accuracy and today depths of 15-18 (for very fast games) are not sufficient with much larger node counts?
Yes there are different prunnings, LMRs, things that change tree shape a lot, but still, would that mean that if you repeated Cray Blitz parallel search tests on today's hardware you would get much better performance???
My understanding is that SMP improves with depth only to a certain point (which is in the depth range from 15-20 for today's hardware and engines) and than you don't see any significant improvement any more.

That covers a lot of ground.

1. When you talk about accuracy, we had almost _none_ in the days of Cray Blitz. Cray provided me with a dedicated machine to play in the annual ACM or WCCC events, and occasionally a chance to play in a Labor Day weekend tournament if we could find one. But as far as playing and tuning? We could possibly beg or borrow a half-hour or an hour during the wee hours of the night, once a week if we were lucky (the machines are in incredibly high demand, or they were back then). Except for the occasional Labor-Day tournament, we never were able to play a full game until we got to the ACM event, which makes our successes more than remarkable. yes, we could play games running on a Vax, Without parallel search, and at 1/1000th the speed of the Cray, so that kind of testing revealed nothing at all.

2. With respect to depth, if you crank up Crafty on (say) a 16 cpu system, in a middlegame position, and tell it to "go" it will start to search and display the NPS frequently. It will start off at about 1/2 the speed it will be searching after 30 seconds or so. I've not noticed any "stabilization" of parallel performance after a certain depth. The deeper the search, the less overhead we seem to see, in general, because the further from the tips you split, the better the move ordering is and the less likely you are to split at a CUT node which kills performance. Is 30 minutes per search better than 1? Yes. A lot? probably not. But there is a steady gain, at least for as far as I have measured. I have not compared 1 day searches to 1 hour searches for obvious reasons, however.

Milos · Post by **Milos** » Mon Dec 13, 2010 3:54 am

bob wrote:With respect to depth, if you crank up Crafty on (say) a 16 cpu system, in a middlegame position, and tell it to "go" it will start to search and display the NPS frequently. It will start off at about 1/2 the speed it will be searching after 30 seconds or so. I've not noticed any "stabilization" of parallel performance after a certain depth. The deeper the search, the less overhead we seem to see, in general, because the further from the tips you split, the better the move ordering is and the less likely you are to split at a CUT node which kills performance. Is 30 minutes per search better than 1? Yes. A lot? probably not. But there is a steady gain, at least for as far as I have measured. I have not compared 1 day searches to 1 hour searches for obvious reasons, however.

So then logically comes the question. If you leave it running on 16 cpu system for days would 16 be asymptotic speedup for most of the positions, or there still would be a majority of positions that can never reach linear speed-up, no matter how much time they are left to run?
And what would be the asymptotic average value (13 or 14 or maybe even 15 for 16 cpu system)?

Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system

Re: Best engine for greater than 8-core SMP system