SMP speed up

rbarreira · Post by **rbarreira** » Wed Sep 15, 2010 10:54 am

In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.

Dann Corbit · Post by **Dann Corbit** » Wed Sep 15, 2010 11:57 am

rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.

11.1 seems to answer very neatly for the 11.5 predicted by the forumla.
You were expecting an exact answer agreement? Of course, that would be an utterly ridiculous expectation (in fact we rightly view such data with mathematical suspicion).

If, for instance, a student dropped a penny and measured the force of gravity using the mass of the penny and the mass of the earth and came up with 6.673 x 10^(-11) m^3/(kg*s^2) it would seem a little far-fetched.

Karlo Bala · Post by **Karlo Bala** » Wed Sep 15, 2010 12:25 pm

michiguel wrote:This is a spin off of another thread.

I never gave a lot thought about it but the well know formula for Crafty

speedup = 1 + (NCPUS - 1) * 0.7

may indicate that the inefficiency is not related to the Amdahl's law, even if this applies to a low number of CPUs. What is the cause of the parallel inefficiency? The shape of the tree? still, it looks like it should either saturate quicker or the speed up with 2 cores should be higher than 1.7

Was this investigated?

Miguel

I think this is a realistic formula:

Speedup(1) = 1
Speedup(NCPUS) = 1.7*Speedup(NCPUS/2)

rbarreira · Post by **rbarreira** » Wed Sep 15, 2010 12:52 pm

Dann Corbit wrote:
rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
11.1 seems to answer very neatly for the 11.5 predicted by the forumla.
You were expecting an exact answer agreement? Of course, that would be an utterly ridiculous expectation (in fact we rightly view such data with mathematical suspicion).

If, for instance, a student dropped a penny and measured the force of gravity using the mass of the penny and the mass of the earth and came up with 6.673 x 10^(-11) m^3/(kg*s^2) it would seem a little far-fetched.

What part of my post made you think that I was complaining about 11.1 not being equal to 11.5?

Did you actually read/understand my post and the fact that the formula is predicing something different from DTS? According to Robert, the formula is supposed to predict Crafty's scaling, which is supposed to be worse than DTS scaling.

The fact that the formula predicts a scaling which is better than DTS (or about equal, if you want to ignore the small difference) is what I find strange given that Robert always says that DTS is quite a lot better than the other parallel algorithms.

Does DTS only pay off with more than 16 CPUs? Does the formula wrongly predict Crafty's scaling above 8 CPUs? Or maybe Crafty's parallel algorithm is already as good as DTS? All are possibilities, I'm just trying to find out the reason for the discrepancy.

mhull · Post by **mhull** » Wed Sep 15, 2010 3:38 pm

bob wrote:
michiguel wrote:This is a spin off of another thread.

I never gave a lot thought about it but the well know formula for Crafty

speedup = 1 + (NCPUS - 1) * 0.7

may indicate that the inefficiency is not related to the Amdahl's law, even if this applies to a low number of CPUs. What is the cause of the parallel inefficiency? The shape of the tree? still, it looks like it should either saturate quicker or the speed up with 2 cores should be higher than 1.7

Was this investigated?

Miguel
I beat it to death for 1-8 cores. About all that can be said is that after the first CPU, every processor added makes the tree grow. If you run the old CB positions, which are all the positions from a single real game that everyone has seen, it is pretty consistent. There is lots of variation until you average over all the moves, and then that 30% extra nodes per CPU starts to settle down. You could drive this down by improving move ordering to get a higher percentage of fail highs on the first move. But I have been stuck at 90-92% for years, which pretty well fixes the overhead since one of every 10 splits (when I split right after the first move) is going to be at a bad point that adds overhead...

There is likely a mathematical model that considers fh % as computed in Crafty, and predicts the speedup, but I have never tried to quantify that at all since it would not be of any benefit. We all know that YBW depends on the first move being the one to cause a cutoff, else we think it is an ALL node.

In my dissertation I tackled this by searching perfectly ordered trees and got perfect linearity in the speedup. But I faked the eval to produce a monotonically decreasing value to simulate perfect ordering. I also tackled worst-first, which is pure minimax, and also got perfect speedups. But it was slow with no cutoffs at all. It is the very good trees that cause the problem.

This chart compares Crafty's estimated speedup function with actual workload benchmarks from an IBM mainframe (z-series 2097). It shows that Crafty's estimate is not unreasonable up to about 42 CPUs. This would make sense since the function was derived from measurements with far fewer CPUs. Only one test is known at more than 32 CPUs (64) for crafty, and the details of that test are sketchy since Bob did not run it.

Source data is from Cheryl Watson z/OS CPU Chart, October 2008, Watson & Walker, Inc.

rbarreira · Post by **rbarreira** » Wed Sep 15, 2010 3:43 pm

Matthew what is that graphic supposed to prove? If you know anything about programming you will know that parallel speedups depend heavily on the problem at hand.

Not to mention the fact that that graphic can be used to prove almost anything, given the quantity of different curves it has.

mhull · Post by **mhull** » Wed Sep 15, 2010 3:47 pm

rbarreira wrote:Matthew what is that graphic supposed to prove? If you know anything about programming you will know that parallel speedups depend heavily on the problem at hand.

Not to mention the fact that that graphic can be used to prove almost anything, given the quantity of different curves it has.

It proves that the Crafty estimate function is not physically impossible (as Milos implied) up to 32 CPUs (with some room to spare). You will note that it is actually worse than several of the benchmarks up to 40 CPUs.

I suspect that if measurements could be made of crafty for for each number of CPUS on such a box, the curve would end up looking like the other curves in shape.

bob · Post by **bob** » Wed Sep 15, 2010 5:42 pm

mhull wrote:
bob wrote:
michiguel wrote:This is a spin off of another thread.

I never gave a lot thought about it but the well know formula for Crafty

speedup = 1 + (NCPUS - 1) * 0.7

may indicate that the inefficiency is not related to the Amdahl's law, even if this applies to a low number of CPUs. What is the cause of the parallel inefficiency? The shape of the tree? still, it looks like it should either saturate quicker or the speed up with 2 cores should be higher than 1.7

Was this investigated?

Miguel
I beat it to death for 1-8 cores. About all that can be said is that after the first CPU, every processor added makes the tree grow. If you run the old CB positions, which are all the positions from a single real game that everyone has seen, it is pretty consistent. There is lots of variation until you average over all the moves, and then that 30% extra nodes per CPU starts to settle down. You could drive this down by improving move ordering to get a higher percentage of fail highs on the first move. But I have been stuck at 90-92% for years, which pretty well fixes the overhead since one of every 10 splits (when I split right after the first move) is going to be at a bad point that adds overhead...

There is likely a mathematical model that considers fh % as computed in Crafty, and predicts the speedup, but I have never tried to quantify that at all since it would not be of any benefit. We all know that YBW depends on the first move being the one to cause a cutoff, else we think it is an ALL node.

In my dissertation I tackled this by searching perfectly ordered trees and got perfect linearity in the speedup. But I faked the eval to produce a monotonically decreasing value to simulate perfect ordering. I also tackled worst-first, which is pure minimax, and also got perfect speedups. But it was slow with no cutoffs at all. It is the very good trees that cause the problem.
This chart compares Crafty's estimated speedup function with actual workload benchmarks from an IBM mainframe (z-series 2097). It shows that Crafty's estimate is not unreasonable up to about 42 CPUs. This would make sense since the function was derived from measurements with far fewer CPUs. Only one test is known at more than 32 CPUs (64) for crafty, and the details of that test are sketchy since Bob did not run it.

Source data is from Cheryl Watson z/OS CPU Chart, October 2008, Watson & Walker, Inc.

I've run on two 64-cpu boxes. The first was a pain in the a**. Eugene Nalimov had a 64-processor Itanium box they were using in his group at Microsoft. NUMA all the way. It was very difficult to get NPS scaling to work very well at all. The second was more usable. But again, testing time was limited. I have run extensively on 2/4/8/16 cpu boxes, both Intel and AMD. AMD has had quad-chip processors in the development lab for years, and 4x4 and 4x6 are doable. There have been 8-chip boxes scattered around, but they are always pricey and not always reliable.

Intel keeps delaying their 8-core chip, we'd like to buy a 4x8 box like that as it would not be excessively expensive. But it is always "3 more months" although they have announced the 7500. And AMD has announced a 12-core chip although again I have not (yet) run on one.

Once either of those is available, I will be able to produce a lot more 1-2-4-8-16-32 data, and can probably begin to find non-prototype-non-NDA machines with 48-64 (and perhaps even beyond) numbers of CPUs.

Another note is that not all N-cpu boxes are created equal. Years ago I bought an ALR box with 4 chips, thing had 4-way interleaved memory. We bought a cheapo-4-way box later, and it performed significantly worse. Turns out it used a standard memory controller/design, which meant 4 cpus fighting for data and starving for both instructions and data. The "server" boxes are generally better, but also more pricey. But they typically offer a better memory sub-system as well to keep all the CPUs busy.

My linear speedup approximation formula is based on the reasonable machines I have run on. It is an approximation. Certainly a bit low for 2x, may well be a little high for 64x since there is not a lot of good data that I have produced. When NUMA enters the picture, there is yet another issue to deal with since memory starts to slow down as it gets farther from a particular CPU. Formula won't be so accurate there, although for a 4-chip AMD it works well since that was where much of the testing was done, starting with the old 4x2 opteron boxes I used in some early CCT events. 4 nodes is not so bad for NUMA. 32 nodes (using the AMD hypertransport model becomes quite problematic.

bob · Post by **bob** » Wed Sep 15, 2010 6:01 pm

rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.

How many times does it take for me to say this before it sinks in.

The formula is not exact. It is not intended to be accurate to decimal places. It is simply a linear approximation to suggest where a particular data point will fall, within a reasonable margin of error. To answer your question, it is _neither_. DTS is better, it has less overhead due to being able to choose better split points. But once you get a lot of processors involved, it actually could get worse than the current algorithm. I've since spent quite a bit of time (things like "min-thread-group" that everyone seems to have copied. That addressed something with DTS that we did not have time to test and tune. With that idea the DTS paper in the JICCA might (emphasis on might) have been a little better at 16 or 32. We ran on the 32 cpu box for one tournament, and the data we got there was not nearly as good as we had hoped. But time marches on, new ideas were discovered to address specific weaknesses that were difficult to anticipate until the actual hardware became available.

So please stop trying to make this formula about accuracy. It was intended to answer the question that always came up in the days when nobody else was doing parallel search, namely "Bob, what kind of speedup should I see if I buy a dual-cpu box? A quad-cpu box? I'm looking at a 4-socked AMD with dual cores, how much faster will that run?"

That was all it was intended. Now a couple of idiots want to take a linear approximation to a very much non-linear (and small) set of data points (nothing about chess is linear) and quibble about it's accuracy.

Feel free to run a few thousand positions on 1, 2, 4, 8, and 16/32/64 if you can get access, and produce a more accurate formula. Would not bother me at all. When that was written, there was no T932 available, so we could not go to 32. If we had had access, I'd bet we could have pushed that flattening up a bit. The simplest example is that we always split somewhere with all processors that were available. Why split at a node where you have 8 cpus and 8 moves (or even fewer, and sometimes we did not know how many moves we would have since we didn't generate them all at once at every position. In Crafty, I limit the number of threads working at any single split point to address this problem, plus it is better to not put all your eggs in one basket (or all your cpus on one node) in those cases where this ALL node turns into a CUT node. With only 4 processors working there, you get less overhead.

So there are ideas to improve things, but for smallish numbers of processors, Cray Blitz could simply choose better split points than the simple YBW algorithm I use today.

bob · Post by **bob** » Wed Sep 15, 2010 6:04 pm

Dann Corbit wrote:
rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
11.1 seems to answer very neatly for the 11.5 predicted by the forumla.
You were expecting an exact answer agreement? Of course, that would be an utterly ridiculous expectation (in fact we rightly view such data with mathematical suspicion).

Forget that point. Remember, this is a linear approximation (emphasis on approximation) to non-linear data points, but it _must_ be accurate to 4 decimal places for some to be happy. For the purposes it was intended for, within 25% would be just fine. How fast using 4 cpus? 2.0-4.0 depending on the position. Occasionally not even 2.0. Average? 1.7-1.9 depending on positions? Lots of positions? Close to 1.8. Formula predicts 1.7. Less than 10% off? Not good enough, it seems. Never knew "linear approximations" had to have that kind of accuracy or I would never have even mentioned the rough guess...

If, for instance, a student dropped a penny and measured the force of gravity using the mass of the penny and the mass of the earth and came up with 6.673 x 10^(-11) m^3/(kg*s^2) it would seem a little far-fetched.

SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up