Measure of SMP scalability

mvk · Post by **mvk** » Thu Jul 11, 2013 3:34 am

syzygy wrote: I will now stop feeding, because this has been going on long enough. Everybody else is in complete agreement. You of course not. Good for you.

Guess who will take get/take the last word in this thread.

kbhearn · Post by **kbhearn** » Thu Jul 11, 2013 4:38 am

Given that

1) the heuristic pruning/reduction methods present in engines these days a) are a tradeoff and b) could be tweaked a bit in either direction without massive damage to the engine's playing strength.

2) a particular parallel search implementation has splitting overhead (not wasted nodes, just the actual act of splitting) that's non-negligible.

Is it really a surprise that with thorough testing it might be found that the heuristics in 1 should be pulled a tiny bit bushier than the tuned single processor values in order to be 'optimal'?

bob · Post by **bob** » Thu Jul 11, 2013 4:16 pm

syzygy wrote:
bob wrote:Doesn't show that is not the CORRECT way to measure performance and evaluate the design however.
We're not talking about "evaluating the design" (which you obviously would like to equate with "if time-to-depth is not great, the design is wrong").

We are talking about measuring the performance of the smp implementation. We are talking about measuring effective smp speedup. Effective in the sense of "how well does it play chess", which in the end is the only thing that matters.

The one and only point of the whole discussion is this: time-to-depth is not a complete measure of effective smp speedup.

There is no other point being made here. In particular, we are not discussing what Crafty should do in its next release.

I will now stop feeding, because this has been going on long enough. Everybody else is in complete agreement. You of course not. Good for you.

Fine. But what information do you get from your measurement? Is the parallel search efficient, compared to a normal program? Could it be MORE efficient if it reached greater depth?

And the critical issue, can you measure what you are talking about? Practically? Much easier to test a significant group of positions and measure time-to-depth when working on an SMP search, as opposed to having to play enough games to get an acceptably low error bar when playing games. The computational requirement is orders of magnitude larger for the latter. So large, the average user probably can't use it at all...

And, as has been my point, nothing so far has shown that going wider is more effective (or even as effective) as going deeper. Lots of experimental results (from many different people) suggest it is not.

BTW no one has said it is a "complete" measurement. Just "it is the BEST measurement."

Just count the number of programs where it works well, and compare that to the exceptions. And decide which measure will be the best predictor for 99% of programs. And which method has the most manageable compute requirement.

bob · Post by **bob** » Thu Jul 11, 2013 4:19 pm

kbhearn wrote:Given that

1) the heuristic pruning/reduction methods present in engines these days a) are a tradeoff and b) could be tweaked a bit in either direction without massive damage to the engine's playing strength.

2) a particular parallel search implementation has splitting overhead (not wasted nodes, just the actual act of splitting) that's non-negligible.

Is it really a surprise that with thorough testing it might be found that the heuristics in 1 should be pulled a tiny bit bushier than the tuned single processor values in order to be 'optimal'?

Simply stated, "it is a stretch". To say the least. Otherwise would you not make the serial tree a tiny bit bushier as well? This extra bushiness is multiplied by the parallel search overhead, don't forget...

Adam Hair · Post by **Adam Hair** » Thu Jul 11, 2013 10:39 pm

I have a couple of questions in regard to measuring time to depth for 1 cpu and 4 cpus:

1) What sort of positions should be used? A random selection from games? Should they be "quiet" positions (non-tactical)? Should they all come from the midgame phase or from all phases? I have collected some data already for various engines, but it is clear that the nature of the positions has some effect on time to depth.

2)What is a reasonable depth to use? Some engines reach a fixed depth much faster than other engines (on average). More depth (up to a point) should give more reliable data, but the measurements need to use a large number of positions. I am seeking a reasonable tradeoff.

bob · Post by **bob** » Sat Jul 13, 2013 5:50 pm

Adam Hair wrote:I have a couple of questions in regard to measuring time to depth for 1 cpu and 4 cpus:

1) What sort of positions should be used? A random selection from games? Should they be "quiet" positions (non-tactical)? Should they all come from the midgame phase or from all phases? I have collected some data already for various engines, but it is clear that the nature of the positions has some effect on time to depth.

2)What is a reasonable depth to use? Some engines reach a fixed depth much faster than other engines (on average). More depth (up to a point) should give more reliable data, but the measurements need to use a large number of positions. I am seeking a reasonable tradeoff.

Couple of answers.

(1) as many positions as you can stand, but at least representative of the complete game from opening to endgame is preferred. How many? The more the merrier. I usually run about 50 positions for a quick test, or about 300 for a longer test.

(2) depth is tough. I generally set it to whatever is reasonable for a single CPU, aiming for at least 3 minutes, so that 4, 8, 12, etc core runs will take enough time to get rid of the timing jitter one sees when trying to measure short intervals.

What I have done in the past, which is a bit painful, is to take my test set and search each for 10 minutes. Then look at the log and for each position, set the max depth to whatever the single CPU run completed. Fixed depth overall is a pain. And you have to beware of setting the depth low enough that some runs take almost zero time, as they won't really count when you average everything together.

When I used the Cray it was worse since I had 32 cpus on the last box I used, which REALLY required long single-cpu runs. I think I used 1 hour per position to set the depths, then ran with 2 ~ 32 threads to measure speedup...

Also, for a multiple-cpu test, you should run each twice, and look at the variance. The greater it is, the more runs you need to get a good average for that number of threads...

bob · Post by **bob** » Tue Jul 16, 2013 7:21 pm

Laszlo Gaspar wrote:I can accept that it should work that way and it might explain why my SMP is not that good as the SMP in Crafty. But the question is what algorytm is implemented in Komodo and this was my idea. It looks like it is not DTS and I agree with Edsel in his opinion.

Note that Crafty is NOT DTS either. I chose to avoid rewriting the search and getting rid of the recursion, which is really nice for things like IID and re-searching to verify things... DTS is a loop-based approach so that one can see ALL possible split points simultaneously, rather than just having the option to "split here or wait to split elsewhere"...

Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability