Measure of SMP scalability

bob · Post by **bob** » Sat Jul 06, 2013 10:35 pm

Laskos wrote:
bob wrote:There you go! Answers the question I posed before I saw your reply. Seems like it is quite easy to jump to conclusions here. And then reality crashes in.

Bugs happen. Regularly when one is dealing with parallel search.
What "there you go"? The gain from width is real even in buggy Hannibal, it's not simply overhead, it's 77 Elo points gain at fixed depth. The problem is that Hannibal almost doesn't go deeper 1->4 cores, time-to-depth being only 1.33. The overall effective speedup also suffers at 2.27, compared to 2.7-3.1 of most other engines. Yes, it is a bug, but not "there you go". Hannibal still has 2.27 effective speedup, coming mostly from non-trivial (non-overhead) widening, which gains points.

And as Edsel points out, Komodo's implementation is probably different and not buggy.

As a programmer, you should know that sometimes a bug hurts performance, sometimes it helps. He explicitly stated that this "increase" was unintentional and was the direct result of a bug that he has now fixed.

Ergo, no conclusions can be drawn from it at all. Particularly in light of the fact that the bug was fixed and the behavior will change...

The "there you go" was simply another way of saying "OK, there's the explanation for the increased node count, a program bug, exactly as I had speculated as a possible explanation." I had speculated that might be the case before seeing his post.

syzygy · Post by **syzygy** » Sat Jul 06, 2013 10:49 pm

bob wrote:
Laskos wrote:
bob wrote:There you go! Answers the question I posed before I saw your reply. Seems like it is quite easy to jump to conclusions here. And then reality crashes in.

Bugs happen. Regularly when one is dealing with parallel search.
What "there you go"? The gain from width is real even in buggy Hannibal, it's not simply overhead, it's 77 Elo points gain at fixed depth. The problem is that Hannibal almost doesn't go deeper 1->4 cores, time-to-depth being only 1.33. The overall effective speedup also suffers at 2.27, compared to 2.7-3.1 of most other engines. Yes, it is a bug, but not "there you go". Hannibal still has 2.27 effective speedup, coming mostly from non-trivial (non-overhead) widening, which gains points.

And as Edsel points out, Komodo's implementation is probably different and not buggy.
As a programmer, you should know that sometimes a bug hurts performance, sometimes it helps. He explicitly stated that this "increase" was unintentional and was the direct result of a bug that he has now fixed.

Ergo, no conclusions can be drawn from it at all.

So what would you call the speed up of Hannibal?

As measured by Kai, when going from 1 to 4 cores time-to-depth improves by a factor of 1.33. But the 4-core search has the strength of a 2.27x faster 1-core search.

You can call the speed up 1.33, but that's just a useless number.

The number that measures the complete SMP implementation is 2.27.

bob · Post by **bob** » Sat Jul 06, 2013 11:15 pm

syzygy wrote:
bob wrote:
Laskos wrote:
bob wrote:There you go! Answers the question I posed before I saw your reply. Seems like it is quite easy to jump to conclusions here. And then reality crashes in.

Bugs happen. Regularly when one is dealing with parallel search.
What "there you go"? The gain from width is real even in buggy Hannibal, it's not simply overhead, it's 77 Elo points gain at fixed depth. The problem is that Hannibal almost doesn't go deeper 1->4 cores, time-to-depth being only 1.33. The overall effective speedup also suffers at 2.27, compared to 2.7-3.1 of most other engines. Yes, it is a bug, but not "there you go". Hannibal still has 2.27 effective speedup, coming mostly from non-trivial (non-overhead) widening, which gains points.

And as Edsel points out, Komodo's implementation is probably different and not buggy.
As a programmer, you should know that sometimes a bug hurts performance, sometimes it helps. He explicitly stated that this "increase" was unintentional and was the direct result of a bug that he has now fixed.

Ergo, no conclusions can be drawn from it at all.
So what would you call the speed up of Hannibal?

As measured by Kai, when going from 1 to 4 cores time-to-depth improves by a factor of 1.33. But the 4-core search has the strength of a 2.27x faster 1-core search.

You can call the speed up 1.33, but that's just a useless number.

The number that measures the complete SMP implementation is 2.27.

The 2.27 is a "buggy Elo". Did you read the entire thread? 1.33 shows there is work left to be done, nothing more, nothing less...

BTW, a little math, just for fun.

The Elo gain presumably shows a number that suggests the thing is 2.27x faster. Part of which comes from extra depth (how much is unknown) and part of which comes from a bug that makes the search wider. Since we have no ACTUAL speedup numbers (the 1 to 4 cpu number is meaningless with the aforementioned bug included, we might guess.

Last time I ran the test, 1 ply = 70 Elo for Crafty. Without an accurate number for Hannibal, I can only assume something similar. Which means he is using about 1/2 the available computing power. If, instead of 1.33x faster, he was hitting 3.3x (Crafty's actual number, for reference) that is a ply, a full 2x improvement over the 1 cpu version.

Out of his current 2.27, what do you assume comes from additional depth and what comes from additional width? Half each? Then the "widening" is not as effective as going for the full 3x+ faster since that should give +70 Elo by itself, which is as good as the entire improvement he sees.

As I said, the idea is there, but trading with for depth is a tradeoff that requires some serious analysis. In this case, I'd be after the full 3x+ speedup, losing the extra width, which is likely exactly what is going to happen when the author fixes the bug...

It is a serious decision to give up a ply. Serious enough to make one look carefully at why "wider" is worth more than the ply. At first look, it would seem that issue is worth addressing before taking a hammer to the problem and giving up the ply. I've always found positions in chess where another ply would have been just what was needed to spot something critical.

Sven · Post by **Sven** » Sat Jul 06, 2013 11:15 pm

bob wrote:
michiguel wrote:
bob wrote:Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input.
No, the work done needs to be the same.
work done = search to fixed depth. Seems clear enough to me. Doesn't matter whether the serial program uses alpha/beta and the parallel program uses pure minimax. Time to depth is "the same task"

Differences are much beyond that "alpha/beta vs. minimax" level. The latter are two different "perfect" search algorithms leading to exactly the same result but with very different effort. But we are no longer at this trivial level today. E.g. searching to fixed depth N with
a) very aggressive pruning and reductions,
b) medium-level pruning and reductions, or
c) no pruning or reductions at all (=basic alpha/beta)
should be seen as three different tasks since three different trees are searched with different node counts, different accuracy, and most probably different strength. As long as all three would always select the same root move you could say that a) is best, and it is clearly also best in time-to-depth. But you know well that this is not the case in reality. More accuracy *may* improve strength by selecting a better root move (not necessarily). So time-to-depth does not always apply, in sequential as well as parallel search. Understanding this does not even require the "1->4 cores" step.

Now you keep asking: why shouldn't this increased accuracy also help in sequential search? I can't tell since I don't know the implementation of Komodo, Hiarcs or Rybka (I leave out Hannibal even though it belongs here, too, simply to avoid the "bug" discussion), but instead of saying that there must be something wrong, or that these programs must have a fundamental problem, I'd say that there could simply be some detail in the SMP algorithms of these programs that inherently increases the accuracy, i.e. causes the widening, so it is not necessarily something that you can apply to the sequential search by tuning some search parameters.

You would certainly know more if you would know the implementation details of these three or four programs ...

Sven

syzygy · Post by **syzygy** » Sat Jul 06, 2013 11:22 pm

bob wrote:
The number that measures the complete SMP implementation is 2.27.
The 2.27 is a "buggy Elo". Did you read the entire thread? 1.33 shows there is work left to be done, nothing more, nothing less...

The Elo is not real because the program has a bug?

So if Houdini has a bug, it's Elo is not real? Are you sure Crafty has no bug?

The number that measures Hannibal's complete SMP implementation, including all its bugs, is 2.27. Denying this is just silly.

bob · Post by **bob** » Sat Jul 06, 2013 11:28 pm

syzygy wrote:
bob wrote:
The number that measures the complete SMP implementation is 2.27.
The 2.27 is a "buggy Elo". Did you read the entire thread? 1.33 shows there is work left to be done, nothing more, nothing less...
The Elo is not real because the program has a bug?

So if Houdini has a bug, it's Elo is not real? Are you sure Crafty has no bug?

When the bug fix is going to change the results, of course it is not real. Let's wait to see the REAL numbers, eh? Rather than arguing about hypotheticals.

Would be interesting to get some better data, 8 cores, 16 cores for example. Widening can't go on forever, and doesn't scale very well...

Then you come back to the only thing left, a more efficient search that can go deeper. If you had started there... you'd already be there...

syzygy · Post by **syzygy** » Sat Jul 06, 2013 11:35 pm

bob wrote:
syzygy wrote:
bob wrote:
The number that measures the complete SMP implementation is 2.27.
The 2.27 is a "buggy Elo". Did you read the entire thread? 1.33 shows there is work left to be done, nothing more, nothing less...
The Elo is not real because the program has a bug?

So if Houdini has a bug, it's Elo is not real? Are you sure Crafty has no bug?
When the bug fix is going to change the results, of course it is not real. Let's wait to see the REAL numbers, eh? Rather than arguing about hypotheticals.

What is hypothetical about Kai's numbers? Are they not real? Crafty does not have any bugs, or is it a hypothetical prorgam as well? Are we living on the same planet?

bob · Post by **bob** » Sat Jul 06, 2013 11:39 pm

Sven Schüle wrote:
bob wrote:
michiguel wrote:
bob wrote:Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input.
No, the work done needs to be the same.
work done = search to fixed depth. Seems clear enough to me. Doesn't matter whether the serial program uses alpha/beta and the parallel program uses pure minimax. Time to depth is "the same task"
Differences are much beyond that "alpha/beta vs. minimax" level. The latter are two different "perfect" search algorithms leading to exactly the same result but with very different effort. But we are no longer at this trivial level today. E.g. searching to fixed depth N with
a) very aggressive pruning and reductions,
b) medium-level pruning and reductions, or
c) no pruning or reductions at all (=basic alpha/beta)
should be seen as three different tasks since three different trees are searched with different node counts, different accuracy, and most probably different strength. As long as all three would always select the same root move you could say that a) is best, and it is clearly also best in time-to-depth. But you know well that this is not the case in reality. More accuracy *may* improve strength by selecting a better root move (not necessarily). So time-to-depth does not always apply, in sequential as well as parallel search. Understanding this does not even require the "1->4 cores" step.

Now you keep asking: why shouldn't this increased accuracy also help in sequential search? I can't tell since I don't know the implementation of Komodo, Hiarcs or Rybka (I leave out Hannibal even though it belongs here, too, simply to avoid the "bug" discussion), but instead of saying that there must be something wrong, or that these programs must have a fundamental problem, I'd say that there could simply be some detail in the SMP algorithms of these programs that inherently increases the accuracy, i.e. causes the widening, so it is not necessarily something that you can apply to the sequential search by tuning some search parameters.

You would certainly know more if you would know the implementation details of these three or four programs ...

Sven

I would agree. That's why I tried to start a discussion on "how can this be beneficial." Clearly, for chess, widening is limited, because eventually you can reach the point where you reach the max width. For endgames, widening is also less effective because the trees are narrow due to material considerations.

So we have two effects.

1. Going deeper produces a pretty consistent Elo gain.

2. Going wider should hurt the serial program, otherwise the selectiveness is not well-tuned and needs work. Given, then, a pretty optimal selectivity in the serial search, you can go wider to reduce the errors the selectivity makes. How bad is the selectivity to make this worthwhile as opposed to focusing on going deeper using the normal selectivity rules?

That is a tradeoff that seems hard to get anyone to discuss with specific details. I suppose I could take Crafty as an example, and run a fixed-depth match against my usual gauntlet, and successively widen the search by tweaking back the reductions and pruning, and normalizing to tree size, to see what a .1 increase in EBF does to the Elo, then a 0.2, and so forth. At the same time, I will have the node counts, so I can accurately predict what the SMP slowdown will be due to the extra nodes.

Now for a discussion on how to widen. Reduce reductions? Reduce null move R? Reduce forward pruning deep in the tree? I suppose I could try each of those separately, and produce a large matrix of results that might show where the most effective widening happened.

My only problem with that is that I have carefully tuned Crafty's selectiveness by playing hundreds of millions of games, adjusting each parameter to produce the best Elo. I have troubling imaging enough gain from detuning those values (assuming the cost of doing so is taken from the SMP search itself) to offset the loss in depth.

There's food for thought however...

syzygy · Post by **syzygy** » Sun Jul 07, 2013 12:30 am

bob wrote:That is a tradeoff that seems hard to get anyone to discuss with specific details.

Because the details of Komodo's implementation are unknown.

As I have already explained, it might be that because of the way Komodo is doing smp, it is pretty much forced to accept a wider tree. It might also be that tweaking the selectivity differently for multithreaded search just turned out to be stronger. It might also be that Komodo's current implementation is simply the result of a decision to release something that, for now, is good enough. And so far all tests indicate that it is indeed "good enough" in that it's effective speed up in terms of Elo is competitive.

The analysis is really simple. Figuring out what it is exactly that Komodo is doing is not. Whether Komodo's approach is "the future" of smp is another thing. Personally I do not think so, but I may be proven wrong.

bob · Post by **bob** » Sun Jul 07, 2013 1:45 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
The number that measures the complete SMP implementation is 2.27.
The 2.27 is a "buggy Elo". Did you read the entire thread? 1.33 shows there is work left to be done, nothing more, nothing less...
The Elo is not real because the program has a bug?

So if Houdini has a bug, it's Elo is not real? Are you sure Crafty has no bug?
When the bug fix is going to change the results, of course it is not real. Let's wait to see the REAL numbers, eh? Rather than arguing about hypotheticals.
What is hypothetical about Kai's numbers? Are they not real? Crafty does not have any bugs, or is it a hypothetical prorgam as well? Are we living on the same planet?

The program has a bug that the author has fixed but not released. It eliminates much of that "widening". So yes, his numbers are hypothetical since the REAL program won't be searching like that. Are we reading the same posts?

I certainly would not post data from a version that had a known bug, particularly had I already fixed it in a recent version. Such data would not be very useful, now would it? That was my point.

Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability