Significant modification of Komodo SMP implementation

bob · Post by **bob** » Thu Jun 25, 2015 5:04 pm

Laskos wrote:
bob wrote:...
Suppose one has a single core highly optimized engine, which is optimally wide and deep. Now you want to implement SMP to 4 cores. Suppose for simplification that as speed goes, NPS on 4 cores is almost exactly 4 times larger than NPS on 1 core, so in the same amount of time the 4 cored version reaches 4 times more nodes than the single core. Let's in fact have 2 such engines: Crafty and Komodo. One tries hard his engine to copy on 4 cores the optimality of 1 core tree shape, being 4 times faster. If the copy is perfect, then the effective speedup is 4.

Working on Crafty, one discovers that no matter how hard he tries, only 80% of nodes of 4 cored engine copy perfectly the 1 cored nodes, they are fully able nodes. 20% of nodes are completely wasted, they are dead nodes. The effective speedup of Crafty is 4*0.8=3.2, and the engine on 4 cores mimics perfectly the 1 core engine (the same tree), being 3.2 times faster. The only variable in this engine to measure is time to depth speedup (TTD speedup), which is 3.2 for 4 cores, the same 3.2 of the total effective speedup.

Working on Komodo, one has a nasty surprise, his optimized on 1 core engine is harder to parallelize, no matter how hard he tries, only 50% of 4 cored nodes mimic perfectly the 1 core tree, and apparently 50% of nodes are wasted. Left in this shape, the effective speedup on 4 cores is 4*0.5=2.0 equal to TTD speedup, much worse than Crafty's 3.2. Analyzing the rest consisting of 50% wasted nodes, he discovers that as in Crafty, 20% are completely dead nodes, but 30% of nodes have some sort of flexibility, not fully free to mimic the original shape of tree, but not completely dead. Suppose they can be used to deal with the lines that are highly reduced in the single core program. The depth reached does not change, we have achieved the maximum depth and TTD speedup of 2.0 using the 50% fully able nodes. But those 30% partly able nodes do contribute to some strength increase by dealing with severely reduced lines in the single core tree. This increase in strength without changing TTD speedup we call "widening". One probably admits that those partly able nodes used for widening are not as efficient as fully able nodes in the original tree to increase the strength purely via TTD. Say their efficiency is 50% and, as already said, there are 30% of these partly able nodes as fraction of total nodes searched on 4 cores. So, this search will look as the following:

50% of nodes used fully to mimic the 1 core tree shape, giving TTD speedup of 2.0 on 4 cores and a contribution to effective speedup of 2.0
20% of completely wasted nodes, exactly like in Crafty.
30% of partly able nodes used to what their limited freedom allows, say to deal with severe reductions in one core shaped tree, which do not change depth and TTD, but add a contribution to effective speedup of 4*0.5*0.3=0.6 of fixed depth widening.

The total effective speedup of Komodo on 4 cores this way is 2 (from TTD speedup) + 0.6 (from widening) = 2.6, with 80% of nodes used in some more or less efficient way.

It's sure desirable to have pure non-widening TTD effective speedup of Crafty of 3.2, if we keep exemplified numbers. But applied to Komodo peculiarity in parallelization, non widening Komodo will have 2.0 effective speedup, and the widening one 2.6, which is sure better than 2.0.

Now, aspiring to improve with the new Komodo, Mark and Larry find that in fact more 4 core nodes can be used to copy 1 core tree. Say, this way more nodes are wasted, 30% instead of 20%, and one is not tempted to waste nodes. But they see that 70% of nodes now are fully able to copy the optimized 1 core shape. Although 70% of used nodes now are fewer than 80% before, they are more efficient, as all of them go to TTD and effective speedup, which are both equal to non-widening 4*0.7=2.8. Better than the old 2.6. So they improved the effective speedup from 2.6 to 2.8 by going from widening to non-widening. But that's not a question of principle. Suppose that among the 30% wasted nodes, they find 10% nodes which can be used to widen. They will get (say in the future) an effective speedup of 2.8+4*0.1*0.5=3.0 on 4 cores with a widening engine, better than the present day non-widening Komodo.

Then, adding more cores beyond 4 will most probably again reduce the proportion of "mimicking" part of the tree (pure TTD speedup), and widening may gain in importance.

This may be the peculiarity of Komodo tree shape, that it's prone to have a widening tree when going multicore, while Crafty is prone to have a non-widening tree. There is to me no theoretical argument as "widening is always bad" when the scaling is clearly below the ideal (4 on 4 cores). The state of chess engines is such that on average the scaling is pretty far from perfect. Multiplexing argument doesn't work for non superlinear SMP. Superlinearity is an exception, and when systematic it IS a bug, as shown by multiplexing mental experiment.

In fact the only argument against widening is that when effective speedup will be very close to ideal (say above 3.9 on 4 cores), the margin of maneuver with widening is reduced, one could improve by going wider by just a bit. We are not there yet, especially on say 16 cores and higher.

One assumption that leads to one question: what could possibly cause two different programs to have a limited parallel search improvement? The tree search problem is well-known and well-understood. Both programs are playing the same game, with the same rules, and there's nothing inherent in the tree search that serves as a limit to SMP speedup. The current problem is that alpha/beta is a sequential algorithm by its very definition, which makes parallel search a real challenge. But I don't believe that there is something in Komodo that limits parallel search potential compared to other chess engines because they ALL use the same basic search approach, minimax + alpha/beta...

Laskos · Post by **Laskos** » Thu Jun 25, 2015 11:10 pm

bob wrote: One assumption that leads to one question: what could possibly cause two different programs to have a limited parallel search improvement? The tree search problem is well-known and well-understood. Both programs are playing the same game, with the same rules, and there's nothing inherent in the tree search that serves as a limit to SMP speedup. The current problem is that alpha/beta is a sequential algorithm by its very definition, which makes parallel search a real challenge. But I don't believe that there is something in Komodo that limits parallel search potential compared to other chess engines because they ALL use the same basic search approach, minimax + alpha/beta...

Some good engines using Lazy SMP like Cheng and Andscacs widen even more, having fairly good SMP results. I guess it's often just easier to have 1.7*1.7=2.9 speedup on 4 cores coming from 1.7 deepening and 1.7 widening, than 2.9 from purely deepening (having the same search tree as the optimized 1 core tree). With more effort, one would slowly go to more deepening and less widening. The ideal speedup of 4 on 4 cores is done only through pure deepening, without widening, but we are still far from it, and widening will be present in many engines for some time from now. Even more on 16 cores and such. I guess many details of optimal parallelization may hide in search particularities, not just minimax and alpha-beta, but I have no competency on this.

bob · Post by **bob** » Fri Jun 26, 2015 5:10 am

Laskos wrote:
bob wrote: One assumption that leads to one question: what could possibly cause two different programs to have a limited parallel search improvement? The tree search problem is well-known and well-understood. Both programs are playing the same game, with the same rules, and there's nothing inherent in the tree search that serves as a limit to SMP speedup. The current problem is that alpha/beta is a sequential algorithm by its very definition, which makes parallel search a real challenge. But I don't believe that there is something in Komodo that limits parallel search potential compared to other chess engines because they ALL use the same basic search approach, minimax + alpha/beta...
Some good engines using Lazy SMP like Cheng and Andscacs widen even more, having fairly good SMP results. I guess it's often just easier to have 1.7*1.7=2.9 speedup on 4 cores coming from 1.7 deepening and 1.7 widening, than 2.9 from purely deepening (having the same search tree as the optimized 1 core tree). With more effort, one would slowly go to more deepening and less widening. The ideal speedup of 4 on 4 cores is done only through pure deepening, without widening, but we are still far from it, and widening will be present in many engines for some time from now. Even more on 16 cores and such. I guess many details of optimal parallelization may hide in search particularities, not just minimax and alpha-beta, but I have no competency on this.

However, things like Lazy SMP don't intentionally "widen". It happens because of an inefficient search algorithm. Any time you add nodes to the serial tree, you are widening the search, albeit almost certainly an unintentional (and undesirable) side effect of an inefficient parallel search. And what is much more important in today's world is 8 cores, and 16 cores. 4 is common in phones already with more coming. That's where most of the "sorta-SMP searches" break down. They often work OK for 2, and barely OK for 4, but beyond that, forget about it...

Clearly getting something is better than ignoring multiple cores and getting nothing, so there's nothing wrong with that. But some will want it all (me included). There, searching extra nodes is the antithesis of our intentions.

cdani · Post by **cdani** » Fri Jun 26, 2015 8:13 am

One argument that seems to increase at least partly efficiency of lazy SMP over other approaches is that we are far from having a perfect algorithm that selects which moves to reduce more. So incidentally lazy SMP helps on this simply by searching moves that most times where reduced too much.

Of course this can be replicated on a single thread search, but witch algorithm will you follow? So lazy SMP solves partly this without an inherent algorithm, generating the behavior on the air.

At some point someone will be able to create a better algorithm to select which moves to reduce, and lazy SMP will lose part of his advantage over other SMP algorithms.

cdani · Post by **cdani** » Fri Jun 26, 2015 8:42 am

cdani wrote:At some point someone will be able to create a better algorithm to select which moves to reduce, and lazy SMP will lose part of his advantage over other SMP algorithms.

Better way of telling this. The more efficient is move ordering, the less sense makes lazy SMP.

bob · Post by **bob** » Fri Jun 26, 2015 3:42 pm

cdani wrote:One argument that seems to increase at least partly efficiency of lazy SMP over other approaches is that we are far from having a perfect algorithm that selects which moves to reduce more. So incidentally lazy SMP helps on this simply by searching moves that most times where reduced too much.

Of course this can be replicated on a single thread search, but witch algorithm will you follow? So lazy SMP solves partly this without an inherent algorithm, generating the behavior on the air.

At some point someone will be able to create a better algorithm to select which moves to reduce, and lazy SMP will lose part of his advantage over other SMP algorithms.

While I understand what you mean, your phrasing is not very good. Lazy SMP doesn't have an advantage over other SMP algorithms. It is worse. Significantly worse.

The idea of using parallel search to offset excessive reductions is not a good argument for the approach. It just suggests that the reduction code needs work as well. SMP search should be about searching the tree as quickly as possible, not trying to compensate for algorithmic issues that are completely unrelated to SMP search.

facil · Post by **facil** » Fri Jun 26, 2015 4:03 pm

I have read this discussion previously, and I reached the point where I don't think I understand either side. So let me recap.

So we have a game where we need to create a search tree. Given that, the engine will choose a move. It may be strong, or it may be weak depending on the search tree attained in the search phase of the computational process. So the natural problem is: Which is the best search tree (given an evaluation function)?

Each engine will have its own best tree, but whatever it is, it stands to reason that the single processor implementation will go for that tree. I believe this is the point Bob makes.

So now, let's assume that we have two very good search-trees. The first one is the best of the two. It searches a little bit farther and is less wide. The second one, although wider, makes a decent job.

As far as I understand, the point of the multiprocessor is to search faster, so we can search more, so the correct comparation between the search of one core and the search of multiple cores would be to give the single core engine x4 the time of the 4 core engine. And those are the trees we ought to compare, otherwise we are not comparing the same thing.

When doing a multicore implementation we have to change things because the alpha-beta algorithm is a serial one. Normally you would like to have the best tree possible, but you can't have that as there is no perfect parallel search, hence you have to settle for one sub-optimal tree.

I think what Komodo programers and fans stated was that maybe Komodo's peculiar implementation made the second tree, the one that was not as good as the best one but was not that bad, the easier for the parallel implementation.

The choice was: Go for the best tree, where we will waste like 30% of the nodes searched (I'm making up the numbers), or go for the second best tree, which is already wider in the SMP implementation, and only wastes 28% of the nodes? Maybe this small 2% is enough to offset the sub-optimal nature of the second tree... If that is the case, then obviously for them this was no bug, it was a conscious decision. They perfectly knew the wider tree they were searching was not perfect. Indeed they knew that they would be wasting 28% of the nodes of that tree, compared to the original sub-optimal tree, but going for the optimal tree would waste 30% of the nodes instead!

Now, they improved the parallelization algorithm, now maybe the 30% wasted nodes on the optimal tree has gone down to 29% (it is still more difficult to do than the other one, but it is closer), and that in turn is enough to make it better as in the single core implementation.

I would like to hear all the parties here. Am I getting it right? Because it may perfectly be that we are arguing different things: Bob claims that a wider tree is a worse tree than the narrow one (otherwise you should go for the same tree in SMP, as you found that tree to be optimal), while the Komodo programers defend that while the optimal tree is narrower, they had no chance to attain it (it is impossible), and found that going for a sub-optimal tree they found a better parallel implementation which wasted less nodes.

cdani · Post by **cdani** » Fri Jun 26, 2015 4:45 pm

bob wrote: While I understand what you mean, your phrasing is not very good. Lazy SMP doesn't have an advantage over other SMP algorithms. It is worse. Significantly worse.

The idea of using parallel search to offset excessive reductions is not a good argument for the approach. It just suggests that the reduction code needs work as well. SMP search should be about searching the tree as quickly as possible, not trying to compensate for algorithmic issues that are completely unrelated to SMP search.

Yes, I did not explain it quite well

I know lazy SMP is not specially good.

For the algorithm part, also you have reason, but sometimes is difficult to deduce an algorithm and one ends using something other. See neural networks, for example, just to give an idea.

bob · Post by **bob** » Fri Jun 26, 2015 4:53 pm

facil wrote:I have read this discussion previously, and I reached the point where I don't think I understand either side. So let me recap.

So we have a game where we need to create a search tree. Given that, the engine will choose a move. It may be strong, or it may be weak depending on the search tree attained in the search phase of the computational process. So the natural problem is: Which is the best search tree (given an evaluation function)?

Each engine will have its own best tree, but whatever it is, it stands to reason that the single processor implementation will go for that tree. I believe this is the point Bob makes.

So now, let's assume that we have two very good search-trees. The first one is the best of the two. It searches a little bit farther and is less wide. The second one, although wider, makes a decent job.

As far as I understand, the point of the multiprocessor is to search faster, so we can search more, so the correct comparation between the search of one core and the search of multiple cores would be to give the single core engine x4 the time of the 4 core engine. And those are the trees we ought to compare, otherwise we are not comparing the same thing.

When doing a multicore implementation we have to change things because the alpha-beta algorithm is a serial one. Normally you would like to have the best tree possible, but you can't have that as there is no perfect parallel search, hence you have to settle for one sub-optimal tree.

I think what Komodo programers and fans stated was that maybe Komodo's peculiar implementation made the second tree, the one that was not as good as the best one but was not that bad, the easier for the parallel implementation.

The choice was: Go for the best tree, where we will waste like 30% of the nodes searched (I'm making up the numbers), or go for the second best tree, which is already wider in the SMP implementation, and only wastes 28% of the nodes? Maybe this small 2% is enough to offset the sub-optimal nature of the second tree... If that is the case, then obviously for them this was no bug, it was a conscious decision. They perfectly knew the wider tree they were searching was not perfect. Indeed they knew that they would be wasting 28% of the nodes of that tree, compared to the original sub-optimal tree, but going for the optimal tree would waste 30% of the nodes instead!

Now, they improved the parallelization algorithm, now maybe the 30% wasted nodes on the optimal tree has gone down to 29% (it is still more difficult to do than the other one, but it is closer), and that in turn is enough to make it better as in the single core implementation.

I would like to hear all the parties here. Am I getting it right? Because it may perfectly be that we are arguing different things: Bob claims that a wider tree is a worse tree than the narrow one (otherwise you should go for the same tree in SMP, as you found that tree to be optimal), while the Komodo programers defend that while the optimal tree is narrower, they had no chance to attain it (it is impossible), and found that going for a sub-optimal tree they found a better parallel implementation which wasted less nodes.

Pretty good summary. Basic idea is that a parallel search will almost always search a larger tree than the equivalent-depth serial search, because of imperfect move ordering. Sometimes those extra nodes actually help, although this is purely accidental in nature. Sometimes those extra nodes just slow the parallel search down since it is wasting time searching unnecessary stuff. The goal is to search the same tree as the serial search. But without perfect move ordering, this is not possible.

In my dissertation, I actually studied this and showed that DTS could provide perfectly linear speedups when searching a tree that is ordered perfectly (I did this by having a contrived evaluation that returned steadily decreasing numbers rather than actual evaluations. I did the same for worst-first ordering (which reverts alpha/beta back to pure minimax. It is the "in-between" area that is problematic, where the ordering is good but not perfect. The curve was bell-shaped. The closer you get to perfect ordering, the closer you get to a linear speedup. As you get further away from perfect, the overhead climbs, until you reach the center of the range, then the speedup starts to improve again since you are getting closer to worst-first ordering which is also easy to search with linear speedup.

Parallel search is certainly not easy. Good parallel search is even harder. I suppose one might give up on improving the parallel search and somehow decide to limit reductions and such somewhat. But that would appear to make the original parallel search even worse.

BTW it would appear that Mark decided to address this and has made some more "normal" improvements to the search, trying to limit the overhead nodes. We've had some email discussion about other things he can do to improve the search.

Laskos · Post by **Laskos** » Fri Jun 26, 2015 6:10 pm

bob wrote:
Laskos wrote:
bob wrote: One assumption that leads to one question: what could possibly cause two different programs to have a limited parallel search improvement? The tree search problem is well-known and well-understood. Both programs are playing the same game, with the same rules, and there's nothing inherent in the tree search that serves as a limit to SMP speedup. The current problem is that alpha/beta is a sequential algorithm by its very definition, which makes parallel search a real challenge. But I don't believe that there is something in Komodo that limits parallel search potential compared to other chess engines because they ALL use the same basic search approach, minimax + alpha/beta...
Some good engines using Lazy SMP like Cheng and Andscacs widen even more, having fairly good SMP results. I guess it's often just easier to have 1.7*1.7=2.9 speedup on 4 cores coming from 1.7 deepening and 1.7 widening, than 2.9 from purely deepening (having the same search tree as the optimized 1 core tree). With more effort, one would slowly go to more deepening and less widening. The ideal speedup of 4 on 4 cores is done only through pure deepening, without widening, but we are still far from it, and widening will be present in many engines for some time from now. Even more on 16 cores and such. I guess many details of optimal parallelization may hide in search particularities, not just minimax and alpha-beta, but I have no competency on this.
However, things like Lazy SMP don't intentionally "widen". It happens because of an inefficient search algorithm. Any time you add nodes to the serial tree, you are widening the search, albeit almost certainly an unintentional (and undesirable) side effect of an inefficient parallel search. And what is much more important in today's world is 8 cores, and 16 cores. 4 is common in phones already with more coming. That's where most of the "sorta-SMP searches" break down. They often work OK for 2, and barely OK for 4, but beyond that, forget about it...

Clearly getting something is better than ignoring multiple cores and getting nothing, so there's nothing wrong with that. But some will want it all (me included). There, searching extra nodes is the antithesis of our intentions.

I don't know how I overlooked that, but only today I tested Crafty 23.6 at fixed depth to detect a possible widening. Well, Crafty 23.6 (the last version I have) widens!

Fixed depth=10, Crafty 23.6 on 4 threads against Crafty 23.6 on 1 thread:

Code: Select all

Score of Crafty 4 threads vs Crafty 1 thread: 680 - 567 - 753  [0.53] 2000
ELO difference: 20
Finished match

20 +/- 12 ELO points the widening.

That's why I am getting only 2.7 time-to-depth (TTD) speedup for Crafty 23.6 on 4 cores, because it'not all the contribution to the effective speedup, it receives a contribution from widening too, of a factor close to 1.15.

So, the effective speedup of Crafty 23.6 on 4 cores is:
2.7 (TTD) * 1.15 (widening) = 3.1

More reasonable effective speedup now, and closer to the NPS speedup of 3.2-3.3.

So, even Crafty 23.6 widens, although much less and without your intention to widen.

Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation