Significant modification of Komodo SMP implementation

bob · Post by **bob** » Wed Jun 24, 2015 4:46 pm

Dann Corbit wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
Consider an engine that searches only the first node as selected by move ordering. It might be pretty good most of the time, and the engine would get absurdly deep, since it has a branching factor of 1.

But widening this search will certainly help the strength {unless it has a perfect oracle for the first move in the list as move ordering}.

In echoing Einstein's "Make things as simple as possible but no simpler."
We can also say: "Make the search as narrow as possible, but no narrower."

Does that not apply to BOTH serial and parallel searches, however?

Modern Times · Post by **Modern Times** » Wed Jun 24, 2015 9:08 pm

bob wrote:
Searching the SAME tree, just faster, is the goal for parallel search. Always has been.

The goal of SMP is to improve Elo by whatever means the programmer chooses, imho.

Komodo going wider was not a bug, it was by design.

Laskos · Post by **Laskos** » Wed Jun 24, 2015 9:56 pm

bob wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
I consider anything that is a detriment to performance "a bug". This seems to fit that perfectly. If searching extra nodes is good (so-called "widening") then it should be just as good on a single-core search. The only way it will work on multi-core searches is if the search is too selective and the "widening" compensates for that a bit. But you can compensate in the same way in a serial search also.

Searching the SAME tree, just faster, is the goal for parallel search. Always has been. Widening is only practical if the parallel search just doesn't work very well and doesn't search any deeper with more cores. A point I don't think we have come anywhere near reaching yet, based on the test results I posted here a couple of weeks back.

Using the new Intel compiler has improved those results significantly. In any case, glad you guys are making progress. I've got a PM to send to Mark for something he should try that will be another small gain, since he has worked on hashing recently. There's another step he can take that works well in my testing.

Multiplexing argument is not consistent with fixed time problem. It is consistent with fixed depth problem. Time required on 1 core to mimic exactly parallel 4 core search is exactly 4 times larger. But say the effective speedup on parallel 4 core search is 3.0. Then, keeping original 1 core search at 4x larger time gives the effective speedup of exactly 4, while mimicking the parallel search, at 4 times longer time, it will give 3.0 times effective speedup. Therefore mimicking on one core the parallel 4 core behavior is suboptimal for the 1 core performance at 4x time.

bob · Post by **bob** » Wed Jun 24, 2015 11:55 pm

Laskos wrote:
bob wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
I consider anything that is a detriment to performance "a bug". This seems to fit that perfectly. If searching extra nodes is good (so-called "widening") then it should be just as good on a single-core search. The only way it will work on multi-core searches is if the search is too selective and the "widening" compensates for that a bit. But you can compensate in the same way in a serial search also.

Searching the SAME tree, just faster, is the goal for parallel search. Always has been. Widening is only practical if the parallel search just doesn't work very well and doesn't search any deeper with more cores. A point I don't think we have come anywhere near reaching yet, based on the test results I posted here a couple of weeks back.

Using the new Intel compiler has improved those results significantly. In any case, glad you guys are making progress. I've got a PM to send to Mark for something he should try that will be another small gain, since he has worked on hashing recently. There's another step he can take that works well in my testing.
Multiplexing argument is not consistent with fixed time problem. It is consistent with fixed depth problem. Time required on 1 core to mimic exactly parallel 4 core search is exactly 4 times larger. But say the effective speedup on parallel 4 core search is 3.0. Then, keeping original 1 core search at 4x larger time gives the effective speedup of exactly 4, while mimicking the parallel search, at 4 times longer time, it will give 3.0 times effective speedup. Therefore mimicking on one core the parallel 4 core behavior is suboptimal for the 1 core performance at 4x time.

I don't see what this has to do with anything I wrote. The goal of a parallel search is to search the SAME tree faster, thereby going deeper in a fixed time limit. 4x with 4 cores is the goal. Not so easy to reach however. So most will take whatever they can get, and continue to work to get even more out of it.

Even 4x longer on one thread does NOT mimic the parallel search very well. Parallel means exactly that. On occasion you will find something better because of a different tree traversal order from what the serial search does.

If you are talking about the multiplexing argument I always give when someone talks some sort of parallel search nonsense, you are missing the point. IF a program gains unexpectedly from a parallel search, for example somehow consistently producing a super-linear speedup. One can do that same thing with the multiplexed search and produce the result faster than the serial search would do so. That's the key to that argument. If the odd ordering of a parallel search consistently helps (and it does NOT of course) then one can multiplex to get that same order in a serial search, and that same benefit.

But in any case, speeding up the search by a factor of N is clearly better than speeding up the search by a factor of N/2 and making the search less selective to eat up the other N/2 units of time. It would seem that Mark simply fixed the things Don had not had time to get around to. More power to him. You've seen my speedup data already (and it will be better soon, as the Intel compiler NPS scaling is improved significantly). Be interesting to see a comparison with Komodo since you have the numbers for both...

Laskos · Post by **Laskos** » Thu Jun 25, 2015 12:32 am

bob wrote:
Laskos wrote:
bob wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
I consider anything that is a detriment to performance "a bug". This seems to fit that perfectly. If searching extra nodes is good (so-called "widening") then it should be just as good on a single-core search. The only way it will work on multi-core searches is if the search is too selective and the "widening" compensates for that a bit. But you can compensate in the same way in a serial search also.

Searching the SAME tree, just faster, is the goal for parallel search. Always has been. Widening is only practical if the parallel search just doesn't work very well and doesn't search any deeper with more cores. A point I don't think we have come anywhere near reaching yet, based on the test results I posted here a couple of weeks back.

Using the new Intel compiler has improved those results significantly. In any case, glad you guys are making progress. I've got a PM to send to Mark for something he should try that will be another small gain, since he has worked on hashing recently. There's another step he can take that works well in my testing.
Multiplexing argument is not consistent with fixed time problem. It is consistent with fixed depth problem. Time required on 1 core to mimic exactly parallel 4 core search is exactly 4 times larger. But say the effective speedup on parallel 4 core search is 3.0. Then, keeping original 1 core search at 4x larger time gives the effective speedup of exactly 4, while mimicking the parallel search, at 4 times longer time, it will give 3.0 times effective speedup. Therefore mimicking on one core the parallel 4 core behavior is suboptimal for the 1 core performance at 4x time.
I don't see what this has to do with anything I wrote. The goal of a parallel search is to search the SAME tree faster, thereby going deeper in a fixed time limit. 4x with 4 cores is the goal. Not so easy to reach however. So most will take whatever they can get, and continue to work to get even more out of it.

Even 4x longer on one thread does NOT mimic the parallel search very well. Parallel means exactly that. On occasion you will find something better because of a different tree traversal order from what the serial search does.

If you are talking about the multiplexing argument I always give when someone talks some sort of parallel search nonsense, you are missing the point. IF a program gains unexpectedly from a parallel search, for example somehow consistently producing a super-linear speedup. One can do that same thing with the multiplexed search and produce the result faster than the serial search would do so. That's the key to that argument. If the odd ordering of a parallel search consistently helps (and it does NOT of course) then one can multiplex to get that same order in a serial search, and that same benefit.

But in any case, speeding up the search by a factor of N is clearly better than speeding up the search by a factor of N/2 and making the search less selective to eat up the other N/2 units of time. It would seem that Mark simply fixed the things Don had not had time to get around to. More power to him. You've seen my speedup data already (and it will be better soon, as the Intel compiler NPS scaling is improved significantly). Be interesting to see a comparison with Komodo since you have the numbers for both...

Crafty 25.0 you are testing has better speedup 1 -> 4 cores than Komodo, although this speedup is time dependent, and I used close to 10 seconds/position on 4 cores for Komodo, while you close 10 minutes/position on 4 cores for Crafty, so it's not a very fair comparison. You seem to have improved drastically since publicly available Crafty 23.6 which I tested a bit.

About multiplexing, the argument would work only for superlinear parallel behavior. As I have never seen superlinear 1 -> 4 cores behavior (effective speedup larger than 4), then the argument that one can optimize the single core engine by mimicking the 4 core behavior if the engine widens in parallel search falls. Even if one can reproduce on 1 core the 4 core behavior. If one cannot reproduce, then the argument falls to start with. So, it might happen the case that widening is the best behavior both on 1 core and on 4 cores, if effective speedup on 4 cores is smaller than 4. I don't see a reason why parallel search ideally follows the same tree, in the case of effective speedup smaller than 4 (valid always in chess engines 1 -> 4 cores).

bob · Post by **bob** » Thu Jun 25, 2015 4:34 am

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
I consider anything that is a detriment to performance "a bug". This seems to fit that perfectly. If searching extra nodes is good (so-called "widening") then it should be just as good on a single-core search. The only way it will work on multi-core searches is if the search is too selective and the "widening" compensates for that a bit. But you can compensate in the same way in a serial search also.

Searching the SAME tree, just faster, is the goal for parallel search. Always has been. Widening is only practical if the parallel search just doesn't work very well and doesn't search any deeper with more cores. A point I don't think we have come anywhere near reaching yet, based on the test results I posted here a couple of weeks back.

Using the new Intel compiler has improved those results significantly. In any case, glad you guys are making progress. I've got a PM to send to Mark for something he should try that will be another small gain, since he has worked on hashing recently. There's another step he can take that works well in my testing.
Multiplexing argument is not consistent with fixed time problem. It is consistent with fixed depth problem. Time required on 1 core to mimic exactly parallel 4 core search is exactly 4 times larger. But say the effective speedup on parallel 4 core search is 3.0. Then, keeping original 1 core search at 4x larger time gives the effective speedup of exactly 4, while mimicking the parallel search, at 4 times longer time, it will give 3.0 times effective speedup. Therefore mimicking on one core the parallel 4 core behavior is suboptimal for the 1 core performance at 4x time.
I don't see what this has to do with anything I wrote. The goal of a parallel search is to search the SAME tree faster, thereby going deeper in a fixed time limit. 4x with 4 cores is the goal. Not so easy to reach however. So most will take whatever they can get, and continue to work to get even more out of it.

Even 4x longer on one thread does NOT mimic the parallel search very well. Parallel means exactly that. On occasion you will find something better because of a different tree traversal order from what the serial search does.

If you are talking about the multiplexing argument I always give when someone talks some sort of parallel search nonsense, you are missing the point. IF a program gains unexpectedly from a parallel search, for example somehow consistently producing a super-linear speedup. One can do that same thing with the multiplexed search and produce the result faster than the serial search would do so. That's the key to that argument. If the odd ordering of a parallel search consistently helps (and it does NOT of course) then one can multiplex to get that same order in a serial search, and that same benefit.

But in any case, speeding up the search by a factor of N is clearly better than speeding up the search by a factor of N/2 and making the search less selective to eat up the other N/2 units of time. It would seem that Mark simply fixed the things Don had not had time to get around to. More power to him. You've seen my speedup data already (and it will be better soon, as the Intel compiler NPS scaling is improved significantly). Be interesting to see a comparison with Komodo since you have the numbers for both...
Crafty 25.0 you are testing has better speedup 1 -> 4 cores than Komodo, although this speedup is time dependent, and I used close to 10 seconds/position on 4 cores for Komodo, while you close 10 minutes/position on 4 cores for Crafty, so it's not a very fair comparison. You seem to have improved drastically since publicly available Crafty 23.6 which I tested a bit.

About multiplexing, the argument would work only for superlinear parallel behavior. As I have never seen superlinear 1 -> 4 cores behavior (effective speedup larger than 4), then the argument that one can optimize the single core engine by mimicking the 4 core behavior if the engine widens in parallel search falls. Even if one can reproduce on 1 core the 4 core behavior. If one cannot reproduce, then the argument falls to start with. So, it might happen the case that widening is the best behavior both on 1 core and on 4 cores, if effective speedup on 4 cores is smaller than 4. I don't see a reason why parallel search ideally follows the same tree, in the case of effective speedup smaller than 4 (valid always in chess engines 1 -> 4 cores).

Let's pursue this a bit. I agree that multiplexing addresses any move-ordering related issue only. So let's put that aside in favor of discussing this idea of "widening" the search.

I can only think of the following ways to make the search "wider":

(1) slightly reduce the null-move search R value. The smaller this number, the shallower the search tree and the wider it becomes (EBF will definitely go up, and pretty quickly at that).

(2) slightly tweak the LMR reduction matrix in both directions, so that you reduce slightly less as the number of moves searched increases, and you reduce slightly less as the remaining depth drops. This will be a real EBF killer depending on how far it is taken.

(3) slightly alter forward pruning decisions by either increasing the safety margin for each remaining depth, or in the case of LMP, slightly delay the onset of tossing out late moves.

(4) slightly reduce any search extensions so that every branch is searched closer to the same depth, without the deeper (and narrower) extended branches.

Any or all of those will increase EBF... And you can do any or all in either a serial or parallel search.

Assumption 1. We have tested and tuned until we have reached optimal values for all of the above selective search parameters. Changing any one to a larger or smaller value hurts Elo. So we are "optimal" for the current program.

Assumption 2. We have reached some sort of bottleneck in our parallel search so that we can't push the speedup beyond what it currently is. Let's take 2.0 on 4 cores as a point for discussion.

Given these assumptions, what do you do in the parallel search to "widen" the search? The assumption here is that if you can't do anything else with the extra processing power, you can at least back off on the pruning stuff that has an error rate that is measurable (but acceptable for one core).

Note this is assuming you have extracted all the "depth" you can get with 4 cores over 1, and that is only a ply (2x faster, EBF ~= 2.0, so doubling speed gains one more ply). What do you do at this point? You have two choices:

(1) give up on more parallel search performance, and make the search wider. But in doing this, you further hurt your parallel search at the same time since it now has to search a bigger tree, and since the speedup is limited to 2.0 for whatever internal reasons, as you widen, you lose depth at the same time.

(2) give up on widening and spend effort trying to get the speedup up closer to 4, which will give you almost another ply of depth. A ply is 50-70 Elo so we KNOW that will help.

What I believe is happening in your tests is that the search is widened, for whatever reason, and it does help the program in fixed-depth searches. But remember, if you fix the depth, you can't widen the search without extending the time. So that fixed depth result is not exactly comparable to the serial search to the same depth. A fairer comparison would be fixed depth to N, vs parallel search to depth N-x where x is the cost of the widening.

I don't think there is any doubt that searching extra nodes will make a program somewhat stronger, assuming the cost of those nodes is ignored. But if you really want to search a wider tree, why would you not want to search a wider tree in the serial search as well.

I agree you have measured the effect. I don't agree with the concept that this is an intentional choice however, as that suggests that the serial search could ALSO be improved with a little widening to get rid of some of the risky speculation.

My data showed some positions searching a bigger tree, some searching a smaller tree. On average it was generally just slightly larger, which I like. Exactly why that is happening is unknown at the moment. Probably is related to all the zero-width window searches that make up 99.999% of the search, along with the reductions and stuff that reduce the EBF so that when something goes wrong, it is found very quickly, limiting the damage to increasing the node counts...

But my question remains, how would one decide what/when/where to widen the search given a sub-optimal parallel search algorithm? Given a poor parallel search algorithm, how is that same programmer going to somehow figure out where to widen the search effectively without just blowing the tree size out of the water and killing all performance completely?

Parallel search is not new. Some of us were doing this in the 70's. We have ALWAYS seen that "widening" effect. Many of us noticed that fixed depth searches were better with parallel search, because of the extra stuff that doesn't hurt since time doesn't matter. But we have ALL uniformly cursed it extensively. As it is just another limiting factor on parallel speedup that we did/do not like. IE in Cray Blitz, we generally saw each CPU added search about 30% extra/unnecessary nodes, which means it only spent its computational power searching 70% of what it could have searched. A pretty significant limit (as DTS showed) when the number of processors goes beyond 4-8.

If you look at a current search, 90% of the time it fails high on the first move. Which means 10% of the time it fails high on the second (or later move). That violates YBW assumptions where after searching one move the assumption is that it will have to search all. So we get overhead there by splitting and searching several moves, when the second might be the one that is going to fail high. And we lose (overhead). But on occasion, because of the parallel search, we search moves 2,3,4 and 5 in parallel, 5 fails high, before 2,3 and 4 are done, and that overhead is not quite as bad as it would have been had move 2 failed high while 3, 4 and 5 are still being searched, since the later case is 100% overhead, while the former might be a super-linear event.

I am still running tests, although I now have to re-run everything since the intel compiler has significantly improved the NPS scaling over that 14x I was being in the last big batch of tests. The only good thing is that the numbers are variable, but consistently variable, so that the overall average is not wildly changing. The more consistent the results, the more trustworthy that search is.

I'd like to release this one but I keep finding things to change. Like a shorter hash table entry, or a null-move change, or whatever.

Laskos · Post by **Laskos** » Thu Jun 25, 2015 11:53 am

bob wrote:...

Suppose one has a single core highly optimized engine, which is optimally wide and deep. Now you want to implement SMP to 4 cores. Suppose for simplification that as speed goes, NPS on 4 cores is almost exactly 4 times larger than NPS on 1 core, so in the same amount of time the 4 cored version reaches 4 times more nodes than the single core. Let's in fact have 2 such engines: Crafty and Komodo. One tries hard his engine to copy on 4 cores the optimality of 1 core tree shape, being 4 times faster. If the copy is perfect, then the effective speedup is 4.

Working on Crafty, one discovers that no matter how hard he tries, only 80% of nodes of 4 cored engine copy perfectly the 1 cored nodes, they are fully able nodes. 20% of nodes are completely wasted, they are dead nodes. The effective speedup of Crafty is 4*0.8=3.2, and the engine on 4 cores mimics perfectly the 1 core engine (the same tree), being 3.2 times faster. The only variable in this engine to measure is time to depth speedup (TTD speedup), which is 3.2 for 4 cores, the same 3.2 of the total effective speedup.

Working on Komodo, one has a nasty surprise, his optimized on 1 core engine is harder to parallelize, no matter how hard he tries, only 50% of 4 cored nodes mimic perfectly the 1 core tree, and apparently 50% of nodes are wasted. Left in this shape, the effective speedup on 4 cores is 4*0.5=2.0 equal to TTD speedup, much worse than Crafty's 3.2. Analyzing the rest consisting of 50% wasted nodes, he discovers that as in Crafty, 20% are completely dead nodes, but 30% of nodes have some sort of flexibility, not fully free to mimic the original shape of tree, but not completely dead. Suppose they can be used to deal with the lines that are highly reduced in the single core program. The depth reached does not change, we have achieved the maximum depth and TTD speedup of 2.0 using the 50% fully able nodes. But those 30% partly able nodes do contribute to some strength increase by dealing with severely reduced lines in the single core tree. This increase in strength without changing TTD speedup we call "widening". One probably admits that those partly able nodes used for widening are not as efficient as fully able nodes in the original tree to increase the strength purely via TTD. Say their efficiency is 50% and, as already said, there are 30% of these partly able nodes as fraction of total nodes searched on 4 cores. So, this search will look as the following:

50% of nodes used fully to mimic the 1 core tree shape, giving TTD speedup of 2.0 on 4 cores and a contribution to effective speedup of 2.0
20% of completely wasted nodes, exactly like in Crafty.
30% of partly able nodes used to what their limited freedom allows, say to deal with severe reductions in one core shaped tree, which do not change depth and TTD, but add a contribution to effective speedup of 4*0.5*0.3=0.6 of fixed depth widening.

The total effective speedup of Komodo on 4 cores this way is 2 (from TTD speedup) + 0.6 (from widening) = 2.6, with 80% of nodes used in some more or less efficient way.

It's sure desirable to have pure non-widening TTD effective speedup of Crafty of 3.2, if we keep exemplified numbers. But applied to Komodo peculiarity in parallelization, non widening Komodo will have 2.0 effective speedup, and the widening one 2.6, which is sure better than 2.0.

Now, aspiring to improve with the new Komodo, Mark and Larry find that in fact more 4 core nodes can be used to copy 1 core tree. Say, this way more nodes are wasted, 30% instead of 20%, and one is not tempted to waste nodes. But they see that 70% of nodes now are fully able to copy the optimized 1 core shape. Although 70% of used nodes now are fewer than 80% before, they are more efficient, as all of them go to TTD and effective speedup, which are both equal to non-widening 4*0.7=2.8. Better than the old 2.6. So they improved the effective speedup from 2.6 to 2.8 by going from widening to non-widening. But that's not a question of principle. Suppose that among the 30% wasted nodes, they find 10% nodes which can be used to widen. They will get (say in the future) an effective speedup of 2.8+4*0.1*0.5=3.0 on 4 cores with a widening engine, better than the present day non-widening Komodo.

Then, adding more cores beyond 4 will most probably again reduce the proportion of "mimicking" part of the tree (pure TTD speedup), and widening may gain in importance.

This may be the peculiarity of Komodo tree shape, that it's prone to have a widening tree when going multicore, while Crafty is prone to have a non-widening tree. There is to me no theoretical argument as "widening is always bad" when the scaling is clearly below the ideal (4 on 4 cores). The state of chess engines is such that on average the scaling is pretty far from perfect. Multiplexing argument doesn't work for non superlinear SMP. Superlinearity is an exception, and when systematic it IS a bug, as shown by multiplexing mental experiment.

In fact the only argument against widening is that when effective speedup will be very close to ideal (say above 3.9 on 4 cores), the margin of maneuver with widening is reduced, one could improve by going wider by just a bit. We are not there yet, especially on say 16 cores and higher.

Vinvin · Post by **Vinvin** » Thu Jun 25, 2015 11:59 am

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
I consider anything that is a detriment to performance "a bug". This seems to fit that perfectly. If searching extra nodes is good (so-called "widening") then it should be just as good on a single-core search. The only way it will work on multi-core searches is if the search is too selective and the "widening" compensates for that a bit. But you can compensate in the same way in a serial search also.

Searching the SAME tree, just faster, is the goal for parallel search. Always has been. Widening is only practical if the parallel search just doesn't work very well and doesn't search any deeper with more cores. A point I don't think we have come anywhere near reaching yet, based on the test results I posted here a couple of weeks back.

Using the new Intel compiler has improved those results significantly. In any case, glad you guys are making progress. I've got a PM to send to Mark for something he should try that will be another small gain, since he has worked on hashing recently. There's another step he can take that works well in my testing.
Multiplexing argument is not consistent with fixed time problem. It is consistent with fixed depth problem. Time required on 1 core to mimic exactly parallel 4 core search is exactly 4 times larger. But say the effective speedup on parallel 4 core search is 3.0. Then, keeping original 1 core search at 4x larger time gives the effective speedup of exactly 4, while mimicking the parallel search, at 4 times longer time, it will give 3.0 times effective speedup. Therefore mimicking on one core the parallel 4 core behavior is suboptimal for the 1 core performance at 4x time.
I don't see what this has to do with anything I wrote. The goal of a parallel search is to search the SAME tree faster, thereby going deeper in a fixed time limit. 4x with 4 cores is the goal. Not so easy to reach however. So most will take whatever they can get, and continue to work to get even more out of it.

Even 4x longer on one thread does NOT mimic the parallel search very well. Parallel means exactly that. On occasion you will find something better because of a different tree traversal order from what the serial search does.

If you are talking about the multiplexing argument I always give when someone talks some sort of parallel search nonsense, you are missing the point. IF a program gains unexpectedly from a parallel search, for example somehow consistently producing a super-linear speedup. One can do that same thing with the multiplexed search and produce the result faster than the serial search would do so. That's the key to that argument. If the odd ordering of a parallel search consistently helps (and it does NOT of course) then one can multiplex to get that same order in a serial search, and that same benefit.

But in any case, speeding up the search by a factor of N is clearly better than speeding up the search by a factor of N/2 and making the search less selective to eat up the other N/2 units of time. It would seem that Mark simply fixed the things Don had not had time to get around to. More power to him. You've seen my speedup data already (and it will be better soon, as the Intel compiler NPS scaling is improved significantly). Be interesting to see a comparison with Komodo since you have the numbers for both...
Crafty 25.0 you are testing has better speedup 1 -> 4 cores than Komodo, although this speedup is time dependent, and I used close to 10 seconds/position on 4 cores for Komodo, while you close 10 minutes/position on 4 cores for Crafty, so it's not a very fair comparison. You seem to have improved drastically since publicly available Crafty 23.6 which I tested a bit.

About multiplexing, the argument would work only for superlinear parallel behavior. As I have never seen superlinear 1 -> 4 cores behavior (effective speedup larger than 4), then the argument that one can optimize the single core engine by mimicking the 4 core behavior if the engine widens in parallel search falls. Even if one can reproduce on 1 core the 4 core behavior. If one cannot reproduce, then the argument falls to start with. So, it might happen the case that widening is the best behavior both on 1 core and on 4 cores, if effective speedup on 4 cores is smaller than 4. I don't see a reason why parallel search ideally follows the same tree, in the case of effective speedup smaller than 4 (valid always in chess engines 1 -> 4 cores).

I follow this discussion but no very closely.
I never seen numbers for this experience who can be interesting : what the results for an engine (Crafty or SF or Komodo) running on 1 thread on 1 core vs the same engine running on 4 threads on _1_ cores. 2000 to 5000 games preferably ...

Laskos · Post by **Laskos** » Thu Jun 25, 2015 4:14 pm

Vinvin wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
I consider anything that is a detriment to performance "a bug". This seems to fit that perfectly. If searching extra nodes is good (so-called "widening") then it should be just as good on a single-core search. The only way it will work on multi-core searches is if the search is too selective and the "widening" compensates for that a bit. But you can compensate in the same way in a serial search also.

Searching the SAME tree, just faster, is the goal for parallel search. Always has been. Widening is only practical if the parallel search just doesn't work very well and doesn't search any deeper with more cores. A point I don't think we have come anywhere near reaching yet, based on the test results I posted here a couple of weeks back.

Using the new Intel compiler has improved those results significantly. In any case, glad you guys are making progress. I've got a PM to send to Mark for something he should try that will be another small gain, since he has worked on hashing recently. There's another step he can take that works well in my testing.
Multiplexing argument is not consistent with fixed time problem. It is consistent with fixed depth problem. Time required on 1 core to mimic exactly parallel 4 core search is exactly 4 times larger. But say the effective speedup on parallel 4 core search is 3.0. Then, keeping original 1 core search at 4x larger time gives the effective speedup of exactly 4, while mimicking the parallel search, at 4 times longer time, it will give 3.0 times effective speedup. Therefore mimicking on one core the parallel 4 core behavior is suboptimal for the 1 core performance at 4x time.
I don't see what this has to do with anything I wrote. The goal of a parallel search is to search the SAME tree faster, thereby going deeper in a fixed time limit. 4x with 4 cores is the goal. Not so easy to reach however. So most will take whatever they can get, and continue to work to get even more out of it.

Even 4x longer on one thread does NOT mimic the parallel search very well. Parallel means exactly that. On occasion you will find something better because of a different tree traversal order from what the serial search does.

If you are talking about the multiplexing argument I always give when someone talks some sort of parallel search nonsense, you are missing the point. IF a program gains unexpectedly from a parallel search, for example somehow consistently producing a super-linear speedup. One can do that same thing with the multiplexed search and produce the result faster than the serial search would do so. That's the key to that argument. If the odd ordering of a parallel search consistently helps (and it does NOT of course) then one can multiplex to get that same order in a serial search, and that same benefit.

But in any case, speeding up the search by a factor of N is clearly better than speeding up the search by a factor of N/2 and making the search less selective to eat up the other N/2 units of time. It would seem that Mark simply fixed the things Don had not had time to get around to. More power to him. You've seen my speedup data already (and it will be better soon, as the Intel compiler NPS scaling is improved significantly). Be interesting to see a comparison with Komodo since you have the numbers for both...
Crafty 25.0 you are testing has better speedup 1 -> 4 cores than Komodo, although this speedup is time dependent, and I used close to 10 seconds/position on 4 cores for Komodo, while you close 10 minutes/position on 4 cores for Crafty, so it's not a very fair comparison. You seem to have improved drastically since publicly available Crafty 23.6 which I tested a bit.

About multiplexing, the argument would work only for superlinear parallel behavior. As I have never seen superlinear 1 -> 4 cores behavior (effective speedup larger than 4), then the argument that one can optimize the single core engine by mimicking the 4 core behavior if the engine widens in parallel search falls. Even if one can reproduce on 1 core the 4 core behavior. If one cannot reproduce, then the argument falls to start with. So, it might happen the case that widening is the best behavior both on 1 core and on 4 cores, if effective speedup on 4 cores is smaller than 4. I don't see a reason why parallel search ideally follows the same tree, in the case of effective speedup smaller than 4 (valid always in chess engines 1 -> 4 cores).
I follow this discussion but no very closely.
I never seen numbers for this experience who can be interesting : what the results for an engine (Crafty or SF or Komodo) running on 1 thread on 1 core vs the same engine running on 4 threads on _1_ cores. 2000 to 5000 games preferably ...

I didn't actually played with multiplexing of chess engines, and your test will result in some nonsense depending on random scheduling of instructions on the core, which is not actual multiplexing. In order to make a single core able to run multiple threads, for example time-division multiplexing can be used. And it has to be via the operating system, setting a timer which interrupts the system at a fixed interval. Every time the interrupt occurs, the system runs the scheduling routine, which picks the next thread that is due to be executed. The context of the core is then switched from the currently running thread to the new thread, and execution continues, sequentially, but at very small intervals, mimicking on larger swaths of time the parallel search. And the threads must communicate just like during the parallel search. That would be multiplexing.

bob · Post by **bob** » Thu Jun 25, 2015 5:01 pm

Laskos wrote:
Vinvin wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
lkaufman wrote:
bob wrote:I had told you "this is a bug and will probably be fixed."

It was and it was...

Nobody wants to "widen" intentionally, it is just making the tree larger than necessary.
Although we certainly find and fix bugs from time to time, this was not one of them. The widening got us plenty of elo, it was not at all wasted, but we made a deliberate decision to trade widening for depth, which our testing showed was a small net gain, and Kai has confirmed this. Good detective work, Kai!
I consider anything that is a detriment to performance "a bug". This seems to fit that perfectly. If searching extra nodes is good (so-called "widening") then it should be just as good on a single-core search. The only way it will work on multi-core searches is if the search is too selective and the "widening" compensates for that a bit. But you can compensate in the same way in a serial search also.

Searching the SAME tree, just faster, is the goal for parallel search. Always has been. Widening is only practical if the parallel search just doesn't work very well and doesn't search any deeper with more cores. A point I don't think we have come anywhere near reaching yet, based on the test results I posted here a couple of weeks back.

Using the new Intel compiler has improved those results significantly. In any case, glad you guys are making progress. I've got a PM to send to Mark for something he should try that will be another small gain, since he has worked on hashing recently. There's another step he can take that works well in my testing.
Multiplexing argument is not consistent with fixed time problem. It is consistent with fixed depth problem. Time required on 1 core to mimic exactly parallel 4 core search is exactly 4 times larger. But say the effective speedup on parallel 4 core search is 3.0. Then, keeping original 1 core search at 4x larger time gives the effective speedup of exactly 4, while mimicking the parallel search, at 4 times longer time, it will give 3.0 times effective speedup. Therefore mimicking on one core the parallel 4 core behavior is suboptimal for the 1 core performance at 4x time.
I don't see what this has to do with anything I wrote. The goal of a parallel search is to search the SAME tree faster, thereby going deeper in a fixed time limit. 4x with 4 cores is the goal. Not so easy to reach however. So most will take whatever they can get, and continue to work to get even more out of it.

Even 4x longer on one thread does NOT mimic the parallel search very well. Parallel means exactly that. On occasion you will find something better because of a different tree traversal order from what the serial search does.

If you are talking about the multiplexing argument I always give when someone talks some sort of parallel search nonsense, you are missing the point. IF a program gains unexpectedly from a parallel search, for example somehow consistently producing a super-linear speedup. One can do that same thing with the multiplexed search and produce the result faster than the serial search would do so. That's the key to that argument. If the odd ordering of a parallel search consistently helps (and it does NOT of course) then one can multiplex to get that same order in a serial search, and that same benefit.

But in any case, speeding up the search by a factor of N is clearly better than speeding up the search by a factor of N/2 and making the search less selective to eat up the other N/2 units of time. It would seem that Mark simply fixed the things Don had not had time to get around to. More power to him. You've seen my speedup data already (and it will be better soon, as the Intel compiler NPS scaling is improved significantly). Be interesting to see a comparison with Komodo since you have the numbers for both...
Crafty 25.0 you are testing has better speedup 1 -> 4 cores than Komodo, although this speedup is time dependent, and I used close to 10 seconds/position on 4 cores for Komodo, while you close 10 minutes/position on 4 cores for Crafty, so it's not a very fair comparison. You seem to have improved drastically since publicly available Crafty 23.6 which I tested a bit.

About multiplexing, the argument would work only for superlinear parallel behavior. As I have never seen superlinear 1 -> 4 cores behavior (effective speedup larger than 4), then the argument that one can optimize the single core engine by mimicking the 4 core behavior if the engine widens in parallel search falls. Even if one can reproduce on 1 core the 4 core behavior. If one cannot reproduce, then the argument falls to start with. So, it might happen the case that widening is the best behavior both on 1 core and on 4 cores, if effective speedup on 4 cores is smaller than 4. I don't see a reason why parallel search ideally follows the same tree, in the case of effective speedup smaller than 4 (valid always in chess engines 1 -> 4 cores).
I follow this discussion but no very closely.
I never seen numbers for this experience who can be interesting : what the results for an engine (Crafty or SF or Komodo) running on 1 thread on 1 core vs the same engine running on 4 threads on _1_ cores. 2000 to 5000 games preferably ...
I didn't actually played with multiplexing of chess engines, and your test will result in some nonsense depending on random scheduling of instructions on the core, which is not actual multiplexing. In order to make a single core able to run multiple threads, for example time-division multiplexing can be used. And it has to be via the operating system, setting a timer which interrupts the system at a fixed interval. Every time the interrupt occurs, the system runs the scheduling routine, which picks the next thread that is due to be executed. The context of the core is then switched from the currently running thread to the new thread, and execution continues, sequentially, but at very small intervals, mimicking on larger swaths of time the parallel search. And the threads must communicate just like during the parallel search. That would be multiplexing.

You can do this internally as well. In Cray Blitz (early versions) I had code that would internally multiplex, one node for thread a, one node for thread b, and alternate. had to do this since the machine I tested on early on didn't support threads while the Cray did...

Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation