Peculiarity of Komodo 5.1MP

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

lkaufman
Posts: 6008
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Peculiarity of Komodo 5.1MP

Post by lkaufman »

bob wrote: That's all well and good. But are you now saying that with just 4 cores, you have already reached this boundary and that it is better to waste part of those cores searching a wider tree? I don't believe it for a minute. 4 core speedups are pretty good for any reasonable implementation... Even 8 and 16 are good...
"pretty good" is a matter of opinion. Generally speaking, I think the typical top program gets about 3 to 1 speedup for four cores. Maybe you get a lot more on Crafty, but it is not a top program because it is much less selective (I think), and if you do get a lot more, it would prove my point that highly selective programs get less speedup from parallelization than do more traditional ones. Let's suppose that given a highly selective engine like Komodo, Houdini, Critter, Stockfish, or Ippo, we decided to use four cores to get NO speedup, but used them by doing a much fuller search. This would clearly not give us as much benefit as 4x as much time would (else we wouldn't be so selective in the single core engine), but I think it might give us as much benefit as 3x as much time, which would make it as good as standard parallel search FOR THOSE ENGINES. It seems pretty clear to me that the optimum use of four cores would be to use some of it to make the search less selective, and some to make it faster. This is what we did with Komodo, and it at least doesn't appear to be inferior in any way to the standard approach. By the way, Rybka also used some of the resources of MP to reduce selectivity (i.e. improve fixed depth strength) at the expense of some speed of course, and Rybka MP was no slouch. We just carried the idea much further than Rybka did.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Peculiarity of Komodo 5.1MP

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?
Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.
Your statement is simply wrong. It doesn't "benefit" from a wider tree. It "suffers from" a wider tree.
It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?
Not a thing for me. But to say that the Komodo parallel search is somehow better is simply wrong. If the parallel search intentionally makes the tree bigger for the same depth, that is a loss, not a gain.

Quite simple, yet some do not seem to get it.

the comparison in terms of Elo is simply invalid, period.




Otherwise a wider tree would be better in a non-parallel search as well.

Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.
All you say is trivial, and was mentioned in my first post. Komodo has lower time-to-depth ratio going from 1 threads to 4 threads, say 2.0 instead of 3.2, but an additional 80 Elo points benefit from a wider tree. If you measure only time-to-depth, get 2.0 instead of 3.2, and say it's scaling badly, you are wrong, as you do not include additional 80 points from wider tree.
There is NO "additional 80 points from the wider tree." It is ALL imaginary and caused by a flawed testing. You made the program slower, yet are not accounting for that at all in the measurement.

As I said, it is a flawed comparison from the get-go.

Using your logic, you could turn on more extensions, turn off pruning, and make the program better in this particular test (only). Pretty much pointless, and obviously so... THAT was my point. The "80 Elo" (or whatever it actually was) is purely imaginary. Not a single elo of that will be seen in a real game using time limits rather than depth limits.

Better to do a few searches with st=10, and use both 1 and 4 cpus and see what happens there. I'd suspect, based on your results, that the 4 cpu is not going to perform as well as one would expect, because part of the extra processing power is wasted searching extra nodes...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Peculiarity of Komodo 5.1MP

Post by bob »

Don wrote:
bob wrote:
Don wrote:
Daniel Shawul wrote:
Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.
It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.
That was a good try but your proof logic is flawed. What is not considered is utilization of resources.

We know that that when when multiple cores try to cooperate to search a position you lose in 3 primary ways with chess programs: synchronization overhead, communication overhead and redundant calculations. If you can reduce 1 or more of those things and apply it to something else instead, you can get a gain that you would not have had otherwise - even if the gain itself would not ordinarily be the best way to utilize resources in a single core program. Your proof assumes that it's impossible for that to be the case.

In other words if you run out of constructive things to do with the extra cores - then you need to start exploiting things that may be less beneficial in ordinary circumstances but now we have new circumstances. In my "doctor" illustration you just wouldn't waste a doctors time in a hospital making phone calls or being a gopher, but in a specialized situation where there were more doctors than you could fully utilize you would start putting them to work on more mundane tasks.
That's all well and good. But are you now saying that with just 4 cores, you have already reached this boundary and that it is better to waste part of those cores searching a wider tree? I don't believe it for a minute. 4 core speedups are pretty good for any reasonable implementation... Even 8 and 16 are good...
No, I didn't say that we reached the boundary.

Here is another way of looking at it. I can tell you from experience that we can make Komodo a ply faster or a ply slower (either way) by changing some things that will have only a very minor impact on the ELO of the program. So it's not as if there is a "one true" serial implementation that is the one and only right and ultimate way to do things.

So what we need to do is to stop thinking of a chess program as these thing that has totally independent parts that do not interact. It doesn't work that way in computer chess. What works the very best in a serial program with respect to LMR and forward pruning and other things do not necessarily work best in a parallel program.

I reject Daniels idea by the way that there is no difference in a parallel algorithm and a serial algorithm and that they are exactly equivalent. That just reflects a lack of understanding.
Daniel's argument is a basic tenet of parallel programming and is well-founded in theory, however. If making the tree wider with 4 cores would work, would not making it wider with one work also? One can even do the EXACT same search (parallel) using a multiplexed search on a single CPU, as another way of proving this concept...

Anything you can do in a parallel algorithm, I can do in a serial algorithm, and vice-versa. So the only way this argument can even begin to stand up to a strong analysis is if the 4 cpu search is so inefficient that it is actually better to do less searching in parallel and search a more complete tree instead. For 4 cores, I don't believe that is the case for any program today...
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Peculiarity of Komodo 5.1MP

Post by michiguel »

Daniel Shawul wrote:Ok I have a better example. You can use two threads on _one_ physical processor which is same as a sequential search. If you want you can keep two search stacks in your program and update them sequentially like two threads are handled with time slicing, then the two are the same. So now if wider parallel search done on _two_ physical processors is better than its deep search counter part done on the same two processors, then the same holds for the case when it is run on one physical processor with two threads. The wider search counter part should perform better in both cases. This is the same argument as "is hyper-threading good or not?", with only difference that HT-enabled processor has 5% or more die area. I am sure this 'Reductio ad absurdum' argument have come up there too, to the effect that you can improve the search on one core by adding more threads (not cores) ad infinitum. Enjoyed some botched up latin ? :)
Yes, that would do it provided the inefficiency of running two threads in one core is acceptable. It is theoretically possible that hardware could provide a minimum overhead. However, this will indicate a superlinear performance of the algorithm and preliminary results in this forum show no evidence of that (at least what I saw).

I think the scenario I mentioned in the second paragraph is possible. That would still give a worse performance for the two threads in one core and explain the results observed.

Miguel
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Peculiarity of Komodo 5.1MP

Post by Don »

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?
Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.
Your statement is simply wrong. It doesn't "benefit" from a wider tree. It "suffers from" a wider tree.
It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?
Not a thing for me. But to say that the Komodo parallel search is somehow better is simply wrong. If the parallel search intentionally makes the tree bigger for the same depth, that is a loss, not a gain.
I think your argument is merely one of semantics. If you are just saying that we didn't get the classic pure speedup MP should give, then you are correct.

But if you are saying the way to implement MP can only be a specific way and that the goal should not be to improve the ELO but to give a 4x speedup only, then I disagree.

To me the ONLY issue with MP is how to squeeze more playing strength out of the extra processors when they cannot fully contribute to a speedup. You know better than anyone else here that the MP search is not the same as an SP search - you cannot simply use the serial algorithm unmodified as if 4 cores makes your program exactly the same as 1 core sped up 4 times.

At MIT this was a major discussion among us because we were getting a 1824 process machine ready for Hong Kong one year. Maybe I'm stupid, but those guys are not, and their intuition after seeing that Cilkchess was not 1824 times faster was that there must be a better way to utilize those processors that went beyond just more of the same.

In a few years something will come along in computer chess, some new big thing like null move pruning was and like LMR was. You can be a follower or you can be an innovator. I am not in this just to be a good engineer - I would like to stretch the boundaries a bit. I don't think my MP implementation qualifies for that and I'm not saying that it does, but for me anything goes and I am not afraid to stray a little off the beaten path.

It could be, as some are saying, that because it seems to work that our serial search is seriously flawed and we could simply apply the same algorithm to our serial search. I would most ecstatic if that were true.

Quite simple, yet some do not seem to get it.

the comparison in terms of Elo is simply invalid, period.


Otherwise a wider tree would be better in a non-parallel search as well.

Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.
All you say is trivial, and was mentioned in my first post. Komodo has lower time-to-depth ratio going from 1 threads to 4 threads, say 2.0 instead of 3.2, but an additional 80 Elo points benefit from a wider tree. If you measure only time-to-depth, get 2.0 instead of 3.2, and say it's scaling badly, you are wrong, as you do not include additional 80 points from wider tree.
There is NO "additional 80 points from the wider tree." It is ALL imaginary and caused by a flawed testing. You made the program slower, yet are not accounting for that at all in the measurement.

As I said, it is a flawed comparison from the get-go.

Using your logic, you could turn on more extensions, turn off pruning, and make the program better in this particular test (only). Pretty much pointless, and obviously so... THAT was my point. The "80 Elo" (or whatever it actually was) is purely imaginary. Not a single elo of that will be seen in a real game using time limits rather than depth limits.

Better to do a few searches with st=10, and use both 1 and 4 cpus and see what happens there. I'd suspect, based on your results, that the 4 cpu is not going to perform as well as one would expect, because part of the extra processing power is wasted searching extra nodes...
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Peculiarity of Komodo 5.1MP

Post by Laskos »

bob wrote:
Laskos wrote: It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?
Not a thing for me. But to say that the Komodo parallel search is somehow better is simply wrong. If the parallel search intentionally makes the tree bigger for the same depth, that is a loss, not a gain.
I am not saying that overall Komodo MP implementation is better. What I am saying is the following: a typical engine like Houdini, Stockfish, probably Crafty, has a MP efficiency of say 3.2 going from 1 to 4 threads. In this typical engine, _all_ this 3.2 is going to _depth_, so time-to-depth is determining the MP efficiency. In Komodo, depth is still higher in parallel search 4 to 1 threads at fixed time, but not by much. MP efficiency of Komodo could be separated in two: 1.8 to depth and 1.8 to width, for the same total of 3.2. So, say, Crafty from 1 to 4 threads gaines 2 plies at fixed time, while Komodo going from 1 to 4 threads gains only one ply, but widens the tree. Inferring MP efficiency of Komodo by pure time-to-depth is wrong.

And again, overall, probably the MP efficiency is comparable of Komodo and the typical engines which do not change the width.
Quite simple, yet some do not seem to get it.

the comparison in terms of Elo is simply invalid, period.




Otherwise a wider tree would be better in a non-parallel search as well.

Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.
All you say is trivial, and was mentioned in my first post. Komodo has lower time-to-depth ratio going from 1 threads to 4 threads, say 2.0 instead of 3.2, but an additional 80 Elo points benefit from a wider tree. If you measure only time-to-depth, get 2.0 instead of 3.2, and say it's scaling badly, you are wrong, as you do not include additional 80 points from wider tree.
There is NO "additional 80 points from the wider tree." It is ALL imaginary and caused by a flawed testing. You made the program slower, yet are not accounting for that at all in the measurement.
It's not imaginary. Say Houdini at fixed time 4 threads goes 2 plies deeper than 1-threaded, Houdini does not widen, and gains 160 points from 1 to 4 threads. Komodo is doing that by going only one ply deeper (80 points) and widening the tree (another 80 points), for a total of same 160 points of typical engine going from 1 to 4 threads.

The numbers are rough, but I hope you can follow.
As I said, it is a flawed comparison from the get-go.

Using your logic, you could turn on more extensions, turn off pruning, and make the program better in this particular test (only). Pretty much pointless, and obviously so... THAT was my point. The "80 Elo" (or whatever it actually was) is purely imaginary. Not a single elo of that will be seen in a real game using time limits rather than depth limits.
Again, the depth of 4 threaded Komodo is by 1 ply larger than 1 threaded to fixed time, so 80 from widening is not imaginary, the gain from MP on 4 threads is 80 from widening + benefit from _one_ additional ply Elo points. For non-widening engines it's the benefit from _two_ additional plies. Overall, MP gain is comparable.
Joerg Oster
Posts: 948
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany
Full name: Jörg Oster

Re: Peculiarity of Komodo 5.1MP

Post by Joerg Oster »

Don wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?
Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.
Your statement is simply wrong. It doesn't "benefit" from a wider tree. It "suffers from" a wider tree.
It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?
Not a thing for me. But to say that the Komodo parallel search is somehow better is simply wrong. If the parallel search intentionally makes the tree bigger for the same depth, that is a loss, not a gain.
I think your argument is merely one of semantics. If you are just saying that we didn't get the classic pure speedup MP should give, then you are correct.

But if you are saying the way to implement MP can only be a specific way and that the goal should not be to improve the ELO but to give a 4x speedup only, then I disagree.
Me too!
There was a similar discussion in the "Lazy SMP part2" Thread here: http://talkchess.com/forum/viewtopic.ph ... at&start=0
Jörg Oster
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Peculiarity of Komodo 5.1MP

Post by bob »

Don wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?
Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.
Your statement is simply wrong. It doesn't "benefit" from a wider tree. It "suffers from" a wider tree.
It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?
Not a thing for me. But to say that the Komodo parallel search is somehow better is simply wrong. If the parallel search intentionally makes the tree bigger for the same depth, that is a loss, not a gain.
I think your argument is merely one of semantics. If you are just saying that we didn't get the classic pure speedup MP should give, then you are correct.

But if you are saying the way to implement MP can only be a specific way and that the goal should not be to improve the ELO but to give a 4x speedup only, then I disagree.

To me the ONLY issue with MP is how to squeeze more playing strength out of the extra processors when they cannot fully contribute to a speedup. You know better than anyone else here that the MP search is not the same as an SP search - you cannot simply use the serial algorithm unmodified as if 4 cores makes your program exactly the same as 1 core sped up 4 times.

At MIT this was a major discussion among us because we were getting a 1824 process machine ready for Hong Kong one year. Maybe I'm stupid, but those guys are not, and their intuition after seeing that Cilkchess was not 1824 times faster was that there must be a better way to utilize those processors that went beyond just more of the same.

In a few years something will come along in computer chess, some new big thing like null move pruning was and like LMR was. You can be a follower or you can be an innovator. I am not in this just to be a good engineer - I would like to stretch the boundaries a bit. I don't think my MP implementation qualifies for that and I'm not saying that it does, but for me anything goes and I am not afraid to stray a little off the beaten path.

It could be, as some are saying, that because it seems to work that our serial search is seriously flawed and we could simply apply the same algorithm to our serial search. I would most ecstatic if that were true.

Quite simple, yet some do not seem to get it.

the comparison in terms of Elo is simply invalid, period.


Otherwise a wider tree would be better in a non-parallel search as well.

Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.
All you say is trivial, and was mentioned in my first post. Komodo has lower time-to-depth ratio going from 1 threads to 4 threads, say 2.0 instead of 3.2, but an additional 80 Elo points benefit from a wider tree. If you measure only time-to-depth, get 2.0 instead of 3.2, and say it's scaling badly, you are wrong, as you do not include additional 80 points from wider tree.
There is NO "additional 80 points from the wider tree." It is ALL imaginary and caused by a flawed testing. You made the program slower, yet are not accounting for that at all in the measurement.

As I said, it is a flawed comparison from the get-go.

Using your logic, you could turn on more extensions, turn off pruning, and make the program better in this particular test (only). Pretty much pointless, and obviously so... THAT was my point. The "80 Elo" (or whatever it actually was) is purely imaginary. Not a single elo of that will be seen in a real game using time limits rather than depth limits.

Better to do a few searches with st=10, and use both 1 and 4 cpus and see what happens there. I'd suspect, based on your results, that the 4 cpu is not going to perform as well as one would expect, because part of the extra processing power is wasted searching extra nodes...
Let me clarify my meaning.

1. SMP search is all about gaining Elo, nothing else.

2. The fastest way to gain Elo is to search deeper. Hence all of our efforts to write parallel searches that split the work and drive the tree deeper.

3. A parallel search, based on the definition of alpha/beta, will ALWAYS incur some overhead due to the fact we can't get move ordering perfect.

So the discussion boils down to two points:

(a) Is our parallel speedup so bad that we fell it is better to expend some of that effort (which is wasted on parallel search overhead) on something that is possibly more productive?

(b) can we actually redirect the search so that there is a gain in Elo, by searching "more stuff" rather than "deeper stuff" in a way that does NOT translate to a single-thread program,

My answers to both are straightforward and based on years of experience.

(a) the answer is "possibly yes". But not for 4 cores. Or 8 cores. Or 16 cores. For 4 cores, it is not hard to get a >3.0 speedup, losing just a fraction of one CPU. If you try to apply some of that effort elsewhere, you can't just "grab the overhead effort" you end up grabbing a part of the useful effort AND a part of the overhead effort. This hurts unless the parallel speedup is really bad.

(b) I don't see any way to limit our changes so that they just use the "wasted effort" to search wider. You simply end up taking some fraction of the total search time, which has some fraction of useful time and some fraction of wasted time, and expend this entire effort on reshaping the tree. Unfortunately, taking that useful time away hurts even more.

So, simply stated, my opinion, and it is just my opinion, is that this discussion has gone off into theoretically unsound areas that are meaningless.

If you can search 2x faster, you get a ply. It is going to be difficult to take the overhead from that 2x faster search and leave everything else alone so that you still go 2x faster, but use that overhead effort elsewhere. It's a nonsensical idea.

IMHO the ONLY reason for an SMP search is to improve Elo. What other point is there? The only practical way to do this is to search deeper. Unless you believe you can reach some sort of "tactically sufficient depth" where going deeper gives you zero improvement. There, I might buy into this discussion. But I have yet to ever reach such a point where another ply doesn't help, until you see a forced mate or draw.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Peculiarity of Komodo 5.1MP

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote: It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?
Not a thing for me. But to say that the Komodo parallel search is somehow better is simply wrong. If the parallel search intentionally makes the tree bigger for the same depth, that is a loss, not a gain.
I am not saying that overall Komodo MP implementation is better. What I am saying is the following: a typical engine like Houdini, Stockfish, probably Crafty, has a MP efficiency of say 3.2 going from 1 to 4 threads. In this typical engine, _all_ this 3.2 is going to _depth_, so time-to-depth is determining the MP efficiency. In Komodo, depth is still higher in parallel search 4 to 1 threads at fixed time, but not by much. MP efficiency of Komodo could be separated in two: 1.8 to depth and 1.8 to width, for the same total of 3.2. So, say, Crafty from 1 to 4 threads gaines 2 plies at fixed time, while Komodo going from 1 to 4 threads gains only one ply, but widens the tree. Inferring MP efficiency of Komodo by pure time-to-depth is wrong.

And again, overall, probably the MP efficiency is comparable of Komodo and the typical engines which do not change the width.
Quite simple, yet some do not seem to get it.

the comparison in terms of Elo is simply invalid, period.




Otherwise a wider tree would be better in a non-parallel search as well.

Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.
All you say is trivial, and was mentioned in my first post. Komodo has lower time-to-depth ratio going from 1 threads to 4 threads, say 2.0 instead of 3.2, but an additional 80 Elo points benefit from a wider tree. If you measure only time-to-depth, get 2.0 instead of 3.2, and say it's scaling badly, you are wrong, as you do not include additional 80 points from wider tree.
There is NO "additional 80 points from the wider tree." It is ALL imaginary and caused by a flawed testing. You made the program slower, yet are not accounting for that at all in the measurement.
It's not imaginary. Say Houdini at fixed time 4 threads goes 2 plies deeper than 1-threaded, Houdini does not widen, and gains 160 points from 1 to 4 threads. Komodo is doing that by going only one ply deeper (80 points) and widening the tree (another 80 points), for a total of same 160 points of typical engine going from 1 to 4 threads.

The numbers are rough, but I hope you can follow.
As I said, it is a flawed comparison from the get-go.

Using your logic, you could turn on more extensions, turn off pruning, and make the program better in this particular test (only). Pretty much pointless, and obviously so... THAT was my point. The "80 Elo" (or whatever it actually was) is purely imaginary. Not a single elo of that will be seen in a real game using time limits rather than depth limits.
Again, the depth of 4 threaded Komodo is by 1 ply larger than 1 threaded to fixed time, so 80 from widening is not imaginary, the gain from MP on 4 threads is 80 from widening + benefit from _one_ additional ply Elo points. For non-widening engines it's the benefit from _two_ additional plies. Overall, MP gain is comparable.
The test did not address your speculation.

The correct test is to take a group of parallel programs and play 'em in a large match, using 1 core per engine. Then take each engine, one at a time, and run 'em at 4 cores. Measure the Elo improvement. If Komodo gains more than the others, I'll stand ready to eat my hat. the test, as run, doesn't show a single thing about this topic, however...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Peculiarity of Komodo 5.1MP

Post by bob »

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Daniel Shawul wrote:
Uri Blass wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
"poor parallel implementation?"

The target of parallel implementation is not to get bigger depth but to play better.

If the implementation is good in helping komodo to play better than I think that it is wrong to call it poor.

Maybe it is the opposite and komodo compensates for poor pruning of lines that it should not prune by parallel implementation that prevent it to prune good moves in the relevant lines.
Uri, I don't think you understood the test well. It was a _fixed depth_ test that is meaningless unless both parallel and serial versions search more or less similar tree.
If i decided to use alpha-beta only for the parallel search while using lmr+nullmove+other pruning for the sequential search who would you think would win for a fixed depth test. Get it?
I understood the test well.
fixed depth is meaningless if the target is to find strength improvement but it was clearly not the target of the test.

The target was simply to compare the behaviour of Komodo with the behaviour of other programs(not to compare strength).


My point was that the conclusion that the parallel implementation is poor is simply wrong.

Note that I did not claim nothing based on the specific test.
I believe that the implementation of komodo of paralllel search is good
based on what I read about results of it against other programs and I read that with many cores(more than 4 cores) it did better against other programs relative to one core tests.
This is easy enough to explain or grasp, depending on which side of the discussion one is on. If you make the tree wider, you do better in fixed depth search comparisons, because your wider tree does not hurt your time (it is not being measured) and the larger tree tends to miss less tactically. BUT, that wider tree has a definite cost. And in timed matches, that hurts. It does not help. All this test has shown is that Komodo has some odd thing in its search that changes the shape of the tree. One could, for example, double every search extension in SMP mode. A fixed-depth search would play stronger, even though the program is much slower to reach that 12 ply depth now.

This shows absolutely nothing other than that the parallel search looks at stuff the sequential search does not, and as a general rule, that is detrimental to overall Elo, NOT beneficial.

I agree only with the first part of the last sentence.

fixed depth show nothing about the question if it is good to have parallel search looks at wider tree.

I disagree with the general rule that it is detrimental to overall Elo.
It may be detrimental and may be beneficial and you need tests not at fixed depth in order to find out.

I think that in the specific case of komodo it is beneficial not based on fixed depth results.
The reason for my opinion is that
I read that komodo performed better with 12 cores relative to 1 core against other programs when the test was at fixed time and not fixed depth.
If a wider tree is NOT detrimental to overall Elo, then why do we not just search wider trees by limiting pruning and reductions? The answer is obvious.