Peculiarity of Komodo 5.1MP

bob · Post by **bob** » Fri Jun 21, 2013 4:09 pm

Joerg Oster wrote:
bob wrote:
Joerg Oster wrote:What settings for 'Min Split Depth' did you use for Houdini and Stockfish? Same for both?

Depending on the setting you compared a 1-core to an almost 1-core engine. More or less.

Interesting enough, Komodo5.1MP doesn't have such a parameter ...
Note that this is a TINY tuning parameter. It might take at least tens of thousands of games to measure the change after altering min split depth, unless you go way too low. Within +/- 2 of the default for any program, there is little gain/loss.
Yes, in general you are right.
But if you run fixed depth matches they may have big influence... Let's say I set it to 10 and run some games at fixed depth of 11. I would not expect a big difference between 1-core and 4-core version. Right?

I would not expect any significant difference between 1 and 4 cores if searching to the same depth, period...

bob · Post by **bob** » Fri Jun 21, 2013 7:38 pm

michiguel wrote:
Laskos wrote:
yanquis1972 wrote:interesting...is this a unique (in the literal sense) implementation of MP in a chess engine?
First time I encounter, but I only tested very few engines. Some folks like Bob Hyatt even stated bluntly that time-to-depth is the only way to measure MP efficiency. Not for Komodo.
I remember the thread and I did not agree with that. What matters, for any engine, it is the elo gained. Time to depth for a selected number of representative (are they?) positions may or may not be an accurate way to predict how much strength is gained, for several reasons. There are too many assumptions in the process.

Miguel

Feel free to show me ANY parallel search that gains Elo from ANYTHING other than increased depth. Any example will do. And note that the current test does NOT qualify because it doesn't show a thing about Elo.

Elo is purely based on time, because all chess games are based on time. Depth is irrelevant as a search constraint, as if we are going to have a match at fixed depth, I am going to disable ALL pruning and reductions, and ramp the extensions up to the max. Might take weeks to make a single move, but it will move after completing the depth you specify. And it proves absolutely nothing.

Again, parallel search is about depth. Nothing more. Once you can't squeeze more depth, you might start to burn a few nodes by restricting some of the speculative stuff like pruning and reductions. One certainly won't reach that point with just 4 or 8 or 16 cores...

To show just how silly this argument is getting, if you believe that doing something other than improving depth works with 2 cores or 4, what about just making a single CPU 2x or 4x faster. Would you STILL want to go for a wider search there? If not, this discussion is completely ridiculous. If so, then why would wider be better on hardware 2x or 4x faster, but not on current cpu speeds?

Do we have any practical data on Komodo's SMP Elo improvement under REAL conditions? IE same time control, same opponents, but komodo uses 1, then 2, then 4, etc cores while the others stick with 1. That is how I test my parallel search stuff. I have not seen such data to date. If I could run Komodo I would test it in a heartbeat, but I can only run linux stuff on my cluster and have to compile to run with our lightweight kernels...

bob · Post by **bob** » Fri Jun 21, 2013 7:52 pm

lkaufman wrote:
Daniel Shawul wrote:
lkaufman wrote:I am quite sure that neither our parallel nor singe-core search is even close to optimal, just as I am sure this is true for every other engine now in existence. But you haven't convinced me that given an optimal single-core engine, the parallel search must just be faster, not wider. Why can't a wider search get a bigger speedup from parallelism?
When I say optimal it doesn't mean searching the minimal tree but the best sequential search that you have now i.e. un-improvable with the current knowledge that you have (not absolutely un-improvable). Now if you get better search i.e. ELO by searching wider when doing parallel search, it means you can do the same for the sequential search as well. Parallel search is not much different from giving the engine more time. So that means for the sequential search, you can decide ,say to widen with time or something like that, for the same effect. I don't know how that can be implemented in a real engine but that is not my point. Better results with wide SMP=>Better results with wide SEQUENTIAL search given more time=> Sequential search improved if we know how to do it exactly like the parallel search does.
Don acknowledges that but he was arguing that 4 threads is not same as as 4x search, but i don't see how it would be different even if you use 3.6x time.
Don's point is that if you could get four to one speedup by a known parallel algorithm (for 4 cores), there is no way this could be improved upon by changing the sequential search. But what evidence is there that if you get 3 to 1 for 4 cores for your best sequential engine, you can't get 3.2 to 1 for 4 cores by a wider algorithm. This is what we believe, and I think this explains our results.

Here's the problem: if you get 3-1, you are doing 25% extra stuff to burn that 4th cpu and get nothing from it. How, exactly, can you identify (in a normal parallel search) which part of the search is overhead, and then re-direct that effort to something else? Yet that is what must be done if you are going to go beyond your basic 3.0x you quoted. If we knew which nodes were overhead, we would not search 'em in the first place, so exactly how do we identify them now and then use that effort to do something else, instead?

I am a big believer in parallel search, been doing it since 1978. I am not a big believer in vague ideas that sound good until the idea is looked at carefully.

Parallel search is actually pretty well understood, as it is not new and nothing new has happened other than a few evolutionary ideas as done in DTS, or Newborn's PVS, etc..

bob · Post by **bob** » Fri Jun 21, 2013 7:57 pm

michiguel wrote:
Daniel Shawul wrote:Ok I have a better example. You can use two threads on _one_ physical processor which is same as a sequential search. If you want you can keep two search stacks in your program and update them sequentially like two threads are handled with time slicing, then the two are the same. So now if wider parallel search done on _two_ physical processors is better than its deep search counter part done on the same two processors, then the same holds for the case when it is run on one physical processor with two threads. The wider search counter part should perform better in both cases. This is the same argument as "is hyper-threading good or not?", with only difference that HT-enabled processor has 5% or more die area. I am sure this 'Reductio ad absurdum' argument have come up there too, to the effect that you can improve the search on one core by adding more threads (not cores) ad infinitum. Enjoyed some botched up latin ?
Yes, that would do it provided the inefficiency of running two threads in one core is acceptable. It is theoretically possible that hardware could provide a minimum overhead. However, this will indicate a superlinear performance of the algorithm and preliminary results in this forum show no evidence of that (at least what I saw).

I think the scenario I mentioned in the second paragraph is possible. That would still give a worse performance for the two threads in one core and explain the results observed.

Miguel

That "explain the results observed" keeps coming up. I don't see why. The ONLY thing "observed" is that if you fix the depth, and change something else, the program that searches the largest tree will win more games, because it is not getting penalized since the extra time it uses is not measured under the given experimental conditions. Hence my continual reference to just taking ANY program and disabling all pruning and reductions. It will now show a marked Elo improvement over its unchanged version. Yet it will be far weaker in real games where the bigger tree greatly reduces the effective depth when using time as a constraint.

The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.

Laskos · Post by **Laskos** » Fri Jun 21, 2013 9:20 pm

bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.

The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.

michiguel · Post by **michiguel** » Fri Jun 21, 2013 9:35 pm

bob wrote:
michiguel wrote:
Laskos wrote:
yanquis1972 wrote:interesting...is this a unique (in the literal sense) implementation of MP in a chess engine?
First time I encounter, but I only tested very few engines. Some folks like Bob Hyatt even stated bluntly that time-to-depth is the only way to measure MP efficiency. Not for Komodo.
I remember the thread and I did not agree with that. What matters, for any engine, it is the elo gained. Time to depth for a selected number of representative (are they?) positions may or may not be an accurate way to predict how much strength is gained, for several reasons. There are too many assumptions in the process.

Miguel
Feel free to show me ANY parallel search that gains Elo from ANYTHING other than increased depth. Any example will do. And note that the current test does NOT qualify because it doesn't show a thing about Elo.

Elo is purely based on time, because all chess games are based on time. Depth is irrelevant as a search constraint, as if we are going to have a match at fixed depth, I am going to disable ALL pruning and reductions, and ramp the extensions up to the max. Might take weeks to make a single move, but it will move after completing the depth you specify. And it proves absolutely nothing.

Again, parallel search is about depth. Nothing more. Once you can't squeeze more depth, you might start to burn a few nodes by restricting some of the speculative stuff like pruning and reductions. One certainly won't reach that point with just 4 or 8 or 16 cores...

To show just how silly this argument is getting, if you believe that doing something other than improving depth works with 2 cores or 4, what about just making a single CPU 2x or 4x faster. Would you STILL want to go for a wider search there? If not, this discussion is completely ridiculous. If so, then why would wider be better on hardware 2x or 4x faster, but not on current cpu speeds?

Do we have any practical data on Komodo's SMP Elo improvement under REAL conditions? IE same time control, same opponents, but komodo uses 1, then 2, then 4, etc cores while the others stick with 1. That is how I test my parallel search stuff. I have not seen such data to date. If I could run Komodo I would test it in a heartbeat, but I can only run linux stuff on my cluster and have to compile to run with our lightweight kernels...

The issue is that
1) Komodo seems to have a reasonable (i.e. at least not obviously worse) increase in Elo with 4 and 12 cores.
2) At the same time, you have seen Kai experiment, the Komodo's SMP implementation is NOT only about pure depth. It searches a wider tree.

Assuming no experimental artifacts, it seems Komodo's approach uses other things beside depth to improve its strength, when it goes parallel.

You are ignoring #1, which remains to be proven with a lower error bar, but it does not look like it may change much. Point #2, seems to be clear.

Miguel

syzygy · Post by **syzygy** » Fri Jun 21, 2013 10:03 pm

bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:Time-to-depth IS the correct measure.
Give one reason why decreased time-to-depth is more important that elo gain from using more cores.
Absolutely trivial to do. When you change the selectivity of a program, but continue to search to the SAME depth, you take more time. That program, at that depth, searches a larger tree, and will play better. A full-width 12 ply search will whip the snot out of a current 12 ply search with LMR, pruning and such enabled. It will also take FAR longer to complete that 12 ply search.

This really is simple to understand I would think. If you make the tree wider, you gain less depth with 4 cores. If the wider search is more important than the extra depth 4x speed should give, then why not gain that extra width in the one-core search.

TIme to get serious here and not start an argument that is totally pointless and founded on flawed reasoning.
Let me get this straight. You say that time-to-depth is more important than engine strength?
I said "Time to depth is the MOST IMPORTANT thing influencing engine strength." Which is NOT what you are suggesting.

I don't think my original question was so complicated:

syzygy wrote:
bob wrote:Time-to-depth IS the correct measure.
Give one reason why decreased time-to-depth is more important that elo gain from using more cores.

The correct measure clearly is elo gain from using more cores. This can't be seriously disputed.

Now, it could be that for certain engines this is practically equivalent to measure the decrease in time-to-depth. The point is: this is not case for all engines. This is what Kai's experiment shows. Crafty is not the measure of all engines.

Btw, where did you say "Time to depth is the MOST IMPORTANT thing influencing engine strength." ? You did not say that. What you said:

bob wrote:Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

bob wrote:1. SMP efficiency? time-to-depth is THE way to do this, Nothing else works.

Almost everybody else has been agreeing from the start of this thread that elo gain = increased engine strength per added core is THE measure of SMP efficiency.

Time-to-depth does NOT measure the COMPLETE SMP implementation, at least not for Komodo. If it is not true for Komodo, it is not true as a general statement.

bob · Post by **bob** » Sat Jun 22, 2013 5:54 pm

Laskos wrote:
bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.
The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.

I'm not confused about anything at all here. Re-read the FIRST post, which is what I responded to. This statement, specifically:

As it can be seen Houdini and Stockfish are within error margins equal to constant depth on 4 and 1 thread. However Komodo 5.1MP shows +80 points increase for 4 threads compared to 1 thread to fixed depth 11. So, time to depth is an incorrect way of calculating Komodo's MP efficiency. It seems that it increases the width of the tree as much as it increases the depth with number of threads.

To answer your question, NO, I do not believe it possible to produce more Elo by tinkering with search width or anything else, as opposed to driving the search deeper with a traditional parallel search. If this were true, one could do the SAME thing by multiplexing a parallel search on a single CPU.

Also, in the statement I quoted, the last sentence is nonsense, because the data does NOT say a single thing about Komodo's parallel search. Absolutely nothing. EXCEPT that the parallel search is losing significant efficiency by searching extra nodes.

So, re-read the quote above. Look at the data that was given. And think about it for a bit. All that test shows is a PROBLEM in the Komodo search, not some clever Elo gain. One does NOT want to search wider in what should already be an optimally-tuned search.

Yes, Komodo is likely stronger on 4 cores than on 1. But that is a direct result of going deeper. Going wider is preventing it from going still deeper and getting a bigger Elo gain...

I have no idea why this thread has gone into the land of theoretical irrationality. But it has. The question was not "Is it remotely possible one might use part of the parallel search to do something other than going deeper?" (answer = highly unlikely but remotely possible.) The question is, quite simply, "does the data suggest this has happened?" The answer is a simple and resounding "No it has not." If my program produced such data, I would be busy debugging.

bob · Post by **bob** » Sat Jun 22, 2013 6:01 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:Time-to-depth IS the correct measure.
Give one reason why decreased time-to-depth is more important that elo gain from using more cores.
Absolutely trivial to do. When you change the selectivity of a program, but continue to search to the SAME depth, you take more time. That program, at that depth, searches a larger tree, and will play better. A full-width 12 ply search will whip the snot out of a current 12 ply search with LMR, pruning and such enabled. It will also take FAR longer to complete that 12 ply search.

This really is simple to understand I would think. If you make the tree wider, you gain less depth with 4 cores. If the wider search is more important than the extra depth 4x speed should give, then why not gain that extra width in the one-core search.

TIme to get serious here and not start an argument that is totally pointless and founded on flawed reasoning.
Let me get this straight. You say that time-to-depth is more important than engine strength?
I said "Time to depth is the MOST IMPORTANT thing influencing engine strength." Which is NOT what you are suggesting.
I don't think my original question was so complicated:
syzygy wrote:
bob wrote:Time-to-depth IS the correct measure.
Give one reason why decreased time-to-depth is more important that elo gain from using more cores.
The correct measure clearly is elo gain from using more cores. This can't be seriously disputed.

Now, it could be that for certain engines this is practically equivalent to measure the decrease in time-to-depth. The point is: this is not case for all engines. This is what Kai's experiment shows. Crafty is not the measure of all engines.

1. Kai's data does NOT show anything relative to parallel search Elo gain. Suppose the program, running on 4 cores, searches a tree 5x larger than the tree on 1 core. Any elo gain there? Of course not, it would be a loss. Yet for this experiment, the latter would be favored since it searches a larger tree and makes fewer mistakes. Taking longer to search to a given depth than the 1-core version doesn't get penalized.

2. There has been ZERO evidence to show that such a "wider search" is stronger. If this happened, does it not DIRECTLY show that the sequential search could be improved? Multiplexing works. Except when you search a larger tree so that there is a net loss, as in this case.

Btw, where did you say "Time to depth is the MOST IMPORTANT thing influencing engine strength." ? You did not say that. What you said:
bob wrote:Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

bob wrote:1. SMP efficiency? time-to-depth is THE way to do this, Nothing else works.
Almost everybody else has been agreeing from the start of this thread that elo gain = increased engine strength per added core is THE measure of SMP efficiency.

Time-to-depth does NOT measure the COMPLETE SMP implementation, at least not for Komodo. If it is not true for Komodo, it is not true as a general statement.

It is absolutely true as a general statement. All you have to do is give just ONE example to the contrary. And the current data does NOT meet that standard. I gave the experiment that solves this. By testing 1 and 4 core versions against the same set of opponents. Then all of this witchcraft and superstition will dissipate. But irregardless, this test data shows absolutely nothing other than the simple "yes, if someone searches a larger tree, with absolutely no regard to how it increases the search time required, that larger tree will likely produce an Elo gain."

It is intuitively obvious, and completely useless as a concept. Apparently this is just another case of "Let's argue about something that is nonsensical..."

bob · Post by **bob** » Sat Jun 22, 2013 6:08 pm

michiguel wrote:
bob wrote:
michiguel wrote:
Laskos wrote:
yanquis1972 wrote:interesting...is this a unique (in the literal sense) implementation of MP in a chess engine?
First time I encounter, but I only tested very few engines. Some folks like Bob Hyatt even stated bluntly that time-to-depth is the only way to measure MP efficiency. Not for Komodo.
I remember the thread and I did not agree with that. What matters, for any engine, it is the elo gained. Time to depth for a selected number of representative (are they?) positions may or may not be an accurate way to predict how much strength is gained, for several reasons. There are too many assumptions in the process.

Miguel
Feel free to show me ANY parallel search that gains Elo from ANYTHING other than increased depth. Any example will do. And note that the current test does NOT qualify because it doesn't show a thing about Elo.

Elo is purely based on time, because all chess games are based on time. Depth is irrelevant as a search constraint, as if we are going to have a match at fixed depth, I am going to disable ALL pruning and reductions, and ramp the extensions up to the max. Might take weeks to make a single move, but it will move after completing the depth you specify. And it proves absolutely nothing.

Again, parallel search is about depth. Nothing more. Once you can't squeeze more depth, you might start to burn a few nodes by restricting some of the speculative stuff like pruning and reductions. One certainly won't reach that point with just 4 or 8 or 16 cores...

To show just how silly this argument is getting, if you believe that doing something other than improving depth works with 2 cores or 4, what about just making a single CPU 2x or 4x faster. Would you STILL want to go for a wider search there? If not, this discussion is completely ridiculous. If so, then why would wider be better on hardware 2x or 4x faster, but not on current cpu speeds?

Do we have any practical data on Komodo's SMP Elo improvement under REAL conditions? IE same time control, same opponents, but komodo uses 1, then 2, then 4, etc cores while the others stick with 1. That is how I test my parallel search stuff. I have not seen such data to date. If I could run Komodo I would test it in a heartbeat, but I can only run linux stuff on my cluster and have to compile to run with our lightweight kernels...
The issue is that
1) Komodo seems to have a reasonable (i.e. at least not obviously worse) increase in Elo with 4 and 12 cores.
2) At the same time, you have seen Kai experiment, the Komodo's SMP implementation is NOT only about pure depth. It searches a wider tree.

Assuming no experimental artifacts, it seems Komodo's approach uses other things beside depth to improve its strength, when it goes parallel.

You are ignoring #1, which remains to be proven with a lower error bar, but it does not look like it may change much. Point #2, seems to be clear.

Miguel

(1) data to support this is where, exactly? NOT in the first post of this thread;

(2) I've searched wider trees many times. Unintentionally. And I fixed it. The entire thrust of today's programs is driving the EBF DOWN. Not UP. If UP is better, it was driven down incorrectly.

I'm currently rewriting my reduction code. The first version did better in fixed depth testing on tactical positions, just to verify it wasn't obviously broken. It was also 30+ Elo WEAKER on a single CPU test in real games using a time limit. Where at fixed depth it would have been STRONGER because it was (incorrectly) not reducing everywhere I intended and searching a larger tree than the previous version. Fixed depth completely hid this. Normal testing exposed it quickly.

That's that point. Offer some data that supports this stuff. The original data in this thread doesn't support anything other than that Komodo searches a larger tree in parallel mode than in sequential. Larger than what is explained by the usual parallel search overhead everyone knows about. Whether this is good or bad is currently unknown with no data to help draw conclusions. However, experience shows that as the tree grows, due solely to parallel search, Elo DROPS. Not to say the parallel speedup gives nothing, but if the tree is 25% bigger, that is a 25% Elo gain you will never get due to parallel search since it is overhead compared to the sequential search.

Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP