Peculiarity of Komodo 5.1MP

michiguel · Post by **michiguel** » Sat Jun 22, 2013 6:42 pm

bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
Laskos wrote:
yanquis1972 wrote:interesting...is this a unique (in the literal sense) implementation of MP in a chess engine?
First time I encounter, but I only tested very few engines. Some folks like Bob Hyatt even stated bluntly that time-to-depth is the only way to measure MP efficiency. Not for Komodo.
I remember the thread and I did not agree with that. What matters, for any engine, it is the elo gained. Time to depth for a selected number of representative (are they?) positions may or may not be an accurate way to predict how much strength is gained, for several reasons. There are too many assumptions in the process.

Miguel
Feel free to show me ANY parallel search that gains Elo from ANYTHING other than increased depth. Any example will do. And note that the current test does NOT qualify because it doesn't show a thing about Elo.

Elo is purely based on time, because all chess games are based on time. Depth is irrelevant as a search constraint, as if we are going to have a match at fixed depth, I am going to disable ALL pruning and reductions, and ramp the extensions up to the max. Might take weeks to make a single move, but it will move after completing the depth you specify. And it proves absolutely nothing.

Again, parallel search is about depth. Nothing more. Once you can't squeeze more depth, you might start to burn a few nodes by restricting some of the speculative stuff like pruning and reductions. One certainly won't reach that point with just 4 or 8 or 16 cores...

To show just how silly this argument is getting, if you believe that doing something other than improving depth works with 2 cores or 4, what about just making a single CPU 2x or 4x faster. Would you STILL want to go for a wider search there? If not, this discussion is completely ridiculous. If so, then why would wider be better on hardware 2x or 4x faster, but not on current cpu speeds?

Do we have any practical data on Komodo's SMP Elo improvement under REAL conditions? IE same time control, same opponents, but komodo uses 1, then 2, then 4, etc cores while the others stick with 1. That is how I test my parallel search stuff. I have not seen such data to date. If I could run Komodo I would test it in a heartbeat, but I can only run linux stuff on my cluster and have to compile to run with our lightweight kernels...
The issue is that
1) Komodo seems to have a reasonable (i.e. at least not obviously worse) increase in Elo with 4 and 12 cores.
2) At the same time, you have seen Kai experiment, the Komodo's SMP implementation is NOT only about pure depth. It searches a wider tree.

Assuming no experimental artifacts, it seems Komodo's approach uses other things beside depth to improve its strength, when it goes parallel.

You are ignoring #1, which remains to be proven with a lower error bar, but it does not look like it may change much. Point #2, seems to be clear.

Miguel
(1) data to support this is where, exactly? NOT in the first post of this thread;

It is in the T&M sub-forum in more than one post.
This could be useful:
http://cegt.siteboard.eu/f6t824-testing ... -4cpu.html
110-120 elo for 4 CPU is not bad.

Miguel

(2) I've searched wider trees many times. Unintentionally. And I fixed it. The entire thrust of today's programs is driving the EBF DOWN. Not UP. If UP is better, it was driven down incorrectly.

I'm currently rewriting my reduction code. The first version did better in fixed depth testing on tactical positions, just to verify it wasn't obviously broken. It was also 30+ Elo WEAKER on a single CPU test in real games using a time limit. Where at fixed depth it would have been STRONGER because it was (incorrectly) not reducing everywhere I intended and searching a larger tree than the previous version. Fixed depth completely hid this. Normal testing exposed it quickly.

That's that point. Offer some data that supports this stuff. The original data in this thread doesn't support anything other than that Komodo searches a larger tree in parallel mode than in sequential. Larger than what is explained by the usual parallel search overhead everyone knows about. Whether this is good or bad is currently unknown with no data to help draw conclusions. However, experience shows that as the tree grows, due solely to parallel search, Elo DROPS. Not to say the parallel speedup gives nothing, but if the tree is 25% bigger, that is a 25% Elo gain you will never get due to parallel search since it is overhead compared to the sequential search.

Don · Post by **Don** » Sat Jun 22, 2013 6:46 pm

bob wrote:
Laskos wrote:
bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.
The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.
I'm not confused about anything at all here. Re-read the FIRST post, which is what I responded to. This statement, specifically:

As it can be seen Houdini and Stockfish are within error margins equal to constant depth on 4 and 1 thread. However Komodo 5.1MP shows +80 points increase for 4 threads compared to 1 thread to fixed depth 11. So, time to depth is an incorrect way of calculating Komodo's MP efficiency. It seems that it increases the width of the tree as much as it increases the depth with number of threads.
To answer your question, NO, I do not believe it possible to produce more Elo by tinkering with search width or anything else, as opposed to driving the search deeper with a traditional parallel search. If this were true, one could do the SAME thing by multiplexing a parallel search on a single CPU.

THIS is the definitive issue and where we disagree. Everything else is just a lot of noise.

What you are not considering is that multiplexing a parallel search onto a single CPU is easy (just use threads) and relatively efficient, but you cannot go the other way without massive losses. In other words you cannot easily and efficiently build a system that takes 64 processors and allows you to write single processor programs that run 64 times faster. It's a one way trip.

So you have to write a parallel program differently. For chess it's impossible to take advantage of all the processors you have - so it's reasonable to explore better ways to do that.

I think what you are basically saying is that you should always use the same exact algorithm for ANY application whether you must run it on multi-core or single core - and that is just plain wrong. For multi-core algorithms you pick and choose algorithms that lend themselves better to parallelism.

At MIT when testing cilk they wrote some sort routines and compared performance and found that best serial sort was not best for a parallel program. I don't remember the details but they picked a sort that was not ideal for a single processor machine but was almost as good and far better for SMP. I think this is a really simple concept and I do not see why you cannot understand it.

Also, in the statement I quoted, the last sentence is nonsense, because the data does NOT say a single thing about Komodo's parallel search. Absolutely nothing. EXCEPT that the parallel search is losing significant efficiency by searching extra nodes.

So, re-read the quote above. Look at the data that was given. And think about it for a bit. All that test shows is a PROBLEM in the Komodo search, not some clever Elo gain. One does NOT want to search wider in what should already be an optimally-tuned search.

Yes, Komodo is likely stronger on 4 cores than on 1. But that is a direct result of going deeper. Going wider is preventing it from going still deeper and getting a bigger Elo gain...

I have no idea why this thread has gone into the land of theoretical irrationality. But it has. The question was not "Is it remotely possible one might use part of the parallel search to do something other than going deeper?" (answer = highly unlikely but remotely possible.) The question is, quite simply, "does the data suggest this has happened?" The answer is a simple and resounding "No it has not." If my program produced such data, I would be busy debugging.

bob · Post by **bob** » Sat Jun 22, 2013 7:58 pm

michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
Laskos wrote:
yanquis1972 wrote:interesting...is this a unique (in the literal sense) implementation of MP in a chess engine?
First time I encounter, but I only tested very few engines. Some folks like Bob Hyatt even stated bluntly that time-to-depth is the only way to measure MP efficiency. Not for Komodo.
I remember the thread and I did not agree with that. What matters, for any engine, it is the elo gained. Time to depth for a selected number of representative (are they?) positions may or may not be an accurate way to predict how much strength is gained, for several reasons. There are too many assumptions in the process.

Miguel
Feel free to show me ANY parallel search that gains Elo from ANYTHING other than increased depth. Any example will do. And note that the current test does NOT qualify because it doesn't show a thing about Elo.

Elo is purely based on time, because all chess games are based on time. Depth is irrelevant as a search constraint, as if we are going to have a match at fixed depth, I am going to disable ALL pruning and reductions, and ramp the extensions up to the max. Might take weeks to make a single move, but it will move after completing the depth you specify. And it proves absolutely nothing.

Again, parallel search is about depth. Nothing more. Once you can't squeeze more depth, you might start to burn a few nodes by restricting some of the speculative stuff like pruning and reductions. One certainly won't reach that point with just 4 or 8 or 16 cores...

To show just how silly this argument is getting, if you believe that doing something other than improving depth works with 2 cores or 4, what about just making a single CPU 2x or 4x faster. Would you STILL want to go for a wider search there? If not, this discussion is completely ridiculous. If so, then why would wider be better on hardware 2x or 4x faster, but not on current cpu speeds?

Do we have any practical data on Komodo's SMP Elo improvement under REAL conditions? IE same time control, same opponents, but komodo uses 1, then 2, then 4, etc cores while the others stick with 1. That is how I test my parallel search stuff. I have not seen such data to date. If I could run Komodo I would test it in a heartbeat, but I can only run linux stuff on my cluster and have to compile to run with our lightweight kernels...
The issue is that
1) Komodo seems to have a reasonable (i.e. at least not obviously worse) increase in Elo with 4 and 12 cores.
2) At the same time, you have seen Kai experiment, the Komodo's SMP implementation is NOT only about pure depth. It searches a wider tree.

Assuming no experimental artifacts, it seems Komodo's approach uses other things beside depth to improve its strength, when it goes parallel.

You are ignoring #1, which remains to be proven with a lower error bar, but it does not look like it may change much. Point #2, seems to be clear.

Miguel
(1) data to support this is where, exactly? NOT in the first post of this thread;

It is in the T&M sub-forum in more than one post.
This could be useful:
http://cegt.siteboard.eu/f6t824-testing ... -4cpu.html
110-120 elo for 4 CPU is not bad.

Miguel

(2) I've searched wider trees many times. Unintentionally. And I fixed it. The entire thrust of today's programs is driving the EBF DOWN. Not UP. If UP is better, it was driven down incorrectly.

I'm currently rewriting my reduction code. The first version did better in fixed depth testing on tactical positions, just to verify it wasn't obviously broken. It was also 30+ Elo WEAKER on a single CPU test in real games using a time limit. Where at fixed depth it would have been STRONGER because it was (incorrectly) not reducing everywhere I intended and searching a larger tree than the previous version. Fixed depth completely hid this. Normal testing exposed it quickly.

That's that point. Offer some data that supports this stuff. The original data in this thread doesn't support anything other than that Komodo searches a larger tree in parallel mode than in sequential. Larger than what is explained by the usual parallel search overhead everyone knows about. Whether this is good or bad is currently unknown with no data to help draw conclusions. However, experience shows that as the tree grows, due solely to parallel search, Elo DROPS. Not to say the parallel speedup gives nothing, but if the tree is 25% bigger, that is a 25% Elo gain you will never get due to parallel search since it is overhead compared to the sequential search.

I couldn't interpret that. I don't know the internal differences between komodo cct and komodo 5.1, or Komodo 5.0 and Komodo 5.1, for example. The 1 cpu tests are NOT with komodo 5.1 which makes interpreting the data difficult at best.

Also, the 4 cpu test is based on just a few games. In watching my cluster tests, the Elo changes dramatically over the first 1000 games before it begins to settle down. My just-finished reduction test (phase 2 of N) showed +30 Elo after 500 games, but was even when the test ended at 30K games...

Of course, any data is better than what this thread started with....

bob · Post by **bob** » Sat Jun 22, 2013 8:13 pm

Don wrote:
bob wrote:
Laskos wrote:
bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.
The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.
I'm not confused about anything at all here. Re-read the FIRST post, which is what I responded to. This statement, specifically:

As it can be seen Houdini and Stockfish are within error margins equal to constant depth on 4 and 1 thread. However Komodo 5.1MP shows +80 points increase for 4 threads compared to 1 thread to fixed depth 11. So, time to depth is an incorrect way of calculating Komodo's MP efficiency. It seems that it increases the width of the tree as much as it increases the depth with number of threads.
To answer your question, NO, I do not believe it possible to produce more Elo by tinkering with search width or anything else, as opposed to driving the search deeper with a traditional parallel search. If this were true, one could do the SAME thing by multiplexing a parallel search on a single CPU.
THIS is the definitive issue and where we disagree. Everything else is just a lot of noise.

What you are not considering is that multiplexing a parallel search onto a single CPU is easy (just use threads) and relatively efficient, but you cannot go the other way without massive losses. In other words you cannot easily and efficiently build a system that takes 64 processors and allows you to write single processor programs that run 64 times faster. It's a one way trip.

No disagreement ONCE we reach that point. But here's the KEY issue. Once you give up on increasing the depth, you have given up on the PRIMARY gain one gets from parallel search. Yes, you might then resort to wider to get something rather than nothing. But I don't believe this point is reached with just 4 cores. Or even 8 or 16. 64 is a different issue, to be sure, but we have not been discussing massive systems here, just 4 cores. I don't believe there is any greater gain possible than to search deeper, otherwise we are admitting that our basic search really has issues with selectivity. There's a lot of room for improvement there, if wider works better.

So you have to write a parallel program differently. For chess it's impossible to take advantage of all the processors you have - so it's reasonable to explore better ways to do that.

I've run on a 64-core itanium box, as well as on a 32 core Cray back in 1994 (T932). Those still produce speedups in a traditional parallel search. Not optimal of course, but going from 32 to 64 is not a "zero" either.

I think what you are basically saying is that you should always use the same exact algorithm for ANY application whether you must run it on multi-core or single core - and that is just plain wrong. For multi-core algorithms you pick and choose algorithms that lend themselves better to parallelism.

No, I am not saying that, and never have. But what I have said is that until the primary benefit (deeper search) runs dry, trying to decrease search efficiency by throttling back reductions or pruning. What does reducing your LMR reduction numbers do to the sequential search? Hurts me. If you have extra cores to search the extra overhead that produces, and IF you are getting nothing from your normal parallel search when you ramp up the threads, then I agree this would be of benefit. But are you REALLY saying that you get very little going from 1 to 4 cores? I get a ply and a half or so. That's more than I would get by just throttling the reductions and pruning back and absorbing the increased cost with some of the cores.

At MIT when testing cilk they wrote some sort routines and compared performance and found that best serial sort was not best for a parallel program. I don't remember the details but they picked a sort that was not ideal for a single processor machine but was almost as good and far better for SMP. I think this is a really simple concept and I do not see why you cannot understand it.

I buy the concept. I just have great difficulty with the requisite condition that going from 1 to 4 cores is so bad in terms of traditional parallel speedup that you gain more by relaxing the pruning rules to search a bigger tree, and absorb that cost with the additional cores, rather than using them to go deeper.

Again, the discussion here has been about 4 cores, not 64. That's a different discussion, completely.

Also, in the statement I quoted, the last sentence is nonsense, because the data does NOT say a single thing about Komodo's parallel search. Absolutely nothing. EXCEPT that the parallel search is losing significant efficiency by searching extra nodes.

So, re-read the quote above. Look at the data that was given. And think about it for a bit. All that test shows is a PROBLEM in the Komodo search, not some clever Elo gain. One does NOT want to search wider in what should already be an optimally-tuned search.

Yes, Komodo is likely stronger on 4 cores than on 1. But that is a direct result of going deeper. Going wider is preventing it from going still deeper and getting a bigger Elo gain...

I have no idea why this thread has gone into the land of theoretical irrationality. But it has. The question was not "Is it remotely possible one might use part of the parallel search to do something other than going deeper?" (answer = highly unlikely but remotely possible.) The question is, quite simply, "does the data suggest this has happened?" The answer is a simple and resounding "No it has not." If my program produced such data, I would be busy debugging.

Joerg Oster · Post by **Joerg Oster** » Sat Jun 22, 2013 8:29 pm

bob wrote:
Joerg Oster wrote:
bob wrote:
Joerg Oster wrote:What settings for 'Min Split Depth' did you use for Houdini and Stockfish? Same for both?

Depending on the setting you compared a 1-core to an almost 1-core engine. More or less.

Interesting enough, Komodo5.1MP doesn't have such a parameter ...
Note that this is a TINY tuning parameter. It might take at least tens of thousands of games to measure the change after altering min split depth, unless you go way too low. Within +/- 2 of the default for any program, there is little gain/loss.
Yes, in general you are right.
But if you run fixed depth matches they may have big influence... Let's say I set it to 10 and run some games at fixed depth of 11. I would not expect a big difference between 1-core and 4-core version. Right?
I would not expect any significant difference between 1 and 4 cores if searching to the same depth, period...

Yes, you're right. I did a small test with Stockfish, 1 and 4 cores up to depth 18. No significant difference.

How disappointing! 3 cores plus, and all you get is ... NOTHING!
Do you really think that's the best way of using multiple cores?

Again, wouldn't it be better not to gain quantity (speedup) only, but also quality?

Don · Post by **Don** » Sat Jun 22, 2013 9:14 pm

bob wrote:
Don wrote:
bob wrote:
Laskos wrote:
bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.
The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.
I'm not confused about anything at all here. Re-read the FIRST post, which is what I responded to. This statement, specifically:

As it can be seen Houdini and Stockfish are within error margins equal to constant depth on 4 and 1 thread. However Komodo 5.1MP shows +80 points increase for 4 threads compared to 1 thread to fixed depth 11. So, time to depth is an incorrect way of calculating Komodo's MP efficiency. It seems that it increases the width of the tree as much as it increases the depth with number of threads.
To answer your question, NO, I do not believe it possible to produce more Elo by tinkering with search width or anything else, as opposed to driving the search deeper with a traditional parallel search. If this were true, one could do the SAME thing by multiplexing a parallel search on a single CPU.
THIS is the definitive issue and where we disagree. Everything else is just a lot of noise.

What you are not considering is that multiplexing a parallel search onto a single CPU is easy (just use threads) and relatively efficient, but you cannot go the other way without massive losses. In other words you cannot easily and efficiently build a system that takes 64 processors and allows you to write single processor programs that run 64 times faster. It's a one way trip.
No disagreement ONCE we reach that point. But here's the KEY issue. Once you give up on increasing the depth, you have given up on the PRIMARY gain one gets from parallel search. Yes, you might then resort to wider to get something rather than nothing. But I don't believe this point is reached with just 4 cores. Or even 8 or 16. 64 is a different issue, to be sure, but we have not been discussing massive systems here, just 4 cores. I don't believe there is any greater gain possible than to search deeper, otherwise we are admitting that our basic search really has issues with selectivity. There's a lot of room for improvement there, if wider works better.

I would agree with that if it were not for one thing. Komodo is relatively flexible and robust with respect to how aggressive we are with selectivity and LMR - so we can slice it in either direction without have more than a minor impact on the ELO. As I said earlier we could search a ply deeper by being more selective or a ply less while being more conservative and the impact on the ELO would be minor.

So there is a lot of room for play here, a "lot of knobs to turn" like the old television sets. So what you want is to find a happy combination that works better for MP than it hurts SP.

So you have to write a parallel program differently. For chess it's impossible to take advantage of all the processors you have - so it's reasonable to explore better ways to do that.
I've run on a 64-core itanium box, as well as on a 32 core Cray back in 1994 (T932). Those still produce speedups in a traditional parallel search. Not optimal of course, but going from 32 to 64 is not a "zero" either.

I think what you are basically saying is that you should always use the same exact algorithm for ANY application whether you must run it on multi-core or single core - and that is just plain wrong. For multi-core algorithms you pick and choose algorithms that lend themselves better to parallelism.
No, I am not saying that, and never have. But what I have said is that until the primary benefit (deeper search) runs dry, trying to decrease search efficiency by throttling back reductions or pruning. What does reducing your LMR reduction numbers do to the sequential search? Hurts me. If you have extra cores to search the extra overhead that produces, and IF you are getting nothing from your normal parallel search when you ramp up the threads, then I agree this would be of benefit. But are you REALLY saying that you get very little going from 1 to 4 cores? I get a ply and a half or so. That's more than I would get by just throttling the reductions and pruning back and absorbing the increased cost with some of the cores.

At MIT when testing cilk they wrote some sort routines and compared performance and found that best serial sort was not best for a parallel program. I don't remember the details but they picked a sort that was not ideal for a single processor machine but was almost as good and far better for SMP. I think this is a really simple concept and I do not see why you cannot understand it.
I buy the concept. I just have great difficulty with the requisite condition that going from 1 to 4 cores is so bad in terms of traditional parallel speedup that you gain more by relaxing the pruning rules to search a bigger tree, and absorb that cost with the additional cores, rather than using them to go deeper.

Again, the discussion here has been about 4 cores, not 64. That's a different discussion, completely.

Also, in the statement I quoted, the last sentence is nonsense, because the data does NOT say a single thing about Komodo's parallel search. Absolutely nothing. EXCEPT that the parallel search is losing significant efficiency by searching extra nodes.

I think this works for Komodo for the reasons I have stated, Komodo is robust with respect to how we do our tradeoff's. So having a wider search is just not a big deal despite the slowdown - it's just a different tradeoff that is "almost" the same. So all that is required is that it be at least slightly better for MP to make it worth doing.

So, re-read the quote above. Look at the data that was given. And think about it for a bit. All that test shows is a PROBLEM in the Komodo search, not some clever Elo gain. One does NOT want to search wider in what should already be an optimally-tuned search.

I don't know if you realize what you are saying. Komodo is an incredibly strong program and you are implying that the search is all screwed up. I think almost anyone would want to have a program as screwed up as ours. Do you actually believe we should get 50 more ELO just from a few search tweaks on the single core program? No, I think you have to see that something here doesn't add up.

Yes, Komodo is likely stronger on 4 cores than on 1. But that is a direct result of going deeper. Going wider is preventing it from going still deeper and getting a bigger Elo gain...

I have no idea why this thread has gone into the land of theoretical irrationality. But it has. The question was not "Is it remotely possible one might use part of the parallel search to do something other than going deeper?" (answer = highly unlikely but remotely possible.) The question is, quite simply, "does the data suggest this has happened?" The answer is a simple and resounding "No it has not." If my program produced such data, I would be busy debugging.

So your statement is that Komodo has serious problems with the search when running on 1 processor. The nice ELO gain on 4 cores is because Komodo's single core search is seriously crippled so in actuality this gain is bogus. So we potentially could improve the program by over 50 ELO by fixing this. If only that were true I would be a happy man because I would love to squeeze 50 ELO out of Komodo by making a few simple adjustments.

Laskos · Post by **Laskos** » Sat Jun 22, 2013 10:42 pm

bob wrote:
Laskos wrote:
bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.
The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.
I'm not confused about anything at all here. Re-read the FIRST post, which is what I responded to. This statement, specifically:

As it can be seen Houdini and Stockfish are within error margins equal to constant depth on 4 and 1 thread. However Komodo 5.1MP shows +80 points increase for 4 threads compared to 1 thread to fixed depth 11. So, time to depth is an incorrect way of calculating Komodo's MP efficiency. It seems that it increases the width of the tree as much as it increases the depth with number of threads.
To answer your question, NO, I do not believe it possible to produce more Elo by tinkering with search width or anything else, as opposed to driving the search deeper with a traditional parallel search. If this were true, one could do the SAME thing by multiplexing a parallel search on a single CPU.

So, it's "NO" that one cannot gain Elos from 1 to 4 threads not changing depth, or it's "NO" that the gain is theoretically sub-optimal? The issue with parallel and sequential search is that sequentially on one core you can do everything you do parallel on 4 cores given 4x time. Not 1x time. 1x time is non-issue.

Also, in the statement I quoted, the last sentence is nonsense, because the data does NOT say a single thing about Komodo's parallel search. Absolutely nothing. EXCEPT that the parallel search is losing significant efficiency by searching extra nodes.

So, re-read the quote above. Look at the data that was given. And think about it for a bit. All that test shows is a PROBLEM in the Komodo search, not some clever Elo gain. One does NOT want to search wider in what should already be an optimally-tuned search.

Yes, Komodo is likely stronger on 4 cores than on 1. But that is a direct result of going deeper.

So, you deny again that going from 1 core to 4 cores, Komodo is stronger not only by going deeper, but by pruning less (or going wider in a general sense)? Again, you seem to be confused by apparent "non-optimality" of this procedure. But as Don pointed out, Komodo can be modified to be a ply less or more without changing strength, so this "non-optimality" is only theoretical.

Going wider is preventing it from going still deeper and getting a bigger Elo gain...

I have no idea why this thread has gone into the land of theoretical irrationality. But it has. The question was not "Is it remotely possible one might use part of the parallel search to do something other than going deeper?" (answer = highly unlikely but remotely possible.) The question is, quite simply, "does the data suggest this has happened?" The answer is a simple and resounding "No it has not." If my program produced such data, I would be busy debugging.

bob · Post by **bob** » Sun Jun 23, 2013 12:23 am

Don wrote:
bob wrote:
Don wrote:
bob wrote:
Laskos wrote:
bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.
The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.
I'm not confused about anything at all here. Re-read the FIRST post, which is what I responded to. This statement, specifically:

As it can be seen Houdini and Stockfish are within error margins equal to constant depth on 4 and 1 thread. However Komodo 5.1MP shows +80 points increase for 4 threads compared to 1 thread to fixed depth 11. So, time to depth is an incorrect way of calculating Komodo's MP efficiency. It seems that it increases the width of the tree as much as it increases the depth with number of threads.
To answer your question, NO, I do not believe it possible to produce more Elo by tinkering with search width or anything else, as opposed to driving the search deeper with a traditional parallel search. If this were true, one could do the SAME thing by multiplexing a parallel search on a single CPU.
THIS is the definitive issue and where we disagree. Everything else is just a lot of noise.

What you are not considering is that multiplexing a parallel search onto a single CPU is easy (just use threads) and relatively efficient, but you cannot go the other way without massive losses. In other words you cannot easily and efficiently build a system that takes 64 processors and allows you to write single processor programs that run 64 times faster. It's a one way trip.
No disagreement ONCE we reach that point. But here's the KEY issue. Once you give up on increasing the depth, you have given up on the PRIMARY gain one gets from parallel search. Yes, you might then resort to wider to get something rather than nothing. But I don't believe this point is reached with just 4 cores. Or even 8 or 16. 64 is a different issue, to be sure, but we have not been discussing massive systems here, just 4 cores. I don't believe there is any greater gain possible than to search deeper, otherwise we are admitting that our basic search really has issues with selectivity. There's a lot of room for improvement there, if wider works better.

I would agree with that if it were not for one thing. Komodo is relatively flexible and robust with respect to how aggressive we are with selectivity and LMR - so we can slice it in either direction without have more than a minor impact on the ELO. As I said earlier we could search a ply deeper by being more selective or a ply less while being more conservative and the impact on the ELO would be minor.

So there is a lot of room for play here, a "lot of knobs to turn" like the old television sets. So what you want is to find a happy combination that works better for MP than it hurts SP.

So you have to write a parallel program differently. For chess it's impossible to take advantage of all the processors you have - so it's reasonable to explore better ways to do that.
I've run on a 64-core itanium box, as well as on a 32 core Cray back in 1994 (T932). Those still produce speedups in a traditional parallel search. Not optimal of course, but going from 32 to 64 is not a "zero" either.

I think what you are basically saying is that you should always use the same exact algorithm for ANY application whether you must run it on multi-core or single core - and that is just plain wrong. For multi-core algorithms you pick and choose algorithms that lend themselves better to parallelism.
No, I am not saying that, and never have. But what I have said is that until the primary benefit (deeper search) runs dry, trying to decrease search efficiency by throttling back reductions or pruning. What does reducing your LMR reduction numbers do to the sequential search? Hurts me. If you have extra cores to search the extra overhead that produces, and IF you are getting nothing from your normal parallel search when you ramp up the threads, then I agree this would be of benefit. But are you REALLY saying that you get very little going from 1 to 4 cores? I get a ply and a half or so. That's more than I would get by just throttling the reductions and pruning back and absorbing the increased cost with some of the cores.

At MIT when testing cilk they wrote some sort routines and compared performance and found that best serial sort was not best for a parallel program. I don't remember the details but they picked a sort that was not ideal for a single processor machine but was almost as good and far better for SMP. I think this is a really simple concept and I do not see why you cannot understand it.
I buy the concept. I just have great difficulty with the requisite condition that going from 1 to 4 cores is so bad in terms of traditional parallel speedup that you gain more by relaxing the pruning rules to search a bigger tree, and absorb that cost with the additional cores, rather than using them to go deeper.

Again, the discussion here has been about 4 cores, not 64. That's a different discussion, completely.

Also, in the statement I quoted, the last sentence is nonsense, because the data does NOT say a single thing about Komodo's parallel search. Absolutely nothing. EXCEPT that the parallel search is losing significant efficiency by searching extra nodes.

I think this works for Komodo for the reasons I have stated, Komodo is robust with respect to how we do our tradeoff's. So having a wider search is just not a big deal despite the slowdown - it's just a different tradeoff that is "almost" the same. So all that is required is that it be at least slightly better for MP to make it worth doing.

So, re-read the quote above. Look at the data that was given. And think about it for a bit. All that test shows is a PROBLEM in the Komodo search, not some clever Elo gain. One does NOT want to search wider in what should already be an optimally-tuned search.

I don't know if you realize what you are saying. Komodo is an incredibly strong program and you are implying that the search is all screwed up. I think almost anyone would want to have a program as screwed up as ours. Do you actually believe we should get 50 more ELO just from a few search tweaks on the single core program? No, I think you have to see that something here doesn't add up.

Yes, Komodo is likely stronger on 4 cores than on 1. But that is a direct result of going deeper. Going wider is preventing it from going still deeper and getting a bigger Elo gain...

I have no idea why this thread has gone into the land of theoretical irrationality. But it has. The question was not "Is it remotely possible one might use part of the parallel search to do something other than going deeper?" (answer = highly unlikely but remotely possible.) The question is, quite simply, "does the data suggest this has happened?" The answer is a simple and resounding "No it has not." If my program produced such data, I would be busy debugging.
So your statement is that Komodo has serious problems with the search when running on 1 processor. The nice ELO gain on 4 cores is because Komodo's single core search is seriously crippled so in actuality this gain is bogus. So we potentially could improve the program by over 50 ELO by fixing this. If only that were true I would be a happy man because I would love to squeeze 50 ELO out of Komodo by making a few simple adjustments.

What I have said, repeatedly, is that going from 1->4 cores, the most gain is in depth. Unless you REALLY believe your search is tuned so aggressively it is making lots of pruning errors, and you want to use the extra horsepower to hide the pruning errors with a wider search, as opposed to going deeper to expose more tactics.

For the 1->4 change, I simply don't believe that happens, unless the parallel search really does have some significant performance issues...

Hardly anyone other than myself actually posts parallel performance data, so I can't speak to this issue with any sort of data that supports any conclusions. But if giving up 1.5 plies with parallel search, and perhaps just sticking with a .5 ply gain, really does make your program stronger, that is certainly one of the biggest outlier data points I have seen in working on parallel search for 35+ years.

If you get that kind of improvement from a wider tree, why not search like that in sequential mode as well? The SAME benefit should accrue...

bob · Post by **bob** » Sun Jun 23, 2013 12:27 am

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
The "results observed" are meaningless. Completely meaningless. I run this very test from time to time, but only to prove I have not broken my parallel search. If the parallel fixed-depth program plays worse, something's wrong. I've never had a case where it played better, taking error bar into consideration. If It did show an improvement, I would instantly run a timed match which would certainly show an elo drop, rather than gain.
The Elo drop is against the optimally implemented parallel search on same 4 threads. But overall, the strength on 4 threads, with this non-optimal SMP, could still be higher than on 1 thread. You seem to be confused. Are you denying the fact that a SMP implementation which doesn't change the depth can gain Elo points going from 1 to 4 threads? This is trivial, but you seem not to read this thread.
I'm not confused about anything at all here. Re-read the FIRST post, which is what I responded to. This statement, specifically:

As it can be seen Houdini and Stockfish are within error margins equal to constant depth on 4 and 1 thread. However Komodo 5.1MP shows +80 points increase for 4 threads compared to 1 thread to fixed depth 11. So, time to depth is an incorrect way of calculating Komodo's MP efficiency. It seems that it increases the width of the tree as much as it increases the depth with number of threads.
To answer your question, NO, I do not believe it possible to produce more Elo by tinkering with search width or anything else, as opposed to driving the search deeper with a traditional parallel search. If this were true, one could do the SAME thing by multiplexing a parallel search on a single CPU.
So, it's "NO" that one cannot gain Elos from 1 to 4 threads not changing depth, or it's "NO" that the gain is theoretically sub-optimal? The issue with parallel and sequential search is that sequentially on one core you can do everything you do parallel on 4 cores given 4x time. Not 1x time. 1x time is non-issue.

Also, in the statement I quoted, the last sentence is nonsense, because the data does NOT say a single thing about Komodo's parallel search. Absolutely nothing. EXCEPT that the parallel search is losing significant efficiency by searching extra nodes.

So, re-read the quote above. Look at the data that was given. And think about it for a bit. All that test shows is a PROBLEM in the Komodo search, not some clever Elo gain. One does NOT want to search wider in what should already be an optimally-tuned search.

Yes, Komodo is likely stronger on 4 cores than on 1. But that is a direct result of going deeper.
So, you deny again that going from 1 core to 4 cores, Komodo is stronger not only by going deeper, but by pruning less (or going wider in a general sense)? Again, you seem to be confused by apparent "non-optimality" of this procedure. But as Don pointed out, Komodo can be modified to be a ply less or more without changing strength, so this "non-optimality" is only theoretical.
Going wider is preventing it from going still deeper and getting a bigger Elo gain...

I have no idea why this thread has gone into the land of theoretical irrationality. But it has. The question was not "Is it remotely possible one might use part of the parallel search to do something other than going deeper?" (answer = highly unlikely but remotely possible.) The question is, quite simply, "does the data suggest this has happened?" The answer is a simple and resounding "No it has not." If my program produced such data, I would be busy debugging.

I have not denied anything.

I simply said (a) the data at the beginning of this thread has absolutely no useful information content in it, and says nothing whatsoever about parallel performance of any kind. Suppose that Elo gain came at the expense of running 8x slower? The given test would not show that. So concluding anything at all about Elo given the way that test was run is completely pointless. (b) for modest numbers of processors, the most significant gain is going to be depth-related...

I'd hope that is clear enough. If someone gets more from widening (as opposed to deepening) the search, that suggests an issue. Nothing more, nothing less. But it is an indicator of an issue.

bob · Post by **bob** » Sun Jun 23, 2013 1:28 am

Joerg Oster wrote:
bob wrote:
Joerg Oster wrote:
bob wrote:
Joerg Oster wrote:What settings for 'Min Split Depth' did you use for Houdini and Stockfish? Same for both?

Depending on the setting you compared a 1-core to an almost 1-core engine. More or less.

Interesting enough, Komodo5.1MP doesn't have such a parameter ...
Note that this is a TINY tuning parameter. It might take at least tens of thousands of games to measure the change after altering min split depth, unless you go way too low. Within +/- 2 of the default for any program, there is little gain/loss.
Yes, in general you are right.
But if you run fixed depth matches they may have big influence... Let's say I set it to 10 and run some games at fixed depth of 11. I would not expect a big difference between 1-core and 4-core version. Right?
I would not expect any significant difference between 1 and 4 cores if searching to the same depth, period...
Yes, you're right. I did a small test with Stockfish, 1 and 4 cores up to depth 18. No significant difference.
How disappointing! 3 cores plus, and all you get is ... NOTHING!
Do you really think that's the best way of using multiple cores?

Again, wouldn't it be better not to gain quantity (speedup) only, but also quality?

I don't understand your question. More cores gains more depth. Just like going to a faster processor gains more depth. And depth certainly improves quality...

Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP