Peculiarity of Komodo 5.1MP

bob · Post by **bob** » Thu Jun 20, 2013 2:30 am

syzygy wrote:
Daniel Shawul wrote:Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does.
Prove that you can "change the algorithm" without loss of speed...

We already KNOW he "lost speed". The tree must be bigger to gain an Elo increase. How hard is this to understand?

bob · Post by **bob** » Thu Jun 20, 2013 2:32 am

Don wrote:
Daniel Shawul wrote:
Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.
It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.
That was a good try but your proof logic is flawed. What is not considered is utilization of resources.

We know that that when when multiple cores try to cooperate to search a position you lose in 3 primary ways with chess programs: synchronization overhead, communication overhead and redundant calculations. If you can reduce 1 or more of those things and apply it to something else instead, you can get a gain that you would not have had otherwise - even if the gain itself would not ordinarily be the best way to utilize resources in a single core program. Your proof assumes that it's impossible for that to be the case.

In other words if you run out of constructive things to do with the extra cores - then you need to start exploiting things that may be less beneficial in ordinary circumstances but now we have new circumstances. In my "doctor" illustration you just wouldn't waste a doctors time in a hospital making phone calls or being a gopher, but in a specialized situation where there were more doctors than you could fully utilize you would start putting them to work on more mundane tasks.

That's all well and good. But are you now saying that with just 4 cores, you have already reached this boundary and that it is better to waste part of those cores searching a wider tree? I don't believe it for a minute. 4 core speedups are pretty good for any reasonable implementation... Even 8 and 16 are good...

bob · Post by **bob** » Thu Jun 20, 2013 2:36 am

Uri Blass wrote:
Daniel Shawul wrote:
Uri Blass wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
"poor parallel implementation?"

The target of parallel implementation is not to get bigger depth but to play better.

If the implementation is good in helping komodo to play better than I think that it is wrong to call it poor.

Maybe it is the opposite and komodo compensates for poor pruning of lines that it should not prune by parallel implementation that prevent it to prune good moves in the relevant lines.
Uri, I don't think you understood the test well. It was a _fixed depth_ test that is meaningless unless both parallel and serial versions search more or less similar tree.
If i decided to use alpha-beta only for the parallel search while using lmr+nullmove+other pruning for the sequential search who would you think would win for a fixed depth test. Get it?
I understood the test well.
fixed depth is meaningless if the target is to find strength improvement but it was clearly not the target of the test.

The target was simply to compare the behaviour of Komodo with the behaviour of other programs(not to compare strength).

My point was that the conclusion that the parallel implementation is poor is simply wrong.

Note that I did not claim nothing based on the specific test.
I believe that the implementation of komodo of paralllel search is good
based on what I read about results of it against other programs and I read that with many cores(more than 4 cores) it did better against other programs relative to one core tests.

This is easy enough to explain or grasp, depending on which side of the discussion one is on. If you make the tree wider, you do better in fixed depth search comparisons, because your wider tree does not hurt your time (it is not being measured) and the larger tree tends to miss less tactically. BUT, that wider tree has a definite cost. And in timed matches, that hurts. It does not help. All this test has shown is that Komodo has some odd thing in its search that changes the shape of the tree. One could, for example, double every search extension in SMP mode. A fixed-depth search would play stronger, even though the program is much slower to reach that 12 ply depth now.

This shows absolutely nothing other than that the parallel search looks at stuff the sequential search does not, and as a general rule, that is detrimental to overall Elo, NOT beneficial.

Uri Blass · Post by **Uri Blass** » Thu Jun 20, 2013 3:04 am

bob wrote:
Uri Blass wrote:
Daniel Shawul wrote:
Uri Blass wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
"poor parallel implementation?"

The target of parallel implementation is not to get bigger depth but to play better.

If the implementation is good in helping komodo to play better than I think that it is wrong to call it poor.

Maybe it is the opposite and komodo compensates for poor pruning of lines that it should not prune by parallel implementation that prevent it to prune good moves in the relevant lines.
Uri, I don't think you understood the test well. It was a _fixed depth_ test that is meaningless unless both parallel and serial versions search more or less similar tree.
If i decided to use alpha-beta only for the parallel search while using lmr+nullmove+other pruning for the sequential search who would you think would win for a fixed depth test. Get it?
I understood the test well.
fixed depth is meaningless if the target is to find strength improvement but it was clearly not the target of the test.

The target was simply to compare the behaviour of Komodo with the behaviour of other programs(not to compare strength).

My point was that the conclusion that the parallel implementation is poor is simply wrong.

Note that I did not claim nothing based on the specific test.
I believe that the implementation of komodo of paralllel search is good
based on what I read about results of it against other programs and I read that with many cores(more than 4 cores) it did better against other programs relative to one core tests.
This is easy enough to explain or grasp, depending on which side of the discussion one is on. If you make the tree wider, you do better in fixed depth search comparisons, because your wider tree does not hurt your time (it is not being measured) and the larger tree tends to miss less tactically. BUT, that wider tree has a definite cost. And in timed matches, that hurts. It does not help. All this test has shown is that Komodo has some odd thing in its search that changes the shape of the tree. One could, for example, double every search extension in SMP mode. A fixed-depth search would play stronger, even though the program is much slower to reach that 12 ply depth now.

This shows absolutely nothing other than that the parallel search looks at stuff the sequential search does not, and as a general rule, that is detrimental to overall Elo, NOT beneficial.

I agree only with the first part of the last sentence.

fixed depth show nothing about the question if it is good to have parallel search looks at wider tree.

I disagree with the general rule that it is detrimental to overall Elo.
It may be detrimental and may be beneficial and you need tests not at fixed depth in order to find out.

I think that in the specific case of komodo it is beneficial not based on fixed depth results.
The reason for my opinion is that
I read that komodo performed better with 12 cores relative to 1 core against other programs when the test was at fixed time and not fixed depth.

Laskos · Post by **Laskos** » Thu Jun 20, 2013 3:19 am

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?
Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.
Your statement is simply wrong. It doesn't "benefit" from a wider tree. It "suffers from" a wider tree.

It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?

Otherwise a wider tree would be better in a non-parallel search as well.

Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.

All you say is trivial, and was mentioned in my first post. Komodo has lower time-to-depth ratio going from 1 threads to 4 threads, say 2.0 instead of 3.2, but an additional 80 Elo points benefit from a wider tree. If you measure only time-to-depth, get 2.0 instead of 3.2, and say it's scaling badly, you are wrong, as you do not include additional 80 points from wider tree.

Better to do a few searches with st=10, and use both 1 and 4 cpus and see what happens there. I'd suspect, based on your results, that the 4 cpu is not going to perform as well as one would expect, because part of the extra processing power is wasted searching extra nodes...

Don · Post by **Don** » Thu Jun 20, 2013 3:54 am

bob wrote:
Don wrote:
Daniel Shawul wrote:
Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.
It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.
That was a good try but your proof logic is flawed. What is not considered is utilization of resources.

We know that that when when multiple cores try to cooperate to search a position you lose in 3 primary ways with chess programs: synchronization overhead, communication overhead and redundant calculations. If you can reduce 1 or more of those things and apply it to something else instead, you can get a gain that you would not have had otherwise - even if the gain itself would not ordinarily be the best way to utilize resources in a single core program. Your proof assumes that it's impossible for that to be the case.

In other words if you run out of constructive things to do with the extra cores - then you need to start exploiting things that may be less beneficial in ordinary circumstances but now we have new circumstances. In my "doctor" illustration you just wouldn't waste a doctors time in a hospital making phone calls or being a gopher, but in a specialized situation where there were more doctors than you could fully utilize you would start putting them to work on more mundane tasks.
That's all well and good. But are you now saying that with just 4 cores, you have already reached this boundary and that it is better to waste part of those cores searching a wider tree? I don't believe it for a minute. 4 core speedups are pretty good for any reasonable implementation... Even 8 and 16 are good...

No, I didn't say that we reached the boundary.

Here is another way of looking at it. I can tell you from experience that we can make Komodo a ply faster or a ply slower (either way) by changing some things that will have only a very minor impact on the ELO of the program. So it's not as if there is a "one true" serial implementation that is the one and only right and ultimate way to do things.

So what we need to do is to stop thinking of a chess program as these thing that has totally independent parts that do not interact. It doesn't work that way in computer chess. What works the very best in a serial program with respect to LMR and forward pruning and other things do not necessarily work best in a parallel program.

I reject Daniels idea by the way that there is no difference in a parallel algorithm and a serial algorithm and that they are exactly equivalent. That just reflects a lack of understanding.

Laskos · Post by **Laskos** » Thu Jun 20, 2013 3:55 am

Maybe this experiment could shed some light

Code: Select all

4 threads
  n/s&#58; 4.426.797  
  TotTime&#58; 51&#58;18m    SolTime&#58; 51&#58;18m
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00
  Ply&#58; 1   Positions&#58; 30   Avg Nodes&#58;     658   Branching = 0.00
  Ply&#58; 2   Positions&#58; 30   Avg Nodes&#58;    1454   Branching = 2.21
  Ply&#58; 3   Positions&#58; 30   Avg Nodes&#58;    3023   Branching = 2.08
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    5522   Branching = 1.83
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;   10040   Branching = 1.82
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;   19588   Branching = 1.95
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   37536   Branching = 1.92
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   70835   Branching = 1.89
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;  150130   Branching = 2.12
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  308354   Branching = 2.05
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  569434   Branching = 1.85
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58; 1048122   Branching = 1.84
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 2170028   Branching = 2.07
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 3712354   Branching = 1.71
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 6877139   Branching = 1.85
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58;13106835   Branching = 1.91
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58;23709744   Branching = 1.81
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;39712826   Branching = 1.67
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;65730096   Branching = 1.66
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;118838030   Branching = 1.81
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;207054782   Branching = 1.74
Engine&#58; Komodo 5.1 64-bit &#40;512 MB&#41;
by Don Dailey, Larry Kaufman


1 thread
  n/s&#58; 1.329.027  
  TotTime&#58; 71&#58;30m    SolTime&#58; 71&#58;30m
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00
  Ply&#58; 1   Positions&#58; 30   Avg Nodes&#58;     208   Branching = 0.00
  Ply&#58; 2   Positions&#58; 30   Avg Nodes&#58;     475   Branching = 2.28
  Ply&#58; 3   Positions&#58; 30   Avg Nodes&#58;    1162   Branching = 2.45
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    2369   Branching = 2.04
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;    4022   Branching = 1.70
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;    7420   Branching = 1.84
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   15120   Branching = 2.04
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   31341   Branching = 2.07
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;   65919   Branching = 2.10
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  138817   Branching = 2.11
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  249815   Branching = 1.80
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58;  486700   Branching = 1.95
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 1025584   Branching = 2.11
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 1761740   Branching = 1.72
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 3037256   Branching = 1.72
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58; 5514478   Branching = 1.82
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58; 9323714   Branching = 1.69
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;18533095   Branching = 1.99
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;33487106   Branching = 1.81
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;53449385   Branching = 1.60
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;94696000   Branching = 1.77
Engine&#58; Komodo 5.1 64-bit &#40;512 MB&#41;
by Don Dailey, Larry Kaufman

Time-to-depth (depth=21) factor from 1 thread to 4 threads is on average (30 positions) very low, 1.4. NPS factor is 3.3. From the first ply to the end, 4-threaded Komodo searches ~2-3 times more nodes to the same depth compared to 1-threaded. It's not that this factor increases gradually, so maybe folks could make their mind what sort of pruning is avoided in the parallel search.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jun 20, 2013 5:21 am

Ok I have a better example. You can use two threads on _one_ physical processor which is same as a sequential search. If you want you can keep two search stacks in your program and update them sequentially like two threads are handled with time slicing, then the two are the same. So now if wider parallel search done on _two_ physical processors is better than its deep search counter part done on the same two processors, then the same holds for the case when it is run on one physical processor with two threads. The wider search counter part should perform better in both cases. This is the same argument as "is hyper-threading good or not?", with only difference that HT-enabled processor has 5% or more die area. I am sure this 'Reductio ad absurdum' argument have come up there too, to the effect that you can improve the search on one core by adding more threads (not cores) ad infinitum. Enjoyed some botched up latin ?

Daniel Shawul · Post by **Daniel Shawul** » Thu Jun 20, 2013 5:25 am

I reject Daniels idea by the way that there is no difference in a parallel algorithm and a serial algorithm and that they are exactly equivalent. That just reflects a lack of understanding.

I didn't say that infact I took a cut of 0.4x to acknowledge for the fact that they search differently. The argument holds even if you get 3x from 4 processors. If you can just find an equivalent performing sequential search to the parallel search on 4 threads, then whatever you say about the latter goes for the former. Hence if wider parallel search is better , so is wider sequential search.

Don · Post by **Don** » Thu Jun 20, 2013 5:34 am

Daniel Shawul wrote:
I reject Daniels idea by the way that there is no difference in a parallel algorithm and a serial algorithm and that they are exactly equivalent. That just reflects a lack of understanding.
I didn't say that infact I took a cut of 0.4x to acknowledge for the fact that they search differently. The argument holds even if you get 3x from 4 processors. If you can just find an equivalent performing sequential search to the parallel search on 4 threads, then whatever you say about the latter goes for the former. Hence if wider parallel search is better , so is wider sequential search.

I wish you were correct - because it would imply that we could improve the serial program to get something substantially better. If it were that easy I would surely do it.

Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP