We already KNOW he "lost speed". The tree must be bigger to gain an Elo increase. How hard is this to understand?syzygy wrote:Prove that you can "change the algorithm" without loss of speed...Daniel Shawul wrote:Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does.
Peculiarity of Komodo 5.1MP
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Peculiarity of Komodo 5.1MP
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Peculiarity of Komodo 5.1MP
That's all well and good. But are you now saying that with just 4 cores, you have already reached this boundary and that it is better to waste part of those cores searching a wider tree? I don't believe it for a minute. 4 core speedups are pretty good for any reasonable implementation... Even 8 and 16 are good...Don wrote:That was a good try but your proof logic is flawed. What is not considered is utilization of resources.Daniel Shawul wrote:It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.
We know that that when when multiple cores try to cooperate to search a position you lose in 3 primary ways with chess programs: synchronization overhead, communication overhead and redundant calculations. If you can reduce 1 or more of those things and apply it to something else instead, you can get a gain that you would not have had otherwise - even if the gain itself would not ordinarily be the best way to utilize resources in a single core program. Your proof assumes that it's impossible for that to be the case.
In other words if you run out of constructive things to do with the extra cores - then you need to start exploiting things that may be less beneficial in ordinary circumstances but now we have new circumstances. In my "doctor" illustration you just wouldn't waste a doctors time in a hospital making phone calls or being a gopher, but in a specialized situation where there were more doctors than you could fully utilize you would start putting them to work on more mundane tasks.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Peculiarity of Komodo 5.1MP
This is easy enough to explain or grasp, depending on which side of the discussion one is on. If you make the tree wider, you do better in fixed depth search comparisons, because your wider tree does not hurt your time (it is not being measured) and the larger tree tends to miss less tactically. BUT, that wider tree has a definite cost. And in timed matches, that hurts. It does not help. All this test has shown is that Komodo has some odd thing in its search that changes the shape of the tree. One could, for example, double every search extension in SMP mode. A fixed-depth search would play stronger, even though the program is much slower to reach that 12 ply depth now.Uri Blass wrote:I understood the test well.Daniel Shawul wrote:Uri, I don't think you understood the test well. It was a _fixed depth_ test that is meaningless unless both parallel and serial versions search more or less similar tree.Uri Blass wrote:"poor parallel implementation?"Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
The target of parallel implementation is not to get bigger depth but to play better.
If the implementation is good in helping komodo to play better than I think that it is wrong to call it poor.
Maybe it is the opposite and komodo compensates for poor pruning of lines that it should not prune by parallel implementation that prevent it to prune good moves in the relevant lines.
If i decided to use alpha-beta only for the parallel search while using lmr+nullmove+other pruning for the sequential search who would you think would win for a fixed depth test. Get it?
fixed depth is meaningless if the target is to find strength improvement but it was clearly not the target of the test.
The target was simply to compare the behaviour of Komodo with the behaviour of other programs(not to compare strength).
My point was that the conclusion that the parallel implementation is poor is simply wrong.
Note that I did not claim nothing based on the specific test.
I believe that the implementation of komodo of paralllel search is good
based on what I read about results of it against other programs and I read that with many cores(more than 4 cores) it did better against other programs relative to one core tests.
This shows absolutely nothing other than that the parallel search looks at stuff the sequential search does not, and as a general rule, that is detrimental to overall Elo, NOT beneficial.
-
- Posts: 10460
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Peculiarity of Komodo 5.1MP
bob wrote:This is easy enough to explain or grasp, depending on which side of the discussion one is on. If you make the tree wider, you do better in fixed depth search comparisons, because your wider tree does not hurt your time (it is not being measured) and the larger tree tends to miss less tactically. BUT, that wider tree has a definite cost. And in timed matches, that hurts. It does not help. All this test has shown is that Komodo has some odd thing in its search that changes the shape of the tree. One could, for example, double every search extension in SMP mode. A fixed-depth search would play stronger, even though the program is much slower to reach that 12 ply depth now.Uri Blass wrote:I understood the test well.Daniel Shawul wrote:Uri, I don't think you understood the test well. It was a _fixed depth_ test that is meaningless unless both parallel and serial versions search more or less similar tree.Uri Blass wrote:"poor parallel implementation?"Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
The target of parallel implementation is not to get bigger depth but to play better.
If the implementation is good in helping komodo to play better than I think that it is wrong to call it poor.
Maybe it is the opposite and komodo compensates for poor pruning of lines that it should not prune by parallel implementation that prevent it to prune good moves in the relevant lines.
If i decided to use alpha-beta only for the parallel search while using lmr+nullmove+other pruning for the sequential search who would you think would win for a fixed depth test. Get it?
fixed depth is meaningless if the target is to find strength improvement but it was clearly not the target of the test.
The target was simply to compare the behaviour of Komodo with the behaviour of other programs(not to compare strength).
My point was that the conclusion that the parallel implementation is poor is simply wrong.
Note that I did not claim nothing based on the specific test.
I believe that the implementation of komodo of paralllel search is good
based on what I read about results of it against other programs and I read that with many cores(more than 4 cores) it did better against other programs relative to one core tests.
This shows absolutely nothing other than that the parallel search looks at stuff the sequential search does not, and as a general rule, that is detrimental to overall Elo, NOT beneficial.
I agree only with the first part of the last sentence.
fixed depth show nothing about the question if it is good to have parallel search looks at wider tree.
I disagree with the general rule that it is detrimental to overall Elo.
It may be detrimental and may be beneficial and you need tests not at fixed depth in order to find out.
I think that in the specific case of komodo it is beneficial not based on fixed depth results.
The reason for my opinion is that
I read that komodo performed better with 12 cores relative to 1 core against other programs when the test was at fixed time and not fixed depth.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Peculiarity of Komodo 5.1MP
It benefits from a wider tree and suffers from larger time to depth. What's so hard to understand?bob wrote:Your statement is simply wrong. It doesn't "benefit" from a wider tree. It "suffers from" a wider tree.Laskos wrote:Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.bob wrote:Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.Laskos wrote:I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?
All you say is trivial, and was mentioned in my first post. Komodo has lower time-to-depth ratio going from 1 threads to 4 threads, say 2.0 instead of 3.2, but an additional 80 Elo points benefit from a wider tree. If you measure only time-to-depth, get 2.0 instead of 3.2, and say it's scaling badly, you are wrong, as you do not include additional 80 points from wider tree.
Otherwise a wider tree would be better in a non-parallel search as well.
Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.
Better to do a few searches with st=10, and use both 1 and 4 cpus and see what happens there. I'd suspect, based on your results, that the 4 cpu is not going to perform as well as one would expect, because part of the extra processing power is wasted searching extra nodes...
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Peculiarity of Komodo 5.1MP
No, I didn't say that we reached the boundary.bob wrote:That's all well and good. But are you now saying that with just 4 cores, you have already reached this boundary and that it is better to waste part of those cores searching a wider tree? I don't believe it for a minute. 4 core speedups are pretty good for any reasonable implementation... Even 8 and 16 are good...Don wrote:That was a good try but your proof logic is flawed. What is not considered is utilization of resources.Daniel Shawul wrote:It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.
We know that that when when multiple cores try to cooperate to search a position you lose in 3 primary ways with chess programs: synchronization overhead, communication overhead and redundant calculations. If you can reduce 1 or more of those things and apply it to something else instead, you can get a gain that you would not have had otherwise - even if the gain itself would not ordinarily be the best way to utilize resources in a single core program. Your proof assumes that it's impossible for that to be the case.
In other words if you run out of constructive things to do with the extra cores - then you need to start exploiting things that may be less beneficial in ordinary circumstances but now we have new circumstances. In my "doctor" illustration you just wouldn't waste a doctors time in a hospital making phone calls or being a gopher, but in a specialized situation where there were more doctors than you could fully utilize you would start putting them to work on more mundane tasks.
Here is another way of looking at it. I can tell you from experience that we can make Komodo a ply faster or a ply slower (either way) by changing some things that will have only a very minor impact on the ELO of the program. So it's not as if there is a "one true" serial implementation that is the one and only right and ultimate way to do things.
So what we need to do is to stop thinking of a chess program as these thing that has totally independent parts that do not interact. It doesn't work that way in computer chess. What works the very best in a serial program with respect to LMR and forward pruning and other things do not necessarily work best in a parallel program.
I reject Daniels idea by the way that there is no difference in a parallel algorithm and a serial algorithm and that they are exactly equivalent. That just reflects a lack of understanding.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Peculiarity of Komodo 5.1MP
Maybe this experiment could shed some light
Time-to-depth (depth=21) factor from 1 thread to 4 threads is on average (30 positions) very low, 1.4. NPS factor is 3.3. From the first ply to the end, 4-threaded Komodo searches ~2-3 times more nodes to the same depth compared to 1-threaded. It's not that this factor increases gradually, so maybe folks could make their mind what sort of pruning is avoided in the parallel search.
Code: Select all
4 threads
n/s: 4.426.797
TotTime: 51:18m SolTime: 51:18m
Ply: 0 Positions: 30 Avg Nodes: 0 Branching = 0.00
Ply: 1 Positions: 30 Avg Nodes: 658 Branching = 0.00
Ply: 2 Positions: 30 Avg Nodes: 1454 Branching = 2.21
Ply: 3 Positions: 30 Avg Nodes: 3023 Branching = 2.08
Ply: 4 Positions: 30 Avg Nodes: 5522 Branching = 1.83
Ply: 5 Positions: 30 Avg Nodes: 10040 Branching = 1.82
Ply: 6 Positions: 30 Avg Nodes: 19588 Branching = 1.95
Ply: 7 Positions: 30 Avg Nodes: 37536 Branching = 1.92
Ply: 8 Positions: 30 Avg Nodes: 70835 Branching = 1.89
Ply: 9 Positions: 30 Avg Nodes: 150130 Branching = 2.12
Ply:10 Positions: 30 Avg Nodes: 308354 Branching = 2.05
Ply:11 Positions: 30 Avg Nodes: 569434 Branching = 1.85
Ply:12 Positions: 30 Avg Nodes: 1048122 Branching = 1.84
Ply:13 Positions: 30 Avg Nodes: 2170028 Branching = 2.07
Ply:14 Positions: 30 Avg Nodes: 3712354 Branching = 1.71
Ply:15 Positions: 30 Avg Nodes: 6877139 Branching = 1.85
Ply:16 Positions: 30 Avg Nodes:13106835 Branching = 1.91
Ply:17 Positions: 30 Avg Nodes:23709744 Branching = 1.81
Ply:18 Positions: 30 Avg Nodes:39712826 Branching = 1.67
Ply:19 Positions: 30 Avg Nodes:65730096 Branching = 1.66
Ply:20 Positions: 30 Avg Nodes:118838030 Branching = 1.81
Ply:21 Positions: 30 Avg Nodes:207054782 Branching = 1.74
Engine: Komodo 5.1 64-bit (512 MB)
by Don Dailey, Larry Kaufman
1 thread
n/s: 1.329.027
TotTime: 71:30m SolTime: 71:30m
Ply: 0 Positions: 30 Avg Nodes: 0 Branching = 0.00
Ply: 1 Positions: 30 Avg Nodes: 208 Branching = 0.00
Ply: 2 Positions: 30 Avg Nodes: 475 Branching = 2.28
Ply: 3 Positions: 30 Avg Nodes: 1162 Branching = 2.45
Ply: 4 Positions: 30 Avg Nodes: 2369 Branching = 2.04
Ply: 5 Positions: 30 Avg Nodes: 4022 Branching = 1.70
Ply: 6 Positions: 30 Avg Nodes: 7420 Branching = 1.84
Ply: 7 Positions: 30 Avg Nodes: 15120 Branching = 2.04
Ply: 8 Positions: 30 Avg Nodes: 31341 Branching = 2.07
Ply: 9 Positions: 30 Avg Nodes: 65919 Branching = 2.10
Ply:10 Positions: 30 Avg Nodes: 138817 Branching = 2.11
Ply:11 Positions: 30 Avg Nodes: 249815 Branching = 1.80
Ply:12 Positions: 30 Avg Nodes: 486700 Branching = 1.95
Ply:13 Positions: 30 Avg Nodes: 1025584 Branching = 2.11
Ply:14 Positions: 30 Avg Nodes: 1761740 Branching = 1.72
Ply:15 Positions: 30 Avg Nodes: 3037256 Branching = 1.72
Ply:16 Positions: 30 Avg Nodes: 5514478 Branching = 1.82
Ply:17 Positions: 30 Avg Nodes: 9323714 Branching = 1.69
Ply:18 Positions: 30 Avg Nodes:18533095 Branching = 1.99
Ply:19 Positions: 30 Avg Nodes:33487106 Branching = 1.81
Ply:20 Positions: 30 Avg Nodes:53449385 Branching = 1.60
Ply:21 Positions: 30 Avg Nodes:94696000 Branching = 1.77
Engine: Komodo 5.1 64-bit (512 MB)
by Don Dailey, Larry Kaufman
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Peculiarity of Komodo 5.1MP
Ok I have a better example. You can use two threads on _one_ physical processor which is same as a sequential search. If you want you can keep two search stacks in your program and update them sequentially like two threads are handled with time slicing, then the two are the same. So now if wider parallel search done on _two_ physical processors is better than its deep search counter part done on the same two processors, then the same holds for the case when it is run on one physical processor with two threads. The wider search counter part should perform better in both cases. This is the same argument as "is hyper-threading good or not?", with only difference that HT-enabled processor has 5% or more die area. I am sure this 'Reductio ad absurdum' argument have come up there too, to the effect that you can improve the search on one core by adding more threads (not cores) ad infinitum. Enjoyed some botched up latin ?
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Peculiarity of Komodo 5.1MP
I didn't say that infact I took a cut of 0.4x to acknowledge for the fact that they search differently. The argument holds even if you get 3x from 4 processors. If you can just find an equivalent performing sequential search to the parallel search on 4 threads, then whatever you say about the latter goes for the former. Hence if wider parallel search is better , so is wider sequential search.I reject Daniels idea by the way that there is no difference in a parallel algorithm and a serial algorithm and that they are exactly equivalent. That just reflects a lack of understanding.
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Peculiarity of Komodo 5.1MP
I wish you were correct - because it would imply that we could improve the serial program to get something substantially better. If it were that easy I would surely do it.Daniel Shawul wrote:I didn't say that infact I took a cut of 0.4x to acknowledge for the fact that they search differently. The argument holds even if you get 3x from 4 processors. If you can just find an equivalent performing sequential search to the parallel search on 4 threads, then whatever you say about the latter goes for the former. Hence if wider parallel search is better , so is wider sequential search.I reject Daniels idea by the way that there is no difference in a parallel algorithm and a serial algorithm and that they are exactly equivalent. That just reflects a lack of understanding.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.