Significant modification of Komodo SMP implementation

bob · Post by **bob** » Sat Jun 27, 2015 3:07 am

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: One assumption that leads to one question: what could possibly cause two different programs to have a limited parallel search improvement? The tree search problem is well-known and well-understood. Both programs are playing the same game, with the same rules, and there's nothing inherent in the tree search that serves as a limit to SMP speedup. The current problem is that alpha/beta is a sequential algorithm by its very definition, which makes parallel search a real challenge. But I don't believe that there is something in Komodo that limits parallel search potential compared to other chess engines because they ALL use the same basic search approach, minimax + alpha/beta...
Some good engines using Lazy SMP like Cheng and Andscacs widen even more, having fairly good SMP results. I guess it's often just easier to have 1.7*1.7=2.9 speedup on 4 cores coming from 1.7 deepening and 1.7 widening, than 2.9 from purely deepening (having the same search tree as the optimized 1 core tree). With more effort, one would slowly go to more deepening and less widening. The ideal speedup of 4 on 4 cores is done only through pure deepening, without widening, but we are still far from it, and widening will be present in many engines for some time from now. Even more on 16 cores and such. I guess many details of optimal parallelization may hide in search particularities, not just minimax and alpha-beta, but I have no competency on this.
However, things like Lazy SMP don't intentionally "widen". It happens because of an inefficient search algorithm. Any time you add nodes to the serial tree, you are widening the search, albeit almost certainly an unintentional (and undesirable) side effect of an inefficient parallel search. And what is much more important in today's world is 8 cores, and 16 cores. 4 is common in phones already with more coming. That's where most of the "sorta-SMP searches" break down. They often work OK for 2, and barely OK for 4, but beyond that, forget about it...

Clearly getting something is better than ignoring multiple cores and getting nothing, so there's nothing wrong with that. But some will want it all (me included). There, searching extra nodes is the antithesis of our intentions.
I don't know how I overlooked that, but only today I tested Crafty 23.6 at fixed depth to detect a possible widening. Well, Crafty 23.6 (the last version I have) widens!

Fixed depth=10, Crafty 23.6 on 4 threads against Crafty 23.6 on 1 thread:
Code: Select all
Score of Crafty 4 threads vs Crafty 1 thread: 680 - 567 - 753  [0.53] 2000
ELO difference: 20
Finished match
20 +/- 12 ELO points the widening.

That's why I am getting only 2.7 time-to-depth (TTD) speedup for Crafty 23.6 on 4 cores, because it'not all the contribution to the effective speedup, it receives a contribution from widening too, of a factor close to 1.15.

So, the effective speedup of Crafty 23.6 on 4 cores is:
2.7 (TTD) * 1.15 (widening) = 3.1

More reasonable effective speedup now, and closer to the NPS speedup of 3.2-3.3.

So, even Crafty 23.6 widens, although much less and without your intention to widen.

Right. It is _really_ difficult to not add nodes. Every time you split you run the risk of splitting at a node that ought to be an ALL node but is in reality a CUT node due to improper move ordering. I've been studying why the "widening" is not as pronounced as what I saw in DTS, and it appears to be at least partially related to LMR. If you unexpectedly fail high during the search at a node, the nodes following the fail-high move would normally not have been searched -> search overhead. But today, they are searched less carefully (LMR reductions). So at least in many of the cases, this seems to help. This is going to take months (or years) to really understand however. The trees are so damned large, it is very difficult to figure out what is going on in a small part of a huge forest.

Laskos · Post by **Laskos** » Sat Jun 27, 2015 11:56 am

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: One assumption that leads to one question: what could possibly cause two different programs to have a limited parallel search improvement? The tree search problem is well-known and well-understood. Both programs are playing the same game, with the same rules, and there's nothing inherent in the tree search that serves as a limit to SMP speedup. The current problem is that alpha/beta is a sequential algorithm by its very definition, which makes parallel search a real challenge. But I don't believe that there is something in Komodo that limits parallel search potential compared to other chess engines because they ALL use the same basic search approach, minimax + alpha/beta...
Some good engines using Lazy SMP like Cheng and Andscacs widen even more, having fairly good SMP results. I guess it's often just easier to have 1.7*1.7=2.9 speedup on 4 cores coming from 1.7 deepening and 1.7 widening, than 2.9 from purely deepening (having the same search tree as the optimized 1 core tree). With more effort, one would slowly go to more deepening and less widening. The ideal speedup of 4 on 4 cores is done only through pure deepening, without widening, but we are still far from it, and widening will be present in many engines for some time from now. Even more on 16 cores and such. I guess many details of optimal parallelization may hide in search particularities, not just minimax and alpha-beta, but I have no competency on this.
However, things like Lazy SMP don't intentionally "widen". It happens because of an inefficient search algorithm. Any time you add nodes to the serial tree, you are widening the search, albeit almost certainly an unintentional (and undesirable) side effect of an inefficient parallel search. And what is much more important in today's world is 8 cores, and 16 cores. 4 is common in phones already with more coming. That's where most of the "sorta-SMP searches" break down. They often work OK for 2, and barely OK for 4, but beyond that, forget about it...

Clearly getting something is better than ignoring multiple cores and getting nothing, so there's nothing wrong with that. But some will want it all (me included). There, searching extra nodes is the antithesis of our intentions.
I don't know how I overlooked that, but only today I tested Crafty 23.6 at fixed depth to detect a possible widening. Well, Crafty 23.6 (the last version I have) widens!

Fixed depth=10, Crafty 23.6 on 4 threads against Crafty 23.6 on 1 thread:
Code: Select all
Score of Crafty 4 threads vs Crafty 1 thread: 680 - 567 - 753  [0.53] 2000
ELO difference: 20
Finished match
20 +/- 12 ELO points the widening.

That's why I am getting only 2.7 time-to-depth (TTD) speedup for Crafty 23.6 on 4 cores, because it'not all the contribution to the effective speedup, it receives a contribution from widening too, of a factor close to 1.15.

So, the effective speedup of Crafty 23.6 on 4 cores is:
2.7 (TTD) * 1.15 (widening) = 3.1

More reasonable effective speedup now, and closer to the NPS speedup of 3.2-3.3.

So, even Crafty 23.6 widens, although much less and without your intention to widen.
Right. It is _really_ difficult to not add nodes. Every time you split you run the risk of splitting at a node that ought to be an ALL node but is in reality a CUT node due to improper move ordering. I've been studying why the "widening" is not as pronounced as what I saw in DTS, and it appears to be at least partially related to LMR. If you unexpectedly fail high during the search at a node, the nodes following the fail-high move would normally not have been searched -> search overhead. But today, they are searched less carefully (LMR reductions). So at least in many of the cases, this seems to help. This is going to take months (or years) to really understand however. The trees are so damned large, it is very difficult to figure out what is going on in a small part of a huge forest.

Some engines do not widen at all, although the overhead as complete waste is present there. The multicore tree almost always looks a bit thicker than single core to the same depth, even without widening. I understand that the search overhead, instead of being a total waste, may contribute without intention to a gain in ELO at fixed depth due for example to LMR reductions. I tested Crafty 18.12, and it indeed doesn't widen:

Fixed depth =7

Code: Select all

Score of Crafty 18.12 4 threads vs Crafty 18.12 1 thread: 1467 - 1436 - 2097  [0.50] 5000
ELO difference: 2
Finished match

But I got a surprising result with the last public Crafy 24.1, which is intuitively hard to understand. What is happening here? Crafty 24.1 seems to NARROW going to multicore:

Fixed depth=10

Code: Select all

Score of Crafty 24.1 4 threads vs Crafty 24.1 1 thread: 1407 - 1766 - 1827  [0.46] 5000
ELO difference: -25
Finished match

In this case, it means that effective speedup of Crafty 24.1 is smaller than TTD speedup.

If your DTS implies widening, then time-to-depth (TTD) speedup is not equal to effective speedup. Effective speedup would be larger than TTD speedup, and does it mean that your DTS numbers are underestimated?

I would be curious whether your private Crafty 25.0 with such high TTD speedups is correctly described by TTD.

bob · Post by **bob** » Sat Jun 27, 2015 4:56 pm

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: One assumption that leads to one question: what could possibly cause two different programs to have a limited parallel search improvement? The tree search problem is well-known and well-understood. Both programs are playing the same game, with the same rules, and there's nothing inherent in the tree search that serves as a limit to SMP speedup. The current problem is that alpha/beta is a sequential algorithm by its very definition, which makes parallel search a real challenge. But I don't believe that there is something in Komodo that limits parallel search potential compared to other chess engines because they ALL use the same basic search approach, minimax + alpha/beta...
Some good engines using Lazy SMP like Cheng and Andscacs widen even more, having fairly good SMP results. I guess it's often just easier to have 1.7*1.7=2.9 speedup on 4 cores coming from 1.7 deepening and 1.7 widening, than 2.9 from purely deepening (having the same search tree as the optimized 1 core tree). With more effort, one would slowly go to more deepening and less widening. The ideal speedup of 4 on 4 cores is done only through pure deepening, without widening, but we are still far from it, and widening will be present in many engines for some time from now. Even more on 16 cores and such. I guess many details of optimal parallelization may hide in search particularities, not just minimax and alpha-beta, but I have no competency on this.
However, things like Lazy SMP don't intentionally "widen". It happens because of an inefficient search algorithm. Any time you add nodes to the serial tree, you are widening the search, albeit almost certainly an unintentional (and undesirable) side effect of an inefficient parallel search. And what is much more important in today's world is 8 cores, and 16 cores. 4 is common in phones already with more coming. That's where most of the "sorta-SMP searches" break down. They often work OK for 2, and barely OK for 4, but beyond that, forget about it...

Clearly getting something is better than ignoring multiple cores and getting nothing, so there's nothing wrong with that. But some will want it all (me included). There, searching extra nodes is the antithesis of our intentions.
I don't know how I overlooked that, but only today I tested Crafty 23.6 at fixed depth to detect a possible widening. Well, Crafty 23.6 (the last version I have) widens!

Fixed depth=10, Crafty 23.6 on 4 threads against Crafty 23.6 on 1 thread:
Code: Select all
Score of Crafty 4 threads vs Crafty 1 thread: 680 - 567 - 753  [0.53] 2000
ELO difference: 20
Finished match
20 +/- 12 ELO points the widening.

That's why I am getting only 2.7 time-to-depth (TTD) speedup for Crafty 23.6 on 4 cores, because it'not all the contribution to the effective speedup, it receives a contribution from widening too, of a factor close to 1.15.

So, the effective speedup of Crafty 23.6 on 4 cores is:
2.7 (TTD) * 1.15 (widening) = 3.1

More reasonable effective speedup now, and closer to the NPS speedup of 3.2-3.3.

So, even Crafty 23.6 widens, although much less and without your intention to widen.
Right. It is _really_ difficult to not add nodes. Every time you split you run the risk of splitting at a node that ought to be an ALL node but is in reality a CUT node due to improper move ordering. I've been studying why the "widening" is not as pronounced as what I saw in DTS, and it appears to be at least partially related to LMR. If you unexpectedly fail high during the search at a node, the nodes following the fail-high move would normally not have been searched -> search overhead. But today, they are searched less carefully (LMR reductions). So at least in many of the cases, this seems to help. This is going to take months (or years) to really understand however. The trees are so damned large, it is very difficult to figure out what is going on in a small part of a huge forest.
Some engines do not widen at all, although the overhead as complete waste is present there. The multicore tree almost always looks a bit thicker than single core to the same depth, even without widening. I understand that the search overhead, instead of being a total waste, may contribute without intention to a gain in ELO at fixed depth due for example to LMR reductions. I tested Crafty 18.12, and it indeed doesn't widen:

Fixed depth =7
Code: Select all
Score of Crafty 18.12 4 threads vs Crafty 18.12 1 thread: 1467 - 1436 - 2097  [0.50] 5000
ELO difference: 2
Finished match
But I got a surprising result with the last public Crafy 24.1, which is intuitively hard to understand. What is happening here? Crafty 24.1 seems to NARROW going to multicore:

Fixed depth=10
Code: Select all
Score of Crafty 24.1 4 threads vs Crafty 24.1 1 thread: 1407 - 1766 - 1827  [0.46] 5000
ELO difference: -25
Finished match
In this case, it means that effective speedup of Crafty 24.1 is smaller than TTD speedup.

If your DTS implies widening, then time-to-depth (TTD) speedup is not equal to effective speedup. Effective speedup would be larger than TTD speedup, and does it mean that your DTS numbers are underestimated?

I would be curious whether your private Crafty 25.0 with such high TTD speedups is correctly described by TTD.

I can try some fixed depth games for fun. I'm still not sure how it would "narrow" the search at all since I only have one search procedure that works for parallel and serial search. 24.1 might well have a bug for all I know, but it is a bare "distant cousin" to 25.0. I'm going to start such a match right now running 8 cores vs 1 core to a fixed depth of 10. Shouldn't take too long, more to follow.

bob · Post by **bob** » Sat Jun 27, 2015 5:29 pm

I have this running right now, 25.0/mt=8 vs 25.0/mt=0. I am only playing 5 games at a time using 5 8-core machines. Once this runs long enough to produce a credible result, I'll try the same with 23.whatever and 24.1. The 25.0 search is definitely based on 24.1 search, but not the parallel stuff. I'd expect if there is anything going on here, it is an error I introduced when I got rid of all the duplicated Search() codes (there was a search, a search root, a search_parallel, and so forth, now there is only a Search()). Since most of the cluster testing is done with one thread, something might well have slipped through a crack.

bob · Post by **bob** » Sat Jun 27, 2015 5:37 pm

Early results are not very stable yet. Currently this is the result:

1 Crafty-25.0 2601 10 10 672 50% 2599 45%
2 Crafty-25.0x 2599 10 10 672 50% 2601 45%

25.0 is using mt=8, 25.0x was compiled with threading disabled so that I could use a common .craftyrc file with mt=8 but 25.0x ignores the mt=8 since there is no thread support compiled in.

Test started off looking like this:

1 Crafty-25.0x 2611 22 22 132 53% 2589 42%
2 Crafty-25.0 2589 22 22 132 47% 2611 42%

then

1 Crafty-25.0x 2603 11 11 531 51% 2597 44%
2 Crafty-25.0 2597 11 11 531 49% 2603 44%

and as of right now

1 Crafty-25.0x 2600 9 9 816 50% 2600 45%
2 Crafty-25.0 2600 9 9 816 50% 2600 45%

I can say there is not a lot of splitting going on, since min_split_depth default is 5. That means that with a 10 ply search and no extensions, it will only be able to split in the first 5 plies of the search. I'm going to let this run for 30K games since I have an outside project at my house, I'll check in later and post the results from time to time.

most recent is now

1 Crafty-25.0x 2600 8 8 911 50% 2600 47%
2 Crafty-25.0 2600 8 8 911 50% 2600 47%

This is running at an average speed of 1 game per second running 5 at a time. So 30,000 seconds to finish although I am not going to let it run that long.

current:

1 Crafty-25.0x 2605 8 8 1077 51% 2595 47%
2 Crafty-25.0 2595 8 8 1077 49% 2605 47%

So it is still not quite stable.

Laskos · Post by **Laskos** » Sat Jun 27, 2015 9:31 pm

bob wrote:Early results are not very stable yet. Currently this is the result:

1 Crafty-25.0 2601 10 10 672 50% 2599 45%
2 Crafty-25.0x 2599 10 10 672 50% 2601 45%

25.0 is using mt=8, 25.0x was compiled with threading disabled so that I could use a common .craftyrc file with mt=8 but 25.0x ignores the mt=8 since there is no thread support compiled in.

Test started off looking like this:

1 Crafty-25.0x 2611 22 22 132 53% 2589 42%
2 Crafty-25.0 2589 22 22 132 47% 2611 42%

then

1 Crafty-25.0x 2603 11 11 531 51% 2597 44%
2 Crafty-25.0 2597 11 11 531 49% 2603 44%

and as of right now

1 Crafty-25.0x 2600 9 9 816 50% 2600 45%
2 Crafty-25.0 2600 9 9 816 50% 2600 45%

I can say there is not a lot of splitting going on, since min_split_depth default is 5. That means that with a 10 ply search and no extensions, it will only be able to split in the first 5 plies of the search. I'm going to let this run for 30K games since I have an outside project at my house, I'll check in later and post the results from time to time.

most recent is now

1 Crafty-25.0x 2600 8 8 911 50% 2600 47%
2 Crafty-25.0 2600 8 8 911 50% 2600 47%

This is running at an average speed of 1 game per second running 5 at a time. So 30,000 seconds to finish although I am not going to let it run that long.

current:

1 Crafty-25.0x 2605 8 8 1077 51% 2595 47%
2 Crafty-25.0 2595 8 8 1077 49% 2605 47%

So it is still not quite stable.

I think after 5,000 games it will be clearer. The error margins shown here seem 1 standard deviation, so multiply them by 2 to get 95% confidence. This is Bayeselo, right?

I have computed the effective speedups and NPS speedups on 4 cores for two Crafties separated by 10 years:

Crafty 18.12:
Effective Speedup: 2.6
NPS Speedup: 3.2

Crafty 23.6:
Effective Speedup: 3.1
NPS Speedup: 3.3

bob · Post by **bob** » Sat Jun 27, 2015 11:07 pm

Laskos wrote:
bob wrote:Early results are not very stable yet. Currently this is the result:

1 Crafty-25.0 2601 10 10 672 50% 2599 45%
2 Crafty-25.0x 2599 10 10 672 50% 2601 45%

25.0 is using mt=8, 25.0x was compiled with threading disabled so that I could use a common .craftyrc file with mt=8 but 25.0x ignores the mt=8 since there is no thread support compiled in.

Test started off looking like this:

1 Crafty-25.0x 2611 22 22 132 53% 2589 42%
2 Crafty-25.0 2589 22 22 132 47% 2611 42%

then

1 Crafty-25.0x 2603 11 11 531 51% 2597 44%
2 Crafty-25.0 2597 11 11 531 49% 2603 44%

and as of right now

1 Crafty-25.0x 2600 9 9 816 50% 2600 45%
2 Crafty-25.0 2600 9 9 816 50% 2600 45%

I can say there is not a lot of splitting going on, since min_split_depth default is 5. That means that with a 10 ply search and no extensions, it will only be able to split in the first 5 plies of the search. I'm going to let this run for 30K games since I have an outside project at my house, I'll check in later and post the results from time to time.

most recent is now

1 Crafty-25.0x 2600 8 8 911 50% 2600 47%
2 Crafty-25.0 2600 8 8 911 50% 2600 47%

This is running at an average speed of 1 game per second running 5 at a time. So 30,000 seconds to finish although I am not going to let it run that long.

current:

1 Crafty-25.0x 2605 8 8 1077 51% 2595 47%
2 Crafty-25.0 2595 8 8 1077 49% 2605 47%

So it is still not quite stable.
I think after 5,000 games it will be clearer. The error margins shown here seem 1 standard deviation, so multiply them by 2 to get 95% confidence. This is Bayeselo, right?

I have computed the effective speedups and NPS speedups on 4 cores for two Crafties separated by 10 years:

Crafty 18.12:
Effective Speedup: 2.6
NPS Speedup: 3.2

Crafty 23.6:
Effective Speedup: 3.1
NPS Speedup: 3.3

Yes, bayeselo.

I re-started with smpmin=0 to let it split anywhere, since time doesn't matter. If there is a difference, this should magnify it...

bob · Post by **bob** » Sun Jun 28, 2015 6:45 am

Stopped it at this point:

1 Crafty-25.0x 2604 2 2 21686 51% 2596 48%
2 Crafty-25.0 2596 2 2 21686 49% 2604 48%

Going to re-run overnight just to be sure. This 8 Elo difference probably turns into a 4 Elo difference if not using self-play. But it is still interesting and I want to make sure it is not my time to get one of those 2-sigma random results...

Laskos · Post by **Laskos** » Sun Jun 28, 2015 8:58 am

bob wrote:Stopped it at this point:

1 Crafty-25.0x 2604 2 2 21686 51% 2596 48%
2 Crafty-25.0 2596 2 2 21686 49% 2604 48%

Going to re-run overnight just to be sure. This 8 Elo difference probably turns into a 4 Elo difference if not using self-play. But it is still interesting and I want to make sure it is not my time to get one of those 2-sigma random results...

So, Crafty 25.0 seems to be narrowing just a bit, 8+/-4 ELO points 2SD. The correction to effective speedup results you got earlier is very small, maybe 5% or so. I would be curious if you can confirm that Crafty 24.1 is narrowing substantially, by some 25 ELO points in my test.

bob · Post by **bob** » Sun Jun 28, 2015 5:17 pm

Laskos wrote:
bob wrote:Stopped it at this point:

1 Crafty-25.0x 2604 2 2 21686 51% 2596 48%
2 Crafty-25.0 2596 2 2 21686 49% 2604 48%

Going to re-run overnight just to be sure. This 8 Elo difference probably turns into a 4 Elo difference if not using self-play. But it is still interesting and I want to make sure it is not my time to get one of those 2-sigma random results...
So, Crafty 25.0 seems to be narrowing just a bit, 8+/-4 ELO points 2SD. The correction to effective speedup results you got earlier is very small, maybe 5% or so. I would be curious if you can confirm that Crafty 24.1 is narrowing substantially, by some 25 ELO points in my test.

I am going to study this a bit. I ran it overnight again, and this time around it is a 6 Elo difference:

1 Crafty-25.0x 2603 3 3 6237 51% 2597 48%
2 Crafty-25.0 2597 3 3 6237 49% 2603 48%

Which points to some sort of problem. There is no reason for any sort of minor degradation, even if it is almost nothing. I'd always expect some minor gain, not loss, since there is always search overhead.

I will try the 24.1 test, since that big a change ought to be easier to find, and might give me a hint about 25.0...

I'll start that right now...

I was thinking about what is different between parallel and serial versions, and the only thing I can think of is the history counters and trans/ref table. Crafty has a single shared history counter array and that might slightly hurt overall since 8 threads are updating that everywhere. I have debated that many times, but it doesn't hurt speed whatsoever so I never really thought about it. But now I might do a per-thread set of history counters and see if that makes a difference, once the 24.1 vs 24.1 test runs... The new version uses an 8 byte hash entry and I generally use 256mb for the hash table, so at a sd=10 match, I'd think 32M hash entries with a bucket size of 8 would probably be enough to avoid excessive overwriting. I might test with a much larger hash table also, to see if this is involved. Those are the only two things that come to mind as potential game-changers when comparing parallel to serial search.

Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation

Re: Significant modification of Komodo SMP implementation