Re: RMO - Randomized Move Order - yet another Lazy SMP derivate
Posted: Sat Jan 04, 2020 6:12 pm
I would like to add that I have been using a variant of DTS for a long time. Primary difference between Crafty's parallel search and Cray Blitz's search (the original DTS search):
In Cray Blitz, when a thread became idle, it could look at the complete tree data for all other threads and find the best place to help. It could use an existing split point or it could create a brand new one without the original thread knowing until it backed up to that node. Very efficient, but it required the mentioned iterative (rather than recursive) search, since it is not possible to either back up or go forward independently of the call stack. So iterated was the approach (and since FORTRAN back then did not support recursive/reentrant code anyway), there was no choice.
In Crafty, since it is not possible to jump around in the tree for the call stack reason quoted above, I decided to take a different angle to the same problem. At "good" split points, in a serial search thread, I have that thread emit what I call a "gratuitous split point", which is a split point with no helpers. But now this is globally visible and any thread can attach to that split point without needing the call stack data for plies prior to that split point. So any thread can jump in at any of these split points, or if it is not happy with any, it can request that each active thread emit such a split point "right now" and then choose one of these. I limit how many of these gratuitous split points are emitted to save memory and wasted execution cycles (it is not free to create one, but it is also not cpu intensive since I spent a lot of time trying to minimize the split point data).
In performance tests, I can not tell a lot of difference between the two approaches (DTS and almost-DTS). I tested this for tens of thousands of games on the cluster, using up to 8 threads / cores for each pair of programs, and found no significant Elo difference. For that reason, you don't see an iterative Crafty since it was MUCH messier than a purely recursive search.
I would not begin to say this is the ultimate parallel search, but several (including a couple here) tested it through 16 cores (and my own testing which also used 32 cores) and found the parallel speedup to be really good. You will have to look back through old posts here to see the exact numbers, I really don't remember.
The only down-side is that locks and memory contention do have an effect. I spent a lot of time the last year I worked on the code heavily, removing as much serial code (lock protected) as possible, which is why we saw the pretty efficient speedups through 32 cores. Beyond 32 it will definitely begin to show the effects of memory and lock contention. There are some options to probably take it beyond 32, at least close to 64. I did not take the time since I had no real way to measure the performance to see what tweaks were needed...
Clearly lazy has less lock contention, and the only real memory contention seems to be the global hash table. This implies that it ought to continue to scale far beyond 64 cores, easily. But as I mentioned, at some point there is a crossover point where Lazy will out-perform DTS/pseudo-DTS, although I don't think the exact point is known. But I would be willing to bet it is not below 32 cores and probably not below 64. But it is there. A classic case of brute force winning out over cleverness. Exactly what happened in the 70's when clever forward pruning gave way to a full-width search. Reduced my code size by 90% and played better. Lazy reduces the code size for the parallel part of the program significantly, which is definitely an advantage in terms of debugging.
In Cray Blitz, when a thread became idle, it could look at the complete tree data for all other threads and find the best place to help. It could use an existing split point or it could create a brand new one without the original thread knowing until it backed up to that node. Very efficient, but it required the mentioned iterative (rather than recursive) search, since it is not possible to either back up or go forward independently of the call stack. So iterated was the approach (and since FORTRAN back then did not support recursive/reentrant code anyway), there was no choice.
In Crafty, since it is not possible to jump around in the tree for the call stack reason quoted above, I decided to take a different angle to the same problem. At "good" split points, in a serial search thread, I have that thread emit what I call a "gratuitous split point", which is a split point with no helpers. But now this is globally visible and any thread can attach to that split point without needing the call stack data for plies prior to that split point. So any thread can jump in at any of these split points, or if it is not happy with any, it can request that each active thread emit such a split point "right now" and then choose one of these. I limit how many of these gratuitous split points are emitted to save memory and wasted execution cycles (it is not free to create one, but it is also not cpu intensive since I spent a lot of time trying to minimize the split point data).
In performance tests, I can not tell a lot of difference between the two approaches (DTS and almost-DTS). I tested this for tens of thousands of games on the cluster, using up to 8 threads / cores for each pair of programs, and found no significant Elo difference. For that reason, you don't see an iterative Crafty since it was MUCH messier than a purely recursive search.
I would not begin to say this is the ultimate parallel search, but several (including a couple here) tested it through 16 cores (and my own testing which also used 32 cores) and found the parallel speedup to be really good. You will have to look back through old posts here to see the exact numbers, I really don't remember.
The only down-side is that locks and memory contention do have an effect. I spent a lot of time the last year I worked on the code heavily, removing as much serial code (lock protected) as possible, which is why we saw the pretty efficient speedups through 32 cores. Beyond 32 it will definitely begin to show the effects of memory and lock contention. There are some options to probably take it beyond 32, at least close to 64. I did not take the time since I had no real way to measure the performance to see what tweaks were needed...
Clearly lazy has less lock contention, and the only real memory contention seems to be the global hash table. This implies that it ought to continue to scale far beyond 64 cores, easily. But as I mentioned, at some point there is a crossover point where Lazy will out-perform DTS/pseudo-DTS, although I don't think the exact point is known. But I would be willing to bet it is not below 32 cores and probably not below 64. But it is there. A classic case of brute force winning out over cleverness. Exactly what happened in the 70's when clever forward pruning gave way to a full-width search. Reduced my code size by 90% and played better. Lazy reduces the code size for the parallel part of the program significantly, which is definitely an advantage in terms of debugging.