RMO - Randomized Move Order - yet another Lazy SMP derivate

dragontamer5788 · Post by **dragontamer5788** » Thu Jan 09, 2020 9:53 pm

Dann Corbit wrote: ↑Wed Jan 08, 2020 6:29 pm Doesn't this indicate that the reductions are flawed? Otherwise, where does the Elo increase come from?

If a fatter tree is better then maybe the reductions are incorrect or too extreme.

On the other hand, it seems possible that the small extra randomness introduced by the thread local storage might indicate a.phenomenon where k threads in n time are actually superior to one thread in k"n time, which is really interesting. But it's clearly not enough to compensate for SMP loss. I guess the next question is, can this effect be enhanced?

The most common source of "super-linear scaling" is that L1 and L2 caches increase for each additional hardware thread you put on a task.

Lets take the AMD Threadripper 3970x (32-core) as an example. 1x thread will only have 512kB of L2 cache. But max out the system with 64-threads, and they will effectively be searching across 16MBs of L2 cache.

EDIT: Here's a good discussion on superlinear speedups: https://www.drdobbs.com/cpp/going-super ... 542?pgno=1

There are two main ways to enter that rarefied realm:
* Do disproportionately less work.
* Harness disproportionately more resources.

In particular, page 2:

But something interesting happens when the distribution of solutions is not uniform, because then averages no longer apply the same way. To illustrate what happens when matches are clustered, consider Figure 4: We still have M matches in the collection, as before, but let's assume that they are distributed, not evenly throughout, but with higher probability in some regions.

For simple sequential search, this is no big news: It fares neither better nor worse, and on average still takes as long to find a match as it would if the matches were distributed uniformly throughout the collection. Even if the sequential algorithm writer knows in advance that matches tend to be clustered, it's hard to see how to take advantage of that without also knowing in advance where the clusters are.

For the simple parallel search, on the other hand, clustering is great news, because a parallel search directly takes advantage of any locality there may be in the matches. Consider: Any clustering effect means that some subrange workers will end up discovering that they are impoverished, having relatively fewer matches, or none at all, in their assigned search areas. But other workers will enjoy a correspondingly higher number of matches in their areas, which means—and this is the important part—they will have a higher probability of finding a match each time they test an element, which in turn means that on average they will need to visit fewer elements to find a match. Like Forty-Niners panning and digging for nonuniformly distributed gold in California (near Sutter's Mill; no relation), many workers will find little or nothing, while a lucky few, or just one, will hit the Motherlode. And "just one" is all it takes; any clustering at all is immediately helpful because the time to perform the total search is the minimum of the workers' runtimes—we're done as soon as any worker finds a match.

[Superlinear Speedup from clustered results] isn't limited to simple array-like collections. Many kinds of collections naturally enjoy clustered matches. For example, when we traverse a tree or graph using depth-first search, the parallel version of the algorithm typically has each parallel worker searching a localized subtree, and in many problem domains the solutions tend to be clustered in subtrees. When that happens, one worker will have a disproportionately better chance of being assigned an area with more matches, and we can expect superlinear improvement. Example applications range from searching game trees (for example, a bad move at one node in the tree can cause the entire subtree below it containing subsequent moves to have no good solutions), to CAD and VLSI circuit layout algorithms (see [2]).

dragontamer5788 · Post by **dragontamer5788** » Thu Jan 09, 2020 10:39 pm

The above Dr. Dobbs journal article references this paper: https://www.researchgate.net/publicatio ... ace_Search

When N processors perform depth-first search on disjoint parts of a state space tree to find a solution, the speedup can be superlinear (i.e., > N) or sublinear (i.e., N) depending upon when a solution is first encountered in the space by one of the processors. It may appear that on the average, the speedup would be either linear or sublinear. Using an analytical model, we show that if the search space has more than one solution and if these solutions are randomly distributed in a relatively small region of the search space, then the average speedup in parallel depth-first search can be superlinear. If all the solutions (one or more) are uniformly distributed over the whole search space, then the average speedup is linear. This model is validated by our experiments on synthetic state-space trees and the 15-puzzle problem. The same model predicts average superlinear speedup in parallel best-first branch-and-bound algorithms on suitable problems.

It appears that average speedup can be superlinear in the average case, due to the clustering effect.

Michel · Post by **Michel** » Fri Jan 10, 2020 8:51 am

dragontamer5788 wrote: ↑Thu Jan 09, 2020 10:39 pm The above Dr. Dobbs journal article references this paper: https://www.researchgate.net/publicatio ... ace_Search

When N processors perform depth-first search on disjoint parts of a state space tree to find a solution, the speedup can be superlinear (i.e., > N) or sublinear (i.e., N) depending upon when a solution is first encountered in the space by one of the processors. It may appear that on the average, the speedup would be either linear or sublinear. Using an analytical model, we show that if the search space has more than one solution and if these solutions are randomly distributed in a relatively small region of the search space, then the average speedup in parallel depth-first search can be superlinear. If all the solutions (one or more) are uniformly distributed over the whole search space, then the average speedup is linear. This model is validated by our experiments on synthetic state-space trees and the 15-puzzle problem. The same model predicts average superlinear speedup in parallel best-first branch-and-bound algorithms on suitable problems.
It appears that average speedup can be superlinear in the average case, due to the clustering effect.

I am not so sure the article actually shows what it promises in the beginning. If I understand correctly, it does not invalidate the serialization argument which "proves" that super linear scaling is impossible (at first I thought that would be the point). It simply argues that the serialized algorithm will typically suffer from a bunch of inefficiencies.

dragontamer5788 · Post by **dragontamer5788** » Fri Jan 10, 2020 7:45 pm

Michel wrote: ↑Fri Jan 10, 2020 8:51 am
dragontamer5788 wrote: ↑Thu Jan 09, 2020 10:39 pm The above Dr. Dobbs journal article references this paper: https://www.researchgate.net/publicatio ... ace_Search

When N processors perform depth-first search on disjoint parts of a state space tree to find a solution, the speedup can be superlinear (i.e., > N) or sublinear (i.e., N) depending upon when a solution is first encountered in the space by one of the processors. It may appear that on the average, the speedup would be either linear or sublinear. Using an analytical model, we show that if the search space has more than one solution and if these solutions are randomly distributed in a relatively small region of the search space, then the average speedup in parallel depth-first search can be superlinear. If all the solutions (one or more) are uniformly distributed over the whole search space, then the average speedup is linear. This model is validated by our experiments on synthetic state-space trees and the 15-puzzle problem. The same model predicts average superlinear speedup in parallel best-first branch-and-bound algorithms on suitable problems.
It appears that average speedup can be superlinear in the average case, due to the clustering effect.
I am not so sure the article actually shows what it promises in the beginning. If I understand correctly, it does not invalidate the serialization argument which "proves" that super linear scaling is impossible (at first I thought that would be the point). It simply argues that the serialized algorithm will typically suffer from a bunch of inefficiencies.

Superlinear is impossible from the perspective of "Take the best sequential algorithm, and compare it against the best parallel algorithm".

But from what they do in the paper, which is:

1. Code a sequential DFS-search on the 15-puzzle problem.
2. Code a naive parallel version of the search (highly based on the original DFS search)

#2 will scale superlinearly compared to #1. Which seems to be hugely similar to the discussion going on in this thread.

smatovic · Post by **smatovic** » Wed Jun 24, 2020 7:28 am

Just a further remark...

RMO will probably not profit from SMT, Hyperthreading, with an overall NPS
speedup of ~1.5x, my experiments with Zeta on gpu showed at least 1.8x is in
need.

--
Srdja

RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate