Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

dragontamer5788 · Post by **dragontamer5788** » Fri Aug 23, 2019 7:13 am

I've finished "SIMDWorkBuffer". This is a fundamental SIMD data-structure that I need for the rest of my code, so I finally spent the last week trying to get this thing working. Its... very short, but I learned a lot writing it. My first few attempts were completely broken actually. But without further ado...

Code: Select all

template <class T>
class SIMDWorkBuffer{
    public:
    static constexpr uint32_t CAPACITY = 128;
    T buffer[CAPACITY];
    uint32_t lock;
    uint32_t head;
    uint32_t size;

    __device__ void Init();

    template <class Functor>
    __device__ void addAndWork(bool addP, T toAdd, Functor thunk);

    template <class Functor>
    __device__ void finishWork(Functor thunk);
};

template <class T>
__device__ void SIMDWorkBuffer<T>::Init(){
    if(hipThreadIdx_x == 0){
        head = 0;
        size = 0;
        lock = 0;
    }
    __syncthreads();
}

template <class T>
template <class Functor>
__device__ void SIMDWorkBuffer<T>::addAndWork(bool addP, T toAdd, Functor thunk)
{
    Spinlock_Lock_shared(lock);
    if(addP){
        buffer[(head + size + __ballotToIdx(addP)) % CAPACITY] = toAdd;
    }
    uint32_t numAdding = __popcll(__ballot64(addP));
    if(__ockl_activelane_u32() == 0){
        size += numAdding;
    }
    __threadfence_block();

    if(size >= 64){
        thunk(buffer[(head+__ockl_activelane_u32())%CAPACITY]);
        if(__ockl_activelane_u32() == 0){
            head = (head + 64) % CAPACITY;
            size -= 64;
        }
    }
    Spinlock_Unlock_shared(lock);
}

template <class T>
template <class Functor>
__device__ void SIMDWorkBuffer<T>::finishWork(Functor thunk)
{
    uint32_t myIdx = __ockl_activelane_u32();
    if(myIdx < size){
        thunk(buffer[(head + myIdx) % CAPACITY]);
    }
}

I've got some AMD-specific code here to have highly-optimized "ballotToIdx" instruction:

Code: Select all

__device__ uint32_t __ballotToIdx(bool p){
    uint64_t val = __ballot64(p);
    return __mbcnt_hi((uint32_t)(val>>32), __mbcnt_lo((uint32_t)val, 0));
}

The "ballotToIdx" function compiles down into 2 assembly instructions in practice (!!), so its super efficient on GCN. I expect it to also be efficient on NVidia (probably less than 5 instructions), but I'm not as good with NVidia PTX (nor do I have an NVidia GPU, so no can do there).

As for sample code of how to use it...

Code: Select all

__global__ void bufferTest(uint32_t* returns, uint32_t* head){
    buffer.Init();

    auto job =
        [=](uint32_t val){
            returns[*head + __ockl_activelane_u32()] = (hipThreadIdx_x << 16) | val;
            uint32_t numAdded = __popcll(__activemask());
            if(__ockl_activelane_u32() == 0){
                *head += numAdded;
            }
            __threadfence_block();
        };


    for(int i=0; i<128; i++){
        uint32_t myvalue = (i * 64) + hipThreadIdx_x; // Iterating over [0 to 8192>
        buffer.addAndWork(
            (((myvalue % 3) == 0) || ((myvalue % 5) == 0)),
            myvalue,
            job);
    }
    buffer.finishWork(job);
}

I'm frankly surprised that the lambda-function works. But yeah, its super nice.

This might be a bit weird for some people to read. But what is going on is fizzbuzz. I'm adding every number that matches the property (num % 3 == 0) OR (num % 5 == 0). The famous FizzBuzz pattern: https://en.wikipedia.org/wiki/Fizz_buzz

I am storing the value "(hipThreadIdx_x << 16) | val;", which is just the FizzBuzz value "val", and the current-thread IDX. By printing the current thread-idx, I can prove how much GPU-utilization is happening. FizzBuzz would normally have ~45% utilization, but because of the magic of the SIMDWorkBuffer, I've achieved near 100% utilization here.

The function which wants to run with 100% utilization is the lambda functor (named "auto job = [=] ...."). I take advantage of [=] binding to implicitly pass the returns[] array and *head closure. In any case, the "job" is the hypothetical task that I want to execute with 100% utilization.

The buffer.addAndWork(predicate, value, Functor) is the key to this whole interface. In effect, I'm running the pseudocode: "if(predicate) Functor(value)". However, this naive implementation would only have 45% utilization (on FizzBuzz, less on other code). I'm able to cheat 100% utilization by buffering up.

All in all, the iterations 0 through 3775 were 100% utilized. The last 3776 - 3822 only had 47/64 utilization, but this low utilization only happened for one loop. I expect the rest of my code to be highly efficient thanks to this tool.

To prove this works as I intend, I have some code which prints the data...

Code: Select all

// Ignoring setup: 
    hipLaunchKernelGGL(bufferTest, dim3(1), dim3(64), 0, 0, rets, head);

    hipMemcpy(localRets, rets, sizeof(uint32_t) * 8192, hipMemcpyDefault);
    hipMemcpy(&localHead, head, sizeof(uint32_t), hipMemcpyDefault);

    using namespace std;
    for(int i=0; i<localHead; i++){
        cout << i << "  " << (localRets[i] >> 16) << "  " << (localRets[i]&0xFFFF) << endl;
    }

The output is:

Code: Select all

0  0  0
1  1  3
2  2  5
3  3  6
4  4  9
5  5  10
6  6  12
7  7  15
8  8  18
9  9  20
10  10  21
11  11  24
12  12  25
13  13  27
14  14  30
15  15  33
16  16  35
17  17  36
18  18  39
19  19  40
20  20  42
21  21  45
22  22  48
23  23  50
24  24  51
25  25  54
26  26  55
27  27  57
28  28  60
29  29  63
30  30  65
31  31  66
32  32  69
33  33  70
34  34  72
35  35  75
36  36  78
37  37  80
38  38  81
39  39  84
40  40  85
41  41  87
42  42  90
43  43  93
44  44  95
45  45  96
46  46  99
47  47  100
48  48  102
49  49  105
50  50  108
51  51  110
52  52  111
53  53  114
54  54  115
55  55  117
56  56  120
57  57  123
58  58  125
59  59  126
60  60  129
61  61  130
62  62  132
63  63  135
64  0  138
65  1  140
66  2  141
67  3  144
68  4  145
69  5  147
70  6  150
71  7  153
72  8  155
73  9  156
74  10  159
75  11  160
76  12  162
77  13  165
78  14  168
79  15  170
80  16  171
81  17  174
82  18  175
83  19  177
84  20  180
85  21  183
86  22  185
87  23  186
88  24  189
89  25  190
90  26  192
91  27  195
92  28  198
93  29  200
94  30  201
95  31  204
96  32  205
97  33  207
98  34  210
99  35  213
100  36  215
101  37  216
102  38  219
103  39  220
104  40  222
105  41  225
106  42  228
107  43  230
108  44  231
109  45  234
110  46  235
111  47  237
112  48  240
113  49  243
114  50  245
115  51  246
116  52  249
117  53  250
118  54  252
119  55  255
120  56  258
121  57  260
122  58  261
123  59  264
124  60  265
125  61  267
126  62  270
127  63  273

...

Etc. etc. 

...

3775  63  8090
3776  0  8091
3777  1  8094
3778  2  8095
3779  3  8097
3780  4  8100
3781  5  8103
3782  6  8105
3783  7  8106
3784  8  8109
3785  9  8110
3786  10  8112
3787  11  8115
3788  12  8118
3789  13  8120
3790  14  8121
3791  15  8124
3792  16  8125
3793  17  8127
3794  18  8130
3795  19  8133
3796  20  8135
3797  21  8136
3798  22  8139
3799  23  8140
3800  24  8142
3801  25  8145
3802  26  8148
3803  27  8150
3804  28  8151
3805  29  8154
3806  30  8155
3807  31  8157
3808  32  8160
3809  33  8163
3810  34  8165
3811  35  8166
3812  36  8169
3813  37  8170
3814  38  8172
3815  39  8175
3816  40  8178
3817  41  8180
3818  42  8181
3819  43  8184
3820  44  8185
3821  45  8187
3822  46  8190

Thea above printout proves the 100% utilization, since all 64-threads are actually cooperating on everything (except the final loop: with only 46 threads finishing the FizzBuzz).

Keeping the GPU at 100% utilization is one of the hardest tasks about this research project. I've been mulling over this data-structure for the past week, because this pattern keeps coming up again-and-again in my code. I'm glad to have finally gotten it down to its most fundamental form. A buffer size 128 is the largest I need: there cannot be any more than 63-elements in the buffer (because otherwise, job / thunk gets executed, which eats up 64-tasks). There can only be 64-elements added to the buffer in the worst-case scenario (waveSize == 64 on Vega64).

dragontamer5788 · Post by **dragontamer5788** » Tue Sep 03, 2019 12:54 am

https://dl.acm.org/citation.cfm?id=2807651

Too many ideas, not enough time. This above paper suggests a "frontier splitting" methodology for work-efficient depth-first searches. If my current idea doesn't work out, then maybe I'll try that paper's methodology instead.

dragontamer5788 · Post by **dragontamer5788** » Wed Nov 06, 2019 6:14 am

The topic until now has been about how to explore the Alpha-beta nodes in parallel through the "FutureExpression" data structure. However, I have yet to describe how to traverse the AB-tree (aside from vague "scheduler" talk). So lets talk scheduling.

The overall goal is to build a SIMD-depth-first search capable of running on a GPU. I've figured out a pattern which looks like it fits the bill. Lets start with alpha-beta(P, Alpha=-inf, Beta=+inf, depth=4). The root node starts off lonely.

With regards to the machine: lets assume a SIMD4 machine, which is small enough to calculate by hand. SIMD4 in practice matches 128-bit SSE instructions, but GPUs are much larger. NVidia native is SIMD32, while AMD GCN native is SIMD64. In practice, I'll use __syncthreads() and a workgroup (or CUDA block) of size 1024 to emulate SIMD1024.

The SIMD4 machine will be loaded as (P, Null, Null, Null). There's not enough work in the first step to fully load a SIMD4 machine, but don't worry, more nodes will be coming in short order.

Remember that alpha and beta are complex data-structures called FutureExpressions. They are the key to traversing the nodes in parallel, even if we don't know the specific values that are being computed are. Any value starting with the letter "F", such as "F1" will be filled in later.

P3-beta is [inf - F1 max 2 max -], which reads as -max(2, max(F1, -infinity)).

P1, P2, and P3 will then be expanded in parallel for step 2. The SIMD4 machine loads (P1, P2, P3, NULL) to execute and expand.

Things start to get interesting here. P22 may not be visited in a sequential Alpha-Beta search. It will only be visited if alpha < beta (or in this case: F1 < F21). Because F1 and F21 are still unknown, I've labeled the P22 node as "Speculative".

We finally have 9-children to evaluate, which means all SIMD4 lanes are active. The left-most nodes, (P11, P12, P13, and P21), are to be explored.

I don't think there's anything "special" to learn about this above step. Moving on to Step4:

Here, we've reached the targeted depth4, and have to start evaluating the heuristic. The heuristic game I use for these hand simulations is

hash(Position) = ((Position * 31415) / 1000) % 100

implemented in the Python:

Code: Select all

def hash(x):
	return ((x * 31415) // 1000) % 100

This returns a 2-digit number, 0-49 are positive, while 50-99 are base10 2's complement negative numbers. (Ex: 99 is equivalent to -1, while 51 is equivalent to -49)

Since we've reached the bottom, we can evaluate these nodes, as well as propagate their values upwards.

Note that hash(1113) == 64, but this is 2's complement for -36.
These futures can now be bound to the next level above.

Lets start with P112. Only one of the children of P112 was completed, as such only one of the futures (F1121) gets bound at this step.

P111 on the other hand, had all 3 children evaluated, so now we can complete its evaluation function and continue to propagate upwards.

Note that because P1111, P1112, P1113, and P1121 all finished execution, their associated resources (memory, futures, and other resources) can be deleted safely at this point. While not necessary for understanding the algorithm, this step is important for memory-management and optimization.

Remember that P111's evaluation function was [2 33 -36 max -], which will form the reverse-polish notation for -max(2, 33, -36) == -33. As such, P111 reports to P11 that its value is -33.

But P11 is not the only node that needs updating. P111's younger-brothers, P112 and P113, also need updating. P112-Beta and P113-Beta both contain F111. Once all nodes are updated, P111 is completed, and therefore deleted from the tree (and any associated resources can be claimed).

--------------

Optional Step: Lower-Bound and Upper-bounds can be propagated up-and-across the tree. This doesn't help a SIMD-system like a GPU, but it would probably help a classic CPU-system or a cluster.

Consider P11's current evaluation after Step7: [-33 F112 F113 max -], which stands for the expression -max(-33, F112, F113). The bounds can intuitively be calculated as F11 <= 33. The only reason to propagate lower-bounds or upper-bounds is to more quickly evaluate Beta-cutoff (the situation when Alpha >= Beta).

If LowerBounds(Alpha) >= UpperBounds (Beta) for some node, you can perform Beta-Cutoff. Instead of evaluating the whole node (and its children), you only have to wait for Beta to be fully-bound, and then immediately return Beta.

--------------

That's all for now. Maybe if people are interested, I can talk more about the specific data-structure I'm using to perform these operations. But for now, the conceptual parallel-search methodology is what's important to get across.

dragontamer5788 · Post by **dragontamer5788** » Wed Nov 06, 2019 6:29 am

dragontamer5788 wrote: ↑Wed Nov 06, 2019 6:14 am Things start to get interesting here. P22 may not be visited in a sequential Alpha-Beta search. It will only be visited if alpha < beta (or in this case: F1 < F21). Because F1 and F21 are still unknown, I've labeled the P22 node as "Speculative".

Your web-browser may not display the image here very well. I've manually cropped and zoomed into the P22 and P23 area of the graph here:

P22-Alpha is: max(F1, -infinity) == F1.

P22-Beta is: -max(F21, -infinity) == -F21.

P22-Alpha >= P22-Beta when F1 >= -F21.

-------

Note: Earlier in this thread I called P22 "Beta-blocked". After some simulations and thinking about it, I have determined that it is wholly-unnecessary to block the execution of P22!! Instead, you should simply prioritize the execution of the left-most nodes of the tree (which is quite natural with a stack-like datastructure).

Consider if P22 were speculatively executed. Its children would be the following:

While alpha-and-beta have flipped for P221, its still equivalent. P221 will beta-cutoff on alpha >= beta, or on the condition of F21 >= -F1.

P222 has beta-cutoff on F21 >= -max(F221, F1)

P223 has beta-cutoff on F21 >= -max(F222, F221, F1)

If F21 >= -F1, then all three nodes: P221, P222, and P223, would enter beta-cutoff and further work would be mitigated. Even if F221 and F222 were still unknown, the value -max(F222, F221, F1) has -F1 as its upperbound.

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search