Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
dragontamer5788
Posts: 98
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 » Fri Aug 23, 2019 5:13 am

I've finished "SIMDWorkBuffer". This is a fundamental SIMD data-structure that I need for the rest of my code, so I finally spent the last week trying to get this thing working. Its... very short, but I learned a lot writing it. My first few attempts were completely broken actually. But without further ado...

Code: Select all

template <class T>
class SIMDWorkBuffer{
    public:
    static constexpr uint32_t CAPACITY = 128;
    T buffer[CAPACITY];
    uint32_t lock;
    uint32_t head;
    uint32_t size;

    __device__ void Init();

    template <class Functor>
    __device__ void addAndWork(bool addP, T toAdd, Functor thunk);

    template <class Functor>
    __device__ void finishWork(Functor thunk);
};

template <class T>
__device__ void SIMDWorkBuffer<T>::Init(){
    if(hipThreadIdx_x == 0){
        head = 0;
        size = 0;
        lock = 0;
    }
    __syncthreads();
}

template <class T>
template <class Functor>
__device__ void SIMDWorkBuffer<T>::addAndWork(bool addP, T toAdd, Functor thunk)
{
    Spinlock_Lock_shared(lock);
    if(addP){
        buffer[(head + size + __ballotToIdx(addP)) % CAPACITY] = toAdd;
    }
    uint32_t numAdding = __popcll(__ballot64(addP));
    if(__ockl_activelane_u32() == 0){
        size += numAdding;
    }
    __threadfence_block();

    if(size >= 64){
        thunk(buffer[(head+__ockl_activelane_u32())%CAPACITY]);
        if(__ockl_activelane_u32() == 0){
            head = (head + 64) % CAPACITY;
            size -= 64;
        }
    }
    Spinlock_Unlock_shared(lock);
}

template <class T>
template <class Functor>
__device__ void SIMDWorkBuffer<T>::finishWork(Functor thunk)
{
    uint32_t myIdx = __ockl_activelane_u32();
    if(myIdx < size){
        thunk(buffer[(head + myIdx) % CAPACITY]);
    }
}

I've got some AMD-specific code here to have highly-optimized "ballotToIdx" instruction:

Code: Select all

__device__ uint32_t __ballotToIdx(bool p){
    uint64_t val = __ballot64(p);
    return __mbcnt_hi((uint32_t)(val>>32), __mbcnt_lo((uint32_t)val, 0));
}
The "ballotToIdx" function compiles down into 2 assembly instructions in practice (!!), so its super efficient on GCN. I expect it to also be efficient on NVidia (probably less than 5 instructions), but I'm not as good with NVidia PTX (nor do I have an NVidia GPU, so no can do there).

As for sample code of how to use it...

Code: Select all

__global__ void bufferTest(uint32_t* returns, uint32_t* head){
    buffer.Init();

    auto job =
        [=](uint32_t val){
            returns[*head + __ockl_activelane_u32()] = (hipThreadIdx_x << 16) | val;
            uint32_t numAdded = __popcll(__activemask());
            if(__ockl_activelane_u32() == 0){
                *head += numAdded;
            }
            __threadfence_block();
        };


    for(int i=0; i<128; i++){
        uint32_t myvalue = (i * 64) + hipThreadIdx_x; // Iterating over [0 to 8192>
        buffer.addAndWork(
            (((myvalue % 3) == 0) || ((myvalue % 5) == 0)),
            myvalue,
            job);
    }
    buffer.finishWork(job);
}
I'm frankly surprised that the lambda-function works. But yeah, its super nice.

This might be a bit weird for some people to read. But what is going on is fizzbuzz. I'm adding every number that matches the property (num % 3 == 0) OR (num % 5 == 0). The famous FizzBuzz pattern: https://en.wikipedia.org/wiki/Fizz_buzz

I am storing the value "(hipThreadIdx_x << 16) | val;", which is just the FizzBuzz value "val", and the current-thread IDX. By printing the current thread-idx, I can prove how much GPU-utilization is happening. FizzBuzz would normally have ~45% utilization, but because of the magic of the SIMDWorkBuffer, I've achieved near 100% utilization here.

The function which wants to run with 100% utilization is the lambda functor (named "auto job = [=] ...."). I take advantage of [=] binding to implicitly pass the returns[] array and *head closure. In any case, the "job" is the hypothetical task that I want to execute with 100% utilization.

The buffer.addAndWork(predicate, value, Functor) is the key to this whole interface. In effect, I'm running the pseudocode: "if(predicate) Functor(value)". However, this naive implementation would only have 45% utilization (on FizzBuzz, less on other code). I'm able to cheat 100% utilization by buffering up.

All in all, the iterations 0 through 3775 were 100% utilized. The last 3776 - 3822 only had 47/64 utilization, but this low utilization only happened for one loop. I expect the rest of my code to be highly efficient thanks to this tool.

To prove this works as I intend, I have some code which prints the data...

Code: Select all

// Ignoring setup: 
    hipLaunchKernelGGL(bufferTest, dim3(1), dim3(64), 0, 0, rets, head);

    hipMemcpy(localRets, rets, sizeof(uint32_t) * 8192, hipMemcpyDefault);
    hipMemcpy(&localHead, head, sizeof(uint32_t), hipMemcpyDefault);

    using namespace std;
    for(int i=0; i<localHead; i++){
        cout << i << "  " << (localRets[i] >> 16) << "  " << (localRets[i]&0xFFFF) << endl;
    }
The output is:

Code: Select all

0  0  0
1  1  3
2  2  5
3  3  6
4  4  9
5  5  10
6  6  12
7  7  15
8  8  18
9  9  20
10  10  21
11  11  24
12  12  25
13  13  27
14  14  30
15  15  33
16  16  35
17  17  36
18  18  39
19  19  40
20  20  42
21  21  45
22  22  48
23  23  50
24  24  51
25  25  54
26  26  55
27  27  57
28  28  60
29  29  63
30  30  65
31  31  66
32  32  69
33  33  70
34  34  72
35  35  75
36  36  78
37  37  80
38  38  81
39  39  84
40  40  85
41  41  87
42  42  90
43  43  93
44  44  95
45  45  96
46  46  99
47  47  100
48  48  102
49  49  105
50  50  108
51  51  110
52  52  111
53  53  114
54  54  115
55  55  117
56  56  120
57  57  123
58  58  125
59  59  126
60  60  129
61  61  130
62  62  132
63  63  135
64  0  138
65  1  140
66  2  141
67  3  144
68  4  145
69  5  147
70  6  150
71  7  153
72  8  155
73  9  156
74  10  159
75  11  160
76  12  162
77  13  165
78  14  168
79  15  170
80  16  171
81  17  174
82  18  175
83  19  177
84  20  180
85  21  183
86  22  185
87  23  186
88  24  189
89  25  190
90  26  192
91  27  195
92  28  198
93  29  200
94  30  201
95  31  204
96  32  205
97  33  207
98  34  210
99  35  213
100  36  215
101  37  216
102  38  219
103  39  220
104  40  222
105  41  225
106  42  228
107  43  230
108  44  231
109  45  234
110  46  235
111  47  237
112  48  240
113  49  243
114  50  245
115  51  246
116  52  249
117  53  250
118  54  252
119  55  255
120  56  258
121  57  260
122  58  261
123  59  264
124  60  265
125  61  267
126  62  270
127  63  273

...

Etc. etc. 

...

3775  63  8090
3776  0  8091
3777  1  8094
3778  2  8095
3779  3  8097
3780  4  8100
3781  5  8103
3782  6  8105
3783  7  8106
3784  8  8109
3785  9  8110
3786  10  8112
3787  11  8115
3788  12  8118
3789  13  8120
3790  14  8121
3791  15  8124
3792  16  8125
3793  17  8127
3794  18  8130
3795  19  8133
3796  20  8135
3797  21  8136
3798  22  8139
3799  23  8140
3800  24  8142
3801  25  8145
3802  26  8148
3803  27  8150
3804  28  8151
3805  29  8154
3806  30  8155
3807  31  8157
3808  32  8160
3809  33  8163
3810  34  8165
3811  35  8166
3812  36  8169
3813  37  8170
3814  38  8172
3815  39  8175
3816  40  8178
3817  41  8180
3818  42  8181
3819  43  8184
3820  44  8185
3821  45  8187
3822  46  8190
Thea above printout proves the 100% utilization, since all 64-threads are actually cooperating on everything (except the final loop: with only 46 threads finishing the FizzBuzz).

Keeping the GPU at 100% utilization is one of the hardest tasks about this research project. I've been mulling over this data-structure for the past week, because this pattern keeps coming up again-and-again in my code. I'm glad to have finally gotten it down to its most fundamental form. A buffer size 128 is the largest I need: there cannot be any more than 63-elements in the buffer (because otherwise, job / thunk gets executed, which eats up 64-tasks). There can only be 64-elements added to the buffer in the worst-case scenario (waveSize == 64 on Vega64).

dragontamer5788
Posts: 98
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Post by dragontamer5788 » Mon Sep 02, 2019 10:54 pm

https://dl.acm.org/citation.cfm?id=2807651

Too many ideas, not enough time. This above paper suggests a "frontier splitting" methodology for work-efficient depth-first searches. If my current idea doesn't work out, then maybe I'll try that paper's methodology instead.

Post Reply