Code: Select all
template <class T>
class SIMDWorkBuffer{
public:
static constexpr uint32_t CAPACITY = 128;
T buffer[CAPACITY];
uint32_t lock;
uint32_t head;
uint32_t size;
__device__ void Init();
template <class Functor>
__device__ void addAndWork(bool addP, T toAdd, Functor thunk);
template <class Functor>
__device__ void finishWork(Functor thunk);
};
template <class T>
__device__ void SIMDWorkBuffer<T>::Init(){
if(hipThreadIdx_x == 0){
head = 0;
size = 0;
lock = 0;
}
__syncthreads();
}
template <class T>
template <class Functor>
__device__ void SIMDWorkBuffer<T>::addAndWork(bool addP, T toAdd, Functor thunk)
{
Spinlock_Lock_shared(lock);
if(addP){
buffer[(head + size + __ballotToIdx(addP)) % CAPACITY] = toAdd;
}
uint32_t numAdding = __popcll(__ballot64(addP));
if(__ockl_activelane_u32() == 0){
size += numAdding;
}
__threadfence_block();
if(size >= 64){
thunk(buffer[(head+__ockl_activelane_u32())%CAPACITY]);
if(__ockl_activelane_u32() == 0){
head = (head + 64) % CAPACITY;
size -= 64;
}
}
Spinlock_Unlock_shared(lock);
}
template <class T>
template <class Functor>
__device__ void SIMDWorkBuffer<T>::finishWork(Functor thunk)
{
uint32_t myIdx = __ockl_activelane_u32();
if(myIdx < size){
thunk(buffer[(head + myIdx) % CAPACITY]);
}
}
Code: Select all
__device__ uint32_t __ballotToIdx(bool p){
uint64_t val = __ballot64(p);
return __mbcnt_hi((uint32_t)(val>>32), __mbcnt_lo((uint32_t)val, 0));
}
As for sample code of how to use it...
Code: Select all
__global__ void bufferTest(uint32_t* returns, uint32_t* head){
buffer.Init();
auto job =
[=](uint32_t val){
returns[*head + __ockl_activelane_u32()] = (hipThreadIdx_x << 16) | val;
uint32_t numAdded = __popcll(__activemask());
if(__ockl_activelane_u32() == 0){
*head += numAdded;
}
__threadfence_block();
};
for(int i=0; i<128; i++){
uint32_t myvalue = (i * 64) + hipThreadIdx_x; // Iterating over [0 to 8192>
buffer.addAndWork(
(((myvalue % 3) == 0) || ((myvalue % 5) == 0)),
myvalue,
job);
}
buffer.finishWork(job);
}
This might be a bit weird for some people to read. But what is going on is fizzbuzz. I'm adding every number that matches the property (num % 3 == 0) OR (num % 5 == 0). The famous FizzBuzz pattern: https://en.wikipedia.org/wiki/Fizz_buzz
I am storing the value "(hipThreadIdx_x << 16) | val;", which is just the FizzBuzz value "val", and the current-thread IDX. By printing the current thread-idx, I can prove how much GPU-utilization is happening. FizzBuzz would normally have ~45% utilization, but because of the magic of the SIMDWorkBuffer, I've achieved near 100% utilization here.
The function which wants to run with 100% utilization is the lambda functor (named "auto job = [=] ...."). I take advantage of [=] binding to implicitly pass the returns[] array and *head closure. In any case, the "job" is the hypothetical task that I want to execute with 100% utilization.
The buffer.addAndWork(predicate, value, Functor) is the key to this whole interface. In effect, I'm running the pseudocode: "if(predicate) Functor(value)". However, this naive implementation would only have 45% utilization (on FizzBuzz, less on other code). I'm able to cheat 100% utilization by buffering up.
All in all, the iterations 0 through 3775 were 100% utilized. The last 3776 - 3822 only had 47/64 utilization, but this low utilization only happened for one loop. I expect the rest of my code to be highly efficient thanks to this tool.
To prove this works as I intend, I have some code which prints the data...
Code: Select all
// Ignoring setup:
hipLaunchKernelGGL(bufferTest, dim3(1), dim3(64), 0, 0, rets, head);
hipMemcpy(localRets, rets, sizeof(uint32_t) * 8192, hipMemcpyDefault);
hipMemcpy(&localHead, head, sizeof(uint32_t), hipMemcpyDefault);
using namespace std;
for(int i=0; i<localHead; i++){
cout << i << " " << (localRets[i] >> 16) << " " << (localRets[i]&0xFFFF) << endl;
}
Code: Select all
0 0 0
1 1 3
2 2 5
3 3 6
4 4 9
5 5 10
6 6 12
7 7 15
8 8 18
9 9 20
10 10 21
11 11 24
12 12 25
13 13 27
14 14 30
15 15 33
16 16 35
17 17 36
18 18 39
19 19 40
20 20 42
21 21 45
22 22 48
23 23 50
24 24 51
25 25 54
26 26 55
27 27 57
28 28 60
29 29 63
30 30 65
31 31 66
32 32 69
33 33 70
34 34 72
35 35 75
36 36 78
37 37 80
38 38 81
39 39 84
40 40 85
41 41 87
42 42 90
43 43 93
44 44 95
45 45 96
46 46 99
47 47 100
48 48 102
49 49 105
50 50 108
51 51 110
52 52 111
53 53 114
54 54 115
55 55 117
56 56 120
57 57 123
58 58 125
59 59 126
60 60 129
61 61 130
62 62 132
63 63 135
64 0 138
65 1 140
66 2 141
67 3 144
68 4 145
69 5 147
70 6 150
71 7 153
72 8 155
73 9 156
74 10 159
75 11 160
76 12 162
77 13 165
78 14 168
79 15 170
80 16 171
81 17 174
82 18 175
83 19 177
84 20 180
85 21 183
86 22 185
87 23 186
88 24 189
89 25 190
90 26 192
91 27 195
92 28 198
93 29 200
94 30 201
95 31 204
96 32 205
97 33 207
98 34 210
99 35 213
100 36 215
101 37 216
102 38 219
103 39 220
104 40 222
105 41 225
106 42 228
107 43 230
108 44 231
109 45 234
110 46 235
111 47 237
112 48 240
113 49 243
114 50 245
115 51 246
116 52 249
117 53 250
118 54 252
119 55 255
120 56 258
121 57 260
122 58 261
123 59 264
124 60 265
125 61 267
126 62 270
127 63 273
...
Etc. etc.
...
3775 63 8090
3776 0 8091
3777 1 8094
3778 2 8095
3779 3 8097
3780 4 8100
3781 5 8103
3782 6 8105
3783 7 8106
3784 8 8109
3785 9 8110
3786 10 8112
3787 11 8115
3788 12 8118
3789 13 8120
3790 14 8121
3791 15 8124
3792 16 8125
3793 17 8127
3794 18 8130
3795 19 8133
3796 20 8135
3797 21 8136
3798 22 8139
3799 23 8140
3800 24 8142
3801 25 8145
3802 26 8148
3803 27 8150
3804 28 8151
3805 29 8154
3806 30 8155
3807 31 8157
3808 32 8160
3809 33 8163
3810 34 8165
3811 35 8166
3812 36 8169
3813 37 8170
3814 38 8172
3815 39 8175
3816 40 8178
3817 41 8180
3818 42 8181
3819 43 8184
3820 44 8185
3821 45 8187
3822 46 8190
Keeping the GPU at 100% utilization is one of the hardest tasks about this research project. I've been mulling over this data-structure for the past week, because this pattern keeps coming up again-and-again in my code. I'm glad to have finally gotten it down to its most fundamental form. A buffer size 128 is the largest I need: there cannot be any more than 63-elements in the buffer (because otherwise, job / thunk gets executed, which eats up 64-tasks). There can only be 64-elements added to the buffer in the worst-case scenario (waveSize == 64 on Vega64).