Page 3 of 3

Re: Another take on DTS?

Posted: Mon Oct 28, 2019 10:16 pm
by dragontamer5788
mar wrote:
Mon Oct 28, 2019 9:51 pm
dragontamer5788 wrote:
Mon Oct 28, 2019 9:32 pm
A singly-linked list seems like it can easily implement a stack. I'm pretty sure a singly-linked list can implement a queue (but it takes a bit of thought). A dequeue however... that is push-front, push-back, pop-front, and pop-back... seems to require a doubly-linked list. And the "XOR-pointer" trick will be sufficient for a dequeue (as you are only pulling data from the front or back, and never in the middle).

Otherwise, you'd have to double your pointer space by creating a classic doubly linked list, which has the (useless) feature of being able to pull out of the middle.


A dequeue is very useful for load-balancing tree traversal searches. Your local thread push_back and pop_back (aka: Depth-first search locally), but "other" threads can invade and pop_front whenever they need to work-steal some work from your thread (aka: Breadth-first search while load-balancing). It takes some effort to mitigate the race-conditions through locks or compare-and-swap operations... but a dequeue is definitely useful.
Ok, thanks. But I'm still not sold on the linked list idea to implement a double-ended queue.
My implementation uses a (dynamic) circular buffer with two indices, seems more compact to me (only expensive operation is growing as needed, but still amortized O(1) push and you can always reserve in advance) => elements are tightly packed together, no need to store any kind of pointer/index per element.
EDIT: as a bonus, I get random access to queue elements for free
Like you, I usually prefer vectors as well. A static circular buffer would normally be my preferred methodology.

But when I started this GPU project, I needed a good memory allocator on the GPU that would work across workgroups. There might be one for CUDA, but I'm not aware of one for ROCm or OpenCL. So I wrote my own.

It would be too much effort to write a truly general purpose "malloc", so I made a bunch of simplifications. In particular, a "heap" in my case can only allocate a fixed size (512 bytes). All the data-structures in my code use these 512-byte chunks in different ways: a task can fit inside of 512 bytes for instance. Linked lists can create stacks, queues, and dequeues. And I "pack" the elements, fitting 128x 4-byte integers per 512-byte node. Linked-lists make it very easy to write a memory allocator. Vectors and circular buffers (especially self-expanding / contracting ones) are very difficult to write a memory-allocator for.

Since all 512-byte chunks are the same size and uniform, alloc is simply atomicAnd into a bitmask, while free is simply atomicOr into the bitmask. Its trivially parallelizable (even across the 16384 threads I expect to run) and works as a global, lock-free memory allocator. In practice, atomic operations are still extremely costly because GPUs have non-cohesive caches. So Vega64 has to... erm... flush the entire L1 cache (At least, that's what the compiled GPU-assembly seems to indicate). I definitely miss cache-coherence that from the CPU-world.

So there's a significant amount of buffering I do to minimize the atomic operations and "coalesce" allocs into bigger chunks to amortize the costs. But the fundamental structure is just atomicAnd / atomicOr into a globally shared bitmask. But in return, I have to use linked-lists instead of vectors, because I can't change the size of my allocations.


It probably should be noted that I haven't actually written the XOR-linked list yet. I've always just found it amusing. But after this discussion, maybe I should...