Yes, true, Leela runs batches of parallel threads to utilize the GPU, and the neural network for evaluation is stored in GPU's VRAM. But I guess Dann was up to some kind of heterogeneous engine, utilizing CPU/GPU/NPU all together on the same game tree stored in unified memory. Idk how such an approach could look like, but Apple M-series offers already the needed hardware.lucario6607 wrote: ↑Fri Mar 27, 2026 11:25 amLeela isn’t memory bandwidth bottlenecked. It does matter but not as much as you think it does. Leela sends work to each gpu, so nvlink really doesn’t do much.Dann Corbit wrote: ↑Fri Mar 27, 2026 8:04 am The AMD architecture allows transparent access to the video RAM and system RAM to both the CPUs and GPUs. They have already implemented it for their AI workhorse type GPUs. I think Nvidia is doing something similar. Maybe Srdja can comment.
The big problem with commodity GPUs is that they spend all their time copying work to and from video RAM
Lc0's batch based parallel approach is needed because of the host-device-latencies for kernel launch, not because of memory copies.
Host-Device Latencies
https://www.chessprogramming.org/GPU#Ho ... _Latencies
Idk what the kernel-launch overhead looks like on Apple M-series hardware.
--
Srdja