GPU ANN, how to deal with host-device latencies?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

GPU ANN, how to deal with host-device latencies?

Post by smatovic »

GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: GPU ANN, how to deal with host-device latencies?

Post by Rémi Coulom »

Double buffering: transfer the next batch to the device while the previous batch is being computed.

The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.

So the search has to be very parallel.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: GPU ANN, how to deal with host-device latencies?

Post by Milos »

smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
As Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: GPU ANN, how to deal with host-device latencies?

Post by Daniel Shawul »

smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
You need to use a highly parallel asynchronus search like MCTS. I expect alpha-beta rollouts to work equally well too. Since I am not using policy network, I am going to have to evaluate each child with a value network during expansion. This gives me on average 40 postiions to evaluate simultaneously, and maybe I could batch that with requests from other threads and latency won't be a problem. I am still running a convnet on the CPU so I haven't actually dealt with the problem yet.

Daniel
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic »

Rémi Coulom wrote: Sun May 06, 2018 12:13 pm Double buffering: transfer the next batch to the device while the previous batch is being computed.

The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.

So the search has to be very parallel.
Got it, buffering and batch size, thx.

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic »

Milos wrote: Sun May 06, 2018 12:24 pm
smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
As Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.
Thx, i will take a look into the paper.

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic »

Daniel Shawul wrote: Sun May 06, 2018 1:24 pm
smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
You need to use a highly parallel asynchronus search like MCTS.
Yes, that was the point, i was thinking in serial alpha-beta..thx.

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic »

smatovic wrote: Sun May 06, 2018 10:42 am GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
I am back at the drawing board, and stil struggle with OpenCL latencies,
maybe someone can comment on my numbers if they look correct?

OS: Ubuntu 18.04 x86-64
Device: Nvidia GTX 750, 1 GHz, 512 cores, 1 TFLOPs

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:

~35K calls per second

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:

~10K calls per second

Note that my machine is a bit outdated:

- PCIe via Northbridge
- PCIe 2.0
- only 8 lanes per slot

Maybe on newer systems the latencies do not hurt at all?

--
Srdja
smatovic
Posts: 2639
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic »

Got this answer on Nvidia developer forum,
maybe it is of interest for others...
I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.

PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.
https://devtalk.nvidia.com/default/topi ... atencies-/

--
Srdja