GPU ANN, how to deal with host-device latencies?

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

GPU ANN, how to deal with host-device latencies?

Post by smatovic » Sun May 06, 2018 8:42 am

GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja

Rémi Coulom
Posts: 429
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: GPU ANN, how to deal with host-device latencies?

Post by Rémi Coulom » Sun May 06, 2018 10:13 am

Double buffering: transfer the next batch to the device while the previous batch is being computed.

The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.

So the search has to be very parallel.

Milos
Posts: 3383
Joined: Wed Nov 25, 2009 12:47 am

Re: GPU ANN, how to deal with host-device latencies?

Post by Milos » Sun May 06, 2018 10:24 am

smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
As Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.

Daniel Shawul
Posts: 3724
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: GPU ANN, how to deal with host-device latencies?

Post by Daniel Shawul » Sun May 06, 2018 11:24 am

smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
You need to use a highly parallel asynchronus search like MCTS. I expect alpha-beta rollouts to work equally well too. Since I am not using policy network, I am going to have to evaluate each child with a value network during expansion. This gives me on average 40 postiions to evaluate simultaneously, and maybe I could batch that with requests from other threads and latency won't be a problem. I am still running a convnet on the CPU so I haven't actually dealt with the problem yet.

Daniel

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic » Sun May 06, 2018 7:52 pm

Rémi Coulom wrote:
Sun May 06, 2018 10:13 am
Double buffering: transfer the next batch to the device while the previous batch is being computed.

The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.

So the search has to be very parallel.
Got it, buffering and batch size, thx.

--
Srdja

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic » Sun May 06, 2018 7:53 pm

Milos wrote:
Sun May 06, 2018 10:24 am
smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
As Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.
Thx, i will take a look into the paper.

--
Srdja

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic » Sun May 06, 2018 7:55 pm

Daniel Shawul wrote:
Sun May 06, 2018 11:24 am
smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
You need to use a highly parallel asynchronus search like MCTS.
Yes, that was the point, i was thinking in serial alpha-beta..thx.

--
Srdja

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic » Thu Feb 28, 2019 12:28 pm

smatovic wrote:
Sun May 06, 2018 8:42 am
GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
I am back at the drawing board, and stil struggle with OpenCL latencies,
maybe someone can comment on my numbers if they look correct?

OS: Ubuntu 18.04 x86-64
Device: Nvidia GTX 750, 1 GHz, 512 cores, 1 TFLOPs

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:

~35K calls per second

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:

~10K calls per second

Note that my machine is a bit outdated:

- PCIe via Northbridge
- PCIe 2.0
- only 8 lanes per slot

Maybe on newer systems the latencies do not hurt at all?

--
Srdja

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: GPU ANN, how to deal with host-device latencies?

Post by smatovic » Fri Mar 01, 2019 6:05 am

Got this answer on Nvidia developer forum,
maybe it is of interest for others...
I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.

PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.
https://devtalk.nvidia.com/default/topi ... atencies-/

--
Srdja

Post Reply