TalkChess.com

Posted: **Fri Sep 11, 2020 6:33 am**

mwyoung wrote: ↑Thu Sep 10, 2020 10:48 pm
I do not think Lc0 is using this tech when running Lc0 on 2 or more cards. Does anyone know for sure.

It's definitely not needed.

Posted: **Fri Sep 11, 2020 6:35 am**

I believe one of the posters here worked at Nvidia. Maybe he can clear it all up once the embargo ends in a few days.

Posted: **Fri Sep 11, 2020 8:58 am**

Another question I’d like to know: if (if!) Lc0 uses Tensor cores, why does the $10,000 A100 barely outperform the Titan? I saw benchmarks specifically for Lc0 yesterday confirming this.

I understand there’s some kind of efficiency issue with the tensor cores (Milos’ argument?), but the A100 is crammed full of them and should have won easily

Posted: **Fri Sep 11, 2020 11:02 am**

Werewolf wrote: ↑Fri Sep 11, 2020 8:58 am Another question I’d like to know: if (if!) Lc0 uses Tensor cores, why does the $10,000 A100 barely outperform the Titan? I saw benchmarks specifically for Lc0 yesterday confirming this.

I understand there’s some kind of efficiency issue with the tensor cores (Milos’ argument?), but the A100 is crammed full of them and should have won easily

Then I suggest to reread Milos' argument:

http://talkchess.com/forum3/viewtopic.p ... 20#p846617

Lc0 like AlphaGo and Leela do use 3x3 convolutions in their CNN design, I guess this is an descendant from the game of Go.
Others have pointed out that 4x4 convolutions would make more sense for the game of Chess.

Milos already pointed out in an older thread that Lc0 with 3x3 CNNs uses only ~30% of the 4x4 TensorCores present in Volta and Turing (RTX 20xx):

http://www.talkchess.com/forum3/viewtop ... 1&start=40

Now we seem to have 8x8 TensorCores in Ampere A100 (not sure about RTX 30xx series) and Lc0 can not utilize these until they may change ther CNN design?

--
Srdja

Posted: **Fri Sep 11, 2020 12:52 pm**

So what was Lc0 running on in the A100?

It was still faster than the Titan.

Posted: **Fri Sep 11, 2020 1:09 pm**

Werewolf wrote: ↑Fri Sep 11, 2020 12:52 pm So what was Lc0 running on in the A100?

It was still faster than the Titan.

I am not aware of your benchmark, if CUDA or CUDNN or OpenCL or DX12 backend, boost-frequency, cooling, this all makes a difference.

But, the A100 has 6912 CUDA-Cores, the Titan RTX has 4608 CUDA-Cores, with ~1410 MHz vs. ~1770 MHz, if we assume that TensorCore performance per SM stays the same, you have an factor of ~1.2x for A100.

https://en.wikipedia.org/wiki/Ampere_(m ... d_DGX_A100

https://en.wikipedia.org/wiki/List_of_N ... _20_series

--
Srdja

Posted: **Fri Sep 11, 2020 4:45 pm**

I only got about 30 seconds to look at the benchmark, but the difference between the two cards seemed to be within 20%, so that does fit. Unfortunately I don’t know backend details.

Which cores are you claiming Lc0 is running on with each card?
Tensor for both or just the Titan?

Re-reading the Milos post above he does point out a small gain in the CUDA cores. I’m wondering if this is newly relevant again, since the 3090 has more CUDA cores than previously expected - more than the A100 even - would it be faster to use them?

Posted: **Fri Sep 11, 2020 6:04 pm**

Werewolf wrote: ↑Fri Sep 11, 2020 4:45 pm I only got about 30 seconds to look at the benchmark, but the difference between the two cards seemed to be within 20%, so that does fit. Unfortunately I don’t know backend details.

Which cores are you claiming Lc0 is running on with each card?
Tensor for both or just the Titan?

Re-reading the Milos post above he does point out a small gain in the CUDA cores. I’m wondering if this is newly relevant again, since the 3090 has more CUDA cores than previously expected - more than the A100 even - would it be faster to use them?

Sorry, I am not into the concrete implementation of LC0's CNNs, and I have no
Nvidia intel, anyway...with a pinch of salt...

- GPU-cores run the dot-product part of the CNN
- TensorCores can boost the tensor-math part of CNN, the convolutions
- there are different implementations of the convolution-math in different backends
- you can run tensor-math on CPU, GPU-cores, TensorCores
- LC0 uses 3x3 convolutions
- RTX 20xx has 4x4 TensorCores
- A100 seems to have 8x8 TensorCores
- RTX 30xx has ?x? TensorCores?
- RTX 20xx has 4x16 FP32 cores per SM plus 4x16 INT32 cores per SM
- RTX 30xx has 4x16 FP32 cores per SM plus 4x16 FP32/INT32 cores per SM
- Volta, Turing, Ampere have one NxN TensorCore per SM

In short, it is not that easy, you have to wait for the LC0 benchmarks for 30xx
series, at least I can not tell you how the new Nvidia series will perform wo
changes to current LC0 (if backend or CNN design).

--
Srdja

Posted: **Fri Sep 11, 2020 6:45 pm**

smatovic wrote: ↑Fri Sep 11, 2020 11:02 am Then I suggest to reread Milos' argument:

http://talkchess.com/forum3/viewtopic.p ... 20#p846617

Lc0 like AlphaGo and Leela do use 3x3 convolutions in their CNN design, I guess this is an descendant from the game of Go.
Others have pointed out that 4x4 convolutions would make more sense for the game of Chess.

Milos already pointed out in an older thread that Lc0 with 3x3 CNNs uses only ~30% of the 4x4 TensorCores present in Volta and Turing (RTX 20xx):

http://www.talkchess.com/forum3/viewtop ... 1&start=40

Now we seem to have 8x8 TensorCores in Ampere A100 (not sure about RTX 30xx series) and Lc0 can not utilize these until they may change ther CNN design?

--
Srdja

3x3 convolutions are kind of a relic of imagine processing. Since state-of-the-art DNNs at the time of AlphaGo development were mainly Resnets used for image processing they naturally had 3x3 convolutions in input layer that were taking care of RGB pixels in images.
There is absolutely no reason why DNNs for chess would need 3x3 convolution and also currently A0/Lc0 net architecture is quite obsolete, huge and inefficient, but changing it to something completely different is a big effort mainly because of efficient mapping (cudnn backend) but also because of training.

Regarding CUDA cores vs. Tensor cores discussion. CUDA cores are used for dot-product i.e. MAC operations. Tensor cores can be more efficiently used for matrix-matrix multiply. In Resenet's with exception of first layer with 3x3 convolutions other layers use 1x1 convolutions which are more efficiently mapped in CUDA cores because they are essentially performing MAC operations.
If one switched to some other architecture like transformer/BERT that extensively uses matrix multiply then Tensor cores could be used much more efficiently.

Posted: **Wed Sep 16, 2020 11:17 pm**

A lot of reviews of 3080 are out but they focus almost exclusively on games. The closest I've seen is this https://babeltechreviews.com/rtx-3080-a ... hmarked/4/ with a 27% increase of half-float performance over 2080Ti.

TalkChess.com

I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs

Re: I stumbled upon this article on the new Nvidia RTX GPUs