https://github.com/LeelaChessZero/lc0/pull/1428
Next-Gen GPUs for LC0
Moderators: hgm, Rebel, chrisw
-
- Posts: 671
- Joined: Sun Jan 26, 2020 10:38 pm
- Location: Turkey
- Full name: Mehmet Karaman
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Next-Gen GPUs for LC0
Are you sure Lc0 does not use 3x3 convolutions in its CNN Filters?Milos wrote: ↑Mon Sep 28, 2020 4:41 pmLeela uses mainly FP16 multipliers from CUDA cores. I am really not aware that this definition changed. Tensor cores are only used for 3x3 convolutions in the input layer (rather inefficiently). You can't use Tensor cores for 1x1 convolutions (which is great majority of operations in Lc0 DNN inference), i.e. you can, but it is grossly inefficient.Alayan wrote: ↑Mon Sep 28, 2020 4:17 pm Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.
1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Turing : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently
But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?
Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:
https://www.nvidia.com/content/dam/en-z ... per-V1.pdf
If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.
--
Srdja
-
- Posts: 1796
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Next-Gen GPUs for LC0
One thing which was weird was the previous results I saw on the A100 were only a little better than the Titan RTX, about 10% better IIRC.smatovic wrote: ↑Tue Sep 29, 2020 10:01 amAre you sure Lc0 does not use 3x3 convolutions in its CNN Filters?Milos wrote: ↑Mon Sep 28, 2020 4:41 pmLeela uses mainly FP16 multipliers from CUDA cores. I am really not aware that this definition changed. Tensor cores are only used for 3x3 convolutions in the input layer (rather inefficiently). You can't use Tensor cores for 1x1 convolutions (which is great majority of operations in Lc0 DNN inference), i.e. you can, but it is grossly inefficient.Alayan wrote: ↑Mon Sep 28, 2020 4:17 pm Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.
1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Turing : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently
But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?
Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:
https://www.nvidia.com/content/dam/en-z ... per-V1.pdf
If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.
--
Srdja
These numbers are much better.
However, Tilips confirmed last night it gets pretty pointless with multiple A100 cards as it's hard for the CPU to keep up.
(not that many of us will buy even one A100, let alone 3 of them...)
-
- Posts: 1796
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Next-Gen GPUs for LC0
By the way, there are still rumours online that Nvidia could still release an Ampere Titan, depending on how fast Big Navi turns out to be. Pinch of salt rumour of course...
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Next-Gen GPUs for LC0
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
--
Srdja
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Next-Gen GPUs for LC0
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Next-Gen GPUs for LC0
- some people prefer AMD over NvidiaMilos wrote: ↑Tue Sep 29, 2020 12:39 pmBoth OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition
--
Srdja
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Next-Gen GPUs for LC0
At least according to original AlphaZero implementation (not sure if Lc0 changes anything really beside the size) a bulk of 3x3 convolutions is in input layer (convolutional block).
3x3 convolutions are also present in each residual block in input layer filter, but not further in policy and value heads that only have 1x1 convolutions.
That is just PR. Nothing really changed in terms of how FLOPS16 throughput for TensorCores is calculated or regarding the fact that those FMAs can only be used in tensor operations (otherwise you waste a full TensorCore to perform a single 1x1 convolution).Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:
https://www.nvidia.com/content/dam/en-z ... per-V1.pdf
If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Next-Gen GPUs for LC0
Gamers profit for sure, ML scientist not at all. Who wants to buy AMD card that costs 1200$ and has worse performance for ML than NVIDIA card that costs 500$???smatovic wrote: ↑Tue Sep 29, 2020 12:48 pm- some people prefer AMD over NvidiaMilos wrote: ↑Tue Sep 29, 2020 12:39 pmBoth OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition
--
Srdja
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Next-Gen GPUs for LC0
Hmm, why did DOE choose for its upcoming exa-FLOP systems Intel (Aurora), AMD (Frontier), AMD (El Capitan) and not IBM/Nvidia?Milos wrote: ↑Tue Sep 29, 2020 1:01 pmGamers profit for sure, ML scientist not at all. Who wants to buy AMD card that costs 1200$ and has worse performance for ML than NVIDIA card that costs 500$???smatovic wrote: ↑Tue Sep 29, 2020 12:48 pm- some people prefer AMD over NvidiaMilos wrote: ↑Tue Sep 29, 2020 12:39 pmBoth OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition
--
Srdja
--
Srdja