2080 Ti

Werewolf · Post by **Werewolf** » Sat Sep 15, 2018 10:57 pm

Oh horrors.

I was hoping given the 2080 Ti's vast CUDA core count and impressive Tensor core numbers (almost as good as the Titan V) this thing was going to be a monster for Lc0 with some very impressive FP16 performance.

However, it seems that although it will be fast at FP32, it's FP16 performance is deliberately crippled if this link is correct:

https://en.wikipedia.org/wiki/List_of_N ... _20_series

Can someone confirm: is the FP16 performance here that of the overall CARD (in which case bad news) or just the performance of the CUDA cores? If it's the latter the 2080 Ti could still be redeemed by its Tensor cores.

Milos · Post by **Milos** » Sun Sep 16, 2018 12:27 am

Werewolf wrote: ↑Sat Sep 15, 2018 10:57 pm Oh horrors.

I was hoping given the 2080 Ti's vast CUDA core count and impressive Tensor core numbers (almost as good as the Titan V) this thing was going to be a monster for Lc0 with some very impressive FP16 performance.

However, it seems that although it will be fast at FP32, it's FP16 performance is deliberately crippled if this link is correct:

https://en.wikipedia.org/wiki/List_of_N ... _20_series

Can someone confirm: is the FP16 performance here that of the overall CARD (in which case bad news) or just the performance of the CUDA cores? If it's the latter the 2080 Ti could still be redeemed by its Tensor cores.

It is FP16 performance of CUDA cores as I wrote already many times. Tensor Cores are mostly marketing gimmick and there is not much real speed up demonstrated so far for inference in scientific community. Lc0 on V100 doesn't benefit anything from Tensor cores only from FP16 in CUDA cores. Why you are hyping so much about Tensor cores I really don't get.

Werewolf · Post by **Werewolf** » Sun Sep 16, 2018 12:40 am

Nvidia are claiming tensor can process FP16, which is quite a big deal. If it can’t be used it’s useless.

Different sites are claiming very different numbers. I just want to find out how fast the card really is for Lc0.

Milos · Post by **Milos** » Sun Sep 16, 2018 1:24 am

Werewolf wrote: ↑Sun Sep 16, 2018 12:40 am Nvidia are claiming tensor can process FP16, which is quite a big deal. If it can’t be used it’s useless.

Different sites are claiming very different numbers. I just want to find out how fast the card really is for Lc0.

First you have to understand what tensor cores are. They process with mixed precision, i.e. in 4x4 matrix multiplication they have FP16 operands and FP32 accumulated result.
Tensor cores in RTX cards are the same ones from Volta architecture. Problem is they are not helping much if anything for inference with regular 3x3 convolution. Tensor cores don't help Lc0 in Volta therefore they won't help in RTX either. Since FP16 is not enabled in 20xx cards the same it wasn't enabled in 10xx cards the only gain is those 15% in extra CUDA cores and higher frequency. Therefore 2080Ti will be faster than 1080Ti for exactly those 15%. Anyone who believes in some other magical speed-up is frankly speaking just daydreaming.

ankan · Post by **ankan** » Sun Sep 16, 2018 6:09 am

Milos wrote: ↑Sun Sep 16, 2018 1:24 am
Werewolf wrote: ↑Sun Sep 16, 2018 12:40 am Nvidia are claiming tensor can process FP16, which is quite a big deal. If it can’t be used it’s useless.

Different sites are claiming very different numbers. I just want to find out how fast the card really is for Lc0.
First you have to understand what tensor cores are. They process with mixed precision, i.e. in 4x4 matrix multiplication they have FP16 operands and FP32 accumulated result.
Tensor cores in RTX cards are the same ones from Volta architecture. Problem is they are not helping much if anything for inference with regular 3x3 convolution. Tensor cores don't help Lc0 in Volta therefore they won't help in RTX either. Since FP16 is not enabled in 20xx cards the same it wasn't enabled in 10xx cards the only gain is those 15% in extra CUDA cores and higher frequency. Therefore 2080Ti will be faster than 1080Ti for exactly those 15%. Anyone who believes in some other magical speed-up is frankly speaking just daydreaming.

This is definitely not true. The fp16 path of lc0 uses tensor cores on Volta and they do help 3x3 convolutions. The reason you see only about 3x speedup at best (compared to 8x if you compare the peak fp16 tensor math vs regular fp32 throughput) is because fp32 path uses winograd algorithm which is 2-3x faster than regular implicit gmem algorithm used by fp16 path. As you said tensor cores just gives you 4x4 matrix multiplications and making them work with winograd algorithm is hard.

2080Ti should be almost as fast as a TitanV for lc0 (or ~3X faster than 1080Ti when using fp16 mode).

Milos · Post by **Milos** » Sun Sep 16, 2018 7:22 am

ankan wrote: ↑Sun Sep 16, 2018 6:09 am
Milos wrote: ↑Sun Sep 16, 2018 1:24 am
Werewolf wrote: ↑Sun Sep 16, 2018 12:40 am Nvidia are claiming tensor can process FP16, which is quite a big deal. If it can’t be used it’s useless.

Different sites are claiming very different numbers. I just want to find out how fast the card really is for Lc0.
First you have to understand what tensor cores are. They process with mixed precision, i.e. in 4x4 matrix multiplication they have FP16 operands and FP32 accumulated result.
Tensor cores in RTX cards are the same ones from Volta architecture. Problem is they are not helping much if anything for inference with regular 3x3 convolution. Tensor cores don't help Lc0 in Volta therefore they won't help in RTX either. Since FP16 is not enabled in 20xx cards the same it wasn't enabled in 10xx cards the only gain is those 15% in extra CUDA cores and higher frequency. Therefore 2080Ti will be faster than 1080Ti for exactly those 15%. Anyone who believes in some other magical speed-up is frankly speaking just daydreaming.
This is definitely not true. The fp16 path of lc0 uses tensor cores on Volta and they do help 3x3 convolutions. The reason you see only about 3x speedup at best (compared to 8x if you compare the peak fp16 tensor math vs regular fp32 throughput) is because fp32 path uses winograd algorithm which is 2-3x faster than regular implicit gmem algorithm used by fp16 path. As you said tensor cores just gives you 4x4 matrix multiplications and making them work with winograd algorithm is hard.

2080Ti should be almost as fast as a TitanV for lc0 (or ~3X faster than 1080Ti when using fp16 mode).

Well you might be one of thousands of other Indian guys writing drivers for Nvidia, but what you are writing is definitively false.
Titan V has FP16 working in CUDA, 2080Ti doesn't. FP16 in CUDA gives Lc0 exactly 2x the speedup compared to FP32 in 1080Ti. In addition, Titan V has 20% more CUDA cores than 1080Ti. This in total gives 2.4x speed up only thanks to CUDA cores. Lc0 on Titan V is around 2.5x faster compared to 1080Ti (not 3x as you are pointing out). That means Tensor cores of Titan V contribute with at best 5% additional speed up to that one of CUDA cores.
Since 2080Ti doesn't have FP16 working in CUDA cores, 2080Ti additional speed up can be only 5% thanks to Tensor cores (plus around 15% thanks to more CUDA cores).
Your stories about 3x speed up for 2080Ti compared to 1080Ti are nothing but marketing of your company. You are simply biased since you have vested interest.

Lion · Post by **Lion** » Sun Sep 16, 2018 9:58 am

I have no clue.... but isn’t the new 2080TI suppose to come out this week?
If this is the case, we should know very soon....

Rgds

Werewolf · Post by **Werewolf** » Sun Sep 16, 2018 10:20 am

Milos wrote: ↑Sun Sep 16, 2018 1:24 am Since FP16 is not enabled in 20xx cards the same it wasn't enabled in 10xx cards the only gain is those 15% in extra CUDA cores and higher frequency. Therefore 2080Ti will be faster than 1080Ti for exactly those 15%. Anyone who believes in some other magical speed-up is frankly speaking just daydreaming.

OK, I get your argument.

In that case an old Quadro P5000 (8.8 TFLOPS in FP32) - only a bit more expensive than a 2080 Ti - should be hitting 17.6 TFLOPS FP16, outperforming the 2080 Ti. And the new RTX 5000 should be about 21 TFLOPS FP16. The top end RTX 6000 should be about 32 TFLOPS FP16.

There are a lot of rumours the Titan line will be axed soon because it's eating into Quadro.

megamau · Post by **megamau** » Fri Sep 21, 2018 2:25 pm

Milos wrote: ↑Sun Sep 16, 2018 7:22 am
ankan wrote: ↑Sun Sep 16, 2018 6:09 am
Milos wrote: ↑Sun Sep 16, 2018 1:24 am Since FP16 is not enabled in 20xx cards the same it wasn't enabled in 10xx cards the only gain is those 15% in extra CUDA cores and higher frequency. Therefore 2080Ti will be faster than 1080Ti for exactly those 15%. Anyone who believes in some other magical speed-up is frankly speaking just daydreaming.
This is definitely not true. The fp16 path of lc0 uses tensor cores on Volta and they do help 3x3 convolutions. The reason you see only about 3x speedup at best (compared to 8x if you compare the peak fp16 tensor math vs regular fp32 throughput) is because fp32 path uses winograd algorithm which is 2-3x faster than regular implicit gmem algorithm used by fp16 path. As you said tensor cores just gives you 4x4 matrix multiplications and making them work with winograd algorithm is hard.

2080Ti should be almost as fast as a TitanV for lc0 (or ~3X faster than 1080Ti when using fp16 mode).
Well you might be one of thousands of other Indian guys writing drivers for Nvidia, but what you are writing is definitively false.
Titan V has FP16 working in CUDA, 2080Ti doesn't. ....
Since 2080Ti doesn't have FP16 working in CUDA cores, 2080Ti additional speed up can be only 5% thanks to Tensor cores (plus around 15% thanks to more CUDA cores). Your stories about 3x speed up for 2080Ti compared to 1080Ti are nothing but marketing of your company. You are simply biased since you have vested interest.

So Milos, as the cards and the benchmarks are now available, is it time to admit you were wrong ?

frankp · Post by **frankp** » Fri Sep 21, 2018 3:33 pm

Someone posted nps numbers for fp16 and fp32 on the leela discord.
Very impressive for the 2080Ti - about a titan v.
fp32 not as good, but these are not used by leela - as I understand it.
Working from memory, but I recall around 32 knps from the start position for fixed nodes (1M maybe). Vague recollection but ballpark.

2080 Ti

2080 Ti

Re: 2080 Ti

Re: 2080 Ti

Re: 2080 Ti

Re: 2080 Ti

Re: 2080 Ti

Re: 2080 Ti

Re: 2080 Ti

Re: 2080 Ti

Re: 2080 Ti