Next-Gen GPUs for LC0

Milos · Post by **Milos** » Mon Sep 28, 2020 10:25 am

Laskos wrote: ↑Mon Sep 28, 2020 9:47 am
Werewolf wrote: ↑Sun Sep 27, 2020 8:24 pm Seems like there is actual data appearing now

Very early results by people quoting ankan suggest if we take the 2080 Ti as a reference point:

3080 is about 1.4x faster
3090 is about 1.6x faster
A100 is about 2.7x faster

Seems like the batch size makes a big difference

That last result contradicts what I saw earlier. Anyway, if this forum had a means to insert an image I'd post the screenshot from the discord...
That's good news. It means that 3080 is about 190% of 2080 as Lc0 speed goes, so represents a very good deal. Do you know the 3070 estimate? To me 3080 is a bit too hungry Watt wise for my PC.

3070 didn't yet appeare in the market.

Laskos · Post by **Laskos** » Mon Sep 28, 2020 11:29 am

Milos wrote: ↑Mon Sep 28, 2020 10:25 am
Laskos wrote: ↑Mon Sep 28, 2020 9:47 am
Werewolf wrote: ↑Sun Sep 27, 2020 8:24 pm Seems like there is actual data appearing now

Very early results by people quoting ankan suggest if we take the 2080 Ti as a reference point:

3080 is about 1.4x faster
3090 is about 1.6x faster
A100 is about 2.7x faster

Seems like the batch size makes a big difference

That last result contradicts what I saw earlier. Anyway, if this forum had a means to insert an image I'd post the screenshot from the discord...
That's good news. It means that 3080 is about 190% of 2080 as Lc0 speed goes, so represents a very good deal. Do you know the 3070 estimate? To me 3080 is a bit too hungry Watt wise for my PC.
3070 didn't yet appeare in the market.

Do you have an estimate of how 3070 compares to 3080 in Lc0 case and similar? Nominally, 3080 should be about 45% faster, but practically this varies with application, from 20% faster to 50% faster.

Milos · Post by **Milos** » Mon Sep 28, 2020 3:29 pm

Laskos wrote: ↑Mon Sep 28, 2020 11:29 am
Milos wrote: ↑Mon Sep 28, 2020 10:25 am
Laskos wrote: ↑Mon Sep 28, 2020 9:47 am
Werewolf wrote: ↑Sun Sep 27, 2020 8:24 pm Seems like there is actual data appearing now

Very early results by people quoting ankan suggest if we take the 2080 Ti as a reference point:

3080 is about 1.4x faster
3090 is about 1.6x faster
A100 is about 2.7x faster

Seems like the batch size makes a big difference

That last result contradicts what I saw earlier. Anyway, if this forum had a means to insert an image I'd post the screenshot from the discord...
That's good news. It means that 3080 is about 190% of 2080 as Lc0 speed goes, so represents a very good deal. Do you know the 3070 estimate? To me 3080 is a bit too hungry Watt wise for my PC.
3070 didn't yet appeare in the market.
Do you have an estimate of how 3070 compares to 3080 in Lc0 case and similar? Nominally, 3080 should be about 45% faster, but practically this varies with application, from 20% faster to 50% faster.

Comparing just CUDA cores IMO gives very reliable estimate for Lc0 since it mainly uses CUDA cores. So 8704/5888 = 1.48, i.e. 3080 would be 45-50% faster.

Alayan · Post by **Alayan** » Mon Sep 28, 2020 4:17 pm

Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.

1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Ampere : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently

But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?

Milos · Post by **Milos** » Mon Sep 28, 2020 4:41 pm

Alayan wrote: ↑Mon Sep 28, 2020 4:17 pm Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.

1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Turing : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently

But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?

Leela uses mainly FP16 multipliers from CUDA cores. I am really not aware that this definition changed. Tensor cores are only used for 3x3 convolutions in the input layer (rather inefficiently). You can't use Tensor cores for 1x1 convolutions (which is great majority of operations in Lc0 DNN inference), i.e. you can, but it is grossly inefficient.

Laskos · Post by **Laskos** » Mon Sep 28, 2020 5:42 pm

Milos wrote: ↑Mon Sep 28, 2020 3:29 pm
Laskos wrote: ↑Mon Sep 28, 2020 11:29 am
Milos wrote: ↑Mon Sep 28, 2020 10:25 am
Laskos wrote: ↑Mon Sep 28, 2020 9:47 am
Werewolf wrote: ↑Sun Sep 27, 2020 8:24 pm Seems like there is actual data appearing now

Very early results by people quoting ankan suggest if we take the 2080 Ti as a reference point:

3080 is about 1.4x faster
3090 is about 1.6x faster
A100 is about 2.7x faster

Seems like the batch size makes a big difference

That last result contradicts what I saw earlier. Anyway, if this forum had a means to insert an image I'd post the screenshot from the discord...
That's good news. It means that 3080 is about 190% of 2080 as Lc0 speed goes, so represents a very good deal. Do you know the 3070 estimate? To me 3080 is a bit too hungry Watt wise for my PC.
3070 didn't yet appeare in the market.
Do you have an estimate of how 3070 compares to 3080 in Lc0 case and similar? Nominally, 3080 should be about 45% faster, but practically this varies with application, from 20% faster to 50% faster.
Comparing just CUDA cores IMO gives very reliable estimate for Lc0 since it mainly uses CUDA cores. So 8704/5888 = 1.48, i.e. 3080 would be 45-50% faster.

In this case 3070 is not that a great deal. It will be weaker than 2080Ti with Chess and Go, and I already see 2080Ti second hand as cheap as $600. Might go even lower, because 3080 rocks.

mehmet123 · Post by **mehmet123** » Mon Sep 28, 2020 5:49 pm

Laskos wrote: ↑Mon Sep 28, 2020 5:42 pm
In this case 3070 is not that a great deal. It will be weaker than 2080 Ti with Chess and Go, and I already see 2080Ti second hand as cheap as $600. Might go even lower, because 3080 rocks.

Why RTX 3070 will be weaker than RTX 2080 Tİ?
RTX 3070 has 5888 cuda cores but RTX 2080 Ti has 4352 cuda cores.

Laskos · Post by **Laskos** » Mon Sep 28, 2020 5:54 pm

mehmet123 wrote: ↑Mon Sep 28, 2020 5:49 pm
Laskos wrote: ↑Mon Sep 28, 2020 5:42 pm
In this case 3070 is not that a great deal. It will be weaker than 2080 Ti with Chess and Go, and I already see 2080Ti second hand as cheap as $600. Might go even lower, because 3080 rocks.
Why RTX 3070 will be weaker than RTX 2080 Tİ?
RTX 3070 has 5888 cuda cores but RTX 2080 Ti has 4352 cuda cores.

It results from the data Werewolf and Milos provided: 3080 is 1.4x as strong as 2080Ti and 1.48x as strong as 3070. Results 3070 is weaker than 2080Ti with Lc0.

mehmet123 · Post by **mehmet123** » Mon Sep 28, 2020 5:58 pm

Lc0 benchmarks with SV-3010 network (384x30)

Default settings (minibatch-size=256)
---------------------------------------------
GPU baseline optimized perf gain (%)
---------------------------------------------
Titan RTX.. 17443 - 20084 15.1
RTX 3090.. 26820 - 29767 11.0
A100........ 41785 - 48815 16.8

Minibatch-size=1024, all other settings default:
---------------------------------------------
GPU baseline optimized perf gain (%)
---------------------------------------------
Titan RTX.... 20211 - 23003 13.8
RTX 3090..... 33032 - 36924 11.8
A100.......... 52732 - 59134 12.1

(From Lc0 Discord)

Laskos · Post by **Laskos** » Mon Sep 28, 2020 6:18 pm

mehmet123 wrote: ↑Mon Sep 28, 2020 5:58 pm Lc0 benchmarks with SV-3010 network (384x30)

Default settings (minibatch-size=256)
---------------------------------------------
GPU baseline optimized perf gain (%)
---------------------------------------------
Titan RTX.. 17443 - 20084 15.1
RTX 3090.. 26820 - 29767 11.0
A100........ 41785 - 48815 16.8

Minibatch-size=1024, all other settings default:
---------------------------------------------
GPU baseline optimized perf gain (%)
---------------------------------------------
Titan RTX.... 20211 - 23003 13.8
RTX 3090..... 33032 - 36924 11.8
A100.......... 52732 - 59134 12.1

(From Lc0 Discord)

What is "baseline" and "optimized"?

Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0