how good is a GeForce GTX 1060 6GB for Leela ?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Albert Silver
Posts: 3019
Joined: Wed Mar 08, 2006 9:57 pm
Location: Rio de Janeiro, Brazil

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Albert Silver »

Milos wrote:
Albert Silver wrote:
Milos wrote:
Albert Silver wrote:
Milos wrote:
Werewolf wrote:
Milos wrote:
Judging be the current proper benchmark (LC0 on cuDNN), Titan V is only 3.5 times faster than GTX 960, and 2.2 times 1080ti. And GTX 960 is at least 15 times cheaper than Titan V.
How do you work that out? The Titan V isn't 2.2x faster than a 1080 ti according to either the GFLOPS (12300 vs 10600 respectively) or the benchmarks here

https://docs.google.com/spreadsheets/d/ ... edit#gid=0

Unless I'm misreading them.
You should look at windows version of LC0-cudnn on Titan V since CUDA libs for windows are obviously better for tensor cores (Titan V) than linux ones.
Is it enough to download and install CUDA or do I need something else such as special commandline or executable? If so, can you point the way?
You need to compile proper executable, however there are already precompiles available.
Look here:
https://github.com/mooskagh/leela-chess/tree/master/lc0
Ok, that worked, thanks. I had downloaded CUDA 9.1, so renamed the files to 90 where needed. How do I do a full-tune? Or is that no longer done?
Download 9.0 directly it is available from NVIDIA just search other releases.
Tuning is not required since cuDNN is already tuned to your specific GPU.
No need, as the renamed DLLs worked fine. I am downloading just to compare in case, but either way, thanks. The speed-up is insane.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Milos »

Albert Silver wrote:
Milos wrote:
Albert Silver wrote:
Milos wrote:
Albert Silver wrote:
Milos wrote:
Werewolf wrote:
Milos wrote:
Judging be the current proper benchmark (LC0 on cuDNN), Titan V is only 3.5 times faster than GTX 960, and 2.2 times 1080ti. And GTX 960 is at least 15 times cheaper than Titan V.
How do you work that out? The Titan V isn't 2.2x faster than a 1080 ti according to either the GFLOPS (12300 vs 10600 respectively) or the benchmarks here

https://docs.google.com/spreadsheets/d/ ... edit#gid=0

Unless I'm misreading them.
You should look at windows version of LC0-cudnn on Titan V since CUDA libs for windows are obviously better for tensor cores (Titan V) than linux ones.
Is it enough to download and install CUDA or do I need something else such as special commandline or executable? If so, can you point the way?
You need to compile proper executable, however there are already precompiles available.
Look here:
https://github.com/mooskagh/leela-chess/tree/master/lc0
Ok, that worked, thanks. I had downloaded CUDA 9.1, so renamed the files to 90 where needed. How do I do a full-tune? Or is that no longer done?
Download 9.0 directly it is available from NVIDIA just search other releases.
Tuning is not required since cuDNN is already tuned to your specific GPU.
No need, as the renamed DLLs worked fine. I am downloading just to compare in case, but either way, thanks. The speed-up is insane.
You might actually gain some with renaming it since 9.1 since there are some speedups in 9.1 compared to 9.0 but mainly on Tesla-V, not sure for "normal" GPUs.
Library is hardcoded but acutally dependency on it is not so that's why renaming works.
We will get more performance once NVIDIA releases 9.2 since they claim 2-5x speed up in HPC deep-learning convolution.
OTOH more speed can still be gained when implementation is changed from cuDNN to TensorRT. Hopefully Alexander Lyashuk does it soon. So far he did a great job of completely rewriting LC0 for CUDA.
Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries, that's why so much speed-up.
Albert Silver
Posts: 3019
Joined: Wed Mar 08, 2006 9:57 pm
Location: Rio de Janeiro, Brazil

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Albert Silver »

Milos wrote:
Albert Silver wrote:
Milos wrote:
Albert Silver wrote:
Milos wrote:
Albert Silver wrote:
Milos wrote:
Werewolf wrote:
Milos wrote:
Judging be the current proper benchmark (LC0 on cuDNN), Titan V is only 3.5 times faster than GTX 960, and 2.2 times 1080ti. And GTX 960 is at least 15 times cheaper than Titan V.
How do you work that out? The Titan V isn't 2.2x faster than a 1080 ti according to either the GFLOPS (12300 vs 10600 respectively) or the benchmarks here

https://docs.google.com/spreadsheets/d/ ... edit#gid=0

Unless I'm misreading them.
You should look at windows version of LC0-cudnn on Titan V since CUDA libs for windows are obviously better for tensor cores (Titan V) than linux ones.
Is it enough to download and install CUDA or do I need something else such as special commandline or executable? If so, can you point the way?
You need to compile proper executable, however there are already precompiles available.
Look here:
https://github.com/mooskagh/leela-chess/tree/master/lc0
Ok, that worked, thanks. I had downloaded CUDA 9.1, so renamed the files to 90 where needed. How do I do a full-tune? Or is that no longer done?
Download 9.0 directly it is available from NVIDIA just search other releases.
Tuning is not required since cuDNN is already tuned to your specific GPU.
No need, as the renamed DLLs worked fine. I am downloading just to compare in case, but either way, thanks. The speed-up is insane.
You might actually gain some with renaming it since 9.1 since there are some speedups in 9.1 compared to 9.0 but mainly on Tesla-V, not sure for "normal" GPUs.
Library is hardcoded but acutally dependency on it is not so that's why renaming works.
We will get more performance once NVIDIA releases 9.2 since they claim 2-5x speed up in HPC deep-learning convolution.
OTOH more speed can still be gained when implementation is changed from cuDNN to TensorRT. Hopefully Alexander Lyashuk does it soon. So far he did a great job of completely rewriting LC0 for CUDA.
Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries, that's why so much speed-up.
I'm really shocked at how big a speedup there is. If I run it for 30 seconds (the ply26 limit is too early now) it shows nearly 12.5 KNPS on my GTX1060, and just the 26-ply limit is 9.5 KNPS. I'll run some tests to see if this pans out in results.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Dann Corbit »

Milos wrote:
Dann Corbit wrote:
Werewolf wrote:
Dann Corbit wrote:Titan V has tensor cores.
But they are fiddly to use and you have to program especially for them.
The Titan V also "only" has 640 of them. I suspect its successor will cram in many more.
They can multiply a small matrix in a single cycle (the tensor cores).
4x4 one to be precise. Since LC0 kernal is 3x3 there is roughly only 1/3 efficiency ((27+9*2)/(81+16*3) operations) when running LC0 only on tensor cores, assuming ofc that they are fully loaded and that cuDNN libs are efficient for them, which is a big question mark at the moment.
I guess that you have to program specifically for the tensor cores.

If they were going to run on that hardware, it certainly makes sense to change to a 4x4 kernel. That is a very good point.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
jkiliani
Posts: 143
Joined: Wed Jan 17, 2018 1:26 pm

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by jkiliani »

Dann Corbit wrote:
Milos wrote:
Dann Corbit wrote:
Werewolf wrote:
Dann Corbit wrote:Titan V has tensor cores.
But they are fiddly to use and you have to program especially for them.
The Titan V also "only" has 640 of them. I suspect its successor will cram in many more.
They can multiply a small matrix in a single cycle (the tensor cores).
4x4 one to be precise. Since LC0 kernal is 3x3 there is roughly only 1/3 efficiency ((27+9*2)/(81+16*3) operations) when running LC0 only on tensor cores, assuming ofc that they are fully loaded and that cuDNN libs are efficient for them, which is a big question mark at the moment.
I guess that you have to program specifically for the tensor cores.

If they were going to run on that hardware, it certainly makes sense to change to a 4x4 kernel. That is a very good point.
3x3 kernels are extremely common in machine learning for good reason, so I think both the design engineers and the driver programmers at Nvidia are way ahead of Milos there. They probably use Winograd transforms to compute 3x3 kernels just like the Leela Zero OpenCL implementation (which is not nearly as bad as Milos claims I might add)
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Milos »

jkiliani wrote:3x3 kernels are extremely common in machine learning for good reason, so I think both the design engineers and the driver programmers at Nvidia are way ahead of Milos there. They probably use Winograd transforms to compute 3x3 kernels just like the Leela Zero OpenCL implementation (which is not nearly as bad as Milos claims I might add)
Haha, sure you know what they use in CUDA, you probably had a glimpse into its source? :lol: Well, not.
Your knowledge of kernel arithmetics doesn't get further than Winograd transforms, does it? ;)
You probably read that one paper from Lavin, I guess that makes you an expert on ML hardware acceleration.

Regarding OpenCL hand implementation of Gian-Carlo, sure it is not that bad, it is only 8x slower than cuDNN implementation with appropriate batch size. Well, if 8x is not bad, I wonder what would you call bad?
That's kind of a speed-up from intel clDNN on integrated graphics 6xx series to 1080Ti on OpenCL.
Werewolf
Posts: 1795
Joined: Thu Sep 18, 2008 10:24 pm

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Werewolf »

Albert, what are we saying here?

You replace the included DLLs with ones from Nvidia and you get an 8x speedup?

Just like that? Why don't they change the GPU download package then to reflect this discovery?
User avatar
Matthias Gemuh
Posts: 3245
Joined: Thu Mar 09, 2006 9:10 am

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Matthias Gemuh »

Werewolf wrote:Albert, what are we saying here?

You replace the included DLLs with ones from Nvidia and you get an 8x speedup?

Just like that? Why don't they change the GPU download package then to reflect this discovery?
Are you saying an Nvidia DLL would be suitable for all graphic cards out there ?
My engine was quite strong till I added knowledge to it.
http://www.chess.hylogic.de
Ras
Posts: 2487
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Ras »

Milos wrote:Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries
I guess the OpenCL version also works with AMD GPUs while CUDA does not. Since the project depends on voluntary contribution, it may make sense not to shut out a considerable number of potential volunteers.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Post by Dann Corbit »

jkiliani wrote:
Dann Corbit wrote:
Milos wrote:
Dann Corbit wrote:
Werewolf wrote:
Dann Corbit wrote:Titan V has tensor cores.
But they are fiddly to use and you have to program especially for them.
The Titan V also "only" has 640 of them. I suspect its successor will cram in many more.
They can multiply a small matrix in a single cycle (the tensor cores).
4x4 one to be precise. Since LC0 kernal is 3x3 there is roughly only 1/3 efficiency ((27+9*2)/(81+16*3) operations) when running LC0 only on tensor cores, assuming ofc that they are fully loaded and that cuDNN libs are efficient for them, which is a big question mark at the moment.
I guess that you have to program specifically for the tensor cores.

If they were going to run on that hardware, it certainly makes sense to change to a 4x4 kernel. That is a very good point.
3x3 kernels are extremely common in machine learning for good reason, so I think both the design engineers and the driver programmers at Nvidia are way ahead of Milos there. They probably use Winograd transforms to compute 3x3 kernels just like the Leela Zero OpenCL implementation (which is not nearly as bad as Milos claims I might add)
I think that the point Milos was making is that the tensor cores do not perform a 3x3 multiply. They perform a 4x4 multiply.

Now, a 3x3 matrix fits into a 4x4 matrix, so you can still multiply it in one cycle. But the thing that is missing is that if you have a 4x4 kernel you would get:
2 * (4 * 4 * 4) - 4 * 4 = 112 operations in one cycle
verses
2 * (3 * 3 * 3) - 3 * 3 = 45 operations in one cycle

So you are getting 45/112= 40% of the computer power available.

This assumes that square matrix multiply is 2N^3 - 2N^2 operations (typical count and I doubt you can do better on such a small matrix).
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.