how good is a GeForce GTX 1060 6GB for Leela ?

mirek · Post by **mirek** » Wed May 02, 2018 12:20 am

Ras wrote:
Milos wrote:Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries
I guess the OpenCL version also works with AMD GPUs while CUDA does not. Since the project depends on voluntary contribution, it may make sense not to shut out a considerable number of potential volunteers.

Except that if we assume 50:50 distribution of discrete AMD vs. Nvidia GPUs, even excluding AMD and speeding up Nvidia 4-8x would speed up the learning process considerably. I think in one of the github threads they were discussing licensing issues with cuDNN. I am not sure what's the status of it? Otherwise I don't see a reason why there shouldn't be an Nvidia specific binary for people who want to contribute with Nvidia and OpenCL binary for everyone else. I mean in the official download section of lc0 and preferably without the necessity for user to install nvidia developer tools (if that's possible)

Milos · Post by **Milos** » Wed May 02, 2018 12:21 am

Ras wrote:
Milos wrote:Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries
I guess the OpenCL version also works with AMD GPUs while CUDA does not. Since the project depends on voluntary contribution, it may make sense not to shut out a considerable number of potential volunteers.

If efficiency is you goal then of course. First I believe there is at least twice more NVIDIA GPU users than AMD GPU users, but if there was a same number, with cuDNN you'd typically get 8x speed-up compared to OpenCL. So even if you cut of half a users and other half all used cuDNN version, you'd get an overall speed up of 4x, meaning you could reach 44 million games in a month

Milos · Post by **Milos** » Wed May 02, 2018 12:25 am

mirek wrote:
Ras wrote:
Milos wrote:Gian-Carlo's hand-written OpenCL implementation is really not a match for NVIDIA specialized libraries
I guess the OpenCL version also works with AMD GPUs while CUDA does not. Since the project depends on voluntary contribution, it may make sense not to shut out a considerable number of potential volunteers.
Except that if we assume 50:50 distribution of discrete AMD vs. Nvidia GPUs, even excluding AMD and speeding up Nvidia 4-8x would speed up the learning process considerably. I think in one of the github threads they were discussing licensing issues with cuDNN. I am not sure what's the status of it? Otherwise I don't see a reason why there shouldn't be an Nvidia specific binary for people who want to contribute with Nvidia and OpenCL binary for everyone else. I mean in the official download section of lc0 and preferably without the necessity for user to install nvidia developer tools (if that's possible)

There are no licensing issues just Gian-Carlo's paranoia and hurt pride:
http://www.talkchess.com/forum/viewtopi ... 944#759944
And cuDNN library is clearly under System library exception of GPL v3 and is totally fine to be included in any GPLed project.

Milos · Post by **Milos** » Wed May 02, 2018 12:44 am

Dann Corbit wrote:
jkiliani wrote:
Dann Corbit wrote:
Milos wrote:
Dann Corbit wrote:
Werewolf wrote:
Dann Corbit wrote:Titan V has tensor cores.
But they are fiddly to use and you have to program especially for them.
The Titan V also "only" has 640 of them. I suspect its successor will cram in many more.
They can multiply a small matrix in a single cycle (the tensor cores).
4x4 one to be precise. Since LC0 kernal is 3x3 there is roughly only 1/3 efficiency ((27+9*2)/(81+16*3) operations) when running LC0 only on tensor cores, assuming ofc that they are fully loaded and that cuDNN libs are efficient for them, which is a big question mark at the moment.
I guess that you have to program specifically for the tensor cores.

If they were going to run on that hardware, it certainly makes sense to change to a 4x4 kernel. That is a very good point.
3x3 kernels are extremely common in machine learning for good reason, so I think both the design engineers and the driver programmers at Nvidia are way ahead of Milos there. They probably use Winograd transforms to compute 3x3 kernels just like the Leela Zero OpenCL implementation (which is not nearly as bad as Milos claims I might add)
I think that the point Milos was making is that the tensor cores do not perform a 3x3 multiply. They perform a 4x4 multiply.

Now, a 3x3 matrix fits into a 4x4 matrix, so you can still multiply it in one cycle. But the thing that is missing is that if you have a 4x4 kernel you would get:
2 * (4 * 4 * 4) - 4 * 4 = 112 operations in one cycle
verses
2 * (3 * 3 * 3) - 3 * 3 = 45 operations in one cycle

So you are getting 45/112= 40% of the computer power available.

This assumes that square matrix multiply is 2N^3 - 2N^2 operations (typical count and I doubt you can do better on such a small matrix).

Correct, most probably tensor core matrix multiply implementation is direct, i.e. not using any FFT because latency minimization is the goal, not area savings.

JJJ · Post by **JJJ** » Wed May 02, 2018 1:08 am

I have the same NPS than Nay here :

https://docs.google.com/spreadsheets/d/ ... =857482380

NPS : 1167 with GTX 1060 @ 3GB, 3ghz, 4 cores, i5 7400

So I guess it's ok for my set up.

I have GTX 1060 @ 6B, using 2 core, i5-3570 using ID 227.

cma6 · Post by **cma6** » Wed May 02, 2018 5:02 am

"Even with an old i5-2500K and GTX1060 I get about 2250NPS in the benchmark I described."Doesn't LC0 use either the GPU or CPU, but not both?
In which case, should one test both versions of LC0 or is that a waste of time, since for a comparably price processing unit, GPU will be faster than CPU?

cma6 · Post by **cma6** » Wed May 02, 2018 5:04 am

How is lczero7.exe different from lczero.exe?

Albert Silver · Post by **Albert Silver** » Wed May 02, 2018 5:04 am

cma6 wrote:"Even with an old i5-2500K and GTX1060 I get about 2250NPS in the benchmark I described."Doesn't LC0 use either the GPU or CPU, but not both?
In which case, should one test both versions of LC0 or is that a waste of time, since for a comparably price processing unit, GPU will be faster than CPU?

Can't speak for everyone, since different setups may vary, but on my desktop you described, it is about 60% CPU usage and 45-50% GPU as per the task manager.

cma6 · Post by **cma6** » Wed May 02, 2018 5:16 am

Albert:
I wasn't asking about processor usage, but about whether one should even bother testing the CPU version of lc0 based on prices of comparably priced CPU vs. GPU?

Albert Silver · Post by **Albert Silver** » Wed May 02, 2018 5:34 am

cma6 wrote:Albert:
I wasn't asking about processor usage, but about whether one should even bother testing the CPU version of lc0 based on prices of comparably priced CPU vs. GPU?

You wrote, "Doesn't LC0 use either the GPU or CPU, but not both?" and I answered no, it uses both.

how good is a GeForce GTX 1060 6GB for Leela ?

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Re: how good is a GeForce GTX 1060 6GB for Leela ?

What does LC0 use?

Re: how good is a GeForce GTX 1060 6GB for Leela ?

Re: What does LC0 use?

Re: What does LC0 use?

Re: What does LC0 use?