NN Evaluation on CPU or GPU

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

NN Evaluation on CPU or GPU

Post by mhull »

I have a basic question about the average instruction path length of a decent evaluation function, and would that be longer or shorter than:

1) A neural network trained on the output of that function?

2) The same trained NN running on a GPU?

Is this something that has been tried?
Matthew Hull
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: NN Evaluation on CPU or GPU

Post by rbarreira »

A typical chess program can evaluate a position in around 1000 CPU cycles, way faster than it takes to evaluate any non-trivial neural network.

AFAIK neural networks are several orders of magnitude slower than typical chess evaluation functions. This means the NN would have to be much smarter in order to compensate.

Sending the position to the GPU to be evaluated also looks like a bad idea due to the communication latency between CPU and GPU.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: NN Evaluation on CPU or GPU

Post by diep »

rbarreira wrote:A typical chess program can evaluate a position in around 1000 CPU cycles, way faster than it takes to evaluate any non-trivial neural network.

AFAIK neural networks are several orders of magnitude slower than typical chess evaluation functions. This means the NN would have to be much smarter in order to compensate.

Sending the position to the GPU to be evaluated also looks like a bad idea due to the communication latency between CPU and GPU.
Diep's evaluation function is rather slow because of its massive chessknowledge. Around a 100k instructions roughly.

We see for a few years now a boost of the automatic tuned beancounters with near to zero chessknowledge. When the old school programmers manage to tune their engines better, the knowledgeable engines should be on top again, though most lack the financial motivation that got them strong in the past - so we'll see.

The material evaluation function from Rybka 1.0 is unclear how it was tuned. It has massive amount of parameters that were automatic tuned.

Rybka 3.0 (maybe also 2.3.2 not sure about that) and the clones seem to have been tuned with a neural network.

Please realize the huge difference between these evaluation functions and a neural network that is doing the evaluation versus tuning with a neural network and generic automatic tuning. That's 4 total different things.

The last serious attempt using a neural network to automatically recognize chesspatterns i know of was somewhere end 90s by a Danish guy. Dan Thies if i recall well.

Search was that of a normal program yet evaluation total selflearned. So no cheating at all. He used special hardware for the neural net. It could deliver back then 100k evaluations a second his hardware. Not sure what price his hardware was - sure could buy an expensive house for it back then.

AFAIK it wasn't a big succes.

If you have a neural network replace the evaluation function, it needs a massive amount of multiplications. CPU's are very bad in this. So any number of clockcycles you can easily multiply by factor 10.

So you will directly notice a slowdown of factor 10 if not more.

In itself a GPU is pretty fast for small integer multiplications, provided you keep it under 16 * 16 bits for AMD, Nvidia is allowing more there, it allows 32 x 32 multiplications fast (32x32 is SLOW at AMD), though that's still 2 cycles at Nvidia to get the full 64 bits. This for tesla's and the 5xx series of nvidia. At AMD it's so to speak 8 cycles throughput latency to get it at modern AMD cpu's.

So a gpu has far more resources to do neural net simulations than a CPU.

Yet combining that gpu <==> cpu is going to fail as someone already posted before. The latency of a gpu to cpu maybe in terms of bandwidth is huge - easily 20GB/s for nvidia at latest pci-e 3.0 motherboards.

Yet the absolute latency in time for small messages is going to be nearby a millisecond. If you can do more than 3000 messages a second that would be a lot.

So whatever you do on a gpu will be real fast, provided it keeps inside the gpu.

For simulating neural nets, gpu's are great hardware i'd say...

I would argue neural networks have failed to deliver the big promise that they did. For every area in science where a neural network can have succes, there are specialized automatic tuning manners that solve things even better.

If you know how to tune with a neural net, in the first place you don't need all these expensive multiplications to tune and can do it way faster with less multiplications.

Also in the meantime brain research shows that the way how ANN's have been modelled, that this is not very close to how the modern understanding of how human brain works is.

The easy public access to neural networks and the big secrecy around generic manners how to tune, is probably the reason that in some areas they still get used.

At least about ANN's you CAN find some information on how to train and tune them.

Biggest application of them seems to be however espionage and automatic scanning of all phone calls. So they are not interesting, as they are already scanning all those phone calls. Nothing new to achieve there which ANN's can achieve.

On other hand in self learning there is a lot to achieve. I wouldn't go for ANN's then though.
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: NN Evaluation on CPU or GPU

Post by rbarreira »

diep wrote: Diep's evaluation function is rather slow because of its massive chessknowledge. Around a 100k instructions roughly.
Are you saying that if you run Diep on a 3.0 GHz modern CPU you get about 30,000 nps per thread? Or do you mean 100k instructions in total, not all of which get executed on every evaluation?
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: NN Evaluation on CPU or GPU

Post by mhull »

rbarreira wrote:A typical chess program can evaluate a position in around 1000 CPU cycles, way faster than it takes to evaluate any non-trivial neural network.

AFAIK neural networks are several orders of magnitude slower than typical chess evaluation functions. This means the NN would have to be much smarter in order to compensate.
This reminds me of the method of Kittinger in his Constellation programs, a relatively complex and expensive evaluation was performed once at the root prior to search which served to guide it. But the expense was justified because it only had to run once per move. But what if it could have been run at every required node at less than its normal expense?

The question would be where the break-even point would be in coding toward an "ideal" evaluation function, where a trained NN(or NNs) would require the same or fewer cycles than the algorithm(s).
Matthew Hull
User avatar
Rebel
Posts: 6946
Joined: Thu Aug 18, 2011 12:04 pm

Re: NN Evaluation on CPU or GPU

Post by Rebel »

mhull wrote:
rbarreira wrote:A typical chess program can evaluate a position in around 1000 CPU cycles, way faster than it takes to evaluate any non-trivial neural network.

AFAIK neural networks are several orders of magnitude slower than typical chess evaluation functions. This means the NN would have to be much smarter in order to compensate.
This reminds me of the method of Kittinger in his Constellation programs, a relatively complex and expensive evaluation was performed once at the root prior to search which served to guide it. But the expense was justified because it only had to run once per move. But what if it could have been run at every required node at less than its normal expense?
Yes and Dave's Bxh7+ sacrifice is the most famous example of pre-processing, as it is called. Another great example of working pre-processing is the Trojan horse.

Pre-processing was a fashion in the early days of chess programming because pre-processing works at low depths, at deeper depths it may work counter productive instead and nowadays engines left the concept of pre-processing.

Nevertheless if someone comes up with a way (structure) of integrating massive pre-processed knowledge to apply for each evaluation in a cheap way (say in 50-100 cycles) then that will guarantee a new breakthrough in computer chess.

Imagine if the Bxh7+ knowledge for root positions could be applied to any node in the search that would be something.
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: NN Evaluation on CPU or GPU

Post by mhull »

Rebel wrote:Imagine if the Bxh7+ knowledge for root positions could be applied to any node in the search that would be something.
Yes, that's exactly what I meant to emphasize! If only an arbitrarily complex static evaluation of Constellation proportions could be done at each node. This is too expensive with normal algorithms. But would ANNs at each node be necessarily too expensive as well? The answers so far indicate they are too slow. But perhaps ANNs would be less expensive than a notional ideal static evaluation, a la Kittinger.

That's why I wondered if a GPU would speed it up enough, or would it just break even (or worse) due to interface latencies.
Matthew Hull
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: NN Evaluation on CPU or GPU

Post by diep »

rbarreira wrote:
diep wrote: Diep's evaluation function is rather slow because of its massive chessknowledge. Around a 100k instructions roughly.
Are you saying that if you run Diep on a 3.0 GHz modern CPU you get about 30,000 nps per thread? Or do you mean 100k instructions in total, not all of which get executed on every evaluation?
I said 100k instructions a full eval roughly yes in opening/middlegame. Sometimes more worstcase. 100k is about average.

How many instructions can your 3Ghz CPU execute? How many full evals per node are you doing?

Diep's IPC at i7 is 1.73 roughly
Further i hash evaluation at 2 spots.
Diep has a special evaltable AND in the transpositiontable it also stores full evals. This in itself already means that 50% of all full evals get from one of its hashtables. Note Diep's evaluation is symmetric, a deliberate choice.

Diep doesn't evaluate at innernodes only at leaf nodes and Diep doesn't evaluate when one side is in check. So it evaluates say about half of all positions in a full manner.

Do not forget you get a lot of nodes 'for free'. If transpositiontable gives a cutoff, you get a node for free. That's 5%. Same with being in check etc.

Nullmove succeeds in roughly 70% of the nodes, so that's 1 more node for free.

You can also see this in benchmarks. Diep gets roughly 1.3 million nps at an i7-965. I was clocked effectively 3.6Ghz i guess if you'd compare with your hardware at home.

That's for 8 threads.

1.3 mln / 8 = roughly 165k nps a logical/mini core

Reduce the higher clock, reduce for the free nodes, then you'll see it's really 100k instructions a full eval on average.

It's possible within not too many weeks that it gets a lot slower than this in opening as i'm busy adding some strategic code which is very cpu intensive.

So if you compare Diep with the beancounter engines it's a lot slower in nps and always was.