Petaflop Binary Neural Networks? - on your home PC

dangi12012 · Post by **dangi12012** » Sun Oct 17, 2021 11:19 pm

Has anyone here implemented one of the smaller binary neural networks?
Not Int8 or Float16 - but Int1. Thats extremely important for chess because this can be directly used with the 12 * 64bit Bitboard these algorithms work on. They work like you would think matrix multiplication is supposed to work if you reduce down to one bit. The solution is XNOR + Popcnt

https://towardsdatascience.com/binary-n ... c926888f3f

The coolest part ist that these are natively supported by Tensor cores. Meaning that with modern hardware you get 136Tflops of float16 compute - but there is a whopping 1.8 Petaflops in your off the shelf 3080 gpu. This would be a good first layer since the input is binary and the output is forced to be a int32 matrix/vector.

Cublass gemm Support is not there yet. But Nvidia CUTLASS (the template abstraction layer with different backends) supports 8, 4 and 1 bits:
https://github.com/NVIDIA/cutlass/blob/ ... l_types.md

Of course only for chess engines that already run natively on the gpu since transferring latency is a huge cost.
Does anyone here have experience or can point me to a good existing git repo for a 1 bit network implementation?
Interesting would be to see a backpropagation learning approach since I wouldnt know how a gradient is defined on a single bit output.

Daniel Shawul · Post by **Daniel Shawul** » Mon Oct 18, 2021 4:02 am

Not INT1 but I do have INT8 which works pretty well on many GPUs that do not support FLOAT16.
The elo loss for INT8 is ~40 elos on the same node search against FLOAT16, but the speed gain does compensate for this
elo loss. I was planning to test INT4 when and if TensorRT supports it, but it looks difficult to go beyond that.
Big neural netowrk search with MCTS heavily relies on the "knowledge in the net" so much that speed is less of an issue.
Loosing that knowledge by quantization may not be the best approach unless it is able to compensate for it tactically.
For speed sensitive neural networks (NNUE), on the other hand, quantization seems to be the right choice.

dangi12012 · Post by **dangi12012** » Mon Oct 18, 2021 10:20 am

Well its useful only for the first layer anyway. But my understanding is that it could expand into a more optimal layout.
Who says that KingKPhalf is the perfect input for the other layers.

A Binary Neural Network could expand any bitboard directly into a good board representation for the other layers - and that in int32 format which is optimal for gpus - and dowsampling to fp16 is natively supported from int32.

IMO its natively the format of bitbaords. Thats why the difference between 8 bits and 1 bits is huge for that. With 8 bits you need to maintain an input list like nnue does. With bitboards and int1 that list is the bitboard itself!

Daniel Shawul · Post by **Daniel Shawul** » Mon Oct 18, 2021 12:58 pm

There is really no computation in the first layer because the input is sparse matrix with 1s at few places where pieces are.
The first layer output is simply the weights associated with piece squares without multiplication (1 is the input).

dangi12012 · Post by **dangi12012** » Mon Oct 18, 2021 4:09 pm

Daniel Shawul wrote: ↑Mon Oct 18, 2021 12:58 pm There is really no computation in the first layer because the input is sparse matrix with 1s at few places where pieces are.
The first layer output is simply the weights associated with piece squares without multiplication (1 is the input).

Nope thats wrong. Its non linear because its not XNOR - its XNor + popcnt. So you have the dot product over all weighted input bits at once - like you do for fp16 or int8. But in binary the "sum" is equal to the popcnt.

If you want a binary exact same uint8 representation like the inputlayer for NNUE - a trained int1 model can generate that. int32->int8
But halfking input may not be the best possible representation for a network - the input bits would become part of the training and thats the cool stuff.

The second cool part is that instead of transferring a huge uint8 array per position - you only need 12*64bits + 16 state bits - and have the full 1.8 Petaflops for the first layer and 119 Teraflops for every other fp16 layer. Which would be better that nnue in terms of size - speed and accuracy.
Only latency is the remaining problem.

odomobo · Post by **odomobo** » Mon Oct 18, 2021 6:08 pm

A bit pedantic, but it wouldn't be petaflops, because it's not floating-point operations. Maybe binary neural networks are worth exploring, but "faster" doesn't necessarily mean better.

Daniel Shawul · Post by **Daniel Shawul** » Mon Oct 18, 2021 6:40 pm

dangi12012 wrote: ↑Mon Oct 18, 2021 4:09 pm
Daniel Shawul wrote: ↑Mon Oct 18, 2021 12:58 pm There is really no computation in the first layer because the input is sparse matrix with 1s at few places where pieces are.
The first layer output is simply the weights associated with piece squares without multiplication (1 is the input).
Nope thats wrong. Its non linear because its not XNOR - its XNor + popcnt. So you have the dot product over all weighted input bits at once - like you do for fp16 or int8. But in binary the "sum" is equal to the popcnt.

If you want a binary exact same uint8 representation like the inputlayer for NNUE - a trained int1 model can generate that. int32->int8
But halfking input may not be the best possible representation for a network - the input bits would become part of the training and thats the cool stuff.

The second cool part is that instead of transferring a huge uint8 array per position - you only need 12*64bits + 16 state bits - and have the full 1.8 Petaflops for the first layer and 119 Teraflops for every other fp16 layer. Which would be better that nnue in terms of size - speed and accuracy.
Only latency is the remaining problem.

Oops I forgot the weights will be binary in the binary network. Yes in that case.
For the current NNUE, the computation is vectorized additions that are further optimized with "accumulation" from previous moves --
which is where the incremental part of NNUE comes from.
So when a piece moves only the part that changed (from/to squares) weights are incremented/decremented.
It is hard to see the binary network apporch improving the computation on the first layer ...

Sopel · Post by **Sopel** » Tue Oct 19, 2021 12:32 pm

dangi12012 wrote: ↑Mon Oct 18, 2021 4:09 pm The second cool part is that instead of transferring a huge uint8 array per position - you only need 12*64bits + 16 state bits - and have the full 1.8 Petaflops for the first layer and 119 Teraflops for every other fp16 layer. Which would be better that nnue in terms of size - speed and accuracy.
Only latency is the remaining problem.

You have a fundamental misunderstanding about how NNUE relies on sparsity and board modification constraints between 2 subsequent positions to allow for incremental updates of the first layer. See https://github.com/glinscott/nnue-pytor ... ccumulator. There's also the aspect of overparametrization which is a key part of NNUE, and your "bitboard" input disallows that.

dangi12012 · Post by **dangi12012** » Tue Oct 19, 2021 1:09 pm

Sopel wrote: ↑Tue Oct 19, 2021 12:32 pm
dangi12012 wrote: ↑Mon Oct 18, 2021 4:09 pm The second cool part is that instead of transferring a huge uint8 array per position - you only need 12*64bits + 16 state bits - and have the full 1.8 Petaflops for the first layer and 119 Teraflops for every other fp16 layer. Which would be better that nnue in terms of size - speed and accuracy.
Only latency is the remaining problem.
You have a fundamental misunderstanding about how NNUE relies on sparsity and board modification constraints between 2 subsequent positions to allow for incremental updates of the first layer. See https://github.com/glinscott/nnue-pytor ... ccumulator. There's also the aspect of overparametrization which is a key part of NNUE, and your "bitboard" input disallows that.

Nope I think programmers need a more precise language than english. Maybe neuralink will change that.
Overparametrisation can be done inherently by a binary network - since you choose yourself how many parameters you generate from the fundamental position by setting the number of columns in the 2nd matrix. Also NNUE maintains the first layer not input - thats the HalfKP i was talking about. Its a very good idea and no wonder it ended up in chess.

My question was if someone has experience with binary networks - but it seems not.

dangi12012 · Post by **dangi12012** » Tue Oct 19, 2021 9:17 pm

So if anyone is interested in the topic - here is a current research paper with a github repo.

https://arxiv.org/pdf/2006.16578.pdf
https://github.com/pnnl/TCBNN

Petaflop Binary Neural Networks? - on your home PC

Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC

Re: Petaflop Binary Neural Networks? - on your home PC