Booot progress

booot · Post by **booot** » Tue May 04, 2021 5:22 pm

Good day friends!

I started my own project to implement NN in my engine's evaluation. Booot is written in Pascal, so all this way i have to pass along form zero point

. So, i decided to make this topic to show the full process and to hear some advises and critical opinions

.

My start point:
1. Intel Core I9-10920X +128 Gb + Nvidia RTX3090
2. Delphi 10.3 + Python3.8 + Keras 2.4 + tensorflow6(nF) + CUDA11 + cudnn8
3. Some (small) experience in all of this stuff.
4. Big enthusiasm.

What i would like to :
1. construct and train model in Keras with Booot-eval data
2. Implement SIMD in Delphi (somehow) to receive quick forward pass of NN
3. The feature schema will never be Half-Kp

. 'Everyone already has it' - it's boring.

Some 'motivated' promo:
-Delphi compiler still does not have any AVX2 support! So my assembly code instead of 'vpmaddubsw zmm0, zmm1, zmm0' (for AVX512) looks now like 'db 62h 0feh,48h .....'

But anyway it's very interesting!

My current plan:
1. create simple keras model with the same NN structure as i will have and train it with simple non-chess data (but with similar (but smaller) 1,0,1... input array and similar NN integer output).
2. Make quantization (may be during learning) to receive int8-int16-int32 final weights and biases
3. Prepare quasi-asm hexadecimal codes in delphi to catch the keras data with AVX512,AVX2 and SSSE3 support
4. Receive the same results in keras and Delphi for every input from training set with quantised weights and biases and measure the accuracy of quantised model.
5. Generate 9-digits number of positions with Booot.
6. Train the full model on my home monster.
7. Test and release engine.

Best regards,
Alex.

booot · Post by **booot** » Tue May 04, 2021 6:12 pm

First impression in SIMD : they ... work

. It's really fantastic technology! I started with avx512 suport (my CPU supports everything ha-ha!) - the speed of "multiply 2 512-bit arrays of 8-bit integers (one signed, one unsigned) + add 2 neighbors in results + saturate final result to int16" really impressive! Relatevely short procedure multiplying 2 matrixes like : [1,256]*[256,32]=[1,32] written completely in asm (hexadecimal codes more correct) shows me the speed 10 millions such multiplications per second on my CPU! Not bad at all.

mvanthoor · Post by **mvanthoor** » Tue May 04, 2021 6:56 pm

Cool

Pascal was my first love with regard to programming languages. I've worked _a lot_ with Turbo/Borland Pascal in the 90's; my university used Delphi in year 1, and C++ Builder in the years after. (This was before .NET even existed.)

How did you get the SIMD-instructions implemented if the compiler doesn't support them; did you write the assembly yourself?

booot · Post by **booot** » Tue May 04, 2021 7:04 pm

mvanthoor wrote: ↑Tue May 04, 2021 6:56 pm Cool Pascal was my first love with regard to programming languages. I've worked _a lot_ with Turbo/Borland Pascal in the 90's; my university used Delphi in year 1, and C++ Builder in the years after. (This was before .NET even existed.)

How did you get the SIMD-instructions implemented if the compiler doesn't support them; did you write the assembly yourself?

I also like Pascal

. But in my case the reason is simple : i am not a programmer, so Pascal is only language i used in my university 25 years ago

I did not write my own asm! I just found cloud Asm+disasm and now can quickly receive hexadecimal opcodes of needed instructions. Then i simple insert this hexadecimal constants to Delphi asm with DB directives :compiler just inserts them 'as is'.

booot · Post by **booot** » Wed May 05, 2021 12:13 am

So, in the next step i will try to implement full forward propagation of neural net with SIMD in Delphi having 1 row of data as input from the features layer output. This output i will incrementaly update in make-unmake procedures in the future... Somehow. I do not have an idea what feature schema i will use, but today is more important what output this schema will produce

. I have chosen this net configuration for start:
1. Input (feature layer) - i will think later (but not Half-KP)
2 256x32 : [1,256] * [256,32] = [1,32] + biases+RELU+dequantization (think later) + pack [1,32] int32 -> [1,32] unsigned int8
3. 32x32 : [1,32] * [32,32] = [1,32] + biases+RELU+dequantization (think later) + pack [1,32] int32 -> [1,32] unsigned int8
4 32x1 : [1,32] * [32,1] = [1,1] +bias + somethink for scaling.

Lets go!

mvanthoor · Post by **mvanthoor** » Wed May 05, 2021 12:22 am

So if you have only a little experience with this and are no programmer by your own statement.... Where did you get all this information so quickly?

I constantly feel as if I'm missing some secret resource or something.

booot · Post by **booot** » Wed May 05, 2021 12:37 am

It is not a rocket science. I read something how NN works (mathematically) and matrix multiplications for dense layers are not so difficult things. Anyway Keras framework will do all the dirty job for me. The very interesting (and difficult) things now i see
1. quntization (float32-> int8). With so short int8 range i think i will spend beautifull time to control weights, neuron summs and RELU outputs in 'live' NN trying to receive loss-function optimisation with all of this stuff.
2. Metrics and loss function taking quantization in mind.
3. Possible scaling NN input and output . The reason is to keep all this int8 data >0 and <255 from layer to layer.

SIMD looks beautiful after some mind preparations for it

https://software.intel.com/sites/landin ... sicsGuide/

This is my secret resource now

booot · Post by **booot** » Wed May 05, 2021 4:10 pm

First surprises (unpleasant).

Seems SIMD does not have fast 'hardware'' instructions for "Horisontal Sum" (count summa of all elements in the vector). "Vertically" it works fast as hell, but during matrix multiplications i need final int32 summa value for every single column in resulting matrix. It's a pity. But anyway - the first procedure , making fast matrix multiplication weights*data for the first hidden layer of my net now works in Delphi with AVX512 support!
I have made the very first small step on this long way. Procedure works fine, multiplying [1,256](uint8)*[256,32](int8)=[1,32](int32) matrixes. This is the most 'heavy' part of future SIMD code, taking the biggest weights matrix (256x32). The current speed is little bit less then 10 millions procedure calls per second.

The next step - fast Activation function for this layer + pack results back to uint8 format. And do not forget to add biases first!

booot · Post by **booot** » Thu May 06, 2021 10:06 pm

First stage done!

All needed asm (hex codes) SIMD procedures in Delphi for NN forward pass from first hidden layer to NN output were written! It was surprise for me but sometimes AVX512 code works little bit slower then AVX2! For [256x32] matrix avx512 is the King, but for 'short' matrixes [32,32] and [32,1] there are not advantages to have 512 bit vectors! I tried to proceed 2 columns in the same time during matrix multiplication for [32,32] but 'broadcast-split-permutate' code works slow and i did not receive speed improvement.
So, i already have fast NN probing code in Delphi! The final speed is about 5 millions NN probes per second with AVX512+AVX2 support. There were lots of very interesting nuanses while i implemented RELU, dequantization (signed shift right) and packing result back to int8! It's so easy to loose the sign or int8 range during SIMd manipulations!

25% of total project completed!

The next stage is 'Back to reality'. I try to train 'live' keras model with the same layers structure, fit it with synthetic non-chess data but with similar input (array with 0-s and 1-s) and similar output (integer in range something like [-500,500]). Then i try to make quantization during training (float32->int16,int8,int32) and bring final model to Delphi. The end of the stage - working model in Delphi with exactly the same probing results with Python-Keras-Tensorflow and measured accuracy of quantized model.

booot · Post by **booot** » Thu May 06, 2021 10:18 pm

BTW i looked how stockfish team realised activation function in their 'clipped relu' file. Who knows, why do they use _mm256_packs_epi16 function? The final result of it is int8 with range [0..127]. Looks little bit stupid for me - they loose the half of valuable range [0..255]! Less range - less accuracy and slower model training.

Booot progress

Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress