Hacking around CFish NNUE

mar · Post by **mar** » Thu Oct 15, 2020 7:35 pm

Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pm I wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.

true, auto-vectorization should work fine for integers, no problem and fully portable.
for floating point, however, you'd have to use fast float to make clang/msc vectorize the dot product, becuse floating point ops are not commutative plus the order of ops matters so the compiler has to be very careful to not introduce errors that could break carefully designed algorithms.

there's a simple cure, though:

Code: Select all

#pragma float_control(precise, off, push)

float my_float_dot_product(const float *a, const float *b)
{
    ...
}

#pragma float_control(pop)

this pragma works in all compilers I'm concerned about (gcc, clang, msc)

maksimKorzh · Post by **maksimKorzh** » Thu Oct 15, 2020 7:40 pm

Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:27 pm Think of Position*, containing your FEN and (Accumulator and DirtyPiece) structures.
NNUE populate these structures using the function.
Code: Select all
void half_kp_append_active_indices
Modify that to be based on your FEN rather than the bitboards code that it assumes the engine uses exactly like Stockfish does.
Also don't forget to comment out the incremental update as I mentioned, otherwise it will touch parts of the code that are corrupt.

If you are frustrated, you can wait for me to add NNUE it to my library that already does EGTB and NN probe

Thank you Daniel. I will try it while waiting for you to add NNUE into your library.
Let me explain what am I trying to do and why:
I want to make a youtube series for idiots like me to obtain a program that would be yeah, probing is the right word - probing NNUE weights to get a score.
The perfect application of this scli tool would be like this:

./nnue_eval r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1

// output
score: 0.20

And that's it. Completely modular. I will then be able to embed it into my engine.
1. First and most slowest way to go, but a proof of concept already: convert position to FEN -> feed to nnue_eval() -> get score back
2. feed position to nnue_eval() directly

maksimKorzh · Post by **maksimKorzh** » Thu Oct 15, 2020 7:41 pm

mar wrote: ↑Thu Oct 15, 2020 7:35 pm
Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pm I wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.
true, auto-vectorization should work fine for integers, no problem and fully portable.
for floating point, however, you'd have to use fast float to make clang/msc vectorize the dot product, becuse floating point ops are not commutative plus the order of ops matters so the compiler has to be very careful to not introduce errors that could break carefully designed algorithms.

there's a simple cure, though:
Code: Select all
#pragma float_control(precise, off, push)

float my_float_dot_product(const float *a, const float *b)
{
    ...
}

#pragma float_control(pop)
this pragma works in all compilers I'm concerned about (gcc, clang, msc)

As far as I'm aware pragma is specific to microsoft compiler only

mar · Post by **mar** » Thu Oct 15, 2020 7:53 pm

maksimKorzh wrote: ↑Thu Oct 15, 2020 7:41 pm As far as I'm aware pragma is specific to microsoft compiler only

you're wrong, #pragma once would be a typical example

see for yourself, try this in godbolt.org, select clang, -O3 -march=native

Code: Select all

#include <stdio.h>
#include <stdlib.h>

extern float fast_dot(int count, float * a, float *b);

#pragma float_control(precise, off, push)

inline float fast_dot(int count, float * a, float *b)
{
	float res = 0;

	for (int i=0; i<count; i++)
		res += a[i] * b[i];

	return res;
}

#pragma float_control(pop)

int main()
{
    float a[4], b[4];
    printf("%f\n", fast_dot(rand(), a, b));
}

now try to compile with and without the pragmas and tell me what you see

hgm · Post by **hgm** » Thu Oct 15, 2020 7:54 pm

maksimKorzh wrote: ↑Thu Oct 15, 2020 6:59 pmCan you please clarify the code:

Well, with comments (and expanded to do all layers) it would roughly look like this

Code: Select all

// in MakeMove():
piece = board[from];
victim = board[to];
board[from] = EMPTY;
board[to] = piece;

for(i=0; i<256; i++) {                 // for all cells in layer 1
  if(IsKing(piece)) {
    int sum=0;
    king[stm] = to;                    // New king location; stm = 0 (white) or 1 (black)
    for(s=0; s<64; s++) {              // for all board squares
      int p = board[s];                // get the piece on it
      sum += KPST[i][to][p][s];        // add its KPST score to the total
    }
    layer1[i + 256*stm] = sum;         // this is the board's PST eval for the new King location
  } else {                             // other than King, update incrementally
    for(j=0; j<2; j++) {               // for both players
      int k = king[j];                 // get their King's location   
      layer1[i+256*j] -= KPST[i][k][piece][from]; // update the PST eval incrementally
      layer1[i+256*j] += KPST[i][k][piece][to];
      layer1[i+256*j] -= KPST[i][k][victim][to];
    }
  }
}

for(i=0; i<32; i++) {                  // for all cells in layer 2
  int sum = 0;                         // prepare calculating their input for cell i
  for(j=0; j<512; j++) {               // for each of the cells in layer 1
    sum += weights1[i][j] * layer1[j]; // add its weighted output to the input
  }                                    // we now have the total input of cell i in layer 2
  layer2[i] = max(0, sum);             // calculate its output
}

for(i=0; i<32; i++) {                  // for all cells in layer 3
  int sum = 0;                         // prepare calculating their input for cell i
  for(j=0; j<32; j++) {                // for each of the cells in layer 2
    sum += weights2[i][j] * layer2[j]; // add its weighted output to the input
  }                                    // we now have the total input of cell i in layer 3
  layer3[i] = max(0, sum);             // calculate its output
}

int eval = 0;                          // prepare calculating the evaluation score
for(j=0; j<32; j++) {                  // for each of the cells in layer 3
  sum += weights3[j] * layer3[j];      // add its weighted output to the input
}                                      // we now have the total input the final cell
                                       // which just outputs it unmodified

This glosses over some hairy details in the first (KPST) layer, such as that the KPST for white and black really need to be each other's mirror images, but should give you roughly an idea of what is going on. BTW, KPST stands for King-Piece-Square Table, because it is indexed by king location, piece type and square location. They are similar to PST, but you use a different one for each King location. So that you can award larger bonus for pieces close to a King. In Shogi this is very important, as Shogi games are usually a race to mate in an all-out King attack. Defense is a losing strategy there, instead you counter-attack and hope to be faster at it.

questions:
1. How can I initialize weights1?

By loading it with the net you want to run. Presumably reading that from a file, and writing the numbers you read in the corresponding elements. Of course if you want to create (train) your own network, this is an entirely different matter. The code I gave is just for running an engine that uses a given network.

2. weights1 is 2 dimensional array here, what values I need in 1st and 2nd indices when I define array? // e.g. weights1[?][?]

The weights represent the 'strength' with which the output of a cell in layer 1 contributes to the input of a cell in layer 2. The indices are the cell numbers.

3. same question for layer1 amd and layer2 (I only understand that NNUE has 4 layers but that's rocket science to me)

The layer arrays hold the output of all the cells in that layer. (To be used as input for the next layer.)

Could you please provide the code the would be doing following(or give a link on implementation):
1. Init everything needed from "*.nnue" file with weights

I have no idea. For that you would have to know in what order Stockfish stores the weights of the net in the file. This is no doubt documented somewhere, but I am not interested in looking it up.

2. then I guess the code you've already provided

Indeed; that code was for calculating layer 2 from layer 1.

3. And then somehow magically obtain a score

The new code shows this for all the layers. The magical score is obtained by the last loop in that code, just the weighted sum of all the outputs of the cells in layer3.

maksimKorzh · Post by **maksimKorzh** » Thu Oct 15, 2020 8:01 pm

mar wrote: ↑Thu Oct 15, 2020 7:53 pm
maksimKorzh wrote: ↑Thu Oct 15, 2020 7:41 pm As far as I'm aware pragma is specific to microsoft compiler only
you're wrong, #pragma once would be a typical example

see for yourself, try this in godbolt.org, select clang, -O3 -march=native
Code: Select all
#include <stdio.h>
#include <stdlib.h>

extern float fast_dot(int count, float * a, float *b);

#pragma float_control(precise, off, push)

inline float fast_dot(int count, float * a, float *b)
{
	float res = 0;

	for (int i=0; i<count; i++)
		res += a[i] * b[i];

	return res;
}

#pragma float_control(pop)

int main()
{
    float a[4], b[4];
    printf("%f\n", fast_dot(rand(), a, b));
}
now try to compile with and without the pragmas and tell me what you see

Oh... I'm sorry. I just tried to compile with pragma keyword - it turned out to be microsoft compiler specific, hence my assumption. But obviously in this case it's different. Sorry for misinform.

Daniel Shawul · Post by **Daniel Shawul** » Thu Oct 15, 2020 8:06 pm

mar wrote: ↑Thu Oct 15, 2020 7:35 pm
Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pm I wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.
true, auto-vectorization should work fine for integers, no problem and fully portable.
for floating point, however, you'd have to use fast float to make clang/msc vectorize the dot product, becuse floating point ops are not commutative plus the order of ops matters so the compiler has to be very careful to not introduce errors that could break carefully designed algorithms.

there's a simple cure, though:
Code: Select all
#pragma float_control(precise, off, push)

float my_float_dot_product(const float *a, const float *b)
{
    ...
}

#pragma float_control(pop)
this pragma works in all compilers I'm concerned about (gcc, clang, msc)

Note that NNUE weights are integers, INT16 and INT8 to be precise.
This is one of the reasons, I guess general AI libraries such as Tensorflow, can not be used for training the net.
I am not sure about the training though which could use FLOAT precision, unless they wanted to do the training with quantization as well.

maksimKorzh · Post by **maksimKorzh** » Thu Oct 15, 2020 8:07 pm

hgm wrote: ↑Thu Oct 15, 2020 7:54 pm
maksimKorzh wrote: ↑Thu Oct 15, 2020 6:59 pmCan you please clarify the code:
Well, with comments (and expanded to do all layers) it would roughly look like this
Code: Select all
// in MakeMove():
piece = board[from];
victim = board[to];
board[from] = EMPTY;
board[to] = piece;

for(i=0; i<256; i++) {                 // for all cells in layer 1
  if(IsKing(piece)) {
    int sum=0;
    king[stm] = to;                    // New king location; stm = 0 (white) or 1 (black)
    for(s=0; s<64; s++) {              // for all board squares
      int p = board[s];                // get the piece on it
      sum += KPST[to][p][s];           // add its KPST score to the total
    }
    layer1[i + 256*stm] = sum;         // this is the board's PST eval for the new King location
  } else {                             // other than King, update incrementally
    for(j=0; j<2; j++) {               // for both players
      int k = king[j];                 // get their King's location   
      layer1[i+256*j] -= KPST[k][piece][from]; // update the PST eval incrementally
      layer1[i+256*j] += KPST[k][piece][to];
      layer1[i+256*j] -= KPST[k][victim][to];
    }
  }
}

for(i=0; i<32; i++) {                  // for all cells in layer 2
  int sum = 0;                         // prepare calculating their input for cell i
  for(j=0; j<512; j++) {               // for each of the cells in layer 1
    sum += weights1[i][j] * layer1[j]; // add its weighted output to the input
  }                                    // we now have the total input of cell i in layer 2
  layer2[i] = max(0, sum);             // calculate its output
}

for(i=0; i<32; i++) {                  // for all cells in layer 3
  int sum = 0;                         // prepare calculating their input for cell i
  for(j=0; j<32; j++) {                // for each of the cells in layer 2
    sum += weights1[i][j] * layer2[j]; // add its weighted output to the input
  }                                    // we now have the total input of cell i in layer 3
  layer3[i] = max(0, sum);             // calculate its output
}

int eval = 0;                          // prepare calculating the evaluation score
for(j=0; j<32; j++) {                  // for each of the cells in layer 3
  sum += weights1[i][j] * layer3[j];   // add its weighted output to the input
}                                      // we now have the total input the final cell
                                       // which just outputs it unmodified
This glosses over some hairy details in the first (KPST) layer, such as that the KPST for white and black really need to be each other's mirror images, but should give you roughly an idea of what is going on. BTW, KPST stands for King-Piece-Square Table, because it is indexed by king location, piece type and square location. They are similar to PST, but you use a different one for each King location. So that you can award larger bonus for pieces close to a King. In Shogi this is very important, as Shogi games are usually a race to mate in an all-out King attack. Defense is a losing strategy there, instead you counter-attack and hope to be faster at it.

questions:
1. How can I initialize weights1?
By loading it with the net you want to run. Presumably reading that from a file, and writing the numbers you read in the corresponding elements. Of course if you want to create (train) your own network, this is an entirely different matter. The code I gave is just for running an engine that uses a given network.

2. weights1 is 2 dimensional array here, what values I need in 1st and 2nd indices when I define array? // e.g. weights1[?][?]
The weights represent the 'strength' with which the output of a cell in layer 1 contributes to the input of a cell in layer 2. The indices are the cell numbers.

3. same question for layer1 amd and layer2 (I only understand that NNUE has 4 layers but that's rocket science to me)
The layer arrays hold the output of all the cells in that layer. (To be used as input for the next layer.)

Could you please provide the code the would be doing following(or give a link on implementation):
1. Init everything needed from "*.nnue" file with weights
I have no idea. For that you would have to know in what order Stockfish stores the weights of the net in the file. This is no doubt documented somewhere, but I am not interested in looking it up.

2. then I guess the code you've already provided
Indeed; that code was for calculating layer 2 from layer 1.

3. And then somehow magically obtain a score
The new code shows this for all the layers. The magical score is obtained by the last loop in that code, just the weighted sum of all the outputs of the cells in layer3.

Thank you SO SO MUCH.
I feel like would spend this night working out your code.
THANK YOU, HGM!

mar · Post by **mar** » Thu Oct 15, 2020 8:19 pm

Daniel Shawul wrote: ↑Thu Oct 15, 2020 8:06 pm I am not sure about the training though which could use FLOAT precision, unless they wanted to do the training with quantization as well.

I'm not sure whether training with quantization is a good idea - tried training a tiny net with quantization and the result was worse than without quantization - I quess quantization after training is ok though, but I'm not an expert by any means

hgm · Post by **hgm** » Thu Oct 15, 2020 8:20 pm

Note that I just fixed a few typos in the code (well, actually copy-paste errors, where I forgot to make the intended modifications to the copied code). The weights of all layers of course had to be different, and the last layer only needs a 1-d array of weights, as there is only a single output.

Note that the most complex part of the code I gave is the calculation of the KPST evaluations, because I wanted to show how that is done in an efficient way. What you really want to do is calculate the sum of all KPST[k][board[s]][s], looping over all board squares s and Kings k. For normal PST this would just be the sum of all PST[board[s]][s]. And that 256 times (with different KPST). If a King moves you have to do that from scratch, because all terms change. When another piece moves, you just have to correct the previous value for the changes at the involved s.

Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE