Booot progress

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Jun 06, 2021 6:03 pm

booot wrote: ↑Sun Jun 06, 2021 5:43 pm "My networks for Seer weren't trained on CP evals and instead predict WDL probabilities starting initially from EGTB. I used cross entropy loss (effectively, MLE)."

Then, you have to 'decode' back the probability from NNUE to CP eval every time you call NNUE_eval()?

Yep. My networks are also floating point so Seer is quite slow anyways. However, this "decoding back" is not actually as expensive as you might expect and takes up only ~1% of the total run time.

In any case, were you to use cross entropy loss to train on search evals, you'd use binary cross entropy and have something like:

Code: Select all

loss = -(sigmoid(eval/k)*logsigmoid(pred) + sigmoid(-eval/k)*logsigmoid(-pred))

Where k is the performance constant mentioned in the linked document.

booot · Post by **booot** » Sun Jun 06, 2021 6:21 pm

I read this document, thanks. I like the idea to have probability labels instead of centipawns - anyway i can easy calc them from my HCE. I will try to train the net having this [0..1] labels ( i will multiply them by a constant to have bigger gradient).

Pio · Post by **Pio** » Sun Jun 06, 2021 9:51 pm

booot wrote: ↑Sun Jun 06, 2021 6:21 pm I read this document, thanks. I like the idea to have probability labels instead of centipawns - anyway i can easy calc them from my HCE. I will try to train the net having this [0..1] labels ( i will multiply them by a constant to have bigger gradient).

Why will multiplying do any good? Of course the gradient will be bigger but the objective will be equally bigger so they should cancel each other out. Or am I missing something?

What I mean is that when you make the hill three times bigger it won’t help if the rate of climbing increases threefold. You will reach the top at the same time.

booot · Post by **booot** » Sun Jun 06, 2021 10:52 pm

Pio wrote: ↑Sun Jun 06, 2021 9:51 pm
booot wrote: ↑Sun Jun 06, 2021 6:21 pm I read this document, thanks. I like the idea to have probability labels instead of centipawns - anyway i can easy calc them from my HCE. I will try to train the net having this [0..1] labels ( i will multiply them by a constant to have bigger gradient).
Why will multiplying do any good? Of course the gradient will be bigger but the objective will be equally bigger so they should cancel each other out. Or am I missing something?

What I mean is that when you make the hill three times bigger it won’t help if the rate of climbing increases threefold. You will reach the top at the same time.

Of cause label multiplying does not give any advantage. I told wrong word. Sorry. My idea - NN produces integers on output. So i have something to do with [0..1] range during quantization and training

Pio · Post by **Pio** » Sun Jun 06, 2021 11:32 pm

booot wrote: ↑Sun Jun 06, 2021 10:52 pm
Pio wrote: ↑Sun Jun 06, 2021 9:51 pm
booot wrote: ↑Sun Jun 06, 2021 6:21 pm I read this document, thanks. I like the idea to have probability labels instead of centipawns - anyway i can easy calc them from my HCE. I will try to train the net having this [0..1] labels ( i will multiply them by a constant to have bigger gradient).
Why will multiplying do any good? Of course the gradient will be bigger but the objective will be equally bigger so they should cancel each other out. Or am I missing something?

What I mean is that when you make the hill three times bigger it won’t help if the rate of climbing increases threefold. You will reach the top at the same time.
Of cause label multiplying does not give any advantage. I told wrong word. Sorry. My idea - NN produces integers on output. So i have something to do with [0..1] range during quantization and training

Yes, I thought it might have to do with the quantisation. I have never done any quantisation for the networks I have trained (and I have never used NN for chess) but I would probably just multiply the [0..1] range with 256 and then floor the result to get it in the [0..255] range so it will fit into a byte. I guess it might be smart not to quantify the weights in the beginning of the training and in the later stages start quantify the weights closest to the input while leaving the others alone. When the first layer has been quantified the quantification has introduced some error that the next layer can compensate for. Then maybe you might want to fixate the first quantified layer and then repeat the training and quantisation of the second layer and so on. I guess that doing so could minimise the error introduced by the quantisation and also reduce the risk of overfitting.

chrisw · Post by **chrisw** » Mon Jun 07, 2021 9:57 am

booot wrote: ↑Sun Jun 06, 2021 1:31 pm
connor_mcmonigle wrote: ↑Sat Jun 05, 2021 9:50 pm
Very interesting. I'd recommend using the standard 768 features as a first step as a sort of an ablation study to insure the extra features are actually having a positive impact. Additionally, a dataset of "just" 50 million positions may be inadequate for the number of features you've proposed. Dropout might be a partial solution. Training with dropout enabled seems to have helped a great deal when training on smaller datasets in my experience.

Could you elaborate on the input features you plan to try? It is important that, for large feature sets, the features for any given position are sparse and that the average number of updated features between adjacent positions is also low. For HalfKA/HalfKP the average number of updated features between adjacent positions is roughly ~6. Much higher than 6 and it is likely incremental updates will prove too slow.
Yes - you are right. I added "tiny" model (2*770 features - i use also castle rights features) to compare with.

Did you try to train NNUE with HCE labels? Could you provide me information about 'mae' metric (Mean average error) for those NN's? I try to train my model with my learner, i see - the process goes normal (loss-function value decreases, train and validation losses and 'mae' metrics are close for the first several epochs before overfitting on this small dataset ). But i do not know what 'mae' value is ok - i do not have anything to compare with. I received 'mae' about 32-35 centipawns in validation dataset. Is it much or small?

I’m not sure if there is a consistent reporting metric to compare against. Gary Linscott mentioned (some months ago) he was getting 0.07 loss, but I don’t think that came either with number of epochs, nor whether it was absolute MAE, or squared MSE. I’m not sure that an epoch is defined either.
If you use winrate (0 to 1) scale, then a BN set to output a constant 0.5 will be correct 1/3 of the time and out by 0.5 2/3 of the time, so I always took that to mean that a random NN has MAE of about 0.33 and a MSE of 0.1
Anyway, how long is an epoch, and what exactly is being reported error-wise?

booot · Post by **booot** » Mon Jun 07, 2021 11:44 pm

I decided to increase test dataset to 250M positions to make it more reasonable for relatively big feature schemas. So generator working now. After this i will continue my investigations.

booot · Post by **booot** » Thu Jun 10, 2021 11:09 am

Test train dataset for 250M positions is ready! I also comlpleted and tested the python learner code. So , the plan of this stage 'The Choice' is follows:

1. Full train set (250M positions) with Booot's HCE evaluations i shuffle (for training purposes) and decode from centipawns to winning probability via Sigmoid function f(x)=1/(1+exp(-x/K)) where K i took 400. I also scaled this [0..1] range to [0...1000] by * 1000 ( just my wish).
2. This full train set learner divides the full set to 1250 subsets (200k positions each), decoded from FEN to Features Input Data and proceed subset by subset to Keras model.fit() function with batch=256.
3. Loss function is Mean Squared Error. I also choose metric Mean Absolute Error to compare different models with it.
4. I have 5 feature models to compare. I decided to use 'pure' float32 NN without any specific 'quantization-aware' restrictions for relu() and weights constrains. I would like to compare the real 'capasity' of different features models (how much useful chess knowledge the model can recognize from dataset) and those specific restrictions could 'spoil' the final results. This kind of quantization-aware training is difficult, tricky and does not even have any garanties for final good result

. So i will think about 'quantization-aware' training later, on the next stage after i will choose the model.

Some words about feature models. I have my own terms to describe them.Of cause some of them are already known, but for me is more suitable to imagine the data like follows:
1. The data in input layer describes chess position and has 2 equal-sized 'frames' : 1 for each opponent's point of view. All data in the frame exists for White point of view (i flip board by sq xor 56 for black side). So every piece on the board exists in 2 frames : 1 time for white king point of view and 1 time for black king point of view. For example : white pawn on c4 is really white pawn on c4 from white king's point of view and becomes black pawn on c5 if we flip the board for black's king point of view. The first frame in data - for side-to-move owner.
2. The frame consists of 'blocks'. The Block is minimum data portion describing position on the board in '0' or "1" with Piece-square positions in array. The size of block is 12(types of pieces)*64 (squares) + 2 (castle rights features for frame owner's king) = 770. Depending from feature model each frame could have from 1 to 104 blocks.

Chosen models to compare:

1. 'Tiny' or 'P-Sq' model. Has only 1 block for each frame. The full size of model is 2*770=1540 features.
2. 'Small' or 'file-P-sq' model. Each frame has 8 blocks. 7 block are inactive (all "0"), 1 block is active depending the file ('a'-'h') where own king stands. The idea is obvious : king is the most important piece on the board and it is important to understand the relation between king location and other pieces locations. My hypothesa : the file-P-Sq model is 8 times smaller then Half-KP but it seems to me more natural for chess from former chessplayer's point of view : lets see. The full size of model is 2*8*770=12320 features.
3.'Medium' of 'Q-file-P-Sq' model. Is twice bigger then previous model (16 blocks in the frame). Additional factor : own queen presence (True-False). The idea is to spilt position to old-fashioned Middlegame-Endgame evaluation. Positions with queen and without queen usually need completely different knowledge to understand. So it may be better to have different features and weights for them.
4. 'Full KP' of 'K-P-Sq' model. I decided also to include quite similar to Half-KP model to compare. The difference is : i do not use neuron splitting in the first hidden layer for frames (like 2*256-32-32-1). So i call this model FullKP

. The full size of model is 2*64*770=98560 features.
5. "Maximum' or "phase-file-P-Sq" model. Phase is well-known [0-12] number depending on material left on the chess board. It is quite bug model and i am not sure if 250M training set will be ok to saturate it. But lets see. The full size of model is 2*13*8*770=160160 features.

So, after some last preparations the show will start! Wish me good luck

. I measure the training speed on my machine. It is about 3-5 hours per epoch (full 250M dataset). So most probably i will train 5 epochs for each model and compare the metrics for the last saved model. Who trained Half-KP models for Stockfish or his own engine? How much epochs you usually need for the final model?

booot · Post by **booot** » Thu Jun 10, 2021 11:21 am

I forgot to say : i use 5% of dataset for validation data. I still do not understand why Stockfish team needs different-depth-data for it? Do we need to fit model with "apples" and validate it with 'oranges'? Who can explain the idea? I really can not understand it.

booot · Post by **booot** » Mon Jun 14, 2021 8:38 pm

First results. During 'calibrating' all framework : generator-preprocessing-trainer i found interesting nuances and changed little bit my plans. So, i received the first result of NN training on Booot's data. I call it "Zero point". The simplest 12*64=768 features model (even without castle rights features), 1 data block with pieces coded from White's point of view.

Net structure : [768 features]-256-32-32-1
206177 trainable params.
Labels - Win probability score from side to move's point of view, scaled [0...1000].
loss - mse, metric - mae, optimizer - Adam, 5% of data - for validation

Last 3 epochs

Epoch 8 ended. Loss-527.2194232407043 mae-16.42082664892256 val_loss-527.6986280116532 val_mae-16.426774829983426
Epoch 9 ended. Loss-524.9331651866007 mae-16.38547597133475 val_loss-525.7456881174747 val_mae-16.397830796946916
Epoch 10 ended. Loss-522.9572118916195 mae-16.35402000665093 val_loss-523.4776344695728 val_mae-16.36083416363223

As i could say - model is still trainable. I wonder how precise is it - the Mean Error for output Win probability score is 1,636% Not so bad for so small model! So this is my zero point where all my future models will be compared with.

Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress