Test train dataset for 250M positions is ready! I also comlpleted and tested the python learner code. So , the plan of this stage 'The Choice' is follows:
1. Full train set (250M positions) with Booot's HCE evaluations i shuffle (for training purposes) and decode from centipawns to winning probability via Sigmoid function f(x)=1/(1+exp(-x/K)) where K i took 400. I also scaled this [0..1] range to [0...1000] by * 1000 ( just my wish).
2. This full train set learner divides the full set to 1250 subsets (200k positions each), decoded from FEN to Features Input Data and proceed subset by subset to Keras model.fit() function with batch=256.
3. Loss function is Mean Squared Error. I also choose metric Mean Absolute Error to compare different models with it.
4. I have 5 feature models to compare. I decided to use 'pure' float32 NN without any specific 'quantization-aware' restrictions for relu() and weights constrains. I would like to compare the real 'capasity' of different features models (how much useful chess knowledge the model can recognize from dataset) and those specific restrictions could 'spoil' the final results. This kind of quantization-aware training is difficult, tricky and does not even have any garanties for final good result
. So i will think about 'quantization-aware' training later, on the next stage after i will choose the model.
Some words about feature models. I have my own terms to describe them.Of cause some of them are already known, but for me is more suitable to imagine the data like follows:
1. The data in input layer describes chess position and has 2 equal-sized 'frames' : 1 for each opponent's point of view. All data in the frame exists for White point of view (i flip board by sq xor 56 for black side). So every piece on the board exists in 2 frames : 1 time for white king point of view and 1 time for black king point of view. For example : white pawn on c4 is really white pawn on c4 from white king's point of view and becomes black pawn on c5 if we flip the board for black's king point of view. The first frame in data - for side-to-move owner.
2. The frame consists of 'blocks'. The Block is minimum data portion describing position on the board in '0' or "1" with Piece-square positions in array. The size of block is 12(types of pieces)*64 (squares) + 2 (castle rights features for frame owner's king) = 770. Depending from feature model each frame could have from 1 to 104 blocks.
Chosen models to compare:
1. 'Tiny' or 'P-Sq' model. Has only 1 block for each frame. The full size of model is 2*770=1540 features.
2. 'Small' or 'file-P-sq' model. Each frame has 8 blocks. 7 block are inactive (all "0"), 1 block is active depending the file ('a'-'h') where own king stands. The idea is obvious : king is the most important piece on the board and it is important to understand the relation between king location and other pieces locations. My hypothesa : the file-P-Sq model is 8 times smaller then Half-KP but it seems to me more natural for chess from former chessplayer's point of view : lets see. The full size of model is 2*8*770=12320 features.
3.'Medium' of 'Q-file-P-Sq' model. Is twice bigger then previous model (16 blocks in the frame). Additional factor : own queen presence (True-False). The idea is to spilt position to old-fashioned Middlegame-Endgame evaluation. Positions with queen and without queen usually need completely different knowledge to understand. So it may be better to have different features and weights for them.
4. 'Full KP' of 'K-P-Sq' model. I decided also to include quite similar to Half-KP model to compare. The difference is : i do not use neuron splitting in the first hidden layer for frames (like 2*256-32-32-1). So i call this model FullKP
. The full size of model is 2*64*770=98560 features.
5. "Maximum' or "phase-file-P-Sq" model. Phase is well-known [0-12] number depending on material left on the chess board. It is quite bug model and i am not sure if 250M training set will be ok to saturate it. But lets see. The full size of model is 2*13*8*770=160160 features.
So, after some last preparations the show will start! Wish me good luck
. I measure the training speed on my machine. It is about 3-5 hours per epoch (full 250M dataset). So most probably i will train 5 epochs for each model and compare the metrics for the last saved model. Who trained Half-KP models for Stockfish or his own engine? How much epochs you usually need for the final model?