I am trying to go through the code of Senpai 2,0 recently. I noticed that the evaluation function of Senpai 2.0 has been completely parameterized: for each evaluation component score is computed by a feature value times a weight, and then sum all component scores to get the final evaluation. Fabien mentioned that the evaluation weight is trained from logistic regression. I am not exactly sure how Fabien did this, so I tried a quite brute force approach: directly run a linear regression wrt some training data, and see how it performs. The training data is automatically generated from the NNUE data-generation algorithm. I use two training sets, one with 100 million examples and the other with 1 billion examples.
Featurization
To compute a feature vector from a given position in Senpai, I have to modify the original evaluation code.
First, to get the feature value, I have to postpone the dot product between the feature and the weight vector. The evaluation utilized the phase material interpolation between opening and ending games, like this:
Code: Select all
eval = \alpha * (\phi \dot w_mg) + (1 - \alpha) * (\phi \dot w_eg)
Code: Select all
eval = ((\alpha * \phi) \dot w_mg) + (((1 - \alpha) * \phi) \dot w_eg)
= ((\alpha * \phi) \concat ((1 - \alpha) * \phi)) \dot (w_mg \concat w_eg)
If we set x = (\alpha * \phi) \concat ((1 - \alpha) * \phi), and w = w_mg \concat w_eg, Then we obtain eval = x \dot w, which turns out to be a standard linear model. The original weight in Senpai is a 759 * 2 (N_dimension * N_game_phase) vector, I simply flatten it to a 1-dimension vector of length 1518.
Second, Senpai has the reducing factor at the end of evaluation (after the dot product between the feature and weight) to cover the case of the draw positions. Since this is a totally afterward step following the dot product, I ignored this part in my feature vector. I think it is possible to integrate it to the feature computation, but because it involves some division and may introduce additional floating rounding errors, I haven’t included it for now.
Training
The learning algorithm is simple: optimizing the MSE with the mini-batch gradient descent. I set the mini batch size to 40000, learning rate to 0.1, and total epoch number to 200. I also did a data shuffle before the starting of each epoch. No regularization was applied yet. So far, one epoch training with 100m set takes around 3min, but the 1b set will take around 30mins. It took almost 4 days to finish the 200 epochs with the 1b training set.
The following issues about training bothered me most:
- Training speed. 30mins for just one epoch is too slow in my opinion. The major time is consumed on inference, which contains 3 steps: (1) decode the NNUE sfen structure to a Senpai position; (2) compute the feature vector; (3) do dot product with weight to get predicted score. Obviously, repeatedly compute the feature vector for each sfen in each epoch looks stupid, but dumping the features to file may take huge amount of disk space. I created a unique compressed file format to store the feature file, but it still takes more than 70GB to store the dumping features for the 100m training set. You can imagine that dumping 1b set may take nearly 1TB.
- Data shuffling. Shuffling a huge training set is also a painful task. The original 1b training set has a size of 40GB. To shuffle this file without consuming too much memory, I designed a very wired shuffling algorithm: partitioning the original file into N chunks, shuffling each chunk first, then merging the chunks by randomly picking the queue top among all the chunks.
Looks like the weight I trained is still not comparable to the original weight Senpai has. I tested it on 40moves/5min time control, with 800 games. There are three setups, two training sets with/without data shuffling between epochs (I did not do data shuffling on the 1b set because that is too slow).
Code: Select all
Name Elo + - games
senpai20 0 10 10 800
senpai21_100m_unshuffled -100 10 10 800
senpai21_100m_shuffled. -70 10 10 800
senpai21_1b_unshuffled -90 10 10 800
- Adding L1/L2 regularizers
- Tuning the other parameters
- Try some other learning algorithms instead of mini-batch gradient descent
- Add data shuffle to the 1b training
- Did a qsearch during inference (not clear why but NNUE training did this)