Tactics in training data

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Tactics in training data

Post by xr_a_y »

chrisw wrote: Thu Jun 17, 2021 10:58 am
xr_a_y wrote: Thu Jun 17, 2021 8:02 am
niel5946 wrote: Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]
For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...
How much depth are you using for the game play?
Currently depth 5
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Tactics in training data

Post by chrisw »

xr_a_y wrote: Thu Jun 17, 2021 11:06 am
chrisw wrote: Thu Jun 17, 2021 10:58 am
xr_a_y wrote: Thu Jun 17, 2021 8:02 am
niel5946 wrote: Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]
For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...
How much depth are you using for the game play?
Currently depth 5
Why not just use the eval for each move from the test game PGNs? (With option to junk any positions where bm continuation is not quiet?). That would give you N quiet positions with a d5 eval.
What your method seems to be adding is P quiet positions at d8-12. So your N is cheap at d5, but your P expensive at d8/12. Why add expensive P, when you could have more cheaper N in the same overall time. Or am I not getting something?
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Tactics in training data

Post by xr_a_y »

chrisw wrote: Thu Jun 17, 2021 11:17 am
xr_a_y wrote: Thu Jun 17, 2021 11:06 am
chrisw wrote: Thu Jun 17, 2021 10:58 am
xr_a_y wrote: Thu Jun 17, 2021 8:02 am
niel5946 wrote: Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]
For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...
How much depth are you using for the game play?
Currently depth 5
Why not just use the eval for each move from the test game PGNs? (With option to junk any positions where bm continuation is not quiet?). That would give you N quiet positions with a d5 eval.
What your method seems to be adding is P quiet positions at d8-12. So your N is cheap at d5, but your P expensive at d8/12. Why add expensive P, when you could have more cheaper N in the same overall time. Or am I not getting something?
There are multiple factors here. First d5 data is not as good as d8 data for net training in Minic.
Then I tried both generation from pv and from qsearch leaf (as said I also tried from random place in search tree). It appears that qsearch leaf positions are more interesting it seems, probably because that is what search and qsearch is seeing during search (and not just positions from the game).

But for sure, I'm very open to comments and hints as I already know that my data for training NNUE are not optimal at all.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Tactics in training data

Post by chrisw »

xr_a_y wrote: Thu Jun 17, 2021 12:04 pm
chrisw wrote: Thu Jun 17, 2021 11:17 am
xr_a_y wrote: Thu Jun 17, 2021 11:06 am
chrisw wrote: Thu Jun 17, 2021 10:58 am
xr_a_y wrote: Thu Jun 17, 2021 8:02 am
niel5946 wrote: Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]
For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...
How much depth are you using for the game play?
Currently depth 5
Why not just use the eval for each move from the test game PGNs? (With option to junk any positions where bm continuation is not quiet?). That would give you N quiet positions with a d5 eval.
What your method seems to be adding is P quiet positions at d8-12. So your N is cheap at d5, but your P expensive at d8/12. Why add expensive P, when you could have more cheaper N in the same overall time. Or am I not getting something?
There are multiple factors here. First d5 data is not as good as d8 data for net training in Minic.
Then I tried both generation from pv and from qsearch leaf (as said I also tried from random place in search tree). It appears that qsearch leaf positions are more interesting it seems, probably because that is what search and qsearch is seeing during search (and not just positions from the game).

But for sure, I'm very open to comments and hints as I already know that my data for training NNUE are not optimal at all.

Yes, I agree, game lines are probably not giving optimal training positions. There are possibly several answers to this

1. Have so many training positions so that it is no longer so important which subset of possible training positions one uses. You use a “superset”. Possibly this is why SF net devs talk about 20 billion training positions.

2. Abandon AB Qsearch paradigm and no longer cull non-quiet training positions.

3. Play test games to generate test positions, but get the test positions by having your Search() build it, by building a list of just plain random positions generated/evaluated during the search. Dunno, maybe one in 10,000 evaluations or something.
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Tactics in training data

Post by xr_a_y »

chrisw wrote: Thu Jun 17, 2021 2:12 pm
xr_a_y wrote: Thu Jun 17, 2021 12:04 pm
chrisw wrote: Thu Jun 17, 2021 11:17 am
xr_a_y wrote: Thu Jun 17, 2021 11:06 am
chrisw wrote: Thu Jun 17, 2021 10:58 am
xr_a_y wrote: Thu Jun 17, 2021 8:02 am
niel5946 wrote: Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]
For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...
How much depth are you using for the game play?
Currently depth 5
Why not just use the eval for each move from the test game PGNs? (With option to junk any positions where bm continuation is not quiet?). That would give you N quiet positions with a d5 eval.
What your method seems to be adding is P quiet positions at d8-12. So your N is cheap at d5, but your P expensive at d8/12. Why add expensive P, when you could have more cheaper N in the same overall time. Or am I not getting something?
There are multiple factors here. First d5 data is not as good as d8 data for net training in Minic.
Then I tried both generation from pv and from qsearch leaf (as said I also tried from random place in search tree). It appears that qsearch leaf positions are more interesting it seems, probably because that is what search and qsearch is seeing during search (and not just positions from the game).

But for sure, I'm very open to comments and hints as I already know that my data for training NNUE are not optimal at all.

Yes, I agree, game lines are probably not giving optimal training positions. There are possibly several answers to this

1. Have so many training positions so that it is no longer so important which subset of possible training positions one uses. You use a “superset”. Possibly this is why SF net devs talk about 20 billion training positions.

2. Abandon AB Qsearch paradigm and no longer cull non-quiet training positions.

3. Play test games to generate test positions, but get the test positions by having your Search() build it, by building a list of just plain random positions generated/evaluated during the search. Dunno, maybe one in 10,000 evaluations or something.
Indeed, at least 500M positions is needed to train NNUE and it seems best to use 1B, 2B or even more.
I have tried extracting random positions from search tree (using a "modulo" as you suggest), but without real success before ... but I can try again for sure (but I remove that code 6 months ago ...) ;)
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Tactics in training data

Post by chrisw »

xr_a_y wrote: Thu Jun 17, 2021 3:25 pm
chrisw wrote: Thu Jun 17, 2021 2:12 pm
xr_a_y wrote: Thu Jun 17, 2021 12:04 pm
chrisw wrote: Thu Jun 17, 2021 11:17 am
xr_a_y wrote: Thu Jun 17, 2021 11:06 am
chrisw wrote: Thu Jun 17, 2021 10:58 am
xr_a_y wrote: Thu Jun 17, 2021 8:02 am
niel5946 wrote: Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]
For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...
How much depth are you using for the game play?
Currently depth 5
Why not just use the eval for each move from the test game PGNs? (With option to junk any positions where bm continuation is not quiet?). That would give you N quiet positions with a d5 eval.
What your method seems to be adding is P quiet positions at d8-12. So your N is cheap at d5, but your P expensive at d8/12. Why add expensive P, when you could have more cheaper N in the same overall time. Or am I not getting something?
There are multiple factors here. First d5 data is not as good as d8 data for net training in Minic.
Then I tried both generation from pv and from qsearch leaf (as said I also tried from random place in search tree). It appears that qsearch leaf positions are more interesting it seems, probably because that is what search and qsearch is seeing during search (and not just positions from the game).

But for sure, I'm very open to comments and hints as I already know that my data for training NNUE are not optimal at all.

Yes, I agree, game lines are probably not giving optimal training positions. There are possibly several answers to this

1. Have so many training positions so that it is no longer so important which subset of possible training positions one uses. You use a “superset”. Possibly this is why SF net devs talk about 20 billion training positions.

2. Abandon AB Qsearch paradigm and no longer cull non-quiet training positions.

3. Play test games to generate test positions, but get the test positions by having your Search() build it, by building a list of just plain random positions generated/evaluated during the search. Dunno, maybe one in 10,000 evaluations or something.
Indeed, at least 500M positions is needed to train NNUE and it seems best to use 1B, 2B or even more.
I have tried extracting random positions from search tree (using a "modulo" as you suggest), but without real success before ... but I can try again for sure (but I remove that code 6 months ago ...) ;)
I’ld be interested to know how that pans out. I’ll have 128 cores this time next week and was intending a mass evaluation of lichess positions instead of comp game positions. Idea being that mass is quality. We will see!
jonkr
Posts: 178
Joined: Wed Nov 13, 2019 1:36 am
Full name: Jonathan Kreuzer

Re: Tactics in training data

Post by jonkr »

niel5946 wrote: Wed Jun 16, 2021 11:01 pm I just noticed that I had the same problem with the score's sign. Before, I flipped the score depending on the STM in root and not in the leaf. I tried to train the net again with this, but it still doesn't seem to work...
How did you change to int16? Did you just cast the floating point values to integers? I am asking because I have no clue about how quantization works yet. I have tried reading some articles about the subject, but it doesn't really make sense to me yet.
Flipped scores definitely can give nonsense play, unfortunately sounds like something else not working too. I was unclear you said it worked with static eval or some other method, if it did I'd suggest compare the target values for some of the positions between the search scored and one that works to see if anything seems off.
niel5946 wrote: Wed Jun 16, 2021 11:01 pm I am pretty confident that the problem lies in the training data or scoring of said data. Otherwise, the plain HCE score training would be equally bad.
One more problem: Hyperparameters. I am afraid that I'm also at a loss here :( . Until now, I have used a batch size of 50k and learning rate of 0.001 and relatively few epochs (< 10). What values do you use? I have implemented learning rate decay, but I don't use it at the moment.
I wouldn't worry about hyperparameters until it seems to be playing properly. I've been usually using around 20 to 25 epochs with tensorflow Adam optimizer (I tried some different optimizer settings but I think that part of my training is just default again right now.) How much you need to worry about overfitting depends how much & quality of training data and what size net you have. My batch size I've tried around 8k to 20k, I think I'm at 8k now but not sure if it matters.
niel5946
Posts: 174
Joined: Thu Nov 26, 2020 10:06 am
Full name: Niels Abildskov

Re: Tactics in training data

Post by niel5946 »

jonkr wrote: Sat Jun 19, 2021 5:14 am Flipped scores definitely can give nonsense play, unfortunately sounds like something else not working too. I was unclear you said it worked with static eval or some other method, if it did I'd suggest compare the target values for some of the positions between the search scored and one that works to see if anything seems off.
Actually, it seems to work better now that I fixed the problem with flipped scores and trained a network for 5 epochs on the lichess set. If I remember correctly, the data was only scored with quiescence and then resolved for immediate tactics (just to see if the latter part was working).
When analyzing the starting position, the PV is:
[pgn]1. c4 Nf6 2. Nc3 e5 3. e3 Bd6 4. d4 c5 5. d5[/pgn]
As you can see, the net is still sloppy (hopefully only because of the rather small amount of training though...) but it seems to learn. I think the discrepancy between static evaluation and search scores was because of correct side-switching with the former.
The few amount of moves are because of the terribly slow implementation of my network. Are there any standard optimizations besides incremental updates that can help relieve this problem?

Regarding other issues, I don't know if this is one but I don't use biases for the input layer and output layer. Could this cause a problem?
Author of Loki, a C++ work in progress.
Code | Releases | Progress Log |
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Tactics in training data

Post by Joost Buijs »

Assuming that your network is trained with biases you can't just omit them during inference. Especially removing the bias from the input layer will totally mess up the network because there is a non linear activation behind it.

If your network consists of float32 it could help to vectorize summing the accumulator and the dot products by using SIMD code. You could also try to use CLang which is pretty good at vectorizing these things automatically.

It's not very difficult to quantize the network to int16 which gives you an immediate 2 fold speed increase. Switching from float32 to int16 didn't gave me any problems, I only had to make sure that the weights weren't getting too large during training. Ideally you want to have weights between -1.0 and +1.0 but in practice they get bigger. It helps to keep the weights low by using weight decay during training, essentially this is the same as L2 regularization. Another option could be clipping the weights between -1.0 and +1.0 during training, this is something I haven't tried.

Of course, using incremental updates for the accumulator will help, but this won't make a difference between day and night. With 16bit SIMD code and incremental update disabled my engine still does ~0,95 mnps on a single core, which is fast enough to play at a reasonable level and to check whether the network works like it should.
connor_mcmonigle
Posts: 530
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: Tactics in training data

Post by connor_mcmonigle »

niel5946 wrote: Sat Jun 19, 2021 11:56 am
jonkr wrote: Sat Jun 19, 2021 5:14 am Flipped scores definitely can give nonsense play, unfortunately sounds like something else not working too. I was unclear you said it worked with static eval or some other method, if it did I'd suggest compare the target values for some of the positions between the search scored and one that works to see if anything seems off.
Actually, it seems to work better now that I fixed the problem with flipped scores and trained a network for 5 epochs on the lichess set. If I remember correctly, the data was only scored with quiescence and then resolved for immediate tactics (just to see if the latter part was working).
When analyzing the starting position, the PV is:
[pgn]1. c4 Nf6 2. Nc3 e5 3. e3 Bd6 4. d4 c5 5. d5[/pgn]
As you can see, the net is still sloppy (hopefully only because of the rather small amount of training though...) but it seems to learn. I think the discrepancy between static evaluation and search scores was because of correct side-switching with the former.
The few amount of moves are because of the terribly slow implementation of my network. Are there any standard optimizations besides incremental updates that can help relieve this problem?

Regarding other issues, I don't know if this is one but I don't use biases for the input layer and output layer. Could this cause a problem?
If your implementation is terribly slow and you're still mostly in the debugging phase, I would highly recommend shrinking the network (at least temporarily). This will enable training to converge faster, reduce overfitting on your small dataset, and greatly improve nps which should enable you to more quickly beat your existing evaluation function. I would recommend a 2 layer 768->256->1 (putting a relu after the first layer and biases for all layers) network.

Later, quantization+simd intrinsics should help performance and enable for experimentation with larger networks.

Using a bias on every layer is free performance wise so there's no reason not to, really. Given the size of your current network, I would guess your lack of bias on the input and output layers isn't impactful. For the first layer, the network can even learn a bias anyways as both the black and white king positions are always one hot.