Training data

maksimKorzh · Post by **maksimKorzh** » Wed Jan 13, 2021 12:17 am

Desperado, here're my test of MSE for two sets of material + PST params:

My own PST: 0.13688435023553586
rofChade PST (3000+ engine): 0.13744401297556394

I calculated MSE on my gm2600.pgn 30 000 quite positions.
Now despite the fact rofChade's error is slightly bigger it's PSTs score around 70 Elo stronger

Desperado · Post by **Desperado** » Wed Jan 13, 2021 8:54 am

maksimKorzh wrote: ↑Wed Jan 13, 2021 12:17 am Desperado, here're my test of MSE for two sets of material + PST params:

My own PST: 0.13688435023553586
rofChade PST (3000+ engine): 0.13744401297556394

I calculated MSE on my gm2600.pgn 30 000 quite positions.
Now despite the fact rofChade's error is slightly bigger it's PSTs score around 70 Elo stronger

Hello Maksim,

is there a question involved, sorry i don't understand what you want to tell me.

In general it is not unusal that a smaller error does not produce better gameplay.
There is nothing wrong with that. There can be many reasons for such an observation.

Desperado · Post by **Desperado** » Wed Jan 13, 2021 10:26 am

Here is what i do with success now

1. I generated 111.661.993 positions from ccrl database with players both elo over 2800.
2. I shuffled the file and picked 4M by random.

That is what i did before too. Now the improvement...

3. I did a 3-ply search for each position
4. I playout the pv to move 3 and create the resulting epd entry with result and score (i just kept the result ?!)

Now validating...

5.I did a training session for material on the new dataset. The result was stable now. No diverging numbers anymore.

Perfect! This process allows me to use noisy input and create useful training data out of it.
Especially because it shows statistically well the distribution of position types in games.
The new feature that the data correlates well with the engine is of course a special bonus.

There is also something to play with, because the value of the position is also output and can be used in the error
calculation in some way.

Desperado · Post by **Desperado** » Wed Jan 13, 2021 11:09 am

Desperado wrote: ↑Wed Jan 13, 2021 10:26 am Here is what i do with success now

1. I generated 111.661.993 positions from ccrl database with players both elo over 2800.
2. I shuffled the file and picked 4M by random.

That is what i did before too. Now the improvement...

3. I did a 3-ply search for each position
4. I playout the pv to move 3 and create the resulting epd entry with result and score (i just kept the result ?!)

Now validating...

5.I did a training session for material on the new dataset. The result was stable now. No diverging numbers anymore.

Perfect! This process allows me to use noisy input and create useful training data out of it.
Especially because it shows statistically well the distribution of position types in games.
The new feature that the data correlates well with the engine is of course a special bonus.

There is also something to play with, because the value of the position is also output and can be used in the error
calculation in some way.

Something went wrong ... it is not true what i wrote. Sorry for the noise!

maksimKorzh · Post by **maksimKorzh** » Wed Jan 13, 2021 1:31 pm

Desperado wrote: ↑Wed Jan 13, 2021 8:54 am
maksimKorzh wrote: ↑Wed Jan 13, 2021 12:17 am Desperado, here're my test of MSE for two sets of material + PST params:

My own PST: 0.13688435023553586
rofChade PST (3000+ engine): 0.13744401297556394

I calculated MSE on my gm2600.pgn 30 000 quite positions.
Now despite the fact rofChade's error is slightly bigger it's PSTs score around 70 Elo stronger
Hello Maksim,

is there a question involved, sorry i don't understand what you want to tell me.

In general it is not unusal that a smaller error does not produce better gameplay.
There is nothing wrong with that. There can be many reasons for such an observation.

Just shared my experiment.
No question here)

Desperado · Post by **Desperado** » Wed Feb 16, 2022 10:31 pm

I would like to reactivate this old thread.
The reason is a current thread in the context of SGD, which I would also like to continue to solve specific technical issues.

Apart from the optimization algorithm and its implementation is still essential which data source is used.
Today I want to start with the quality determination of data sources for the selection of training data.

Useful sources

Rebel Data collection
Lichess databases
CCRL Blitz
CCRL Rapid

NNs

Measuring data quality

The well-known quiet-labeled.epd is a collection of training positions that have been used by several people to generate improved parameters for their engines. Therefore, I think it is very interesting to study the characteristics and the process of the data. Although a few days ago the process could be clarified here.Of course, this was already known by some people, but I think it is useful to point it out again.

Well, it doesn't stop me from examining the data more closely anyway. Now, my first idea to compare raw data looks like this.

1. Choose a reference evaluation

I decided to utilize a trained NN as reference static evaluation function. At a later stage, I would like to look at the general uses of NN in the optimization process. Already in the selection of the training data I see various possible applications, but they are also applicable to other classical evaluation functions.

Let's go

Code: Select all


       3400_8.epd / CNT =   256400 / K = 0.31087369  / MSE = 0.0771616689121431 / DR = 0.647 / APHS = 18.23
       3300_8.epd / CNT =   520624 / K = 0.366286730 / MSE = 0.0898464623713863 / DR = 0.582 / APHS = 18.40
       3200_8.epd / CNT =  1149832 / K = 0.416795321 / MSE = 0.1022347681140613 / DR = 0.515 / APHS = 18.52
       3100_8.epd / CNT =  2102367 / K = 0.445591220 / MSE = 0.1104224481755210 / DR = 0.469 / APHS = 18.54
       3000_8.epd / CNT =  3229357 / K = 0.450422900 / MSE = 0.1154083821281762 / DR = 0.445 / APHS = 18.52
       2800_8.epd / CNT =  7872226 / K = 0.497679208 / MSE = 0.1025738023441199 / DR = 0.434 / APHS = 16.40
       2500_8.epd / CNT = 12746017 / K = 0.488025865 / MSE = 0.1109184451157332 / DR = 0.388 / APHS = 16.35
        64K_5.epd / CNT =   750000 / K = 0.376087857 / MSE = 0.1307579005698916 / DR = 0.232 / APHS = 15.98
  
       corr_8.epd / CNT = 16082330 / K = 0.462039370 / MSE = 0.1223174055758201 / DR = 0.404 / APHS = 20.21
       human8.epd / CNT = 30205714 / K = 0.307684900 / MSE = 0.2005787651347260 / DR = 0.076 / APHS = 19.46
       
       quiet-labeled.epd / CNT =   750000 / K = 0.643410690 / MSE = 0.0465274694383763 / DR = 0.275 / APHS =  9.76

From the rawdata (pgn files) i picked 8 random positions from one game. CNT shows how many positions finally are in the corresponding epd collection. Then i used the NN (nn-62ef826d1a6d.nnue) to measure the K (scaling factor used for measure), the MSE, the draw ratio and the average phase (MG=24 / EG = 0). In only used 750K positions out of the collections or less if the collection did not include that number.

Most collections above are limited to the mentioned elo level. The 64K_5.epd was used as learning reference from selfplayed games when doing experiments with my tuner. The signal ratio is a little bit different, because there were only 5 positions per game used. The corr.epd and the human.epd include the factor human, but i think that the correspondece chess data is strongly influenced by computers. Finally i measured the reference data, the quiet-labled.epd.

1. observation
Now, the first impression gives some intersting, but not really suprising information. The stronger the chess entities are, the lower the measured error is. Going downwards from 3400 ot 3000 we can see the error growing. Something happens below 3000. At the same time the draw ratio drops.
The average phase seems to be stable.

1. interpretation
The smaller the error is, the stronger the relation between the training signal (result) and the evaluation is. Based on this relation,
we can also argue that the training signal is less random. Picking an element from the raw data with the given result should be a target signal.

2. observation
The draw ratio is dropping at the same time. The stronger the entities are the higher the draw ratio is.

2.interpretation
The effect for the selection is unclear without looking to the reference data at that point.

3. observation
The collection 2800_8.epd is the first collection that produced a lower error than the higher ranked epd collections.

3. interpretation
If you look at the average phase it also dropped significantly into the endgame direction. That makes sense too.
With a lot of material (we are talking of material phase) on the board, there is much more noise in the result prediction.
Of course to predict the result in an equal position, early game phase, is much more uncertain. Picking random positions from early stages will raise the uncertainty. What we measure is the delta between winning probability of sigmoid(score,400) and the result. So, no suprise, that due to a significant lower average gamephase the error shrink. But that value can be controlled when selecting data.

4. observation
The properties of human game collections varies somehow in its properties.

4. interpretation
Humans are hopelessly overwhelmed

It too late for me today, but maybe there will be an interesting discussion on that topic. The next point is to compare this properties with the properties of quiet-labeled.epd.

There will be other questions following like:

Training Singals (Signal weights, Producing instead of reading results, ...)
Utilization of NNs
Technical selection process (e.g.: SEE vs QS and others)
Quantity of training elements

Further technical discussions, observations and interpretations.

Example: Feature Delta for Knight PST (MG). Rank symmetric / File asymmetric

A1...H1
A2...H2
A8...H8

A number like 0.3106 means that every third element in the data set has a delta for that square.
So if white knight is on c3 and black knight is not on RankFlipped(c3) the value has an impact on the total score.
You can easily see that the weight some squares are relevant for the MSE while others are nearly useless.

Code: Select all


quiet-labeled.epd
0.0011 0.0691 0.0050 0.0076 0.0089 0.0071 0.0554 0.0010
0.0035 0.0037 0.0129 0.0548 0.0407 0.0106 0.0041 0.0049
0.0197 0.0164 0.1056 0.0208 0.0192 0.1244 0.0169 0.0180
0.0129 0.0071 0.0228 0.0344 0.0293 0.0220 0.0067 0.0116
0.0068 0.0224 0.0177 0.0212 0.0312 0.0134 0.0193 0.0069
0.0038 0.0077 0.0110 0.0128 0.0092 0.0069 0.0056 0.0035
0.0033 0.0052 0.0063 0.0057 0.0053 0.0056 0.0033 0.0021
0.0020 0.0015 0.0024 0.0028 0.0026 0.0021 0.0010 0.0011

3200_8.epd
0.0007 0.1449 0.0047 0.0064 0.0143 0.0140 0.0840 0.0004
0.0030 0.0040 0.0145 0.1492 0.0577 0.0086 0.0057 0.0079
0.0162 0.0501 0.2785 0.0223 0.0260 0.3106 0.0291 0.0068
0.0279 0.0049 0.0365 0.0896 0.0438 0.0214 0.0063 0.0217
0.0061 0.0246 0.0174 0.0224 0.0473 0.0127 0.0184 0.0047
0.0014 0.0051 0.0062 0.0079 0.0036 0.0024 0.0022 0.0015
0.0012 0.0020 0.0024 0.0017 0.0013 0.0011 0.0006 0.0005
0.0004 0.0003 0.0005 0.0005 0.0005 0.0003 0.0003 0.0003

2800_8.epd
0.0008 0.1155 0.0053 0.0073 0.0138 0.0122 0.0717 0.0005
0.0034 0.0043 0.0146 0.1208 0.0539 0.0092 0.0060 0.0070
0.0159 0.0425 0.2351 0.0230 0.0274 0.2592 0.0285 0.0074
0.0238 0.0061 0.0367 0.0796 0.0421 0.0224 0.0073 0.0204
0.0066 0.0248 0.0181 0.0239 0.0416 0.0142 0.0191 0.0052
0.0020 0.0061 0.0078 0.0096 0.0050 0.0035 0.0031 0.0019
0.0015 0.0023 0.0030 0.0024 0.0020 0.0016 0.0009 0.0008
0.0006 0.0004 0.0008 0.0008 0.0007 0.0005 0.0003 0.0003

human_8.epd
0.0004 0.2008 0.0028 0.0047 0.0100 0.0113 0.1531 0.0004
0.0014 0.0017 0.0101 0.1427 0.0700 0.0080 0.0036 0.0057
0.0123 0.0333 0.3149 0.0109 0.0146 0.3444 0.0317 0.0090
0.0144 0.0022 0.0271 0.0637 0.0398 0.0173 0.0041 0.0145
0.0029 0.0173 0.0107 0.0210 0.0403 0.0092 0.0219 0.0028
0.0008 0.0028 0.0044 0.0069 0.0035 0.0021 0.0016 0.0008
0.0009 0.0015 0.0022 0.0012 0.0012 0.0014 0.0005 0.0005
0.0009 0.0002 0.0004 0.0004 0.0003 0.0002 0.0001 0.0004

Enough for today, see you soon.

Desperado · Post by **Desperado** » Sun Feb 20, 2022 10:06 am

Code: Select all

       3400_8.epd / CNT =   256400 / K = 0.31087369  / MSE = 0.0771616689121431 / DR = 0.647 / APHS = 18.23
       3300_8.epd / CNT =   520624 / K = 0.366286730 / MSE = 0.0898464623713863 / DR = 0.582 / APHS = 18.40
       3200_8.epd / CNT =  1149832 / K = 0.416795321 / MSE = 0.1022347681140613 / DR = 0.515 / APHS = 18.52
       3100_8.epd / CNT =  2102367 / K = 0.445591220 / MSE = 0.1104224481755210 / DR = 0.469 / APHS = 18.54
       3000_8.epd / CNT =  3229357 / K = 0.450422900 / MSE = 0.1154083821281762 / DR = 0.445 / APHS = 18.52
       2800_8.epd / CNT =  7872226 / K = 0.497679208 / MSE = 0.1025738023441199 / DR = 0.434 / APHS = 16.40
       2500_8.epd / CNT = 12746017 / K = 0.488025865 / MSE = 0.1109184451157332 / DR = 0.388 / APHS = 16.35
        64K_5.epd / CNT =   750000 / K = 0.376087857 / MSE = 0.1307579005698916 / DR = 0.232 / APHS = 15.98
  
       corr_8.epd / CNT = 16082330 / K = 0.462039370 / MSE = 0.1223174055758201 / DR = 0.404 / APHS = 20.21
       human8.epd / CNT = 30205714 / K = 0.307684900 / MSE = 0.2005787651347260 / DR = 0.076 / APHS = 19.46
       
       quiet-labeled.epd / CNT =   750000 / K = 0.643410690 / MSE = 0.0465274694383763 / DR = 0.275 / APHS =  9.76

As you can see the quiet-labeled.epd is very different in the average phase value and has a very low draw ratio.

As already mentioned a lower phase score will shrink the MSE.
In general because there will be less uncertainty in the result prediction.
The draw ratio is very low compared with the collection of the strong entities.

Code: Select all

3200_8_DP.epd      / CNT =   415968  / K = 0.683935950 / MSE = 0.0869939938770631 / DR = 0.275 / APHS = 11.06
3000_8_DP.epd      / CNT =  1321544 / K = 0.645552504 / MSE = 0.0876008663539680 / DR = 0.275 / APHS = 10.88
quiet-labeled.epd / CNT =    750000 / K = 0.643410690 / MSE = 0.0465274694383763 / DR = 0.275 / APHS =  9.76

I managed to produce properties that are close to the properties of quiet-labeled.epd. The MSE is dropping like expected.
So, the prediction ratio gets better. There is still a big gap with the MSE compared with the reference data.
It is intersting too, to compare the modified collections directly to each other.

How to continue?

As i pointed out before, the "delta distribution" looks very different.
Another factor might be that the process for the quiet-labeled.epd did playouts from any position to produce the game results,
so not only for opening positions. That might lead to an "artifical" distribution compared with standard games.

1.
The next concrete step will be to replace the native evaluation with the reference evaluation (nnue) for the selection process.
I just want to figure out what happens when the selection entity is stronger.

2.
Following with the analysis how the errors are distributed in the game phases.
Especially the signals (game results) that correspond to early game phases seem to produce the big gap. I want to have a closer look at it.

Interim summary

For now, the quiet-labled.epd gave the best input for the tuner and the resulting values with the biggest elo gain afterwards.
Second best was the low level selfplayed games result over the complete process (producing training data and using the tuner results afterwards).

j.t. · Post by **j.t.** » Sun Feb 20, 2022 1:32 pm

I think the reason why the error of quiet label set gets comparatively low is that it contains many positions that are very clear, like being a piece up.

Desperado · Post by **Desperado** » Sun Feb 20, 2022 2:42 pm

j.t. wrote: ↑Sun Feb 20, 2022 1:32 pm I think the reason why the error of quiet label set gets comparatively low is that it contains many positions that are very clear, like being a piece up.

We will figure it out, what the difference with "clear" positions is. There are many positions included with material advantage but the side loosing.
If you look in some positions, this is because of a king attack or other threats which are not catched by a simple qs() and a static evaluation.

Facts so far are, that the average phase in the data set significantly effects the MSE.
I put in another collection, thats shows that a higher draw rate (but lower average phase) also (not with the same impact but clear enough to see) effects the MSE. The distance (error) between evaluation score and a draw score (0.5) on average is smaller than to 1.0/0.0.
Above you can compare the original stats, so unmodified draw rates and average phases.
It is still a random selection process but while sampling i control the draw rate and phase selection.

Code: Select all

3200_8_DP.epd      / CNT =  415968 / K = 0.683935950 / MSE = 0.0869939938770631 / DR = 0.275 / APHS = 11.06
3000_8_DP.epd      / CNT = 1321544 / K = 0.645552504 / MSE = 0.0876008663539680 / DR = 0.275 / APHS = 10.88
3000_8_P.epd       / CNT = 1856599 / K = 0.436934197 / MSE = 0.0744345800020777 / DR = 0.476 / APHS = 10.44
quiet-labeled.epd  / CNT =  750000 / K = 0.643410690 / MSE = 0.0465274694383763 / DR = 0.275 / APHS =  9.76

As pointed out i looked deeper into the phases and its corresponding errors. 0 = EG / 24 = MG.
My intuition guides me well through the data at the moment. The observation holds for other collections too.

Code: Select all

quiet-labeled.epd
0.0994 0.0925 0.0492 0.0493 0.0486 0.0470 0.0559 0.0457
0.0604 0.0516 0.0707 0.0559 0.0829 0.0625 0.0912 0.0679
0.0890 0.0697 0.0941 0.0593 0.0993 0.0557 0.1220 0.0300
0.1247

3000_8_DP.epd
0.0936 0.0774 0.0753 0.0728 0.0744 0.0666 0.0947 0.0935
0.1053 0.1065 0.1245 0.1256 0.1399 0.1378 0.1499 0.1561
0.1549 0.1937 0.1734 0.2138 0.1884 0.2245 0.1934 0.2467
0.1957

Beside the properties like elo strength / phase distribution / draw ratio, the biggest difference can be seen in the error corresponding to the MG phases. This is a challenge because i do not see a special process step in the generation of the quiet-labeled.epd that prefers MG value with such a low average error. I am going to think about it.

j.t. · Post by **j.t.** » Sun Feb 20, 2022 3:06 pm

An idea why the quiet-labeled set has a lower average phase is that because the positions are sampled from all the positions an engine evaluation function encountered during play, many positions will be from the end of a quiesce search, which has likely made a number of captures, and thus reduced the phase value.

The lower number of positions in the opening/midgame phase could maybe partly explain why the error for these phases is lower. The opening parameters have fewer positions that they need to predict, and thus can specialize better on these.

Training data

Re: Training data

Re: Training data

Re: Training data

Re: Training data

Re: Training data

Training data /

Re: Training data

Re: Training data

Re: Training data

Re: Training data