Dangerous Positions

D Sceviour · Post by **D Sceviour** » Fri Oct 02, 2020 1:35 am

I am in the middle of creating new tuning sets from SF-NNUE games and wondering what to do with dangerous positions. I have collected about 5000 of these in fen positions. Dangeous positions for chess engines occur when both the static evaluator and the quiescent return score both give a very large advantage to one side, but the game ends in a draw. Many a game has been robbed of victory due to this type of miscalculation. One might opine that the only way to solve dangerous positions is with depth of search. If so, should they be included in a tuning set? Or do they confuse tuning values? Perhaps they are very important and should be included. Here are two examples:

[d]2r1k2r/pp2p1b1/3p4/2q5/P4PB1/5Qp1/1P4K1/1N2BR2 b k - 0 1
[d]R7/P1K3k1/6P1/6P1/8/8/r7/8 w - - 0 1

Tony P. · Post by **Tony P.** » Fri Oct 02, 2020 8:33 am

My 2c

It doesn't seem efficient to teach a static centipawn/WDL estimator to detect double-edged positions like #1. It would make more sense to me to train a QS controller to extend in them and not extend in truly quiet ones*. To this end, it needs the attack tables, the king safety info and possibly some cheap info about 2+ move threats (such as uninterposable checks) as inputs.

As for endgame fortress positions like #2, a possibility that comes to mind is a rule-based recognizer that would trigger formal verification of fortresses seldom enough not to slash the amortized speed. The static eval doesn't help much here either, as endgame mastery requires looking for concrete longterm plans (both offensive and defensive) instead of relying on optics.

* There are trainable move ordering controllers like the policy networks in the Alpha0 family or Winter's move classifier. However, I'm unaware of engines with trainable static classifiers of position quiescence. That's an area for future research

D Sceviour · Post by **D Sceviour** » Fri Oct 02, 2020 4:26 pm

Tony P. wrote: ↑Fri Oct 02, 2020 8:33 am My 2c It doesn't seem efficient to teach a static centipawn/WDL estimator to detect double-edged positions like #1. It would make more sense to me to train a QS controller to extend in them and not extend in truly quiet ones*. To this end, it needs the attack tables, the king safety info and possibly some cheap info about 2+ move threats (such as uninterposable checks) as inputs.

As for endgame fortress positions like #2, a possibility that comes to mind is a rule-based recognizer that would trigger formal verification of fortresses seldom enough not to slash the amortized speed. The static eval doesn't help much here either, as endgame mastery requires looking for concrete longterm plans (both offensive and defensive) instead of relying on optics.

* There are trainable move ordering controllers like the policy networks in the Alpha0 family or Winter's move classifier. However, I'm unaware of engines with trainable static classifiers of position quiescence. That's an area for future research

Okay. So, we exclude all positions where the quiescent/static scores returned are greater than 600 centipawns, but the game ended in draw. Can the same thing be said for greater than 500 centipawns? At what minimum value can the position be called quiescent and accurate?
My thinking is to include them all anyway, and to continue to devise new methods of evaluations. The tuner is then forced to make less re-active results.

Tony P. · Post by **Tony P.** » Fri Oct 02, 2020 5:38 pm

Sorry, I'm not qualified to advise on the centipawn cutoff. On this forum, I'm filling the niche vacated by Lyudmil Tsvetkov: I have very controversial views on how a chess engine should operate in general, not backed by code yet

In particular, I'm not a fan of incorporating tactical insight into static eval. I think it only makes sense to call the full static eval (as opposed to a lazy approximation) in positions that are so quiet that an extension is expected not to give enough extra insight. How much is enough depends on the alpha and beta, the estimated probability of the node's appearance on the board (in the spirit of the Giraffe paper), but also on the clock situation or the time left until the user-specified deadline in analysis (I'm clearly not a fan of iterative deepening with more than a few iterations per root

), and even on the expected computational complexity of an extension, e.g. how many sliders are left, which are harder to process than leapers.

As for what positions to train the eval on, I'd rather use a 'natural' dataset mimicking engine game scenarios in terms of the probability distribution of positional and tactical nuances, so I wouldn't deliberately inject positions that are expected to occur seldom in practice, as I think such edge cases would be handled better by a more advanced and less heuristic search routine.

From the Bayesian viewpoint, the static value of a node is a prior expected value which should be later marginalized on (refined by) the insight from the node's children to yield a preciser posterior expected value if the game continuation makes that node more likely to appear on the board.

D Sceviour · Post by **D Sceviour** » Fri Oct 02, 2020 6:31 pm

Tony P. wrote: ↑Fri Oct 02, 2020 5:38 pm As for what positions to train the eval on, I'd rather use a 'natural' dataset mimicking engine game scenarios in terms of the probability distribution of positional and tactical nuances, so I wouldn't deliberately inject positions that are expected to occur seldom in practice, as I think such edge cases would be handled better by a more advanced and less heuristic search routine.

This is also my intuitive feeling on the problem. Currently, I have about 6 million unique epd positions but the file needs to be reduced in size on some basis. Other programmers claim to use a time consuming quiescent search for tuning. It may be assumed they are trying to produce better results. I have pruned the positions so far to include only positions where the quiescent function could not improve on the static evaluation; that is, the quiescent score is equal to the static evaluation. This should prevent the need to use a quiescent search in tuning, and to just use a faster static evaluation. However was this correct, or is the "quiet" method already corrupting the tuning procedure?

On the other hand, one could produce a truly quiescent file where the static scores are all less than 125 centipawns for drawn games. This would greatly reduce the size of the epd file.

The centipawn description is based on the approximate values of material:

pawn = 100
knight = 300
bishop = 325
rook = 500
queen = 1000

Tony P. · Post by **Tony P.** » Fri Oct 02, 2020 6:37 pm

I wouldn't try to make a static eval training set deliberately quiet either, as that would be an unnatural move in the other direction

A training set for a quiescence (/ eval uncertainty) model would need dangerous positions because the model's very purpose would be to discriminate between them and quiet ones.

D Sceviour · Post by **D Sceviour** » Fri Oct 02, 2020 8:56 pm

Tony P. wrote: ↑Fri Oct 02, 2020 6:37 pm I wouldn't try to make a static eval training set deliberately quiet either, as that would be an unnatural move in the other direction

A training set for a quiescence (/ eval uncertainty) model would need dangerous positions because the model's very purpose would be to discriminate between them and quiet ones.

Rare dangerous positions where the result does not match the evaluation can reduce the quality of the training data. The idea is to create a quality training set, and maybe even create guidelines on how to make training sets. So far:

(1) Use best engine available for games, currently Stockfish-nnue. I also included a few Schooner games; the evaluator is very different.
(2) Collect a minimum of 60,00 games, preferably a longer time control. About 1/3 of the games collected were taken from Mehmet Karaman game postings for download.
(3) Delete all duplicate positions. This was first done using the "pgn-extract" -D switch, and then verified a second time by comparing all positions with my own hash table generation.
(4) Filter the epd file created, by comparing the quiescent score with the static evaluation. If the score are equal, then quiescence cannot improve on the position. The position can be considered quiet. Only the static evaluator need be used for further testing and this can save a lot of time by not waiting for quiescence. [This was actually quite difficult to prepare in Schooner because my makemove() has threaded positional migration. This take a lots of time to start up at the root. That is, I could not call up quiescence directly but had to go through an Iteration startup routine.]

Another extreme measure may be to only include games that have a definite conclusion - win or loss. All draws can be excluded. Thus the tuning parameters are augmented. I already experimented with this without success. I do want to create several different types of tuning sets and then brute force test them with large tournaments.

D Sceviour · Post by **D Sceviour** » Mon Oct 05, 2020 7:54 pm

Tuning set tests are underway with some interesting results. My intuitive idea to create a truly quiescent file where the static scores are all less than 125 centipawns for drawn games, turned out to be true! These results are from a programmer's test that use elo scores to estimate how much damage a bad training set can do. It measures adjustments to the PST_eg only.

Code: Select all

Rank Name                          ELO     +/-   Games   Score   Draws
   1 Schooner2.23-sse               81      17     808     61%     52%
   2 Schooner2.25-sse               72      16     810     60%     53%
   3 Schooner1_125_end              41      16     808     56%     53%
   4 Schooner1_150_end               6      16     811     51%     53%
   5 Schooner1_100_end              -6      16     811     49%     54%
   6 Schooner1_300_end             -11      16     808     48%     53%
   7 Schooner1_50_end              -14      16     808     48%     55%
   8 Schooner1M_end                -17      17     810     48%     52%
   9 Schooner1_25_end             -158      19     808     29%     41%

This means that dangerous positions that end in draws should be excluded from the training set.

jdart · Post by **jdart** » Mon Oct 05, 2020 9:32 pm

Tuning datasets are inherently noisy. That is, there are going to be positions where objectively one side is winning but then that side doesn't win the game, frequently because of a blunder later on. Or the position really should be drawn but the game isn't a draw. You could reduce the noise, in a bunch of ways. Play longer-time games. Play multiple games and get the mean result, etc. But the simplest thing to do is just accept that there is noise in the result labels and if you use a large enough data set the noise will not matter, because most games will have the expected outcome.

I exclude positions with very large static scores but otherwise I don't try to filter. It is probably better to have the dataset include a wide variety of positions.

D Sceviour · Post by **D Sceviour** » Mon Oct 05, 2020 9:48 pm

jdart wrote: ↑Mon Oct 05, 2020 9:32 pm I exclude positions with very large static scores but otherwise I don't try to filter. It is probably better to have the dataset include a wide variety of positions.

In the last test, all abs(scores) > 125 centipawns that ended in a draw were filtered. This was most successful, compared between 25 centipawns and 300 centipawns or no filter. Wins and losses were ignored and not subject to a scoring filter, other than removing duplicate positions.

Before this, the first filter was to remove all positions where the static evaluation did not equal the quiescent score. That is, quiescence was not able to improve on the position.

Dangerous Positions

Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions

Re: Dangerous Positions