Stockfish has included WDL stats in engine output

Pio · Post by **Pio** » Mon Jul 06, 2020 7:12 pm

Alayan wrote: ↑Mon Jul 06, 2020 6:31 pm
Pio wrote: ↑Sun Jul 05, 2020 10:13 pm That was not my point however. My point is that it is simple to convert an existing alpha beta engine to work with win probabilities (or should I say win or draw probabilities to satisfy you). Using my way of probabilities in alpha beta has many obvious advantages as I mentioned in my previous post. An additional gain is that can compress the storage space for the transposition table since a size of 10 bits should be more than enough and 8 bits might be sufficient. With 8 bits you could let 1 bit represent if the score is a special score or not and with the rest of the 7 bits tell either the “win or draw”-probability with a granularity of less than 1 % granularity or the distance to mate.
The main point of evaluation is to produce position ordering. If position A has a better eval than position B, prefer position A.

The secondary point of evaluation is to guide search. Some feature might not be worth the weight it is assigned and would be incorrect in a leaf node that's backed up to the root, but it will push search to consider more the positions with this feature, and descendant leaf nodes where the feature isn't there anymore will tell if there was actually something or not there.

Internal winpct gives no advantage over standard internal units in either case. Increased granularity close to 0.00 isn't an advantage if your evaluation is too inaccurate anyway to take meaningful advantage of it. Meanwhile, you just made serial computation of the position's evaluation a total headache, as you can't just add winpct the way one can add cp. If you use conversion functions to go back and forth from an additive model (or something equivalent), then you're just wasting a lot of energy on useless computations. If you don't go with a linear model, there is nothing "simple" about converting an existing engine. It will be very complex, which also mean hard to tune and improve, and you'll lose elo because even if the model allows good enough values to stay on par with the linear one, you won't find them.

Besides, centipawn output makes no false promise. It gives an estimated advantage, but if someone with a clue makes the mental effort to think of it in winning probabilities, contextual information will be used - type of position, engine depth, eval trends...

Meanwhile, with raw WDL output, many people make the gross mistake of forgetting context. The actual WDL values will be off for almost all situations. They are tuned for training conditions, but even so, they are only a guess. WDL may look more serious, but if the engine is missing something important, its number will be way off. And for any games played in different conditions, the WDL predictions are not applicable. If Stockfish proclaims +2 in a human blitz game position that white goes on to lose, the interpretation "White blundered a +2 position" still stands. If Stockfish were to proclaim 95% win instead, the interpretation "White blundered a 95% win position" would be completely wrong because in the context of that game, there never was a 95% win for white position.

Milos wrote: ↑Mon Jul 06, 2020 2:01 pm
syzygy wrote: ↑Sun Jul 05, 2020 10:47 pm.
But I hope you do realise that you will have to rewrite Stockfish's evaluation almost completely. You can't just convert SF's current (usually additive) scoring components into probabilities or "probability components". (And this thread has "Stockfish" in the title.)
No need, NN eval for SF is already there and for fixed nodes is already noticably stroger than original SF eval.
That's completely irrelevant to the discussion at hand. The NN takes input feature, then output an eval in SF-internal units. It doesn't need winpct to work at all, because winpct gives nothing for position ordering and search exploration - and to perform well with SF's search without heavy modifications, the eval needs similarities with SF's original eval.

Converting SF's hand-written eval to use winpct instead of cp is what was suggested ant it's unfeasible without severe elo loss.

I don’t think I said someone should use probabilities within the evaluation function since I also realised it would probably be too costly to motivate it. If I would do the evaluation function I might go for probabilities though (just for the fun of it) and not consider the Hugh extra cost. I would probably just define a couple of operator overrides to make it simple for me. What I said or wanted to say is that is no problem to use probabilities in a NegaMax alpha beta framework and that it could make the storage place for an entry in the TT smaller, thus maybe would make it possible to squeeze in an extra entry in every bucket. That should probably give a couple of extra Elos. The other advantage I realised was that it might be better to use probabilities in pruning or reduction thresholds for example.

What I wanted to say with my posts was/is that there is nothing in an alpha beta framework that prevents you to use probabilities in the same way as there is nothing that prevents a neural network to use alpha beta instead of a MCTS like algorithm.

Maybe I will make a neural network engine in the future. I have an idea of how to do it. I would not use many of the techniques that is used for image classification and I think I could make a network that is very small but still okay. The big problem would be to come up with a good way to train it. Maybe it is best to start training it from known table base positions and mate in n problems and after learning some basic knowledge of mating material and Mating patterns it could learn from playing itself and weighing the positions close to the end more.

If you want some ideas that has potential to give big Elo gains I could share those.

chrisw · Post by **chrisw** » Tue Jul 07, 2020 12:34 am

syzygy wrote: ↑Sun Jul 05, 2020 10:47 pm
Pio wrote: ↑Sun Jul 05, 2020 10:13 pm My point is that it is simple to convert an existing alpha beta engine to work with win probabilities (or should I say win or draw probabilities to satisfy you). Using my way of probabilities in alpha beta has many obvious advantages as I mentioned in my previous post. An additional gain is that can compress the storage space for the transposition table since a size of 10 bits should be more than enough and 8 bits might be sufficient. With 8 bits you could let 1 bit represent if the score is a special score or not and with the rest of the 7 bits tell either the “win or draw”-probability with a granularity of less than 1 % granularity or the distance to mate.
Feel free to prepare a patch for Stockfish. I don't expect your patch will pass, but perhaps I am wrong.

But I hope you do realise that you will have to rewrite Stockfish's evaluation almost completely. You can't just convert SF's current (usually additive) scoring components into probabilities or "probability components". (And this thread has "Stockfish" in the title.)

Well, but, except. SF, and probably most AB programs, has actually “generated” additive centipawn scoring components by essentially reversing win rate back into additive weights. Isn’t that how the linear tuning works? Those magic numbers 100, 300, 500, 900 aren’t actually quite right, but they are what appears out of, say, a Texel tuner for “material” weights, given a mass of training positions and game result.
Win rate is the primary tangible data, centipawn material weights are secondary constructs. We turned a probability function into a polynomial and that’s fundamentally unsound, even if it works in practice, well, until it met AZ, when the compromises and forced corner cutting of the additive linear polynomial got terribly exposed. There isn’t really such a thing as “material”, it’s a heuristic construct , which works, until it doesn’t.

Or are you only proposing to map SF's evaluation onto some kind of logarithmic scale to save a few bits in the transposition table? That would not have much to do with probabilities, I would think. But again, if it makes SF stronger, why not.

Well, SF’s evaluation is reverse mapped off a logarithmic scale in the first place. Imperfectly, but negation then negation again, so to speak, is not entirely illogical, in the sense that a circle mapped imperfectly to a square can get imperfectly mapped back again, no?

On a side note I have thought about that it would be really nice to have a different metric for endgame table bases. Would it not be nice to have an EGTB that for each position gives one of the moves that minimises the proof tree of the position. I understand it will take more space but it could be very useful for people wanting to train a neural network on endgame positions as well as for people wanting natural play of the engine in the endgame. Just an idea (since I am not able to do it by myself without a lot of work)
It is very easy to come up with "you should think outside the box!" ideas if you don't intend to implement them yourself. No need to think about practicalities, like, could it ever work at all, does it even make sense.

syzygy · Post by **syzygy** » Tue Jul 07, 2020 12:44 am

chrisw wrote: ↑Tue Jul 07, 2020 12:34 am Well, but, except. SF, and probably most AB programs, has actually “generated” additive centipawn scoring components by essentially reversing win rate back into additive weights. Isn’t that how the linear tuning works?

Good luck plugging test run win rates into your evaluation function. We'll continue the conversation when you're done.

Ovyron · Post by **Ovyron** » Tue Jul 07, 2020 8:18 am

99% of users that know what is WDL but not how Stockfish's WDL works will be misled into thinking they're getting something they're not. If the feature is here to stay, I think this problem can be avoided by changing its name. Adding a word like "Fake", "Illusory", "Made up", "Fictional", "Fantasy", "Dummy", "Knock-off" or "Phoney" before the name would solve all problems, specially if it makes people switch back to real centipawn scoring.

chrisw · Post by **chrisw** » Tue Jul 07, 2020 9:55 am

syzygy wrote: ↑Tue Jul 07, 2020 12:44 am
chrisw wrote: ↑Tue Jul 07, 2020 12:34 am Well, but, except. SF, and probably most AB programs, has actually “generated” additive centipawn scoring components by essentially reversing win rate back into additive weights. Isn’t that how the linear tuning works?
Good luck plugging test run win rates into your evaluation function. We'll continue the conversation when you're done.

Obviously I didn’t explain myself very well since test run win rates have been “plugged into” evaluation functions ever since Texel tuning, with reported ELO gains over hand-crafting of 200 or 300 Elo. Raw input is game positions plus WDL result, output are adjusted “centipawn” weights for additive evaluation functions. But you know this already.
To re-iterate, material weights are a secondary construct, a heuristic, based on “what works”, like all heuristics, and essentially imaginary. Win rate is primary data. Centipawn weights are derived from win rate, imperfectly, for purposes of giving beginners some sort of tool to read chess maps, and to provide programmers with imperfect data to weight imperfect evaluation functions. Since the weights are derived from WDL data, there seems every reason to reverse the displayed output from the artificial centipawns construct, back to where it came from, win rate. Sorry if that disturbs your paradigm.

Ovyron · Post by **Ovyron** » Tue Jul 07, 2020 10:28 am

chrisw wrote: ↑Tue Jul 07, 2020 9:55 am Since the weights are derived from WDL data, there seems every reason to reverse the displayed output from the artificial centipawns construct

This is irreversible, you can't convert centipawns back to WDL, because a 1.00 in a position with very few draws is very different from a 1.00 one with lots of draws, even if white's performance is the same in both, meaning you can't get an accurate win percentage for white, because black's percentage (the chances that the tables are turned around) are unknown.

syzygy · Post by **syzygy** » Tue Jul 07, 2020 11:26 am

chrisw wrote: ↑Tue Jul 07, 2020 9:55 am
syzygy wrote: ↑Tue Jul 07, 2020 12:44 am
chrisw wrote: ↑Tue Jul 07, 2020 12:34 am Well, but, except. SF, and probably most AB programs, has actually “generated” additive centipawn scoring components by essentially reversing win rate back into additive weights. Isn’t that how the linear tuning works?
Good luck plugging test run win rates into your evaluation function. We'll continue the conversation when you're done.
Obviously I didn’t explain myself very well since test run win rates have been “plugged into” evaluation functions ever since Texel tuning, with reported ELO gains over hand-crafting of 200 or 300 Elo. Raw input is game positions plus WDL result, output are adjusted “centipawn” weights for additive evaluation functions. But you know this already.

Indeed. Probably it was me who didn't explain himself very well.
I skipped ahead a few steps and interpreted your sentence as implying that, since win rates have been "reversed" into additive weights to tune conventional engines, it would be easier to skip the "reversing" and to use them directly in the evaluation function (or so). Which doesn't make all that much sense to me.

To re-iterate, material weights are a secondary construct, a heuristic, based on “what works”, like all heuristics, and essentially imaginary. Win rate is primary data.

Hmmm, maybe I did interpret your sentence correctly?

Centipawn weights are derived from win rate, imperfectly, for purposes of giving beginners some sort of tool to read chess maps, and to provide programmers with imperfect data to weight imperfect evaluation functions. Since the weights are derived from WDL data, there seems every reason to reverse the displayed output from the artificial centipawns construct, back to where it came from, win rate. Sorry if that disturbs your paradigm.

OK, so you do not actually suggest to put back the win rates into the evaluation function (which wouldn't make any sense, at least not to me, but I can be easily convinced of the contrary by a counterexample in the form of a working implementation). You merely say it makes sense to convert the final score output by SF back into win rate.

I don't see the connection at all. Dressing up SF's score as win/draw/loss probabilities is just cosmetics. If you want to end up with real win rates, just run a long match with SF against some other engine or against itself and collect the stats.

Michel · Post by **Michel** » Tue Jul 07, 2020 12:12 pm

If you want to end up with real win rates, just run a long match with SF against some other engine or against itself and collect the stats.

It's not so clear how to objectively measure the win rate of a particular position. One must have some method for introducing variety. The method one uses may influence the results.

syzygy · Post by **syzygy** » Tue Jul 07, 2020 11:22 pm

Michel wrote: ↑Tue Jul 07, 2020 12:12 pm
If you want to end up with real win rates, just run a long match with SF against some other engine or against itself and collect the stats.
It's not so clear how to objectively measure the win rate of a particular position. One must have some method for introducing variety. The method one uses may influence the results.

Indeed. I don't see the connection between the win rate of a match between two engines and the components of an evaluation function. (But please don't expect me to explain the (in my view rather unspecific) ideas about how important or intuitive win probabilities are that others have been coming up with in this thread.)

It is true that engine parameters are often tuned by running lots of games, but those games don't form matches and don't produce win rates, because the parameters are being changed all the time in an attempt to converge to optimal values.

chrisw · Post by **chrisw** » Wed Jul 08, 2020 12:54 am

syzygy wrote: ↑Tue Jul 07, 2020 11:26 am
chrisw wrote: ↑Tue Jul 07, 2020 9:55 am
syzygy wrote: ↑Tue Jul 07, 2020 12:44 am
chrisw wrote: ↑Tue Jul 07, 2020 12:34 am Well, but, except. SF, and probably most AB programs, has actually “generated” additive centipawn scoring components by essentially reversing win rate back into additive weights. Isn’t that how the linear tuning works?
Good luck plugging test run win rates into your evaluation function. We'll continue the conversation when you're done.
Obviously I didn’t explain myself very well since test run win rates have been “plugged into” evaluation functions ever since Texel tuning, with reported ELO gains over hand-crafting of 200 or 300 Elo. Raw input is game positions plus WDL result, output are adjusted “centipawn” weights for additive evaluation functions. But you know this already.
Indeed. Probably it was me who didn't explain himself very well.
I skipped ahead a few steps and interpreted your sentence as implying that, since win rates have been "reversed" into additive weights to tune conventional engines, it would be easier to skip the "reversing" and to use them directly in the evaluation function (or so). Which doesn't make all that much sense to me.

I wouldn’t have been arguing that because probabilities are not additive. But you know that already. What’s being added in conventional engine is a kind a transmogrified bastard child of probability called “centipawns”, which, up to a point, one can get away with adding together, although we get reminded every now and again when not.

To re-iterate, material weights are a secondary construct, a heuristic, based on “what works”, like all heuristics, and essentially imaginary. Win rate is primary data.
Hmmm, maybe I did interpret your sentence correctly?

Centipawn weights are derived from win rate, imperfectly, for purposes of giving beginners some sort of tool to read chess maps, and to provide programmers with imperfect data to weight imperfect evaluation functions. Since the weights are derived from WDL data, there seems every reason to reverse the displayed output from the artificial centipawns construct, back to where it came from, win rate. Sorry if that disturbs your paradigm.
OK, so you do not actually suggest to put back the win rates into the evaluation function (which wouldn't make any sense, at least not to me, but I can be easily convinced of the contrary by a counterexample in the form of a working implementation). You merely say it makes sense to convert the final score output by SF back into win rate.

I don't see the connection at all. Dressing up SF's score as win/draw/loss probabilities is just cosmetics.

Probably we are arguing about phantoms. Obviously one can’t reconstruct a 2D object (WDL) from a 1D object (winrate/centipawns whatever). That’s so base that I assumed you were arguing against expressing centipawns on a 0-1 scale. But I see now you were arguing about trying to regenerate W D and L.

If you want to end up with real win rates, just run a long match with SF against some other engine or against itself and collect the stats.

We don’t know any real win rate. Too early to say.

Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output

Re: Stockfish has included WDL stats in engine output