Sure, go ahead Joerg. I don't think there is much difference between the versions? I looked at your Stockfish branch but I did not find any code there yet.
Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
syzygy wrote:Quite a few KBBvKN positions need more than 50 moves for capturing the knight.
More accurately: the vast majority. (After you weed out the positions where the Knight is tactically lost in 1 or 2 ply from the very beginning, because it is hanging, or victim of a skewer etc.) If the engine cannot see the gaining of the Knight within its horizon, it should better assume KBBKN is a draw.
Stockfish seems to be pretty backward in its end-game knowledege. I don't think Fruit 2.1 would make the mis-evaluation of the original post. White has no Pawns, and that fact alone deserves a 50% reduction of its naive evaluation advantage.
Fruit would group this material combination in the 'minor ahead, no Pawns' class. Which is severely discounted (a factor 8?). That the defending side has one or two Pawns doesn't make it any easier, and is only taken into account in the sense that it reduces the naive advantage even before the discount is applied.
Joerg Oster wrote:
Edit: Oh, I just realize you did test it, but no draw score
I just left in the bonus for trying to capture the Knight, to give the engine something to do. This is more Swindle mode of sorts but it does not really help. With or without the knight there is no way without help from the other side and then even with a knight I think I can't construct a position where it is mate in a corner by a blunder. If trying to capture the knight takes more than 50 moves though in general I think Marco had better move back to Tord's original version that correctly scores a draw with same coloured bishops... Very subtle! I find it a bit cheap of Harm to accuse Stockfish of having poor endgame rules on the basis of one bug found more or less by accident.
Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
Uh? I was not talking about the like-colored-Bishops bug here, but about the KBBKBP evaluation in the original post. I don't think this can be traced to any bug. The white Bishops there are a regular pair. It seems a plain omission.
This seems rather a symptom of a very general problem, namely that it does not know that when KXYKZ is a dead draw, KXYKZP and KXYKZPP is even worse. Have you tried this with KRKBP, KRKNP, KBNKNP, KBNKBP, KRBKRP, KRNKRP, KQBKQP, KQNKQP? Fruit (a 10-year-old engine!) would recognize all these material combinations as heavily drawish.
I don't think there is anything 'cheap' in concluding that an engine completely unaware of such elementary facts has 'poor end-game knowledge'.
Well, I sometimes have my doubts how real 'real' really is. It is rather fashionable nowadays to cull all knowledge out of Chess egines. That entails the risk that they now almost all make the same silly mistakes, so that it doesn't hurt much when you make them yourself too. The term 'incestuous testing' has been coined for this.
It is hard for me to believe that not knowing someting as elementary as that without Pawns being a minor ahead is still a dead draw could not be efficiently exploited by an opponent that does know it. Especially by an opponent that knows you are naive in this respect. Just like Pablo exploits that engines are naive towards closing the position. But if you only test agains opponents that would never sucker you for a draw, you wouldn't see the difference. But could you still call such testing 'real game-play'? We might well be creating our own virtual reality here.
The funny thing with the under-promotion is not only that it doesn't recognize the like Bishops, but that it thinks KBBKN is better than KQBKN in the first place. Even with unlike Bishops KBBKN is almost always a 50-move draw against best defense, while KQBKN of course always wins. Recognizing KBBKN as a 'certified win' seems an example of 'wrong knowledge'.
Would there be some alternative testing track that could uncover these kinds of problems and clean them up? Something like randomly generated positions with playouts? Maybe take a randomly generated position and try a self-playout vs a playout vs another engine. Of course "randomly generated" is an extremely wide net, might be some ways to narrow that down without missing the interesting cases we want to detect.