lkaufman wrote: ↑Tue Dec 10, 2019 7:07 am
lkaufman wrote: ↑Mon Dec 09, 2019 5:37 pm
Javier Ros wrote: ↑Mon Dec 09, 2019 11:11 am
Nordlandia wrote: ↑Mon Dec 09, 2019 5:10 am
Engines need to know that they're playing armageddon. So they need to be taught playing that mode.
I agree, in the same way AlphaZero has been trained to play No Castling Chess, Lc0 must learn No Castling Chess or BNC with draw advantage, while alpha-beta programs must be also modified. A new opening repertoire will be created for each variation of chess. The contempt factor for each side is not enough.
The experiments of Laskos and Larry Kaufman are very interesting but when the programs take into account the draw advantage the results will vary.
Anyway the proposal seems balanced and acceptable.
Although knowing the Armageddon rule is ideal, I'm pretty sure that using a White Contempt of 75 in Komodo comes close enough to this for most practical purposes. I think that the results indicate that the exact value doesn't matter much, because as the value goes higher, both sides modify their play more towards the Armageddon rule, and the effects cancel out.
I added Fat Fritz to the experiment, although it doesn't have Contempt so you may consider the result less reliable than the Komodo results. It ran on an RTX 2080 at 1' + 1". Result: 177 White wins, 8 Black wins, 185 draws, so 177/370 points = 47.8%. So this brings the overall results for all engines tested down to between 50% and 51% (depending on how you weight them). It is really amazing how fair this variant appears to be, at least between engines!
I realized that it may be a flaw in the testing of this idea to run only identical or very similar engines against each other. In the real world, engines and humans don't generally play against clones, this is not a proper test. So I'm running some unrelated engine matches. They don't have to be of equal strength, as long as they are within a hundred elo or so of each other this should still work, since each side gets half White and half Black. Of course if the engines were a thousand elo apart the result would come out 50% for each color since the stronger engine would win all the games, half with each color, but with moderate elo gaps any White-Black bias should show up.
My first test was Stockfish 10 on 6 fast cpu cores vs. Fat Fritz on RTX 2080, at 1' +0.6". Stockfish won the match 60 to 40, but that's not what matters. White won 58 games, with 42 draws and not a single Black win for either engine! This is a bit worrisome, as 58 to 42 is rather significant. We'll have to see how other unrelated pairings come out. It may turn out that NBC Armageddon isn't as fair as we thought, in which case we can fall back on NBSC, no Black short castling, which would obviously raise Black's prospects. But let's wait for results first.
I had this in the morning, a RR of 600 games with top 3 engines at
60 + 0.6:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next)
1 SF_9 : 46.58 19.48 232.5 400 58.1 95
2 Houdini_6 : 19.35 18.85 213.5 400 53.4 100
3 Komodo_131 : -65.93 19.52 154.0 400 38.5 ---
White advantage = 157.98 +/- 11.35
Draw rate (equal opponents) = 44.33 % +/- 2.40
Komodo surprisingly performs quite poorly, although I put White Contempt = 75, and the default small Contempt for the other two, as they have no Colored Contempt. I hope high Contempt in Komodo doesn't harm its performance and doesn't skew the overall result.
Here is the important aspect:
Code: Select all
Games : 600 (finished)
White Wins : 314 (52.3 %)
Black Wins : 70 (11.7 %)
Draws : 216 (36.0 %)
Unfinished : 0
White Score : 70.3 %
Black Score : 29.7 %
White wins are 52.3%, which is not that bad. My theory was that for determining the borderline at longer TC, both the White score (52.3%, the border is 50%) and the White performance in normal scoring (70.3%, the border is 75%) should be considered. At shorter TC more "accidents" happen, and a Black win (many here) is often an accident between similar in strength engines. But "accidents" do happen the other way around too, some of White wins are also "accidents". At the same time, the performance at 60 + 0.6 in normal scoring is significantly below 75%, which combined 52.3% White wins, denotes that it is debatable what sort of opening it is at much longer TC.
I am now testing at
240 + 2.4 the same RR in 600 games. It will take almost a day probably, but if my theory stands, White might score even below 52.3% at longer TC, but White performance might be above 70.3% in normal scoring (maybe not above the threshold of 75%). That is due to less Black wins at longer TC, and possibly, less White wins. Let's see, but the statistical fluctuations are not that small even in 600 games, so one has to be cautious inferring too many things.