Armageddon scoring doesn't enhance the resolution power of the test suite

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Armageddon scoring doesn't enhance the resolution power of the test suite

Post by Laskos »

The only advantages of Armageddon scoring seem to be that it eliminates draws and there is no necessity to pair games side and reversed. For normal scoring to be competitive as resolution power goes, one needs to have pairs of side and reversed games and to use pentanomial variance.

The statements can be derived from the Elo model (which includes a draw model) having Elo_difference, Elo_draw and Elo_bias. For now I am giving an empirical evidence of these statements. The examples given are NBSC chess (No Black Short Castling) and unbalanced 2moves_80_100 openings of the regular chess. It is also shown that using correct variances, NBSC chess doesn't offer any resolution power advantage over other unbalanced opening suites of the regular chess like 2moves_80_100. Thus NBSC chess, not being regular chess, offers only disadvantages compared to some regular chess opening suites.

I built the NBSC opening set (395 different positions, moves 1-4) using 4-threaded Komodo 14 with such options --- "White Must Win" and "Contempt = 0" combined with "White Contempt". The NBSC openings currently in use AFAIK are built using the openings of the regular chess happening to obey the NBSC chess, but the openings are not optimal for the NBSC chess.

I let SF12 play SF12 at 70% time control from 3 opening suites: NBSC chess, 2moves_80_100 regular chess, 2moves_v1 regular chess. To note that SF12 doesn't understand Armageddon Scoring.
1000 games each match. In case of "Normal Scoring" trinomial and the correct pentanomial variances are used. In case of "Armageddon Scoring" binomial and the correct trinomial variances are used. The important thing is to look at the appropriate t-value (tvalue5 for normal scoring, tvalue3 for Armageddon scoring). This t-value equals to Elo difference over Sigma or "signal to noise ratio" and gives the resolution power of the test suite (higher t-value is better).

1/
NBSC chess openings

Score of SF_12 vs SF_12x07: 336 - 178 - 486 [0.579] 1000
... SF_12 playing White: 308 - 5 - 187 [0.803] 500
... SF_12 playing Black: 28 - 173 - 299 [0.355] 500
... White vs Black: 481 - 33 - 486 [0.724] 1000
Elo difference: 55.4 +/- 15.4, LOS: 100.0 %, DrawRatio: 48.6 %
Finished match

Normal scoring: delta = 55.4 Elo points
sigma3 = 7.9; tvalue3 = 7.1
sigma5 = 6.0; tvalue5 = 9.3

Armageddon scoring: delta = 96.2 Elo points
sigma2 = 11.4; tvalue2 = 8.9
sigma3 = 11.3; tvalue3 = 9.0

Quite obvious already that the Armageddon scoring brings no more sensitivity that using paired side and reversed games with normal scoring and pentanomial errors.


2/
2moves_80_100 regular chess

Score of SF_12 vs SF_12x07: 339 - 178 - 483 [0.581] 1000
... SF_12 playing White: 315 - 1 - 184 [0.814] 500
... SF_12 playing Black: 24 - 177 - 299 [0.347] 500
... White vs Black: 492 - 25 - 483 [0.734] 1000
Elo difference: 56.4 +/- 15.5, LOS: 100.0 %, DrawRatio: 48.3 %
Finished match

Normal scoring: delta = 55.4 Elo points
sigma3 = 7.9; tvalue3 = 7.3
sigma5 = 5.7; tvalue5 = 10.0

Armageddon scoring: delta = 98.4 Elo points
sigma2 = 11.4; tvalue2 = 9.1
sigma3 = 10.9; tvalue3 = 9.5

And it is obvious not only that Armageddon scoring is pretty useless, but also that regular chess openings with normal scoring and the correct pentanomial variance are not any worse in sensitivity compared NBSC chess openings.


3/
2moves_v1 regular chess

Score of SF_12 vs SF_12x07: 212 - 66 - 722 [0.573] 1000
... SF_12 playing White: 140 - 20 - 340 [0.620] 500
... SF_12 playing Black: 72 - 46 - 382 [0.526] 500
... White vs Black: 186 - 92 - 722 [0.547] 1000
Elo difference: 51.1 +/- 11.1, LOS: 100.0 %, DrawRatio: 72.2 %
Finished match

Normal scoring: delta 51.1 Elo points
sigma3 = 5.7; tvalue3 = 9.1
sigma5 = 5.0; tvalue5 = 10.3

And here it is shown that at this not yet extremely high draw rate (72.2%) the use of unbalanced openings is not yet necessary. The sensitivity of the regular 2moves_v1 suite of the Stockfish testing framework here is shown to be as good as that from fancy unbalanced openings. Unbalanced openings are very useful only at very high draw rates, above 70-75%.

===========

It remains for me or someone else to explain these results more theoretically starting with the Elo model, a draw model and parameters like Elo_difference, Elo-draw and Elo_bias.
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by mwyoung »

Laskos wrote: Fri Sep 11, 2020 2:32 pm The only advantages of Armageddon scoring seem to be that it eliminates draws and there is no necessity to pair games side and reversed. For normal scoring to be competitive as resolution power goes, one needs to have pairs of side and reversed games and to use pentanomial variance.

The statements can be derived from the Elo model (which includes a draw model) having Elo_difference, Elo_draw and Elo_bias. For now I am giving an empirical evidence of these statements. The examples given are NBSC chess (No Black Short Castling) and unbalanced 2moves_80_100 openings of the regular chess. It is also shown that using correct variances, NBSC chess doesn't offer any resolution power advantage over other unbalanced opening suites of the regular chess like 2moves_80_100. Thus NBSC chess, not being regular chess, offers only disadvantages compared to some regular chess opening suites.

I built the NBSC opening set (395 different positions, moves 1-4) using 4-threaded Komodo 14 with such options --- "White Must Win" and "Contempt = 0" combined with "White Contempt". The NBSC openings currently in use AFAIK are built using the openings of the regular chess happening to obey the NBSC chess, but the openings are not optimal for the NBSC chess.

I let SF12 play SF12 at 70% time control from 3 opening suites: NBSC chess, 2moves_80_100 regular chess, 2moves_v1 regular chess. To note that SF12 doesn't understand Armageddon Scoring.
1000 games each match. In case of "Normal Scoring" trinomial and the correct pentanomial variances are used. In case of "Armageddon Scoring" binomial and the correct trinomial variances are used. The important thing is to look at the appropriate t-value (tvalue5 for normal scoring, tvalue3 for Armageddon scoring). This t-value equals to Elo difference over Sigma or "signal to noise ratio" and gives the resolution power of the test suite (higher t-value is better).

1/
NBSC chess openings

Score of SF_12 vs SF_12x07: 336 - 178 - 486 [0.579] 1000
... SF_12 playing White: 308 - 5 - 187 [0.803] 500
... SF_12 playing Black: 28 - 173 - 299 [0.355] 500
... White vs Black: 481 - 33 - 486 [0.724] 1000
Elo difference: 55.4 +/- 15.4, LOS: 100.0 %, DrawRatio: 48.6 %
Finished match

Normal scoring: delta = 55.4 Elo points
sigma3 = 7.9; tvalue3 = 7.1
sigma5 = 6.0; tvalue5 = 9.3

Armageddon scoring: delta = 96.2 Elo points
sigma2 = 11.4; tvalue2 = 8.9
sigma3 = 11.3; tvalue3 = 9.0

Quite obvious already that the Armageddon scoring brings no more sensitivity that using paired side and reversed games with normal scoring and pentanomial errors.


2/
2moves_80_100 regular chess

Score of SF_12 vs SF_12x07: 339 - 178 - 483 [0.581] 1000
... SF_12 playing White: 315 - 1 - 184 [0.814] 500
... SF_12 playing Black: 24 - 177 - 299 [0.347] 500
... White vs Black: 492 - 25 - 483 [0.734] 1000
Elo difference: 56.4 +/- 15.5, LOS: 100.0 %, DrawRatio: 48.3 %
Finished match

Normal scoring: delta = 55.4 Elo points
sigma3 = 7.9; tvalue3 = 7.3
sigma5 = 5.7; tvalue5 = 10.0

Armageddon scoring: delta = 98.4 Elo points
sigma2 = 11.4; tvalue2 = 9.1
sigma3 = 10.9; tvalue3 = 9.5

And it is obvious not only that Armageddon scoring is pretty useless, but also that regular chess openings with normal scoring and the correct pentanomial variance are not any worse in sensitivity compared NBSC chess openings.


3/
2moves_v1 regular chess

Score of SF_12 vs SF_12x07: 212 - 66 - 722 [0.573] 1000
... SF_12 playing White: 140 - 20 - 340 [0.620] 500
... SF_12 playing Black: 72 - 46 - 382 [0.526] 500
... White vs Black: 186 - 92 - 722 [0.547] 1000
Elo difference: 51.1 +/- 11.1, LOS: 100.0 %, DrawRatio: 72.2 %
Finished match

Normal scoring: delta 51.1 Elo points
sigma3 = 5.7; tvalue3 = 9.1
sigma5 = 5.0; tvalue5 = 10.3

And here it is shown that at this not yet extremely high draw rate (72.2%) the use of unbalanced openings is not yet necessary. The sensitivity of the regular 2moves_v1 suite of the Stockfish testing framework here is shown to be as good as that from fancy unbalanced openings. Unbalanced openings are very useful only at very high draw rates, above 70-75%.

===========

It remains for me or someone else to explain these results more theoretically starting with the Elo model, a draw model and parameters like Elo_difference, Elo-draw and Elo_bias.
I concur mostly. My testing showed the same results. Moving on with normal testing!
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by mmt »

Laskos wrote: Fri Sep 11, 2020 2:32 pm The only advantages of Armageddon scoring seem to be that it eliminates draws and there is no necessity to pair games side and reversed. For normal scoring to be competitive as resolution power goes, one needs to have pairs of side and reversed games and to use pentanomial variance.
Kind of important advantages there.
Laskos wrote: Fri Sep 11, 2020 2:32 pm The examples given are NBSC chess (No Black Short Castling) and unbalanced 2moves_80_100 openings of the regular chess. It is also shown that using correct variances, NBSC chess doesn't offer any resolution power advantage over other unbalanced opening suites of the regular chess like 2moves_80_100. Thus NBSC chess, not being regular chess, offers only disadvantages compared to some regular chess opening suites.
A major difference: you don't play openings in your test suites. It's a poor version of chess with an important part of the game completely erased.
Laskos wrote: Fri Sep 11, 2020 2:32 pm And here it is shown that at this not yet extremely high draw rate (72.2%) the use of unbalanced openings is not yet necessary. The sensitivity of the regular 2moves_v1 suite of the Stockfish testing framework here is shown to be as good as that from fancy unbalanced openings. Unbalanced openings are very useful only at very high draw rates, above 70-75%.
Another important part of chess you're erasing with unbalanced openings is playing from even positions. This is getting far from regular chess.

Do you think the draw rate will go up or down as we progress in computing power? Do different strengths and time controls have different ratios? There is no official version of openings that everyone uses and they can change with time. What if we find out that 1. d4 is losing? The ratings will suddenly change without any program change.

Using forced openings for testing is a dirty and temporary bandage to make computer chess not be a complete bore.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by Dann Corbit »

mmt wrote: Sat Sep 12, 2020 7:02 amUsing forced openings for testing is a dirty and temporary bandage to make computer chess not be a complete bore.
Aye, and there's the rub. Who wants to watch 400 English and French openings in a row?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by Laskos »

mmt wrote: Sat Sep 12, 2020 7:02 am
Laskos wrote: Fri Sep 11, 2020 2:32 pm The only advantages of Armageddon scoring seem to be that it eliminates draws and there is no necessity to pair games side and reversed. For normal scoring to be competitive as resolution power goes, one needs to have pairs of side and reversed games and to use pentanomial variance.
Kind of important advantages there.
Laskos wrote: Fri Sep 11, 2020 2:32 pm The examples given are NBSC chess (No Black Short Castling) and unbalanced 2moves_80_100 openings of the regular chess. It is also shown that using correct variances, NBSC chess doesn't offer any resolution power advantage over other unbalanced opening suites of the regular chess like 2moves_80_100. Thus NBSC chess, not being regular chess, offers only disadvantages compared to some regular chess opening suites.
A major difference: you don't play openings in your test suites. It's a poor version of chess with an important part of the game completely erased.
I don't quite understand, you mean all opening suites are making the game "a poor version of chess"? Or specifically these unbalanced and random one 2moves_v1? One cannot let play engines from the single standard opening even several games without redundancy and bias. Maybe you mean using opening books? Well, then it's engine + book playing, not just engine. I prefer to keep them separated for easy of study purpose.
Laskos wrote: Fri Sep 11, 2020 2:32 pm And here it is shown that at this not yet extremely high draw rate (72.2%) the use of unbalanced openings is not yet necessary. The sensitivity of the regular 2moves_v1 suite of the Stockfish testing framework here is shown to be as good as that from fancy unbalanced openings. Unbalanced openings are very useful only at very high draw rates, above 70-75%.
Another important part of chess you're erasing with unbalanced openings is playing from even positions. This is getting far from regular chess.

Do you think the draw rate will go up or down as we progress in computing power? Do different strengths and time controls have different ratios? There is no official version of openings that everyone uses and they can change with time. What if we find out that 1. d4 is losing? The ratings will suddenly change without any program change.

Using forced openings for testing is a dirty and temporary bandage to make computer chess not be a complete bore.
It is very unlikely that e4 or d4 are anything other than draws according to 32 WDL tables. Yes, the chess from balanced and sound openings will be in 10 years even more drawish than today. The draw rates of top engines (Contempt is regular) even now from this sort of openings and on strong hardware in rapid is above 90% (check the matches here of mwyoung and corres). In 10 years they will be above 95%, the situation approaching the game of computer checkers of 15-20 years ago.

Again, I don't quite understand your "Using forced openings for testing is a dirty and temporary bandage to make computer chess not be a complete bore" . How the testing of engines should proceed?
Cornfed
Posts: 511
Joined: Sun Apr 26, 2020 11:40 pm
Full name: Brian D. Smith

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by Cornfed »

Lets face it, people who let engines play their games for them will always be looking for a 'solution' to something their approach has created.
This is why it's really only discussed on 'computer chess' forums. Seriously.

It's kind of like Christianity positing a concept called 'Original Sin'...so that it can offer - indeed be the only solution for the invented problem. If you can be convinced this invented thing is a problem, they have started the process of reeling you in. Kind of like an infomercial trying to sell you something you probably do not even need.
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by mmt »

Laskos wrote: Sat Sep 12, 2020 12:54 pm I don't quite understand, you mean all opening suites are making the game "a poor version of chess"? Or specifically these unbalanced and random one 2moves_v1? One cannot let play engines from the single standard opening even several games without redundancy and bias. Maybe you mean using opening books? Well, then it's engine + book playing, not just engine. I prefer to keep them separated for easy of study purpose.
Of course having these longer openings books for classical chess is the best way to go within the current rules and I do the same thing when testing. I also like unbalanced books. But not needing these books to have programs start in an interesting position and having programs play earlier in the game (like after 3 moves) would be nice.
Laskos wrote: Sat Sep 12, 2020 12:54 pm It is very unlikely that e4 or d4 are anything other than draws according to 32 WDL tables. Yes, the chess from balanced and sound openings will be in 10 years even more drawish than today. The draw rates of top engines (Contempt is regular) even now from this sort of openings and on strong hardware in rapid is above 90% (check the matches here of mwyoung and corres). In 10 years they will be above 95%, the situation approaching the game of computer checkers of 15-20 years ago.

Again, I don't quite understand your "Using forced openings for testing is a dirty and temporary bandage to make computer chess not be a complete bore" . How the testing of engines should proceed?
The 1. d4 thing was just an example, I don't think it's a win/loss either. But if we find out that a longer opening is a draw, that's not a great thing either. And we'll find more and more of them.

I don't yet know how things would look in no-black (short) castling chess, it needs to be researched before I would actually promote it. Right now I'm just saying it's worth testing out. Seeing relatively small differences in the number of draws that other variants resulted in in the research paper made this Armageddon scoring more interesting to me. No draws, having a more balanced white vs black position at the start, not cutting out openings, classical rules after starting and before scoring. Hard to do better.
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by mmt »

Cornfed wrote: Sat Sep 12, 2020 5:26 pm Lets face it, people who let engines play their games for them will always be looking for a 'solution' to something their approach has created.
This is why it's really only discussed on 'computer chess' forums. Seriously.
It's not though. It's more of an issue with computer chess of course but many grandmasters have talked about it. E.g. out of all Kasparov-Kramnik matches, 82% ended up in a draw.
Cornfed
Posts: 511
Joined: Sun Apr 26, 2020 11:40 pm
Full name: Brian D. Smith

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by Cornfed »

THAT is just a factor of:

1. TOO long a time control (see the recent series of online 'Carlsen Tour' to show how negative (ie: draw percentage...for those who have a problem with draws) all that time affects 'Classical' chess" in a day and age of theory going really deep.
2. There is a lot of $$ on the line (and REALLY long time controls) that influence 'draw vs win/loss' percentage.
3. $$ is paid out only for where the players end up in a tourney. Having the bulk of the prize fund go to actual 'wins' would take care of that....you can always give more money to the top finisher(s).

Basically, it's not the regular game of chess itself that is a problem. It's all that revolves around it.

And this apparent 'draw problem' is only at the top level (Super GM) anyway...I see don't see it in my games or any tourney I've ever played in.
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: Armageddon scoring doesn't enhance the resolution power of the test suite

Post by mmt »

The highest level of chess is where the spectators are though so each draw there matters much, much more, especially if it's a boring draw. Assuming you're a good player since you participate in tournaments, can you try playing a couple of games with no black castling or no short black castling and let us know your impressions?