ELO calculations can be confusing

OliverBr · Post by **OliverBr** » Tue Sep 01, 2020 5:21 pm

Sometimes the ELO can be deceptive.

1) From 5.6.9 to 5.6.9a there is a gain of +8 ELO:

   # PLAYER             :  RATING  ERROR  POINTS  PLAYED   (%)     W     D     L  D(%)  CFS(%)
   1 OliThink 5.6.9a    :       8      8  1946.0    3809  51.1  1169  1554  1086  40.8      97
   2 OliThink 5.6.9     :       0   ----  1863.0    3809  48.9  1086  1554  1169  40.8     ---

2) From 5.6.9a to 5.7.0 there is a gain of +12 ELO:

Code: Select all

   # PLAYER             :  RATING  ERROR  POINTS  PLAYED   (%)    W    D    L  D(%)  CFS(%)
   1 OliThink 5.7.0     :      12     10  1463.5    2835  51.6  981  965  889  34.0      99
   2 OliThink 5.6.9a    :       0   ----  1371.5    2835  48.4  889  965  981  34.0     ---

3) So we would guess there is a sum of +20 ELO from 5.6.9 to 5.7.0?
Wrong! and true.

In their battle it looks as the gain is only +8 ELO (after being +36 after 900 games btw, see here: http://talkchess.com/forum3/viewtopic.p ... 85#p858691)

Code: Select all

    # PLAYER            :  RATING  ERROR  POINTS  PLAYED   (%)     W     D     L  D(%)  CFS(%)
   1 OliThink 5.7.0    :       8      9  1944.5    3810  51.0  1261  1367  1182  35.9      96
   2 OliThink 5.6.9    :       0   ----  1865.5    3810  49.0  1182  1367  1261  35.9     ---

This is strange, isn't it?
But even stranger is the fact that against other engines or completely other version the ELO gain of 5.7.0 is indeed almost 20 compared to 5.6.9.
So 5.7.0. is 20 ELO stronger, but 5.6.9 is especially good against 5.7.0? Or vice versa?

Terje · Post by **Terje** » Tue Sep 01, 2020 5:37 pm

The error column gives you a range that the 'true' value is in (with 95% likelihood I assume), 0 is in that range for your first test, and 2 is the lowest on your 2nd one so the patches could have practically no gain without the test results being extreme outliers. And in the other direction 17 is still in that range in your third test so both of the previous could be 8 each and combined 16 without being too unrealistic. Your tests are too short to say much about the exact elo values of the versions.

OliverBr · Post by **OliverBr** » Tue Sep 01, 2020 6:53 pm

Terje wrote: ↑Tue Sep 01, 2020 5:37 pm Your tests are too short to say much about the exact elo values of the versions.

So you are saying that 4000 games are too short? How many games is your engine playing in order to get an exact result?

BTW, this "ERROR" doesn't look correct if you consider this after 900 games:

Code: Select all

   # PLAYER            :  RATING  ERROR  POINTS  PLAYED   (%)    W    D    L  D(%)  CFS(%)
   1 OliThink 5.7.0    :      36     18   497.0     908  54.7  349  296  263  32.6     100
   2 OliThink 5.6.9    :       0   ----   411.0     908  45.3  263  296  349  32.6     ---

It's saying it could be between 18 and 54? Right? Wrong. It was finally 8:

Code: Select all

# PLAYER            :  RATING  ERROR  POINTS  PLAYED   (%)     W     D     L  D(%)  CFS(%)
   1 OliThink 5.7.0    :       8      9  1944.5    3810  51.0  1261  1367  1182  35.9      96
   2 OliThink 5.6.9    :       0   ----  1865.5    3810  49.0  1182  1367  1261  35.9     ---

Which is between -1 and 17. So this column ERROR has itself an error?

Terje · Post by **Terje** » Tue Sep 01, 2020 8:42 pm

My most recent patch did 23k games at 10+0.1 and 13k at 60+0.6. And even then the 95% confidence range is about 1-7 for the first and 2-9 for the second. They passed an SPRT test.

It's a 95% confidence, 1 in 20 tests the true value will be outside the range (on either end).

OliverBr · Post by **OliverBr** » Tue Sep 01, 2020 9:38 pm

23k is something. My biggest tourneys are 20k. It's really crazy how many games are needed to get stable numbers.

Coming back to my example: In my experience, after so 4000 to 5000 the ELO will still change, but not that much. Surely not enough that that those numbers (8 + 12 =? 8) will ever add linearly.

This is not the first time, that small ELO numbers behave unpredictably. Furthermore I had examples where A > B, B > C, but then C > A. It's like "Rock Papers Scissors".

AndrewGrant · Post by **AndrewGrant** » Tue Sep 01, 2020 11:44 pm

OliverBr wrote: ↑Tue Sep 01, 2020 9:38 pm 23k is something. My biggest tourneys are 20k. It's really crazy how many games are needed to get stable numbers.

Coming back to my example: In my experience, after so 4000 to 5000 the ELO will still change, but not that much. Surely not enough that that those numbers (8 + 12 =? 8) will ever add linearly.

This is not the first time, that small ELO numbers behave unpredictably. Furthermore I had examples where A > B, B > C, but then C > A. It's like "Rock Papers Scissors".

If you start adding up elo results from multiple tests, you start adding up error bars. Quickly.
This is why the Leela "Self-elo" graph from years ago (still a thing?) was always an absolute cluster f*$# and presented no useful knowledge, leaving the less informed people to believe insane results. Note that you probably only commit good changes. So you only take elo results from passed tests. So you are already introducing a bias towards positive results in your summation.

If you want to know if A is > B, then play games between A and B until the result exceeds the error bars by some pre-determerined margin.
If you want to know how much better Z is than A, don't do Z vs y + Y vs X + ... + B vs A. Just do Z vs A, and state that the elo value you are presenting is derived by comparing those two versions.

Also ask yourself: Do you need to know, or even care, how much better B is than A? Or is the fact that B is > A enough? If you come to the right conclusion on that, then you are on the same path as Stockfish's and Ethereal's testing frameworks.

Anything else is subjective. Elo numbers mean nothing. Elo differences between two opponents means something*.

OliverBr · Post by **OliverBr** » Wed Sep 02, 2020 9:15 am

Thank you very much for you contribution. Very interesting and informative.

I have a little jewel for everyone:
Actually, 5.7.0 has got a little bug, a most tiny one. I accidentally removed a line from the code since 5.6.9, so the move generator doesn't work absolutely correct anymore:
When a pinned pawn is promoted with a capture, it won't generate any under-promotions. Queen-promotions still are ok, because those happen in "generateCaps".

Yes, this is very very rare, it happens only when a pinned pawn is capturing the pinning foe on the 8th line. Lol. Anyway I, repaired it in 5.7.0a and gave them a 10.000 games battle:

Code: Select all

   # PLAYER             :  RATING  ERROR  POINTS  PLAYED   (%)     W     D     L  D(%)  CFS(%)
   1 OliThink 5.7.0     :       0   ----  5035.0   10000  50.4  2858  4354  2788  43.5      82
   2 OliThink 5.7.0a    :      -2      5  4965.0   10000  49.6  2788  4354  2858  43.5     ---

White advantage = -0.42 +/- 2.68
Draw rate (equal opponents) = 43.54 % +/- 0.51

The result is, that the buggy version has 2 more ELO points?! Of course, the ERROR is still 5, so no conclusive information from this.

PS: The fix:

Code: Select all

@@ -754,6 +754,7 @@ int generateNonCaps(u64 ch, int c, int f, u64 pin, int *ml, int *mn) {
 		}
 		if (RANK(f, c ? 0x08 : 0x30)) {
 			u64 a = (t & 32) ? PCA3(f, c) : ((t & 64) ? PCA4(f, c) : 0LL);
+			if (a) regPromotions(f, c, a, ml, mn, 1, 0);
 			regPromotions(f, c, m, ml, mn, 0, 0);
 		} else {
 			regMoves(PREMOVE(f, PAWN), m, ml, mn, 0);

Dann Corbit · Post by **Dann Corbit** » Wed Sep 02, 2020 8:23 pm

The gain in Elo for the bug is not surprising.
We see the same for refusing to underpromote to bishop and rook.
There were some big programs that would only underpromote to knight, or would not underpromote to bishop, because it is very rare.
I found this very annoying because I had studies that then became impossible for those commercial engines to solve.
Now, I don't care about the -2 Elo. I want a chess engine to play legal chess.
If it can't do that, it's really not a chess engine. It's an annoying thing that wins games.

Something that tends to be surprising to some who are new to writing a chess engine is that the ordering of underpromotions can add Elo.
The correct order for promotions is: QNRB.
If Knight is last, it will cost you Elo.
By far, Knight is the most important underpromotion (which is, I suppose, why even the balky engines kept it).

OliverBr · Post by **OliverBr** » Wed Sep 02, 2020 9:45 pm

Dann Corbit wrote: ↑Wed Sep 02, 2020 8:23 pm The gain in Elo for the bug is not surprising.

Actually in this case it's shouldn't affect ELO at all. Only promotions of pinned pawns are affected, which is very very rare.
E.g. this position:

[d]3b4/2R1P3/8/6K1/8/8/1kp1R3/8 w - - 0 1

Code: Select all

Perft 7: 
792856141 (Olithink 5.6.9, Stockfish 11 and again by coming 5.7.1)
696646565 (OliThink 5.7.0)

And of course it is a bug and can yield to a crash when other side under-promotes a pinned pawn.
Btw, pinned pawns cannot promote without capture, so I could remove a line of code.

Something that tends to be surprising to some who are new to writing a chess engine is that the ordering of underpromotions can add Elo.
The correct order for promotions is: QNRB.
If Knight is last, it will cost you Elo.
By far, Knight is the most important underpromotion (which is, I suppose, why even the balky engines kept it).

Of course, it's not that difficult too see:

Code: Select all

void regPromotions(int f, int c, u64 bt, int* mlist, int* mn, int cap, int queen) {
	while (bt) {
		int t = pullLsb(&bt);
		Move m = f | _ONMV(c) | _PIECE(PAWN) | _TO(t) | (cap ? _CAP(identPiece(t)) : 0);
		if (queen) mlist[(*mn)++] = m | _PROM(QUEEN);
		mlist[(*mn)++] = m | _PROM(KNIGHT);
		mlist[(*mn)++] = m | _PROM(ROOK);
		mlist[(*mn)++] = m | _PROM(BISHOP);
	}
}

AndrewGrant · Post by **AndrewGrant** » Wed Sep 02, 2020 10:17 pm

OliverBr wrote: ↑Wed Sep 02, 2020 9:45 pm Of course, it's not that difficult too see:

Code: Select all

void regPromotions(int f, int c, u64 bt, int* mlist, int* mn, int cap, int queen) {
	while (bt) {
		int t = pullLsb(&bt);
		Move m = f | _ONMV(c) | _PIECE(PAWN) | _TO(t) | (cap ? _CAP(identPiece(t)) : 0);
		if (queen) mlist[(*mn)++] = m | _PROM(QUEEN);
		mlist[(*mn)++] = m | _PROM(KNIGHT);
		mlist[(*mn)++] = m | _PROM(ROOK);
		mlist[(*mn)++] = m | _PROM(BISHOP);
	}
}

Its actually quite hard to see. I'de really suggest you standardize your code before your repo gets any bigger. Otherwise you'll be posting about strange bugfixes for the next half a decade.

ELO calculations can be confusing

ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing

Re: ELO calculations can be confusing