ChessWar XI Promotion : lust of participants

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Rating-scale problem

Post by hgm »

pijl wrote:I don't think that elo calculations are not correct. I think the bigger problem is that the game collections on which the rating lists are based are not ideal. Usually there are only few games with bigger elo differences, so I guess you cannot expect that based on those few games the scaling of the list is correct.
Richard.
In most cases you would be right, but this Promo does have an unusually large Elo range of participants. And the Elo most likely come from the previous edition of this Promo, as most of the engines do not participate in any other event. And in that earlier Promo, the first round did involve games between such widely spaced opponents, just as they are played now. That is about 9% of the games, which should be enough to fix the scale properly.

But I guess I am starting to understand the cause of this: the ratings were most likely obtained by applying BayesElo to the results of the previous Promo. And BayesElo uses a prior assumption that all participants are expected to be equally strong. And this assumption doesn't seem to be valid here at all.

So I guess the initial ratings are indeed wrong, because they were calculated using an improper prior setting in BayesElo.
User avatar
Olivier Deville
Posts: 937
Joined: Wed Mar 08, 2006 9:13 pm
Location: Aurec, France

Re: Rating-scale problem

Post by Olivier Deville »

Hi hg

I would happily give a rating under 1000 to some of the weakest engines, but my pairing software won't accept a rating lower than 1000 points.

Anyway I don't care too much about Elo ratings. They are useful to do the pairings, that's all that counts for me.

Let's add I am very lousy at maths :)

Olivier
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Rating-scale problem

Post by hgm »

OK, I noticed that you bottomed out all ratings at 1000, and now I understand why. Note that this is not the source of the problem, though: I did only consider the first 58 games (the others where not played yet :lol: ), and in those none of the participants with rating 1000 played.

It really is something fundamental in BayesElo. I have become very interested in the low end of the rating scale, and your tournaments are the most extensive source of games between very weak engines. But it means I should not take your ratings at face value (although they are certainly good enough for pairing purposes), but really re-calculate them myself. After figuring out and fixing the problem with the BayesElo prior.
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Rating-scale problem

Post by hgm »

Well, I tried the following:

Based on the results from Chess War X Promo, I extracted ratings with BayesElo using prior = 0. The ratings are then spread out over about twice as large a range as with the default setting, namely from -1400 to +1000. I threw away all engines (versions) with 100% scores, as with prior=0 their ratings are indeterminate.

Then I used those ratings in the current Promo pairing. I took only the games between engine that had received a rating from the previos Promo. This left 42 games. The (new) rating difference in these games ranged from 506 to 1407, on the average 916 points. Now with the score formula used by BayesElo (score = 100%/(1+10^(RatingDifference/400))), calculating the number of points expected to be salvaged by the weak group, (on a per-game basis, and adding all games), we get 0.476 point (out of 42 games). So If the priorless ratings are any good, we should expect one draw.

In fact, we see one win (of the 3 'surprises' in the first Promo round, only Mooboo-BaChess had both participants rated: Trynyty-Vicki and Ananke-Akiba were dropped because Ananke and Vicki are new). This is not really a significant difference, and if you look at the game, even the win is suspicious: BaChess won on time in a totally lost position because neither engine claims or recognizes rep-draws, and they were just infinitely repeating the same position...

In fact the other 'surprises' were just as suspicious: Akiba managed to win on time because Ananke crashed after 9 moves, while Ananke was already a Knight ahead. Such wins really don't tell you anything about the rating difference: Ananke would have lost this game no matter how poor the opponent was, because even a random mover would not let itself be checkmated (by Ananke) within 9 moves. (Note that Eden2 managed to get itself checkmated against Blikskottel in 4 moves, though!)

So in conclusion, it seems that the rating differences are really much larger than even the results of this promo round (2.5 out of 73 = 3.4%) suggest, as this score of the weaker group seems to be due more to problems with the stronger engines ('defeating themselves', by crashing or opponent-independent idiocy), rather than due to any skill of the opponent.

If we use the score of TSCP 1.18c given in Olivier's list (1638) as a calibration point, it means the bottom of the list ends somewhere around an Elo of 0:

Code: Select all

Rating  Engine
1804	Storm 0.6
1756	Mooboo 0.2b
1740	Milady 2.15
1739	MiniMax
1738	Damas 7c
1712	ChessRikus 1.4.66
1707	Clarabit 0.18
1688	Atlanchess 3.3
1675	Simple 0048
1657	Pooky 2.7
1638	TSCP 1.81c
1626	PolarChess 1.3
1614	Jester 0.83
1609	Golem 0.4
1591	SharpChess2 2.52
1587	Dimitri 1.34e
1587	Milady 2.1
1576	Simon 1.2
1552	Beaches 2.2
1546	JChess 1.0
1531	Hokus Pokus 0.6.3
1522	Hoplite 2.1.1
1515	LarsenVB 0.05.01
1514	Bace 0.45
1504	Roque 1.1
1470	Lovelace 1.0r1
1462	Jupiter 001
1458	Gedeone 1620
1431	MSCP 1.6g
1422	Piranha 0.5
1413	Rainman 0.7.5
1408	Pentagon 1.2
1364	Alice 0.3.5
1360	Cefap 0.72
1359	Nero 6.1
1318	The Lightning 2.04
1306	Skaki 1.22
1294	Braincrack
1272	Yawce 0.16
1263	APILchess 1.05r1b
1261	StAndersen 3.1
1247	Blikskottel 0.7
1245	Murderhole 1.0.10
1241	MiniMardi 1.3
1219	Eden 0.0.11_server
1218	Exacto 0.d
1209	Ozwald 0.43
1206	Sinapse 1.1
1175	MiniChessAI 1.19
1171	Zephyr 0.61
1162	Excelsior 2.32b
1161	Trueno 1.0
1132	SCP 1.0b
1104	Raffaela 0.14
1092	T.rex 1.9b
1073	Stan's Chess 1.42
1030	KillerQueen 2b3
1024	Carnivor
 987	Pierre 1.7
 987	BaChess 1.3
 955	Tikov 0.6.3
 949	BremboCE 0.4
 944	Gringo 1.4.7
 940	DarkFusch 0.9
 904	SharpChess 0.0.6
 901	O'Chess
 897	Brama 051204
 888	Matilde 2.6.1
 887	Turing
 881	Cassandre 0.24
 876	Blitzter 2.0
 876	Crux 5.0m
 875	RoboKewlper 0.047a
 855	Marquis 0.1.5
 838	ZChess2 2004
 836	Joana
 806	Kace 0.8.1
 787	Youk 1.05
 777	Mystery 2.1
 769	LTK 2.0
 768	Dimitri 1.35e
 749	Geko 0.4.3
 730	MFChess 1.1
 723	JaksaH 0.17
 685	BabyChess 11.1
 679	Pyotr Amateur 0.6
 660	Tiffanys 0.2
 655	Talvmenni 0.1
 652	BigBook 3.1
 650	Xadreco 5.0
 642	StrategicDeep 1.31
 619	Chad's Chess 0.15
 616	NSVChess 0.14
 582	Trynyty 1.0
 569	Neophyte 0.1
 547	Testina 2.2
 512	Dreamer 0.1.0
 490	Koenig Schwarz
 487	Fianchetto
 419	Cheops 1.1
 381	Belofte 0.2.8
 363	Protej 0.5.3
 353	CS4210
 346	Usurper 0.5
 322	Akiba 0.0.20031118
 317	Sachy 0.2
 293	GiuChess 1.01b1
 290	PreChess 0.7.8
 194	LaMoSca 0.10
  37	POS 1.10
 -66	Etabeta 7.21
 -78	RattateChess 0.666a
 -94	Omar 3.1
-102	ECE 0.1
-156	CPP1 0.1038
-213	Eden2
-236	Gray Matter
The second Promo round will provide a nice test to see if these huge rating difference are real, as now engines will be paired that are much closer in rating. So the scores between the weaker and stronger groups should stay far enough from 100% so that meaningful conclusions about the rating difference can be drawn. (Rather than all scoring being due to flukes.)
User avatar
Olivier Deville
Posts: 937
Joined: Wed Mar 08, 2006 9:13 pm
Location: Aurec, France

Re: Rating-scale problem

Post by Olivier Deville »

Hi hg

Many thanks for the very interesting data !

Next time somebody complains that my ratings are too low (this happens from time to time), I'll give a link to this thread :wink:

Olivier
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Rating-scale problem

Post by Michael Sherwin »

Yea! Carnivor still has 4 digits!!! :D
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Rating-scale problem

Post by hgm »

And even a 'magic number'
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Rating-scale problem

Post by hgm »

OK, second-round results are in. In the first round the average opponent distance was so large, and the results so close to 100%, that a meaningfull estimate of the reting difference couldn't be given.

The second round matched more closely spaced engines. There were 39 games between engines that I had rated through the previous Promo (the list above). According to the rating formula, the group of weakest engines should have scored 3.04 pts (out of 39). Similarly, with the ratings used for the ranking, the weaker engines should have scored 18.5 out of 73.

In practice, the weaker engines scored 9 draws and 8 wins (12.5 pt) out of the 73 round-2 games. This suggests that the ratings difference in Olivier's list are a factor 1.5 too low. The ratings from my list predict a much larger rout for the weak engines, though: in the 39 games where both opponents were rated by me, the weak engines scored 6 wins and 2 draws (7 pts, in stead of the predicted 3). Is this now evidence that the ratings extracted by me from the previous Promo without using a prior are spread out much too widely?

Well, perhaps a little. Looking at the 'surprise' games shows that some of them should be ignored, as they do not contain any information on the quality of the weaker engine. Belofte 'beat' Killer Queen because the latter crashed after 6 moves. Doesn't mean Belofte is stronger than we thought. Cheops vs Crux, same thing. The higher-rated Crux crashed after 10 moves. Tyffani beat Zephyr, but in fact it was totally demolished by it. But Zephyr does not correctly claim checkmate, so that it forfeits on time every game where it checkmates the opponent. So although Zephyr has a substantial rating, because it can score draws against quite strong opponents, it always 'loses' from very weak opponents, as these get quickly checkmated by it. A loss against Zephyr is thus no evidence of strength.

GiuChess was awarded a win against the stronger Mystery, while at +7 in a rep-draw situation. This seems an adjudication mistake. SharpChess forfeited on time against Talvmenni in a totally won position. Such wins cannot be totally dismissed, as they are evidence that Talvmenni is at least strong enough to last to the time control. But it cannot be seen as evidence that sometimes Talvmenni can outplay SharpChess on a good day (as ascribing Talvmenni a higher rating would suggest).

So many of the points scored by the lower-rated engines can in fact be explained away as anomalies. To be fair, there are some wins of the lower-rated engines on merit (Testina-ChessCraft, Vicky-Mooboo, ApilChess-Jester, although the latter was a time forfeit on move 40 in an equal position). All draws were on merit.

Now if I ignore the 3 meaningless wins, and correct the mis-adjudicated game to a draw, the weak engines scored only 3.5 pts. And then Talvmenni still got the benifit of the doubt! Predicted was 3 (actually 2.9 after removing the 3 games).

So the bottom line seems: The ratings I listed above seem to give a quite realistic prediction for the win-probability, even for engines somewhat closer to each other on the list. But the win-probability should be increased somewhat compared to that predicted from the rating model, to account for opponent crashes (which make the win probability bottom out at non-zero value, no matter how weak you are).
User avatar
Olivier Deville
Posts: 937
Joined: Wed Mar 08, 2006 9:13 pm
Location: Aurec, France

Re: Rating-scale problem

Post by Olivier Deville »

Hi hg !

Thanks again for the very interesting analysis.

When you chose to take every engine in, even the crappiest and buggiest ones, you must expect some crashes and losses on time :) Two sidenotes :

- Zephyr has many bugs (losses on time at move 40, promotion to King), but not the one you describe. I checked the debug and there was no move that would have given mate in Zephyr's PV. It simply stopped thinking and lost on time.

- Mystery crashed vs GiuChess, this is why the game was awarded to the opponent. The engines repeated moves indeed, therefore adjudicating the game to a draw would have been fine too.

Olivier
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Rating-scale problem

Post by hgm »

OK, thanks for the clarifications. I guess there will always be a certain arbitrariness in cases like this.

Omar definitely has that bug: it resigns when it can checkmate in one. (This is hardly a problem, because it almost never can. :lol: )

Anyway, playing such that you can be checkmated in one, only to win because the opponent crashes for whatever reason) doesn't tell anything about your strength, as even infinitely weak engines can do that. So the game is a dud as far as rating determination goes. The crash is still likely to be caused by the fact that checkmate was possible, even if the mating move was not in the PV.

Wins do to opponent crashes simply do not tell anything about the strength of an engine. (They do of course reduce the strength of the engine that crashes.) To improve on BayesElo the rating model should really allow for a finite crash probability. This would be fully accomodated by not having the win probability going asymptotically to zero for infinite rating difference, so that the rating of infinitely weak engines is not perturbed so much by an occasional win against a somewhat less weak engine.

The other shortcoming in BayesElo is the application of the prior. This is fine for round robins, but totally wrong for sparsely connected clusters, like originate in a Swiss tournament. I am still working out the math for a fundamentally correct way to apply the prior. But th Promo seems to contain enough games that setting the prior to zero does not lead to a large exaggeration of the rating difference.