Old dedicated computers - calculating ratings using endgame test positions

gordonr · Post by **gordonr** » Wed Jun 05, 2024 2:04 pm

I know this site is mainly focused on modern PC engines but I'm hoping someone can still give their thoughts.

I'm using Franz Huber's excellent emulators of old dedicated chess computers (https://fhub.jimdofree.com/). This allows me to automate the processing of test sets via Arena.

I'm initially interested in how these old machines compare in terms of their endgame ability. Here's what I've tried:

- I made a test set of 380 endgame tests while trying to get a broad range of difficulty and endgame categories
- I checked the tests with Stockfish using 6 EGTB, etc
- ran the automated tests and generated a PGN file using a "win" for solving, otherwise a "loss" e.g.

Code: Select all

[White "Mephisto TM London 68030"]
[Black "endgame - 3 hxg5+"]
[Result "1-0"]

1. *

[White "Mephisto TM London 68030"]
[Black "endgame - 4 e5"]
[Result "0-1"]

1. *

So each test position has a unique number, i.e. the "3" and "4" above.

- I then use PGN Stat to run BayesElo/EloStat/Ordo and generate a rating list. I don't think this suits Ordo well because of the grouping?!

So I'm hoping each position gets a "rating" too based on how often it is solved or not by different computers. I think this is how chess.com do their puzzle test ratings?!

A snippet from bayeselo output:

Code: Select all

Rank Name                                            Elo    +    -    games score oppo. draws 
64 endgame - 343 Kc5                                 2208  237  180    15   80%  1935    0% 
65 endgame - 188 Bh7                                 2208  237  180    15   80%  1935    0% 
66 Mephisto TM London 68030                          2161   49   47   380   81%  1842    0% 
67 endgame - 31 d6                                   2153  213  178    15   73%  1935    0%

- I then try to summarise in a table along with some other relevant data. The endgame ratings column is using bayeselo at the moment. I scaled the ratings to try to have the same average for "Selective Search" and endgame ratings (hence the totals at the bottom of each column).

(Is it possible to put an image here based on a screenshot of Excel ?)

Code: Select all

            Computer            Selective Endgame  Delta  Score   Ave Time              Level
Mephisto London 68030              2298     2164    -134     307       3.2    NORML 8 = 3 min/move
Saitek RISC 2500                   2232     2134    -98      299       2.8    180s/move
Saitek Sparc                       2208     2092    -116     287       3.7    e7 = 3 min/move
Novag Star Diamond                 2173     2075    -98      282       2.7    b8 = 3 min/move
Mephisto Portorose 68020           2135     2025    -110     267        2     BLITZ 9 = 60 min/game
Fidelity Designer Mach IV 2325     2075     2048    -27      274       2.5    a7 = 40 moves in 2 hrs (3 min/move)
Novag Diablo                       2002     1960    -42      246       3.4    d8 = 3 min/move
Mephisto Amsterdam                 1946     1883    -63      220       3.4    6 = 40 moves in 2 hrs
CXG Sphinx Galaxy                  1866     1817    -49      196       3.1    a8 = 3 min/move
Conchess Plymate Victoria          1865     1858     -7      211       3.7    8 = 40 moves in 2 hrs (3 min/move)
Fidelity Par Excellence            1829     1863     34      213       5.9    11 =  40 moves in 2 hrs
Saitek Turbostar 432               1760     1763     3       177       2.4    a6 = 3 min/move
Novag Super Constellation          1728     1810     82      194       2.6    7 = 1-10 min/move (40 moves in 2 hrs)
Novag VIP                          1631     1724     93      163        3     FT 8 = 3 min/move
ARB Sargon                         1320     1847    527      207       1.9    5 = 3 min/move
                                     29068   29063

I've just started experimenting with this approach and I'm wondering how reliable it is. I'm pondering things like: do I have enough test positions; are they varied enough; does it work ok for a small group of computers; etc. I'm aware of large error bars.

One thing that looks suspicious to me is how all of the negative deltas are at the top of the table?! I was expecting to see some computers performing better or worse in the endgame (compared to their overall rating) regardless of position in table. There is also a chance that the emulated version I chose isn't the same as the Selective Search version of the machine. ARB Sargon looks suspicious in that respect.

Any thoughts on this experiment are appreciated

CRoberson · Post by **CRoberson** » Wed Jun 05, 2024 7:18 pm

You could compare your results to this rating list https://www.schach-computer.info/wiki/i ... -Elo-Liste

Charles

towforce · Post by **towforce** » Thu Jun 06, 2024 12:04 am

Maybe the other way around? The ratings of the dedicated computers are known - so use the dedicated computers to calculate the elo ratings of the puzzles!

gordonr · Post by **gordonr** » Thu Jun 06, 2024 1:10 am

CRoberson wrote: ↑Wed Jun 05, 2024 7:18 pm You could compare your results to this rating list https://www.schach-computer.info/wiki/i ... -Elo-Liste

Charles

The list you refer to is indeed very good in terms of computers included and being reliable. However, the "Selective Search" ratings (by Eric Hallsworth) are often based on even more games at tournament time controls. So I'm comparing my endgame ratings versus the "Selective Search" ratings.

gordonr · Post by **gordonr** » Thu Jun 06, 2024 1:26 am

towforce wrote: ↑Thu Jun 06, 2024 12:04 am Maybe the other way around? The ratings of the dedicated computers are known - so use the dedicated computers to calculate the elo ratings of the puzzles!

The rating of the puzzles is a side effect of the method I'm using.

If I just score the computers based on how many puzzles they solve, I'd need a good way to translate that into a rating. And some puzzles are harder than others, so I can't make them all worth the same points.

Chess.com allows players to have a "puzzle rating". Again this depends on how many puzzles you solve and how difficult they are. But in order to work out how difficult a puzzle is, they use the results of which players are solving the puzzle or not. The players and puzzles are being rated as part of a single ratings pool in which neither of them were initially known. I think this is the way it works but never seen it documented.

I'm the same with my endgame ratings. I don't know how good each computer is in terms of its endgame ability. And I don't know what rating to place against each endgame puzzle. But just like chess.com, the rating system should work it all out if there is enough quality data and it's this latter part I'm unsure about in my tests (chess.com has a huge amount of players doing a huge amount of different puzzles).

gordonr · Post by **gordonr** » Thu Jun 06, 2024 1:41 am

Here is a pdf of an issue of Selective Search magazine. The rating list is shown on the last page. I apologise if people weren't familiar with Selective Search.

http://www.chesscomputeruk.com/SS_117.pdf

Old dedicated computers - calculating ratings using endgame test positions

Old dedicated computers - calculating ratings using endgame test positions

Re: Old dedicated computers - calculating ratings using endgame test positions

Re: Old dedicated computers - calculating ratings using endgame test positions

Re: Old dedicated computers - calculating ratings using endgame test positions

Re: Old dedicated computers - calculating ratings using endgame test positions

Re: Old dedicated computers - calculating ratings using endgame test positions