CCRL update (14th July 2007)

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

User avatar
Graham Banks
Posts: 41432
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

CCRL update (14th July 2007)

Post by Graham Banks »

The July 14th update of the CCRL Rating Lists and Statistics is now available for viewing at:
http://www.computerchess.org.uk/ccrl/4040/

The links to the various rating lists can be found just beneath the default Best Versions list.
For example there is a 32-bit Single CPU list.

Our standard testing is at 40 moves in 40 minutes repeating while our current blitz testing is at both 40 moves in 4 minutes repeating and 40 moves in 12 minutes repeating, all adjusted to the AMD64 X2 4600+ (2.4GHz).

Currently active testers in our team are:
Graham Banks, Ray Banks, Shaun Brewer, Kirill Kryukov, Dom Leste, Tom Logan, Andreas Schwartmann, Charles Smith, George Speight, Chris Taylor, Chuck Wilson, Gabor Szots and Martin Thoresen.

A big thanks to all testers as usual for their efforts this week.


40/40 Notes

There currently 65,377 games in our 40/40 database.
Many engines on our list have few games and in many cases their ratings are likely to fluctuate (markedly for some) until a lot more games are played. Therefore no conclusions should be drawn about their strength yet.
To illustrate this point, when an engine has 200 games played, the error margin is still approximately +/-40 ELO, after 500 games +-25 ELO, after 1000 games +-17 ELO and even after 2000 games there is a +-13 ELO error margin!
This of course highlights the importance of looking at other rating lists that are also available in order to draw comparisons and get a more accurate overall picture.


Multi CPU Engines
Rybka 2.3.2 64-bit 4CPU currently has a tiny lead over Rybka 2.2 64-bit 4CPU.
Zap!Chess Zanzibar 64-bit 4CPU is clearly number 2 ahead of Hiarcs 11.1 4CPU, Naum 2.1 64-bit 4CPU and Loop M1-T 64-bit 4CPU.
Deep Shredder 10 64-bit 4CPU, Deep Fritz 10 4CPU and Deep Junior 10 4CPU, are off the pace.


Single CPU Engines
Rybka 2.3.2 leads the ratings here as well.
A large distance back, Zap!Chess Zanzibar, Hiarcs 11.1 and Loop 13.6 are fairly closely grouped ahead of Fritz 10, Shredder 10, Toga II 1.2.1a and Strelka 1.0b.
Spike 1.2 Turin, Junior 10, Naum 2.1, Fruit 2.2.1 and Deep Sjeng 2.5 are the next closely matched group of engines.
Ktulu 8.0 and Chess Tiger 2007.1 are further adrift.


Amateur News:
Strelka 1.0b is currently level with Toga II 1.2.1a, slightly ahead of Spike 1.2 Turin.
Our current stance on the Strelka controversy is as follows: "We offer no opinion on Strelka-Rybka-Fruit controversy. Each individual must take their own view." This stance is subject to review dependent on future developments.
Glaurung 2 epsilon/5 is stronger than Glaurung 1.2.1, but needs further games to stabilise its rating.
Scorpio 1.91, Alaric 704, Delfi 5.1 and SlowChess Blitz WV2.1 are the next group of strong amateurs.
Further down the list, early indications are that the latest versions of Booot, DanaSah, Natwarlal, Buzz and Feuerstein seem to have made good gains over previous versions.
Micro-Max 4.8 now has over 200 games and maintains a rating over 2000 ELO, amazing for an engine that has less than 2000 characters of code!
We test a very extensive range of amateur engines through our Amateur Championship divisions (32-bit 1CPU) plus other tournaments, all of which can be followed in our public forum.
Our aim is of course to ensure that all engines lower on our lists get at least 200 games.


Blitz Notes

There are currently 150,139 games in our 40/4 database.
The 40/4 update is usually done separately to our 40/40 update. The most recent update can always be viewed here:
http://computerchess.org.uk/ccrl/404.live/


FRC Notes

Ray tests only those engines that can play FRC through the Shredder Classic GUI.
If engine authors have a new and stable version of their engine that will run under this GUI, they should contact Ray to get it tested.

For FRC the best list to look at is the pure list.
http://www.computerchess.org.uk/ccrl/404FRC/


Stats/Presentation Notes

The LOS stats to the right hand side of each rating list are "likelihood of superiority" stats. They tell you the likelihood in percentage terms of each engine being superior to the engine directly below them.

A list of games played this week per engine can be found in the update thread in the CCRL public forum, accessible through the link given at the top of this post.

All games are available for download through the link given at the top of this post. They can be downloaded by engine or by month.
ELO ratings are now saved in all game databases for those engines that have 200 games or more.

Clicking on an engine name will give details as to opponents played plus homepage links where applicable.

Custom list selections now have the option of including or excluding betas, private engines, settings and others.

An openings report page (link at bottom of index page) lists the number of games played by ECO codes with draw percentage and White win percentage. Clicking on a column heading will sort the list by that column.
The aim is to soon have games downloadable by ECO code.
Norm Pollock
Posts: 1056
Joined: Thu Mar 09, 2006 4:15 pm
Location: Long Island, NY, USA

Re: CCRL update (14th July 2007)

Post by Norm Pollock »

Hi Graham,

Is the list of "killed engines" something new? I didn't see it before, but that doesn't mean anything. Anyway, I think it is a good idea to prune games that cause statistical "noise", such as these. But isn't "kill" a bit harsh?

http://www.computerchess.org.uk/ccrl/40 ... _list.html

But how many games were "killed"? Totaling the numbers in the chart won't tell me because many of the games were internal, among the group, and many were not.

-Norm
Dariusz Orzechowski

Re: CCRL update (14th July 2007)

Post by Dariusz Orzechowski »

Norm Pollock wrote:But isn't "kill" a bit harsh?
Yes, it is. I suggest calling them "zombie engines" instead. :D
Spock

Re: CCRL update (14th July 2007)

Post by Spock »

Norm Pollock wrote:Hi Graham,

Is the list of "killed engines" something new? I didn't see it before, but that doesn't mean anything. Anyway, I think it is a good idea to prune games that cause statistical "noise", such as these. But isn't "kill" a bit harsh?

http://www.computerchess.org.uk/ccrl/40 ... _list.html

But how many games were "killed"? Totaling the numbers in the chart won't tell me because many of the games were internal, among the group, and many were not.

-Norm
The list of killed engines was introduced about a month ago :)

The total number of games killed is currently about 2,200

We try to ensure all engines on the list get at least 200 games. If a new engine version comes out quickly, and the old version only has a small number of games, then either we commit to getting it up to 200 games as well as the new version, or "kill" the old one.... As you say, the list can quickly get out of control if we don't take steps to tidy it up
Spock

Re: CCRL update (14th July 2007)

Post by Spock »

I should add, killed engines can be miraculously brought back to life with the flick of a switch :)
We have done this occasionally too, where a particular tester has wanted to resurrect an engine
rdan1987

Re: CCRL update (14th July 2007)

Post by rdan1987 »

I don't understand why Toga II 1.3X4 was put on the "killed" list.....It's still the latest version of Toga....especially when Thomas Gaksch said it would not improve this engine anymore....it would be some kind of tribute to him right?
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: CCRL update (14th July 2007)

Post by hgm »

Spock wrote:The list of killed engines was introduced about a month ago :)

The total number of games killed is currently about 2,200

We try to ensure all engines on the list get at least 200 games. If a new engine version comes out quickly, and the old version only has a small number of games, then either we commit to getting it up to 200 games as well as the new version, or "kill" the old one.... As you say, the list can quickly get out of control if we don't take steps to tidy it up
I can understand why you don't want the list to grow to unwieldly proportions, including all kind of obsolete engine versions with poorly known rating.

But I seriously question the statistical wisdom of removing their games from the database. These games do contain information that is still useful for narrowing down the ratings of other engines that have played them, that BayesElo would extract.

Example:
Say I have engines A and B and I play them two games each against the engines C1, C2, ... C200. Say A and B score both 50% from these gauntlets.

A and B then each have 400 games, and there is good evidence that they are equally strong. Statistically about as good as when they had played 200 games against each other, but without the systematic error that would result from playing against the same opponent too often. All the engines C1, ... C200 would have only played 4 games, though, and their ratings are hardly known at all.

But 'killing' these C engines would leave the relative strength of A and B totally undefined. It would be equivalent in terms of accuracy loss to removing 200 games between the two of them, without need or reason.

An extreme example, perhaps, to make it very obvious. But the effect will always be there, no matter how small the fraction of games thrown away is, compared to the total. These games still contain about 25% of the information as the games between 'alive' engines.
User avatar
Mike S.
Posts: 1480
Joined: Thu Mar 09, 2006 5:33 am

Re: CCRL update (14th July 2007)

Post by Mike S. »

There is a chance that Toga 1.3X4 is a little bit better than 1.2.1a. If we take a look at the CEGT blitz ratings:

Code: Select all

37 Toga II 1.3x4          2813 24 24  500 42.4 % 2866 39.6 % 
(...)
44 Toga II 1.2.1a         2802 10 10 3364 50.6 % 2798 32.2 % 
45 Toga II 1.2 Beta2a e26 2800 18 17 1074 57.5 % 2748 30.2 % 
(For me, it would not be important to have results from tests with all the 300+ MB bitbases, or without them if this is easier for the testers considering available RAM.)

Furthermore, it is surprising that Deep Junior 10.1 2 CPU is on the kill list. Isn't that the latest version? Why no dual tests? And if I'm not wrong Deep Fritz 10 2 CPU is missing completely. Don't get me wrong, of course I feel much obliged to get so many test ratings and great data selection and -comparison features. It's just not always easy to compare engines and (other) ratings, if i.e. dual results of major engines are missing, or 32-bit, etc. Just an observation. Thanks anyway.
Regards, Mike
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: CCRL update (14th July 2007)

Post by Uri Blass »

Mike S. wrote:There is a chance that Toga 1.3X4 is a little bit better than 1.2.1a. If we take a look at the CEGT blitz ratings:

Code: Select all

37 Toga II 1.3x4          2813 24 24  500 42.4 % 2866 39.6 % 
(...)
44 Toga II 1.2.1a         2802 10 10 3364 50.6 % 2798 32.2 % 
45 Toga II 1.2 Beta2a e26 2800 18 17 1074 57.5 % 2748 30.2 % 
(For me, it would not be important to have results from tests with all the 300+ MB bitbases, or without them if this is easier for the testers considering available RAM.)

Furthermore, it is surprising that Deep Junior 10.1 2 CPU is on the kill list. Isn't that the latest version? Why no dual tests? And if I'm not wrong Deep Fritz 10 2 CPU is missing completely. Don't get me wrong, of course I feel much obliged to get so many test ratings and great data selection and -comparison features. It's just not always easy to compare engines and (other) ratings, if i.e. dual results of major engines are missing, or 32-bit, etc. Just an observation. Thanks anyway.
As far as I know toga1.3x4 is relatively better at blitz.
There was a beta version that was relatively better at long time control based on ccrl tests
(Toga II 1.3 Beta1 32-bit) but it was not released.

From the CEGT 40/20 list

59 Toga II 1.3x4 egbb 2804 25 26 394 42.4 % 2857 45.2 %
62 Toga II 1.2.1a 2800 8 8 5071 53.2 % 2778 35.5 %
66 Toga II 1.3x4 2795 27 28 383 41.0 % 2858 38.6 %

From the CCRL 40/12 list

2 Toga II 1.3 Beta1 32-bit(this version is not public) 2860 +18 −18 60.0% −69.9 35.5% 1061
61.5%
Toga II 1.3x4 HT70 2845 +23 −22 61.8% −84.2 31.2% 686
67.3%
Toga II 1.2.1a 32-bit 2838 +18 −18 57.4% −50.2 35.2% 1032
47.8%
Toga II 1.3x4 2838 +23 −22 60.9% −78.8 31.1% 666
51.8%
Norm Pollock
Posts: 1056
Joined: Thu Mar 09, 2006 4:15 pm
Location: Long Island, NY, USA

Re: CCRL update (14th July 2007)

Post by Norm Pollock »

hgm wrote:
Spock wrote:The list of killed engines was introduced about a month ago :)

The total number of games killed is currently about 2,200

We try to ensure all engines on the list get at least 200 games. If a new engine version comes out quickly, and the old version only has a small number of games, then either we commit to getting it up to 200 games as well as the new version, or "kill" the old one.... As you say, the list can quickly get out of control if we don't take steps to tidy it up
I can understand why you don't want the list to grow to unwieldly proportions, including all kind of obsolete engine versions with poorly known rating.

But I seriously question the statistical wisdom of removing their games from the database. These games do contain information that is still useful for narrowing down the ratings of other engines that have played them, that BayesElo would extract.

Example:
Say I have engines A and B and I play them two games each against the engines C1, C2, ... C200. Say A and B score both 50% from these gauntlets.

A and B then each have 400 games, and there is good evidence that they are equally strong. Statistically about as good as when they had played 200 games against each other, but without the systematic error that would result from playing against the same opponent too often. All the engines C1, ... C200 would have only played 4 games, though, and their ratings are hardly known at all.

But 'killing' these C engines would leave the relative strength of A and B totally undefined. It would be equivalent in terms of accuracy loss to removing 200 games between the two of them, without need or reason.

An extreme example, perhaps, to make it very obvious. But the effect will always be there, no matter how small the fraction of games thrown away is, compared to the total. These games still contain about 25% of the information as the games between 'alive' engines.
I think you answered it yourself when you said:
"All the engines C1, ... C200 would have only played 4 games, though, and their ratings are hardly known at all."

Their tentative elo ratings will be based upon the standard initial elo value that all engines start from, which is data input by the user, and 4 games. Possibly very inaccurate elo ratings. These ratings will then influence A and B's elo rating, and then have a ripple effect until all engines in the cluster are affected.

I would not have confidence in such ratings. A chain is as weak as the weakest link, and in this case, having 200 weak elo ratings (weak in terms of reliability) is like having 200 weak links. Not good.