The "ultimate" engine rating list

ChickenLogic · Post by **ChickenLogic** » Tue Jan 04, 2022 2:21 pm

As it stands now, we have many individuals testing engines however they feel like. Some barely play 100 games and call it a day. Other people, like ipmanchess, invest much more CPU time to get more accurate results at the cost of only being able to test a few selected engines. CCRL on the other hand tests almost every engine under the sun but in doing so sacrifices accuracy. A single loss of a strong engine that may occur once in 2 million games does impact the rating quite a lot. The obvious issue is that there are not enough games played so that these statistical anomalies get ironed out.
Either way, it is obvious that the problem every tester faces is a lack of hardware for the amount of games they would need. Some believe that longer time controls reduce noise but I think the opposite happens: with longer time control the difference gets smaller and we may need even more games to conclude anything.

Now, wouldn't a rating list of every¹ engine with decently long time control and high accuracy be awesome to have? YES! Obviously.

Instead of waiting for somebody to provide multiple dual Epyc servers we could build our own distributed rating list. Such a list obviously is only possible for freeware and free and open source software since I doubt the authors of commercial engines would be happy to provide every tester with a copy.
The other problem we face is a problem of trust. Just because something is freeware it doesn't mean it can't be injected with malicious code. A foss engine on the other hand can be checked for such things before it enters the list. From there it would be distributed and compiled on many different machines and the time control scaled according to nps and a chosen baseline. This way every engine plays at the same time control.
Contrary to fishtest, we wouldn't have the luxury of enjoying sprt. We'd need a fixed amount of games to properly build the rating list.

Another issue that would be resolved (that every rating list faces right now) is the issue of trust that centralised power always brings. With a transparent framework and clear rules that apply to multiple machines of many different users the possibility of manipulation is reduced to a minimum.

TLDR: Fishtest, OpenBench, Leela and BOINC have proven that open and distributed frameworks do work like a charm. It would be a gain for everybody to have a rating list with the accuracy such a framework provides.

Cheers.

¹ For simplicity I would recommend we, at first, only try to test foss engines.

Guenther · Post by **Guenther** » Tue Jan 04, 2022 2:42 pm

ChickenLogic wrote: ↑Tue Jan 04, 2022 2:21 pm As it stands now, we have many individuals testing engines however they feel like. Some barely play 100 games and call it a day. Other people, like ipmanchess, invest much more CPU time to get more accurate results at the cost of only being able to test a few selected engines. CCRL on the other hand tests almost every engine under the sun but in doing so sacrifices accuracy. A single loss of a strong engine that may occur once in 2 million games does impact the rating quite a lot. The obvious issue is that there are not enough games played so that these statistical anomalies get ironed out.
Either way, it is obvious that the problem every tester faces is a lack of hardware for the amount of games they would need. Some believe that longer time controls reduce noise but I think the opposite happens: with longer time control the difference gets smaller and we may need even more games to conclude anything.

Now, wouldn't a rating list of every¹ engine with decently long time control and high accuracy be awesome to have? YES! Obviously.

Instead of waiting for somebody to provide multiple dual Epyc servers we could build our own distributed rating list. Such a list obviously is only possible for freeware and free and open source software since I doubt the authors of commercial engines would be happy to provide every tester with a copy.
The other problem we face is a problem of trust. Just because something is freeware it doesn't mean it can't be injected with malicious code. A foss engine on the other hand can be checked for such things before it enters the list. From there it would be distributed and compiled on many different machines and the time control scaled according to nps and a chosen baseline. This way every engine plays at the same time control.
Contrary to fishtest, we wouldn't have the luxury of enjoying sprt. We'd need a fixed amount of games to properly build the rating list.

Another issue that would be resolved (that every rating list faces right now) is the issue of trust that centralised power always brings. With a transparent framework and clear rules that apply to multiple machines of many different users the possibility of manipulation is reduced to a minimum.

TLDR: Fishtest, OpenBench, Leela and BOINC have proven that open and distributed frameworks do work like a charm. It would be a gain for everybody to have a rating list with the accuracy such a framework provides.

Cheers.

¹ For simplicity I would recommend we, at first, only try to test foss engines.

You forgot one thing completely. The 'trust' factor for the testers. Even when I think the overwhelming majority doesn't 'fake' their results/games,
they just unintentionally quirk them - there are a lot of things, which can be done wrong and all what can be wrong was already done wrong
and there are dozens of examples in this forum alone, even from people, one would have thought they would know how to test correctly...
Well one could say those errors would drown in the sheer amount of possible games, but I don't buy it.
(It might work for improving one single engine, but it won't for a rating list with hundreds/thousands of different engines/versions)

It would need an enormous sanity check of all delivered game files, also it would need much more transparency then ever was acchieved for
most rating lists ever.

(BTW Ipman tests at sudden death it seems, which is already a nogo IMHO)

amanjpro · Post by **amanjpro** » Tue Jan 04, 2022 2:56 pm

Guenther wrote: ↑Tue Jan 04, 2022 2:42 pm
ChickenLogic wrote: ↑Tue Jan 04, 2022 2:21 pm As it stands now, we have many individuals testing engines however they feel like. Some barely play 100 games and call it a day. Other people, like ipmanchess, invest much more CPU time to get more accurate results at the cost of only being able to test a few selected engines. CCRL on the other hand tests almost every engine under the sun but in doing so sacrifices accuracy. A single loss of a strong engine that may occur once in 2 million games does impact the rating quite a lot. The obvious issue is that there are not enough games played so that these statistical anomalies get ironed out.
Either way, it is obvious that the problem every tester faces is a lack of hardware for the amount of games they would need. Some believe that longer time controls reduce noise but I think the opposite happens: with longer time control the difference gets smaller and we may need even more games to conclude anything.

Now, wouldn't a rating list of every¹ engine with decently long time control and high accuracy be awesome to have? YES! Obviously.

Instead of waiting for somebody to provide multiple dual Epyc servers we could build our own distributed rating list. Such a list obviously is only possible for freeware and free and open source software since I doubt the authors of commercial engines would be happy to provide every tester with a copy.
The other problem we face is a problem of trust. Just because something is freeware it doesn't mean it can't be injected with malicious code. A foss engine on the other hand can be checked for such things before it enters the list. From there it would be distributed and compiled on many different machines and the time control scaled according to nps and a chosen baseline. This way every engine plays at the same time control.
Contrary to fishtest, we wouldn't have the luxury of enjoying sprt. We'd need a fixed amount of games to properly build the rating list.

Another issue that would be resolved (that every rating list faces right now) is the issue of trust that centralised power always brings. With a transparent framework and clear rules that apply to multiple machines of many different users the possibility of manipulation is reduced to a minimum.

TLDR: Fishtest, OpenBench, Leela and BOINC have proven that open and distributed frameworks do work like a charm. It would be a gain for everybody to have a rating list with the accuracy such a framework provides.

Cheers.

¹ For simplicity I would recommend we, at first, only try to test foss engines.
You forgot one thing completely. The 'trust' factor for the testers. Even when I think the overwhelming majority doesn't 'fake' their results/games,
they just unintentionally quirk them - there are a lot of things, which can be done wrong and all what can be wrong was already done wrong
and there are dozens of examples in this forum alone, even from people, one would have thought they would know how to test correctly...
Well one could say those errors would drown in the sheer amount of possible games, but I don't buy it.
(It might work for improving one single engine, but it won't for a rating list with hundreds/thousands of different engines/versions)

It would need an enormous sanity check of all delivered game files, also it would need much more transparency then ever was acchieved for
most rating lists ever.

(BTW Ipman tests at sudden death it seems, which is already a nogo IMHO)

That, and most of the rating lists are run because people behind them enjoy maintaining them. Moving them to a boring distributed and centralized system takes the joy away completely

ChickenLogic · Post by **ChickenLogic** » Tue Jan 04, 2022 3:05 pm

amanjpro wrote: ↑Tue Jan 04, 2022 2:56 pm
Guenther wrote: ↑Tue Jan 04, 2022 2:42 pm
ChickenLogic wrote: ↑Tue Jan 04, 2022 2:21 pm As it stands now, we have many individuals testing engines however they feel like. Some barely play 100 games and call it a day. Other people, like ipmanchess, invest much more CPU time to get more accurate results at the cost of only being able to test a few selected engines. CCRL on the other hand tests almost every engine under the sun but in doing so sacrifices accuracy. A single loss of a strong engine that may occur once in 2 million games does impact the rating quite a lot. The obvious issue is that there are not enough games played so that these statistical anomalies get ironed out.
Either way, it is obvious that the problem every tester faces is a lack of hardware for the amount of games they would need. Some believe that longer time controls reduce noise but I think the opposite happens: with longer time control the difference gets smaller and we may need even more games to conclude anything.

Now, wouldn't a rating list of every¹ engine with decently long time control and high accuracy be awesome to have? YES! Obviously.

Instead of waiting for somebody to provide multiple dual Epyc servers we could build our own distributed rating list. Such a list obviously is only possible for freeware and free and open source software since I doubt the authors of commercial engines would be happy to provide every tester with a copy.
The other problem we face is a problem of trust. Just because something is freeware it doesn't mean it can't be injected with malicious code. A foss engine on the other hand can be checked for such things before it enters the list. From there it would be distributed and compiled on many different machines and the time control scaled according to nps and a chosen baseline. This way every engine plays at the same time control.
Contrary to fishtest, we wouldn't have the luxury of enjoying sprt. We'd need a fixed amount of games to properly build the rating list.

Another issue that would be resolved (that every rating list faces right now) is the issue of trust that centralised power always brings. With a transparent framework and clear rules that apply to multiple machines of many different users the possibility of manipulation is reduced to a minimum.

TLDR: Fishtest, OpenBench, Leela and BOINC have proven that open and distributed frameworks do work like a charm. It would be a gain for everybody to have a rating list with the accuracy such a framework provides.

Cheers.

¹ For simplicity I would recommend we, at first, only try to test foss engines.
You forgot one thing completely. The 'trust' factor for the testers. Even when I think the overwhelming majority doesn't 'fake' their results/games,
they just unintentionally quirk them - there are a lot of things, which can be done wrong and all what can be wrong was already done wrong
and there are dozens of examples in this forum alone, even from people, one would have thought they would know how to test correctly...
Well one could say those errors would drown in the sheer amount of possible games, but I don't buy it.
(It might work for improving one single engine, but it won't for a rating list with hundreds/thousands of different engines/versions)

It would need an enormous sanity check of all delivered game files, also it would need much more transparency then ever was acchieved for
most rating lists ever.

(BTW Ipman tests at sudden death it seems, which is already a nogo IMHO)
That, and most of the rating lists are run because people behind them enjoy maintaining them. Moving them to a boring distributed and centralized system takes the joy away completely

Just because a distributed framework exists it doesn't mean that other's can't do their own thing. I don't understand your point at all. If distributed frameworks had the effect you say they have we'd only have Stockfish and maybe Leela right now. But in fact there were many new engines popping up lately so I don't think that they extinguish the work of other people.
In addition to that, a lot of people do enjoy contributing to these kind of frameworks. Saying those people can't create such a list because other's would stop having fun is kind of narrow minded. These things can co-exist. Please don't see ghosts where there are none.

amanjpro · Post by **amanjpro** » Tue Jan 04, 2022 3:11 pm

ChickenLogic wrote: ↑Tue Jan 04, 2022 3:05 pm
amanjpro wrote: ↑Tue Jan 04, 2022 2:56 pm
Guenther wrote: ↑Tue Jan 04, 2022 2:42 pm
ChickenLogic wrote: ↑Tue Jan 04, 2022 2:21 pm As it stands now, we have many individuals testing engines however they feel like. Some barely play 100 games and call it a day. Other people, like ipmanchess, invest much more CPU time to get more accurate results at the cost of only being able to test a few selected engines. CCRL on the other hand tests almost every engine under the sun but in doing so sacrifices accuracy. A single loss of a strong engine that may occur once in 2 million games does impact the rating quite a lot. The obvious issue is that there are not enough games played so that these statistical anomalies get ironed out.
Either way, it is obvious that the problem every tester faces is a lack of hardware for the amount of games they would need. Some believe that longer time controls reduce noise but I think the opposite happens: with longer time control the difference gets smaller and we may need even more games to conclude anything.

Now, wouldn't a rating list of every¹ engine with decently long time control and high accuracy be awesome to have? YES! Obviously.

Instead of waiting for somebody to provide multiple dual Epyc servers we could build our own distributed rating list. Such a list obviously is only possible for freeware and free and open source software since I doubt the authors of commercial engines would be happy to provide every tester with a copy.
The other problem we face is a problem of trust. Just because something is freeware it doesn't mean it can't be injected with malicious code. A foss engine on the other hand can be checked for such things before it enters the list. From there it would be distributed and compiled on many different machines and the time control scaled according to nps and a chosen baseline. This way every engine plays at the same time control.
Contrary to fishtest, we wouldn't have the luxury of enjoying sprt. We'd need a fixed amount of games to properly build the rating list.

Another issue that would be resolved (that every rating list faces right now) is the issue of trust that centralised power always brings. With a transparent framework and clear rules that apply to multiple machines of many different users the possibility of manipulation is reduced to a minimum.

TLDR: Fishtest, OpenBench, Leela and BOINC have proven that open and distributed frameworks do work like a charm. It would be a gain for everybody to have a rating list with the accuracy such a framework provides.

Cheers.

¹ For simplicity I would recommend we, at first, only try to test foss engines.
You forgot one thing completely. The 'trust' factor for the testers. Even when I think the overwhelming majority doesn't 'fake' their results/games,
they just unintentionally quirk them - there are a lot of things, which can be done wrong and all what can be wrong was already done wrong
and there are dozens of examples in this forum alone, even from people, one would have thought they would know how to test correctly...
Well one could say those errors would drown in the sheer amount of possible games, but I don't buy it.
(It might work for improving one single engine, but it won't for a rating list with hundreds/thousands of different engines/versions)

It would need an enormous sanity check of all delivered game files, also it would need much more transparency then ever was acchieved for
most rating lists ever.

(BTW Ipman tests at sudden death it seems, which is already a nogo IMHO)
That, and most of the rating lists are run because people behind them enjoy maintaining them. Moving them to a boring distributed and centralized system takes the joy away completely
Just because a distributed framework exists it doesn't mean that other's can't do their own thing. I don't understand your point at all. If distributed frameworks had the effect you say they have we'd only have Stockfish and maybe Leela right now. But in fact there were many new engines popping up lately so I don't think that they extinguish the work of other people.
In addition to that, a lot of people do enjoy contributing to these kind of frameworks. Saying those people can't create such a list because other's would stop having fun is kind of narrow minded. These things can co-exist. Please don't see ghosts where there are none.

I thought you were suggesting combining all the lists into one using a distributed system, not introducing yet another rating list. In this case, yeah you are right, my bad

xr_a_y · Post by **xr_a_y** » Tue Jan 04, 2022 3:20 pm

The project is interesting at least!

Indeed, we know how to distribute games (see OB for instance) but this is using a centralized system (there is one "server").
Fun would be to come up with a fully distributed system (P2P-like).
Another fun thing would be to indeed be able to verify stuff. This would come from a "validation hash" (like the nodes count in OB) but we also need to check expected performance on various system (from the engine point of view). Some other things shall be validated from the "client" point of view (for instance, not trying to use more thread than currently available depending on the machine load).
"Post" validation can also come from including or not various outputed result (pgn) from the distributed database.
Probably engine authors will have to conform to a building process or to give a clear receipe on how the engine shall be built.
Anyways this would increase the "standardisation" in the community at least about the build process probably.

The ability to broadcast games would also be very fun to have. Just image a web page were you can see any running (or past) game on any foreign machine.

All this shall of course be done preserving security of the running machine. It would be to easy to compile code with bad intentions on any machine, this things must be made "isaolated" (maybe using a container for instance).

It seems to me like a beautiful collaborative project with some already very good basis (OB, fishtest, already existing rating lists, ...).

ChickenLogic · Post by **ChickenLogic** » Tue Jan 04, 2022 3:28 pm

Guenther wrote: ↑Tue Jan 04, 2022 2:42 pm
ChickenLogic wrote: ↑Tue Jan 04, 2022 2:21 pm As it stands now, we have many individuals testing engines however they feel like. Some barely play 100 games and call it a day. Other people, like ipmanchess, invest much more CPU time to get more accurate results at the cost of only being able to test a few selected engines. CCRL on the other hand tests almost every engine under the sun but in doing so sacrifices accuracy. A single loss of a strong engine that may occur once in 2 million games does impact the rating quite a lot. The obvious issue is that there are not enough games played so that these statistical anomalies get ironed out.
Either way, it is obvious that the problem every tester faces is a lack of hardware for the amount of games they would need. Some believe that longer time controls reduce noise but I think the opposite happens: with longer time control the difference gets smaller and we may need even more games to conclude anything.

Now, wouldn't a rating list of every¹ engine with decently long time control and high accuracy be awesome to have? YES! Obviously.

Instead of waiting for somebody to provide multiple dual Epyc servers we could build our own distributed rating list. Such a list obviously is only possible for freeware and free and open source software since I doubt the authors of commercial engines would be happy to provide every tester with a copy.
The other problem we face is a problem of trust. Just because something is freeware it doesn't mean it can't be injected with malicious code. A foss engine on the other hand can be checked for such things before it enters the list. From there it would be distributed and compiled on many different machines and the time control scaled according to nps and a chosen baseline. This way every engine plays at the same time control.
Contrary to fishtest, we wouldn't have the luxury of enjoying sprt. We'd need a fixed amount of games to properly build the rating list.

Another issue that would be resolved (that every rating list faces right now) is the issue of trust that centralised power always brings. With a transparent framework and clear rules that apply to multiple machines of many different users the possibility of manipulation is reduced to a minimum.

TLDR: Fishtest, OpenBench, Leela and BOINC have proven that open and distributed frameworks do work like a charm. It would be a gain for everybody to have a rating list with the accuracy such a framework provides.

Cheers.

¹ For simplicity I would recommend we, at first, only try to test foss engines.
You forgot one thing completely. The 'trust' factor for the testers. Even when I think the overwhelming majority doesn't 'fake' their results/games,
they just unintentionally quirk them - there are a lot of things, which can be done wrong and all what can be wrong was already done wrong
and there are dozens of examples in this forum alone, even from people, one would have thought they would know how to test correctly...
Well one could say those errors would drown in the sheer amount of possible games, but I don't buy it.
(It might work for improving one single engine, but it won't for a rating list with hundreds/thousands of different engines/versions)

It would need an enormous sanity check of all delivered game files, also it would need much more transparency then ever was acchieved for
most rating lists ever.

(BTW Ipman tests at sudden death it seems, which is already a nogo IMHO)

I guess the problem is that, at the end of the day, somewhere is a thing we, as the "end users", just have to trust. Let's take CCRL: How would I validate that the games are legitimate? Engines like Stockfish are deterministic so long they only use one core. So if one core is used, the amount of nodes searched, the amount of hash used and the version used is known it could in theory be reproduced.
However, now we take a game played by xiphos. If I remember correctly, even with only one core used it is not deterministic. So the only possible way would be to recreate the settings of ccrl and see if the resulting elo of xiphos lines up close enough. This however is not feasible for the average users of such lists and not every list even provides pgns let alone the hardware they use.

I think the easiest way to validate is detecting if a worker in the framework reports a result much worse or much better than all other workers. Then that worker may have a problem such as faulty hardware or a faulty compile. It could also be a malicious actor. Either way it would be discarded. The trust that this system works would rely on the fact that many different users that have no connection to each other come to a similar conclusion. But of course, similar to blockchains, if a user has more than 50% of the hardware used by the framework that trust isn't given anymore.

The "ultimate" engine rating list

The "ultimate" engine rating list

Re: The "ultimate" engine rating list

Re: The "ultimate" engine rating list

Re: The "ultimate" engine rating list

Re: The "ultimate" engine rating list

Re: The "ultimate" engine rating list

Re: The "ultimate" engine rating list