Reference Engines?

dangi12012 · Post by **dangi12012** » Fri Nov 19, 2021 10:52 pm

Do we have a benchmark engine with a fixed elo? Elo is only calculated within a population. But i didnt find a fixed elo one like a kg was for weight.

It would have to fit these criteria:
Deterministic
No time control (important to not depend on hardware)
Fixed elo

Then you would just have to download it and run against other engines with different settings.

So then conclusions this engine has 3200 Elo on that hardware on 1 Minute games become meaningful. Also doubling time gives +100 and doubling threads gives +90 and stuff like that become more scientific.

Is Stockfish 10.0 singlethreaded no hashtabe fixed nodes 10000000 such a benchmark? Or do i overlook something?

hgm · Post by **hgm** » Sat Nov 20, 2021 9:57 am

Probably the people from CCRL can shed some light on how they handle this. IMO the best solution would be to have a pool of reference engines, each having an above-average number of games, and keep the average of their Elo at a fixed value. I don't see why determinism would be an advantage.

Rating lists are usually compiled exclusively from games played at the same effective TC (i.e. corrected for machine speed). I agree it would be interesting to also include time-odds games, so that one could determine absolute ratings, rather than relative ratings for a given TC. In the past I sometimes did this for compiling rating list for variant engines, when there was only a hand full of engines able to play the variant, with wildly different strength. Then using time odds of up to a factor 100 for the stronger ones provided a much better population of the rating scale, making it possible to calculate meaningful ratings. There is not much you can conclude if nearly every match between different engines gets a >95% score.

dangi12012 · Post by **dangi12012** » Sat Nov 20, 2021 12:01 pm

hgm wrote: ↑Sat Nov 20, 2021 9:57 am Probably the people from CCRL can shed some light on how they handle this. IMO the best solution would be to have a pool of reference engines, each having an above-average number of games, and keep the average of their Elo at a fixed value. I don't see why determinism would be an advantage.

Rating lists are usually compiled exclusively from games played at the same effective TC (i.e. corrected for machine speed). I agree it would be interesting to also include time-odds games, so that one could determine absolute ratings, rather than relative ratings for a given TC. In the past I sometimes did this for compiling rating list for variant engines, when there was only a hand full of engines able to play the variant, with wildly different strength. Then using time odds of up to a factor 100 for the stronger ones provided a much better population of the rating scale, making it possible to calculate meaningful ratings. There is not much you can conclude if nearly every match between different engines gets a >95% score.

these guys maintain their own engine populations - so Rating is a useless metric outside of that population.
Determinism and Timeless state (reference engine must not have Time control applied) is an absolute must to have the same result on every machine. So everyone could download the engine and play your own engine against it - and have an absolute number to compare to.
You are right optimally there would be different versions from 500 - 3500 Elo.

Its a more scientific approach to ELO because the number actually compares to an universal comparison engine accross all ELO populations independent of hardware.
My guess is that a fixed version of SF with a singlethreaded fixed nodecount per move would provide that.

hgm · Post by **hgm** » Sat Nov 20, 2021 1:20 pm

I don't see what is the point of having a deterministic referee engine, if you use it to test non-deterministic engines. No two games will be the same, even when they start from the opening line, because the opponent will quickly deviate. If in theory the reference engine would have repeated all its moves when the opponent would have done so, rather than playing something else, is irrelevant, because it will never be put to the test. In principle every engine can be made deterministic by having it add every game it ever played to a kind of book, and use the book move instead of thinking when it has one. But the difference between doing that and letting it think non-deterministically will never be tested.

Also note that the statistical noise in game results is huge. The whole Elo model is based on the idea that the performance differs from game to game with a standard deviation of 200 Elo. This is why a player that is 280 Elo weaker still scores about 16%. If you would intentionally perturb the average Elo of an engine with a random change with standard deviation delta, its standard deviation in the performace is affected only as sqrt(280*280 + delta*delta), i.e. quite negligible if delta <~30.

Sopel · Post by **Sopel** » Sat Nov 20, 2021 1:29 pm

A determinstic engine like you described is easly exploitable and may perform worse than a random mover. It would be a terrible reference point. Stockfish on fixed nodes single thread is deterministic, yes.

dangi12012 · Post by **dangi12012** » Sat Nov 20, 2021 3:18 pm

hgm wrote: ↑Sat Nov 20, 2021 1:20 pm I don't see what is the point of having a deterministic referee engine, if you use it to test non-deterministic engines. No two games will be the same, even when they start from the opening line, because the opponent will quickly deviate. If in theory the reference engine would have repeated all its moves when the opponent would have done so, rather than playing something else, is irrelevant, because it will never be put to the test. In principle every engine can be made deterministic by having it add every game it ever played to a kind of book, and use the book move instead of thinking when it has one. But the difference between doing that and letting it think non-deterministically will never be tested.

Also note that the statistical noise in game results is huge. The whole Elo model is based on the idea that the performance differs from game to game with a standard deviation of 200 Elo. This is why a player that is 280 Elo weaker still scores about 16%. If you would intentionally perturb the average Elo of an engine with a random change with standard deviation delta, its standard deviation in the performace is affected only as sqrt(280*280 + delta*delta), i.e. quite negligible if delta <~30.

Well if it really needs to be deterministic is a point to argue about you both had good arguments so I think this point is droppable.
It should "perform" the same on every hardware. Yes its about the standard deviation - but as its stands today - ELO is not a metric to compare engines against each other (hardware etc.)
There is no ELO standard so to speak to compare against.

With a standardized population independent ELO you could take a random build of a random engine and would exactly know where it would land on the CCRL Ranking with a defined uncertainty after a defined amount of games.

As it stands today you need to introduce it into the CCRL population and only then you get a score... Which is only valid in that pool.

mvanthoor · Post by **mvanthoor** » Sun Nov 21, 2021 12:40 pm

dangi12012 wrote: ↑Fri Nov 19, 2021 10:52 pm Do we have a benchmark engine with a fixed elo? Elo is only calculated within a population. But i didnt find a fixed elo one like a kg was for weight.

It would have to fit these criteria:
Deterministic
No time control (important to not depend on hardware)
Fixed elo

Then you would just have to download it and run against other engines with different settings.

So then conclusions this engine has 3200 Elo on that hardware on 1 Minute games become meaningful. Also doubling time gives +100 and doubling threads gives +90 and stuff like that become more scientific.

Is Stockfish 10.0 singlethreaded no hashtabe fixed nodes 10000000 such a benchmark? Or do i overlook something?

If you run without time control then you are not testing engines but evaluation functions. Every engine will run until the set depth or node count has been reached, so the only difference between the two would be the evaluation function. Time control is an essential part of playing a game of chess. If you take a look at the FIDE-list, you will see that even the strongest grand masters will have three different ratings for Blitz, Rapid and Classical, and sometimes there may be a difference of around 100 points.

If there's one reference engine I could name, then it would have to be TSCP. It seems to be everybody's first target, especially when implementing a hash table. TSCP is a pawn trickster, and is (almost) impossible to defeat in a match unless you use the same tricks, OR implement a hash table. The engine has played close to 5000 games on CCRL blitz, and it has been 1725 Elo for ages.

Most people (including me) feel that if you can defeat TSCP _without_ using anything else but PST's and a transposition table, you've got your basic stuff together.

Another engine I'd call a reference would be Vice 1.1. This engine and its video tutorial on Youtube was the basis of so many engines, or the tutorial was the kickstart of a new original engine (including mine), that many people consider the engine and it's tutorial to be their first "teacher" in chess programming. After TSCP, Vice is often the second target to beat. (Elo 2045).

Other engines I often see mentioned as targets are Fairy-Max and Micro-Max. They are strong for their feature set and code size.

dangi12012 · Post by **dangi12012** » Sun Nov 21, 2021 1:18 pm

mvanthoor wrote: ↑Sun Nov 21, 2021 12:40 pm
dangi12012 wrote: ↑Fri Nov 19, 2021 10:52 pm Do we have a benchmark engine with a fixed elo? Elo is only calculated within a population. But i didnt find a fixed elo one like a kg was for weight.

It would have to fit these criteria:
Deterministic
No time control (important to not depend on hardware)
Fixed elo

Then you would just have to download it and run against other engines with different settings.

So then conclusions this engine has 3200 Elo on that hardware on 1 Minute games become meaningful. Also doubling time gives +100 and doubling threads gives +90 and stuff like that become more scientific.

Is Stockfish 10.0 singlethreaded no hashtabe fixed nodes 10000000 such a benchmark? Or do i overlook something?
If you run without time control then you are not testing engines but evaluation functions. Every engine will run until the set depth or node count has been reached, so the only difference between the two would be the evaluation function. Time control is an essential part of playing a game of chess. If you take a look at the FIDE-list, you will see that even the strongest grand masters will have three different ratings for Blitz, Rapid and Classical, and sometimes there may be a difference of around 100 points.

If there's one reference engine I could name, then it would have to be TSCP. It seems to be everybody's first target, especially when implementing a hash table. TSCP is a pawn trickster, and is (almost) impossible to defeat in a match unless you use the same tricks, OR implement a hash table. The engine has played close to 5000 games on CCRL blitz, and it has been 1725 Elo for ages.

Most people (including me) feel that if you can defeat TSCP _without_ using anything else but PST's and a transposition table, you've got your basic stuff together.

Another engine I'd call a reference would be Vice 1.1. This engine and its video tutorial on Youtube was the basis of so many engines, or the tutorial was the kickstart of a new original engine (including mine), that many people consider the engine and it's tutorial to be their first "teacher" in chess programming. After TSCP, Vice is often the second target to beat. (Elo 2045).

Other engines I often see mentioned as targets are Fairy-Max and Micro-Max. They are strong for their feature set and code size.

I have to disagree. I didnt say no Time Control. I said no time control for the reference engine - it will do nodecount N. This is to achieve a reference strength on every hardware regardless of of all variables like ram, cores, time etc.
Reference Engine - Nodecount mode - 3000 ELO
Tested Engine - Normal Time control mode - x ELO

Then you can definitely say how your own engine Performs with which time controls and which hardware. So "My engine does 2500 Elo in 1 minute games" becomes an accurate statement.

As it stands now ELO is a comparison within a population. You can only say who is on top relative to each other - but what is really missing is a benchmark or reference point to compare across all populations.

Madeleine Birchfield · Sun Nov 21, 2021 1:35 pm

mvanthoor wrote: ↑Sun Nov 21, 2021 12:40 pm If there's one reference engine I could name, then it would have to be TSCP. It seems to be everybody's first target, especially when implementing a hash table. TSCP is a pawn trickster, and is (almost) impossible to defeat in a match unless you use the same tricks, OR implement a hash table. The engine has played close to 5000 games on CCRL blitz, and it has been 1725 Elo for ages.

Most people (including me) feel that if you can defeat TSCP _without_ using anything else but PST's and a transposition table, you've got your basic stuff together.

Another engine I'd call a reference would be Vice 1.1. This engine and its video tutorial on Youtube was the basis of so many engines, or the tutorial was the kickstart of a new original engine (including mine), that many people consider the engine and it's tutorial to be their first "teacher" in chess programming. After TSCP, Vice is often the second target to beat. (Elo 2045).

Other engines I often see mentioned as targets are Fairy-Max and Micro-Max. They are strong for their feature set and code size.

Another one I would add is BBC 1.0, which like VICE is part of a video tutorial on Youtube, and is also named after a media company for some reason.

Desperado · Post by **Desperado** » Sun Nov 21, 2021 2:38 pm

I think the discussion is about relative and absolute rating system . Both types are very different and have their right to exist.

Now, just some thoughts without any preference. Everybody is free to use a reference engine (in any form of its attributes).
The usage can be for developing, testing or building a list for the public. Also with non determinism this will result in an absolute performance,
because any engine would have been compared to the strength of the reference engine. Such a feature is not limited to one reference engine.
It can be a pool of engines too. The point is that the test conditions in total need to be the same and the performance can be compared directly.
The test conditions can include the same statistical noise for every test, so it will balance out.

The relative measurement has different attributes. In general the test conditions for two entities can vary. To make them comparable you need at leastan overlap of the test conditions, like a subset of opponents and the same time control (assuming the same hardware) and things like that.
In long run you can observe that the relative rating/ranking will correlate to a certain degree. Compare the strength of the top 10 engines today and the top 10 engines ten years ago. I am absolutely sure, although the test conditions are very different (different hardware, different pool of engines), that today the engines perform better in absolute comparison too (Even if they never played to each other or did the same test).

While you might have a little bit more noise in relative testing in short time, it will balance out over time.
Having absolute rating systems can lead to disadvantages like to optimize the performance for the test conditions but not for the general behaviour or it may be simply outdated at some point.

And finally, there are approaches that already measure strength performance and are absolute. Just think position test suites where the results
are comparable directly. It turned out over many years that game play (including noise e.g. by time based test conditions) is good enough to make improvements in short and long run. I don't say it is a smart approach but it works! The statistical noise will be wiped away with countless of test games.

Reference Engines?

Reference Engines?

Re: Reference Engines?

Re: Reference Engines?

Re: Reference Engines?

Re: Reference Engines?

Re: Reference Engines?

Re: Reference Engines?

Re: Reference Engines?

Re: Reference Engines?

Re: Reference Engines?