Standard candles

Uri Blass · Post by **Uri Blass** » Tue Aug 27, 2013 10:07 pm

hgm wrote:I once mate an attempt to accurately measure the ratings of the very weakest engines in the ChessWar Promo division. They did end up below zero. And I had the feeling these were even over-estimated, and would go down several hundred Elo more if there just had been enough intermediate engines to push them further down. The gap between engines like Ccp, Pos and N.E.G. and the weakest, most buggy alpha-beta searchers is truly enormous.

Another problem is that the standard rating models are not valid for these engines. There always is a very sizable probability that they score points against other extremely weak engines that are 500, 1000 or even 3000 Elo stronger. Because these weak engines are so buggy that the often hang before the opponent is checkmated, and thus forfeit on time (or play illegal move, etc.) Of course the standard rating models then refuse to believe that they are really 3000 Elo weaker, when they score points. But it would be easy to construct a sequence of engines all differing by 100 Elo (where the occasional forfeit of the stronger one would not completely corrupt the measurements, and then you would see that sequence would have to be very, very long before you reach the strength of Pos or Brutus random.

Another problem is that engines weakened by randomizing their eval still might recognize very deep mates. This makes them very unbalanced. They play like complete idiots, but (when put up against other complete idiots) then suddenly announce "mate in 11", and play it out perfectly. That is not what you expect of a weak player.

I do not believe that the standard rating model is valid for something.
Rating is always dependent on the pool of opponents that you play.

I think that it may be interesting to compute the rating advantage that some engine has against the random mover by having the following players.

1)a strong program X(call it X(0))
2)the same program except playing a random moves in 1% of the cases(call it X(1))
3)The same program except playing a random move in 2% of the cases(call it X(2))

You have by this way 101 programs X(0),X(1),...X(100) when X(100) is the random mover.

Now make a match of 1000 games between X(i) and X(i+1) for 0<=i<=99

After 100,000 games you get based on the matchs rating for X(i) and you can calculate the difference between X(0) and the random mover X(100).

You can also make 50 matchs of 1000 games between X(i) and X(i+2) for i=0,2,4,...98 and it may be interesting to see also the gap between X(0) and X(100) in these conditions.

flok · Post by **flok** » Wed Aug 28, 2013 5:54 pm

Adam Hair wrote:Unlike ChessWar (if I remember correctly), I throw out all time forfeits and games adjudicated for illegal moves. Due to this, I exclude engines that tend to hang or perform an illegal move when they are losing. Still, there is still a major source for rating distortion in my list. Some engines can not apply mate with an overwhelming majority of material (I have forgotten the reason for this and would not mind if someone explained it again). It is these engines that give up half points to Brutus random, POS, etc ... in my testing.

I find it hilarious that POS actually is good for something: being the worst chess-engine as a marker (and proof that negative elo is possible)

Well, let's hope that DBP gets better.

Michel · Post by **Michel** » Thu Aug 29, 2013 12:54 pm

I have forgotten the reason for this and would not mind if someone explained it again

For example if you score a mate simply as 10000 rather than as 10000-ply the engine will have no incentive at all to actually work towards the mate. So it will keep playing random moves that keep the mate within the horizon without ever actually delivering the mate, except by pure accident.

Don · Post by **Don** » Thu Aug 29, 2013 3:25 pm

sje wrote:In the realm of astrophysics, standard candles are classes of astronomical objects which allow measurement of cosmic distances. Without these candles, we would not know with any certainty the distances to a faraway object unless somehow physical travel were possible.

http://en.wikipedia.org/wiki/Cosmic_distance_ladder

It occurs to me that we might construct standard candles on the elo strength scale. For example, no one knows the actual elo measurement of a totally random player. But that measurement could be determined by artificially constructing progressively weaker machine players. A program would be made we no randomness and earn a (say) 2800 elo rating. From that point, a very slight amount of randomness would be added to produce a program which performed 100 elo points lower after extended (> 1000 game) match competition with the prior version. The downward revision iteration would be continued as necessary.

I had at one time considered producing a generic chess program set up as a standard benchmark. It would be a specification so that anyone could reproduce this program from the specification. The idea was to use fairly modern techniques but not go to extreme lengths. For example there are all sorts of variations of how one might do mobility - but this particular program would have the most basic square counting method and it would be described in an unambiguous way. The design document would be formal.

I decided against it because it would be a major project and I seriously doubt everyone would faithfully follow the spec. I did such a thing with GO - for a simple Monte Carlo standard for benchmarking and people implemented it with their own minor twists and enhancements - which completely defeats the purpose. And at some point you get down to how much design you want to impose - for example do you have to generate moves in the same order? I couldn't decide whether the program should match node for node or just be strength equivalent for example you would have to have the same exact zobrist hash table if you want deterministically identical implementations. That's not necessary for equivalent playing strength but it would be necessary for proving that it was implemented correctly.

Anyway, I thought it would be pretty cool to do such a thing but there are not enough hours in the day as it is ....

sje · Post by **sje** » Thu Aug 29, 2013 5:04 pm

A detailed approach:

1.Take a source snapshot of a strong, open source program like Stockfish. Call the program made form this snapshot "Candle-0".

2. Modify the program source in only one way: after a search completes. generate a random number on the unit (0..1) interval. If this random number is less than a given percentage (zero percent for Candle-0), then this triggers the generation of a second random number. This second number is an integer used to select a move from all the available ply zero moves.

3. For Candle-0, the first random number never has an effect, so the second number is never generated.

4. Successively make Candle-1, Candle-2 up to Candle-100. Each program performs the same as its predecessor expect for a one point percentage addition to its random trigger.

5. Run all the programs against each other in a huge round robin tournament with game/hour sudden death controls for about a year.

6. A the end of all this, you'll have a very good idea as to the elo of a random program.

Don · Post by **Don** » Thu Aug 29, 2013 5:48 pm

sje wrote:A detailed approach:

1.Take a source snapshot of a strong, open source program like Stockfish. Call the program made form this snapshot "Candle-0".

2. Modify the program source in only one way: after a search completes. generate a random number on the unit (0..1) interval. If this random number is less than a given percentage (zero percent for Candle-0), then this triggers the generation of a second random number. This second number is an integer used to select a move from all the available ply zero moves.

3. For Candle-0, the first random number never has an effect, so the second number is never generated.

4. Successively make Candle-1, Candle-2 up to Candle-100. Each program performs the same as its predecessor expect for a one point percentage addition to its random trigger.

5. Run all the programs against each other in a huge round robin tournament with game/hour sudden death controls for about a year.

6. A the end of all this, you'll have a very good idea as to the elo of a random program.

This would produce a lot of versions of different strengths, but it's too bizarre and unbalanced. Why not just use a node level? Something like:

candle(n) = 100 * 2^n nodes

candle(0) = 100 nodes
candle(1) = 200 nodes
candle(2) = 400 nodes

etc ...

Having a program play like a super Grandmaster but blundering horribly a percentage of the time is really ugly.

There is a more natural way to do your idea though. Program runs in multipv mode of INF - in other words RANK all the moves from best to worst.

You then choose a random number between 0 and the number of legal moves. That is the weakest play and is completely random. To get to the next strongest candle you generate a second random number between zero and the FIRST random number. You now have a curve where the best move is favored a little more than the second best and so on. The worst move is still possible but less likely than any other move. You can continue this process with additional random numbers.

After 20 random numbers you will almost always pick the top Stockfish move but with some very slight chance of picking the 2nd best. The odds of picking the 3rd or 4th best are exceedingly remote and it's possible to still pick the worst move if you are willing to wait a few trillion years. But with more random numbers you will asymptotically approach the full strength of Stockfish. And I think the quality (even if low) will seem a lot more consistent from move to move.

Of course you will take a hit for multipv mode of perhaps 100 ELO - I don't know for sure.

mcostalba · Post by **mcostalba** » Thu Aug 29, 2013 11:37 pm

sje wrote:A detailed approach:

With all the due respect this is a mental fart.

A reference program is a good idea but IMO shoudl have these features

1) Be ELO independent from PC hardware and from TC (that's why a unmodified normal engine that thinks more when has more time or when runs on a faster hardware is not suitable).

2) Be unambiguous to implement, starithforward and simple.

3) Plays weak chess but still chess: a random engine is just a mental abstraction (I am a bit more politically correct here

)

So here is my suggestion.

take and engine and modify so that:

1) Generates all the moves and for each one calls SEE

2) The result of the "search" of each move is simply the eval of the position after the move + SEE score

3) The eval of the position is simply given by the psq score.

4) The engine chooses as best move the move with highest eval. If two or more moves have the same eval the one that starts from the lowest square is done (this is just a convention to make the moves reproducible also on different generators)

Don · Post by **Don** » Fri Aug 30, 2013 12:11 am

I agree 100% with hardware independence issue - but it can still have a normal time control if the nodes per second is strictly controlled.

In fact a really good reference program would be some open source program such as Stockfish with the only modification being that you can SET the nodes per second - AND/OR have a NIT level - nodes is time.

The advantage of being able to set the nodes per second is that you can run on ANY hardware and play matches with humans and get a REAL rating if you choose to. You could play a human time control - the only requirement being that the machine is fast enough to support the nodes per second you choose.

Of course for a static level you can support the fixed nodes level with no other change.

mcostalba wrote:
sje wrote:A detailed approach:
With all the due respect this is a mental fart.

A reference program is a good idea but IMO shoudl have these features

1) Be ELO independent from PC hardware and from TC (that's why a unmodified normal engine that thinks more when has more time or when runs on a faster hardware is not suitable).

2) Be unambiguous to implement, starithforward and simple.

3) Plays weak chess but still chess: a random engine is just a mental abstraction (I am a bit more politically correct here )

So here is my suggestion.

take and engine and modify so that:

1) Generates all the moves and for each one calls SEE

2) The result of the "search" of each move is simply the eval of the position after the move + SEE score

3) The eval of the position is simply given by the psq score.

4) The engine chooses as best move the move with highest eval. If two or more moves have the same eval the one that starts from the lowest square is done (this is just a convention to make the moves reproducible also on different generators)

Daniel Shawul · Post by **Daniel Shawul** » Fri Aug 30, 2013 12:16 am

Am I the only one who is struggling to understand the merit of this exercise? If you want a standard metric , why don't we just pick an arbitrary player and assign it an arbitrary ELO. Since ELO is relative, this is allowed. If you want to compare against FIDE ELO, then it means we are using their standard 'candle' which I don't recall what it is. Btw lets avoid advanced English idioms (candle etc) so that we don't confuse the masses. There is no way to escape the fact that eventually you have to assign a number (elo) to some arbitrary player. (FIDE-some human player, CEGT shredder/fruit =2600) etc... So in this regard selecting the random player has no significance, we might as well chose an engine that thinks it is playing loosers chess. That player is certainly weaker than a random mover. We could also have a random move sorter at all internal plies which can be pretty strong. If we go by branching factors then it has reduces b^d to b^0.75d which is not bad at all. The reason why this player may be better is because measuring elo of extremely week player , such as random move selector at the root, is more difficult because of lack of similar strength opponents. Unless ofcourse you have random players that chose moves with uniform distribution, normal distribution etc

which is pretty meaningless anyway. I hope I have conveyed my thought well or lack thereof. In any case it is good to state the goal of the project in unambiguous terms a layman could understand.

P.S: Just last week a new 'atomic clock' was invented that measures to 10^18th or so accuracy. Problem is how to compare its accuracy because it is better than anything out there. So they are going to compare to the best version of the older clock, but they might as well define 'one second' with the new clock because eventually it all arbitrary. So in this regard why worry at all in having a reference, pick one that has reasonable scales as the old ones so that we don't have too many complaints of the new second being longer than the old. Ok that is enough from me

Don · Post by **Don** » Fri Aug 30, 2013 12:56 am

Daniel Shawul wrote:Am I the only one who is struggling to understand the merit of this exercise?

We have had arguments on this forum about how strong 20 and 30 year old programs are. Someday we may want to compare today's programs with tomorrows programs - and for that matter we might want to have some benchmark to compare the top players of yesterday to the top players of tomorrow. ELO is not reliable as the rating pool is like the economy, subject to deflation and inflation.

So this could provide a hard benchmark - something not subject to change. The key point being the source could live in perpetuity and effort would be made to preserve this benchmark even decades into the future. Personally I prefer the benchmark to be a hard formal spec instead of an existing reference program but I cannot argue with the convenience of having something right away and avoiding the pain of a formal spec. A formal spec provides a means to build this reference program in any future programming language that might come along in the future - or even future hardware that is incompatible with today's hardware. But that is thinking much farther ahead - and a reference program can also be viewed as a hard spec.

To be truly useful however it would need be USED - so if the idea was aggressively promoted the first step would be to "benchmark the benchmark", pick an arbitrary level that is roughly on par with the average FIDE player and get
a lot of games in - as many as possible. This makes it possible to define EXACTLY what 2500 ELO means - without the fluctuations of inflation and deflation on the rating pool.

You probably could not get a lot of top players to play such matches with a benchmark player but you probably could get a lot of FIDE masters to participate - perhaps even putting one of these programs in FIDE tournaments or paying players to play against them.

I think it's a really cool idea.

If you want a standard metric , why don't we just pick an arbitrary player and assign it an arbitrary ELO. Since ELO is relative, this is allowed. If you want to compare against FIDE ELO, then it means we are using their standard 'candle' which I don't recall what it is. Btw lets avoid advanced English idioms (candle etc) so that we don't confuse the masses. There is no way to escape the fact that eventually you have to assign a number (elo) to some arbitrary player. (FIDE-some human player, CEGT shredder/fruit =2600) etc... So in this regard selecting the random player has no significance, we might as well chose an engine that thinks it is playing loosers chess. That player is certainly weaker than a random mover. We could also have a random move sorter at all internal plies which can be pretty strong. If we go by branching factors then it has reduces b^d to b^0.75d which is not bad at all. The reason why this player may be better is because measuring elo of extremely week player , such as random move selector at the root, is more difficult because of lack of similar strength opponents. Unless ofcourse you have random players that chose moves with uniform distribution, normal distribution etc which is pretty meaningless anyway. I hope I have conveyed my thought well or lack thereof. In any case it is good to state the goal of the project in unambiguous terms a layman could understand.

P.S: Just last week a new 'atomic clock' was invented that measures to 10^18th or so accuracy. Problem is how to compare its accuracy because it is better than anything out there. So they are going to compare to the best version of the older clock, but they might as well define 'one second' with the new clock because eventually it all arbitrary. So in this regard why worry at all in having a reference, pick one that has reasonable scales as the old ones so that we don't have too many complaints of the new second being longer than the old. Ok that is enough from me

Standard candles

Re: Standard candles

Re: Standard candles

Re: Standard candles

Re: Standard candles

Candlemaking

Re: Candlemaking

Re: Candlemaking

Re: Candlemaking

Re: Standard candles

Re: Standard candles