obvious/easy move

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: obvious/easy move - final results

Post by Sven »

hgm wrote:
Sven Schüle wrote:
Sven Schüle wrote:One final question to you: why are the ratings of A1 and A2 independent in the "reality" but dependent in case of using a rating program?
I know that you wrote about this twice in this thread, mentioning the fact that rating programs move the ratings around to create an average of zero as the reason for it. I just want to know why and how this leads to dependence.
If, for instance, you play A1-B and A2-B, then A1 doing better against B would push not only B's rating down, but also A2's rating. Because the program will keep the A2-B rating difference fixed based on their mutual result. So there always is this 'recoil' effect; you can only gain rating by pushing the average of all other ratings down, and if your opponent and the others are connected, it will not be just your opponent that suffers from this. This is especially bad if the pairing network has a star topology, where they all connect well to the star center, and not directly to you.

But the effect tends to disappear when there is a large number of (well-connected) players, because then their combined 'mass' gets so large that there is very little recoil.
But isn't this inherent to the rating system instead of a rating program? In our case "rating system" means how we calculate performance ratings from one set of games and nothing else, based on the ELO system.

Sven
User avatar
hgm
Posts: 27702
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: obvious/easy move - final results

Post by hgm »

The requirement that the average of ratings should be fixed is not part of any rating system. If the rating of engine B would have been taken as the zero point of the scale, the ratings of A1 and A2 would have been independent, and the entire problem would have gone away (for this pairing topology). The program would have to quote a larger error bar on the A1 and A2 rating in this case, of course, and the error bar on the rating of B would shrink to zero.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: obvious/easy move - final results

Post by Sven »

hgm wrote:The requirement that the average of ratings should be fixed is not part of any rating system. If the rating of engine B would have been taken as the zero point of the scale, the ratings of A1 and A2 would have been independent, and the entire problem would have gone away (for this pairing topology). The program would have to quote a larger error bar on the A1 and A2 rating in this case, of course, and the error bar on the rating of B would shrink to zero.
But how do we calculate performance ratings "from scratch" without defining a fixed, arbitrary average?

In his book about the Elo system, Prof. Elo described a procedure for "rating a group of unrated players" which is exactly our situation when rating chess programs based on one set of games. The procedure assigns some initial rating for each player (that might be the same for all, or something else), then goes through as many iterations as necessary until convergence is achieved with a given accuracy.

Isn't that the base for our computer chess rating system as well?

Sven
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: obvious/easy move - final results

Post by AlvaroBegue »

Sven Schüle wrote:
hgm wrote:The requirement that the average of ratings should be fixed is not part of any rating system. If the rating of engine B would have been taken as the zero point of the scale, the ratings of A1 and A2 would have been independent, and the entire problem would have gone away (for this pairing topology). The program would have to quote a larger error bar on the A1 and A2 rating in this case, of course, and the error bar on the rating of B would shrink to zero.
But how do we calculate performance ratings "from scratch" without defining a fixed, arbitrary average?

In his book about the Elo system, Prof. Elo described a procedure for "rating a group of unrated players" which is exactly our situation when rating chess programs based on one set of games. The procedure assigns some initial rating for each player (that might be the same for all, or something else), then goes through as many iterations as necessary until convergence is achieved with a given accuracy.

Isn't that the base for our computer chess rating system as well?

Sven
I am not sure what Mr. Elo did, but the way the rating system is used is as a predictor of future results. Ignoring draws, it's something like "the probability of A beating B is 1/(1+exp(K*(ELO(B)-ELO(A))))", for some fixed positive constant K that sets the scale of what we mean by "1 ELO point". Notice that adding a constant to everybody's ELO doesn't change any future predictions. You can say that the result of an ELO assignment is determined up to addition of an arbitrary constant, similarly to how the indefinite integral of log(x) is "1/x + C", for any constant C.
User avatar
hgm
Posts: 27702
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: obvious/easy move - final results

Post by hgm »

Whatever method you would use to determine ratings, there will always be one completely undetermined parameter, which is the zero-point of the rating scale. Rating programs normally fix this by a requirement on the average rating. Which causes correlations between the ratings, which can be quite unexpected, and make the interpretation of the reported error bars tricky.

Now very often there is no better alternative to this. But in some simple and quite common cases, there are better choices, which remove the correlation problem, and make the reported error bars intuitive. One such case is the gountlet: if I test various versions A1, A2,... An of my engine against a fixed gauntlet B1, B2, ... Bm, it is much more convenient to keep the average of the B engines fixed. Then testing a new engine An, even if it is so much stronger that it completely crushes the gauntlet, will not cause a revision of all the previously determined ratings downward, because you pushed the rating of the B engines down. Assigning an infinite 'mass' to the B engines prevents any recoil, and completely decouples the A engines from everything but their own gauntlet.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: obvious/easy move - final results

Post by Sven »

AlvaroBegue wrote:
Sven Schüle wrote:
hgm wrote:The requirement that the average of ratings should be fixed is not part of any rating system. If the rating of engine B would have been taken as the zero point of the scale, the ratings of A1 and A2 would have been independent, and the entire problem would have gone away (for this pairing topology). The program would have to quote a larger error bar on the A1 and A2 rating in this case, of course, and the error bar on the rating of B would shrink to zero.
But how do we calculate performance ratings "from scratch" without defining a fixed, arbitrary average?

In his book about the Elo system, Prof. Elo described a procedure for "rating a group of unrated players" which is exactly our situation when rating chess programs based on one set of games. The procedure assigns some initial rating for each player (that might be the same for all, or something else), then goes through as many iterations as necessary until convergence is achieved with a given accuracy.

Isn't that the base for our computer chess rating system as well?

Sven
I am not sure what Mr. Elo did, but the way the rating system is used is as a predictor of future results. Ignoring draws, it's something like "the probability of A beating B is 1/(1+exp(K*(ELO(B)-ELO(A))))", for some fixed positive constant K that sets the scale of what we mean by "1 ELO point". Notice that adding a constant to everybody's ELO doesn't change any future predictions. You can say that the result of an ELO assignment is determined up to addition of an arbitrary constant, similarly to how the indefinite integral of log(x) is "1/x + C", for any constant C.
I know that. But the point was a different one here. HGM stated that the "requirement that the average of ratings should be fixed is not part of any rating system", and even if that is true I believe that our computer chess rating system, independent from any particular rating program, works basically in a way that is very similar to the procedure described by Prof. Elo: calculate Elo differences based on the given game results and make absolute values out of these differences by defining a fixed, arbitrary average.

Note also that our context is the question whether the ratings of A1 and A2 in the gauntlet case (A1-B and A2-B games) are independent, implying that the "factor 4" confirmed by many people in this thread would be correct, or if they are in fact dependent which would imply a different and slightly more complex calculation of that factor. HGM basically says "in theory, yes, they are independent, but not with rating programs since these enforce a fixed rating average", and I am now asking what that "theory" is all about.

Sven
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: obvious/easy move - final results

Post by AlvaroBegue »

This whole thread is great for me, because I am trying to design my testing setup. My current plan involves a few different types of tests: correctness, speed, branching factor, evaluation accuracy and --finally-- playing many quick games. I plan to primarily run matches against the previous version, but I also want to be able to run games against a variety of engines, primarily so I can measure progress over long periods of time.

Similarly to what hgm proposes, I intend to get a bunch of reference engines, run a long round-robin tournament between them and fix their ratings forever. Then a version of my engine will get an Elo score that maximizes the likelihood of the results of the games given the Elo formula, in the style of Bayeselo. That won't suffer from the correlation issues.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: obvious/easy move - final results

Post by Sven »

hgm wrote:Whatever method you would use to determine ratings, there will always be one completely undetermined parameter, which is the zero-point of the rating scale. Rating programs normally fix this by a requirement on the average rating. Which causes correlations between the ratings, which can be quite unexpected, and make the interpretation of the reported error bars tricky.

Now very often there is no better alternative to this. But in some simple and quite common cases, there are better choices, which remove the correlation problem, and make the reported error bars intuitive. One such case is the gountlet: if I test various versions A1, A2,... An of my engine against a fixed gauntlet B1, B2, ... Bm, it is much more convenient to keep the average of the B engines fixed. Then testing a new engine An, even if it is so much stronger that it completely crushes the gauntlet, will not cause a revision of all the previously determined ratings downward, because you pushed the rating of the B engines down. Assigning an infinite 'mass' to the B engines prevents any recoil, and completely decouples the A engines from everything but their own gauntlet.
Three questions on that:

1) Why should the error bars be affected from a different average rating at all? (Or do you mean the error of the rating difference?)

2) Do you propose that a rating program should detect if the set of games matches that "gauntlet pattern" and should be handled differently?

3) Wouldn't it be possible that rating programs do all their internal calculations of ratings and error margins without the "fixed average" requirement, and only at the very end, prior to displaying results, shift the ratings according to "fixed average"? I would not be surprised if that were already the case today, and I think that in that case you could still regard the A1, A2 ratings as independent. The "amount of shifting" to achieve a certain average could as well have been defined by the user in a post-calculation action where the user has already seen the ratings.

Sven
Pio
Posts: 334
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: obvious/easy move - final results

Post by Pio »

Hi Robert!

I think both you and Don are right. I would primarily test the same way as Don does because:

1) You need less games to be sure of an improvement of your new version A' relative your old version A

2) You (Robert) make an extra assumption saying that your opponents will not change. The more the opponents change in the future towards your version A the worse your assumption was that the opponents will not make progress. I guess this is more dangerous if you test against weaker (in terms of evaluation and search algorithms but not speed) opponents than yourself since they will probably converge to your chess engine in the future

3) You will ruthlessly exploit the weaknesses of your pool of opponents but they might not be as representative to all chess-engines of today and tomorrow as you might think. Lets say you have 20 opponents and you test 100 different ideas. Lets say each idea (a simplified and not correct assumption) has a 50 percentage chance to be better against every opponent and 50 percentage chance to be worse against every opponent. For one of those 100 ideas you will probably be really lucky and it will work for maybe 17 opponents in your pool and will look like a brilliant idea even though it was not a general improvement at all. Of course you might do the same for some ideas that are great but unfortunately will not work for your pool of opponents

What I think is interesting is to manually study the games of your own engine against other engines since that will show flaws in your own engine.

If you really would like to beat the opponents in your pool you could do asymmetrical evaluation/search. Lets say you have made a small change "a" to your engine version A and your new engine A' = A + a works bad against A but really good against your pool of opponents. Why not change your engine's evaluation/search to A + a but model the opponent's evaluation/search to A - a where I with -a mean the inverse of a. If that does not work do the opposite and model your engine's evaluation/search to A - a and the opponents to A + a. Just an idea.

Good luck with your testing!!!
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: obvious/easy move - final results

Post by AlvaroBegue »

Pio wrote: What I think is interesting is to manually study the games of your own engine against other engines since that will show flaws in your own engine.
That's how we did basically all development of Ruy-López in the 90s. And we'll probably continue to do that as a source of ideas. But you need the intensive testing to filter what modifications to accept.