Testing results interpretation.

Discussion of chess software programming and technical issues.

Moderator: Ras

mathmoi
Posts: 290
Joined: Mon Mar 13, 2006 5:23 pm
Location: Québec
Full name: Mathieu Pagé

Re: Testing results interpretation.

Post by mathmoi »

Daniel Shawul wrote:Are you sure ? If one of the MatMoi versions (v1) improve score against Matheus by 1% and the other (v2) scored 1% better against Matant, the current scheme will assign higher elo to v1 because Matheus is higher ranked than Matant. But Matant could actually be much stronger than Matheus if they were to be matched. What we have here is only their strength agains MatMoi, that is why I suggested an initial round robin tournament to be played only _once_ . For all i care, initial ratings from CEGT could be assinged. But this would be very rough approximation. Then after, you add your engines one by one by a gauntlet match thereby _all engines_ would have played the same number of games all the time.
I know that Robert does not make the opponents play against themselves while he test Crafty. I'm not sure why.

Maybe he can tell us.
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: Testing results interpretation.

Post by Dirt »

Daniel Shawul wrote:Are you sure ? If one of the MatMoi versions (v1) improve score against Matheus by 1% and the other (v2) scored 1% better against Matant, the current scheme will assign higher elo to v1 because Matheus is higher ranked than Matant.
Why would you think so? You have v1 performing 1% better against a strong engine, but 1% worse against a weak engine. I would think that balances out.

Is it better to beat Rybka every game and lose every game against TSCP, or the reverse? I don't see a difference, and I don't think the Elo calculations do either. I do think that one of those shows much more promise for improvement, but I doubt BayesElo can pick up on that.
User avatar
hgm
Posts: 28387
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing results interpretation.

Post by hgm »

I don't think that this matters much, indeed. BayesElo has the peculiar property, though, that draws are considered more significant than wins. (This is a result of the underlying Elo model.) So if you score two draws against opponent X, and a win plus a loss against opponent Y, it will estimate your rating closer to that of X than that of Y. In fact every draw games counts double, like it is one loss plus one draw.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Testing results interpretation.

Post by Daniel Shawul »

I don't know the details of elo calculators, but i expected that wins against different opponents to be weighed differently. If I win against rybka, and loose against an equal ranked opponent, Or loose against rybka and win the equal ranked opponent. I see both get the same ELO but i think the former should be ranked higher. Somehow it doesn't feel right to use the linear _averaged _opponent elo and add some offset to it again based on _averaged_score. If bayeselo makes differntiation between draws and wins, why not use similar logic for opponent ELO too?
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Testing results interpretation.

Post by Daniel Shawul »

Why not count a win against a +400 elo higher ranked opponenet as if it was 2 wins and 1 loss. The net average score is the same but with smaller SE.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Testing results interpretation.

Post by Daniel Shawul »

Infact it looks like bayeselo does something similar, based on my rough reading of its webpage
http://remi.coulom.free.fr/Bayesian-Elo/
Elostat approach

The fundamental Elo formula can be reversed to obtain an estimation of the rating difference between two players, as a function of the average score. This is the basis of the Elostat approach, that works in two steps:

* Iterative method to solve a fixed-point equation, so that the rating of every player is in accordance to the reverse Elo formula, assuming an expected result equal to the average score, against an opponent whose Elo is equal to the average opponent. This is done under a constraint of a given average Elo over all the players.
* Measure of variance of score to estimate uncertainty.

The main flaw of this approach is that the estimation of uncertainty does as if a player had played against one opponent, whose Elo is equal to the mean Elo of the opponents. This assumption has bad consequences for the estimation of ratings and uncertainties:

* The expected result against two players is not equal to the expected result against one single player whose rating is the average of the two players.
* Estimation of uncertainty is wrong, because 10 wins and 10 losses against a 1500-Elo opponent should result in less uncertainty than 10 wins against a 500-Elo opponent and 10 losses against a 2500-Elo opponent.

Also, another problem is that the estimation of uncertainty in Elostat does as if the rating of opponents are their true ratings. But those ratings also have some uncertainty that should be taken into consideration.
User avatar
Eelco de Groot
Posts: 4669
Joined: Sun Mar 12, 2006 2:40 am
Full name:   Eelco de Groot

Re: Testing results interpretation.

Post by Eelco de Groot »

Edmund wrote:
Eelco de Groot wrote:If it is really stronger running more games, add them to the total pgn for Bayeselo and the uncertainty range should come down. (...)
You would have to re-run the whole tournament as you may not alter the sample size depending on the current state.

eg. lets say 100 games were run between two equal strength engines
in the end one engine is ahead by 11 wins (this would happen in <10% of all cases)
So this would suggest a LOS of 90%

Lets now say you are not happy with the result (hoping for 95%) and want to run another 100 games
for LOS of 95% after 200 games you need to be 20 wins ahead

So you only need another 9 wins in the next 100 games to reach your objective. The likelyhood for this is 13.81%.


Lets now say you happen to test 10 changes (all of them are equal strenghts) 9 of them are LOS < 90%; 1 of them is LOS > 90%;
if you now use this condition and keep only the one > 90% and run another test you only need to reach a LOS of 86.19% if in fact you really wanted to achieve a 95% (which would also be displayed by BayesElo)

regards,
Edmund
Sure, the confidence intervals can not be strictly valid anymore. I believe there was a thread about this very subject recently, I did not happen to read all of it? The results of the added games as I understand it are not truly independant anymore if there are conditions whether they should be played or not, dependant on previous results. But there is also no reason why you would have to throw away the results of the first tournament. It is just as valid as a possible second. Statistics is just a tool. I am sure it must be possible to determine some sort of test, with which you could statistically determine such a progression of results, like in a medical test of a new treatment if the results are very good, it is no longer ethical to withhold the advantages of the treatment from patients receiving the placebo.

I'm sure somebody must have developed statistics for determining confidence intervals for cases if a test is terminated prematurely 8-)

Maybe these 'Double blind medical trial termination and progression statistics' 8-) exist only in the form of patented software from some powerful medical company but I am sure they would have covered themselves by being able to provide the mathematical proof if needed, if the problem can be solved mathematically I'm sure they would have done it. Yet another reason that would be why patenting software is wrong in my opinion. But this patented medical software is just a complete speculation of mine, and maybe the problems can at present only numerically be simulated but I don't really think so, but I'm no mathematician or know enough about the statistics.

I only happened on your post just now Edmund, sorry missed it before, I have not done any research in the matter.

Regards, Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Testing results interpretation.

Post by Edmund »

Eelco de Groot wrote:Sure, the confidence intervals can not be strictly valid anymore. I believe there was a thread about this very subject recently, I did not happen to read all of it? The results of the added games as I understand it are not truly independant anymore if there are conditions whether they should be played or not, dependant on previous results. But there is also no reason why you would have to throw away the results of the first tournament. It is just as valid as a possible second. Statistics is just a tool. I am sure it must be possible to determine some sort of test, with which you could statistically determine such a progression of results, like in a medical test of a new treatment if the results are very good, it is no longer ethical to withhold the advantages of the treatment from patients receiving the placebo.

I'm sure somebody must have developed statistics for determining confidence intervals for cases if a test is terminated prematurely 8-)

Maybe these 'Double blind medical trial termination and progression statistics' 8-) exist only in the form of patented software from some powerful medical company but I am sure they would have covered themselves by being able to provide the mathematical proof if needed, if the problem can be solved mathematically I'm sure they would have done it. Yet another reason that would be why patenting software is wrong in my opinion. But this patented medical software is just a complete speculation of mine, and maybe the problems can at present only numerically be simulated but I don't really think so, but I'm no mathematician or know enough about the statistics.

I only happened on your post just now Edmund, sorry missed it before, I have not done any research in the matter.

Regards, Eelco
Hello Eelco,

I think the significance of the prior test depends very much on the conditions used to decide whether or not to conduct a follow-up test.

should you use a 95% confidence interval to decide whether or not to run a second test and - using this condition - you find yourself running an additional test in 5% of all tests the significance of the first test would be zero. The other extreme, if the condition was true all the time the significance of the first test would be as high as the second one.

Concerning the early termination of tests, I put some ideas on how to calculate it in the other thread. The problem I see though after further research is that it would only hold for a non-directional test, ie the probability of one engine having not the same strength than the other one - not the probability of being stronger.

regards,
Edmund