Eval Dilemma

Edsel Apostol · Post by **Edsel Apostol** » Mon Apr 06, 2009 2:21 pm

bob wrote:
Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.

I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.

I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.

I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.

Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.
Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.
You said that you knew it was too few games. But I do not think you knew the magnitude of games needed to come up with a conclusion. What Giancarlo was pointing out can be translated to: "Both versions do not look any weaker or stronger than the other". So, your test does not look counter intuitive.

To make a decision based only on the numbers of wins you had in your tests, is almost as basing it on flipping coins. The difference you got was ~10 wins on 240 games. You had a performance of ~33%. This is not the same (because you have draws) but just to have an idea, throw a dice 240 times and count how many times you get 1 or 2 (33% chances). Do it again and again. The number will oscillate around 80, but getting close to 70 or 90 is not that unlikely. This is pretty well established. The fact that you are using only 20 positions and 4 engines make differences even less significant (statistically speaking).

Miguel
I'm using 30 positions played for both colors, so 60 positions per opponent multiplied by four opponents equals 240.

I don't think that basing a decision from just 240 games is like basing it on flipping coins. What I know is that there is a certain difference in percentage of wins that you could declare if a version is better than the other if there error bar doesn't overlap.

For example I have a version with a performance of 2400 +-40 and I have another version with a performance of 2600 +-40. The upper limit of the first version is 2440 and the lower limit of the second version is 2560, they doesn't overlap so in this case I could say that the second version is better than the first version even if I only have a few hundred games.
It is just as random, in fact. I have a script I run on the cluster when I am testing. It grabs all the completed games and runs them thru bayeselo. It is almost a given that after 1000 games, the Elo will be 20-30 above or below where then 32,000 game Elo will end up. Many times a new change starts off looking like it will be a winner, only to sink down and be a no-change or worse....

If you play 10,000 games, and you look at the result as a series of wld characters, you can find all sorts of "strings" inside that 10,000 game result, that will produce results significantly different than the total.

1,000 games is worthless for 99% of the changes you will make.

I don't know how you proved it as random. If you would say that at least 32000 games are needed to determine if an engine or version is better then I think that we could not trust the rating lists like CCRL and CEGT as they only have few hundreds to a couple thousand games per engine or version.

If for example you pit Twisted Logic against Rybka with just 100 games and the result is 100% for Rybka, would you say that there is not enough games to say that Rybka is much better than Twisted Logic because it's too few games?

Have you experienced in your tests that for example after 1000 games the performance is 2700 +-20 but after 32000 games the performance is greater or lower than the error bar, for example it results to a performance of 2750? I'm asking this because of what I've said above that even with just a few games you could trust that result if the performance of two versions with there error bars being considered doesn't overlap. You seem to dismiss this as random.

Edsel Apostol · Post by **Edsel Apostol** » Mon Apr 06, 2009 2:26 pm

diep wrote:
Edsel Apostol wrote:
diep wrote:
michiguel wrote:
diep wrote:
michiguel wrote:
diep wrote:
hgm wrote:
MattieShoes wrote:Can you or anybody point me to how the error bars are calculated?
I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.) For very unbalanced scores, you would have to take into account the fact that the 40% goes down; the exact formula for this is

100%*sqrt(score*(1-score) - 0.25*drawFraction)

where the score is given as a fraction. The 40% is based on 35% draws, and a score of 0.5. In the case mentioned (score around 0.25, presumably 15% win and 20% draws, you would get 100%*sqrt(0.25*0.75-0.25*0.2) = 37%. So in 240 games you woud have 2.4% error bar (1 sigma).

When comparing the results from two independent gauntlets, the error bar in the diffrence is the Pythagorean sum of the individua error bars (i.e. sqrt(error1^2 + error2^2) ). For results that had equal numbers of games, this means multiplying the individual error bars by sqrt(2).
To get a score difference of 30% (35% ==> 65%) that's so big, that if you add just 1 pattern to your evaluation with such a huge impact, that obviously we might hope that's not the 'average pattern' that you add.

More likely it is that you see a score difference of 1 point at a 200 games when adding just 1 tiny pattern.

Vincent
He was not talking about increases from 35% to 65%. The formula is valid if both score A and score B (from versions A and B) are within 35% to 65%. In other words, if score A is 48% and score B is 52%, you can apply the formula. If score A is 8% and score B is 12%, you cannot.

Miguel
If A scores 48% and B scores 52%, that's basically blowing 2 games with maybe in total just 2 very bad moves, as that can give a 4 point swing in total.

First of all odds these 2 bad moves were caused by the specific pattern is tiny. It could be some fluctuation or book learning or whatever effect.

So you really soon will conclude you need THOUSANDS of games for real good statistical significance. I'm going usually for 95%.

Vincent
Going from 50-50 to 52-48 is an increase of ~15 Elo points. Yes, you need thousands of games to make sure it is real with a good level of confidence.

Miguel
Additionally they have to be preferably on the hardware and time control you want to play tournament in.

Jonathan Schaeffer: "You have to test with what you play".
There seems to be a positive correlation of engine strength between short and longer time controls, based on some results published here and what you also can notice in the established rating lists.

So if you're lacking the time and resources to test with what you play, for example a tournament time control of 40/40 you can take a compromise by playing blitz games. There is a big probability that the result in blitz will correspond with its result in longer time controls, though there are some few engines that are the exception.
Well, you get lured by evaluation function.

For search several algorithm do no longer work very well in longer time controls which do in short time controls and vice versa.

Vincent

Okay, I get your point here and I agree that you're right but I think you can avoid this. Please see my reply to Bob on the similar quote. My suggestion there seems logical to me but still needs to be proven and is just my opinion.

Edsel Apostol · Post by **Edsel Apostol** » Mon Apr 06, 2009 2:33 pm

michiguel wrote:
Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.

I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.

I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.

I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.

Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.
Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.
You said that you knew it was too few games. But I do not think you knew the magnitude of games needed to come up with a conclusion. What Giancarlo was pointing out can be translated to: "Both versions do not look any weaker or stronger than the other". So, your test does not look counter intuitive.

To make a decision based only on the numbers of wins you had in your tests, is almost as basing it on flipping coins. The difference you got was ~10 wins on 240 games. You had a performance of ~33%. This is not the same (because you have draws) but just to have an idea, throw a dice 240 times and count how many times you get 1 or 2 (33% chances). Do it again and again. The number will oscillate around 80, but getting close to 70 or 90 is not that unlikely. This is pretty well established. The fact that you are using only 20 positions and 4 engines make differences even less significant (statistically speaking).

Miguel
I'm using 30 positions played for both colors, so 60 positions per opponent multiplied by four opponents equals 240.

I don't think that basing a decision from just 240 games is like basing it on flipping coins. What I know is that there is a certain difference in percentage of wins that you could declare if a version is better than the other if there error bar doesn't overlap.
If the confidence value you obtain is not far apart from 50%, it is no much different from flipping coins. That is what you get with a difference of a handful of wins in 240 games. You mention in another post a difference of 3 wins. That was hint that I should better warn you no to put any weight on these type of results. That is exactly like flipping coins.

For example I have a version with a performance of 2400 +-40 and I have another version with a performance of 2600 +-40. The upper limit of the first version is 2440 and the lower limit of the second version is 2560, they doesn't overlap so in this case I could say that the second version is better than the first version even if I only have a few hundred games.
That is not the example you brought!

Miguel

Okay, thanks for the hint.

If results are within the error bar of each version's performance then the results are indeed not reliable.

Testing seems very tedious and is an art in itself, but I'm starting to learn.

MattieShoes · Post by **MattieShoes** » Mon Apr 06, 2009 2:57 pm

Hmm, I guess so. Nullmove + simple pawn structure + futility pruning gave me 200+. I guess now it gets difficult

I suppose for very situational eval changes, it could be easier to test at least. My engine knows to push kings to the corners in endgames but for KNB vs K, it doesn't know two of the corners are no good and can't search deep enough to figure it out. That's not much more than an if statement in eval but would be nearly impossible to test for us mortals since 99.9% of games never get to KNB vs K endgames...

MattieShoes · Post by **MattieShoes** » Mon Apr 06, 2009 3:16 pm

Err, what I was attempting to say was one could test the situation directly with a variety of KNB vs K positions in that case rather than play a berjillion games, so even though the Elo gain is probably less than 1, it's still possible to test it...

bob · Post by **bob** » Mon Apr 06, 2009 9:15 pm

MattieShoes wrote:Err, what I was attempting to say was one could test the situation directly with a variety of KNB vs K positions in that case rather than play a berjillion games, so even though the Elo gain is probably less than 1, it's still possible to test it...

No you can't. What if the change makes it play better in KNB vs K positions, but wrecks things in KNBR vs KNBR positions? It is quite easy to design a change to improve performance on a particular position. But you have to make sure it doesn't hurt more in other places before you can accept it.

bob · Post by **bob** » Mon Apr 06, 2009 9:22 pm

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.

I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.

I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.

I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.

Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.
Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.
You said that you knew it was too few games. But I do not think you knew the magnitude of games needed to come up with a conclusion. What Giancarlo was pointing out can be translated to: "Both versions do not look any weaker or stronger than the other". So, your test does not look counter intuitive.

To make a decision based only on the numbers of wins you had in your tests, is almost as basing it on flipping coins. The difference you got was ~10 wins on 240 games. You had a performance of ~33%. This is not the same (because you have draws) but just to have an idea, throw a dice 240 times and count how many times you get 1 or 2 (33% chances). Do it again and again. The number will oscillate around 80, but getting close to 70 or 90 is not that unlikely. This is pretty well established. The fact that you are using only 20 positions and 4 engines make differences even less significant (statistically speaking).

Miguel
I'm using 30 positions played for both colors, so 60 positions per opponent multiplied by four opponents equals 240.

I don't think that basing a decision from just 240 games is like basing it on flipping coins. What I know is that there is a certain difference in percentage of wins that you could declare if a version is better than the other if there error bar doesn't overlap.

For example I have a version with a performance of 2400 +-40 and I have another version with a performance of 2600 +-40. The upper limit of the first version is 2440 and the lower limit of the second version is 2560, they doesn't overlap so in this case I could say that the second version is better than the first version even if I only have a few hundred games.
It is just as random, in fact. I have a script I run on the cluster when I am testing. It grabs all the completed games and runs them thru bayeselo. It is almost a given that after 1000 games, the Elo will be 20-30 above or below where then 32,000 game Elo will end up. Many times a new change starts off looking like it will be a winner, only to sink down and be a no-change or worse....

If you play 10,000 games, and you look at the result as a series of wld characters, you can find all sorts of "strings" inside that 10,000 game result, that will produce results significantly different than the total.

1,000 games is worthless for 99% of the changes you will make.
I don't know how you proved it as random. If you would say that at least 32000 games are needed to determine if an engine or version is better then I think that we could not trust the rating lists like CCRL and CEGT as they only have few hundreds to a couple thousand games per engine or version.

If for example you pit Twisted Logic against Rybka with just 100 games and the result is 100% for Rybka, would you say that there is not enough games to say that Rybka is much better than Twisted Logic because it's too few games?

NO, because I never said _that_. I said "for two engines that are close to each other in strength, it takes a _ton_ of games to accurately assess which one is stronger." That's a lot different from your example. 100-0 is a clear superiority, although that can still happen between two programs of equal strength given enough games.

If you don't believe this is a problem, there's little I can do to convince you. Those that know, know. Those that don't will eventually one day figure it out.

Have you experienced in your tests that for example after 1000 games the performance is 2700 +-20 but after 32000 games the performance is greater or lower than the error bar, for example it results to a performance of 2750? I'm asking this because of what I've said above that even with just a few games you could trust that result if the performance of two versions with there error bars being considered doesn't overlap. You seem to dismiss this as random.

I don't see many outside the error bar, although it definitely happens. But I have seen results where after 100 games the rating was 2700+ while after 32000 games it is 2600. Again, for your case to happen, the two programs will have to be _hundreds_ of Elo apart as the error bar for a hundred games or so is huge. You are not going to have that kind of huge difference when you are trying to test A' against A to see if A' (modified A) is better or worse than the original A.

You are letting yourself be convinced that for a patzer vs a GM a 100 games is enough to recognize that the GM is stronger with high confidence, and then trying to extropolate to two nearly equal opponents and use the same testing approach which just doesn't work.

MattieShoes · Post by **MattieShoes** » Tue Apr 07, 2009 5:14 am

I was assuming a more specific if statement that is ONLY true for KNB vs K (or the inverse). All non-KNB vs K evals would have whatever hit is associated with determining the if statement is false.

Or actually I was considering writing an endgame-specific eval that takes place when material is very low that scores several endgame situations more correctly, since typical material eval kind of breaks down with pawn races, KNN vs K, etc. And within that, have an if statement that tests specifically for KNB vs K only endgames.

Edsel Apostol · Post by **Edsel Apostol** » Tue Apr 07, 2009 5:31 am

bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.

I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.

I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.

I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.

Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.
Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.
You said that you knew it was too few games. But I do not think you knew the magnitude of games needed to come up with a conclusion. What Giancarlo was pointing out can be translated to: "Both versions do not look any weaker or stronger than the other". So, your test does not look counter intuitive.

To make a decision based only on the numbers of wins you had in your tests, is almost as basing it on flipping coins. The difference you got was ~10 wins on 240 games. You had a performance of ~33%. This is not the same (because you have draws) but just to have an idea, throw a dice 240 times and count how many times you get 1 or 2 (33% chances). Do it again and again. The number will oscillate around 80, but getting close to 70 or 90 is not that unlikely. This is pretty well established. The fact that you are using only 20 positions and 4 engines make differences even less significant (statistically speaking).

Miguel
I'm using 30 positions played for both colors, so 60 positions per opponent multiplied by four opponents equals 240.

I don't think that basing a decision from just 240 games is like basing it on flipping coins. What I know is that there is a certain difference in percentage of wins that you could declare if a version is better than the other if there error bar doesn't overlap.

For example I have a version with a performance of 2400 +-40 and I have another version with a performance of 2600 +-40. The upper limit of the first version is 2440 and the lower limit of the second version is 2560, they doesn't overlap so in this case I could say that the second version is better than the first version even if I only have a few hundred games.
It is just as random, in fact. I have a script I run on the cluster when I am testing. It grabs all the completed games and runs them thru bayeselo. It is almost a given that after 1000 games, the Elo will be 20-30 above or below where then 32,000 game Elo will end up. Many times a new change starts off looking like it will be a winner, only to sink down and be a no-change or worse....

If you play 10,000 games, and you look at the result as a series of wld characters, you can find all sorts of "strings" inside that 10,000 game result, that will produce results significantly different than the total.

1,000 games is worthless for 99% of the changes you will make.
I don't know how you proved it as random. If you would say that at least 32000 games are needed to determine if an engine or version is better then I think that we could not trust the rating lists like CCRL and CEGT as they only have few hundreds to a couple thousand games per engine or version.

If for example you pit Twisted Logic against Rybka with just 100 games and the result is 100% for Rybka, would you say that there is not enough games to say that Rybka is much better than Twisted Logic because it's too few games?
NO, because I never said _that_. I said "for two engines that are close to each other in strength, it takes a _ton_ of games to accurately assess which one is stronger." That's a lot different from your example. 100-0 is a clear superiority, although that can still happen between two programs of equal strength given enough games.

If you don't believe this is a problem, there's little I can do to convince you. Those that know, know. Those that don't will eventually one day figure it out.

Have you experienced in your tests that for example after 1000 games the performance is 2700 +-20 but after 32000 games the performance is greater or lower than the error bar, for example it results to a performance of 2750? I'm asking this because of what I've said above that even with just a few games you could trust that result if the performance of two versions with there error bars being considered doesn't overlap. You seem to dismiss this as random.
I don't see many outside the error bar, although it definitely happens. But I have seen results where after 100 games the rating was 2700+ while after 32000 games it is 2600. Again, for your case to happen, the two programs will have to be _hundreds_ of Elo apart as the error bar for a hundred games or so is huge. You are not going to have that kind of huge difference when you are trying to test A' against A to see if A' (modified A) is better or worse than the original A.

You are letting yourself be convinced that for a patzer vs a GM a 100 games is enough to recognize that the GM is stronger with high confidence, and then trying to extropolate to two nearly equal opponents and use the same testing approach which just doesn't work.

Okay, I understand it. For two versions that are nearly similar in strength you need a lot of games to make sure that the error bars doesn't overlap with each other to trust the results. For example two versions that has a performance of 2700 +-20 and 2710 +-20 respectively after a thousand games is not yet enough to say that one version is better. The difference in performance should be greater than 40 if the error bars for both sides is 20 to trust the results.

What I'm trying to point out is that it is also possible to trust the results of even just a few games.

If you could see results outside of the error bars after more games then definitely there's a flaw in how the rating is being solved/calculated.

MattieShoes · Post by **MattieShoes** » Tue Apr 07, 2009 5:54 am

Assuming I understand confidence intervals correctly, you'd expect the true value to lie outside the error bars 5% of the time with a 95% confidence interval. So you'd expect values outside the error bars occasionally after more games.

After playing further with my spreadsheet, I'm noticing something weird.

Situation A:
X scores 0.20 in a gauntlet with 20% draws
Y scores 0.25 in a gauntlet with 20% draws
That'd take somewhere around 800 games to get error bars small enough.

Situation B:
X scores 0.475 in a gauntlet with 33% draws
Y scores 0.525 in a gauntlet with 33% draws
The difference in scores is equal but this situation would take closer to 1300 games to get error bars small enough.

That seems backwards to me though - Is that how it's "supposed to be" or did I screw something up?

Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma