I don't know how you proved it as random. If you would say that at least 32000 games are needed to determine if an engine or version is better then I think that we could not trust the rating lists like CCRL and CEGT as they only have few hundreds to a couple thousand games per engine or version.bob wrote:It is just as random, in fact. I have a script I run on the cluster when I am testing. It grabs all the completed games and runs them thru bayeselo. It is almost a given that after 1000 games, the Elo will be 20-30 above or below where then 32,000 game Elo will end up. Many times a new change starts off looking like it will be a winner, only to sink down and be a no-change or worse....Edsel Apostol wrote:I'm using 30 positions played for both colors, so 60 positions per opponent multiplied by four opponents equals 240.michiguel wrote:You said that you knew it was too few games. But I do not think you knew the magnitude of games needed to come up with a conclusion. What Giancarlo was pointing out can be translated to: "Both versions do not look any weaker or stronger than the other". So, your test does not look counter intuitive.Edsel Apostol wrote:Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.bob wrote:I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.
I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.
I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.
I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.
Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
To make a decision based only on the numbers of wins you had in your tests, is almost as basing it on flipping coins. The difference you got was ~10 wins on 240 games. You had a performance of ~33%. This is not the same (because you have draws) but just to have an idea, throw a dice 240 times and count how many times you get 1 or 2 (33% chances). Do it again and again. The number will oscillate around 80, but getting close to 70 or 90 is not that unlikely. This is pretty well established. The fact that you are using only 20 positions and 4 engines make differences even less significant (statistically speaking).
Miguel
I don't think that basing a decision from just 240 games is like basing it on flipping coins. What I know is that there is a certain difference in percentage of wins that you could declare if a version is better than the other if there error bar doesn't overlap.
For example I have a version with a performance of 2400 +-40 and I have another version with a performance of 2600 +-40. The upper limit of the first version is 2440 and the lower limit of the second version is 2560, they doesn't overlap so in this case I could say that the second version is better than the first version even if I only have a few hundred games.
If you play 10,000 games, and you look at the result as a series of wld characters, you can find all sorts of "strings" inside that 10,000 game result, that will produce results significantly different than the total.
1,000 games is worthless for 99% of the changes you will make.
If for example you pit Twisted Logic against Rybka with just 100 games and the result is 100% for Rybka, would you say that there is not enough games to say that Rybka is much better than Twisted Logic because it's too few games?
Have you experienced in your tests that for example after 1000 games the performance is 2700 +-20 but after 32000 games the performance is greater or lower than the error bar, for example it results to a performance of 2750? I'm asking this because of what I've said above that even with just a few games you could trust that result if the performance of two versions with there error bars being considered doesn't overlap. You seem to dismiss this as random.