Eval Dilemma

Discussion of chess software programming and technical issues.

Moderator: Ras

MattieShoes
Posts: 718
Joined: Fri Mar 20, 2009 8:59 pm

Re: Eval Dilemma

Post by MattieShoes »

As I understand it, when comparing two numbers with error bars, you use the pythagorean sum of the error bars rather than just adding them.

So say the error bars are 5% on each engine. If the scores are 8% apart, the error bars overlap but we're only looking for 95% confidence, so you do

Code: Select all

sqrt(.05^2 + .05^2)
which is about 7% instead of the 10% you'd get from adding them. So 8% difference is enough to satisfy 95% confidence interval when comparing the two in this case.

Visualized, it'd be something like this (I think)
Image
The two curves represent the strength of the two engines. They're not points because we're not totally sure what the strengths are, but we "know" they're somewhere inside each curve. No matter how many games are played, the curves overlap to some extent. For 95% confidence, we want a randomly point under curve A to be greater than a randomly chosen point under curve B 95% of the time.

And I pray somebody will correct me if I'm totally off here. :-)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Eval Dilemma

Post by bob »

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
diep wrote:
michiguel wrote:
diep wrote:
michiguel wrote:
diep wrote:
hgm wrote:
MattieShoes wrote:Can you or anybody point me to how the error bars are calculated?
I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.) For very unbalanced scores, you would have to take into account the fact that the 40% goes down; the exact formula for this is

100%*sqrt(score*(1-score) - 0.25*drawFraction)

where the score is given as a fraction. The 40% is based on 35% draws, and a score of 0.5. In the case mentioned (score around 0.25, presumably 15% win and 20% draws, you would get 100%*sqrt(0.25*0.75-0.25*0.2) = 37%. So in 240 games you woud have 2.4% error bar (1 sigma).

When comparing the results from two independent gauntlets, the error bar in the diffrence is the Pythagorean sum of the individua error bars (i.e. sqrt(error1^2 + error2^2) ). For results that had equal numbers of games, this means multiplying the individual error bars by sqrt(2).
To get a score difference of 30% (35% ==> 65%) that's so big, that if you add just 1 pattern to your evaluation with such a huge impact, that obviously we might hope that's not the 'average pattern' that you add.

More likely it is that you see a score difference of 1 point at a 200 games when adding just 1 tiny pattern.

Vincent
He was not talking about increases from 35% to 65%. The formula is valid if both score A and score B (from versions A and B) are within 35% to 65%. In other words, if score A is 48% and score B is 52%, you can apply the formula. If score A is 8% and score B is 12%, you cannot.

Miguel
If A scores 48% and B scores 52%, that's basically blowing 2 games with maybe in total just 2 very bad moves, as that can give a 4 point swing in total.

First of all odds these 2 bad moves were caused by the specific pattern is tiny. It could be some fluctuation or book learning or whatever effect.

So you really soon will conclude you need THOUSANDS of games for real good statistical significance. I'm going usually for 95%.

Vincent
Going from 50-50 to 52-48 is an increase of ~15 Elo points. Yes, you need thousands of games to make sure it is real with a good level of confidence.

Miguel
Additionally they have to be preferably on the hardware and time control you want to play tournament in.

Jonathan Schaeffer: "You have to test with what you play".
There seems to be a positive correlation of engine strength between short and longer time controls, based on some results published here and what you also can notice in the established rating lists.

So if you're lacking the time and resources to test with what you play, for example a tournament time control of 40/40 you can take a compromise by playing blitz games. There is a big probability that the result in blitz will correspond with its result in longer time controls, though there are some few engines that are the exception.
That's not the issue. You are trying to decide whether A' is better than A, and that absolutely does not correlate well at different time controls, for many kinds of changes. Particularly with respect to things that significantly alter the shape of the tree, such as search extensions, reductions, etc. I have found some ideas that work better at fast time controls (small but significant improvement) but then fall flat on their face at longer time controls. Eval changes are less likely to do this but I have seen examples there as well where this also happens.
In my opinion, when you test the things that significantly alter the shape of the tree, you can make it consistent with the results on longer time control as long as you give enough depth to the blitz time control for the reductions, extensions to take effect. I mean if you only test with ultra fast time control that the average depth of the engine in the game is 5 then I think you could not trust it to determine improvements on the search. I think it must be at least depth 12. A solution is to play blitz but to make sure that the blitz time control is adjusted to have the engine at least an average depth of 12 in the game.

If only eval is being tuned I don't think that result in short time control would not correlate with longer time controls, though there might be some rare cases that will prove as an exception. I'm just curious what are these rare exceptions. Can you cite some examples? It would help the amateurs like me and the others as we became aware of it.
I've seen several. One I've mentioned many times. in 1985 I was getting ready to move to Birmingham to work on my Ph.D., and Bert Gower and I spent the summer tuning blitz/cray-blitz for the 1985 chess tournament (ACM). We played on the vax, and played a couple of dedicated chess machines, including the super constellation. The only games we lost were in endings where our pawns were a bit "scattered" leaving holes the opposiing king could penetrate through. We modified the evaluation to penalize "holes" in our pawn structure and we did not lose another game to the superconnie. However, long games on the vax were shallower than speed chess on the Cray. In the ACM event we played passively and lost 2 of 5 rounds. At the 1986 WCCC we won the first round (same source program) and lost the second, again to a very passive style of play. Cray Blitz was pretty aggressive so we started to look to see why. On a whim, we removed the two lines of code (I believe that is the correct number) added for the pawn hole code and it started to play like its old self, and won the next three games and the 1986 WCCC event as well. The vax could not go deeply enough to understand the pawn "holes" so the new eval term helped. But the Cray went much deeper and not only did it get penalized for the "holes" it could also see the danger through search, but because of the eval terms it refused to make any holes at all...

We've had other such cases. Eval is intended to replace the search when the search can't go deep enough. When you do have a deep search, some of those "things" no longer work well. In fast games you can ramp up king safety and beat many opponents by speculative attacks. In long games this fails and you just throw games away to speculative attacks that don't work.

I've also found some LMR tricks that work better at fast games, but as the time control gets longer they cause Crafty to play worse. I don't keep such changes as I want uniform play through all time controls, rather than N versions for N different time controls.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Eval Dilemma

Post by bob »

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
michiguel wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.

I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.

I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.

I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.

Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.
Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.
You said that you knew it was too few games. But I do not think you knew the magnitude of games needed to come up with a conclusion. What Giancarlo was pointing out can be translated to: "Both versions do not look any weaker or stronger than the other". So, your test does not look counter intuitive.

To make a decision based only on the numbers of wins you had in your tests, is almost as basing it on flipping coins. The difference you got was ~10 wins on 240 games. You had a performance of ~33%. This is not the same (because you have draws) but just to have an idea, throw a dice 240 times and count how many times you get 1 or 2 (33% chances). Do it again and again. The number will oscillate around 80, but getting close to 70 or 90 is not that unlikely. This is pretty well established. The fact that you are using only 20 positions and 4 engines make differences even less significant (statistically speaking).

Miguel
I'm using 30 positions played for both colors, so 60 positions per opponent multiplied by four opponents equals 240.

I don't think that basing a decision from just 240 games is like basing it on flipping coins. What I know is that there is a certain difference in percentage of wins that you could declare if a version is better than the other if there error bar doesn't overlap.

For example I have a version with a performance of 2400 +-40 and I have another version with a performance of 2600 +-40. The upper limit of the first version is 2440 and the lower limit of the second version is 2560, they doesn't overlap so in this case I could say that the second version is better than the first version even if I only have a few hundred games.
It is just as random, in fact. I have a script I run on the cluster when I am testing. It grabs all the completed games and runs them thru bayeselo. It is almost a given that after 1000 games, the Elo will be 20-30 above or below where then 32,000 game Elo will end up. Many times a new change starts off looking like it will be a winner, only to sink down and be a no-change or worse....

If you play 10,000 games, and you look at the result as a series of wld characters, you can find all sorts of "strings" inside that 10,000 game result, that will produce results significantly different than the total.

1,000 games is worthless for 99% of the changes you will make.
I don't know how you proved it as random. If you would say that at least 32000 games are needed to determine if an engine or version is better then I think that we could not trust the rating lists like CCRL and CEGT as they only have few hundreds to a couple thousand games per engine or version.

If for example you pit Twisted Logic against Rybka with just 100 games and the result is 100% for Rybka, would you say that there is not enough games to say that Rybka is much better than Twisted Logic because it's too few games?
NO, because I never said _that_. I said "for two engines that are close to each other in strength, it takes a _ton_ of games to accurately assess which one is stronger." That's a lot different from your example. 100-0 is a clear superiority, although that can still happen between two programs of equal strength given enough games.

If you don't believe this is a problem, there's little I can do to convince you. Those that know, know. Those that don't will eventually one day figure it out.

Have you experienced in your tests that for example after 1000 games the performance is 2700 +-20 but after 32000 games the performance is greater or lower than the error bar, for example it results to a performance of 2750? I'm asking this because of what I've said above that even with just a few games you could trust that result if the performance of two versions with there error bars being considered doesn't overlap. You seem to dismiss this as random.
I don't see many outside the error bar, although it definitely happens. But I have seen results where after 100 games the rating was 2700+ while after 32000 games it is 2600. Again, for your case to happen, the two programs will have to be _hundreds_ of Elo apart as the error bar for a hundred games or so is huge. You are not going to have that kind of huge difference when you are trying to test A' against A to see if A' (modified A) is better or worse than the original A.

You are letting yourself be convinced that for a patzer vs a GM a 100 games is enough to recognize that the GM is stronger with high confidence, and then trying to extropolate to two nearly equal opponents and use the same testing approach which just doesn't work.
Okay, I understand it. For two versions that are nearly similar in strength you need a lot of games to make sure that the error bars doesn't overlap with each other to trust the results. For example two versions that has a performance of 2700 +-20 and 2710 +-20 respectively after a thousand games is not yet enough to say that one version is better. The difference in performance should be greater than 40 if the error bars for both sides is 20 to trust the results.

What I'm trying to point out is that it is also possible to trust the results of even just a few games.

If you could see results outside of the error bars after more games then definitely there's a flaw in how the rating is being solved/calculated.
Unfortunately that is not true. one standard deviation means 2/3 of the results should be "in the proper range". But what about that other 1/3? Getting results 2SD away is not common, but also not rare. If you go to 2SD, you are still going to see 1 out of 20 resutls that are "out there"... You compress the SD by playing a sufficient number of games. If you can add a bunch of changes so that you expect a +40 Elo boost, this will be easier to measure than a +4 Elo boost. But this leads to other problems. You test the modifications as a whole, not knowing if some are good and some are bad. If you could eliminate the bad, you might go up more than +40. That's why we try to test one change at a time...

32,000 games gives a +/-4 error bar. Lots of changes won't give you +4 elo. For those you need more games. It is _very_ difficult to measure very small changes this way.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Eval Dilemma

Post by bob »

MattieShoes wrote:I was assuming a more specific if statement that is ONLY true for KNB vs K (or the inverse). All non-KNB vs K evals would have whatever hit is associated with determining the if statement is false.

Or actually I was considering writing an endgame-specific eval that takes place when material is very low that scores several endgame situations more correctly, since typical material eval kind of breaks down with pawn races, KNN vs K, etc. And within that, have an if statement that tests specifically for KNB vs K only endgames.
That's ok, but now you are talking about almost a zero-point Elo improvement since such an ending is _very_ rare.
User avatar
hgm
Posts: 28443
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Eval Dilemma

Post by hgm »

Indeed. Even under-promotion is much more common.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Eval Dilemma -- some quick data

Post by bob »

Below, I am pasting the output from a script that I run while a test is in progress. The initial data has 3 sets of 32,000 games against Glaurung 1 & 2, fruit 2 and toga 2. The three versions of Crafty (Crafty-23.0fast1, fast2 and fast3) are the same program, same everything, just run 3 times in a row. Pretty consistent.

I then start sampling this data every 30 seconds, while playing a new match with Crafty-23.1R05. Watch how the rating moves around as it settles to the right area (more about what this version is at the bottom)

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2666    4    4 23346   58%  2607   22% 
   2 Toga2              2657    4    4 23346   57%  2607   23% 
   3 Crafty-23.0-fast3  2608    4    4 31128   52%  2595   22% 
   4 Crafty-23.0-fast2  2607    4    4 31128   52%  2595   22% 
   5 Crafty-23.0-fast1  2606    4    4 31128   51%  2595   22% 
   6 Fruit 2.1          2560    5    4 23346   44%  2607   24% 
   7 Glaurung 1.1 SMP   2496    4    4 23346   35%  2607   20% 
-----------------------  currently using 83 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2664    5    5 23350   58%  2606   22% 
   2 Toga2              2656    5    5 23349   57%  2606   23% 
   3 Crafty-23.1R05     2609  116  116    17   50%  2571   41% 
   4 Crafty-23.0-fast3  2606    4    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast2  2605    4    5 31128   52%  2594   22% 
   6 Crafty-23.0-fast1  2605    4    5 31128   51%  2594   22% 
   7 Fruit 2.1          2559    4    5 23348   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 23354   35%  2606   20% 
-----------------------  currently using 84 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2660    5    4 23420   58%  2602   22% 
   2 Toga2              2652    4    4 23415   57%  2602   23% 
   3 Crafty-23.1R05     2638   31   31   293   56%  2587   20% 
   4 Crafty-23.0-fast3  2602    4    5 31128   52%  2589   22% 
   5 Crafty-23.0-fast2  2601    4    4 31128   52%  2589   22% 
   6 Crafty-23.0-fast1  2601    4    4 31128   51%  2589   22% 
   7 Fruit 2.1          2555    5    5 23415   44%  2602   24% 
   8 Glaurung 1.1 SMP   2491    5    4 23427   35%  2602   20% 
-----------------------  currently using 84 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2662    4    4 23498   58%  2604   22% 
   2 Toga2              2653    4    5 23496   57%  2604   23% 
   3 Crafty-23.1R05     2624   21   21   626   54%  2588   22% 
   4 Crafty-23.0-fast3  2604    4    4 31128   52%  2591   22% 
   5 Crafty-23.0-fast2  2603    5    4 31128   52%  2591   22% 
   6 Crafty-23.0-fast1  2603    5    5 31128   51%  2591   22% 
   7 Fruit 2.1          2557    4    4 23497   44%  2604   24% 
   8 Glaurung 1.1 SMP   2493    4    4 23519   35%  2604   20% 
-----------------------  currently using 86 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2664    5    5 23564   58%  2605   22% 
   2 Toga2              2655    5    5 23564   57%  2605   23% 
   3 Crafty-23.1R05     2611   18   18   894   53%  2590   20% 
   4 Crafty-23.0-fast3  2606    4    4 31128   52%  2593   22% 
   5 Crafty-23.0-fast2  2605    4    4 31128   52%  2593   22% 
   6 Crafty-23.0-fast1  2605    4    4 31128   51%  2593   22% 
   7 Fruit 2.1          2559    5    5 23558   44%  2605   24% 
   8 Glaurung 1.1 SMP   2495    5    5 23592   35%  2605   20% 
-----------------------  currently using 87 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 23630   58%  2606   22% 
   2 Toga2              2656    5    5 23632   57%  2606   23% 
   3 Crafty-23.1R05     2607   16   16  1166   52%  2591   21% 
   4 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast2  2606    5    5 31128   52%  2594   22% 
   6 Crafty-23.0-fast1  2605    4    5 31128   51%  2594   22% 
   7 Fruit 2.1          2559    4    4 23624   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 23664   35%  2606   20% 
-----------------------  currently using 88 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 23690   58%  2606   22% 
   2 Toga2              2656    5    5 23696   57%  2606   23% 
   3 Crafty-23.1R05     2608   14   14  1424   52%  2591   20% 
   4 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast2  2605    5    5 31128   52%  2594   22% 
   6 Crafty-23.0-fast1  2605    4    5 31128   51%  2594   22% 
   7 Fruit 2.1          2559    4    4 23683   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 23739   35%  2606   20% 
-----------------------  currently using 88 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2666    4    4 23760   58%  2607   22% 
   2 Toga2              2657    4    4 23762   57%  2607   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2595   22% 
   4 Crafty-23.0-fast2  2606    4    4 31128   52%  2595   22% 
   5 Crafty-23.0-fast1  2606    4    4 31128   51%  2595   22% 
   6 Crafty-23.1R05     2601   13   13  1698   51%  2592   21% 
   7 Fruit 2.1          2560    4    4 23747   44%  2607   24% 
   8 Glaurung 1.1 SMP   2496    4    4 23813   35%  2607   20% 
-----------------------  currently using 89 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2666    4    4 23827   58%  2607   22% 
   2 Toga2              2657    4    4 23831   57%  2607   23% 
   3 Crafty-23.0-fast3  2608    4    4 31128   52%  2595   22% 
   4 Crafty-23.0-fast2  2606    4    4 31128   52%  2595   22% 
   5 Crafty-23.0-fast1  2606    4    4 31128   51%  2595   22% 
   6 Crafty-23.1R05     2601   12   12  1975   51%  2592   21% 
   7 Fruit 2.1          2560    4    4 23812   44%  2607   24% 
   8 Glaurung 1.1 SMP   2496    4    4 23889   35%  2607   20% 
-----------------------  currently using 90 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 23888   58%  2606   22% 
   2 Toga2              2657    4    4 23897   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.0-fast2  2606    4    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast1  2606    5    4 31128   51%  2594   22% 
   6 Crafty-23.1R05     2603   12   11  2246   52%  2591   21% 
   7 Fruit 2.1          2560    4    4 23879   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 23966   35%  2606   20% 
-----------------------  currently using 83 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2666    4    4 23952   58%  2607   22% 
   2 Toga2              2657    4    4 23961   57%  2607   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2595   22% 
   4 Crafty-23.0-fast2  2606    4    4 31128   52%  2595   22% 
   5 Crafty-23.0-fast1  2606    4    4 31128   51%  2595   22% 
   6 Crafty-23.1R05     2602   11   11  2489   51%  2592   21% 
   7 Fruit 2.1          2560    5    5 23941   44%  2607   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24019   35%  2607   20% 
-----------------------  currently using 81 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24021   58%  2606   22% 
   2 Toga2              2657    4    4 24018   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.0-fast2  2606    4    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast1  2606    5    4 31128   51%  2594   22% 
   6 Crafty-23.1R05     2603   11   11  2716   51%  2593   21% 
   7 Fruit 2.1          2561    5    5 24010   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24051   35%  2606   20% 
-----------------------  currently using 81 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24076   58%  2606   22% 
   2 Toga2              2656    4    4 24072   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.0-fast2  2606    5    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast1  2606    5    5 31128   51%  2594   22% 
   6 Crafty-23.1R05     2605   10   10  2924   51%  2594   21% 
   7 Fruit 2.1          2560    4    4 24072   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24088   35%  2606   20% 
-----------------------  currently using 94 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24135   58%  2606   22% 
   2 Toga2              2656    4    4 24131   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.1R05     2606   10   10  3172   52%  2593   21% 
   5 Crafty-23.0-fast2  2606    5    5 31128   52%  2594   22% 
   6 Crafty-23.0-fast1  2605    5    5 31128   51%  2594   22% 
   7 Fruit 2.1          2560    4    4 24136   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24154   35%  2606   20% 
-----------------------  currently using 94 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24203   58%  2606   22% 
   2 Toga2              2657    4    4 24200   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.0-fast2  2606    5    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast1  2606    5    5 31128   51%  2594   22% 
   6 Crafty-23.1R05     2605    9   10  3439   51%  2594   21% 
   7 Fruit 2.1          2560    4    4 24196   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24224   35%  2606   20% 
-----------------------  currently using 94 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24275   58%  2606   22% 
   2 Toga2              2657    4    4 24266   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.0-fast2  2606    5    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast1  2606    5    5 31128   51%  2594   22% 
   6 Crafty-23.1R05     2605    9    9  3730   51%  2593   22% 
   7 Fruit 2.1          2560    4    4 24272   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24301   35%  2606   20% 
-----------------------  currently using 94 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24339   58%  2606   22% 
   2 Toga2              2656    4    4 24336   57%  2606   23% 
   3 Crafty-23.1R05     2607    9    9  4010   52%  2593   22% 
   4 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast2  2606    5    5 31128   52%  2594   22% 
   6 Crafty-23.0-fast1  2605    4    5 31128   51%  2594   22% 
   7 Fruit 2.1          2560    4    4 24341   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24378   35%  2606   20% 
-----------------------  currently using 92 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24411   58%  2606   22% 
   2 Toga2              2657    4    4 24397   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.0-fast2  2606    5    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast1  2606    5    5 31128   51%  2594   22% 
   6 Crafty-23.1R05     2605    8    8  4279   52%  2593   22% 
   7 Fruit 2.1          2560    4    4 24409   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24446   35%  2606   20% 
-----------------------  currently using 94 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24482   58%  2606   22% 
   2 Toga2              2657    4    4 24467   57%  2606   23% 
   3 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   4 Crafty-23.1R05     2606    8    8  4562   52%  2593   22% 
   5 Crafty-23.0-fast2  2606    5    4 31128   52%  2594   22% 
   6 Crafty-23.0-fast1  2606    5    5 31128   51%  2594   22% 
   7 Fruit 2.1          2560    4    4 24475   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24522   35%  2606   20% 
-----------------------  currently using 94 nodes.
Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.2       2665    4    4 24549   58%  2606   22% 
   2 Toga2              2656    4    4 24535   57%  2606   23% 
   3 Crafty-23.1R05     2607    8    8  4838   52%  2593   22% 
   4 Crafty-23.0-fast3  2607    4    4 31128   52%  2594   22% 
   5 Crafty-23.0-fast2  2606    5    5 31128   52%  2594   22% 
   6 Crafty-23.0-fast1  2605    5    5 31128   51%  2594   22% 
   7 Fruit 2.1          2560    4    4 24539   44%  2606   24% 
   8 Glaurung 1.1 SMP   2495    5    5 24599   35%  2606   20% 
after 17 games, R05 is 2609. After 293 it is up to 2638. If you stopped the test after 300 games you would be highly tempted to say R05 is significantly better. After 626 games it is down a bit to 2624. But still 20 Elo better than the original. And of course, 626 games is quite a bit of computation. By the time we have done 1698 games, it now looks like it is a little _worse_ than the original 3 tests.

And in case you are interested, it ended up at 2606. R05 is identical to 23.0 as this is the version with the LMR offset search window, but this first run uses an offset of zero (0) to verify that it produces the same result as the original 23.0 version.

This is a "quick test" version to see how things look. The entire 32K game match normally takes about an hour or a little less. However, I am only using about 3/4 of the cluster as another user is running on 32 nodes or so. If you notice the error bars, this is all staying within 1SD, and as the error bar narrows, the score gets closer to "the truth"...
User avatar
hgm
Posts: 28443
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Eval Dilemma -- some quick data

Post by hgm »

Do you have a filter to count the number of KBNK endings in those games?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Eval Dilemma -- some quick data

Post by bob »

hgm wrote:Do you have a filter to count the number of KBNK endings in those games?
not directly, but I have all the PGN so I suppose someone with the Chessbase (or whatever) book database program could search for them. If there's no rush I could simply have Crafty log each time a KBNK ending is reached and play a match to see what happens...
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Eval Dilemma

Post by michiguel »

hgm wrote:Indeed. Even under-promotion is much more common.
I disagree. I think KNBK is more common than underpromotion. I have seen it quite a few times.

Miguel
PS: I have seen underpromotions, but most of them were not needed. i.e., promoting a Q was the best move but the engine decides to promote a R.
Dann Corbit
Posts: 12814
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Eval Dilemma

Post by Dann Corbit »

michiguel wrote:
hgm wrote:Indeed. Even under-promotion is much more common.
I disagree. I think KNBK is more common than underpromotion. I have seen it quite a few times.

Miguel
PS: I have seen underpromotions, but most of them were not needed. i.e., promoting a Q was the best move but the engine decides to promote a R.
Usually, these are pieces about to be captured. I think the engine is saying:
"Haw, Haw! You only got a rook and not a queen."

Knight underpromotions are the ones that are most commonly actually of value.