An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote:
bob wrote: But every program I have personally tested and use in my tests certainly does do this. So for _most_ of us, the number of games needed to predict progress with high confidence is quite large. And all the arguing in the world is not going to change that.
Let me try to follow your logic here:

"every program I have tested does this, therefore for most of us..."

For me there is a huge jump in reasoning that I don't understand.

It would mean that you are somehow equivalent to most of us. just looking at the members active in this discussion would show you that we are most of us, not you.
Why? I know and have conversed with many program authors over the years. We _all_ time our searches in the same way, with different tweaks. But nearly every one I have _seen_ did not complete iterations and stop. You could go back thru years of r.g.c.c posts to see the discussions there. And eventually you realize that most do it as I do it, not as you/hgm do it. Most have read the various journal articles on using time, and understand why the ideas work. So concluding that your approach is the center of the universe is wrong. How many different programmers have you discussed timing issues with? I count hundreds in my past... so there is some weight to my statement, and I am not making an unsupported assumption..
I never talked about fixed anything, I never talked about time controls. In fact, I specifically asked for that subject to be discussed in a separate thread.

If I misquoted you on something that related to the timing issue and took it to mean the original subject we talked about by mistake, I apologize.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An objective test process for the rest of us?

Post by Gerd Isenberg »

Uri Blass wrote: Gerd,

You do not understand the point of H.G.Muller.
Yes, true. I almost don't understand anything from H.G.'s reasoning and found Bob's most plausible for me. I got a bit emotional by H.G's subtle innuendos against Bob which made me a bit biased. This thread is a bit bulky and thanks to the ergonomy of this forum software somehow hard to follow over the time.
Uri Blass wrote: His claim is that the data that Bob gave earlier from the first 4 matches against the same opponent is something that you expect to be very rare even if the result of every game is decided by pure luck and the matches are from random positions so the positions of one match do not repeat in another match.


He did not claim that it is impossible that the programs are not deterministic in the same position.

The main question is if the variance in match result of 80 games is not bigger than the sum of the variance in the different games.

In theory it should be equal but the results made me suspect that it is not the case for some reason and maybe bob believes that the games are independent when they are not independent for some reason.

If you see that a coin fall on head 70 times out of 100 in the first 100 tries then it is logical to suspect that the coin is simply not fair(of course it is possible that the coin is fair and you had a bad luck).

Uri
I don't get that - may be you can teach me some stat-basics?
Can you explain me what you think is wrong with Bob's result?
Can you explain me how to calculate the variance (sigma?) per line and in total?
What about the variance per column aka sample position?
Why do you expect those games not independent from each other?

Code: Select all

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)
As far as I understand, you, H.G. and Nicolai claim, it is safe to play less, but more "deterministic" games, to draw significant conclusions. E.g. only one run per opponent with a carefully choosen, independent set of starting positions. And if the new version wins let say x% more games it is safe to claim it is stronger (or weaker) by some probability. More deterministic games for instance by clearing hash-tables between moves and finishing iterations after reaching a certain number of nodes. Deterministic matches imply a quite small number of fixed positions to occur and to search.

I vote for Bob arguments. Some randomness is immanent and even desirable to cover seldom conditional eval terms (e.g. some endgame-knowledge) for some significance of those terms in the final result. For a bean counter with piece square tables a few "deterministic" games might be sufficient, to "prove" some piece square values are better than others. For a decent eval and search with zillions of conditional stuff, one certainly needs much more games. Even more the more randomness you apply by keeping hash between moves and terminating on time.

Looks like a dilemma - the greater the slight desired randomness to cover all combinations of eval-terms, the much more games you need to play. What about keeping positional and "critical" tactical and endgame positions with some bm or am from those matches as test-positions and to determine progess on those sets?

Thanks,
Gerd
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: An objective test process for the rest of us?

Post by mhull »

Gerd Isenberg wrote:What about the variance per column aka sample position?
Here is Bob's data transposed. The first two rows here are the first Silver position, where 01: Crafty playing white and 02: Crafty playing black.

Code: Select all

01: ------------------=-----=--==--- -26
02: ++++++++++++++++=+++++++++++++++ 31.5
03: -=-==----=-=---=-=--=----=---==- -15.5
04: =+----++++-++=+++=+--==+-++-+++= 11
05: -===-------=---=--=-=---------=- -20
06: +++++=-==+=+-+++=++++=++++++++=- 22.5
07: +++==+++++=+++++++-++++-=++-++++ 24
08: =-=--=----=-=--=--=---===------- -17
09: -==-------=-=------=-=------==-= -18.5
10: +++=++++++++++++++=++=++++=+=+++ 29.5
11: =---=-=--------=-=-==-------=--- -20
12: +-++-+-=++++-+=++==-+-+++=+++=++ 17
13: +++++=+==+++==+++=++=+=+=+=+-++- 23
14: -----==-=-=-======---=--===---== -8
15: ==+++===-=-+++=+==++++-+++=++++- 19
16: -----=---=-----=---==----==-==-= -17
17: ++++-+=+++==+=+-++-++==+=+++++++ 22.5
18: -------=--=------=-------------- -27.5
19: ++++=++++++++++++++=++=-+==++=++ 27
20: ==-=---=-----=----------------=- -23
21: +=++++++++-+=+++++++++++++++++++ 29
22: -------=-------=-----=---=------ -26
23: ++=++++++=++++++++++++-+++=-=+++ 26
24: --=------=-===--------=-=-=---=- -18.5
25: -----=------------------+------- -28.5
26: =-++-+++=+++++++++++--+++++=++++ 22.5
27: ------=-=----=-==-----=---=-=--- -20
28: ++++++-=-+++++++=+++=++=++-+++++ 24
29: ==+-+===-=--====+++=-=--=-+=--+= 4.5
30: ---===-----=----------=--=-=--=- -20
31: -=-=--==-=------=---===----=---- -17
32: +++++=++++--++--++++++=-+=-+++++ 18.5
33: ++=++++=++++-+++++++++-+++++++++ 27
34: =--==---=---==---====-=--------= -14
35: ------=-==-----------=---=------ -24.5
36: ++-+++++++++++++++++=+++=+++++++ 29
37: +=+=-+++++==+++++=+=-+=+++=+=+== 22.5
38: ---=--=-=------------=---------- -26
39: ----=----==-=-=---==-=-----===-= -14
40: =++++=+-++++-+++++=++==++==+=+++ 24
41: ----==--==-=-=-==----==-=--=-==- -11
42: +-+--+=++==-+=+-=++++++==+==++-+ 15.5
43: =+++++++++++++++++++++++++++++++ 31.5
44: ---=-------=---=----------=----= -24.5
45: =----=-----=---==-=-------=-==-- -18.5
46: =-+++++++++++=-+++++-++=++=++=+= 23
47: ==-----==-=-====-=-==-=-=------= -9.5
48: +=+++-+-+=+==-+==++++-+==++--++- 14
49: --=---------==--=-+=--==-----==- -16.5
50: ++=+++++=+++-=+++++++++=++=++--+ 23.5
51: =-======---=-==-===-====-=--=--- -3.5
52: ==+++++-==+-++=+-+--+=++=-++++++ 16.5
53: ======---=--=--=--==-=====--==== -2
54: +++-+++++++++++++++++++==++++++- 27
55: ==-------=--==---=--===------=-- -17
56: ++=++++=+===+-++++++++++++++-+++ 25.5
57: +-+-=-++=+=+-+-+++++-++++++=++-+ 16
58: ----=------------=-------------- -29
59: =-=-===----=-=------=------==-=- -15.5
60: ++==+++-++==+-=-++=+-++++-==++++ 18
61: +++=+++-++++++=+++++++++++++++++ 29
62: --=---------------------=------- -29
63: --=---===--=======----==-=-=---- -9.5
64: +++++++++++++++++++++++-+++++-++ 28
65: -----=---==------=-------------- -26
66: +--++--+-+++=--=-=+=---===+-++-+ 2.5
67: -+-+-+=+=+++++=+++++++++=++++=++ 23.5
68: --=-=--=-=-------=--=------==--- -20
69: +=--====+=+=-=+=-==++----====+=+ 8
70: --------=------=--=----==------- -24.5
71: -------------=-------=--=--=---= -24.5
72: ++++++++-+++-+++++++=+-++=++++++ 25
73: =-++-=++--++++=++-+++++++++=+-++ 18
74: -=-----==-=----=--------===---== -17
75: ---=-----------=-=----------=--- -26
76: +-++----+-+--++=-+++---+-++-++-+ 1.5
77: +++++-++-++++=++++++=+-+++++++++ 25
78: --=---=------------------=------ -27.5
79: --+-+=+-++=--++++-=+-=+-+=++---+ 5.5
80: -------=-==---=-----------=-==-- -21.5
Matthew Hull
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:Statistics is not the end-all when choosing how to test a program. Experience helps a _whole_ lot.
Okay, finally we have come to the conclusion that your experience is worth more than Statistics, so whatever you say will always be right. I am just wasting my time, and probably yours. I give up.

Give up whenever you want. But don't put words in my mouth when "signing off". I didn't say a _sord_ about experience is worth more than statistics. I said "experience helps a whole lot" because sometimes the statistics just lead you down a wrong path because of the characteristics of the data. How you go from my statement to what you claimed I said is beyond me...


I will try and discuss my ideas in a forum where people understand me, and that seem to have no problem in using standard Statistics methodology to help me solve my problem.

Since you don't have a problem to solve, everybody can be happy.

Your point about people that should be more careful in their statements regarding engines is taken (not that I have ever disputed it), and I will continue to point out such mistakes when I see them, and when they are suppported by standard methodology.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

I actually look at the data like that from time to time. What I really look for are cases where I lose most games from both sides. That indicates a basic hole in the evaluation or search that my opponent is exploiting. I don't look at this kind of data to make a go/no-go decision however. But occasionally I want to answer the question "OK, what to do next" and looking at the worst-cases often leads to the biggest overall improvement as fixing that issue usually has a positive effect on other positions in the test as well.

But it is a different issue from evaluating a change to evaluating what needs to be changed..
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Gerd,

this discussion is not about randomness in playing games. This point was already solved in a previous thread, by my theoretical prediction of how large these effects would be in engines with various time-management strategies, and by the later tests of Eden and uMax that followed. These were in total agreement with the predictions, and a such that problem can be considered 100% understood. Engines with simple time-management and little memory, like Eden and uMax, are about 97% deterministic, and repeat most games. This was of course well known to us, it is so obvious that you cannot miss it, and very annoying when you want to run gauntlets. Especially since many of the opponents at this level do not support the GUI commands to start in other positions then the opening.

What the current discussion is about, is this:

Nicolai asked me how much Eden 0.0.14 would have to score in his 26-game RR, before it could be considered better than Eden 0.0.13, w(which had scored 9 out of 26). Here it turned out that I (applying standard statistical analysis) and Bob have a fundamental difference:
bob wrote:
hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.

If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.

Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Bob claims here that the variance of 26-game match results is much larger than the 2 points I calculated (according to the 0.4*sqrt(26) , with as a consequence that you would err much more often than 5% of the cases if you accepted the score threshold I calculated (15 out of 26).

As this is theoretically impossible if the games in a match are independent (i.e. totally random), we were skeptical wrt this claim. But Bob persistently kept claiming that this is experimental fact, that he sees it happen all the time, and that he can show us experimental data to prove his extraordinary claim. He than shows us the by now famous 80-game match results, where one of the traces contains a deviation of 29 points from the average.

Now if the games within the mini-match are independent, such a result _cannot_ occur more than once every 15,000 times, and there was a second not so very likely (though not so extreme) value next to it, together making it something that should not occur more than once in a million times. Well, if once in every 15,000 times I would accept a change as better because of such a fluke, that would have no impact on my confidence calculaion at all, as such 1-in-15,000 events are all included in the 5% error probability we wanted to risk. So either Bob is showing us very untypical data to "prove" his point, thereby suggesting that it is typical data, (which would be rather unethical, and therefore unlikely), or most of his data really has such large deviations, in which case there _must_ be something wrong with his measurement setup, as such extreme fluctuations are only possible if there is a large and significant correlation between the results of games within a mini-match, (which is also extremely unlikely, as they are intended to be independent).

So I know something stinks here, and I ask Bob if this is perhaps a hypothetical case. For some reason Bob didn't like this, and saw fit to respond with rude ad hominems. As it turns out now that these 4 traces were indeed _very_ atypical, you can draw your own conclusions...
Last edited by hgm on Wed Sep 19, 2007 10:34 pm, edited 1 time in total.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
nczempin wrote:
bob wrote: But every program I have personally tested and use in my tests certainly does do this. So for _most_ of us, the number of games needed to predict progress with high confidence is quite large. And all the arguing in the world is not going to change that.
Let me try to follow your logic here:

"every program I have tested does this, therefore for most of us..."

For me there is a huge jump in reasoning that I don't understand.

It would mean that you are somehow equivalent to most of us. just looking at the members active in this discussion would show you that we are most of us, not you.
Why? I know and have conversed with many program authors over the years. We _all_ time our searches in the same way, with different tweaks. But nearly every one I have _seen_ did not complete iterations and stop. You could go back thru years of r.g.c.c posts to see the discussions there. And eventually you realize that most do it as I do it, not as you/hgm do it. Most have read the various journal articles on using time, and understand why the ideas work. So concluding that your approach is the center of the universe is wrong. How many different programmers have you discussed timing issues with? I count hundreds in my past... so there is some weight to my statement, and I am not making an unsupported assumption..
I never talked about fixed anything, I never talked about time controls. In fact, I specifically asked for that subject to be discussed in a separate thread.

If I misquoted you on something that related to the timing issue and took it to mean the original subject we talked about by mistake, I apologize.
In that case, my point was that since _most_ (most -> nearly all programs I have looked at or discussed with their authors) use the classic stop when time is up approach. Leading to the _same_ level of non-deterministic behavior I am seeing in Crafty, Fruit, and the rest of the group I am using. That would make my comments applicable to all but a very small group (which seems to include you and hgm). You seem to think I am suggesting that everybody play long matches. I'm not. I am suggesting that _most_ need to play long matches because they all have the non-deterministic problem, one that they most likely don't realize they have (because I didn't realize this either until I started trying to interpret the results I am producing.)

So, do we agree that my time management is the most common? My experience says so. I use some very good opponents in my testing here and they all do it this way. And do we agree that your "finish an iteration" approach is far less common, since very few seem to do things that way? If so, why do you find fault with my talking to the _majority_ about how they test???
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

BTW, for the record, I have tried a test similar to yours. My original hypothesis (wrong) was that Crafty was causing most of the non-deterministic behavior, and I was convinced I must have introduced some sort of bug. I quickly modified arasan, fruit, crafty, glaurung, and one other I don't recall to limit the search based on node counts. And now the matches were identical each and every time, same moves, same everything, only differences were very slightly different time per move here and there as expected since the timing code on any O/S has jitter.

I then eliminated the nodes for just crafty and the non-determinism came back with a vengeance. I put it back in and removed it from Fruit. It was even worse, but then fruit uses history values for reductions (I don't) which means any extra nodes searched will adjust history counters that will drastically influence the search later.

I am not even sure that there is that much more variance with both programs depending on time, than there is with just one using time and one using a fixed number of nodes. But I was not trying to determine that, I just wanted to make sure I understood where the variance was coming from, having eliminated all the things I thought were wrong (book, pondering, SMP, etc.)
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An objective test process for the rest of us?

Post by Gerd Isenberg »

hgm wrote:Gerd,

this discussion is not about randomness in playing games. This point was already solved in a previous thread, by my theoretical prediction of how large these effects would be in engines with various time-management strategies, and by the later tests of Eden and uMax that followed. These were in total agreement with the predictions, and a such that problem can be considered 100% understood. Engines with simple time-management and little memory, like Eden and uMax, are about 97% deterministic, and repeat most games. This was of course well known to us, it is so obvious that you cannot miss it, and very annoying when you want to run gauntlets. Especially since many of the opponents at this level do not support the GUI commands to start in other positions then the opening.

What the current discussion is about, is this:

Nicolai asked me how much Eden 0.0.14 would have to score in his 26-game RR, before it could be considered better than Eden 0.0.13, w(which had scored 9 out of 26). Here it turned out that I (applying standard statistical analysis) and Bob have a fundamental difference:
bob wrote:
hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.

If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.

Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Bob claims here that the variance of 26-game match results is much larger than the 2 points I calculated (according to the 0.4*sqrt(26) , with as a consequence that you would err much more often than 5% of the cases if you accepted the score threshold I calculated (15 out of 26).

As this is theoretically impossible if the games in a match are independent (i.e. totally random), we were skeptical wrt this claim. But Bob persistently kept claiming that this is experimental fact, that he sees it happen all the time, and that he can show us experimental data to prove his extraordinary claim. He than shows us the by now famous 80-game match results, where one of the traces contains a deviation of 29 points from the average.

Now if the games within the mini-match are independent, such a result _cannot_ occur more than once every 15,000 times, and there was a second not so very likely (though not so extreme) value next to it, together making it something that should not occur more than once in a million times. Well, if once in every 15,000 times I would accept a change as better because of such a fluke, that would have no impact on my confidence calculaion at all, as such 1-in-15,000 events are all included in the 5% error probability we wanted to risk. So either Bob is showing us very untypical data to "prove" his point, thereby suggesting that it is typical data, (which would be rather unethical, and therefore unlikely), or most of his data really has such large deviations, in which case there _must_ be something wrong with his measurement setup, as such extreme fluctuations are only possible if there is a large and significant correlation between the results of games within a mini-match, (which is also extremely unlikely, as they are intended to be independent).

So I know something stinks here, and I ask Bob if this is perhaps a hypothetical case. For some reason Bob didn't like this, and saw fit to respond with rude ad hominems. As it turns out now that these 4 traces were indeed _very_ atypical, you can draw your own conclusions...
Thanks for your summarization and sorry for the weisenheimer ;-)

Obviously the starting positions are not random. But arbitrarily choosen from a wide range of early middle-game positions after playing common opening lines - to cover some domain specific aspects of the game of chess.

Wild tactical traxler or najdorf lines, where you sac some material, even piece(s) for long term attack? Positions with a lot of tactical tricks versus quite positions. Closed positions, where "strategical" long term knowledge is an issue, e.g. the idea to attack the king wing, while you would otherwise lose at the queen-wing or in the ending due to the pawn structure. Minory attacks? Lame drawing lines with symmetrical pawn structures, where some programs tend to exchange rooks and simplify to the ending while other may maneuver to keep their pieces?

The dynamical balanced positions tend to be more volatile and chaotical on minor search deviances than others, I guess. Chess is like tossing dice, pronged by the programs heuristics.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:Gerd,

this discussion is not about randomness in playing games. This point was already solved in a previous thread, by my theoretical prediction of how large these effects would be in engines with various time-management strategies, and by the later tests of Eden and uMax that followed. These were in total agreement with the predictions, and a such that problem can be considered 100% understood. Engines with simple time-management and little memory, like Eden and uMax, are about 97% deterministic, and repeat most games. This was of course well known to us, it is so obvious that you cannot miss it, and very annoying when you want to run gauntlets. Especially since many of the opponents at this level do not support the GUI commands to start in other positions then the opening.

What the current discussion is about, is this:

Nicolai asked me how much Eden 0.0.14 would have to score in his 26-game RR, before it could be considered better than Eden 0.0.13, w(which had scored 9 out of 26). Here it turned out that I (applying standard statistical analysis) and Bob have a fundamental difference:
bob wrote:
hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.

If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.

Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Bob claims here that the variance of 26-game match results is much larger than the 2 points I calculated (according to the 0.4*sqrt(26) , with as a consequence that you would err much more often than 5% of the cases if you accepted the score threshold I calculated (15 out of 26).

As this is theoretically impossible if the games in a match are independent (i.e. totally random), we were skeptical wrt this claim. But Bob persistently kept claiming that this is experimental fact, that he sees it happen all the time, and that he can show us experimental data to prove his extraordinary claim. He than shows us the by now famous 80-game match results, where one of the traces contains a deviation of 29 points from the average.

Now if the games within the mini-match are independent, such a result _cannot_ occur more than once every 15,000 times, and there was a second not so very likely (though not so extreme) value next to it, together making it something that should not occur more than once in a million times. Well, if once in every 15,000 times I would accept a change as better because of such a fluke, that would have no impact on my confidence calculaion at all, as such 1-in-15,000 events are all included in the 5% error probability we wanted to risk. So either Bob is showing us very untypical data to "prove" his point, thereby suggesting that it is typical data, (which would be rather unethical, and therefore unlikely), or most of his data really has such large deviations, in which case there _must_ be something wrong with his measurement setup, as such extreme fluctuations are only possible if there is a large and significant correlation between the results of games within a mini-match, (which is also extremely unlikely, as they are intended to be independent).

So I know something stinks here, and I ask Bob if this is perhaps a hypothetical case. For some reason Bob didn't like this, and saw fit to respond with rude ad hominems. As it turns out now that these 4 traces were indeed _very_ atypical, you can draw your own conclusions...
I responded because you suggested that the data was made up. The data was _not_ "atypical". I just posted some more data that started off pretty wildly then settled down. Whether it will "wilden back up" or not is unknown. Seeing drastic differences is not that uncommon. Doesn't happen every time. Doesn't happen once per millenium. Sort of the nature of random observations...