New testing thread

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: Correlated data discussion

Post by Zach Wegner »

bob wrote:The only difference I can see between the two approaches is the distribution of node counts. If you use 3M +/- 500K, you would probably get a uniform distribution evenly scattered over the interval unless you modified the PRNG to produce a normal rather than uniform distribution (easy enough to do to be sure). If we set the time so that the average search covers about 3M nodes, and assuming some sort of upper/lower bound of say 500K nodes, we get a normal distribution centered on 3M. Now are you going to tell me that the uniform distribution somehow is better? Or that it somehow more accurately simulates the real world?

So, again, I don't see any possible advantage other than repeatability, which is completely worthless in this context since we already know how to get perfect repeatability, but it gives too many duplicate games to provide useful information...

If you run the same test twice, and random numbers to produce a uniform distribution from 2.5M to 3.5M, that would seem to be _more_ random than the current normal distribution centered on 3M. Why would we want to go _more_ random when we know the results change with each different node count.
Yes, it would be more random. The point is so that the games are random in the first place. Randomness is very important in running these tests. The whole point that your mathematician friend made in an email is that you can't draw statistical conclusions from results that aren't random. You don't know what sort of distribution that testing by time gives at all. In fact, for one given search and search time, I'd expect a typical engine that checks the timer every X nodes to have very few possible node counts, likely only two or three in a controlled environment like yours, depending on X. The picture gets much more complicated because there is an unpredictable (not random) element added at every move, which of course leads to a large tree of possible games, but not one that has much statistical meaning. Thinking about this point, I would say that a random number of nodes for each move instead of per game would be better, of course giving each engine the same amount of nodes to make it fair.

When you talk about a normal distribution as coming from the RNG, then there is probably not much difference. I think, and this is merely philosophical, not a point I'm trying to make, that testing with uniform distribution would give better results, as if an engine can produce a good move after a wide range of possible search times, it uses a better, more robust and "universal" algorithm, not one that merely plays well given "around 3M nodes". But, search times do not give a normal distribution. I would guess that your setup reduces variance quite a lot, so that the distribution is much smaller and much less random.

An additional point is that most engines only check for timeout every X nodes, so having a window of +/-500K only guarantees 1M/x different possible samples. Mine for instance, checks every 10000 nodes, so that's 100 different possible games. Ideally they should be modified to check every node. Since the games are based on node count rather than time, this wouldn't hurt, but only give more robustness.

A good analogy to this is Monte Carlo analysis, which this could be considered to be in a way. If you don't have a good random number source, than the MC samples have bias, and you can't draw reasonable conclusions from them, as your first post in the other thread shows quite clearly.
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: ugh ugh ugh

Post by xsadar »

bob wrote:So far as I know, and I can only state back to 1995, coordinate notation with respect to xboard has not insisted on Kg1 to indicate O-O. Crafty played its first games on ICC in December of that year and worked flawlessly with xboard, using the normal O-O and O-O-O. And O-O will _always_ work with xboard or winboard. That's all I have ever used and nobody has ever told me it was failing... So I am not quite sure where you are coming from with that.
Where I was coming from was this excerpt from the xboard documentation (emphasis mine):
move MOVE

Your engine is making the move MOVE. Do not echo moves from xboard with this command; send only new moves made by the engine.

For the actual move text from your chess engine (in place of MOVE above), your move should be either

* in coordinate notation (e.g., e2e4, e7e8q) with castling indicated by the King's two-square move (e.g., e1g1), or
* in Standard Algebraic Notation (SAN) as defined in the Portable Game Notation standard (e.g, e4, Nf3, O-O, cxb5, Nxe4, e8=Q), with the extension piece@square (e.g., P@f7) to handle piece placement in bughouse and crazyhouse.

xboard itself also accepts some variants of SAN, but for compatibility with non-xboard interfaces, it is best not to rely on this behavior.

Warning: Even though all versions of this protocol specification have indicated that xboard accepts SAN moves, some non-xboard interfaces are known to accept only coordinate notation. See the Idioms section for more information on the known limitations of some non-xboard interfaces. It should be safe to send SAN moves if you receive a "protover 2" (or later) command from the interface, but otherwise it is best to stick to coordinate notation for maximum compatibility. An even more conservative approach would be for your engine to send SAN to the interface only if you have set feature san=1 (which causes the interface to send SAN to you) and have received "accepted san" in reply.
What it says isn't really as pessimistic about SAN as I remembered it being, however. Also, it turns out that the Idioms section refers only to one engine with problems, and it had a patch to fix it. This is the relevant portion of the Idioms section:
ChessMaster 8000 also implements version 1 of the xboard/winboard protocol and can use WinBoard-compatible engines. The original release of CM8000 also has one additional restriction: only pure coordinate notation (e.g., e2e4) is accepted in the move command. A patch to correct this should be available from The Learning Company (makers of CM8000) in February 2001.
So, considering that you've never had problems, perhaps O-O and SAN aren't really a problem. But you can't really blame people for following the protocol's recommendations.
plattyaj

Re: ugh ugh ugh

Post by plattyaj »

bob wrote:So far as I know, and I can only state back to 1995, coordinate notation with respect to xboard has not insisted on Kg1 to indicate O-O. Crafty played its first games on ICC in December of that year and worked flawlessly with xboard, using the normal O-O and O-O-O. And O-O will _always_ work with xboard or winboard. That's all I have ever used and nobody has ever told me it was failing... So I am not quite sure where you are coming from with that.
Some nasty implementations of the protocol (Chessmaster) fail to allow O-O so I changed Schola to use horrible syntax when not in protocol 2 mode. I can't remember why I wanted to run it under Chessmaster now I come to think about it.

Andy.
Uri Blass
Posts: 10895
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Correlated data discussion

Post by Uri Blass »

bob wrote:Since I am not sure whether or not the person that contacted me via email will join in here, I thought I would provide excerpts that we could perhaps use for a sane discourse on the issues. excerpt 1:

===================================================
The central point of miscommunication seems to have been confusion between the everyday meaning of dependent (causally connected) and the
mathematical meaning of dependent (correlated). I am astonished that self-styled mathematical experts at talkchess.com who were
criticizing you didn't make this distinction. The differnce in the two meanings is stark if one considers two engines playing each other
twice from a given position with fixed node counts, because the results of the two playouts will surely be the same. Neither playout
affects the other causally, so they are not dependent at all in the everyday sense, but the winner is always the same, which is to say
the outputs are perfectly correlated, and therefore as mathematically dependent as it gets.
====================================================

This is a topic I had already mentioned. In a perfect world, if A plays B, and A is better than B, then A will always win. And there will be perfect correlation between games since any game can be used to predict the outcome of any other, even there is none of the "causality dependcy" present since the games do not effect each other in any fashion.

Here's the next segment which again is a rehash of what has been explained previously:

=====================================================
Let's consider a series of hypothetical trial runs. I assume that you are as capable as anyone in the industry of preventing any causal
dependence between various games in the trials, so causal dependence will not factor in my calculations at all. I believe you when you
say that you have solved that problem.

Trial A: Crafty plays forty positions against each of five opponents with colors each way for a total of 400 games. The engines are each
limited to a node count of 10,000,000. Crafty wins 198 games.

Trial B: Same as Trial A, except the node count limit is changed to 10,010,000. Crafty wins 190 games.

Now we compare these two results to see if anything extraordinary has happened. In 400 games, the standard deviation is 10, and the
difference in results was only 8, so we are well within expected bounds. There's nothing to get excited about, and we move on to the
next experiment.

Trial C: Same as Trial A, except that each position-opponent-color combination is played out 64 times. Yes, this is a silly experiment,
because we know that repeated playouts with a fixed node count give identical results, but bear with me. Crafty wins (as expected)
exactly 12672 games.

Trial D: Same as Trial B, except that each position-opponent-color combination is played out 64 times. Crafty wins 12160, as we knew it
would.

Now we compare the latter two trials. In 25,600 games the standard deviation is 80, and our difference in result was 512, so we are more
than six sigmas out. Holy cow! Run out and buy lottery tickets! :)

In this deterministic case it is easy to see what happened. The prefect correlation of the sixty-four repeats of each combination meant
that we were gaining no new information by expanding the trial. The calculation of standard deviation, however, assumes no correlation
whatsoever, i.e. perfect mathematical independence. Since the statistical assumption was not met, the statistical result is absurd.
=====================================================

But at this point, we return to "stampee foot, impossible, stampee foot, test is flawed, stampee foot, etc..."

Now before I go farther, I will stop here and see if anyone wants to contest, add to, contradict, etc the above.

So just perhaps, this explains the so-called "six-sigma" event of my first post. And it explains why so many runs have been producing odd results. And it does once again explain exactly how there is correlation, simply because of the opponents and positions and semi-deterministic behavior of programs... Of course, it also suggests quite a bit more than that, since so many are doing this same exact test...

His next suggestion to help is one that will take me a bit of time to think about, as it is _completely_ counter-intuitive to me on first analysis. I'll save that until after the discussion on the above.

your turn...
1)I disagree about the first part

http://en.wikipedia.org/wiki/Correlation

Correlation between 2 varaibales that get 1 with probability of 1 is not defined because
"The correlation is defined only if both of the standard deviations are finite and both of them are nonzero"

In case that you have always the same result the standard deviation is 0
so if you think about exact mathematical definition you cannot say that they are correlated.

You can only say that what I said that they are not correlated is not correct because correlation is undefined but it is an extreme case that usually does not happen and this case does not increase the variance of the result but only reduce the variance.

To be more correct it is better to claim the following:

There is a correlation in the results that cause V(A1+A2) to be bigger than V(A1)+V(A2)

In other words if you define A1 the result of game Crafty-Fruit from position 1 and A2 as the result of game Crafty-Fruit from position 2
then
V(A1+A2)>V(A1)+V(A2)

If Crafty always win position 1 with white and lose position 2 with white then We have 0=0+0 and have no problem here.

2)trial C and trial D mean that you have different conditions in the experiment and that there is a correlation.

The same effect can happen if the machine become 0.1% slower in the second experiment that may have similiar effect to 10,000,000 to 10,010,000 nodes.

If you get average number of nodes of 10,010,000 with standard error of 1000 in games 1-1000 and you get average number of nodes of 10,000,000 with standard error of 1000 in games 25000-26000 then it is clear that the number of average nodes in game X and the number of nodes in game X+1 are correlated and the same may be correct for the result of the games.

I do not believe that changing the speed by 0.1% can have a significant effect about rating but when you have small number of positions it may happen(I did not consider this possibility earlier in the discussion because I thought that the randomness from time decision is bigger espacially when people reported that they do not get the same game twice when they repeat game with no opening book).

What you can learn from it is that it is better to increase the set of positions that you test from them if you want to test small changes.


Uri
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: New testing thread

Post by Michael Sherwin »

bob wrote:Thought I would move this since the other thread has gotten pretty long, as well as getting off-topic a bit. I have now fixed my referee program so that it can properly adjudicate games as won/lost/drawn. It maintains a board state (borrowed code from crafty but rewrote move generator to avoid the magic stuff to keep the code short. Since speed is not an issue, it just directly generates moves in bitboards rather than doing rotated lookups or magic stuff. I had to do this because not all programs provide accurate "Result" commands and can't be trusted. It can now ignore them and play anybody vs anybody. Only thing that is not allowed is draw offers. I had too many problems with programs not handling that correctly and decided to just disable the "offer draw" stuff and let the game continue. It is a forced draw if the 3-fold repetition is hit, or the 50 move rule, or insufficient material, regardless of whether one side claims a draw or not, which keeps the test games to a reasonable length.

I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.

More to follow...

Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...

Code: Select all

2291 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    73   19   19   763   61%   -15   18% 
   2 opponent-21.7           26   18   18   759   54%    -5   23% 
   3 Fruit 2.1                5   19   19   754   51%    -2   15% 
   4 Glaurung 1.1 SMP       -15   19   19   764   47%     2   16% 
   5 Crafty-22.2            -33   18   18   757   45%     6   22% 
   6 Arasan 10.0            -56   19   19   785   42%    11   10% 

760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115   44   42   153   69%   -31   17% 
   2 Fruit 2.1               64   42   41   149   63%   -31   18% 
   3 Glaurung 1.1 SMP        48   42   41   152   61%   -31   16% 
   4 opponent-21.7           20   38   37   152   58%   -31   41% 
   5 Crafty-22.2            -31   19   19   760   45%     6   22% 
   6 Arasan 10.0           -216   43   45   154   26%   -31   16% 
The two samples were made at the same time, for reference. Has a ways to go until it is done. I will then run this again, but with the "big" match. And will eventually stop running the all vs all, since once each non-crafty has played all the others, those versions do not change and there is little use in re-running them over and over. I will just take those PGN files and add them to the crafty vs everyone so that I just have to run crafty vs everyone after the first run to get all the PGN.

opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...
Just a note:

When you do the final rating calculations for the Crafty versions the results should be based on only the world vs. world games and a single Crafty vs the world games. If Crafty is given more games than the rest of the engines, the rest of the engines ratings are going to be more affected by how they do against Crafty than how they do against each other. So, if program A does poorly (relative to the other engines) against Crafty and there is a higher number of games with that matchup then Crafty will not gain as many rating points from that matchup as program A's rating will be lower than it should be. Also one Crafty version should not be allowed to affect the other Crafty version's rating.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: ugh ugh ugh

Post by bob »

xsadar wrote:
bob wrote:So far as I know, and I can only state back to 1995, coordinate notation with respect to xboard has not insisted on Kg1 to indicate O-O. Crafty played its first games on ICC in December of that year and worked flawlessly with xboard, using the normal O-O and O-O-O. And O-O will _always_ work with xboard or winboard. That's all I have ever used and nobody has ever told me it was failing... So I am not quite sure where you are coming from with that.
Where I was coming from was this excerpt from the xboard documentation (emphasis mine):
move MOVE

Your engine is making the move MOVE. Do not echo moves from xboard with this command; send only new moves made by the engine.

For the actual move text from your chess engine (in place of MOVE above), your move should be either

* in coordinate notation (e.g., e2e4, e7e8q) with castling indicated by the King's two-square move (e.g., e1g1), or
* in Standard Algebraic Notation (SAN) as defined in the Portable Game Notation standard (e.g, e4, Nf3, O-O, cxb5, Nxe4, e8=Q), with the extension piece@square (e.g., P@f7) to handle piece placement in bughouse and crazyhouse.

xboard itself also accepts some variants of SAN, but for compatibility with non-xboard interfaces, it is best not to rely on this behavior.

Warning: Even though all versions of this protocol specification have indicated that xboard accepts SAN moves, some non-xboard interfaces are known to accept only coordinate notation. See the Idioms section for more information on the known limitations of some non-xboard interfaces. It should be safe to send SAN moves if you receive a "protover 2" (or later) command from the interface, but otherwise it is best to stick to coordinate notation for maximum compatibility. An even more conservative approach would be for your engine to send SAN to the interface only if you have set feature san=1 (which causes the interface to send SAN to you) and have received "accepted san" in reply.
What it says isn't really as pessimistic about SAN as I remembered it being, however. Also, it turns out that the Idioms section refers only to one engine with problems, and it had a patch to fix it. This is the relevant portion of the Idioms section:
ChessMaster 8000 also implements version 1 of the xboard/winboard protocol and can use WinBoard-compatible engines. The original release of CM8000 also has one additional restriction: only pure coordinate notation (e.g., e2e4) is accepted in the move command. A patch to correct this should be available from The Learning Company (makers of CM8000) in February 2001.
So, considering that you've never had problems, perhaps O-O and SAN aren't really a problem. But you can't really blame people for following the protocol's recommendations.
My complaint is that most that use that form of castling use it _everywhere_ which includes exporting PGN files. Crafty will happily accept that kind of input, because I got tired of getting complaints from users who could not read certain PGN collections. I could list a dozen ways people violate the PGN standard all the time, including at times ChessBase as well...

And this case was another classic example of two programs unable to communicate with each other, but which could communicate with crafty with no problems because I chose to support all forms of input to avoid problems.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

Zach Wegner wrote:
bob wrote:The only difference I can see between the two approaches is the distribution of node counts. If you use 3M +/- 500K, you would probably get a uniform distribution evenly scattered over the interval unless you modified the PRNG to produce a normal rather than uniform distribution (easy enough to do to be sure). If we set the time so that the average search covers about 3M nodes, and assuming some sort of upper/lower bound of say 500K nodes, we get a normal distribution centered on 3M. Now are you going to tell me that the uniform distribution somehow is better? Or that it somehow more accurately simulates the real world

So, again, I don't see any possible advantage other than repeatability, which is completely worthless in this context since we already know how to get perfect repeatability, but it gives too many duplicate games to provide useful information...

If you run the same test twice, and random numbers to produce a uniform distribution from 2.5M to 3.5M, that would seem to be _more_ random than the current normal distribution centered on 3M. Why would we want to go _more_ random when we know the results change with each different node count.
Yes, it would be more random. The point is so that the games are random in the first place. Randomness is very important in running these tests. The whole point that your mathematician friend made in an email is that you can't draw statistical conclusions from results that aren't random. You don't know what sort of distribution that testing by time gives at all. In fact, for one given search and search time, I'd expect a typical engine that checks the timer every X nodes to have very few possible node counts, likely only two or three in a controlled environment like yours, depending on X

Not even close. here are some samples from crafty. Same position. Target time of 3 seconds, using just one CPU to minimize the node variation. These results are run in a "conservative mode" where it checks the time more frequently than in real games.:

log.001: time=3.00 mat=0 n=6814217 fh=94% nps=2.3M
log.002: time=3.01 mat=0 n=6786713 fh=94% nps=2.3M
log.003: time=3.01 mat=0 n=6768568 fh=94% nps=2.2M
log.004: time=3.00 mat=0 n=6814217 fh=94% nps=2.3M
log.005: time=3.08 mat=0 n=6948342 fh=94% nps=2.3M
log.006: time=3.06 mat=0 n=6948342 fh=94% nps=2.3M

The problem is that the clock advances in "spurts" and the sample instant can occur anywhere along that spurt. I use the NPS to help control how often I sample the time, and the NPS is a non-constant as well which introduces more randomness in when the time is checked which controls how much time can be used in a search.
. The picture gets much more complicated because there is an unpredictable (not random) element added at every move, which of course leads to a large tree of possible games, but not one that has much statistical meaning. Thinking about this point, I would say that a random number of nodes for each move instead of per game would be better, of course giving each engine the same amount of nodes to make it fair.

When you talk about a normal distribution as coming from the RNG, then there is probably not much difference. I think, and this is merely philosophical, not a point I'm trying to make, that testing with uniform distribution would give better results, as if an engine can produce a good move after a wide range of possible search times, it uses a better, more robust and "universal" algorithm, not one that merely plays well given "around 3M nodes". But, search times do not give a normal distribution. I would guess that your setup reduces variance quite a lot, so that the distribution is much smaller and much less random.

An additional point is that most engines only check for timeout every X nodes, so having a window of +/-500K only guarantees 1M/x different possible samples. Mine for instance, checks every 10000 nodes, so that's 100 different possible games. Ideally they should be modified to check every node. Since the games are based on node count rather than time, this wouldn't hurt, but only give more robustness.

A good analogy to this is Monte Carlo analysis, which this could be considered to be in a way. If you don't have a good random number source, than the MC samples have bias, and you can't draw reasonable conclusions from them, as your first post in the other thread shows quite clearly.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New testing thread

Post by bob »

Michael Sherwin wrote:
bob wrote:Thought I would move this since the other thread has gotten pretty long, as well as getting off-topic a bit. I have now fixed my referee program so that it can properly adjudicate games as won/lost/drawn. It maintains a board state (borrowed code from crafty but rewrote move generator to avoid the magic stuff to keep the code short. Since speed is not an issue, it just directly generates moves in bitboards rather than doing rotated lookups or magic stuff. I had to do this because not all programs provide accurate "Result" commands and can't be trusted. It can now ignore them and play anybody vs anybody. Only thing that is not allowed is draw offers. I had too many problems with programs not handling that correctly and decided to just disable the "offer draw" stuff and let the game continue. It is a forced draw if the 3-fold repetition is hit, or the 50 move rule, or insufficient material, regardless of whether one side claims a draw or not, which keeps the test games to a reasonable length.

I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.

More to follow...

Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...

Code: Select all

2291 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    73   19   19   763   61%   -15   18% 
   2 opponent-21.7           26   18   18   759   54%    -5   23% 
   3 Fruit 2.1                5   19   19   754   51%    -2   15% 
   4 Glaurung 1.1 SMP       -15   19   19   764   47%     2   16% 
   5 Crafty-22.2            -33   18   18   757   45%     6   22% 
   6 Arasan 10.0            -56   19   19   785   42%    11   10% 

760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115   44   42   153   69%   -31   17% 
   2 Fruit 2.1               64   42   41   149   63%   -31   18% 
   3 Glaurung 1.1 SMP        48   42   41   152   61%   -31   16% 
   4 opponent-21.7           20   38   37   152   58%   -31   41% 
   5 Crafty-22.2            -31   19   19   760   45%     6   22% 
   6 Arasan 10.0           -216   43   45   154   26%   -31   16% 
The two samples were made at the same time, for reference. Has a ways to go until it is done. I will then run this again, but with the "big" match. And will eventually stop running the all vs all, since once each non-crafty has played all the others, those versions do not change and there is little use in re-running them over and over. I will just take those PGN files and add them to the crafty vs everyone so that I just have to run crafty vs everyone after the first run to get all the PGN.

opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...
Just a note:

When you do the final rating calculations for the Crafty versions the results should be based on only the world vs. world games and a single Crafty vs the world games. If Crafty is given more games than the rest of the engines, the rest of the engines ratings are going to be more affected by how they do against Crafty than how they do against each other. So, if program A does poorly (relative to the other engines) against Crafty and there is a higher number of games with that matchup then Crafty will not gain as many rating points from that matchup as program A's rating will be lower than it should be. Also one Crafty version should not be allowed to affect the other Crafty version's rating.
This match is simply everybody vs everybody an equal number of games. Only in the last output I excluded everything except crafty vs world. The first output has an equal number for everyone. I was simply showing how the old test would have looked, which is the kind of testing everyone is doing when they try to evaluate changes. Nobody is playing everybody against everybody, but someone suggested it might help stabilize the ratings. It is not clear that it did. I am now running the same test with no check extension. With any luck, each of these 4 runs will look worse than the original 4 because I know from lots of testing that removing the check extension hurts overall play.

Waiting on the test to complete right now.
User avatar
hgm
Posts: 28391
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: ugh ugh ugh

Post by hgm »

bob wrote:To solve this in Crafty, I accept o-o, (small oh's), O-O (capital Oh;s) and even 0-0 (zero-zero). this appears to be the only sane way to deal with castling and avoiding user complaints about being unable to read some PGN files. Of course, there is also the brain-dead e1g1, Ke1-g1 and similar approaches, where the author ought to be flogged, but Crafty will handle them all. I'm going to simply take a move from an opponent, parse it into crafty's internal format for the referee, then use crafty's OutputMove() function to convert it back to a _normal_ SAN move to get rid of this crap.
Well, this is what WinBord does, (parsing and convering), and what any serious referee program should do. Engines are never directly exposed t PGN files. It is the GUI that parses those, and then sends the moves in protocol format to the engine. So there is no reaon for engines to understand things like o-o or 0-0 (unless the play FRC, then they should understand 0-0).

btw, WB protocol very clearly recommends that castlings should be sent as e1g1, or they might not work under non-WB WB-compatible GUIs. So it is actually the authors that send castlings as O-O that should be flogged! :lol:
User avatar
hgm
Posts: 28391
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Correlated data discussion

Post by hgm »

bob wrote:Since I am not sure whether or not the person that contacted me via email will join in here, I thought I would provide excerpts that we could perhaps use for a sane discourse on the issues. excerpt 1:

===================================================
The central point of miscommunication seems to have been confusion between the everyday meaning of dependent (causally connected) and the
mathematical meaning of dependent (correlated). I am astonished that self-styled mathematical experts at talkchess.com who were
criticizing you didn't make this distinction. The differnce in the two meanings is stark if one considers two engines playing each other
twice from a given position with fixed node counts, because the results of the two playouts will surely be the same. Neither playout
affects the other causally, so they are not dependent at all in the everyday sense, but the winner is always the same, which is to say
the outputs are perfectly correlated, and therefore as mathematically dependent as it gets.
====================================================
Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit theorem. Having engines that play very reproducible, and only can play 2 or 3 different games from a position that they have to play a hundred times, still would produce perfectly independent sampling, with a statistical error < 0.5/sqrt(N), provided the choice of which of the few possible games they played was not dependent on the choice they made in the previous game.
This is a topic I had already mentioned. In a perfect world, if A plays B, and A is better than B, then A will always win. And there will be perfect correlation between games since any game can be used to predict the outcome of any other, even there is none of the "causality dependcy" present since the games do not effect each other in any fashion.

Here's the next segment which again is a rehash of what has been explained previously:

=====================================================
Let's consider a series of hypothetical trial runs. I assume that you are as capable as anyone in the industry of preventing any causal
dependence between various games in the trials, so causal dependence will not factor in my calculations at all. I believe you when you
say that you have solved that problem.

Trial A: Crafty plays forty positions against each of five opponents with colors each way for a total of 400 games. The engines are each
limited to a node count of 10,000,000. Crafty wins 198 games.

Trial B: Same as Trial A, except the node count limit is changed to 10,010,000. Crafty wins 190 games.

Now we compare these two results to see if anything extraordinary has happened. In 400 games, the standard deviation is 10, and the
difference in results was only 8, so we are well within expected bounds. There's nothing to get excited about, and we move on to the
next experiment.

Trial C: Same as Trial A, except that each position-opponent-color combination is played out 64 times. Yes, this is a silly experiment,
because we know that repeated playouts with a fixed node count give identical results, but bear with me. Crafty wins (as expected)
exactly 12672 games.

Trial D: Same as Trial B, except that each position-opponent-color combination is played out 64 times. Crafty wins 12160, as we knew it
would.

Now we compare the latter two trials. In 25,600 games the standard deviation is 80, and our difference in result was 512, so we are more
than six sigmas out. Holy cow! Run out and buy lottery tickets! :)

In this deterministic case it is easy to see what happened. The prefect correlation of the sixty-four repeats of each combination meant
that we were gaining no new information by expanding the trial. The calculation of standard deviation, however, assumes no correlation
whatsoever, i.e. perfect mathematical independence. Since the statistical assumption was not met, the statistical result is absurd.
=====================================================
Seems to me the point made above is not relevant. Playing at 10,000,000 nodes is not the same experiment as playing at 10,010,000 nodes. Each match is a sample of a different process, each process having its own standard deviation. Which is actually zero for totally deterministic node-count-based TC games. So the results actually ly an infinite number of SDs apart, making it 100% certain that Crafty has a better performance against these opponents, with these starting positions, at the slightly different node count. Great! But of course meaningless, as the sample of opponents and positions was too small for this result to have any correlation with the performance on all positions against all possible opponents.

But none of it has any bearing on the reported 6-sigma deviation between the 2x25,000 games: there the conditions were supposed to be identical.
But at this point, we return to "stampee foot, impossible, stampee foot, test is flawed, stampee foot, etc..."

Now before I go farther, I will stop here and see if anyone wants to contest, add to, contradict, etc the above.

So just perhaps, this explains the so-called "six-sigma" event of my first post. And it explains why so many runs have been producing odd results. And it does once again explain exactly how there is correlation, simply because of the opponents and positions and semi-deterministic behavior of programs... Of course, it also suggests quite a bit more than that, since so many are doing this same exact test...
Bullshit. That two different experiments give a different answer can never be used to explain why repeating the same experiment gives a different answer.
His next suggestion to help is one that will take me a bit of time to think about, as it is _completely_ counter-intuitive to me on first analysis. I'll save that until after the discussion on the above.

your turn...