New testing thread

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

Uri Blass wrote:
hgm wrote: Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit theorem. Having engines that play very reproducible, and only can play 2 or 3 different games from a position that they have to play a hundred times, still would produce perfectly independent sampling, with a statistical error < 0.5/sqrt(N), provided the choice of which of the few possible games they played was not dependent on the choice they made in the previous game.
I think that it is dependent on how you define correlation
You and I considered X and Y to be not correlated if and only if
cov(x,y)=0

It seems that the mathematician that hyatt talked with him
considered x and y to be not correlated only if the correlation coefficient is 0 and in the case of x=y=constant the correlation coefficient is undefined.

Uri
This person has agreed to join the discussion, but registering has not quite worked out yet (he has registered but has not gotten a confirmation, he's working on figuring out what happened).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 4 sets of data

Post by bob »

hgm wrote:
bob wrote:the data with round-robin matches is generally ordered the same, which is good and unlike the Crafty vs world games. But the Elo is still bouncing around enough that it would be very difficult to make a modest change and then successfully measure the change.
Yes, of course. BayesElo gives an error margin of 18 Elo. You know what '18' means, not? If the error margin is 18, you cannot use it to reliably measure differences of less than 18...
So while there are possible improvements in deciding which of the programs I am using is the best, the ability to measure the difference in two crafty versions seems harder. I am going to make a run with the check extension set to zero to see how that goes, another 4 runs and I will post the results along with these again...
Got anything useful to offer? That's not what this was about. We were testing to see if round-robin offered more reliable ratings than just C vs world.
krazyken

Re: Correlated data discussion

Post by krazyken »

bob wrote: All conditions except for time measurements which are _never_ identical.
What reason is there to believe that the two runs of 25000 don't both experience the same amount of "time jitter"? Why shouldn't I believe the amount of jitter in a game is random, and would even out with a larger sample?
User avatar
hgm
Posts: 28393
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: 4 sets of data

Post by hgm »

bob wrote:Got anything useful to offer? That's not what this was about. We were testing to see if round-robin offered more reliable ratings than just C vs world.
Why would you want to test such a thing, which trivially follows from first principles? Would you also play a game of Chess twice, and the use BayesElo to 'test' if 1+1 equals 2???
Uri Blass
Posts: 10900
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Correlated data discussion

Post by Uri Blass »

The reason for your result is not that time measurements are never identical but that that their distribution is changed after many games.

Note that even if you could get the same distribution the test is simply not good to test small changes because the number of positions is too small.

The main problem is that some change may give +5 elo in the silver suite at the specific time control that you use but -10 elo in slightly different time control.

The best solution is simply to have more positions in your suite.

It is better to test 13000 different position one time with white and one time with black and not to test the same position again and again if you want to evaluate small changes.

I remember reading that the rybka team use big set of position and not small set of positions for their tests at 1 second per game.

In order to get 26000 positions you can practically take some big pgn and take the first 26000 different positions(when you remove doubles) after move 10 as starting point.

Uri
Uri Blass
Posts: 10900
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: 4 sets of data

Post by Uri Blass »

hgm wrote:
bob wrote:Got anything useful to offer? That's not what this was about. We were testing to see if round-robin offered more reliable ratings than just C vs world.
Why would you want to test such a thing, which trivially follows from first principles? Would you also play a game of Chess twice, and the use BayesElo to 'test' if 1+1 equals 2???

I suggest not to insult Bob.

It is obvious to me and you that games between non crafty versions are not going to help to give Crafty more accurate rating
but it seems not obvious to Bob and the people who suggested him to play games between non crafty versions.

I noticed that other people commented nonsense and based on my memory I commented something against the test but I understood that I have no chance to convince other people about it unless they see the results so I stop commenting about it.

Note that I did not study the exact way that the rating program works but it is obvious that smaller error for Crafty after games between non Crafty versions suggests some bug in the rating program because I see no way how this information can be productive to get better estimate to the rating of Crafty.

Uri
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: ugh ugh ugh

Post by xsadar »

bob wrote:
xsadar wrote:
bob wrote:So far as I know, and I can only state back to 1995, coordinate notation with respect to xboard has not insisted on Kg1 to indicate O-O. Crafty played its first games on ICC in December of that year and worked flawlessly with xboard, using the normal O-O and O-O-O. And O-O will _always_ work with xboard or winboard. That's all I have ever used and nobody has ever told me it was failing... So I am not quite sure where you are coming from with that.
Where I was coming from was this excerpt from the xboard documentation (emphasis mine):
move MOVE

Your engine is making the move MOVE. Do not echo moves from xboard with this command; send only new moves made by the engine.

For the actual move text from your chess engine (in place of MOVE above), your move should be either

* in coordinate notation (e.g., e2e4, e7e8q) with castling indicated by the King's two-square move (e.g., e1g1), or
* in Standard Algebraic Notation (SAN) as defined in the Portable Game Notation standard (e.g, e4, Nf3, O-O, cxb5, Nxe4, e8=Q), with the extension piece@square (e.g., P@f7) to handle piece placement in bughouse and crazyhouse.

xboard itself also accepts some variants of SAN, but for compatibility with non-xboard interfaces, it is best not to rely on this behavior.

Warning: Even though all versions of this protocol specification have indicated that xboard accepts SAN moves, some non-xboard interfaces are known to accept only coordinate notation. See the Idioms section for more information on the known limitations of some non-xboard interfaces. It should be safe to send SAN moves if you receive a "protover 2" (or later) command from the interface, but otherwise it is best to stick to coordinate notation for maximum compatibility. An even more conservative approach would be for your engine to send SAN to the interface only if you have set feature san=1 (which causes the interface to send SAN to you) and have received "accepted san" in reply.
What it says isn't really as pessimistic about SAN as I remembered it being, however. Also, it turns out that the Idioms section refers only to one engine with problems, and it had a patch to fix it. This is the relevant portion of the Idioms section:
ChessMaster 8000 also implements version 1 of the xboard/winboard protocol and can use WinBoard-compatible engines. The original release of CM8000 also has one additional restriction: only pure coordinate notation (e.g., e2e4) is accepted in the move command. A patch to correct this should be available from The Learning Company (makers of CM8000) in February 2001.
So, considering that you've never had problems, perhaps O-O and SAN aren't really a problem. But you can't really blame people for following the protocol's recommendations.
My complaint is that most that use that form of castling use it _everywhere_ which includes exporting PGN files. Crafty will happily accept that kind of input, because I got tired of getting complaints from users who could not read certain PGN collections. I could list a dozen ways people violate the PGN standard all the time, including at times ChessBase as well...

And this case was another classic example of two programs unable to communicate with each other, but which could communicate with crafty with no problems because I chose to support all forms of input to avoid problems.
Yes, I'll certainly agree with you that e1g1 (or Ke1g1 or any other weird variation on that) should not be put in PGN files. Anyone who does that really SHOULD be flogged.
Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 9:19 pm
Location: Oslo, Norway

Re: ugh ugh ugh

Post by Tord Romstad »

Zach Wegner wrote:
Tord Romstad wrote:Glaurung does not yet support the XBoard protocol (the next version probably will)...
Awesome! :D

You might like to know that work is being done to back-port the Winboard_X/F enhancements back to xboard for us UNIX-based folk. There's also some interesting discussions about extending the protocol over at the Winboard Forum.
Don't get excited: Honestly, I am not very interested in XBoard or in the XBoard protocol. The UCI protocol satisfies my needs much better. The XBoard support in Glaurung will probably be rather basic. For instance, I doubt that I will bother to implement pondering.

My main motivation for adding XBoard support is as a preparation for the iPhone version of my program. Because Apple doesn't allow third-party applications to spawn new processes, I can't separate my program into a GUI and an engine, like I do in the desktop version of Glaurung. I have to write it all as a single, monolithic executable, and therefore I am forced to add messy stuff like keeping track of the state of the game and the clocks to the engine itself. Trying to make my engine run in XBoard without an adapter is just a way to test that everything works.

Tord
Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 9:19 pm
Location: Oslo, Norway

Re: ugh ugh ugh

Post by Tord Romstad »

bob wrote:
Tord Romstad wrote:By the way, I strongly recommend upgrading from Glaurung 2-ε/5 to Glaurung 2.1. As the name indicates, 2-ε/5 was just a beta version. The current version is less buggy, and much stronger.
I will do that once I get the kinks worked out on the Elo testing. I don't particularly care whether I use the strongest or not, just that all the opponents are perfectly consistent for the various test runs so the results are comparable.
I also frequently use old versions. The reason for my message was that you were not just using an old version, but an unstable and incomplete version. If you had used version 2.0 (which is also somewhat oldish, but complete and stable), I wouldn't have mentioned it.

Tord
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

krazyken wrote:
bob wrote: All conditions except for time measurements which are _never_ identical.
What reason is there to believe that the two runs of 25000 don't both experience the same amount of "time jitter"? Why shouldn't I believe the amount of jitter in a game is random, and would even out with a larger sample?
At least a couple of reasons. First, I posted two 25,000 game runs that had different Elo bounds for Crafty. Where the only thing different between the two runs was that both used time per move and time is not a constant when it is measured over intervals that are fairly long. But I happen to agree that the timing jitter is a part of the test scenario that is inescapable, and it leads to artifacts that are causing statistical anomalies...