Towards a standard analysis output format

Don · Post by **Don** » Sat Mar 26, 2011 1:49 pm

I wanted to also see how the endgame compared to the opening, so using the same data I plotted just those points AFTER the first 40 moves have been played. I compared this to the "opening" graph where ONLY the first 40 moves were considered. The results are startling but indicate that (for my program) being a pawn up in the endgame is not as good as being a pawn up in the early part of the game.

The red "z7" line is the opening phase, and the blue "e7" phase is ending. I define ending as all positions after moves 40 and visa versa.

Another thing I noticed was that the endgame curve at the mid point has a serious glitch. I'm not sure what to make of that but it seems that being a 1/4 pawn down is not much worse that being a 1/4 pawn up. It could mean that Komodo measures things that are not relevant in the endgame.

Don · Post by **Don** » Sat Mar 26, 2011 1:56 pm

bob wrote:
hgm wrote:
bob wrote:I wonder if I ought to actually do this formally as an experiment? I could certainly take Crafty and play it against the gauntlet, normally, then at -1.00, and then -2.00 to see just what effect removing one or two pawns does in terms of Elo...
I think such 'direct' piece-value measurements are quite interesting. I played many ten-thousands of such material-imbalance games, not only with Pawn odds, but also deleting Bishop vs Knight, Bishop vs Knight +Pawn, Queen vs 2 Bishops + 1 Knight, etc. (Mostly on 10x8 boards.) In principle you could measure the score of any advantage that way. E.g. if you want to now how much castling rights are worth, play one side without castling rights, and see by how much it loses, in comparison to the Pawn-odds score.

To make it work you need an engine that randomizes well (as there are no books for such imbalanced positions), or shuffle the intial positions (e.g. in Chess960 fashion). And deleting multiple Pawns sometimes gave inconsistent results (the second Pawn having negative effective value), presumably because you give that side a tremendous advantage in development, which can be very dangerous in 10x8 Chess.
I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.

But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...

The only issue I see is that a +.5 in the opening might be meaningless, while in an endgame it is possibly winning.

I just posted a graph that seems to indicate just the opposite, at least for Komodo. If you have a 0.5 advantage in the ending (for Komodo) you don't have much of an advantage, but in the opening it is a bigger advantage. The difference is alarmingly significant.

However, this does not mean you are wrong - Komodo may just be broken with respect to how these phases are scaled - and in fact I consider it an error that they don't scale the same. I believe my program would be stronger if a pawn up meant the same thing in the opening and endgame.

Don

Don · Post by **Don** » Sat Mar 26, 2011 4:29 pm

bob wrote:
hgm wrote:
bob wrote:I wonder if I ought to actually do this formally as an experiment? I could certainly take Crafty and play it against the gauntlet, normally, then at -1.00, and then -2.00 to see just what effect removing one or two pawns does in terms of Elo...
I think such 'direct' piece-value measurements are quite interesting. I played many ten-thousands of such material-imbalance games, not only with Pawn odds, but also deleting Bishop vs Knight, Bishop vs Knight +Pawn, Queen vs 2 Bishops + 1 Knight, etc. (Mostly on 10x8 boards.) In principle you could measure the score of any advantage that way. E.g. if you want to now how much castling rights are worth, play one side without castling rights, and see by how much it loses, in comparison to the Pawn-odds score.

To make it work you need an engine that randomizes well (as there are no books for such imbalanced positions), or shuffle the intial positions (e.g. in Chess960 fashion). And deleting multiple Pawns sometimes gave inconsistent results (the second Pawn having negative effective value), presumably because you give that side a tremendous advantage in development, which can be very dangerous in 10x8 Chess.
I was thinking of talking my test set of starting positions, and simply removing a pawn in each, since I play without a book anyway.

But even more importantly, I might try to take the evaluation of Crafty, and "discretize" the numbers into blocks of say .25, and then count wins and losses to see how +0, +.25, +.5 and so forth compare to the final result...

The only issue I see is that a +.5 in the opening might be meaningless, while in an endgame it is possibly winning.

My previous open/end comparison was flawed because I went by move number. The correct way was to go by stage of game. Komodo uses 24 games stages where I consider 12 and higher to be opening and less than 12 endgame. When I make 2 plots based on that, I get 2 smooth curves without the anomolies of the previous graph.

For reference, we add up non-pawn material to get a game phase where N=0, B=1, R=2, Q=6

Here is the improved graph:

It still looks like a give score advantage is better in the opening, but not by as much.

sje · Post by **sje** » Sun Mar 27, 2011 9:10 pm

As I've written, I doubt if it's possible for all or even most to agree on a single line format for reporting a search analysis. Perhaps we might be able to accomplish something a bit less ambitious but still useful.

Consider an XML or a Lisp format data structure mapped to text with slots for ALL potentially interesting analysis data. A searching program would fill in the slots according to its author's preferences and an analysis receiving program (possibly part of the same program) would map the input filled slots, again according to preferences and not to some absolute standard. A lot of code could be re-used among different authors.

Example: A ScorePOV slot could have values of "WhitePOV", "BlackPOV", "SideToMovePOV", "SideNotToMovePOV", and gosh maybe some others. This would be filled by the originating program and then used by the receiving program to be remapped as desired.

Example: A ScoreInterpret slot could be filled with "PawnUnits", "CentipawnUnits", "MillipawnUnits", or whatever. Using "MateInN", "MatN", and other special scores could handled similarly.

No user would ever have to see the interchange structure, so it doesn't have to be pretty; it only has to be clear.

UncombedCoconut · Post by **UncombedCoconut** » Mon Mar 28, 2011 2:02 am

Score units would be great: e.g., DTM, DTC, Pawns (may be fractional).
A structured format could easily allow programs to return a predicted tree (or DAG??) rather than line.
What goes in the tree depends on engine settings.
A widely implemented example is Multi-PV output. Protocol support for even this sucks. (CECP doesn't standardize it, although at least Fairy-Max is capable. UCI is worse, and standardizes something inflexible: e.g., Spark adjusts the number of lines based on a score margin. I'm 99% sure this makes it non-compliant.)

I'm imagining much cooler future engine features, aimed at beginners: "Why is this combination sound?" "Why is this move unsafe?" Instead of showing a PV that declines a sacrifice (or rejects another move that looks good at low draft) in the middle, show a refutation. Perhaps, when a move creates a strong threat, include a subsequent variation that starts with a nullmove.

sje · Post by **sje** » Mon Mar 28, 2011 2:48 am

What is needed is an association list format. See: http://en.wikipedia.org/wiki/Association_list

In ASCII (or Unicode), this could look something like:

Code: Select all

alist ::= ( tag-pair* )

tag-pair ::= ( tag-name  tag-value )

tag-name ::= <symbol>

tag-value ::= <symbol> | <number> | <string> | <list>

For the new CIL toolkit, an example might look like:

Code: Select all

(
  (ScorePOV SideToMovePOV)  ; alt: WhitepPOV, etc.
  (ScoreUnits millipawns)   ; alt: pawns, decimalpawns, etc.
  (TimeUnits milliseconds)  ; alt: seconds, decimalseconds, etc.
  (SpecialScores Symbolic)  ; alt: Crafty, UCI, etc.
  (Notation SAN)            ; alt: Coordinate, FAN, etc.
  (Score -1423)
  (PV (e4 e5 Nf3 Nc6))
  (Draft 4)
  (CpuUsage 14532)
  (Elapsed 7266)
  (BookProbes 2)
  (TablebaseProbes 0)
  (NodesAll 234)
  (NodesIterior 134)
  ...
)

jwes · Post by **jwes** » Mon Mar 28, 2011 8:14 am

bob wrote: Logically there must be _some_ correlation between CE and winning expectation, else the evaluation is broken, and badly. How strong the correlation is is a question. I think I am going to tackle this when a student load ends on the cluster...

One way to do this is to just annotate each move in the pgn with the score and save a few million games from your testing. Then the data could be sliced and diced any which way, i.e. one class of positions might have a different relation between score and winning percentage than another which would suggest that one or the other is mis-evaluated.

marcelk · Post by **marcelk** » Fri May 13, 2011 7:22 pm

sje wrote:When reporting a search result in character formatted output, in can be convenient to have the reported analysis appear on a single line. This is certainly the case when a result is presented as a kibitz in server play.

Wouldn't it also be convenient if program authors were to adopt a standard for analysis reporting? This would better support program parsing including parsing to map the report into a graphical interface.

A sample position and it's analysis as reported by some of my code:
Code: Select all
[MateIn4/7/17.269/111,573/0] 42... Rg2+ 43 Be2 Bxe4+ 44 Kc1 Qxe1+ 45 Bd1 Bg5#
Key:

Inside the brackets, in order:
1) Expectation (decimal pawns or a special symbol like MateIn7, LoseIn2, Even, Checkmated)
2) Integer ply draft
3) Decimal seconds of CPU usage
4) Node count (commas inserted for human readability)
5) Tablebase probe count

After the bracket set, the predicted variation (if any) appears with move number labeling.

Seeing that Symbolic(C) has made its re-appearance on ICC (thanks a lot for that, it is really a difficult engine to checkmate!), I have to say I'm not yet enthusiastic about the readability of its kibitzing output. But take no offense, I have worse experiences with the output of some other engines and interfaces. Most try to dump programming language notation to the channel. It would better if they were more designed with human readability in mind.

Code: Select all

Symbolic(C) kibitzes: [+0.276/12/1:18.299/72,073,603/0] 16 gxh4 Bxh4 17 c3 Be7 18 h4 Be6 19
\   Nbd2 O-O 20 cxd4 exd4 21 Qc2 Bd6
aics% 
Rookie(C) kibitzes: 16... Bxh4, +0.802, 14 ply, 4.1 Mnps, 51.4 s
aics% 
Symbolic(C) kibitzes: [+0.267/10/7.562/7,282,927/0] 17 c3 Be7 18 h4 Be6 19 Nbd2 Rc8 20 Nc4
\   Bxc4 21 dxc4 Bxh4 22 cxd4 exd4
aics% 
Rookie(C) kibitzes: 17... Be7, +1.088, 15 ply, 4.7 Mnps, 44.7 s
aics% 
Symbolic(C) kibitzes: [+0.189/11/28.736/26,452,153/0] 18 Rd1 Be6 19 Nbd2 Rc8 20 Nf1 dxc3 21
\   Qxc3 Nd4 22 Qd2 Bb3 23 Re1
aics% 
Rookie(C) kibitzes: 18... Bh3, +1.395, 14 ply, 3.7 Mnps, 18.9 s
aics% 
Symbolic(C) kibitzes: [+0.268/10/6.558/6,355,893/0] 19 Bxh3 Qxh3 20 cxd4 exd4 21 Nbd2 Rc8 22
\   Qc4 Bd6 23 Nf1 O-O

The commas in the node counter for me don't add readability, maybe due to the slashes making the part between brackets harder to parse. The missing periods in the pv are not compliant with PGN export format, I find that harder to scan. I think adding a space here and there will help a lot.

PS: I find it strange that it sometimes kibitzes an "Even" score when making a single-legal move. I would make that "None" at best...

Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format

Back on topic

Re: Back on topic

Re: Back on topic

Re: Towards a standard analysis output format

Re: Towards a standard analysis output format