PGN Parser Interface in C

mathmoi · Post by **mathmoi** » Fri Aug 05, 2011 3:23 am

Hi,

Just for fun I'm about to write a PGN parser in C. The reason I do want to do that is that I never wrote a parser from scratch. For example, the PGN parser in my engine is written with boost.spirit a parser framework that abuse the C++ templates (and operator overloading) system in order to create a grammar language within the C++ language.

So, I want to write a parser from scratch. I choose C, a language I don't normaly use, so that if it's efficient and the proud enough of the code I may open source it and it could maybe be usefull to someone else. I remember when I wrote my first PGN parser I wished I could have just used someone else library.

I'm at the stage where I need to define the public interface of a PGN Parser in C. I think I'll simply have one function to parse a single game and one function to parse all the games from a file (or an null terminated string). But what should theses function returns?

As a programmer used to OO languages I'm thinking about returning a PgnGame struct, that would contain a tree of PgnNode and each node would have a specific type (PgnMoveNode, PgnNagNode, PgnCommentNode, PgnVariationNode, etc.). I know I can easily implement some pseudo inheritance in C, but I feel like I'm trying to write C code with a C++ mindset.

So before I go further I though I'd ask here : How do you think I should represent a PGN game in memory, in C? It's important to note that I wish to implement the whole PGN Standard, so I need to handle moves, comments, variations, etc. Even the never used Numeric Annotation Glyph.

Thanks

mp

hgm · Post by **hgm** » Fri Aug 05, 2011 9:10 am

As it happens I just did a lot of work on the XBoard parser, completely rewriting the move parser / lexical scanner (wich originally was an awful lex-generated thing) from scratch in C, and writing a back-end for fast reconstruction of the main line of games.

The representation would be most useful if the different information 'channels' (moves, comments, variations) are well separated, so that the user does not need to disentangle them when he wants to follow the game. The game would most conveniently be represented by a list of moves (given as pairs of square numbers). Fromeachmove you couldhang a linked list of other elements present after the move. Applications not interested in that info can then simply skip it.

It would perhaps be useful if it was already possible to specify that the parser should ignore this information (just skipping over it) to save time and memory space, e.g. when you are merely interested in searching a position in the game. Automatic processing of a batch of games will almost never be interested in the comments.

Remember that the practical use of a parser is as good as its error recovery. ThePGN standard is very strict, and a parser that merely returns a message "illegal game" when there is a tiny violation of the standard would reject too many games to be useful.

mcostalba · Post by **mcostalba** » Fri Aug 05, 2011 10:36 am

mathmoi wrote: So before I go further I though I'd ask here : How do you think I should represent a PGN game in memory, in C? It's important to note that I wish to implement the whole PGN Standard, so I need to handle moves, comments, variations, etc. Even the never used Numeric Annotation Glyph.

I am not an expert of PGN databases, but I stumbled across 2 new technologies that seems to be extremely fast, scalable and, most important, seems fun to hack on:

http://code.google.com/p/leveldb/

http://code.google.com/p/indexeddb/

As I said I am not a PGN expert, but as a general rule for a database parsing library (that I have wrote regarding git repositories), I have learned that the key to a fast and scalable software is to delay as much as possible _any_ computation / manipulation of the source data. For instance I'd think to even do not load in memory the PGN file, and do real the minimum parsing: just store references to the position in the PGN file of the single games, and parse the selected games only on demand when user needs them. This should make fast the scanning of very big PGN files and keep low the allocated memory.

hgm · Post by **hgm** » Fri Aug 05, 2011 12:53 pm

Indeed, I can confirm that. I just equiped XBoard with a position search function. For this it is obviously necessary to reconstruct every board position of every game (which is needed to parse the PGN anyway). XBoard already had code to load the game file, in which case it skips over moves without really interpreting them, and just remembers the location of game starts in the file, and extracts the PGN tag info for display in the list.

One of the things I changed was not testing the moves for legality unless it would be ambiguous in terms of pseudo-legal moves, to see if the check rule resolves the ambiguity Even when completely redoing the disambiguation from scratch with check-testing when it is ambiguous without, rather than remembering the pseudo-legal moves that satisfied the SAN form, and only test those, gave a speedup of factor 2.5. (And before that, disambiguation in XBoard was really stupid, generating all moves and testing them for legality before even attempt to match them with the SAN, rather than matching first and subject only valid candidates for check testing. But I had already fixed that earlier, for the purpose of displaying engine PVs as SAN.)

Of course SAN is about the last format you would want in a database application; long-algebraic with special codes for castling and e.p. is enormously faster to process.

mathmoi · Post by **mathmoi** » Fri Aug 05, 2011 2:13 pm

hgm wrote:The representation would be most useful if the different information 'channels' (moves, comments, variations) are well separated, so that the user does not need to disentangle them when he wants to follow the game. The game would most conveniently be represented by a list of moves (given as pairs of square numbers). Fromeachmove you couldhang a linked list of other elements present after the move. Applications not interested in that info can then simply skip it.

Hi,

This is a good advice and I though about this when I wrote my first parser (in C++). The reason I did not go with it a the time is that there can be some elements (comments for examples) before the first move or after the last move. If a game is represented by a list of moves with a list of other elements for each of them I have to handle a special case for elements that appears before any moves. Any idea about this?

hgm wrote:It would perhaps be useful if it was already possible to specify that the parser should ignore this information (just skipping over it) to save time and memory space, e.g. when you are merely interested in searching a position in the game. Automatic processing of a batch of games will almost never be interested in the comments.

That's a good idea, I'll probably implement it.

hgm wrote:Remember that the practical use of a parser is as good as its error recovery. ThePGN standard is very strict, and a parser that merely returns a message "illegal game" when there is a tiny violation of the standard would reject too many games to be useful.

I plan to be as permissive as possible about what you feed the parser.

Thanks for your input,

mp

Don · Post by **Don** » Fri Aug 05, 2011 3:48 pm

hgm wrote:Indeed, I can confirm that. I just equiped XBoard with a position search function. For this it is obviously necessary to reconstruct every board position of every game (which is needed to parse the PGN anyway). XBoard already had code to load the game file, in which case it skips over moves without really interpreting them, and just remembers the location of game starts in the file, and extracts the PGN tag info for display in the list.

One of the things I changed was not testing the moves for legality unless it would be ambiguous in terms of pseudo-legal moves, to see if the check rule resolves the ambiguity Even when completely redoing the disambiguation from scratch with check-testing when it is ambiguous without, rather than remembering the pseudo-legal moves that satisfied the SAN form, and only test those, gave a speedup of factor 2.5. (And before that, disambiguation in XBoard was really stupid, generating all moves and testing them for legality before even attempt to match them with the SAN, rather than matching first and subject only valid candidates for check testing. But I had already fixed that earlier, for the purpose of displaying engine PVs as SAN.)

Of course SAN is about the last format you would want in a database application; long-algebraic with special codes for castling and e.p. is enormously faster to process.

My first high quality autotester was written in TCL and the event loop of tcl made it very simple to build a tester that would handle many running instances of matches.

However, what KILLED the use of TCL was outputting correct PGN. Everything else was very fast and I could even play a couple of games per second on a quad - but only if I did not want to produce proper PGN with san notation.

So it came down to a decision about whether to try to super-optimize TCL or rewrite it in C. I had no doubt that I could double or triple the speed of the TCL code, but the whole idea of doing this in TCL was that it would be easy and convenient. So I ended up writing it in C. but that turned out to be difficult for portability reasons - Larry needed to be able to run in Windows and the event library and forking processes is different in Windows. So I ended up with a tester that works in Java and is cross platform and it outputs PGN with proper SAN and is capable of many games per second even on 1 core if you want to test that fast.

Don

hgm · Post by **hgm** » Fri Aug 05, 2011 5:05 pm

But why save the games in PGN, and not in a more compact database format with 2-byte moves? That would be much more compact for storage and transmission, and you could convert it to PGN off-line with a simple console application when needed.

Don · Post by **Don** » Fri Aug 05, 2011 5:34 pm

hgm wrote:But why save the games in PGN, and not in a more compact database format with 2-byte moves? That would be much more compact for storage and transmission, and you could convert it to PGN off-line with a simple console application when needed.

Indeed, this is how it worked. The tester produced long algebraic SAN which is good enough.

But even the legal move checking was slow enough to be noticeable. We can get a good sense of the speed by running 1 ply games and if you cannot get several games a second with komodo there is significant overhead.

But I generally try to build things that can serve many purposes and I envisioned this TCL package being used for other thing too. The mantra you hear is that high level languages like this are good enough for 90% of the things you do, but I keep getting bit, it's just not true.

So I started to build a Chess program user interface in TCL using my "chess" package and I like the PV's to be in SAN notation. But when processing output from the engine there would be a small but noticeable delay every time the PV got updated, especially when the PV was quite long. One of the reasons I like xboard is that it is very crisp and responsive.

hgm · Post by **hgm** » Fri Aug 05, 2011 5:57 pm

Yes, well, I have now also built options -fSAN, -sSAN into XBoard to force conversion of an engine PV to SAN before it is displayed, and indeed I was very worried about efficiency. Legality testing, SAN generation and SAN disambiguation was all extremely inefficient in XBoard. E.g. legaity testing was done by first generating all pseudo-legal moves of all pieces, then checking each of them for legality by generating all opponent legal moves to count how many of them capture a King, and only then compare the legal moves found to the given move, to see if the latter was amongst them.

I did speed that up by at least generating pseudo-legal moves of pieces that match the piece type of the given move, and test for legality only moves that match the given move, before I dared to add this feature. The test forKing capture is still horribly inefficient, though, but now, at least, it is done only once (and sometimes twice) per move, in stead of 40 times. An engine can spew out an awful number of moves during a search...

Don · Post by **Don** » Fri Aug 05, 2011 6:13 pm

hgm wrote:Yes, well, I have now also built options -fSAN, -sSAN into XBoard to force conversion of an engine PV to SAN before it is displayed, and indeed I was very worried about efficiency. Legality testing, SAN generation and SAN disambiguation was all extremely inefficient in XBoard. E.g. legaity testing was done by first generating all pseudo-legal moves of all pieces, then checking each of them for legality by generating all opponent legal moves to count how many of them capture a King, and only then compare the legal moves found to the given move, to see if the latter was amongst them.

I did speed that up by at least generating pseudo-legal moves of pieces that match the piece type of the given move, and test for legality only moves that match the given move, before I dared to add this feature. The test forKing capture is still horribly inefficient, though, but now, at least, it is done only once (and sometimes twice) per move, in stead of 40 times. An engine can spew out an awful number of moves during a search...

I think if you build a move generator specifically for fast SAN creation and legal testing it could be pretty fast if you pulled out all the stops, but I certainly did not want to spend a lot of time doing that in TCL and even in C it would require more effort that I might be willing to put into it.

My Java tester produces output in pure SAN and it's very fast - but I should check to see if there is a noticable overhead with my 1 ply test. I seriously doubt my Java implementation is as fast as it could be as I did not do any work to make it fast unless it was really obvious stuff that did not require much additional coding.

This was at least 2 or 3 years ago, but I think I did improve the tcl code at one point and saw that I could probably get another 2 to 1 with other things, but that was still not enough - and life is too short ....

PGN Parser Interface in C

PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C

Re: PGN Parser Interface in C