JGN: A PGN Replacement

Ras · Post by **Ras** » Thu Nov 11, 2021 3:47 pm

Obligatory xkcd:

Fulvio · Post by **Fulvio** » Thu Nov 11, 2021 3:56 pm

hgm wrote: ↑Thu Nov 11, 2021 3:16 pm Banning the use of non-ascii is also problematic. OTOH, it avoids a real problem, namely that there is no generally accepted standard for encoding such characters, so that they might chage when files get exchanged between different locales.

Am I the only one who has read the standard?

Code: Select all

4.1: Character codes
PGN data is represented using a subset of the eight bit ISO 8859/1 (Latin 1) character set. ("ISO" is an acronym for the International Standards Organization.) This set is also known as ECMA-94 and is similar to other ISO Latin character sets. ISO 8859/1 includes the standard seven bit ASCII character set for the 32 control character code values from zero to 31. The 95 printing character code values from 32 to 126 are also equivalent to seven bit ASCII usage. (Code value 127, the ASCII DEL control character, is a graphic character in ISO 8859/1; it is not used for PGN data representation.)

The 32 ISO 8859/1 code values from 128 to 159 are non-printing control characters. They are not used for PGN data representation. The 32 code values from 160 to 191 are mostly non-alphabetic printing characters and their use for PGN data is discouraged as their graphic representation varies considerably among other ISO Latin sets. Finally, the 64 code values from 192 to 255 are mostly alphabetic printing characters with various diacritical marks; their use is encouraged for those languages that require such characters. The graphic representations of this last set of 64 characters is fairly constant for the ISO Latin family.

Printing character codes outside of the seven bit ASCII range may only appear in string data and in commentary. They are not permitted for use in symbol construction.

However utf-8 is clearly superior and I believe most programs accept that for comments and tag values.
(SCID converts latin-1 chars to utf-8).
Putting the BOM at the beginning of the file (like Chessbase) or allowing non-asci characters outside those fields in my opinion would be a mistake.

dkl · Post by **dkl** » Sat Nov 13, 2021 3:49 pm

Few thoughts and explanations on "why"

First is there a need for a text-based formta? Or is a binary format better?

This is an easy answer I think. If there would be no need for a text-based format, there would be no PGN. If you go online, play on a server or upload your game there, it's copy'n'paste with PGN. If you transfer games from one program to another, it's PGN. If you copy a game from your cellphone to your computer, it's PGN, maybe even via e-mail. This argument is no different than say why JSON or XML if we have ASN.1?

And there is only one standard for a text-based chess format (not 14), and that is PGN. And the question is: Does PGN work, and is it good enough? And a lot of folks have implement it and their standpoint is: "works for me". My argument is that it's works but it's not working out really well. And of course I do not expect to convince those who say "works for me", but I am convinved that some see the issues:

Tord Romstad summed it up pretty well here [ The problems with PGN can be put into three categories:

a) some things are not defined in the standard. There are conventions which are sometimes used, and sometimes not
b) ambiguities in the standard
c) it's difficult to implement

Some examples for a)

- Officially, UTF-8 is not suported, only ASCII. This leaves out a vast majority of earth's population, which cannot annotate games in their language properly. As hgm emphasized "why" - and this is not to badmouth Winboard, which is an excellent piece of software - a quick test on _my_ system, opening a UTF-8 encoded PGN without BOM and umlauts did not display properly Winboard. Everyone ignores the standard here, some PGNs are UTF-8, some are UTF-8 with BOM, some are ISO-Latin or other derivatives. To handle arbitrary PGNs you need to have a charset detector built in your software. And we know from browsers how well that works. You also can't write player names in the correct spelling but always have to use transliterations
- PGN does not support nullmoves (forum3/viewtopic.php?t=39896)
- a lot of annotation markers are not officially defined, especially those w.r.t. engine analysis. The convention is to put things with % into comments, but it is not clear what must/should be supported.
- Comments must not have newlines
- It's difficult to put a chess-book into PGN format. One reason is that variations can only appear after the main move in a position and not before. However ofter the author wants to discuss some variations _before_ a move was played in the game.
- Quoting from Tord's reddit thread: You can't write something like "threataning 31. Nx7 Kxg7 32 ... with a strong attack" except when writing this out as pure text
- Newlines should be \r\n (or was it the other way round) which is often ignored

b) Import vs export format. I think that was discussed enough

c) Building a PGN parser is non-trivial. At least from my perspective: https://buildingjerry.wordpress.com/202 ... n-parsing/
Now there are indeed very fast implementations, e.g. SCID which is probably the fastest of all GUIs out there, but: Suppose you simply want to build a minimal javascript chessboard that displays a game from a PGN. You will need to implement full-chess logic, i.e. move generation. So you will almost need close to a full engine, just to display moves and resolve disambiguities due to SAN.

It's almost impossible to implement a fast PGN parser in an interpreted language, i.e. Ruby/Python/Perl. However it's precisely these languages which are ideally suited to quickly hack something to bulk-process game data

You need to always implement your own parser from scratch. With JSON you can use well-tested libs which are available in almost any language, parse everything into a JSON tree object, and then just walk through the tree to turn this into your internal representation. Is your parser secure, btw? Have you applied fuzzing on it? Are you sure there is no potential exploit hiding deep down, especially if you implemented the parser in C/C++?

Human readability: I challenge that notion. Anything except simple games without annotations are _not_ human readable and very few users use a text-editor to crate PGNs. They use GUI. Maybe to edit/fix something a text-editor is used, but that can be done equally well with JSON. NAGs are also not human readable, and they make no sense anymore when using UTF-8.

The approach here probably needs some work on the details, but it generally addresses all these issues. Due to nd-json, bulk-processing a JGN is as simple as :

for line in lines:
tree = parseJSON(line)
internal-game-tree = walkThroughJson-Tree(tree)

You don't even need a JSON parser that supports streaming, any simple implementation will do. Not using SAN makes parsing moves easy and does not require any chess-logic. \r\n vs \n is also of no concern, since you just need to scan for \n and throw the rest at the JSON parser.

Variations can start at any point due to using an offset value.

And there a lot of other quirks of PGN fixed.

I've attached a (rough) schema-validator btw. It's not perfect but validates the general structure.

hgm · Post by **hgm** » Sat Nov 13, 2021 4:15 pm

Seems to me you just moved all the real work to walkThroughJsonTree(). Processing a PGN file can be done by the single call processPgnGame()...

I see no argument for using long algebraic vs using SAN. Both move formats are trivial to parse, and both can potentially encode invalid and illegal moves.

Increasing the number of competing standards from 1 to 2 seems a lot worse than increasing iit from 14 to 15. (100% increase of chaos, instead of 7%.)

What is not defined is a standard for computer annotations. This is outside the scope of PGN, to which these are just comments. If you want to have a standard for that, it will have to be defined. Whether you then implement that standard in a PGN context or some other context doesn't make any difference at all. This can never be used as an argument for preferring one context over another.

UTF-8 is not a native Windows encoding. In countries where it really matters (China, Japan, Taiwan) it isn't even very popular. WinBoard is written to work with Windows code pages. If you have a PGN file that violates the standard by using ShiftJIS-encoded Japanese player names and comments, my guess is that they would display correctly when you had the locale set correctly. That the PGN standard might require them to display as question marks I can happily ignore. Because no one requires me to display PGN. I can display game info in any format I want, and it seems more useful to display the non-ascii as is.

WinBoard does not support recognition of the BOM. That means that the user is responsible for specifying the charset to be used for display.

Harald · Post by **Harald** » Sun Nov 14, 2021 12:19 am

HGM wrote:

I see no argument for using long algebraic vs using SAN. Both move formats are trivial to parse, ...

Sorry, no. I think SAN is a pain in the ass when you not just scan through it but want to get the move information from it to update a chess board representation in your program. For all the obvious reasons that were mentioned here in this thread. Even reading the SAN text and finding the right move can be annoying for untrained human readers. It is just a waste of time for software developers and a waste of time for the computer software that has to read, write and apply it.

UTF-8 is not a native Windows encoding. In countries where it really matters (China, Japan, Taiwan) it isn't even very popular. WinBoard is written to work with Windows code pages. If you have a PGN file that violates the standard by using ShiftJIS-encoded Japanese player names and comments, my guess is that they would display correctly when you had the locale set correctly.

Sorry, no. I think UTF-8 has long ago won the battle for the best international text encoding format. Especially in the open source world. It won against ASCII, Latin-1, ..., Latin-10, ISO-8859-1, ..., ISO-8859-16, UTF-16, UTF-32, UCS-2, UCS-4, Windows codepages and some local encodings like ShiftJIS. Even chinese companies use UTF-8 in international software products and databases that are shared with other companies.

dangi12012 · Post by **dangi12012** » Sun Nov 14, 2021 1:07 am

Harald wrote: ↑Sun Nov 14, 2021 12:19 am HGM wrote:
I see no argument for using long algebraic vs using SAN. Both move formats are trivial to parse, ...
Sorry, no. I think SAN is a pain in the ass when you not just scan through it but want to get the move information from it to update a chess board representation in your program. For all the obvious reasons that were mentioned here in this thread. Even reading the SAN text and finding the right move can be annoying for untrained human readers. It is just a waste of time for software developers and a waste of time for the computer software that has to read, write and apply it.

True words - A complex inner state is needed to understand what the tokens mean - just to save a single char here and there. Would be much easier if it just says from-to squares and promotion.

mar · Post by **mar** » Sun Nov 14, 2021 10:17 am

hgm wrote: ↑Sat Nov 13, 2021 4:15 pm UTF-8 is not a native Windows encoding. In countries where it really matters (China, Japan, Taiwan) it isn't even very popular. WinBoard is written to work with Windows code pages. If you have a PGN file that violates the standard by using ShiftJIS-encoded Japanese player names and comments, my guess is that they would display correctly when you had the locale set correctly. That the PGN standard might require them to display as question marks I can happily ignore. Because no one requires me to display PGN. I can display game info in any format I want, and it seems more useful to display the non-ascii as is.

that's unfortunate - that's why we can't have nice things

WinAPI supports "wide character" versions of everything and this works with Unicode (UTF-16 actually), so nothing prevents you from using these and voila - you can suddenly support utf-8 on Windows. also MultiByteToWideChar does support utf-8. of course it's trivial to decode utf-8, so there's even no need for that. NTFS also uses utf-16 encoding, so you can even have filenames (say copied from another computer) with characters outside the "native" encoding that you won't be able to open without the wide API versions anyway (e.g. fopen will fail here as well, unfortunately).
on Linux there's no such problem, because filenames are just a sequence of bytes if I'm not mistaken - and since there's no case insensitive filesystem (=a disaster), everything just works. since the OS uses utf-8 natively, this sequence of bytes is actually utf-8 and everything just works out of the box.

doesn't the P in PGN stand for portable? codepages are anything but portable.

when you move a pgn encoded using a random codepage to a different computer with different default encoding then you display random garbage.
also thanks to utf-8's clever encoding it's trivial to detect (even without BOM), unlike random codepages, especially if you have like 1 non-ascii character in the whole file.
happy detecting in that case. surely noone wants to go down the dreaded path to bundle encoding detectors like browsers do.

both unicode and utf-8 are awesome, unlike ancient codepages that have long outlived their purpose

since people violate the standard anyway by using non-ascii characters, I see no problem (and in fact I'd encourage) for modern PGN-processing software to simply use utf-8 and replace random encodings outside ascii with ? and be done with it.

Kotlov · Post by **Kotlov** » Sun Nov 14, 2021 11:14 am

Ras wrote: ↑Thu Nov 11, 2021 3:47 pm Obligatory xkcd:

joke is funny but the situation is scary

hgm · Post by **hgm** » Sun Nov 14, 2021 11:15 am

dangi12012 wrote: ↑Sun Nov 14, 2021 1:07 am True words - A complex inner state is needed to understand what the tokens mean - just to save a single char here and there. Would be much easier if it just says from-to squares and promotion.

The point is that this 'complex inner state' is the thing you are after in the first place. Checking whether the move strings have the proper syntax is always trivial.

hgm · Post by **hgm** » Sun Nov 14, 2021 11:45 am

mar wrote: ↑Sun Nov 14, 2021 10:17 am
hgm wrote: ↑Sat Nov 13, 2021 4:15 pm UTF-8 is not a native Windows encoding. In countries where it really matters (China, Japan, Taiwan) it isn't even very popular. WinBoard is written to work with Windows code pages. If you have a PGN file that violates the standard by using ShiftJIS-encoded Japanese player names and comments, my guess is that they would display correctly when you had the locale set correctly. That the PGN standard might require them to display as question marks I can happily ignore. Because no one requires me to display PGN. I can display game info in any format I want, and it seems more useful to display the non-ascii as is.
that's unfortunate - that's why we can't have nice things

WinAPI supports "wide character" versions of everything and this works with Unicode (UTF-16 actually), so nothing prevents you from using these and voila - you can suddenly support utf-8 on Windows. also MultiByteToWideChar does support utf-8. of course it's trivial to decode utf-8, so there's even no need for that. NTFS also uses utf-16 encoding, so you can even have filenames (say copied from another computer) with characters outside the "native" encoding that you won't be able to open without the wide API versions anyway (e.g. fopen will fail here as well, unfortunately).
on Linux there's no such problem, because filenames are just a sequence of bytes if I'm not mistaken - and since there's no case insensitive filesystem (=a disaster), everything just works. since the OS uses utf-8 natively, this sequence of bytes is actually utf-8 and everything just works out of the box.

The problem is that it would pretty much require a complete rewrite of the WinBoard front-end to make it use wide characters. If the back-end is to remain using UTF-8 and normal characters (as would be required for XBoard), you would have to do back and forth conversions at any point where these interact.

doesn't the P in PGN stand for portable? codepages are anything but portable.

It does, and this portability is achieved 'with the aid of a blunt axe': it simply forbids the use of any non-ascii character anywhere. Of course in an environment where such characters are almost guaranteed to cause portability problems, this is the only thing you can do. But now that we have UTF to cover every printable symbol in the world, this greater problem seems solved. So it would be logical to extend the PGN standard to use UTF rather than ascii as the underlying character set.

Whether this should be UTF-8 or UTF-16, and whether this should be announced through a BOM, is really outside the scope of a standard for game notation: it is an OS property. It is unfortunate that different encodings still exist, but as long as they do, one can expect there will be file-conversion tools for these formats.

since people violate the standard anyway by using non-ascii characters, I see no problem (and in fact I'd encourage) for modern PGN-processing software to simply use utf-8 and replace random encodings outside ascii with ? and be done with it.

I have not looked into this lately, but the problem used to be that Windows API supported UTF-16, and not UTF-8. So to properly display the non-ascii characters in dialogs, or allow their entry in text edits there, you would have to use the wchar versions of the API calls.

JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement

Re: JGN: A PGN Replacement