Advice on .pgn format issues for my chess GUI

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Dann Corbit
Posts: 12482
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Advice on .pgn format issues for my chess GUI

Post by Dann Corbit »

Bill Forster wrote:The issue here is not the implementation details, but the specification. I don't have any problem with implementation.
So my suggestion is to write text files in the native format on whatever platform you are on.

On Windows, PGN and EPD files should have cr/lf, for instance.
Bill Forster
Posts: 76
Joined: Mon Sep 21, 2015 7:47 am
Location: New Zealand

Re: Advice on .pgn format issues for my chess GUI

Post by Bill Forster »

Everybody seems to be agree with you. Can you cite any specific advantages, beyond those already mentioned ?

Arguments for using CR,LF on Windows (so far)
1) It feels right to respect the native text file convention
2) All Windows tools will understand your file as a text file
3) SCID should fix their bug

Arguments for using LF on Windows (so far)
A) The .pgn spec recommends using LF only (ref [1] below), so maybe SCID's bug is not a bug.
B) Most Windows tools will understand your file as a text file
C) All Chess tools, in particular SCID, will work with your .pgn file

I am not married to any particular solution, but at the moment A), B) and C) seem more compelling [to me] than 1),2) and 3)

[1] The PGN specification says "The archival representation of a newline is the ASCII control character LF (line feed, decimal value 10, hexadecimal value 0x0a)....Some systems may just not be able to handle an archival PGN text file with native text editors. In these cases, an indulgence of sorts is granted to use the local newline convention in non-archival PGN files for those text editors."
User avatar
hgm
Posts: 27701
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Advice on .pgn format issues for my chess GUI

Post by hgm »

It is actually quite shocking that the PGN spec contain something like this. It clearly exceeds the authority of such a spec: it meddles with the underlying level of storage technology. They might as well have required that PGN must only be stored on magnetic media, (1, particular 1.44MB floppy disks), and burning them on a CD-rom is a violation of the standard. How to encode a text file (e.g. ascii, UTF-8, code pages, line endings) should be left to the platform, and conversion of one type of text encoding to the native one is expected to occur automatically when transferring data from one platform to another.

As the case at hand shows, such requirements are a severe hindrance to interchangeability of the format, although the person(s) adding this specification no doubt was in the misguided believe that it would facilitate portability from one platform to another.

I guess it is no exaggeration to say that compliance with this requirement renders PGN unsuitable as a standard format for encoding Chess games, as it fails to meet the requirement of portability...
mar
Posts: 2552
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Advice on .pgn format issues for my chess GUI

Post by mar »

hgm wrote:How to encode a text file (e.g. ascii, UTF-8, code pages, line endings) should be left to the platform, and conversion of one type of text encoding to the native one is expected to occur automatically when transferring data from one platform to another.
Encoding is quite interesting. I would go further and make UTF-8 mandatory.

While converting line endings is trivial (if you know it's a text file you're transferring), guessing text encoding is complicated and unreliable.

If I send someone a pgn encoded in win-1250, containing some extended latin characters used in Czech (player names like "Láznička" etc.),
it's very likely that some junk will show up instead of those letters (assuming he uses a different encoding)

Now imagine UTF-8 was mandatory encoding (and various pgn encoders would honor the standard which seems unlikely),
in this case no matter whom I send the pgn, he will be able to decode special characters properly and see the names as expected
(assuming he has fonts installed that can display those characters).
Dann Corbit
Posts: 12482
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Advice on .pgn format issues for my chess GUI

Post by Dann Corbit »

If you use normal eol on platform x, then you can edit the file with a text editor. You can import the file with other tools expecting PGN in text. Since it is trivial to make it use the normal encoding, it seems natural to me.

If you used Mac endings on Linux, you would get similar push-back.

Personally, it won't matter much to me, but as soon as a user is frustrated with your tool they will toss it. They won't be frustrated by standard eol, and they will be frustrated by non-standard eol.

Just my guesses, but I think they are pretty good.
User avatar
hgm
Posts: 27701
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Advice on .pgn format issues for my chess GUI

Post by hgm »

mar wrote:Encoding is quite interesting. I would go further and make UTF-8 mandatory.

While converting line endings is trivial (if you know it's a text file you're transferring), guessing text encoding is complicated and unreliable.

If I send someone a pgn encoded in win-1250, containing some extended latin characters used in Czech (player names like "Láznička" etc.),
it's very likely that some junk will show up instead of those letters (assuming he uses a different encoding)

Now imagine UTF-8 was mandatory encoding (and various pgn encoders would honor the standard which seems unlikely),
in this case no matter whom I send the pgn, he will be able to decode special characters properly and see the names as expected
(assuming he has fonts installed that can display those characters).
Indeed, UTF-8 is very suitable as a standard. But unfortunately the world is not like that yet, and most text you can find will use encodings like BIG5, Shift-JIS or GB1243 (or whatever the number was).

But the point is that it is the task of the platforms to iron out these incompatibilities in a general way. If I access web-pages through my browser, the text should appear as intended, and not as gibberish, no matter how the web page was encoded. If I copy-past that text from the browser window, it should get on the clipboard in the encoding used by my locale. When I transfer text files by ftp, it adapts the line endings to the receiving platform. This is how people want it to work, and this is how it must work for many things many hundreds of times more important than anything having to do with Chess to work.

In such an environment, things will only be portable when they respect the native encoding of the platform and locale they are on. Insisting on using a Linux-specific text encoding on non-Linux platforms will break all common practices for text-file transfer, and in fact makes PGN a binary format rather than a text format. It totally wrecks portability as a text format.

So it is OK for a format specification to require that it satisfy the abstract notion of a text file. But it is extremely counter-productive to attempt redefining the notion of a text file, to solve for Chess only a problem that exists in general,and of which you can be sure a general solution will (sooner, rather than later) be commonly provided.
mar
Posts: 2552
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Advice on .pgn format issues for my chess GUI

Post by mar »

hgm wrote:Indeed, UTF-8 is very suitable as a standard. But unfortunately the world is not like that yet, and most text you can find will use encodings like BIG5, Shift-JIS or GB1243 (or whatever the number was).

But the point is that it is the task of the platforms to iron out these incompatibilities in a general way. If I access web-pages through my browser, the text should appear as intended, and not as gibberish, no matter how the web page was encoded. If I copy-past that text from the browser window, it should get on the clipboard in the encoding used by my locale. When I transfer text files by ftp, it adapts the line endings to the receiving platform. This is how people want it to work, and this is how it must work for many things many hundreds of times more important than anything having to do with Chess to work.

In such an environment, things will only be portable when they respect the native encoding of the platform and locale they are on. Insisting on using a Linux-specific text encoding on non-Linux platforms will break all common practices for text-file transfer, and in fact makes PGN a binary format rather than a text format. It totally wrecks portability as a text format.

So it is OK for a format specification to require that it satisfy the abstract notion of a text file. But it is extremely counter-productive to attempt redefining the notion of a text file, to solve for Chess only a problem that exists in general,and of which you can be sure a general solution will (sooner, rather than later) be commonly provided.
I assume BIG5 and others are encodings used for Chinese?

HTML provides a header tag to specify document encoding, so browsers have to support various encodings internally, which is complicated.

If I'm not mistaken, browsers fall back to automatic detection if no such tag is present, which may not work reliably.

The problem with scripts containing lots of characters is that UTF-8 would require 3 bytes per character. However, since web pages contain a lot of control tags in Latin,
it might still be better than using 16-bit representation (depending on the amount of text)

I still think that non-standard local encodings make it difficult to interchange text files, therefore only usable on machines which use the same encoding as where the file was created
(or one has to use encoding converters on target machine).

There's a reason for Unicode and Linux certainly went in the right direction (and I believe other modern Unixes/BSDs as well, including OSX which I consider to be BSD-family).

So the only mainstream OS that remains is Windows,
even though all API internally support UCS-2 (or is it UTF-16?) and there are API functions that can convert to/from UTF-8.
Even NTFS filenames are physically stored as UCS-2
(I think Linux has no notion of filename encoding and simply stores them as byte sequences, but since high level APIs work with UTF-8, it should work as expected).

So I believe the only real problem is backward compatibility because of multibyte encodings, but it would be nice
to see Microsoft say ok, from now on we switch to UTF-8, forget about ancient non-portable encodings.
User avatar
hgm
Posts: 27701
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Advice on .pgn format issues for my chess GUI

Post by hgm »

mar wrote:I assume BIG5 and others are encodings used for Chinese?
Sort of. BIG5 is Taiwanese (trad. Chinese), GB1243 is the standard in main-land China (simplified Chinese), Shift-JIS is for Japanese kanji.
HTML provides a header tag to specify document encoding, so browsers have to support various encodings internally, which is complicated.

If I'm not mistaken, browsers fall back to automatic detection if no such tag is present, which may not work reliably.
Yes, it is complicated and a pain, and the world would be a much better place if this problem did not exist. But the point is that people are working on a general solution, and that web material that will not render correctly on all platforms is slowly being selected away, and be replaced by pages that allow compatible methods (be it HTML tags or anything else). Standards for a very limited application should not try to interfere with that process; it will just make matters worse.

PGN is a very nice format for Chess games, given the restriction that it be text. As a binary encoding it utterly sucks, and it would be much better to just use a sequence of square numbers. So it seems a bad mistake to turn it into a binary format by putting conditions on the encoding.
The problem with scripts containing lots of characters is that UTF-8 would require 3 bytes per character. However, since web pages contain a lot of control tags in Latin,
it might still be better than using 16-bit representation (depending on the amount of text)
Actually the 3 encodings I mentioned are all upward compatible with ASCII (like UTF-8 is): codes 0-127 are single-byte codes. They use either pairs of codes in the range 128-255 to indicate kanji (GB1243), or a single such code followed by a general byte (Shift-JIS). Some of the codes > 127 are escapes, which indicate that the kanji is encoded by 4 bytes. This should not be confused with UTF-16, which is unicode in 'wide-char' format, and needs even 2 bytes for ascii Latin.

So the compensatory savings you mention will not occur; these encodings are really optimized for the languages they intend to encode. (Funny thing is that they usually contain a duplicate of the entire ascii set, as well as Greek and cyrillic glyphs as two-bytes encodings, like these are just some peculiar kanji.)
I still think that non-standard local encodings make it difficult to interchange text files, therefore only usable on machines which use the same encoding as where the file was created
(or one has to use encoding converters on target machine).
True, but the problem is of course that this is what people do, 99.9% of the time. Chinese pages are only viewed by other Chinese. It is very difficult to make something a success, when you have to say: "Wow, look how much easier this 0.1% of what you typically do gets. And the other 99.9% of what you do gets only 50% more difficult!"
There's a reason for Unicode and Linux certainly went in the right direction (and I believe other modern Unixes/BSDs as well, including OSX which I consider to be BSD-family).

So the only mainstream OS that remains is Windows,
even though all API internally support UCS-2 (or is it UTF-16?) and there are API functions that can convert to/from UTF-8.
Even NTFS filenames are physically stored as UCS-2
(I think Linux has no notion of filename encoding and simply stores them as byte sequences, but since high level APIs work with UTF-8, it should work as expected).
I think even Windows supports or even recommends UTF-8 now.
So I believe the only real problem is backward compatibility because of multibyte encodings, but it would be nice
to see Microsoft say ok, from now on we switch to UTF-8, forget about ancient non-portable encodings.
And lose a substantial fraction of their client base...

Backward compatibility is something that should be taken very seriously always. If Micro-Soft decided that text encoding would be more efficient if it was just done in chunks of 7 bits, as these captials and punctuation marks are not that common after all, and can easily be assigned 14-bit codes. A lot more compact, but alas, new codes get assigned to all ascii characters. So they make UTF-7 the standard, and in their release of Windows 13, you would not be able to properly display or edit any source code or other text that you have ever written, without running it through a converter first. (Including all the stuff you have zipped, or in git.) Try to image how you would feel about that. Then you get an idea how a Japanese would feel towards the move you propose...
Bill Forster
Posts: 76
Joined: Mon Sep 21, 2015 7:47 am
Location: New Zealand

Re: Advice on .pgn format issues for my chess GUI

Post by Bill Forster »

Nobody who has commented here thinks it is a good idea to use the archive format for .pgn (with LF only, as per the .pgn spec) on Windows. This surprised me and made me pause. Additionally, an important reason for me to consider changing to LF only was the Scid vs PC behaviour I reported earlier. This morning I have found out that I was wrong about that behaviour (see below) and now I am sure you are all right and it is better to continue to use the Windows text convention on Windows.

A few days ago a user reported that Tarrasch .pgn files couldn't be opened on Scid vs PC and sent me an example .pgn. The .pgn looked fine to me. On a hunch I did a Dos2Unix conversion on the file and sent it back to him. He reported all was now well. I jumped to the obvious conclusion. This morning he sent me another .pgn, created with Scid vs PC, remarking that it was strange that Notepad had no problem with this .pgn. Surely enough, on examination it was a normal Windows CR,LF text file. Very strange.

At this stage I decided to download Scid vs PC myself. No doubt I should have done this earlier. Scid vc PC V4.14 has no problems with the original Tarrasch .pgn. My user has multiple versions of Scid and Scid vs PC. Maybe there is a version of Scid that runs on Windows that cannot read CR,LF .pgn files, but I at this stage I cannot be sure of that.

Thanks for all the helpful advice in this thread.