PGN for dummies

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: PGN for dummies

Post by hgm »

So encoding PGN as UTF-8 is a violation of the standard, right?
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: PGN for dummies

Post by Fulvio »

hgm wrote: Mon Nov 15, 2021 10:57 pm So encoding PGN as UTF-8 is a violation of the standard, right?
UTF-8 is recommended for all text files, including PGNs.
Sure, even if there are no valid reasons, you could still encode them using the old Latin-1.
But in both cases it is required, and always has been, that a PGN is interpreted correctly regardless of the operating system on which it was created.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: PGN for dummies

Post by hgm »

Now you lost me completely. In your forelast posting you quote (from the PGN standard, right?) that PGN should use latin-1 encoding. And now you say that UTF-8 is recommended for all text files. Recommended by who? Apparently not by the PGN standard, which defines it as a violation. Or are you advocating that PGN is not text? Even then, a binary format for chess games based on UTF-8 doesn't qualify as PGN.
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: PGN for dummies

Post by Fulvio »

hgm wrote: Tue Nov 16, 2021 10:17 am Now you lost me completely. In your forelast posting you quote (from the PGN standard, right?) that PGN should use latin-1 encoding. And now you say that UTF-8 is recommended for all text files. Recommended by who?
Time passes and things evolve.
I have already mentioned:
- locales which are now almost all utf-8
- html 5 which switched to utf-8
also come to mind:
- Qt 5 which switched to utf-8
- Python 3 which switched to utf-8
... and I could make an endless list by searching the internet.

Maybe you have a better solution to portably exchange text files between users of different nationalities and with different computers.
Publish a paper.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: PGN for dummies

Post by hgm »

Umm, that reminds me of something I read earlier in this thread:
Your argument is basically "I don't like the standard so I'll do whatever I think is right and I don't care what others think"
So you no longer recognize the PGN standard, because you think in the contemporary environment it is counter-productive to do so, and made your own 'standard' by amending it.

Actually that is not so much different from my take at this: I do not recognize an encoding restriction as part of a specification of a text format in the first place. Because it predictably will lead to problems like we are facing now. Standards for text encoding will vary between OS and in time, and sticking to obsolete encodings that are no longer standard is very counter-productive.

The PGN specs must tell how a chess game must be represented as a string of characters. Anything else it tries to enforce should be considerd void. In particular rules for how characters should be represented as bit patterns, or what software would have to do on syntax violations.
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: PGN for dummies

Post by Fulvio »

Cool, I'd say it's time to end this.
I, like most of the world, prefer to be able to view, edit and share my text files between different computers without any problems.
But everyone has their own tastes.
I guess there are those who prefer a bit of chaos and surprise. Who knows if opening the file on my laptop will show the same way? And if I send it to someone else: will he see it correctly?
Everyone has what he wants and we are all happy.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: PGN for dummies

Post by hgm »

Nice speech. But whether it is very truthful...

When I save a PGN file on Windows, I would like to be able to view it uncorrupted in WordPad and NotePad. It seems to me that the way you advocate, and presumably have impleemnted in SCID, would not satisfy that requirement.

So when you talk about people who like chaos and surprise, you seem to incluse yourself in that group.
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: PGN for dummies

Post by Fulvio »

hgm wrote: Tue Nov 16, 2021 2:40 pm When I save a PGN file on Windows
Again :shock:
The OS doesn't matter.
When you save a PGN file using a recent chess software (SCID, Chessbase, lichess, chess.com, etc...)
hgm wrote: Tue Nov 16, 2021 2:40 pm I would like to be able to view it uncorrupted in WordPad and NotePad.
you are able to view it uncorrupted in any recent text editor (Notepad, gedit, visual studio, sublime text, etc....).

You can also edit it and then load it back in your preferred recent chess software, on any OS you like.
R. Tomasi
Posts: 307
Joined: Wed Sep 01, 2021 4:08 pm
Location: Germany
Full name: Roland Tomasi

Re: PGN for dummies

Post by R. Tomasi »

I'm sorry to drop in late on this thread, but it kind of coincides with what I am currently doing with Pygmalion: I am currently implementing parsing and generation of PGN files. I use http://www.saremba.de/chessgml/standard ... e.htm#c3.1 as reference for the PGN standard (if someone knows a more complete description of the standard I'd be thrilled to hear about that).

From my understanding of the documentation that I linked to, the PGN standard is actually quite explicit on which encoding should be used. Especially this section
4.1: Character codes
PGN data is represented using a subset of the eight bit ISO 8859/1 (Latin 1) character set. ("ISO" is an acronym for the International Standards Organization.) This set is also known as ECMA-94 and is similar to other ISO Latin character sets. ISO 8859/1 includes the standard seven bit ASCII character set for the 32 control character code values from zero to 31. The 95 printing character code values from 32 to 126 are also equivalent to seven bit ASCII usage. (Code value 127, the ASCII DEL control character, is a graphic character in ISO 8859/1; it is not used for PGN data representation.)

The 32 ISO 8859/1 code values from 128 to 159 are non-printing control characters. They are not used for PGN data representation. The 32 code values from 160 to 191 are mostly non-alphabetic printing characters and their use for PGN data is discouraged as their graphic representation varies considerably among other ISO Latin sets. Finally, the 64 code values from 192 to 255 are mostly alphabetic printing characters with various diacritical marks; their use is encouraged for those languages that require such characters. The graphic representations of this last set of 64 characters is fairly constant for the ISO Latin family.

Printing character codes outside of the seven bit ASCII range may only appear in string data and in commentary. They are not permitted for use in symbol construction.

Because some PGN users' environments may not support presentation of non-ASCII characters, PGN game authors should refrain from using such characters in critical commentary or string values in game data that may be referenced in such environments. PGN software authors should have their programs handle such environments by displaying a question mark ("?") for non-ASCII character codes. This is an important point because there are many computing systems that can display eight bit character data, but the display graphics may differ among machines and operating systems from different manufacturers.

Only four of the ASCII control characters are permitted in PGN import format; these are the horizontal and vertical tabs along with the linefeed and carriage return codes.

The external representation of the newline character may differ among platforms; this is an acceptable variation as long as the details of the implementation are hidden from software implementors and users. When a choice is practical, the Unix "newline is linefeed" convention is preferred.
And this section
3.2.2: Archival storage and the newline character
Export format should also be used for archival storage. Here, "archival" storage is defined as storage that may be accessed by a variety of computing systems. The only extra requirement for archival storage is that the newline character have a specific representation that is independent of its value for a particular computing system's text file usage. The archival representation of a newline is the ASCII control character LF (line feed, decimal value 10, hexadecimal value 0x0a).

Sadly, there are some accidents of history that survive to this day that have baroque representations for a newline: multicharacter sequences, end-of-line record markers, start-of-line byte counts, fixed length records, and so forth. It is well beyond the scope of the PGN project to reconcile all of these to the unified world of ANSI C and the those enjoying the bliss of a single '\n' convention. Some systems may just not be able to handle an archival PGN text file with native text editors. In these cases, an indulgence of sorts is granted to use the local newline convention in non-archival PGN files for those text editors.
seem to be relevant for the discussion of this thread. My understanding is that the correct encoding for PGN is ISO 8859/1, but especially for the "export" format generators are discuraged from using certain characters from that codepage. Also the newline problem seems to be adressed: for "export"/"archival" format you should only use LF.

Personally I think that much of the confusion related to PGN comes from the fact that there is a more relaxed version of the standard ("import") which is intended for humans.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: PGN for dummies

Post by hgm »

True, but I consider such requirements w.r.t. encoding 'void', and not part of the specifications of PGN syntax. They are specifications of the underlying text format. Which contradict modern standards for text files, and thus are extremely counter-productive.

To have smoothly working software it is important to conform to the standards of the computer system it is embedded in. E.g. it is annoying to use just LF for a newline on WinBoard, because then the file won't disply correctly in NotePad when you want to edit it. Fulvio ignores the encoding standard, and uses UTF-8 always, because he (unjustly, as we have seen) thinks that this works perfectly even on Windows.