PGN for dummies

hgm · Post by **hgm** » Sun Nov 14, 2021 8:43 pm

Fulvio wrote: ↑Sun Nov 14, 2021 5:21 pm This shows that you do not know the subject at all and explains all the nonsense statements.
Create a UTF-8 text file (without BOM which is used to identify endianess and is not needed and is not recommended for UTF-8) and then open it with Windows Notepad and you will see that it displays correctly.

Generally speaking, when you are convinced that you are the only one who is right, and the rest of the world is wrong, it is worth taking some time to think about it better.

Well, I was talking from past experience, and testing it was not trivial as I had no easy way to create an UTF-8 file in the first place. It appears you are right as far as NotePad is concerned. But with WordPad. Which is what I normally have to use on files imported from Linux, as NotePad would forge everything into a single line there.

When I copy-paste some text with non-ascii symbols (I used "het Kröller-Müller Museum") into NotePad, and save, it creates this file:

Code: Select all

0000000 150 145 164 040 113 162 366 154 154 145 162 055 115 374 154 154
0000020 145 162 040 115 165 163 145 165 155 015 012
0000033

That is, it encodes the ö as 0366, which is the Latin-1 encoding, not UTF-8 (for which all non-ascii is at least 2 bytes >= 0200). When I read that back into NotePad, it shows me the text I started with.

When I upload this file to my website, access it with FireFox from Linux, it also displays correctly. When I copy-paste it from FireFox into gedit, it still displays correctly. When I save it from gedit, though, I get

Code: Select all

0000000 150 145 164 040 113 162 303 266 154 154 145 162 055 115 303 274
0000020 154 154 145 162 040 115 165 163 145 165 155 012 012
0000035

The encoding of ö is now 0303 0266. Apparently the text was converted to UTF-8. I suspect that FireFox did this, when I copied it to the clipboard. The end-of-line has been changed from CR+LF to LF-only. So far so good.

Now I open this UTF-8 file with WordPad, and get to see this:

This doesn't look so good anymore... It shows exactly the effect I described: the UTF-8 bytes are interpreted as if they were Latin-1 characters.

Now the amazing thing is that when I save this file from WordPad, and load it into NotePad, I do get the ö and ü back. Which of course should count as an error: it was not what WordPad diaplayed, and NotePad could not possibly know I did not actually mean to write what WordPad displayed: "het KrÃ¶ller-MÃ¼ller Museum". Which is what I get on TalkChess when I copy-paste it from the WordPad display. Saving it and reading it Back into NotePad should have given me the same thing.

So it seems these newer versions of NotePad are cheating to make it appear they do something that is logically impossible: judge if a file should be interpreted as UTF-8 or Latin-1. I have no idea what algorithm it uses for that. Perhaps when all non-ascii sequences are valid UTF-8 it guesses that the file is UTF-8 encoded. This can be a wrong guess, especially when there are very few non-ascii characters in the file.

Sopel · Post by **Sopel** » Sun Nov 14, 2021 9:55 pm

sublime text 3, notepad++, total commander have no issues with utf-8

to be clear, there's nothing that prevents you from having utf-8 encoded text files on windows

dangi12012 · Post by **dangi12012** » Mon Nov 15, 2021 12:41 am

hgm wrote: ↑Sun Nov 14, 2021 8:43 pm So it seems these newer versions of NotePad are cheating to make it appear they do something that is logically impossible: judge if a file should be interpreted as UTF-8 or Latin-1. I have no idea what algorithm it uses for that. Perhaps when all non-ascii sequences are valid UTF-8 it guesses that the file is UTF-8 encoded. This can be a wrong guess, especially when there are very few non-ascii characters in the file.

That is exactly right. The encoding is either read from BOM. Or it is inferred by byte order marks in the first page of the file.
https://docs.microsoft.com/en-us/dotnet ... ew=net-5.0
Either way the default behaviour is to infer the correct encoding on the first read call on the file.

hgm · Post by **hgm** » Mon Nov 15, 2021 2:04 pm

Sopel wrote: ↑Sun Nov 14, 2021 9:55 pm sublime text 3, notepad++, total commander have no issues with utf-8

to be clear, there's nothing that prevents you from having utf-8 encoded text files on windows

That should be considered a bug, right? When I create a Latin-1 encoded file, and notepad++ interprets it as UTF-8, and alters the text accordingly.

Sopel · Post by **Sopel** » Mon Nov 15, 2021 2:44 pm

hgm wrote: ↑Mon Nov 15, 2021 2:04 pm
Sopel wrote: ↑Sun Nov 14, 2021 9:55 pm sublime text 3, notepad++, total commander have no issues with utf-8

to be clear, there's nothing that prevents you from having utf-8 encoded text files on windows
That should be considered a bug, right? When I create a Latin-1 encoded file, and notepad++ interprets it as UTF-8, and alters the text accordingly.

The software has no idea what the encoding of the file is. Only the user knows. You can choose the encoding in any serious text editor.

hgm · Post by **hgm** » Mon Nov 15, 2021 2:57 pm

Indeed! I never noticed that. NotePad has a combibox ANSI / unicode / unicode - big endian / UTF-8 in the Open File dialog. (I never use that dialog, as I always doubleclick the text file to start NotePad on it.)

Only... When ANSI is selected, and I load the file, NotePad still interprets it as UTF-8. That is a bug, for sure. Haven't tried NotePad++, that did not come with my Windows version. WordPad doesn't appear to have the option to select encoding on opening a file.

Pretty bad situation anyway, that a noob user is forced to select the encoding. Would have been much better if the file contained the information. Either in its content (e.g. a BOM), or through its name (e.g. through using the extension .txt8 instead of .txt).

Harald · Post by **Harald** » Mon Nov 15, 2021 4:38 pm

Technically there is no such thing as a text file. All files are binary files.
They may have different file extensions like .txt, .xml, .doc, ...
These extensions can tell the operating system what to do with a double click.
And they can help a program to find out what to do with the file when it is opened.

A text file is just a convenient human way to say that you can more or less read
the content like a real human language when you open the file in a program like a text editor.

Once upon a time when computers were only used in the USA and the only language was English
reduced to ASCII letters and even a command window could type its content the expression
this *.txt file is a text file made some sense. That was long ago.

I an *.xml file a text file? I a *.cpp file a text file? Is a Word document *.doc a text file?
The latter is a binary monstrosity but it has the nicest text layout in it.

To read a binary file as a text file we have to know its encoding. Unfortunately this
is very often not part of the file itself. Therefore programs that open such a file need
either a very good heuristic, a special protecting encoding for their own formats
or they need the help of the user. A byte order mark (BOM) at the beginning of a file
can only support the heuristic somehow but it can only suggest very few encodings like
UTF-16 (big/little endian) or UTF-8 where it is not recommended. Also a binary file full
of random bytes may start with a BOM.

In a typical text viewer like the one in TotalCommander you can chose and change the encoding
after any file is loaded to see the displayed text. This immediately changes the displayed output
without any change of the loaded file data. Even in *.exe files you can find many included
text phrases when you load such files in your viewer or text editor. That is the same for
Editors like NotePad++ where your choice also changes how the editor encodes the text that
you insert from the keyboard. Or the editor changes the encoding internally and when it
saves the data.

Unfortunately sometimes there are different text encodings used together in one file.
I have seen Source code files with the C++ part and strings in UTF-8 and some comments
in Latin-1. That is bad but cannot be fixed easily.

Also the line endings often change from Operating system to operating system or
from program to program. but that is just a convenience and default depending on the OS.
Typically Windows uses \r\n, Linux uses \n (?). But that is not a big problem and many
editors can handle this or change it if you let it do that.

dangi12012 · Post by **dangi12012** » Mon Nov 15, 2021 4:55 pm

Harald wrote: ↑Mon Nov 15, 2021 4:38 pm Also the line endings often change from Operating system to operating system or
from program to program. but that is just a convenience and default depending on the OS.
Typically Windows uses \r\n, Linux uses \n (?). But that is not a big problem and many
editors can handle this or change it if you let it do that.

Also do you know where the \r comes from.
Once upon a time in DOS you could print a document to screen with TYPE FILE.TXT
If you had a printer you could print a document to paper with TYPE FILE.TXT > LPT1

First generation of printers were literally electrical typemachines and IBM had a newline and a carriage return.
So a newline is carriage return to 0 + go down a line. Thats why you could pipe files directly to a printer device and it would just work without any software between files and printers in DOS.

Its the same reason why in Windows 11 exactly 40 years later you cannot name a file COM or LPT1 etc.

hgm · Post by **hgm** » Mon Nov 15, 2021 8:06 pm

Harald wrote: ↑Mon Nov 15, 2021 4:38 pm Technically there is no such thing as a text file. All files are binary files.

Indeed. And memory bits are just collections of atoms. But what matters is how such collections of more elementary concepts should be interpreted. 'Text file' is a higher-level concept, and means that the bit patterns in the file should be interpreted as a sequence of characters, without any markup information. There can be different types of text files, each requiring a different mapping of bit patterns to characters. XML or JSON files are also text files, but have a still higher-level structure built on top of text (basically representing trees rather than a linear sequence). A .doc file is not a text file, even though at some higher level it contains text. Just like a .png file that happens to show an image of a book page is not a text file. A .rtf file is a text file, though.

It is not possible to interpret the bit patterns in a file without knowing what type of file it is. In particular it is not possible to recover the represented text if you do not know the encoding that defines the type of text file (Latin-1, UTF-8, ShiftJIS, ...).

Fulvio · Post by **Fulvio** » Mon Nov 15, 2021 9:25 pm

hgm wrote: ↑Mon Nov 15, 2021 8:06 pm It is not possible to interpret the bit patterns in a file without knowing what type of file it is. In particular it is not possible to recover the represented text if you do not know the encoding that defines the type of text file (Latin-1, UTF-8, ShiftJIS, ...).

Good. Now, let's go back to the topic.
Have you ever used clang? Did you know that it requires your source code to be UTF-8 encoded?
Or what about the HTML <meta charset=""> ?
How do you read that information if you do not know the width of a char? (8-bit? 16-bit? or even 32-bit?)
The HTML standard also define a default charset. In particular for HTML4 it was ISO-8859-1 (Latin-1) and for HTML5 is UTF-8. And if there is no BOM it must be an ASCII-compatible charset (like UTF-8 or Latin-1). Only if you want to encode your HTML file in UTF-16 you must include a BOM, otherwise there is no way to know that chars are 16-bit width. By the way, the name is very very very clear: Byte Order Mark. Not detect bytes, byte order. UTF-16 was so bad that if you encoded the same text in computers with different endianess it produced different files.

With PGN it was cleverly chosen a subset of Latin-1:
"PGN data is represented using a subset of the eight bit ISO 8859/1 (Latin 1) character set."
"The 32 ISO 8859/1 code values from 128 to 159 are non-printing control characters. They are not used for PGN data representation."
"The 32 code values from 160 to 191 are mostly non-alphabetic printing characters and their use for PGN data is discouraged as their graphic representation varies considerably among other ISO Latin sets."
"Finally, the 64 code values from 192 to 255 are mostly alphabetic printing characters with various diacritical marks; their use is encouraged for those languages that require such characters."

Why that was clever?
Because UTF-8 have similar properties: the first byte is always less than 128 or greater than 191.
And if is greater than 191 it is multi-byte char, and the following bytes are always between 128 and 191.

So, if you decide that bad PGNs with discouraged Latin-1 chars can be happily doomed, it is very easy to infer the encoding:
When you encounter a byte with value > 191,
if the next byte is between 128 and 191 --> UTF-8
else --> Latin-1

PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies

Re: PGN for dummies