mar wrote:I assume BIG5 and others are encodings used for Chinese?
Sort of. BIG5 is Taiwanese (trad. Chinese), GB1243 is the standard in main-land China (simplified Chinese), Shift-JIS is for Japanese kanji.
HTML provides a header tag to specify document encoding, so browsers have to support various encodings internally, which is complicated.
If I'm not mistaken, browsers fall back to automatic detection if no such tag is present, which may not work reliably.
Yes, it is complicated and a pain, and the world would be a much better place if this problem did not exist. But the point is that people are working on a general solution, and that web material that will not render correctly on all platforms is slowly being selected away, and be replaced by pages that allow compatible methods (be it HTML tags or anything else). Standards for a very limited application should not try to interfere with that process; it will just make matters worse.
PGN is a very nice format for Chess games, given the restriction that it be text. As a binary encoding it utterly sucks, and it would be much better to just use a sequence of square numbers. So it seems a bad mistake to turn it into a binary format by putting conditions on the encoding.
The problem with scripts containing lots of characters is that UTF-8 would require 3 bytes per character. However, since web pages contain a lot of control tags in Latin,
it might still be better than using 16-bit representation (depending on the amount of text)
Actually the 3 encodings I mentioned are all upward compatible with ASCII (like UTF-8 is): codes 0-127 are single-byte codes. They use either pairs of codes in the range 128-255 to indicate kanji (GB1243), or a single such code followed by a general byte (Shift-JIS). Some of the codes > 127 are escapes, which indicate that the kanji is encoded by 4 bytes. This should not be confused with UTF-16, which is unicode in 'wide-char' format, and needs even 2 bytes for ascii Latin.
So the compensatory savings you mention will not occur; these encodings are really optimized for the languages they intend to encode. (Funny thing is that they usually contain a duplicate of the entire ascii set, as well as Greek and cyrillic glyphs as two-bytes encodings, like these are just some peculiar kanji.)
I still think that non-standard local encodings make it difficult to interchange text files, therefore only usable on machines which use the same encoding as where the file was created
(or one has to use encoding converters on target machine).
True, but the problem is of course that this is what people do, 99.9% of the time. Chinese pages are only viewed by other Chinese. It is very difficult to make something a success, when you have to say: "Wow, look how much easier this 0.1% of what you typically do gets. And the other 99.9% of what you do gets only 50% more difficult!"
There's a reason for Unicode and Linux certainly went in the right direction (and I believe other modern Unixes/BSDs as well, including OSX which I consider to be BSD-family).
So the only mainstream OS that remains is Windows,
even though all API internally support UCS-2 (or is it UTF-16?) and there are API functions that can convert to/from UTF-8.
Even NTFS filenames are physically stored as UCS-2
(I think Linux has no notion of filename encoding and simply stores them as byte sequences, but since high level APIs work with UTF-8, it should work as expected).
I think even Windows supports or even recommends UTF-8 now.
So I believe the only real problem is backward compatibility because of multibyte encodings, but it would be nice
to see Microsoft say ok, from now on we switch to UTF-8, forget about ancient non-portable encodings.
And lose a substantial fraction of their client base...
Backward compatibility is something that should be taken very seriously always. If Micro-Soft decided that text encoding would be more efficient if it was just done in chunks of 7 bits, as these captials and punctuation marks are not that common after all, and can easily be assigned 14-bit codes. A lot more compact, but alas, new codes get assigned to all ascii characters. So they make UTF-7 the standard, and in their release of Windows 13, you would not be able to properly display or edit any source code or other text that you have ever written, without running it through a converter first. (Including all the stuff you have zipped, or in git.) Try to image how you would feel about that. Then you get an idea how a Japanese would feel towards the move you propose...