Unfortunately, strings get used for a lot of things besides displaying of characters. Being able to find the beginning and end of a multi-byte "character" is extremely important for some application domains (e.g. something like substring matching), so all of the Unicode encodings make sure that its possible to do. I believe that many algorithms that use random access into strings, can be adapted to work on a variable-size encoding like UTF-8 as long as they can scan backwards or forwards to find the boundaries between characters. It also helps with error recovery in streaming applications (e.g. if you had a long string of 2-byte symbols in your encoding, and one of their high bits got flipped by an error, the entire string would decode wrong after that. But if it was in UTF-8 format it could resynchronize within a character or two).hgm wrote:To get back to the coding problem:
On second thought UTF-8 does not strike me as a terribly clever system. Languages like Chinese take a terrible hit in coding density. It seems UTF-8 puts an exaggerated importance on the possibility to be able to jump at a random point into a text, and immediately recognize where the next character starts. To this end it sacrifices many bits as control bits, to destinguish leading bytes from extension bytes, from ascii.
A system along the lines of BG2312 seems much more powerful. Here you would encode non-ascii characters as two bytes, each starting with a 1 bit (where all ascii codes start with a 0 bit). This would give you 14 coding bits, much better than the 11 you have in 2-byte UTF-8 characters. As this is apparently more than you need for Chinese, BG2312 even spares the codes 80-9F, which in some single-byte encoding systems can act as control codes, and could possibly confuse software not aware of the higher coding level.
The only disadvantage I see of such a system is that in a very long sequence of doube-byte encodings, you would not get easilly back in sync wih the character boundaries once you get out of sync (e.g. by jumping randomly in to the text). But any single-byte code (such as linefeed..) would re-sync you, and the code can be read backwards without losing sync. If needed a separate byte code could be set apart (like 0x80, or even 0xA0) to act as sync punctuation, and you could require that it occurs at east once every 100 bytes. This would cause an overhead of only 1% in pure Chinese text, and the overhead of hunting through 100 bytes for the sync after a random jump into the text to determine the character bundaries seems aceptable as well.
Even though 16K codes is not enough to encode all unicodes, it usally does a very satisfactory job on encoding virtually all characters of a single language, even Chinese. Occasionally embedded out-of-range codes can be encoded by the trick of surrogate pairs. In practice, you tend to stick to the same lanuage quite long. So a system where you would set apart some of the 2-byte codes as control characters to switch to a different language (i.e. map a different part of the unicode range to the 2-byte encodings, while the rest of the range would have to rely on surrogate pairs) seems much more practical.
[Edit: You made all the same points in your original post. Sorry for being redundant!]
I'll also point out that if you used UTF-16 instead of UTF-8, the Chinese characters would all fit in 2 bytes. So would Japanese, so would Korean, so would virtually everything that anyone uses. The downside is that ASCII characters, and letters with diacritics, would also take 2 bytes.
Unicode was created by committees full of smart people. To handle input and layout for all of the world's languages, it has to support all kinds of bizarre stuff. It has a bunch of control characters for things like bi-directional text. It has combining characters and tone marks, shaping characters and nonspacing diacritics. It has extensive rules about how certain characters should or shouldn't be combined or decomposed into other characters (for characters with accents, text layout/pagination, etc.) Due to compatibility, it has multiple code points that are really for the same character. Most applications can ignore 95% of this stuff, unless they need to implement their own low-level input widgets or text rendering or typesetting.
Nowadays, any localized or international applications should be designed to use Unicode. Sure, chinese-language Windows has some other encoding (based on code pages) that you could use instead. But you'll limit your program to one language and one market, by doing that. If you start with Unicode, your app will be much easier to port to new languages and new regions later. [But this is general advice that doesn't necessarily apply to Winboard, since its a pre-existing app.]

