WinBoard translations

Discussion of chess software programming and technical issues.

Moderator: Ras

wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: WinBoard translations

Post by wgarvin »

hgm wrote:To get back to the coding problem:

On second thought UTF-8 does not strike me as a terribly clever system. Languages like Chinese take a terrible hit in coding density. It seems UTF-8 puts an exaggerated importance on the possibility to be able to jump at a random point into a text, and immediately recognize where the next character starts. To this end it sacrifices many bits as control bits, to destinguish leading bytes from extension bytes, from ascii.

A system along the lines of BG2312 seems much more powerful. Here you would encode non-ascii characters as two bytes, each starting with a 1 bit (where all ascii codes start with a 0 bit). This would give you 14 coding bits, much better than the 11 you have in 2-byte UTF-8 characters. As this is apparently more than you need for Chinese, BG2312 even spares the codes 80-9F, which in some single-byte encoding systems can act as control codes, and could possibly confuse software not aware of the higher coding level.

The only disadvantage I see of such a system is that in a very long sequence of doube-byte encodings, you would not get easilly back in sync wih the character boundaries once you get out of sync (e.g. by jumping randomly in to the text). But any single-byte code (such as linefeed..) would re-sync you, and the code can be read backwards without losing sync. If needed a separate byte code could be set apart (like 0x80, or even 0xA0) to act as sync punctuation, and you could require that it occurs at east once every 100 bytes. This would cause an overhead of only 1% in pure Chinese text, and the overhead of hunting through 100 bytes for the sync after a random jump into the text to determine the character bundaries seems aceptable as well.

Even though 16K codes is not enough to encode all unicodes, it usally does a very satisfactory job on encoding virtually all characters of a single language, even Chinese. Occasionally embedded out-of-range codes can be encoded by the trick of surrogate pairs. In practice, you tend to stick to the same lanuage quite long. So a system where you would set apart some of the 2-byte codes as control characters to switch to a different language (i.e. map a different part of the unicode range to the 2-byte encodings, while the rest of the range would have to rely on surrogate pairs) seems much more practical.
Unfortunately, strings get used for a lot of things besides displaying of characters. Being able to find the beginning and end of a multi-byte "character" is extremely important for some application domains (e.g. something like substring matching), so all of the Unicode encodings make sure that its possible to do. I believe that many algorithms that use random access into strings, can be adapted to work on a variable-size encoding like UTF-8 as long as they can scan backwards or forwards to find the boundaries between characters. It also helps with error recovery in streaming applications (e.g. if you had a long string of 2-byte symbols in your encoding, and one of their high bits got flipped by an error, the entire string would decode wrong after that. But if it was in UTF-8 format it could resynchronize within a character or two).

[Edit: You made all the same points in your original post. Sorry for being redundant!]

I'll also point out that if you used UTF-16 instead of UTF-8, the Chinese characters would all fit in 2 bytes. So would Japanese, so would Korean, so would virtually everything that anyone uses. The downside is that ASCII characters, and letters with diacritics, would also take 2 bytes.

Unicode was created by committees full of smart people. To handle input and layout for all of the world's languages, it has to support all kinds of bizarre stuff. It has a bunch of control characters for things like bi-directional text. It has combining characters and tone marks, shaping characters and nonspacing diacritics. It has extensive rules about how certain characters should or shouldn't be combined or decomposed into other characters (for characters with accents, text layout/pagination, etc.) Due to compatibility, it has multiple code points that are really for the same character. Most applications can ignore 95% of this stuff, unless they need to implement their own low-level input widgets or text rendering or typesetting.

Nowadays, any localized or international applications should be designed to use Unicode. Sure, chinese-language Windows has some other encoding (based on code pages) that you could use instead. But you'll limit your program to one language and one market, by doing that. If you start with Unicode, your app will be much easier to port to new languages and new regions later. [But this is general advice that doesn't necessarily apply to Winboard, since its a pre-existing app.]
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

Unicode is OK, but as I understood it, the unicode specs say nothing about the way it should be encoded as bits and bytes. UTF-8 and UTF-16 are some ways to do it, but they are by no means the only ways. I can understand why the Chinese would like neither UTF-8, (50% size overhead) nor UTF-6 (not backward compatible with ASCII).

I could encode unicode in the following way:
0-0x7F: single-byte ASCII codes
0x80-0xFF: occurring in pairs, together containing 14 data bits, which would contain the lowest 14 unicode bits. The 6 most-significant unicode bits would not normally be encoded on a per-character basis, as a prefix that would usually be the same for all characters of a particular language.

The prefix would be set by two-byte codes where the first byte was 0x80, and the second byte 0x20-0x5F (a printable ASCII code), which is 64 possibilities.

The combination (0x80,0x60) could be defined as a sync mark, and periodically injected into long stretches of 2-byte codes. This means it should be forbidden that this combination occurs accidentally in a switch to ASCII, which can be prevented by injecting an extra sync mark in the rare cases where it would occur.

This seems a very satisfactory way to cover the entire unicode range. You could refine the system by adding one-time escapes, which affect the prefix of the following character, but do not affect the permanent prefix, e.g. starting with 0x81 in stead of 0x80.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: WinBoard translations

Post by wgarvin »

Of course you can use whatever encoding you want, or invent your own.

The hard part is convincing anyone else to also use your encoding, so you can inter-operate with them.
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

The point is that many encodings are possible, and in fact in use, and UTF-8 does not seem superior, or in fact, to the users of those encodings, so inferior that I don't see it would ever conquer the World. (The Chinese hate Linux because of UTF-8...)

So I think enforcing one particular code in WinBoard will be a bad idea, and the most practial solution will be to specify it will follow the OS standard.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: WinBoard translations

Post by Don »

hgm wrote:The point is that many encodings are possible, and in fact in use, and UTF-8 does not seem superior, or in fact, to the users of those encodings, so inferior that I don't see it would ever conquer the World. (The Chinese hate Linux because of UTF-8...)

So I think enforcing one particular code in WinBoard will be a bad idea, and the most practial solution will be to specify it will follow the OS standard.
I don't know what you mean the "the OS standard" as there are a bunch of different Operating Systems. Are you saying that each language should have a different encoding based on which OS it is running on? That's a lot of combinations.

I have to say that I think you have an unusual bias towards compactness (as witnessed by your amazingly compact chess program) and your statements about your imagined inferiority of UTF-8 seems to based on a how compactly a single language encodes. I think what you want is something that works reliably. Nobody is going to care one bit (no pun intended) that menu text takes a few more bits to encode but they will care if it doesn't work on most languages and platforms because it's too complicated to understand.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: WinBoard translations

Post by Don »

hgm wrote:The point is that many encodings are possible, and in fact in use, and UTF-8 does not seem superior, or in fact, to the users of those encodings, so inferior that I don't see it would ever conquer the World. (The Chinese hate Linux because of UTF-8...)

So I think enforcing one particular code in WinBoard will be a bad idea, and the most practial solution will be to specify it will follow the OS standard.
HG, I stumbled upon this on the web - it says it better than I can. Having said that, I don't care what you use as long as it works and somebody supports it (as I am an Xboard user on Linux):

The reference is here:

http://htmlpurifier.org/docs/enduser-utf8.html#whyutf8

But here is an important excerpt which makes me think you are going down the wrong path - but I'm not a complete expert on this either so my opinion can be taken with a grain of salt. This is a bit web-centric but I think the principles still apply:


Why UTF-8?

So, you've gone through all the trouble of ensuring that your server and embedded characters all line up properly and are present. Good job: at this point, you could quit and rest easy knowing that your pages are not vulnerable to character encoding style XSS attacks. However, just as having a character encoding is better than having no character encoding at all, having UTF-8 as your character encoding is better than having some other random character encoding, and the next step is to convert to UTF-8. But why?

Internationalization

Many software projects, at one point or another, suddenly realize that they should be supporting more than one language. Even regular usage in one language sometimes requires the occasional special character that, without surprise, is not available in your character set. Sometimes developers get around this by adding support for multiple encodings: when using Chinese, use Big5, when using Japanese, use Shift-JIS, when using Greek, etc. Other times, they use character references with great zeal.

UTF-8, however, obviates the need for any of these complicated measures. After getting the system to use UTF-8 and adjusting for sources that are outside the hand of the browser (more on this later), UTF-8 just works. You can use it for any language, even many languages at once, you don't have to worry about managing multiple encodings, you don't have to use those user-unfriendly entities.
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

Don wrote:I don't know what you mean the "the OS standard" as there are a bunch of different Operating Systems. Are you saying that each language should have a different encoding based on which OS it is running on? That's a lot of combinations.
The OS standard is whatever the OS you are currently running under is using for interpreting (char*). It is the encoding it will use in handing the strings typed by the user into dialog controls to the application, and the scheme it will use to translate strings printed by printf() into glyphs.

This might be a zillion combinations, but the nice thing is that WinBoard would not have to know about any of them. It would just pass the strings along that the OS hands it, without any conversion used on them, and would similaly pass strings sent to t by the engine to the OS. If I would insist on UTF-8, it means that I would have to make WinBoard figure out what the Windows it is running on is using for encoding, and then translate it back and forth to UTF-8.

I realize that this shifts the problem to the engine, and that it is less than desirable to make engines responsible from defining ther options in multiple languages, as well as multiple encodings. (Although my assumption is that each language will have a natural system for encoding it, so making it responsible for the encoding doesn't make the problem much worse than it already is.) I don't have a good solution to this.

I suppose that internally, engines that do want to support multiple languages for their options, could store all option names in their preferred unicode format, and then could draw on an API call to convert that code to whatever the OS uses before sending it.
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: WinBoard translations

Post by Michel »

Communication protocols should be OS neutral. Remember that the engine and xboard may very well be running on different OS'es (e.g. the engine on a linux server and xboard on an iphone).
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

By the same argument they should also be language neutral. Having the engine directly communicate with the user through engine-defined options causes a lot of problems. The encoding is only a tiny part of that.

Suppose WinBoard can be obtained in 32 national version in a few years? Can you imagine a system by which the options of each engine could be presented to the user in each of these 32 languages? If not, wouldn't the logical solution be to accept that it cannot be done, and specify all option names should be in English with ASCII encoding?
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: WinBoard translations

Post by Don »

hgm wrote:By the same argument they should also be language neutral. Having the engine directly communicate with the user through engine-defined options causes a lot of problems. The encoding is only a tiny part of that.

Suppose WinBoard can be obtained in 32 national version in a few years? Can you imagine a system by which the options of each engine could be presented to the user in each of these 32 languages? If not, wouldn't the logical solution be to accept that it cannot be done, and specify all option names should be in English with ASCII encoding?
The protocol itself needs to use only one encoding such as UTF-8, but the language is not particularly relevant. UCI uses something that looks like english but it's not really english because "setoption", "ucinewgame" and so on are not words in the English language or any other that I know of. Of course it's obvious that they are derived from English words because english speaking people immediately recognize and can remember what they are supposed to do.

It's not reasonable for the protocol itself to be multi-language since by definition the protocol IS the language. But the options can be defined any way you want and I don't consider them "language" words as I said. The primary restriction is the encoding, and within that you can call your options anything you want - I could use random character sequences if I wanted to to define options and nobody could identify them as being in any particular language.

The use of English however should be strongly encouraged because as a chess program author you would want the most people to be able to understand what the option does - and English may not be the most widely spoken first language but it's almost universally taught as a second language.

If it were up to me, the entire world should switch over to Esperanto and that should be encouraged as the international language. Of course I realize that will never happen but this is a language designed from the very beginning to very easy to learn because the rules are very consistent and regular. It is said that it can be learned much more quickly than any other human language.

So is xboard going to have an Esperanto version? UTF-8 handles Esperanto too!