WinBoard translations

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

OK, I read up a bit on unicode and UTF-8, and this UTF-8 seems a good idea. So let us define all WB protocol traffic to be UTF-8. WB protocol does not refer to the length of its strings, so there is no problem there. Multi-byte codes in (correctly used) WB protocol can only occur in feature commands, in the engine name or option names.

As WB moves around these strings mostly without any attempt to interpret them, I don't foresee many problems. They could be compared to values of command-line string options, but we could define these to be UTF-8 too.

The only real problem occurs when we want to display them. The Windows API calls might not understand UTF-8, but require wide chars. So there should be a conversion UTF-8 -> UCS-2 whenever such a user-supplied string is printed in a dialog item. So far this only occurs in the engine-settings dialogs and the about-box (for WinBoard + engine.name). And soon perhaps in the menu for the language names.

Another problem could be when we export the strings. Do we want our PGN files to be UTF-8 too?

[edit] Perhaps having UTF-8 filenames could be a problem, if the standard Windows API calls for accessing the files would insist on a different format?
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: WinBoard translations

Post by Don »

wgarvin wrote:I strongly suggest you store the translated strings for each language, in a UTF-8 or UCS-2 text file per language. All the Windows API functions that deal with strings, have wide-char versions which will work correctly with any language as long as the user has appropriate fonts/language packs installed. Using Unicode will make it much easier to switch between languages, etc.

If Winboard uses a lot of char*, then passing around UTF-8 internally might be easier than changing to wide chars... but then you'd have to wrap the API functions to convert the strings. Using wchar_t internally is easier.
I believe UTF-8 is the only really sane choice as H.G. has concluded. It's a forgone conclusion that on the web UTF-8 is really the way to go and that probably applies elsewhere too. Browsers already have excellent support for UTF-8

Most encodings have some advantage and disadvantages, but UTF-8 addresses almost all the problems in a sane way and already has wide support.

Winboard is actually cross platform and thus Windows specific considerations should not be put out of proportion, this needs to work everywhere.

I'm not going to list all the advantages of UTF-8 here, but everything should be done in UTF-8 first and only if needed should there be a translation for display purposes. I believe the world is gradually moving toward UTF-8 for everything and is half way there now.
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

Indeed! And the problem of encoding filenames with non-ascii characters when sending them to the engine is really not related to the naming of the engine options. It could occur always. Such as in the egtpath command.

The current protocol leaves it open how characters like é or ç in filenames should be encoded, and I guess there is no way it could work in the engine for both UTF-8 and using single-byte codes >= 128. So we must specify one of the two, and using the single-byte codes really seems a dead end alley.

If an engine author insists to use non-ascii in his option names, I guess it is a far demand that he worries about encoding those. Because the WB protocol stream is definitely not wide char. Having a label field in the option feature would not solve any of these encoding problems.
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

And to get back to the original topic:

I uploaded a new version to http://hgm.nubati.net/winbo_int.zip .
The previous one forgot to call Translate() in the Options -> General dialog. I also made an attempt to allow switching language through the menus, but this does not work yet, as the menus won't translate a second time once the English is replaced.

Note that .lng is now automatically appended to the name of any given translation file, if it contains no period.

Also note that it is relly important to have (single) spaces around the === in the translation files. I might have goofed there in some places of the template.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: WinBoard translations

Post by wgarvin »

Don wrote:
wgarvin wrote:I strongly suggest you store the translated strings for each language, in a UTF-8 or UCS-2 text file per language. All the Windows API functions that deal with strings, have wide-char versions which will work correctly with any language as long as the user has appropriate fonts/language packs installed. Using Unicode will make it much easier to switch between languages, etc.

If Winboard uses a lot of char*, then passing around UTF-8 internally might be easier than changing to wide chars... but then you'd have to wrap the API functions to convert the strings. Using wchar_t internally is easier.
I believe UTF-8 is the only really sane choice as H.G. has concluded. It's a forgone conclusion that on the web UTF-8 is really the way to go and that probably applies elsewhere too. Browsers already have excellent support for UTF-8

Most encodings have some advantage and disadvantages, but UTF-8 addresses almost all the problems in a sane way and already has wide support.

Winboard is actually cross platform and thus Windows specific considerations should not be put out of proportion, this needs to work everywhere.

I'm not going to list all the advantages of UTF-8 here, but everything should be done in UTF-8 first and only if needed should there be a translation for display purposes. I believe the world is gradually moving toward UTF-8 for everything and is half way there now.
The world is certainly moving towards Unicode, but UTF-8 is not the only encoding that is widely used. UTF-8 is probably best when most of your data will be ASCII, because ASCII chars only take up one byte in that encoding. However, other BMP chars (such as Chinese or Japanese) will often take 3 bytes, and some will even take 4 bytes [Edit: actually I think only non-BMP chars can take 4 bytes, so thats not too bad.] In UTF-16 and UCS-2, all of the BMP can be represented with just 2 bytes. They cleverly reserved a range of code points for the encoding of 20-bit values as "surrogate pairs" to represent U+10000 through U+10FFFF), so I guess everything that is representable as UCS-2 (i.e. everything from BMP) has the same 2-byte representation in UTF-16.

UTF-8 also has the advantage that you don't have to worry about byte ordering: the wider encodings often come in two flavors (e.g. UTF-16LE and UTF-16BE). The other tradeoffs are about the size of the string data (UTF-8 is best for ASCII data, wider chars are often better for non-ASCII data) and about convenience in the implementation (UTF-8 is more interoperable with char* libraries, but you have to convert it before you can pass to windows wide-char API functions like SetWindowTextW).

When working with UCS-2 or UTF-16, you can use the type wchar_t, but its size is implementation-dependent and might be more than 2 bytes on some platforms. On every Windows compiler I'm aware of, its 2 bytes. Microsoft compilers provide wide versions of popular functions that start with "wcs" (i.e. wcscpy, wcscat, wprintf, ...) I imagine Intel and gcc provide the same ones for compatibility, but I don't actually know.

As for converting from UTF-8 to a 16-bit encoding for passing to Windows... You could write your own conversion function, but there are probably robust open source converters out there if you look around. For example, there's International Components for Unicode which is a "mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications". You could also try to use a win32 API function but at a glance I don't see any useful function for that. (At work we use our own in-house libraries for this sort of thing.)
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: WinBoard translations

Post by Don »

wgarvin wrote:
Don wrote:
wgarvin wrote:I strongly suggest you store the translated strings for each language, in a UTF-8 or UCS-2 text file per language. All the Windows API functions that deal with strings, have wide-char versions which will work correctly with any language as long as the user has appropriate fonts/language packs installed. Using Unicode will make it much easier to switch between languages, etc.

If Winboard uses a lot of char*, then passing around UTF-8 internally might be easier than changing to wide chars... but then you'd have to wrap the API functions to convert the strings. Using wchar_t internally is easier.
I believe UTF-8 is the only really sane choice as H.G. has concluded. It's a forgone conclusion that on the web UTF-8 is really the way to go and that probably applies elsewhere too. Browsers already have excellent support for UTF-8

Most encodings have some advantage and disadvantages, but UTF-8 addresses almost all the problems in a sane way and already has wide support.

Winboard is actually cross platform and thus Windows specific considerations should not be put out of proportion, this needs to work everywhere.

I'm not going to list all the advantages of UTF-8 here, but everything should be done in UTF-8 first and only if needed should there be a translation for display purposes. I believe the world is gradually moving toward UTF-8 for everything and is half way there now.
The world is certainly moving towards Unicode, but UTF-8 is not the only encoding that is widely used. UTF-8 is probably best when most of your data will be ASCII, because ASCII chars only take up one byte in that encoding. However, other BMP chars (such as Chinese or Japanese) will often take 3 bytes, and some will even take 4 bytes [Edit: actually I think only non-BMP chars can take 4 bytes, so thats not too bad.] In UTF-16 and UCS-2, all of the BMP can be represented with just 2 bytes. They cleverly reserved a range of code points for the encoding of 20-bit values as "surrogate pairs" to represent U+10000 through U+10FFFF), so I guess everything that is representable as UCS-2 (i.e. everything from BMP) has the same 2-byte representation in UTF-16.

UTF-8 also has the advantage that you don't have to worry about byte ordering: the wider encodings often come in two flavors (e.g. UTF-16LE and UTF-16BE). The other tradeoffs are about the size of the string data (UTF-8 is best for ASCII data, wider chars are often better for non-ASCII data) and about convenience in the implementation (UTF-8 is more interoperable with char* libraries, but you have to convert it before you can pass to windows wide-char API functions like SetWindowTextW).

When working with UCS-2 or UTF-16, you can use the type wchar_t, but its size is implementation-dependent and might be more than 2 bytes on some platforms. On every Windows compiler I'm aware of, its 2 bytes. Microsoft compilers provide wide versions of popular functions that start with "wcs" (i.e. wcscpy, wcscat, wprintf, ...) I imagine Intel and gcc provide the same ones for compatibility, but I don't actually know.

As for converting from UTF-8 to a 16-bit encoding for passing to Windows... You could write your own conversion function, but there are probably robust open source converters out there if you look around. For example, there's International Components for Unicode which is a "mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications". You could also try to use a win32 API function but at a glance I don't see any useful function for that. (At work we use our own in-house libraries for this sort of thing.)
You just did what I wanted to avoid - listing all the pro's and con's of various formats.

I would encourage everyone to just always use UTF-8 and try to standardize on that instead of choosing some different encoding for each individual project. Avoid formats that only work on a single platform, even if you have to convert. The world is on the way to standardizing on UTF-8 but we can hasten the end of the insanity by ignoring as much as possible the myriad other confusing encodings.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: WinBoard translations

Post by wgarvin »

Don wrote:I would encourage everyone to just always use UTF-8 and try to standardize on that instead of choosing some different encoding for each individual project. Avoid formats that only work on a single platform, even if you have to convert. The world is on the way to standardizing on UTF-8 but we can hasten the end of the insanity by ignoring as much as possible the myriad other confusing encodings.
Code pages were insanity--the various Unicode encodings are just the necessary complexity when you want to handle all of the world's languages at once.

I don't think "always use UTF-8" is good advice. For interoperating with other programs its reasonable, but internally each application should use whatever makes the most sense. Exchanging strings with the operating system or with 3rd-party libraries will be easier if you use the same encoding they use. Some operations on strings are easier if you use a fixed-size encoding like UCS-2 or UCS-4. Apps that load and manipulate text files (such as text editors, comparison tools, compilers, etc.) should be able to recognize byte-order marks and be able to read any of the Unicode encodings and convert them to their preferred encoding. Apps that write out any Unicode encoding in a text file, should put appropriate byte-order mark at the start of the file. The localized apps I've worked on we were mostly concerned with displaying of translated strings to the user, and most of them used UTF-16 for that purpose rather than UTF-8.

For Winboard, using UTF-8 does sound like best choice since all those char* can remain unchanged, and it will pass through strcpy/strcat/sprintf %s with no problems.
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

OK, I inquired some information, and this actually makes me doubt if we should make any definite specs in WB protocol about the encoding of non-ASCII characters.

It turns out that in languages based on non-ASCII, such as Chinese, there exist national systems. These encode their language so that it can be represented in normal (char*), compatible with encoding of plain ASCII text, very much like UTF-8. But with a different encoding. In Chinese, for instance, they use GB2312.

The point is that the involved programs really do not have to know anything about this. It is all handled by the OS. When you type something into text-edits, the OS hands you a (char[]) in the encoding that is native to the OS. Programs can handle these strings in the normal way, without ever having to know what the codes >128 embedded in them mean. When they pass these strings to request OS actions, say in fopen(), the OS will automatically convert the string to wchar, using its native encoding system.

So I think we should simply specify that any non-ASCII characters that have to be sent through WB protocol (e.g. in file names) will be sent in the encoding native to the OS. Because that means that they will be automatically handled in a compatible way by engine and GUI. In Linux this would of course mean you use UTF-8. But in Windows it would depend on the national version.

Now this might be less convenient for engines that want to use option names that include such non-ASCII symbols. They could not use hard string constants for printing them, as the encoding in which they have to be sent to WinBoard will depend on the OS they are running on. Of course they could make assumptions, like that to print Chinese option names should work on a native Chinese OS only, and hard-code them accordingly. People running the engine on a non-Chinese OS would then see garbage if they configured the engine to se the Chinese option names. But because WinBoard will simply give them back (when setting the option) what they sent it in the first place, the protocol would work. It is just the display of the option names that would not. But I think this is mainly a symptom of using a language on an OS that was not set up for using it, and that this does not work will in prctice not be a very big problem (and likel a problem people will be very much used to).
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

I uploaded a new version, together with the Spanish translation finished by Óscar Toledo. (Still http://hgm.nubati.net/winbo_int.zip .)

It now also does change menu in response to interactive language switching! I still have not found out how to do the about-box...
User avatar
hgm
Posts: 28405
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: WinBoard translations

Post by hgm »

To get back to the coding problem:

On second thought UTF-8 does not strike me as a terribly clever system. Languages like Chinese take a terrible hit in coding density. It seems UTF-8 puts an exaggerated importance on the possibility to be able to jump at a random point into a text, and immediately recognize where the next character starts. To this end it sacrifices many bits as control bits, to destinguish leading bytes from extension bytes, from ascii.

A system along the lines of BG2312 seems much more powerful. Here you would encode non-ascii characters as two bytes, each starting with a 1 bit (where all ascii codes start with a 0 bit). This would give you 14 coding bits, much better than the 11 you have in 2-byte UTF-8 characters. As this is apparently more than you need for Chinese, BG2312 even spares the codes 80-9F, which in some single-byte encoding systems can act as control codes, and could possibly confuse software not aware of the higher coding level.

The only disadvantage I see of such a system is that in a very long sequence of doube-byte encodings, you would not get easilly back in sync wih the character boundaries once you get out of sync (e.g. by jumping randomly in to the text). But any single-byte code (such as linefeed..) would re-sync you, and the code can be read backwards without losing sync. If needed a separate byte code could be set apart (like 0x80, or even 0xA0) to act as sync punctuation, and you could require that it occurs at east once every 100 bytes. This would cause an overhead of only 1% in pure Chinese text, and the overhead of hunting through 100 bytes for the sync after a random jump into the text to determine the character bundaries seems aceptable as well.

Even though 16K codes is not enough to encode all unicodes, it usally does a very satisfactory job on encoding virtually all characters of a single language, even Chinese. Occasionally embedded out-of-range codes can be encoded by the trick of surrogate pairs. In practice, you tend to stick to the same lanuage quite long. So a system where you would set apart some of the 2-byte codes as control characters to switch to a different language (i.e. map a different part of the unicode range to the 2-byte encodings, while the rest of the range would have to rely on surrogate pairs) seems much more practical.