Q: FICS code and 64-bit (va_list)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27807
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Q: FICS code and 64-bit (va_list)

Post by hgm »

I don't get it. ASCII is simply a subset of UTF-8 unicode, right? The first 128 code points of unicode are the ASCII set, and in UTF-8 they map onto single bytes 0-127.

So how can [a-z] mean anything different in ASCII than in UTF-8?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Q: FICS code and 64-bit (va_list)

Post by bob »

UncombedCoconut wrote:
bob wrote:
UncombedCoconut wrote:
bob wrote:
hgm wrote:Problem is that I normally get thousands of compiler warnings, when doing "make". :cry: I will try to see if the -m32 helps, though.

This CentOS is a bit fishy anyway: make install did crash on it not being case sensitive in file names, so that the command "cp [a-z]* ..." also tried to copy the directory CVS, which produced a fatal error.
Something is broken. Unix has always been case-sensitive. the regular expression "[a-z]*" should NEVER match a file name that does not start with a lowercase alphabetic character, and doesn't on my fedora systems. We did run centos on our lab machines that use unix and I never noticed that being broken, so something is definitely wrong.
Not true. The behavior is locale-dependent.

Code: Select all

[justinb@coconut ~]$ LANG=C sh -c 'ls -d [a-c]*'
barndiagonal.mov  caz  chess  code  convert.sh
[justinb@coconut ~]$ LANG=en_US.utf-8 sh -c 'ls -d [a-c]*'
barndiagonal.mov  BlocksThatMatterUserDatas  BlocksThatMatterUserDatas.7z  BlocksThatMatterUserDatas-justin  BotaniculaSaves  caz  chess  code	convert.sh
(Note that the changes are in how sorting and character ranges work. For instance, "ls C*" will never turn up chess, and more importantly the filenames "chess" and "Chess" are still different.)

Doesn't explain how his copy [a-z]* would copy ANY file that started with C. Obviously that RE matches any file that starts with lowercase a-z, but I am not aware of any locale changes that would make it case insensitive, as Unix has always been completely case sensitive.
Sure it does. The shell is interpreting the pattern (just as it did on my box when deciding what args to pass to ls). Shell glob patterns aren't exactly REs; they are constructs interpreted according to the shell's documentation. Relevantly (for bash):

Code: Select all

              [...]  Matches any one of the enclosed characters.   A  pair  of
                     characters  separated by a hyphen denotes a range expres‐
                     sion; any character that sorts between those two  charac‐
                     ters,  inclusive,  using  the  current locale's collating
                     sequence and character set, is  matched.   If  the  first
                     character following the [ is a !  or a ^ then any charac‐
                     ter not enclosed is matched.  The sorting order of  char‐
                     acters  in range expressions is determined by the current
                     locale and the value of the LC_COLLATE shell variable, if
                     set.   A - may be matched by including it as the first or
                     last character in the set.  A ] may be matched by includ‐
                     ing it as the first character in the set.
So the ordering that says what the range [a-z] is doesn't have to be ASCII, and in common cases it isn't.
Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon. :)

Even something as archaic as EBCDIC doesn't fail here.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Q: FICS code and 64-bit (va_list)

Post by bob »

syzygy wrote:
UncombedCoconut wrote:
bob wrote:
UncombedCoconut wrote:
bob wrote:
hgm wrote:Problem is that I normally get thousands of compiler warnings, when doing "make". :cry: I will try to see if the -m32 helps, though.

This CentOS is a bit fishy anyway: make install did crash on it not being case sensitive in file names, so that the command "cp [a-z]* ..." also tried to copy the directory CVS, which produced a fatal error.
Something is broken. Unix has always been case-sensitive. the regular expression "[a-z]*" should NEVER match a file name that does not start with a lowercase alphabetic character, and doesn't on my fedora systems. We did run centos on our lab machines that use unix and I never noticed that being broken, so something is definitely wrong.
Not true. The behavior is locale-dependent.

Code: Select all

[justinb@coconut ~]$ LANG=C sh -c 'ls -d [a-c]*'
barndiagonal.mov  caz  chess  code  convert.sh
[justinb@coconut ~]$ LANG=en_US.utf-8 sh -c 'ls -d [a-c]*'
barndiagonal.mov  BlocksThatMatterUserDatas  BlocksThatMatterUserDatas.7z  BlocksThatMatterUserDatas-justin  BotaniculaSaves  caz  chess  code	convert.sh
(Note that the changes are in how sorting and character ranges work. For instance, "ls C*" will never turn up chess, and more importantly the filenames "chess" and "Chess" are still different.)
Doesn't explain how his copy [a-z]* would copy ANY file that started with C. Obviously that RE matches any file that starts with lowercase a-z, but I am not aware of any locale changes that would make it case insensitive, as Unix has always been completely case sensitive.
Sure it does. The shell is interpreting the pattern (just as it did on my box when deciding what args to pass to ls). Shell glob patterns aren't exactly REs; they are constructs interpreted according to the shell's documentation. Relevantly (for bash):

Code: Select all

              [...]  Matches any one of the enclosed characters.   A  pair  of
                     characters  separated by a hyphen denotes a range expres‐
                     sion; any character that sorts between those two  charac‐
                     ters,  inclusive,  using  the  current locale's collating
                     sequence and character set, is  matched.   If  the  first
                     character following the [ is a !  or a ^ then any charac‐
                     ter not enclosed is matched.  The sorting order of  char‐
                     acters  in range expressions is determined by the current
                     locale and the value of the LC_COLLATE shell variable, if
                     set.   A - may be matched by including it as the first or
                     last character in the set.  A ] may be matched by includ‐
                     ing it as the first character in the set.
So the ordering that says what the range [a-z] is doesn't have to be ASCII, and in common cases it isn't.
Amazing, even on my computer [a-z]* matches CVS... :shock:
Indeed LANG is set to en_US.UTF-8.
In my UTF-8 chart, lowercase letters are separate from uppercase. Uppercase C does not appear in between lowercase a and lowercase z. Something is broken.
UncombedCoconut
Posts: 319
Joined: Fri Dec 18, 2009 11:40 am
Location: Naperville, IL

Re: Q: FICS code and 64-bit (va_list)

Post by UncombedCoconut »

bob wrote:Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon. :)

Even something as archaic as EBCDIC doesn't fail here.
Perhaps you're thinking too much like a computer. :) The key phrase above is "using the current locale's collating sequence". Characters are not compared as integers; their order in the collating sequence doesn't necessarily have anything to do with their order in the charset.

The POSIX locale, a.k.a. the C locale, defines collation order to be the same as ASCII's order. This is why I used LANG=C in some of my examples.

The other locale I've mentioned is en_US.utf-8. The utf-8 part isn't what's important; it's that in English the symbols 'a' and 'A' are considered to be earlier in the alphabet than 'b' and 'B'. Most of the time, this is nice: for instance, I may download a file and remember its name inexactly. It sucks to guess whether to start looking in the capital or lower-case part of the directory listing. Aççèñtéd characters, while rare, can make matters worse. (On the other hand, it sucks when you untar a source folder and the COPYING, INSTALL, README files don't appear before everything else. We don't have to agree which inconvenience is worse; my point is that these sorting orders aren't totally crazy.)

Further reading: the strcoll man page, maybe this page.
stevenaaus
Posts: 608
Joined: Wed Oct 13, 2010 9:44 am
Location: Australia

Re: Q: FICS code and 64-bit (va_list)

Post by stevenaaus »

UncombedCoconut wrote:
bob wrote:Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon. :)

Even something as archaic as EBCDIC doesn't fail here.
Perhaps you're thinking too much like a computer. :) The key phrase above is "using the current locale's collating sequence". Characters are not compared as integers; their order in the collating sequence doesn't necessarily have anything to do with their order in the charset.

The POSIX locale, a.k.a. the C locale, defines collation order to be the same as ASCII's order. This is why I used LANG=C in some of my examples.

The other locale I've mentioned is en_US.utf-8. The utf-8 part isn't what's important; it's that in English the symbols 'a' and 'A' are considered to be earlier in the alphabet than 'b' and 'B'. Most of the time, this is nice: for instance, I may download a file and remember its name inexactly. It sucks to guess whether to start looking in the capital or lower-case part of the directory listing. Aççèñtéd characters, while rare, can make matters worse. (On the other hand, it sucks when you untar a source folder and the COPYING, INSTALL, README files don't appear before everything else. We don't have to agree which inconvenience is worse; my point is that these sorting orders aren't totally crazy.)

Further reading: the strcoll man page, maybe this page.
Thanks for this explanation. I'd noticed this behaviour in bash before and was a bit confounded.
I notice that in tcsh and zsh, "[a-z]*" matches only lowercase files.
mar
Posts: 2559
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Q: FICS code and 64-bit (va_list)

Post by mar »

bob wrote:In my UTF-8 chart, lowercase letters are separate from uppercase. Uppercase C does not appear in between lowercase a and lowercase z. Something is broken.
You mean in your Unicode chart :)
Yes it seems like a bug somewhere.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Q: FICS code and 64-bit (va_list)

Post by bob »

UncombedCoconut wrote:
bob wrote:Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon. :)

Even something as archaic as EBCDIC doesn't fail here.
Perhaps you're thinking too much like a computer. :) The key phrase above is "using the current locale's collating sequence". Characters are not compared as integers; their order in the collating sequence doesn't necessarily have anything to do with their order in the charset.

The POSIX locale, a.k.a. the C locale, defines collation order to be the same as ASCII's order. This is why I used LANG=C in some of my examples.

The other locale I've mentioned is en_US.utf-8. The utf-8 part isn't what's important; it's that in English the symbols 'a' and 'A' are considered to be earlier in the alphabet than 'b' and 'B'. Most of the time, this is nice: for instance, I may download a file and remember its name inexactly. It sucks to guess whether to start looking in the capital or lower-case part of the directory listing. Aççèñtéd characters, while rare, can make matters worse. (On the other hand, it sucks when you untar a source folder and the COPYING, INSTALL, README files don't appear before everything else. We don't have to agree which inconvenience is worse; my point is that these sorting orders aren't totally crazy.)

Further reading: the strcoll man page, maybe this page.
You still are not answering my question. HGM clearly mentioned [a-z] and the character "C". What locale could possibly alter the upper-case C and make it appear in between lowercase-a and lowercase-z?

My bet would be a system installed on an archaic filesystem so that there are no lowercase letters. But if you do a ls -l and see both upper and lower case filenames, that can't be the explanation, and HGM seemed to have eliminated that possibility as I read his post.

So, where does such an "uppercase in the middle of lowercase" circumstance actually show up?

Your example will break so many programs it is not funny. If that is actually done, I'd consider it a horrible concept... And would that mean that it would NOT match an uppercase Z since that would apparently fall after a lowercase z?

That really is broken.
User avatar
hgm
Posts: 27807
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Q: FICS code and 64-bit (va_list)

Post by hgm »

I agree. This is absolutely awful.

But it seems he is right. What I get on this machine is:

Code: Select all

$ echo [a-c]*
bots copying
$ echo [a-d]*
bots copying COPYING CVS data doc
$ echo [a-C]*
bots copying COPYING CVS
It is not due to case insensitivity of the file system, as I was able to make distinct files copying and COPYING. C just maps in between c and d, in the way this shell expands [a-z]. :shock:
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Q: FICS code and 64-bit (va_list)

Post by syzygy »

bob wrote:You still are not answering my question. HGM clearly mentioned [a-z] and the character "C". What locale could possibly alter the upper-case C and make it appear in between lowercase-a and lowercase-z?

My bet would be a system installed on an archaic filesystem so that there are no lowercase letters.
Your system does just the same, just try it. Set LANG to en_US.utf-8 (if it's not already set!) and do ls -l. Now observe that files are listed in alphabetical order without regard to case. Then type ls [a-z]* and observe that it catches files starting with a capital C.