I don't get it. ASCII is simply a subset of UTF-8 unicode, right? The first 128 code points of unicode are the ASCII set, and in UTF-8 they map onto single bytes 0-127.
So how can [a-z] mean anything different in ASCII than in UTF-8?
Q: FICS code and 64-bit (va_list)
Moderators: hgm, Rebel, chrisw
-
- Posts: 27807
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Q: FICS code and 64-bit (va_list)
Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon.UncombedCoconut wrote:Sure it does. The shell is interpreting the pattern (just as it did on my box when deciding what args to pass to ls). Shell glob patterns aren't exactly REs; they are constructs interpreted according to the shell's documentation. Relevantly (for bash):bob wrote:UncombedCoconut wrote:Not true. The behavior is locale-dependent.bob wrote:Something is broken. Unix has always been case-sensitive. the regular expression "[a-z]*" should NEVER match a file name that does not start with a lowercase alphabetic character, and doesn't on my fedora systems. We did run centos on our lab machines that use unix and I never noticed that being broken, so something is definitely wrong.hgm wrote:Problem is that I normally get thousands of compiler warnings, when doing "make". I will try to see if the -m32 helps, though.
This CentOS is a bit fishy anyway: make install did crash on it not being case sensitive in file names, so that the command "cp [a-z]* ..." also tried to copy the directory CVS, which produced a fatal error.(Note that the changes are in how sorting and character ranges work. For instance, "ls C*" will never turn up chess, and more importantly the filenames "chess" and "Chess" are still different.)Code: Select all
[justinb@coconut ~]$ LANG=C sh -c 'ls -d [a-c]*' barndiagonal.mov caz chess code convert.sh [justinb@coconut ~]$ LANG=en_US.utf-8 sh -c 'ls -d [a-c]*' barndiagonal.mov BlocksThatMatterUserDatas BlocksThatMatterUserDatas.7z BlocksThatMatterUserDatas-justin BotaniculaSaves caz chess code convert.sh
Doesn't explain how his copy [a-z]* would copy ANY file that started with C. Obviously that RE matches any file that starts with lowercase a-z, but I am not aware of any locale changes that would make it case insensitive, as Unix has always been completely case sensitive.
So the ordering that says what the range [a-z] is doesn't have to be ASCII, and in common cases it isn't.Code: Select all
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expres‐ sion; any character that sorts between those two charac‐ ters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any charac‐ ter not enclosed is matched. The sorting order of char‐ acters in range expressions is determined by the current locale and the value of the LC_COLLATE shell variable, if set. A - may be matched by including it as the first or last character in the set. A ] may be matched by includ‐ ing it as the first character in the set.
Even something as archaic as EBCDIC doesn't fail here.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Q: FICS code and 64-bit (va_list)
In my UTF-8 chart, lowercase letters are separate from uppercase. Uppercase C does not appear in between lowercase a and lowercase z. Something is broken.syzygy wrote:Amazing, even on my computer [a-z]* matches CVS...UncombedCoconut wrote:Sure it does. The shell is interpreting the pattern (just as it did on my box when deciding what args to pass to ls). Shell glob patterns aren't exactly REs; they are constructs interpreted according to the shell's documentation. Relevantly (for bash):bob wrote:Doesn't explain how his copy [a-z]* would copy ANY file that started with C. Obviously that RE matches any file that starts with lowercase a-z, but I am not aware of any locale changes that would make it case insensitive, as Unix has always been completely case sensitive.UncombedCoconut wrote:Not true. The behavior is locale-dependent.bob wrote:Something is broken. Unix has always been case-sensitive. the regular expression "[a-z]*" should NEVER match a file name that does not start with a lowercase alphabetic character, and doesn't on my fedora systems. We did run centos on our lab machines that use unix and I never noticed that being broken, so something is definitely wrong.hgm wrote:Problem is that I normally get thousands of compiler warnings, when doing "make". I will try to see if the -m32 helps, though.
This CentOS is a bit fishy anyway: make install did crash on it not being case sensitive in file names, so that the command "cp [a-z]* ..." also tried to copy the directory CVS, which produced a fatal error.(Note that the changes are in how sorting and character ranges work. For instance, "ls C*" will never turn up chess, and more importantly the filenames "chess" and "Chess" are still different.)Code: Select all
[justinb@coconut ~]$ LANG=C sh -c 'ls -d [a-c]*' barndiagonal.mov caz chess code convert.sh [justinb@coconut ~]$ LANG=en_US.utf-8 sh -c 'ls -d [a-c]*' barndiagonal.mov BlocksThatMatterUserDatas BlocksThatMatterUserDatas.7z BlocksThatMatterUserDatas-justin BotaniculaSaves caz chess code convert.sh
So the ordering that says what the range [a-z] is doesn't have to be ASCII, and in common cases it isn't.Code: Select all
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expres‐ sion; any character that sorts between those two charac‐ ters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any charac‐ ter not enclosed is matched. The sorting order of char‐ acters in range expressions is determined by the current locale and the value of the LC_COLLATE shell variable, if set. A - may be matched by including it as the first or last character in the set. A ] may be matched by includ‐ ing it as the first character in the set.
Indeed LANG is set to en_US.UTF-8.
-
- Posts: 319
- Joined: Fri Dec 18, 2009 11:40 am
- Location: Naperville, IL
Re: Q: FICS code and 64-bit (va_list)
Perhaps you're thinking too much like a computer. The key phrase above is "using the current locale's collating sequence". Characters are not compared as integers; their order in the collating sequence doesn't necessarily have anything to do with their order in the charset.bob wrote:Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon.
Even something as archaic as EBCDIC doesn't fail here.
The POSIX locale, a.k.a. the C locale, defines collation order to be the same as ASCII's order. This is why I used LANG=C in some of my examples.
The other locale I've mentioned is en_US.utf-8. The utf-8 part isn't what's important; it's that in English the symbols 'a' and 'A' are considered to be earlier in the alphabet than 'b' and 'B'. Most of the time, this is nice: for instance, I may download a file and remember its name inexactly. It sucks to guess whether to start looking in the capital or lower-case part of the directory listing. Aççèñtéd characters, while rare, can make matters worse. (On the other hand, it sucks when you untar a source folder and the COPYING, INSTALL, README files don't appear before everything else. We don't have to agree which inconvenience is worse; my point is that these sorting orders aren't totally crazy.)
Further reading: the strcoll man page, maybe this page.
-
- Posts: 608
- Joined: Wed Oct 13, 2010 9:44 am
- Location: Australia
Re: Q: FICS code and 64-bit (va_list)
Thanks for this explanation. I'd noticed this behaviour in bash before and was a bit confounded.UncombedCoconut wrote:Perhaps you're thinking too much like a computer. The key phrase above is "using the current locale's collating sequence". Characters are not compared as integers; their order in the collating sequence doesn't necessarily have anything to do with their order in the charset.bob wrote:Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon.
Even something as archaic as EBCDIC doesn't fail here.
The POSIX locale, a.k.a. the C locale, defines collation order to be the same as ASCII's order. This is why I used LANG=C in some of my examples.
The other locale I've mentioned is en_US.utf-8. The utf-8 part isn't what's important; it's that in English the symbols 'a' and 'A' are considered to be earlier in the alphabet than 'b' and 'B'. Most of the time, this is nice: for instance, I may download a file and remember its name inexactly. It sucks to guess whether to start looking in the capital or lower-case part of the directory listing. Aççèñtéd characters, while rare, can make matters worse. (On the other hand, it sucks when you untar a source folder and the COPYING, INSTALL, README files don't appear before everything else. We don't have to agree which inconvenience is worse; my point is that these sorting orders aren't totally crazy.)
Further reading: the strcoll man page, maybe this page.
I notice that in tcsh and zsh, "[a-z]*" matches only lowercase files.
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Q: FICS code and 64-bit (va_list)
You mean in your Unicode chartbob wrote:In my UTF-8 chart, lowercase letters are separate from uppercase. Uppercase C does not appear in between lowercase a and lowercase z. Something is broken.
Yes it seems like a bug somewhere.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Q: FICS code and 64-bit (va_list)
You still are not answering my question. HGM clearly mentioned [a-z] and the character "C". What locale could possibly alter the upper-case C and make it appear in between lowercase-a and lowercase-z?UncombedCoconut wrote:Perhaps you're thinking too much like a computer. The key phrase above is "using the current locale's collating sequence". Characters are not compared as integers; their order in the collating sequence doesn't necessarily have anything to do with their order in the charset.bob wrote:Can you cite some example where the letters a-z have a "C" embedded in the middle? Note that we ARE talking English here, not Vulcan or Klingon.
Even something as archaic as EBCDIC doesn't fail here.
The POSIX locale, a.k.a. the C locale, defines collation order to be the same as ASCII's order. This is why I used LANG=C in some of my examples.
The other locale I've mentioned is en_US.utf-8. The utf-8 part isn't what's important; it's that in English the symbols 'a' and 'A' are considered to be earlier in the alphabet than 'b' and 'B'. Most of the time, this is nice: for instance, I may download a file and remember its name inexactly. It sucks to guess whether to start looking in the capital or lower-case part of the directory listing. Aççèñtéd characters, while rare, can make matters worse. (On the other hand, it sucks when you untar a source folder and the COPYING, INSTALL, README files don't appear before everything else. We don't have to agree which inconvenience is worse; my point is that these sorting orders aren't totally crazy.)
Further reading: the strcoll man page, maybe this page.
My bet would be a system installed on an archaic filesystem so that there are no lowercase letters. But if you do a ls -l and see both upper and lower case filenames, that can't be the explanation, and HGM seemed to have eliminated that possibility as I read his post.
So, where does such an "uppercase in the middle of lowercase" circumstance actually show up?
Your example will break so many programs it is not funny. If that is actually done, I'd consider it a horrible concept... And would that mean that it would NOT match an uppercase Z since that would apparently fall after a lowercase z?
That really is broken.
-
- Posts: 27807
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Q: FICS code and 64-bit (va_list)
I agree. This is absolutely awful.
But it seems he is right. What I get on this machine is:
It is not due to case insensitivity of the file system, as I was able to make distinct files copying and COPYING. C just maps in between c and d, in the way this shell expands [a-z].
But it seems he is right. What I get on this machine is:
Code: Select all
$ echo [a-c]*
bots copying
$ echo [a-d]*
bots copying COPYING CVS data doc
$ echo [a-C]*
bots copying COPYING CVS
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Q: FICS code and 64-bit (va_list)
Your system does just the same, just try it. Set LANG to en_US.utf-8 (if it's not already set!) and do ls -l. Now observe that files are listed in alphabetical order without regard to case. Then type ls [a-z]* and observe that it catches files starting with a capital C.bob wrote:You still are not answering my question. HGM clearly mentioned [a-z] and the character "C". What locale could possibly alter the upper-case C and make it appear in between lowercase-a and lowercase-z?
My bet would be a system installed on an archaic filesystem so that there are no lowercase letters.