=========================================================== Collate Order and Character Set - GLOB patterns and accents =========================================================== -Ian! D. Allen - idallen@idallen.ca - www.idallen.com This file should help you understand Unix/Linux scripts in a world of increasing internationalization (i18n). I used to say that a shell script only needed to set two things to behave properly no matter what nonsense was set in the parent: PATH and umask #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 I've discovered a third and fourth necessity: setting character collation order, and setting the acceptable input character set. Without these additions, scripts may behave differently depending on environment variables set (or not set) in the parent process. You will find these variables used in Unix/Linux start-up scripts for network services. ------------- Collate Order ------------- Here is an example of the expected, intuitive strict numeric collation order we've all come to expect over the past three decades: $ LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order $ touch a A b B c C x X y Y z Z $ ls A B C X Y Z a b c x y z # expected sorted output $ ls | sort | fmt A B C X Y Z a b c x y z $ echo [a-z] a b c x y z $ echo [A-Z] A B C X Y Z Below is the non-intuitive output that appears if you don't set the character collation order to strict numeric, and you try to use ranges with dashes in them: $ LC_COLLATE=en_US ; export LC_COLLATE # many Linux distros set this! $ ls a A b B c C x X y Y z Z # note the new collate order! $ ls | sort | fmt a A b B c C x X y Y z Z $ echo [a-z] a A b B c C x X y Y z # note how 'Z' is outside the range! $ echo [A-Z] A b B c C x X y Y z Z # note how 'a' is outside the range! With many modern Linux locale settings, such as en_US, en_CA, or even en_CA.utf8, the character set is not laid out in strict numeric order; the collating order places upper and lower case together, in this order: a A b B c C .... x X y Y z Z and so the GLOB pattern [a-z] (which we expect to match only lower-case letters) actually matches all the lower-case and all but one of the upper-case letters (everything from 'a' to 'z') which means a A b B c C .... x X y Y z (and not 'Z')! The GLOB pattern [A-Z] (which we expect to match only upper-case letters) actually matches all the upper-case letters and all but one of the lower-case letters (everything from 'A' to 'Z') which means A b B c C .... x X y Y z Z (and not 'a')! The environment variables LC_* determine your "locale" and affect how programs behave: LC_ADDRESS=en_US LC_COLLATE=C LC_CTYPE=en_US LC_IDENTIFICATION=en_US LC_MEASUREMENT=en_US LC_MESSAGES=en_US LC_MONETARY=en_US LC_NAME=en_US LC_NUMERIC=en_US LC_PAPER=en_US LC_SOURCED=1 LC_TELEPHONE=en_US LC_TIME=en_US The master variable "LC_ALL" over-rides them all, if set. Of particular concern to shell scripts are LC_CTYPE (the type of characters allowed, e.g. 7-bit ASCII or full 8-bit iso-latin-1 with accents) and LC_COLLATE (the order of the characters in the alphabet). If you use character ranges containing dashes (e.g. [a-z]), you must set and export the LC_COLLATE "C" locale at the top of your script, to make sure your ranges match the characters in strict numeric order: #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order ------------------------------------------------ Internationalization and POSIX character classes ------------------------------------------------ In an international (non-English) world where characters include accents, dashed ranges such as [a-z] and [A-Z] are wrong. These ranges may not match accented characters at all, either upper- or lower-case, and they can mis-handle alphabets with upper-case and lower-case collated together. If the LC_COLLATE order is set to strict numeric order ("C"), dashed ranges behave predictably: $ unset LC_ALL $ LC_CTYPE=en_US ; export LC_CTYPE # accept iso-latin-1 characters $ LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order $ touch a A b B c C x X y Y z Z $ touch á Á é É # four latin-1 accented characters $ ls A B C X Y Z a b c x y z Á É á é $ ls | sort | fmt A B C X Y Z a b c x y z Á É á é $ echo [a-z] a b c x y z $ echo [A-Z] A B C X Y Z The above shows that the latin-1 characters sort to the end (they are high-value 8-bit characters) and are not matched by the GLOB ranges in a strict numeric order collating sequence such as LC_COLLATE=C. If we change the collating sequence away from strict numeric "C", the GLOB ranges match a somewhat non-intuitive set of characters: $ unset LC_ALL $ LC_CTYPE=en_US ; export LC_CTYPE # accept iso-latin-1 characters $ LC_COLLATE=en_US ; export LC_COLLATE # collate together $ ls a A á Á b B c C é É x X y Y z Z $ ls | sort | fmt a A á Á b B c C é É x X y Y z Z $ echo [a-z] a A á Á b B c C é É x X y Y z # note missing 'Z' $ echo [A-Z] A á Á b B c C é É x X y Y z Z # note missing 'a' Instead of using dashed character ranges (which misbehave, as you can see above), many matching systems let you specify a POSIX standard "class" of characters to match by name (e.g. "lower" and "upper"), and these *do* work correctly to match even accented characters: $ unset LC_ALL $ LC_CTYPE=en_US ; export LC_CTYPE # accept iso-latin-1 characters $ LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order $ echo [[:lower:]] a b c x y z á é # all lower-case, nothing missing $ echo [[:upper:]] A B C X Y Z Á É # all upper-case, nothing missing $ LC_COLLATE=en_US ; export LC_COLLATE # collate together $ echo [[:lower:]] a á b c é x y z # all lower-case, nothing missing $ echo [[:upper:]] A Á B C É X Y Z # all upper-case, nothing missing $ LC_CTYPE=C ; export LC_CTYPE # accept only plain ASCII $ echo [[:lower:]] a b c x y z # only lower-case ASCII now $ echo [[:upper:]] A B C X Y Z # only upper-case ASCII now While the order of the characters in the POSIX class changes with the collating order, the list of characters matched does not - it is always the correct list for the given CTYPE locale. Contrast this with the dashed [a-z] range used above, where the list of characters matched changed non-intuitively depending on the collating order selected. In multi-lingual countries such as Canada, pathnames will often contain accents. Your programs need to handle them correctly. Avoid character ranges containing dashes, and use the POSIX character classes that aren't affected by the character collating sequence being used: $ rm [a-z]* # WRONG - dependent on collating order $ rm [[:lower:]]* # RIGHT - use the POSIX class that always works To be safe, always start your scripts with a correct setting of LC_COLLATE: #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order ------------- Character Set ------------- Many non-English languages have characters that don't fit into 8-bit bytes. The world has adopted standards such as UTF and UNICODE to allow for multi-byte characters, and many (but not all) Unix/Linux programs know how to process files with multi-byte characters. What happens in a script when a program such as wc (word count) counts the words and characters in a file? If the file contains multi-byte characters, should wc treat the multi-bytes as single characters, or should wc count each byte as a separate character? Should wc treat non-ASCII bytes as word separators, or as parts of multi-byte characters? Usually, there is no indication of which multi-byte standard is in use in a text file - one might find UTF and UNICODE files in the same directory, and wc is sure to do the wrong thing with one or the other of the files. The LC_* and LANG environment variables affect how programs such as wc interpret "characters" in files. If they are set to anything other than the "C" setting, you may find that some programs misbehave when processing files that appear to have multi-byte characters in them. Unless you are certain of your character set, your scripts must first pre-emptively set the LANG and/or LC_COLLATE and/or LC_ALL variables to "C" to prevent undefined or inconsistent behaviour: #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order LANG=C ; export LANG # don't process multi-byte chars -- | Ian! D. Allen - idallen@idallen.ca - Ottawa, Ontario, Canada | Home Page: http://idallen.com/ Contact Improv: http://contactimprov.ca/ | College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/ | Defend digital freedom: http://eff.org/ and have fun: http://fools.ca/