===================================================== Collate order - making GLOB patterns and sorting work ===================================================== -IAN! idallen@idallen.ca (This file is not a required part of NET2003 - it is optional.) I used to say that a shell script only needed to set two things to behave properly no matter what nonsense was set in the parent: PATH and umask #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 I've discovered a third necessity: setting character collation order. Here is an example of the expected, intuitive strict numeric collation order we've all come to expect over the past decades: $ LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order $ touch a A b B c C x X y Y z Z $ ls A B C X Y Z a b c x y z # expected sorted output $ ls | sort | fmt A B C X Y Z a b c x y z $ echo [a-z] a b c x y z $ echo [A-Z] A B C X Y Z Below is the non-intuitive output that appears if you don't set the character collation order to strict numeric, and you try to use ranges with dashes in them: $ LC_COLLATE=en_US ; export LC_COLLATE # many Linux distros set this! $ ls a A b B c C x X y Y z Z # note the new collate order! $ ls | sort | fmt a A b B c C x X y Y z Z $ echo [a-z] a A b B c C x X y Y z # note how 'Z' is outside the range! $ echo [A-Z] A b B c C x X y Y z Z # note how 'a' is outside the range! With many modern Linux locale settings, such as en_US, the character set is not laid out in strict numeric order; the collating order places upper and lower case together, in this order: a A b B c C .... x X y Y z Z and so the GLOB pattern [a-z] (which we expect to mach only lower-case letters) actually matches all the lower-case and all but one of the upper-case letters (everything from 'a' to 'z') which means a A b B c C .... x X y Y z (and not 'Z')! The GLOB pattern [A-Z] (which we expect to match only upper-case letters) actually matches all the upper-case letters and all but one of the lower-case letters (everything from 'A' to 'Z') which means A b B c C .... x X y Y z Z (and not 'a')! The environment variables LC_* determine your "locale" and affect how programs behave: LC_ADDRESS=en_US LC_COLLATE=C LC_CTYPE=en_US LC_IDENTIFICATION=en_US LC_MEASUREMENT=en_US LC_MESSAGES=en_US LC_MONETARY=en_US LC_NAME=en_US LC_NUMERIC=en_US LC_PAPER=en_US LC_SOURCED=1 LC_TELEPHONE=en_US LC_TIME=en_US The master variable "LC_ALL" over-rides them all, if set. Of particular concern to shell scripts are LC_CTYPE (the type of characters allowed, e.g. 7-bit ASCII or full 8-bit iso-latin-1 with accents) and LC_COLLATE (the order of the characters in the alphabet). If you use character ranges containing dashes (e.g. [a-z]), you must set and export the LC_COLLATE "C" locale at the top of your script, to make sure your ranges match the characters in strict numeric order: #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order ------------------------------------------------ Internationalization and POSIX character classes ------------------------------------------------ In an international (non-English) world where characters include accents, dashed ranges such as [a-z] and [A-Z] are wrong. These ranges may not match accented characters at all, either upper- or lower-case, and they can mis-handle alphabets with upper-case and lower-case collated together. If the LC_COLLATE order is set to strict numeric order ("C"), dashed ranges behave predictably: $ unset LC_ALL $ LC_CTYPE=en_US ; export LC_CTYPE # accept iso-latin-1 characters $ LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order $ touch a A b B c C x X y Y z Z $ touch á Á é É # four latin-1 accented characters $ ls A B C X Y Z a b c x y z Á É á é $ ls | sort | fmt A B C X Y Z a b c x y z Á É á é $ echo [a-z] a b c x y z $ echo [A-Z] A B C X Y Z The above shows that the latin-1 characters sort to the end (they are high-value 8-bit characters) and are not matched by the GLOB ranges in a strict numeric order collating sequence such as LC_COLLATE=C. If we change the collating sequence away from strict numeric "C", the GLOB ranges match a somewhat non-intuitive set of characters: $ unset LC_ALL $ LC_CTYPE=en_US ; export LC_CTYPE # accept iso-latin-1 characters $ LC_COLLATE=en_US ; export LC_COLLATE # collate together $ ls a A á Á b B c C é É x X y Y z Z $ ls | sort | fmt a A á Á b B c C é É x X y Y z Z $ echo [a-z] a A á Á b B c C é É x X y Y z # note missing 'Z' $ echo [A-Z] A á Á b B c C é É x X y Y z Z # note missing 'a' Instead of using dashed character ranges (which misbehave, as you can see above), many matching systems let you specify a POSIX standard "class" of characters to match by name (e.g. "lower" and "upper"), and these *do* work correctly to match even accented characters: $ unset LC_ALL $ LC_CTYPE=en_US ; export LC_CTYPE # accept iso-latin-1 characters $ LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order $ echo [[:lower:]] a b c x y z á é # all lower-case, nothing missing $ echo [[:upper:]] A B C X Y Z Á É # all upper-case, nothing missing $ LC_COLLATE=en_US ; export LC_COLLATE # collate together $ echo [[:lower:]] a á b c é x y z # all lower-case, nothing missing $ echo [[:upper:]] A Á B C É X Y Z # all upper-case, nothing missing $ LC_CTYPE=C ; export LC_CTYPE # accept only plain ASCII $ echo [[:lower:]] a b c x y z # only lower-case ASCII now $ echo [[:upper:]] A B C X Y Z # only upper-case ASCII now While the order of the characters in the POSIX class changes with the collating order, the list of characters matched does not - it is always the correct list for the given CTYPE locale. Contrast this with the dashed [a-z] range used above, where the list of characters matched changed non-intuitively depending on the collating order selected. In multi-lingual countries such as Canada, pathnames will often contain accents. Your programs need to handle them correctly. Avoid character ranges containing dashes, and use the POSIX character classes that aren't affected by the character collating sequence being used: $ rm [a-z]* # WRONG - dependent on collating order $ rm [[:lower:]]* # RIGHT - use the POSIX class that always works To be safe, always start your scripts with a correct setting of LC_COLLATE: #!/bin/sh -u PATH=/bin:/usr/bin ; export PATH umask 022 LC_COLLATE=C ; export LC_COLLATE # collate in strict numeric order