Internationalization (i18n) – Collate (Sort) Order, Character Set, Accents, GLOB patterns

Ian! D. Allen - - www.idallen.com

Winter 2014 - January to April 2014 - Updated 2018-04-09 08:30 EDT

1 Internationalization – “i18n”Indexup to index

This file should help you understand Unix/Linux scripts in a world of increasing internationalization (i18n).

I used to say that a shell script only needed to set two things to behave properly no matter what nonsense was set in the parent process that invokes the script: PATH and umask

#!/bin/sh -u
PATH=/bin:/usr/bin ; export PATH
umask 022

Internationalization imposes a third and fourth consideration: character collation order and the input character set.

The spirit of i18n is that we write our script once, and the behavior of that script is automatically tailored to suit specific users, depending on their locale, at the time they run it. For example, a script may display file names in a certain order for one user, and the same script might display those same file names in a different order to another user with a different locale from the first user.

Scripts may behave differently depending on the values of a family of locale-related environment variables set (or not set) in the parent process. This is the nature of i18n, but the differences can have serious consequences. The following sections explain the differences, and how a script writer can avoid consequences due to those differences.

The examples in this document were done using the Bourne-Again Shell BASH that understands locales. The system shell /bin/sh is often linked to a shell that does not understand locales (as of April 2012): /bin/dash, and so the examples will not work in scripts using that locale-free shell.

2 LC_* – Locales, LC_COLLATE, LC_CTYPEIndexup to index

The environment variables LC_* determine your “locale” and affect how programs behave (see man locale for details):

LC_ADDRESS
LC_COLLATE
LC_CTYPE
LC_IDENTIFICATION
LC_MEASUREMENT
LC_MESSAGES
LC_MONETARY
LC_NAME
LC_NUMERIC
LC_PAPER
LC_SOURCED
LC_TELEPHONE
LC_TIME

The master variable LC_ALL over-rides all the above, if set, so you have to make sure you unset LC_ALL if you want to set any of the other variables to different values.

Of particular concern to shell scripts (and programs in general) are:

LC_CTYPE

The type of characters allowed, e.g. 7-bit single-byte ASCII or full 8-bit multi-byte UTF-8 that can express the full Unicode character set, including accents.

The LC_CTYPE variable determines whether a bit pattern (and how long a bit pattern) is considered a character. If the variable is unset, or is set to the old C value, only ASCII single-byte values are considered as characters and the rest are treated as non-characters.

LC_COLLATE

The sorting order of the characters in the character set.

If you use character ranges that use dashes, such as [a-z] or [A-Z], and you are expecting it to match only lower-case or upper-case English letters, the way it has for the first 30 years of Unix, your script will be broken in the age of i18n. In the i18n world, the character class ranges formed with a dash do not behave predictably across different locales. Stop using them. Use POSIX character classes (below) instead.

3 Character Set Collation Order – LC_COLLATEIndexup to index

Here is an example of the traditional strict numeric collation (sorting) order that English-speaking people have come to expect over the past three decades of English-only (ASCII) Unix scripting, where all the upper-case characters sort before all the lower-case characters:

$ unset LC_ALL                         # unset the over-ride variable
$ LC_COLLATE=C ; export LC_COLLATE     # collate in strict numeric order
$ touch a A b B c C x X y Y z Z
$ ls
A  B  C  X  Y  Z  a  b  c  x  y  z     # traditional ASCII sorted output
$ ls | sort | fmt
A B C X Y Z a b c x y z
$ echo [a-z]
a b c x y z
$ echo [A-Z]
A B C X Y Z

Below is the output that appears when the character collation order is not strictly numeric, and you display sorted strings or use character GLOB patterns or regular expression bracket expressions with dashes (ranges) in them:

$ unset LC_ALL                         # unset the over-ride variable
$ LC_COLLATE=en_US.utf8 ; export LC_COLLATE  # many Linux distros set this!
$ touch a A b B c C x X y Y z Z
$ ls
a  A  b  B  c  C  x  X  y  Y  z  Z     # note the new collate order!
$ ls | sort | fmt
a A b B c C x X y Y z Z                # note the new collate order!
$ echo [a-z]
a A b B c C x X y Y z                  # note how 'Z' is outside the range!
$ echo [A-Z]
A b B c C x X y Y z Z                  # note how 'a' is outside the range!

With many modern Linux locale settings, such as en_US.utf8, en_CA.utf8, etc., the character set is not laid out in the old ASCII numeric order; the collating order places upper and lower case together, in this order:

a A b B c C ... x X y Y z Z

In many locales in an i18n world, the GLOB pattern or regular expression range [a-z] that English speakers for the past 30 years have expected to match only lower-case letters, actually matches all the lower-case letters and all but one of the upper-case letters, which means a A b B c C ... x X y Y z (and not ‘Z’)!

In many locales in an i18n world, the GLOB pattern or regular expression range [A-Z] that English speakers for the past 30 years have expected to match only upper-case letters, actually matches all the upper-case letters and all but one of the lower-case letters, which means A b B c C ... x X y Y z Z (and not ‘a’)!

In an i18n world, the GLOB patterns and regular expression bracket expressions that use dashes (ranges) do not match what they used to match. They are now obsolete. Stop using them. Always use POSIX character classes, described below, instead.

3.1 Bracket expressions using dashed ranges are unpredictableIndexup to index

In an international (not-English-only) world, the old, predictable ASCII dashed ranges such as [a-z] and [A-Z] are wrong. These ranges may not match accented characters, either upper- or lower-case, and they can mis-handle locales with alphabets having upper-case and lower-case collated together.

If the LC_COLLATE sort order happens to be set to the old strict numeric order (“C”), dashed ranges behave the way they have for the first 30 years of Unix, and accented characters (and anything non-ASCII) sort after all the ASCII letters:

$ unset LC_ALL                       # unset the over-ride variable
$ LC_CTYPE=en_US.utf8 ; export LC_CTYPE   # handle UTF-8 characters
$ LC_COLLATE=C ; export LC_COLLATE   # collate in strict numeric order
$ touch a A b B c C x X y Y z Z
$ touch á Á é É                      # four utf8 accented characters
$ ls
A  B  C  X  Y  Z  a  b  c  x  y  z  Á  É  á  é
$ ls | sort | fmt
A B C X Y Z a b c x y z Á É á é
$ echo [a-z]
a b c x y z
$ echo [A-Z]
A B C X Y Z

The above shows that the non-ASCII UTF-8 characters sort to the end (they are high-value 8-bit characters) and are not matched by the GLOB or regular expression ranges in a strict numeric order collating sequence such as LC_COLLATE=C.

If we change the collating sequence away from the old strict numeric “C” sort order, the character ranges match a somewhat non-intuitive (to English users) set of characters:

$ unset LC_ALL                          # unset the over-ride variable
$ LC_CTYPE=en_US.utf8 ; export LC_CTYPE # handle UTF-8 characters
$ LC_COLLATE=en_US.utf8 ; export LC_COLLATE  # collate together
$ ls
a  A  á  Á  b  B  c  C  é  É  x  X  y  Y  z  Z
$ ls | sort | fmt
a A á Á b B c C é É x X y Y z Z
$ echo [a-z]
a A á Á b B c C é É x X y Y z           # note missing 'Z'
$ echo [A-Z]
A á Á b B c C é É x X y Y z Z           # note missing 'a'

When you write a script that uses a dash range such as [a-z], depending on locale, your script could match either the old ASCII range or the new i18n range that includes most upper-case characters, but not Z!

This unpredictability is not good. The solution is to never use dashed character ranges. When you need to match a class of characters such as “all lower-case letters”, use the POSIX standard named character classes.

3.2 Using POSIX character classes instead of dashed rangesIndexup to index

Instead of using dashed character ranges that misbehave when applied to international character sets, many matching systems let you specify a POSIX standard “class” of characters to match by name (e.g. “lower” and “upper”), and these do work correctly in all locales to match even accented and other non-ASCII characters:

$ unset LC_ALL
$ LC_CTYPE=en_US.utf8 ; export LC_CTYPE # handle UTF-8 characters

$ LC_COLLATE=C ; export LC_COLLATE      # collate in strict numeric order
$ echo [[:lower:]]
a b c x y z á é                         # all lower-case, nothing missing
$ echo [[:upper:]]
A B C X Y Z Á É                         # all upper-case, nothing missing

$ LC_COLLATE=en_US.utf8 ; export LC_COLLATE  # collate together
$ echo [[:lower:]]
a á b c é x y z                         # all lower-case, nothing missing
$ echo [[:upper:]]
A Á B C É X Y Z                         # all upper-case, nothing missing

As mentioned above, the LC_CTYPE environment variable tells commands what types of bit patterns are considered characters. Continuing the above example, a change in LC_CTYPE can reject the UTF-8 characters:

$ LC_CTYPE=en_US.utf8 ; export LC_CTYPE # handle UTF-8 characters
$ echo [[:lower:]]
a b c x y z á é                         # all lower-case, including UTF-8

$ LC_CTYPE=C ; export LC_CTYPE          # accept only plain ASCII
$ echo [[:lower:]]
a b c x y z                             # only lower-case ASCII now

While the order of the characters in the POSIX class changes with the collating order, the list of characters matched does not – it is always the correct list for the given LC_CTYPE locale. Contrast this with the dashed [a-z] range used above, where the list of characters matched changed non-intuitively depending on the collating order selected.

In multi-lingual countries such as Canada, pathnames will often contain accented letters and other non-ASCII characters. Your programs need to handle these pathnames correctly. Stop using character ranges containing dashes, and use the POSIX character classes that aren’t affected by the character collating sequence being used:

$ rm [a-z]*          # WRONG - dependent on collating order
$ rm [[:lower:]]*    # RIGHT - use the POSIX class that always works

3.3 Forcing the old ASCII collating sequenceIndexup to index

This section is advanced material not intended for new scripts. Techniques in this section are not recommended, but may be necessary in some special circumstances, or to fix an old script that cannot be easily upgraded for i18n or changed away from using dash ranges.

If your script must perform an operation where a difference in sort order can cause incorrect behavior in your script, then you might set the LC_COLLATE environment variable to select a fixed collating order, either for the whole script or just for that one operation. The syntax below is for a Bourne-style shell:

LC_COLLATE=C ; export LC_COLLATE   # set for whole script, OR
LC_COLLATE=C command               # set for one command only

For example, suppose my_command is a command that relies on ls producing output in the order given by LC_COLLATE=C. In that case, we would write this pipeline in the script to force the required sort order from ls before feeding it to my_command:

LC_COLLATE=C ls | my_command

The disadvantage with this approach is that the C locale must be supported on all machines where the script is run. It would really be better to find a different way to do the operation such that it didn’t depend on locale settings at all.

4 Character Set – ASCII, UTF-8, and LC_CTYPE and/or LANGIndexup to index

Many non-English languages have characters that don’t fit into the single 8-bit bytes used by most computer hardware. The world has adopted standards such as UTF and UNICODE to allow for multi-byte characters, and many (but not all) Unix/Linux programs know how to process files with multi-byte characters.

What happens in a script when a program such as wc (word count) counts the words and characters in a file? If the file contains multi-byte characters, should wc treat the multi-byte characters as single characters, or should wc count each byte as a separate character? Should wc treat non-ASCII bytes as word separators, or as parts of multi-byte characters? The LC_CTYPE and LANG variables affect this. (LANG is a legacy variable that is only used if LC_CTYPE Is not set.)

The LC_* and LANG environment variables affect how programs such as wc interpret “characters” in files:

$ LC_CTYPE=C ; export LC_CTYPE          # don't process multi-byte chars
$ echo 'àéïöüÿç' | wc -m                # echo 8 UTF-8 multi-byte characters
15                                      # 15 individual bytes counted
$ LC_CTYPE=en_US.utf8 ; export LC_CTYPE # handle UTF-8 multi-byte characters
$ echo 'àéïöüÿç' | wc -m                # echo 8 UTF-8 multi-byte characters
8                                       # now counts only 8 UTF-8 characters

Unfortunately, there is no indication of which multi-byte standard is in use inside a text file – one might find ASCII, UTF, LATIN-1, and UNICODE files in the same directory, and counting “characters” using wc -m is sure to do the wrong thing any time the current LC_CTYPE setting doesn’t match the character standard used inside a file.

When counting characters using wc -m, one must ensure that the locale matches the type of character encoding used in each file.

4.1 Forcing the old ASCII character typeIndexup to index

This section is advanced material not intended for new scripts. Techniques in this section are not recommended, but may be necessary in some special circumstances, or to fix an old script that cannot be easily upgraded for i18n or changed to handle non-ASCII character sets.

Modern versions of wc distinguish between always counting separate bytes (wc -c) and counting characters (wc -m). Some older versions of wc don’t have the newer -m option and they unfortunately treat -c as if it were the locale-dependent -m. These old versions of wc don’t have any way of counting only bytes unless you set LC_CTYPE=C to force the byte-counting behaviour:

LC_CTYPE=C wc -c file_with_bytes_to_count.bin     # for legacy wc commands

The disadvantage with that approach is that it relies on the C locale to be supported on the machine where the script is eventually run. It would be better to find a way to do it that does not depend on locale.

Newer versions of wc have been modified so that wc -c always counts single bytes, no matter what the locale.

5 Summary of i18nIndexup to index

Shell scripts and programs must NEVER use the old legacy character ranges such as [a-z] and must ALWAYS use the POSIX character classes such as [[:alpha:]]. This applies to both GLOB patterns and regular expressions.

Our goal is to write our scripts so that they are not broken by the user’s specific locale environment variables, and in most cases avoiding legacy character ranges will suffice.

Advanced: In situations where the program logic (that is, proper functioning) of the script relies on a certain LC_CTYPE or LC_COLLATE setting, it is best if the script writer finds a different way to do the operation such that LC_CTYPE and LC_COLLATE don’t matter. Failing that, the script might set those variables for the duration of that command as shown in the examples above. However, the disadvantage of that approach is that in general the script writer cannot know what locales are supported on every machine where the script might eventually be run.

Author: 
| Ian! D. Allen, BA, MMath  -  idallen@idallen.ca - Ottawa, Ontario, Canada
| Home Page: http://idallen.com/   Contact Improv: http://contactimprov.ca/
| College professor (Free/Libre GNU+Linux) at: http://teaching.idallen.com/
| Defend digital freedom:  http://eff.org/  and have fun:  http://fools.ca/

Plain Text - plain text version of this page in Pandoc Markdown format

Campaign for non-browser-specific HTML   Valid XHTML 1.0 Transitional   Valid CSS!   Creative Commons by nc sa 3.0   Hacker Ideals Emblem   Author Ian! D. Allen