Character Sets

A number of the tools that we are using provide several ways of referring to sets of characters. These are:

enumerated - each character is explicitly listed, e.g. [0123456789]
ranges - the set of characters betweeen two limits, e.g. [0-9]
named - the characters belonging to a named set, e.g. [[:digit:]]

Explicitly listing the characters in the set is by far the safest technique to use. It is the most portable approach, since all tools that provide character sets allow this notation, while some do not support character ranges or named sets. Furthermore, it is the only notation that allows you to be sure what characters are in the set

. The interpretation of the character range notation depends on the character set you are using. This is because a range means "the characters represented by the codes beginning with the code for the first character in the range and ending with the code for the last character in the range". Thus, if your character set is ASCII, [a-z] refers to the lowercase letters because the lowercase letters occupy the codes from 141 through 172. However, in EBCDIC (still in use on IBM mainframes), the lower-case letters are broken into three groups with codes as follows:

a-i 201-211
j-r 221-231
s-z 242-251

The result is that [a-z] means "the characters ranging from 201 through 251". This includes a bunch of empty slots plus the tilde (˜) at 161. Similarly, the range [A-Z] includes empty slots plus right curly bracket (}) at 320 and backslash (\)at 340.

Generally speaking, it is hard to know exactly what a range will include in a particular character encoding. In a Latin-1 environment, does [a-z] include á?

The situation is further complicated by the fact that in the most recent "internationalized" computer systems, the meaning of ranges depends on the "locale", that is, the information in your environment about what language and writing system you are using. The locale determines not only what character set to use and things like the conventions for writing dates and numbers, but the sort order to use. For example, even in a system in which the letters [a-z] are contiguous, such as ASCII or Latin-1, different sort orders are possible. With ASCII, for example [a-z] is equivalent to [abcdefghijklmnopqrstuvwxyz] when the sort order is "machine collating order", that is, the sequence of character codes, but not when it is "dictionary order", in which upper and lower case characters are interspersed. In such locales, [a-z] is equivalent to [aAbBcCDe....zZ]. Both upper and lower case letters will match this, which is not what you want when testing to see if a character is upper or lower case.

On a Unix system that supports locales, you can force ASCII machine collating sequence ordering by setting the locale with the following command to your shell:

setenv LC_ALL C

However, this is not a good general solution since you may well want the other features of the appropriate locale for your language and country.

Thus, it it is safest and most portable always to enumerate the members of character sets explicitly. For a one-shot command typed directly on the command-line, when working in a familiar environment, abbreviations like character ranges and named sets are fairly safe, but if you are working in an unfamiliar environment, or writing a program that is likely to be reused, it is better to avoid them.