A character set is a set of characters used to find a pattern match. Because character sets are placed in brackets, they are sometimes referred to as bracket expressions. A character set usually matches any one of the characters described by the expression; however, some metacharacters have different meanings inside a bracket expression.
Character sets are useful, but frequently misunderstood. POSIX extends them to include collating sequences and equivalence classes. POSIX also includes several predefined character sets.
Collating sequences and equivalence classes were introduced to handle eight-bit and multibyte locales. The term "collation" refers to the ordering of a character set. For example, the characters a–z are guaranteed to sort with a before b, no matter what representation is used within the computer. Collation is performed based on collating elements. A collating element can be a single character, a multicharacter sequence that collates as if it were a single character, or a collating-sequence name. Users in the C and POSIX locales usually have little need for collating sequences and equivalence classes.
An equivalence class defines a set of characters with the same value for the purposes of collation. For example, if the locale supports accented characters, all accented a characters might be given the same weight for collation; therefore, à, á, â and ã sort identically.
The following table describes characters that can be put inside a bracket expression:
Sample pattern | Description |
---|---|
c | Any character within brackets matches itself. For example, [abc] matches a or b or c. You can also represent a range of characters with a minus sign (-). For example, [a-c] is identical to [abc]. To include - in the set, make it the first or last character. To include ] in the set, make it the first character. The character set to find [, ], or - is [][-]. |
* \ . [ | These characters lose their special meaning when they are placed inside bracket expressions; therefore, they match themselves. |
^ | If this is the first character in a bracket expression, it inverts the meaning of the expression. That is, the expression matches any one character that is not between the brackets. |
[.c.] | Represents a collating element. The element indicated by c can be one or more characters, as long as it represents a single collating element. In the C and POSIX locales, all characters are single bytes. The regular expression [[.a.]-[.z.]] is identical to [a-z]. |
[=c=] | Represents an equivalence class. The element c can represent one or more characters, which are all treated equally for collating purposes. |
The following character classes are defined:
Character class | Description |
---|---|
[:alnum:] | Matches any alphabetic character or a digit from 0 to 9. |
[:alpha:] | Matches any alphabetic character. |
[:blank:] | Matches the space and tab characters. |
[:cntrl:] | Matches any control character. |
[:digit:] | Matches any decimal digit, from 0 to 9. |
[:graph:] | Matches any printing character except space. |
[:lower:] | Matches any lowercase letter. |
[:print:] | Matches any printable character except the space character. |
[:punct:] | Matches any punctuation character: any printing character that is not a letter or a digit. |
[:space:] | Matches any of the white-space characters space, form-feed, horizontal tab, vertical tab, and carriage return. |
[:upper:] | Matches any uppercase letter. |
[:xdigit:] | Matches any letter or digit that could be part of a hexadecimal number; that is, the digits 0 to 9, the letters A through F, and the letters a through f. |