Character lists [a-z] and classes \s

The purpose of character lists and classes is the same: to match one character of the specified kind. When using a list, you explicitly specify the characters you want to match. When using a class, you only specify the type (letter, digit, etc.)

Character lists

A character list matches one character in the specified range. You can list all allowable characters (e.g., [abcdef]) or use a range (e.g., [a-f]):

[aeiouy] Any English vowel
[a-f] or [abcdef] Any letter from a to f
[0-9a-f] A hexadecimal digit (from 0 to 9, or from A to F)

The letters in a range are ordered by their Unicode codes, even if you search in ANSI files. You can find the order of letters in Unicode charts or in Windows charmap utility. For example, to search for an extended ANSI character, use [¡-ÿ], because U+00A1 ¡ is the first printable character in Latin-1 block, and U+00FF ÿ is the last one.

[µÞ-öø-ÿ] A letter from Latin-1 charset
[Þ-ÿ] A Latin-1 letter (if you don't care about µ and ÷)
[a-zÞ-ÿ] A Latin-1 letter, including basic Latin alphabet
[À-ž] A letter from “Latin-1” or “Latin Extended A” blocks (for Western European languages)
[α-ω] A Greek letter (from alpha to omega)
[а-яё] A Russian letter

This behavior differs from grep and other Unix regex tools. For example, when using grep with French locale, the range [a-z] includes the letters with diacritics (àéè). In Aba, [a-z] matches only a letter from the basic Latin alphabet. To include the letters with diacritic marks in search, use [a-zÞ-ÿ].

A range cannot be larger than 256 characters, for example, [Z-Ж] is wrong (here, Z is a Latin letter and Ж is a Cyrillic letter).

If you need to include the characters ] or - in the range, put them right after the opening bracket:

[][] Closing bracket ] or opening bracket [
[-a-z] Latin letters and dash -

Using ^ metacharacter, you can find any character except the specified ones:

[^aeiouy] Any character except vowels
[^a-z] Any character except Latin letters

Character classes

A character class matches one character of the specified type (letter, digit, etc.) Here is the full list of supported classes:

\d   A digit or a numeric character (e.g., 4 or ½)
\D Anything but a digit
\w A word character (a letter, a digit, or a underscore _)
\W Any character except word characters
\s Space, tab, newline character, or other separator
\S Anything but a separator

Note that the classes include international characters. For example, \w matches not only English letters, but also German umlauts and French letters with diacritics (and also Greek and Russian letters, Chinese ideographs, etc.) This differs from Perl regular expressions, which include only English letters in \w class and require special notation \P{IsAlpha} for other languages.

If you need to find only Latin letters (say, you are looking for programming language identifiers), use [a-z].

Also, note that \s matches newlines. If you want to match only spaces and tabs, use ( |\t).

Escape sequences

You can use C-like escape sequences \t, \r, \n to match tab, carriage return, and line feed. Only these 3 escape sequences are supported.

\t Tab
\r Carriage return
\n Line feed (new line)

However, it's recommended to press Enter instead of typing \r\n, because the former will match both Windows (CR+LF) and Unix-style (LF) line terminators.

Any character

Use dot . to match any character. Note that a dot matches newlines, too. (In Perl terms, Aba always has /s modifier on.) If you want to match any character except newline, use \N (borrowed from Perl 6).

. Any character
\N   Any character except CR and LF

Previous topic | Next topic