Regular Expressions 101

28 Jan 2024

With regular expressions, you can describe the patterns that are similar to each other. For example, you have multiple <img> tags, and you want to move all these images to the images folder:

<img src="9.png">   →  <img src="images/9.png">
<img src="10.png">  →  <img src="images/10.png">
and so on

You can easily write a regular expression that matches all file names that are numbers, then replace all such tags at once.

Basic syntax

If you need to match one of the alternatives, use an alternation (vertical bar). For example:

RegexMeaning
a|img|h1|h2either a, or img, or h1, or h2

When using alternation, you often need to group characters together; you can do this with parentheses. For example, if you want to match an HTML tag, this approach won't work:

RegexMeaning
<h1|h2|b|i><h1 or h2 (without the angle brackets) or b or i>

because < applies to the first alternative only and > applies to the last one only. To apply the angle brackets to all alternatives, you need to group the alternatives together:

<(h1|h2|b|i)>

The last primitive (star) allows you to repeat anything zero or more times. You can apply it to one character, for example:

RegexMeaning
a*an empty string, a, aa, aaa, aaaa, etc.

You also can apply it to multiple characters in parentheses:

RegexMeaning
(ab)*an empty string, ab, abab, ababab, abababab, etc.

Note that if you remove the parentheses, the star will apply to the last character only:

RegexMeaning
ab*an empty string, ab, abb, abbb, abbbb, etc.
A portrait of Stephen Cole Kleene, the regular expression inventor
Stephen Kleene (1909-1994), the regular expression inventor.
Author: Konrad Jacobs. Source: Archives of the Mathematisches Forschungsinstitut Oberwolfach.

The star is named Kleene star after an American mathematician Stephen Kleene who invented regular expressions in the 1950s. It can match an empty string as well as any number of repetitions.

These three primitives (alternation, parentheses, and the star for repetition) are enough to write any regular expression, but the syntax may be verbose. For example, you now can write a regex for matching the file names that are numbers in an <img> tag:

RegexMeaning
(0|1|2|3|4​|5|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*one or more digits
(1|2|3|4|5​|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*a positive integer number (don't allow zero as the first character)

The parentheses may be nested without a limit, for example:

RegexMeaning
(1|2|3|4​|5|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*(,(1|2|3|4​|5|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*)*one or more positive integer numbers, separated with commas

Convenient shortcuts for character classes

You can write any regex with the three primitives, but it quickly becomes hard to read, so a few shortcuts were invented. When you need to match any of the listed characters, please put them into square brackets:

RegexShorter regexMeaning
a|e|i|o|u|y[aeiouy]a vowel
0|1|2|3|4​|5|6|7|8|9[0123456789]a digit
0|1|2|3|4​|5|6|7|8|9[0-9]a digit
a|b|c|d|e​|f|g|h|i|j​|k|l|m|n​|o|p|q|r​|s|t|u|v​|w|x|y|z[a-z]a letter

As you can see, it's possible to specify only the first and the last allowed character if you put a dash between them. There may be several such ranges inside square brackets:

RegexMeaning
[a-z0-9]a letter or a digit
[a-z0-9_]a letter, a digit, or the underscore character
[a-f0-9]a hexadecimal digit

There are some predefined character classes that are even shorter to write:

RegexMeaning
\sa space character: the space, the tab character, the new line, or the carriage feed
\da digit
\wa word character (a letter, a digits, or the underscore character)
.any character

In Aba Search and Replace, these character classes include Unicode characters such as accented letters or Unicode line breaks. In other regex dialects, they usually include ASCII characters only, so \d is typically the same as [0-9] and \w is the same as [a-zA-Z0-9_].

The character classes don't add any new capabilities to the regular expressions; you can just list all allowed characters with an alternation, but a character class is much shorter to write. We now can write a shorter version of the regex mentioned before:

RegexMeaning
[1-9][0-9]*(,[1-9][0-9])*one or more positive integer numbers, separated with commas

Repetitions

A Kleene star means "repeating zero or more times", but you often need another number of repetitions. As shown before, you can just copy-and-paste a regex to repeat it twice or three times, but there is a shorter notation for that:

RegexShorter regexMeaning
\d\d*\d+one or more digits
(0|1)(0|1)*[01]+any binary number (consisting of zeros and ones)
(\s|)\s?either a space character or nothing
http(s|)https?either http or https
(-|\+|)[-+]?the minus sign, the plus sign, or nothing
[a-z][a-z][a-z]{2}two small letters
[a-z][a-z]((([a-z]|)[a-z]|)[a-z]|)[a-z]{2,5}from two to five small letters
[a-z][a-z][a-z]*[a-z]{2,}two or more small letters

So there are the following repetition operators:

Note that you can express any repetition with the curly brackets, so these operators partially duplicate each other. For example:

RegexShorter regexMeaning
\d{0,}\d*nothing or some digits
\d{1,}\d+one or more digits
\s{0,1}\s?either a space character or nothing

Just like the Kleene star, the other repetition operators can apply to parentheses, so you can nest them indefinitely.

Escaping

If you need to match any of the special characters like parentheses, vertical bar, plus, or star, you must escape them by adding a backslash \ before them. For example, to find a number in parentheses, use \(\d+\).

A common mistake is to forget a backslash before a dot. Note that a dot means any character, so if you write example.com in a regular expression, it will match examplexcom or something similar, which may even cause a security issue in your program. Now we can write a regex to match the <img> tags:

<img src="\d+\.png">

This matches any filename consisting of digits and we correctly escaped the dot.

Other features

Modern regex engines add more features such as backreferences or conditional subpatterns. Mathematically speaking, these features don't belong to the regular expressions; they describe a non-regular language, so you cannot replace them with the three primitives.

Next time, we will discuss anchors and zero-width assertions.

Aba Search and Replace screenshot

Replacing text in several files used to be a tedious and error-prone task. Aba Search and Replace solves the problem, allowing you to correct errors on your web pages, replace banners and copyright notices, change method names, and perform other text-processing tasks.

This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.