Regular Expressions 101
28 Jan 2024
With regular expressions, you can describe the patterns that are similar to each other. For example, you have multiple <img>
tags, and you want to move all these images to the images
folder:
<img src="9.png"> → <img src="images/9.png"> <img src="10.png"> → <img src="images/10.png"> and so on
You can easily write a regular expression that matches all file names that are numbers, then replace all such tags at once.
Basic syntax
If you need to match one of the alternatives, use an alternation (vertical bar). For example:
Regex | Meaning |
a|img|h1|h2 | either a , or img , or h1 , or h2 |
When using alternation, you often need to group characters together; you can do this with parentheses. For example, if you want to match an HTML tag, this approach won't work:
Regex | Meaning |
<h1|h2|b|i> | <h1 or h2 (without the angle brackets) or b or i> |
because <
applies to the first alternative only and >
applies to the last one only. To apply the angle brackets to all alternatives, you need to group the alternatives together:
<(h1|h2|b|i)>
The last primitive (star) allows you to repeat anything zero or more times. You can apply it to one character, for example:
Regex | Meaning |
a* | an empty string, a , aa , aaa , aaaa , etc. |
You also can apply it to multiple characters in parentheses:
Regex | Meaning |
(ab)* | an empty string, ab , abab , ababab , abababab , etc. |
Note that if you remove the parentheses, the star will apply to the last character only:
Regex | Meaning |
ab* | an empty string, ab , abb , abbb , abbbb , etc. |
The star is named Kleene star after an American mathematician Stephen Kleene who invented regular expressions in the 1950s. It can match an empty string as well as any number of repetitions.
These three primitives (alternation, parentheses, and the star for repetition) are enough to write any regular expression, but the syntax may be verbose. For example, you now can write a regex for matching the file names that are numbers in an <img>
tag:
Regex | Meaning |
(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* | one or more digits |
(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* | a positive integer number (don't allow zero as the first character) |
The parentheses may be nested without a limit, for example:
Regex | Meaning |
(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*(,(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*)* | one or more positive integer numbers, separated with commas |
Convenient shortcuts for character classes
You can write any regex with the three primitives, but it quickly becomes hard to read, so a few shortcuts were invented. When you need to match any of the listed characters, please put them into square brackets:
Regex | Shorter regex | Meaning |
a|e|i|o|u|y | [aeiouy] | a vowel |
0|1|2|3|4|5|6|7|8|9 | [0123456789] | a digit |
0|1|2|3|4|5|6|7|8|9 | [0-9] | a digit |
a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z | [a-z] | a letter |
As you can see, it's possible to specify only the first and the last allowed character if you put a dash between them. There may be several such ranges inside square brackets:
Regex | Meaning |
[a-z0-9] | a letter or a digit |
[a-z0-9_] | a letter, a digit, or the underscore character |
[a-f0-9] | a hexadecimal digit |
There are some predefined character classes that are even shorter to write:
Regex | Meaning |
\s | a space character: the space, the tab character, the new line, or the carriage feed |
\d | a digit |
\w | a word character (a letter, a digits, or the underscore character) |
. | any character |
In Aba Search and Replace, these character classes include Unicode characters such as accented letters or Unicode line breaks. In other regex dialects, they usually include ASCII characters only, so \d
is typically the same as [0-9]
and \w
is the same as [a-zA-Z0-9_]
.
The character classes don't add any new capabilities to the regular expressions; you can just list all allowed characters with an alternation, but a character class is much shorter to write. We now can write a shorter version of the regex mentioned before:
Regex | Meaning |
[1-9][0-9]*(,[1-9][0-9])* | one or more positive integer numbers, separated with commas |
Repetitions
A Kleene star means "repeating zero or more times", but you often need another number of repetitions. As shown before, you can just copy-and-paste a regex to repeat it twice or three times, but there is a shorter notation for that:
Regex | Shorter regex | Meaning |
\d\d* | \d+ | one or more digits |
(0|1)(0|1)* | [01]+ | any binary number (consisting of zeros and ones) |
(\s|) | \s? | either a space character or nothing |
http(s|) | https? | either http or https |
(-|\+|) | [-+]? | the minus sign, the plus sign, or nothing |
[a-z][a-z] | [a-z]{2} | two small letters |
[a-z][a-z]((([a-z]|)[a-z]|)[a-z]|) | [a-z]{2,5} | from two to five small letters |
[a-z][a-z][a-z]* | [a-z]{2,} | two or more small letters |
So there are the following repetition operators:
- a Kleene star
*
means repeating zero or more times, so it can never match, it can match once, twice, three times, etc.; - a plus sign
+
means repeating one or more times, so it must match at least once; - an optional part
?
means zero times or once; - curly brackets
{m,n}
means repeating from m to n times.
Note that you can express any repetition with the curly brackets, so these operators partially duplicate each other. For example:
Regex | Shorter regex | Meaning |
\d{0,} | \d* | nothing or some digits |
\d{1,} | \d+ | one or more digits |
\s{0,1} | \s? | either a space character or nothing |
Just like the Kleene star, the other repetition operators can apply to parentheses, so you can nest them indefinitely.
Escaping
If you need to match any of the special characters like parentheses, vertical bar, plus, or star, you must escape them by adding a backslash \
before them. For example, to find a number in parentheses, use \(\d+\)
.
A common mistake is to forget a backslash before a dot. Note that a dot means any character, so if you write example.com
in a regular expression, it will match examplexcom
or something similar, which may even cause a security issue in your program. Now we can write a regex to match the <img>
tags:
<img src="\d+\.png">
This matches any filename consisting of digits and we correctly escaped the dot.
Other features
Modern regex engines add more features such as backreferences or conditional subpatterns. Mathematically speaking, these features don't belong to the regular expressions; they describe a non-regular language, so you cannot replace them with the three primitives.
Next time, we will discuss anchors and zero-width assertions.
Replacing text in several files used to be a tedious and error-prone task. Aba Search and Replace solves the problem, allowing you to correct errors on your web pages, replace banners and copyright notices, change method names, and perform other text-processing tasks.
This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.
- Regular Expressions 101
- Regular expression for numbers
- Aba 2.6 released
- Search from the Windows command prompt
- Empty character class in JavaScript regexes
- Privacy Policy Update - December 2022
- Aba 2.5 released
- Our response to the war in Ukraine
- Check VAT ID with regular expressions and VIES
- Which special characters must be escaped in regular expressions?
- Aba 2.4 released
- Privacy Policy Update - April 2021
- Review of Aba Search and Replace with video
- Aba 2.2 released
- Discount on Aba Search and Replace
- Using search and replace to rename a method
- Cleaning the output of a converter
- Aba 2.1 released
- How to replace HTML tags using regular expressions
- Video trailer for Aba
- Aba 2.0 released