Aba logo

Which special characters must be escaped in regular expressions?

8 Jan 2022

In most regular expression engines (PCRE, JavaScript, Python, Go, and Java), these special characters must be escaped outside of character classes:

[ * + ? { . ( ) ^ $ | \

If you want to find one of these metacharacters literally, please add \ before it. For example, to find the text $100, use \$100. If you want to find the backslash itself, double it: \\.

Inside character classes [square brackets], you must escape the following characters:

\ ] -

For example, to find an opening or a closing bracket, use [[\]].

If you need to include the dash into a character class, you can make it the first or the last character instead of escaping it. Use [a-z-] or [a-z\-] to find a Latin letter or a dash.

If you need to include the caret ^ into a character class, it cannot be the first character; otherwise, it will be interpreted as any character except the specified ones. For example: [^aeiouy] means "any character except vowels", while [a^eiouy] means "any vowel or a caret". Alternatively, you can escape the caret: [\^aeiouy]

JavaScript

In JavaScript, you also need to escape the slash / in regular expression literals:

/AC\/DC/.test('AC/DC')

Lone closing brackets ] and } are allowed by default, but if you use the 'u' flag, then you must escape them:

/]}/.test(']}') // true
/]}/u.test(']}') // throws an exception

This feature is specific for JavaScript; lone closing brackets are allowed in other languages.

If you create a regular expression on the fly from a user-supplied string, you can use the following function to properly escape the special characters:

function escapeRe(str) {
    return str.replace(/[[\]*+?{}.()^$|\\-]/g, '\\$&');
}

var re = new RegExp(escapeRe(start) + '.*?' + escapeRe(end));

PHP

In PHP, you have the preg_quote function to insert a user-supplied string into a regular expression pattern. In addition to the characters listed above, it also escapes # (in 7.3.0 and higher), the null terminator, and the following characters: = ! < > : -, which do not have a special meaning in PCRE regular expressions but are sometimes used as delimiters. Closing brackets ] and } are escaped, too, which is unnecessary:

preg_match('/]}/', ']}'); // returns 1

Just like in JavaScript, you also need to escape the delimiter, which is usually /, but you can use another special character such as # or = if the slash appears inside your pattern:

if (preg_match('/\/posts\/([0-9]+)/', $path, $matches)) {
}

// Can be simplified to:
if (preg_match('#/posts/([0-9]+)#', $path, $matches)) {
}

Note that preg_quote does not escape the tilde ~ and the slash /, so you should not use them as delimiters if you construct regexes from strings.

In double quotes, \1 and $ are interpreted differently than in regular expressions, so the best practice is:

$text = 'C:\\Program files\\';
echo $text;
if (preg_match('/C:\\\\Program files\\\\/', $text, $matches)) {
   print_r($matches);
}

Python

Python has a raw string syntax (r''), which conveniently avoids the backslash escaping idiosyncrasies of PHP:

import re
re.match(r'C:\\Program files/Tools', 'C:\\Program files/Tools')

You only need to escape the quote in raw strings:

re.match(r'\'', "'")
re.match(r"'", "'") // or just use double quotes if you have a regex with a single quote

re.match(r"\"", '"')
re.match(r'"', '"') // or use single quotes if you have a regex with a double quote

re.match(r'"\'', '"\'') // multiple quote types; cannot avoid escaping them

A raw string literal cannot end with a single backslash, but this is not a problem for a valid regular expression.

To match a literal ] inside a character class, you can make it the first character: [][] matches a closing or an opening bracket. Aba Search & Replace supports this syntax, but other programming languages do not. You can also quote the ] character with a slash, which works in all languages: [\][] or [[\]].

For inserting a string into a regular expression, Python offers the re.escape method. Unlike JavaScript with the u flag, Python tolerates escaping non-special punctuation characters, so this function also escapes -, #, &, and ~:

print(re.escape(r'-#&~')) // prints \-\#\&\~
re.match(r'\@\~', '@~') // matches

Java

Java allows escaping non-special punctuation characters, too:

Assert.assertTrue(Pattern.matches("\\@\\}\\] }]", "@}] }]"));

Similarly to PHP, you need to repeat the backslash character 4 times, but in Java, you also must double the backslash character when escaping other characters:

Assert.assertTrue(Pattern.matches("C:\\\\Program files \\(x86\\)\\\\", "C:\\Program files (x86)\\"));

This is because the backslash must be escaped in a Java string literal, so if you want to pass \\ \[ to the regular expression engine, you need to double each backslash: "\\\\ \\[". There are no raw string literals in Java, so regular expressions are just usual strings.

There is the Pattern.quote method for inserting a string into a regular expression. It surrounds the string with \Q and \E, which escapes multiple characters in Java regexes (borrowed from Perl). If the string contains \E, it will be escaped with the backslash \:

Assert.assertEquals("\\Q()\\E",
      Pattern.quote("()"));

Assert.assertEquals("\\Q\\E\\\\E\\Q\\E",
      Pattern.quote("\\E"));

Assert.assertEquals("\\Q(\\E\\\\E\\Q)\\E",
      Pattern.quote("(\\E)"));

The \Q...\E syntax is another way to escape multiple special characters that you can use. Besides Java, it's supported in PHP/PCRE and Go regular expressions, but not in Python nor in JavaScript.

Go

Go raw string literals are characters between back quotes: `\(`. It's preferable to use them for regular expressions because you don't need to double-escape the backslash:

r := regexp.MustCompile(`\(text\)`)
fmt.Println(r.FindString("(text)"))

A back quote cannot be used in a raw string literal, so you have to resort to the usual "`" string syntax for it. But this is a rare character.

The \Q...\E syntax is supported, too:

r := regexp.MustCompile(`\Q||\E`)
fmt.Println(r.FindString("||"))

There is a regexp.QuoteMeta method for inserting strings into a regular expression. In addition to the characters listed above, it also escapes closing brackets ] and }.

Aba Search and Replace screenshot

Replacing text in several files used to be a tedious and error-prone task. Aba Search and Replace solves the problem, allowing you to correct errors on your web pages, replace banners and copyright notices, change method names, and perform other text-processing tasks.

This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.