Using zero-width assertions in regular expressions

30 Jun 2024

Anchors ^ $ \b \A \Z

Anchors in regular expressions allow you to specify context in a string where your pattern should be matched. There are several types of anchors:

These anchors are supported in Java, PHP, Python, Ruby, C#, and Go. In JavaScript, \A and \Z are not supported, but you can use ^ and $ instead of them; just remember to keep the multiline mode disabled. Aba Search and Replace always runs in multiline mode, so you can use \A and \Z to match the beginning or the end of a file.

For example, the regular expression ^abc will match the start of a string that contains the letters "abc". In multiline mode, the same regex will match these letters at the beginning of a line. You can use anchors in combination with other regular expression elements to create more complex matches. For example, ^From: (.*) matches a line starting with From:

The difference between \Z and \z is that \Z matches at the end of the string but also skips a possible newline character at the end. In contrast, \z is more strict and matches only at the end of the string.

If you have read the previous part of this article, you may wonder if the anchors add any additional capabilities that are not supported by the three primitives (alternation, parentheses, and the star for repetition). The answer is: they do not, but they change what is captured by the regular expression. You can match a line starting with abc by explicitly adding the newline character: \nabc, but in this case, you will also match the newline character itself. When you use ^abc, the newline character is not consumed.

In a similar way, ing\b matches all words ending with ing. You can replace the anchor with a character class containing non-letter characters (such as spaces or punctuation): ing\W, but in this case, the regular expression will also consume the space or punctuation character.

If the regular expression starts with ^ so that it only matches at the start of the string, it's called anchored. In some programming languages, you can do an anchored match instead of the non-anchored search without using ^. For example, in PHP (PCRE), you can use the A modifier.

So the anchors don't add any new capabilities to the regular expressions, but they allow you to manage which characters will be included into the match or to match only at the beginning or end of the string. The matched language is still regular.

Zero-width assertions (?= ) (?! ) (?<= ) (?<! )

Zero-width assertions (also called lookahead and lookbehind assertions) allow you to check that a pattern occurs in the subject string without capturing any of the characters. This can be useful when you want to check for a pattern without moving the match pointer forward. For example, you can test that the next characters are abc without consuming them: (?=abc).

Zero-width assertions are generalized anchors. Just like anchors, they don't consume any character from the input string. Unlike anchors, they allow you to check anything, not only line boundaries or word boundaries. So you can replace an anchor with a zero-width assertion, but not vice versa. For example, ing\b could be rewritten as ing(?=\W|$).

Aba documentation includes a detailed article on zero-width assertions (lookaround) and their typical usage, so we won't repeat it here. Zero-width lookahead and lookbehind are supported in PHP, JavaScript, Python, Java, and Ruby. Unfortunately, they are not supported in Go.

Just like anchors, zero-width assertions still match a regular language, so from a theoretical point of view, they don't add anything new to the capabilities of regular expressions. They just make it possible to skip certain things from the captured string, so you only check for their presence but don't consume them.

Aba Search and Replace screenshot

Replacing text in several files used to be a tedious and error-prone task. Aba Search and Replace solves the problem, allowing you to correct errors on your web pages, replace banners and copyright notices, change method names, and perform other text-processing tasks.

This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.