Lookaround (?= ) (?< )

A lookaround is available through the extension syntax:

(?=abc) the next characters are “abc” (a positive lookahead)
(?!abc) the next characters are not “abc” (a negative lookahead)
(?<=abc) the previous characters are “abc” (a positive lookbehind)
(?<!abc) the previous characters are not “abc” (a negative lookbehind)

Checking strings after and before the expression

The positive lookahead checks that there is a subexpression after the current position. For example, you need to find all div selectors with the footer ID and remove the div part:

Search forReplace toExplanation
div(?=#footer) “div” followed by “#footer”

(?=#footer) checks that there is the #footer string here, but does not consume it. In div#footer, only div will match. A lookahead is zero-width, just like the anchors.

In div#header, nothing will match, because the lookahead assertion fails.

Of course, this can be solved without any lookahead:

Search forReplace toExplanation
div#footer #footer A simpler equivalent

Generally, any lookahead after the expression can be rewritten by copying the lookahead text into replacement or by using backreferences.

In a similar way, positive lookbehind checks that there is a subexpression before the current position:

Search forReplace toExplanation
(?<=<a href=")news/ blog/ Replace “news/” preceded by “<a href="” with “blog/”
<a href="news/ <a href="blog/ The same replacement without lookbehind

The positive lookahead and lookbehind lead to a shorter regex, but you can do without them in this case. However, these were just basic examples. In some of the following regular expressions, the lookaround will be indispensable.

Testing the same characters for multiple conditions

Sometimes you need to test a string for several conditions.

For example, you want to find a consonant without listing all of them. It may seem simple at first: [^aeiouy] However, this regular expression also finds spaces and punctuation marks, because it matches anything except a vowel. And you want to match any letter except a vowel. So you also need to check that the character is a letter.

(?=[a-z])[^aeiouy] A consonant
[bcdfghjklmnpqrstvwxz] Without lookahead

There are two conditions applied to the same character here:

After (?=[a-z]) is checked, the current position is moved back because a lookahead has a width of zero: it does not consume characters, but only checks them. Then, [^aeiouy] matches (and consumes) one character that is not a vowel.

The order is important: the regex [^aeiouy](?=[a-z]) will match a character that is not a vowel, followed by any letter. Clearly it's not what is needed.

This technique is not limited to testing one character for two conditions; there can be any number of conditions of different lengths:

border:(?=[^;}]*\<solid\>)(?=[^;}]*\<red\>)(?=[^;}]*\<1px\>)[^;}]* Find a CSS declaration that contains the words solid, red, and 1px in any order.

This regex has three lookahead conditions. In each of them, [^;}]* skips any number of any characters except ; and } before the word. After the first lookahead, the current position is moved back and the second word is checked, etc.

The anchors \< and \> check that the whole word matches. Without them, 1px would match in 21px.

The last [^;}]* consumes the CSS declaration (the previous lookaheads only checked the presence of words, but didn't consume anything).

This regular expression matches {border: 1px solid red}, {border: red 1px solid;}, and {border:solid green 1px red} (different order of words; green is inserted), but doesn't match {border:red solid} (1px is missing).

Simulating overlapped matches

If you need to remove repeating words (e.g., replace the the with just the), you can do it in two ways, with and without lookahead:

Search forReplace toExplanation
\<(\w+)\s+(?=\1\>) Replace the first of repeating words with an empty string
\<(\w+)\s+\1\> \1 Replace two repeating words with the first word

The regex with lookahead works like this: the first parentheses capture the first word; the lookahead checks that the next word is the same as the first one.

The two regular expressions look similar, but there is an important difference. When replacing 3 or more repeating words, only the regex with lookahead works correctly. The regex without lookahead replaces every two words. After replacing the first two words, it moves to the next two words because the matches cannot overlap:

However, you can simulate overlapped matches with lookaround. The lookahead will check that the second word is the same as the first one. Then, the second word will be matched against the third one, etc. Every word that has the same word after it will be replaced with an empty string:

The correct regex without lookahead is \<(\w+)(\s+\1)+\> It matches any number of repeating words (not just two of them).

Checking negative conditions

The negative lookahead checks that the next characters do NOT match the expression in parentheses. Just like a positive lookahead, it does not consume the characters. For example, (?!toves) checks that the next characters are not “toves” without including them in the match.

<\?(?!php) “<?” without “php” after it

This pattern will match <? in <?echo 'text'?> or in <?xml.

Another example is anagram search. To find anagrams for “mate”, check that the first character is one of M, A, T, or E. Then, check that the second character is one of these letters and is not equal to the first character. After that, check the third character, which has to be different from the first and the second one, etc.

\<([mate])(?!\1)([mate])(?!\1)(?!\2)([mate])(?!\1)(?!\2)(?!\3)([mate])\> Anagram for “mate”

The sequence (?!\1)(?!\2) checks that the next character is not equal to the first subexpression and is not equal to the second subexpression.

The anagrams for “mate” are: meat, team, and tame. Certainly, there are special tools for anagram search, which are faster and easier to use.

A lookbehind can be negative, too, so it's possible to check that the previous characters do NOT match some expression:

\w+(?<!ing)\b A word that does not end with “ing” (the negative lookbehind)

In most regex engines, a lookbehind must have a fixed length: you can use character lists and classes ([a-z] or \w), but not repetitions such as * or +. Aba is free from this limitation. You can go back by any number of characters; for example, you can find files not containing a word and insert some text at the end of such files.

Search forReplace toExplanation
(?<!Table of contents.*)$$ <a href="/toc">Contents</a> Insert the link to the end of each file not containing the words “Table of contents”
^^(?!.*Table of contents) <a href="/toc">Contents</a> Insert it to the beginning of each file not containing the words

However, you should be careful with this feature because an unlimited-length lookbehind can be slow.

Controlling backtracking

A lookahead and a lookbehind do not backtrack; that is, when they have found a match and another part of the regular expression fails, they don't try to find another match. It's usually not important, because lookaround expressions are zero-width. They consume nothing and don't move the current position, so you cannot see which part of the string they match.

However, you can extract the matching text if you use a subexpression inside the lookaround. For example:

Search forReplace toExplanation
(?=\<(\w+)) \1 Repeat each word

Since lookarounds don't backtrack, this regular expression never matches:

(?=(\N*))\1\N A regex that doesn't backtrack and always fails
\N*\N A regex that backtracks and succeeds on non-empty lines

The subexpression (\N*) matches the whole line. \1 consumes the previously matched subexpression and \N tries to match the next character. It always fails because the next character is a newline.

A similar regex without lookahead succeeds because when the engine finds that the next character is a newline, \N* backtracks. At first, it has consumed the whole line (“greedy” match), but now it tries to match less characters. And it succeeds when \N* matches all but the last character of the line and \N matches the last character.

It's possible to prevent excessive backtracking with a lookaround, but it's easier to use atomic groups for that.

In a negative lookaround, subexpressions are meaningless because if a regex succeeds, negative lookarounds in it must fail. So, the subexpressions are always equal to an empty string. It's recommended to use a non-capturing group instead of the usual parentheses in a negative lookaround.

(?!(a))\1 A regex that always fails: (not A) and A

This is a page from Aba Search and Replace help file.