How to replace HTML tags using regular expressions
3 Nov 2011
Strictly speaking, you cannot parse HTML using only regular expressions. The reason is explained in any computer science curriculum: HTML is a context-free language, but regular expressions allow only regular languages. So, you cannot match nested tags with them.
However, regexes are really useful for quick search and replace in your web pages. Full parsing is unnecessary, because you know the HTML code that you wrote. Approaches that are “impure” from theoretical point of view work extremely well in this setting. You even can simplify the regexes shown below: say, if you never insert newlines between a
and href
, then you need not to allow for them in your regular expression.
Match an HTML tag
<a\s(.*?)>(.*?)</a>
This regex matches an <a> tag with any attributes. If you break it into parts:
\s
matches a space or a newline aftera
;.*?
matches any text to the next closing angle bracket>
;- another
.*?
matches any text inside the tag.
Parenthesis are used to capture the attributes and the text inside tag. You can then refer to them using \1
and \2
in the replacement. For example, you can remove all links:
As mentioned above, the regex will not correctly match nested <a>
tags; it just finds the next closing tag of the same type. But in this case, it's not important, because the nested <a>
tags make little sense :)
Match an opening HTML tag with some attribute
<a\s([^>]*\s)?href="(.*?)"(.*?)>
This regex matches an opening <a> tag with href
attribute. The differences from the previous example are:
[^>]*
matches anything except the closing angle bracket>
(so it skips any attributes beforehref
);- the question mark
?
makes the other attributes optional, sohref
can immediately followa
.
This regex is simple enough and works in most cases. The HTML standard allows spaces around =
and single quotes instead of double quotes in attribute values. If you need to match such tags, you need a more complicated regex:
<a\s([^>]*\s)?href\s*=\s*(["'])(.*?)\2(.*?)>
But simpler regexes usually suffice. Here is how you can replace absolute links with relative ones:
I hope that this short tutorial convinced you of the power of regular expressions :)
See also: Regular expression reference
Replacing text in several files used to be a tedious and error-prone task. Aba Search and Replace solves the problem, allowing you to correct errors on your web pages, replace banners and copyright notices, change method names, and perform other text-processing tasks.
This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.
- Automatically add width and height to img tags
- Using zero-width assertions in regular expressions
- Aba 2.7 released
- Regular Expressions 101
- 2023 in review
- Regular expression for numbers
- Aba 2.6 released
- Search from the Windows command prompt
- Empty character class in JavaScript regexes
- Privacy Policy Update - December 2022
- Aba 2.5 released
- Our response to the war in Ukraine
- Check VAT ID with regular expressions and VIES
- Which special characters must be escaped in regular expressions?
- Aba 2.4 released
- Privacy Policy Update - April 2021
- Review of Aba Search and Replace with video
- Aba 2.2 released
- Discount on Aba Search and Replace
- Using search and replace to rename a method
- Cleaning the output of a converter
- Aba 2.1 released
- How to replace HTML tags using regular expressions
- Video trailer for Aba
- Aba 2.0 released