Replace only the Nth match

10 May 2025

Here is another practical task. We need to replace only the second or following match in each file.

For example, we want to insert <a name="..."> tags before the second and any subsequent headers in each HTML file. There are multiple headers in each file:

Matching <h2> headers

And we want to insert a name to be able to refer to each header. The final result should look like this (the added tag is bold):

<a name="international_search_test"><h2>International search test</h2>

We can solve this task with a lookbehind:

Matching the second, the third, etc. tags
(?<=<h2>.*?)<h2>

This regular expression matches <h2>, but only if there is another <h2> before it.

With Aba Search and Replace, there is a more flexible way to do this. You can match all <h2> tags, but change only the second, the third, etc. tags leaving the first tag intact. The pattern is simple:

<h2>(.*?)</h2>

But in the replacement, we check if the match number is equal to one:

\( if Aba.matchNoInFile() == 1 {
   \0
} else {
   '<a name="' \1.replace(' ', '_').toLower() '">' \0
} )

If yes, we return the whole match \0 without any change. If not, we add <a name="..."> before it. We also use the toLower function to convert the name to lowercase and the replace function to replace spaces with underscores.

Replacing the second, the third, etc. tags

You can easily modify this one-liner to replace the first three tags in each file only, or replace every second tag, which is more complicated with a lookbehind.

 

Aba 2.8 released

9 Mar 2025

The new version goes beyond just search and replace; it allows you to convert text and images to and from Base64, encode and decode HTML entities like &lt;, encode and decode percent-encoding (also known as URL encoding), decode JSON Web Tokens, and convert Unix/JavaScript timestamps to dates and vice versa.

Decode image from Base64 Decode Unix timestamp

Other new features and fixes include:

Just as always, the upgrade is free for the registered users.

 

Anonymizing a dataset by replacing names with counters

11 Jan 2025

Sometimes, you need to remove personal data from a dataset, such as when preparing examples or unit tests. With Aba Search and Replace, you can mask names, addresses, and other personally identifiable information by replacing them with counters.

Let's use the following CSV file with information about Alice in Wonderland characters as an example:

Name,Address,Favorite Color
Alice,Near the Rabbit Hole,Blue
Mad Hatter,Tea Party Garden,Orange
White Rabbit,Rabbit Hole,White
Queen of Hearts,Hearts Castle,Red
Cheshire Cat,Forest Tree Hollow,Purple
Caterpillar,Mushroom Grove,Green
Tweedledee,Looking Glass Land,Yellow
Tweedledum,Looking Glass Land,Yellow
March Hare,Mad Tea Party Estate,Brown
Dormouse,Tea Party Garden,Gray

You want to remove real names and addresses from this file. A common approach would be to write a script that opens the file, reads each line, replaces the first two fields with counters, and then prints the result. However, it's easier to do the same task with Aba Search and Replace. You don't have to write boilerplate code for file reading, and you can immediately preview the replacement results.

We'll use the following regular expression to match the first two columns in the CSV file while skipping the headers:

(?<=\n)(\N+?),(\N+?),

Here's how it works: first, we check that a newline \n is found before the match using a lookbehind assertion, which allows us to skip the headers (the first line). Next, we match two fields separated with commas.

We would like to replace the names (Alice, Mad Hatter, White Rabbit, etc.) with a counter like person1, person2, person3, etc. Aba provides functions for inserting counters; Aba.matchNo works well for this case:

Aba window

For the address field, we don't want to use the same sequence (1, 2, 3), so let's do some math with the counter in order to start from 77 and decrement each street number by 3. The replacement expression becomes:

person\{ Aba.matchNo() },\{ 80 - Aba.matchNo() * 3 } Wonderland Drive,

Note that proper anonymization is more complex than this. In our example, it's still possible to identify some characters after the replacement. For example, White Rabbit predictably likes white, Queen of Hearts likes red ❤️, and the twins (Tweedledee and Tweedledum) share the same favorite color, yellow. So the anonymization process won't meet GDPR requirements and you need further manual edits to remove or randomize such cases, but the replacement is a good first step for removing sensitive information.

 

Automatically add width and height to img tags

14 Jul 2024

If you set the width and height attributes for your img tags, the browser can allocate the correct amount of space for the image before loading it. This prevents content below the image from shifting around as the page loads. The layout becomes stable, which means that:

That’s why Google recommends setting the width and height attributes in your HTML code.

If you have a lot of images, it may take some time to specify their dimensions. With Aba Search and Replace, you can do it automatically.

The typical case

Adding width and height to HTML images

Please use this search pattern to capture the image file name in the first subexpression:

<img src="([^"]+)"

The [^"]+ regex matches everything except for the closing quotation mark and parentheses mark the first subexpression.

If you have absolute paths like <img src="/images/someImage.png"> in your HTML code, use the following replacement:

\0 \{ File(Aba.searchPath() \1).meta('ImgTag') }

Here, we insert the whole match \0, which is the img tag and its src attribute. Then, we insert width and height via the meta function. The Aba.searchPath() function returns the directory that you selected for the search, then the image filename \1 is added to it.

Relative paths

Adding width and height to HTML images with relative paths

If your paths are relative to the html files (e.g., <img src="someImage.png"> or <img src="../banner.png">), then use a simpler replacement:

\0 \{ File(\1).meta('ImgTag') }

Replacing existing width and height attributes

If you have existing width and height attributes and you want to replace them, the regex becomes more complex. For example, if the width and height always follow the src attribute:

<img src="([^"]+)" width="\d+" height="\d+"

And the replacement should be:

<img src="\1" \{File(Aba.searchPath() \1).meta('ImgTag')}
Matching the existing width and height attributes

Matching tags without existing width and height attributes

More often, you need to skip the tags that already have the width or the height attribute. Our previous regular expression also has these disadvantages:

The following regular expression fixes these problems:

<img\s+(?:alt="[^"]*?"\s+)?src="([^"]+)"(?!\s+width|\s+height)

And the replacement should be:

\0 \{ File( if \1[0] == '/' { Aba.searchPath() } else {''} \1.decodeUrl()).meta('ImgTag') }

We use alt="[^"]*?" to match an optional alt attribute. If you use other attributes, you can add them here. Instead of spaces, we use \s+ to match any number of spaces or line breaks. The regular expression includes a negative lookhead (?!\s+width|\s+height), so it skips the tags that already have width or height attributes.

The replacement checks if the first character of the file name is a slash /; if yes, it uses the absolute path. Finally, the decodeUrl function replaces %20 with spaces.

This regular expression works in most cases and it's included into favorites by default. Note that regex matching is textual, so the program does not really understand HTML. You may need to modify the regular expression to match your specific case.

Conclusion

You can preview the replacements and check that the img tags are matched correctly. If Aba cannot find an image file, it will display an error message with the src attribute and the HTML filename. Then, just press the Replace button and test the result in your browser. If anything goes wrong, you can always undo the replacement.

Aba can help you to ensure that all of your pages use width and height attributes, which improves performance, prevents layout shifts, and makes your website more visually appealing for the users.

 

Using zero-width assertions in regular expressions

30 Jun 2024

Anchors ^ $ \b \A \Z

Anchors in regular expressions allow you to specify context in a string where your pattern should be matched. There are several types of anchors:

These anchors are supported in Java, PHP, Python, Ruby, C#, and Go. In JavaScript, \A and \Z are not supported, but you can use ^ and $ instead of them; just remember to keep the multiline mode disabled. Aba Search and Replace always runs in multiline mode, so you can use \A and \Z to match the beginning or the end of a file.

For example, the regular expression ^abc will match the start of a string that contains the letters "abc". In multiline mode, the same regex will match these letters at the beginning of a line. You can use anchors in combination with other regular expression elements to create more complex matches. For example, ^From: (.*) matches a line starting with From:

The difference between \Z and \z is that \Z matches at the end of the string but also skips a possible newline character at the end. In contrast, \z is more strict and matches only at the end of the string.

If you have read the previous part of this article, you may wonder if the anchors add any additional capabilities that are not supported by the three primitives (alternation, parentheses, and the star for repetition). The answer is: they do not, but they change what is captured by the regular expression. You can match a line starting with abc by explicitly adding the newline character: \nabc, but in this case, you will also match the newline character itself. When you use ^abc, the newline character is not consumed.

In a similar way, ing\b matches all words ending with ing. You can replace the anchor with a character class containing non-letter characters (such as spaces or punctuation): ing\W, but in this case, the regular expression will also consume the space or punctuation character.

If the regular expression starts with ^ so that it only matches at the start of the string, it's called anchored. In some programming languages, you can do an anchored match instead of the non-anchored search without using ^. For example, in PHP (PCRE), you can use the A modifier.

So the anchors don't add any new capabilities to the regular expressions, but they allow you to manage which characters will be included into the match or to match only at the beginning or end of the string. The matched language is still regular.

Zero-width assertions (?= ) (?! ) (?<= ) (?<! )

Zero-width assertions (also called lookahead and lookbehind assertions) allow you to check that a pattern occurs in the subject string without capturing any of the characters. This can be useful when you want to check for a pattern without moving the match pointer forward. For example, you can test that the next characters are abc without consuming them: (?=abc).

Zero-width assertions are generalized anchors. Just like anchors, they don't consume any character from the input string. Unlike anchors, they allow you to check anything, not only line boundaries or word boundaries. So you can replace an anchor with a zero-width assertion, but not vice versa. For example, ing\b could be rewritten as ing(?=\W|$).

Aba documentation includes a detailed article on zero-width assertions (lookaround) and their typical usage, so we won't repeat it here. Zero-width lookahead and lookbehind are supported in PHP, JavaScript, Python, Java, and Ruby. Unfortunately, they are not supported in Go.

Just like anchors, zero-width assertions still match a regular language, so from a theoretical point of view, they don't add anything new to the capabilities of regular expressions. They just make it possible to skip certain things from the captured string, so you only check for their presence but don't consume them.

 

Aba 2.7 released

12 May 2024

In the new version, Aba got a UI facelift and dark mode. Several critical bugs were fixed in this release, so it's recommended for everyone to install. The changes are:

Dark mode

Just as always, the upgrade is free for the registered users.

 

Regular Expressions 101

28 Jan 2024

With regular expressions, you can describe the patterns that are similar to each other. For example, you have multiple <img> tags, and you want to move all these images to the images folder:

<img src="9.png">   →  <img src="images/9.png">
<img src="10.png">  →  <img src="images/10.png">
and so on

You can easily write a regular expression that matches all file names that are numbers, then replace all such tags at once.

Basic syntax

If you need to match one of the alternatives, use an alternation (vertical bar). For example:

RegexMeaning
a|img|h1|h2either a, or img, or h1, or h2

When using alternation, you often need to group characters together; you can do this with parentheses. For example, if you want to match an HTML tag, this approach won't work:

RegexMeaning
<h1|h2|b|i><h1 or h2 (without the angle brackets) or b or i>

because < applies to the first alternative only and > applies to the last one only. To apply the angle brackets to all alternatives, you need to group the alternatives together:

<(h1|h2|b|i)>

The last primitive (star) allows you to repeat anything zero or more times. You can apply it to one character, for example:

RegexMeaning
a*an empty string, a, aa, aaa, aaaa, etc.

You also can apply it to multiple characters in parentheses:

RegexMeaning
(ab)*an empty string, ab, abab, ababab, abababab, etc.

Note that if you remove the parentheses, the star will apply to the last character only:

RegexMeaning
ab*an empty string, ab, abb, abbb, abbbb, etc.
A portrait of Stephen Cole Kleene, the regular expression inventor
Stephen Kleene (1909-1994), the regular expression inventor.
Author: Konrad Jacobs. Source: Archives of the Mathematisches Forschungsinstitut Oberwolfach.

The star is named Kleene star after an American mathematician Stephen Kleene who invented regular expressions in the 1950s. It can match an empty string as well as any number of repetitions.

These three primitives (alternation, parentheses, and the star for repetition) are enough to write any regular expression, but the syntax may be verbose. For example, you now can write a regex for matching the file names that are numbers in an <img> tag:

RegexMeaning
(0|1|2|3|4​|5|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*one or more digits
(1|2|3|4|5​|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*a positive integer number (don't allow zero as the first character)

The parentheses may be nested without a limit, for example:

RegexMeaning
(1|2|3|4​|5|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*(,(1|2|3|4​|5|6|7|8|9)(0|1|2|3|4​|5|6|7|8|9)*)*one or more positive integer numbers, separated with commas

Convenient shortcuts for character classes

You can write any regex with the three primitives, but it quickly becomes hard to read, so a few shortcuts were invented. When you need to match any of the listed characters, please put them into square brackets:

RegexShorter regexMeaning
a|e|i|o|u|y[aeiouy]a vowel
0|1|2|3|4​|5|6|7|8|9[0123456789]a digit
0|1|2|3|4​|5|6|7|8|9[0-9]a digit
a|b|c|d|e​|f|g|h|i|j​|k|l|m|n​|o|p|q|r​|s|t|u|v​|w|x|y|z[a-z]a letter

As you can see, it's possible to specify only the first and the last allowed character if you put a dash between them. There may be several such ranges inside square brackets:

RegexMeaning
[a-z0-9]a letter or a digit
[a-z0-9_]a letter, a digit, or the underscore character
[a-f0-9]a hexadecimal digit

There are some predefined character classes that are even shorter to write:

RegexMeaning
\sa space character: the space, the tab character, the new line, or the carriage feed
\da digit
\wa word character (a letter, a digits, or the underscore character)
.any character

In Aba Search and Replace, these character classes include Unicode characters such as accented letters or Unicode line breaks. In other regex dialects, they usually include ASCII characters only, so \d is typically the same as [0-9] and \w is the same as [a-zA-Z0-9_].

The character classes don't add any new capabilities to the regular expressions; you can just list all allowed characters with an alternation, but a character class is much shorter to write. We now can write a shorter version of the regex mentioned before:

RegexMeaning
[1-9][0-9]*(,[1-9][0-9])*one or more positive integer numbers, separated with commas

Repetitions

A Kleene star means "repeating zero or more times", but you often need another number of repetitions. As shown before, you can just copy-and-paste a regex to repeat it twice or three times, but there is a shorter notation for that:

RegexShorter regexMeaning
\d\d*\d+one or more digits
(0|1)(0|1)*[01]+any binary number (consisting of zeros and ones)
(\s|)\s?either a space character or nothing
http(s|)https?either http or https
(-|\+|)[-+]?the minus sign, the plus sign, or nothing
[a-z][a-z][a-z]{2}two small letters
[a-z][a-z]((([a-z]|)[a-z]|)[a-z]|)[a-z]{2,5}from two to five small letters
[a-z][a-z][a-z]*[a-z]{2,}two or more small letters

So there are the following repetition operators:

Note that you can express any repetition with the curly brackets, so these operators partially duplicate each other. For example:

RegexShorter regexMeaning
\d{0,}\d*nothing or some digits
\d{1,}\d+one or more digits
\s{0,1}\s?either a space character or nothing

Just like the Kleene star, the other repetition operators can apply to parentheses, so you can nest them indefinitely.

Escaping

If you need to match any of the special characters like parentheses, vertical bar, plus, or star, you must escape them by adding a backslash \ before them. For example, to find a number in parentheses, use \(\d+\).

A common mistake is to forget a backslash before a dot. Note that a dot means any character, so if you write example.com in a regular expression, it will match examplexcom or something similar, which may even cause a security issue in your program. Now we can write a regex to match the <img> tags:

<img src="\d+\.png">

This matches any filename consisting of digits and we correctly escaped the dot.

Other features

Modern regex engines add more features such as backreferences or conditional subpatterns. Mathematically speaking, these features don't belong to the regular expressions; they describe a non-regular language, so you cannot replace them with the three primitives.

Next time, we will discuss anchors and zero-width assertions.

 

2023 in review

14 Jan 2024

In 2023, I continued to support Ukraine and donated more than 50% of the revenue from Aba Search and Replace to the charities helping Ukrainians in need. I will keep donating this year.

Released in December, Aba 2.6 is the first version that requires Windows Vista. The previous versions were tested on Windows XP, which remained popular for a long time after its release. Unfortunately, it became increasingly hard to maintain the Windows XP compatibility code and it limited the further development, so I had to say goodbye to Windows 2000/XP. Please contact me if it creates any problem for you; I always listen to your feedback and can send you the previous version.

In January 2023, Microsoft certified Aba Search and Replace for publication to the Microsoft Store. The new version 2.6 was also approved a few days ago, so you can download it from the Microsoft Store as well as from this website.

Thanks to Richard, Aba is also available in French . If you are a native speaker of Spanish , German , or Italian and you can translate the 17 messages that were added in the recent version, please contact me. Feel free to use Google Translate or ChatGPT, then review and edit the automatic translation. Thank you so much.

The blog post about escaping in regular expressions is still the most popular on this blog. In April, I wrote a followup about empty character classes, which was also well-received.

The new Aba version remains lean and fast. No huge runtime libraries, no cluttered UIs or bloatware. Stay tuned for the next versions!

 

Regular expression for numbers

30 Dec 2023

It's easy to find a positive integer number with regular expressions:

[0-9]+

This regex means digits from 0 to 9, repeated one or more times. However, numbers starting with zero are treated as octal in many programming languages, so you may wish to avoid matching them:

[1-9][0-9]*

This regular expression matches any positive integer number starting with a non-zero digit. If you also need to match zero, you can include it as another branch:

[1-9][0-9]*|0

To also accomodate negative integer numbers, you can allow a minus sign before the digits:

-?[1-9][0-9]*|0

Sometimes it's necessary to allow a plus sign as well:

[-+]?[1-9][0-9]*|0

The previous regexes searched the input string for a number. If you need to match a number only discarding anything else, you can add the ^ anchor to match the beginning of the string and the $ anchor to match the end:

^(-?[1-9][0-9]*|0)$

Parentheses are necessary here; without them, the ^ anchor would apply only to the first branch. Another variation of the same regex avoids finding numbers that are part of words, such as 600px or x64:

\b(-?[1-9][0-9]*|0)\b

Things get more complicated if you need to match a fractional number:

\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0)\b

Let's break down this regular expression:

For floating-point numbers with an exponent, such as 5.2777e+231, please use:

\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0)(?:[eE][+-]?[0-9]+)?\b

Many programming languages support hexadecimal numbers starting with 0x. Here is a regular expression to match them:

0x[0-9a-fA-F]+

Finally, here is a comprehensive regular expression to match floating-point, integer decimal, or hexadecimal numbers:

\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0(?:x[0-9a-fA-F]+)?)(?:[eE][+-]?[0-9]+)?\b

 

Aba 2.6 released

25 Dec 2023

This version adds the following features:

Just as always, the upgrade is free for the registered users; your settings and search history will be preserved when you run the installer.

If you have any suggestions for new features, please contact me. I will be happy to implement your ideas.

 

This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.