Aba Search and Replace blog

Automatically add width and height to img tags

Sun, 14 Jul 2024 15:30:03 +0200

If you set the width and height attributes for your img tags, the browser can allocate the correct amount of space for the image before loading it. This prevents content below the image from shifting around as the page loads. The layout becomes stable, which means that:

your users won’t accidentally click a wrong button because of layout shift;
the performance is better because the browser doesn’t have to recalculate the layout as the images load;
page load feels smoother and faster.

That’s why Google recommends setting the width and height attributes in your HTML code.

If you have a lot of images, it may take some time to specify their dimensions. With Aba Search and Replace, you can do it automatically.

The typical case

Please use this search pattern to capture the image file name in the first subexpression:

<img src="([^"]+)"

The [^"]+ regex matches everything except for the closing quotation mark and parentheses mark the first subexpression.

If you have absolute paths like <img src="/images/someImage.png"> in your HTML code, use the following replacement:

\0 \{ File(Aba.searchPath() \1).meta('ImgTag') }

Here, we insert the whole match \0, which is the img tag and its src attribute. Then, we insert width and height via the meta function. The Aba.searchPath() function returns the directory that you selected for the search, then the image filename \1 is added to it.

Relative paths

If your paths are relative to the html files (e.g., <img src="someImage.png"> or <img src="../banner.png">), then use a simpler replacement:

\0 \{ File(\1).meta('ImgTag') }

Replacing existing width and height attributes

If you have existing width and height attributes and you want to replace them, the regex becomes more complex. For example, if the width and height always follow the src attribute:

<img src="([^"]+)" width="\d+" height="\d+"

And the replacement should be:

<img src="\1" \{File(Aba.searchPath() \1).meta('ImgTag')}

Conclusion

You can preview the replacements and check that the img tags are matched correctly. If Aba cannot find an image file, it will display an error message with the src attribute and the HTML filename. Then, just press the Replace button and test the result in your browser. If anything goes wrong, you can always undo the replacement.

Aba can help you to ensure that all of your pages use width and height attributes, which improves performance, prevents layout shifts, and makes your website more visually appealing for the users.

Using zero-width assertions in regular expressions

Sun, 30 Jun 2024 22:27:35 +0200

Anchors ^ $ \b \A \Z

Anchors in regular expressions allow you to specify context in a string where your pattern should be matched. There are several types of anchors:

^ matches the start of a line (in multiline mode) or the start of the string (by default).
$ matches the end of a line (in multiline mode) or the end of the string (by default).
\A matches the start of the string.
\Z or \z matches the end of the string.
\b matches a word boundary (before the first letter of a word or after the last letter of a word).
\B matches a position that is not a word boundary (between two letters or between two non-letter characters).

These anchors are supported in Java, PHP, Python, Ruby, C#, and Go. In JavaScript, \A and \Z are not supported, but you can use ^ and $ instead of them; just remember to keep the multiline mode disabled. Aba Search and Replace always runs in multiline mode, so you can use \A and \Z to match the beginning or the end of a file.

For example, the regular expression ^abc will match the start of a string that contains the letters "abc". In multiline mode, the same regex will match these letters at the beginning of a line. You can use anchors in combination with other regular expression elements to create more complex matches. For example, ^From: (.*) matches a line starting with From:

The difference between \Z and \z is that \Z matches at the end of the string but also skips a possible newline character at the end. In contrast, \z is more strict and matches only at the end of the string.

If you have read the previous part of this article, you may wonder if the anchors add any additional capabilities that are not supported by the three primitives (alternation, parentheses, and the star for repetition). The answer is: they do not, but they change what is captured by the regular expression. You can match a line starting with abc by explicitly adding the newline character: \nabc, but in this case, you will also match the newline character itself. When you use ^abc, the newline character is not consumed.

In a similar way, ing\b matches all words ending with ing. You can replace the anchor with a character class containing non-letter characters (such as spaces or punctuation): ing\W, but in this case, the regular expression will also consume the space or punctuation character.

If the regular expression starts with ^ so that it only matches at the start of the string, it's called anchored. In some programming languages, you can do an anchored match instead of the non-anchored search without using ^. For example, in PHP (PCRE), you can use the A modifier.

So the anchors don't add any new capabilities to the regular expressions, but they allow you to manage which characters will be included into the match or to match only at the beginning or end of the string. The matched language is still regular.

Zero-width assertions (?= ) (?! ) (?<= ) (?<! )

Zero-width assertions (also called lookahead and lookbehind assertions) allow you to check that a pattern occurs in the subject string without capturing any of the characters. This can be useful when you want to check for a pattern without moving the match pointer forward. For example, you can test that the next characters are abc without consuming them: (?=abc).

Zero-width assertions are generalized anchors. Just like anchors, they don't consume any character from the input string. Unlike anchors, they allow you to check anything, not only line boundaries or word boundaries. So you can replace an anchor with a zero-width assertion, but not vice versa. For example, ing\b could be rewritten as ing(?=\W|$).

Aba documentation includes a detailed article on zero-width assertions (lookaround) and their typical usage, so we won't repeat it here. Zero-width lookahead and lookbehind are supported in PHP, JavaScript, Python, Java, and Ruby. Unfortunately, they are not supported in Go.

Just like anchors, zero-width assertions still match a regular language, so from a theoretical point of view, they don't add anything new to the capabilities of regular expressions. They just make it possible to skip certain things from the captured string, so you only check for their presence but don't consume them.

Aba 2.7 released

Sun, 12 May 2024 17:54:00 +0200

In the new version, Aba got a UI facelift and dark mode. Several critical bugs were fixed in this release, so it's recommended for everyone to install. The changes are:

Dark mode.
A larger, more modern UI font (Segoe UI).
Syntax highlight for Java, C#, SQL, and Pascal.
Drag and drop into the main window.
Autocomplete in the path combobox.
Allow to use a file name in double quotes.
Fixed 13 bugs including 6 critical ones.

Just as always, the upgrade is free for the registered users.

Regular Expressions 101

Sun, 28 Jan 2024 15:10:17 +0100

With regular expressions, you can describe the patterns that are similar to each other. For example, you have multiple <img> tags, and you want to move all these images to the images folder:

<img src="9.png">   →  <img src="images/9.png">
<img src="10.png">  →  <img src="images/10.png">
and so on

You can easily write a regular expression that matches all file names that are numbers, then replace all such tags at once.

Basic syntax

If you need to match one of the alternatives, use an alternation (vertical bar). For example:

Regex	Meaning
`a\|img\|h1\|h2`	either `a`, or `img`, or `h1`, or `h2`

When using alternation, you often need to group characters together; you can do this with parentheses. For example, if you want to match an HTML tag, this approach won't work:

Regex	Meaning
`<h1\|h2\|b\|i>`	`<h1` or `h2` (without the angle brackets) or `b` or `i>`

because < applies to the first alternative only and > applies to the last one only. To apply the angle brackets to all alternatives, you need to group the alternatives together:

<(h1|h2|b|i)>

The last primitive (star) allows you to repeat anything zero or more times. You can apply it to one character, for example:

Regex	Meaning
`a*`	an empty string, `a`, `aa`, `aaa`, `aaaa`, etc.

You also can apply it to multiple characters in parentheses:

Regex	Meaning
`(ab)*`	an empty string, `ab`, `abab`, `ababab`, `abababab`, etc.

Note that if you remove the parentheses, the star will apply to the last character only:

Regex	Meaning
`ab*`	an empty string, `ab`, `abb`, `abbb`, `abbbb`, etc.

Stephen Kleene (1909-1994), the regular expression inventor.
Author: Konrad Jacobs. Source: Archives of the Mathematisches Forschungsinstitut Oberwolfach.

The star is named Kleene star after an American mathematician Stephen Kleene who invented regular expressions in the 1950s. It can match an empty string as well as any number of repetitions.

These three primitives (alternation, parentheses, and the star for repetition) are enough to write any regular expression, but the syntax may be verbose. For example, you now can write a regex for matching the file names that are numbers in an <img> tag:

Regex	Meaning
`(0\|1\|2\|3\|4\|5\|6\|7\|8\|9)(0\|1\|2\|3\|4\|5\|6\|7\|8\|9)*`	one or more digits
`(1\|2\|3\|4\|5\|6\|7\|8\|9)(0\|1\|2\|3\|4\|5\|6\|7\|8\|9)*`	a positive integer number (don't allow zero as the first character)

The parentheses may be nested without a limit, for example:

Regex	Meaning
`(1\|2\|3\|4\|5\|6\|7\|8\|9)(0\|1\|2\|3\|4\|5\|6\|7\|8\|9)(,(1\|2\|3\|4\|5\|6\|7\|8\|9)(0\|1\|2\|3\|4\|5\|6\|7\|8\|9))*`	one or more positive integer numbers, separated with commas

Convenient shortcuts for character classes

You can write any regex with the three primitives, but it quickly becomes hard to read, so a few shortcuts were invented. When you need to match any of the listed characters, please put them into square brackets:

Regex	Shorter regex	Meaning
`a\|e\|i\|o\|u\|y`	`[aeiouy]`	a vowel
`0\|1\|2\|3\|4\|5\|6\|7\|8\|9`	`[0123456789]`	a digit
`0\|1\|2\|3\|4\|5\|6\|7\|8\|9`	`[0-9]`	a digit
`a\|b\|c\|d\|e\|f\|g\|h\|i\|j\|k\|l\|m\|n\|o\|p\|q\|r\|s\|t\|u\|v\|w\|x\|y\|z`	`[a-z]`	a letter

As you can see, it's possible to specify only the first and the last allowed character if you put a dash between them. There may be several such ranges inside square brackets:

Regex	Meaning
`[a-z0-9]`	a letter or a digit
`[a-z0-9_]`	a letter, a digit, or the underscore character
`[a-f0-9]`	a hexadecimal digit

There are some predefined character classes that are even shorter to write:

Regex	Meaning
`\s`	a space character: the space, the tab character, the new line, or the carriage feed
`\d`	a digit
`\w`	a word character (a letter, a digits, or the underscore character)
`.`	any character

In Aba Search and Replace, these character classes include Unicode characters such as accented letters or Unicode line breaks. In other regex dialects, they usually include ASCII characters only, so \d is typically the same as [0-9] and \w is the same as [a-zA-Z0-9_].

The character classes don't add any new capabilities to the regular expressions; you can just list all allowed characters with an alternation, but a character class is much shorter to write. We now can write a shorter version of the regex mentioned before:

Regex	Meaning
`[1-9][0-9](,[1-9][0-9])`	one or more positive integer numbers, separated with commas

Repetitions

A Kleene star means "repeating zero or more times", but you often need another number of repetitions. As shown before, you can just copy-and-paste a regex to repeat it twice or three times, but there is a shorter notation for that:

Regex	Shorter regex	Meaning
`\d\d*`	`\d+`	one or more digits
`(0\|1)(0\|1)*`	`[01]+`	any binary number (consisting of zeros and ones)
`(\s\|)`	`\s?`	either a space character or nothing
`http(s\|)`	`https?`	either `http` or `https`
`(-\|\+\|)`	`[-+]?`	the minus sign, the plus sign, or nothing
`[a-z][a-z]`	`[a-z]{2}`	two small letters
`[a-z][a-z]((([a-z]\|)[a-z]\|)[a-z]\|)`	`[a-z]{2,5}`	from two to five small letters
`[a-z][a-z][a-z]*`	`[a-z]{2,}`	two or more small letters

So there are the following repetition operators:

a Kleene star * means repeating zero or more times, so it can never match, it can match once, twice, three times, etc.;
a plus sign + means repeating one or more times, so it must match at least once;
an optional part ? means zero times or once;
curly brackets {m,n} means repeating from m to n times.

Note that you can express any repetition with the curly brackets, so these operators partially duplicate each other. For example:

Regex	Shorter regex	Meaning
`\d{0,}`	`\d*`	nothing or some digits
`\d{1,}`	`\d+`	one or more digits
`\s{0,1}`	`\s?`	either a space character or nothing

Just like the Kleene star, the other repetition operators can apply to parentheses, so you can nest them indefinitely.

Escaping

If you need to match any of the special characters like parentheses, vertical bar, plus, or star, you must escape them by adding a backslash \ before them. For example, to find a number in parentheses, use $\d+$.

A common mistake is to forget a backslash before a dot. Note that a dot means any character, so if you write example.com in a regular expression, it will match examplexcom or something similar, which may even cause a security issue in your program. Now we can write a regex to match the <img> tags:

<img src="\d+\.png">

This matches any filename consisting of digits and we correctly escaped the dot.

Other features

Modern regex engines add more features such as backreferences or conditional subpatterns. Mathematically speaking, these features don't belong to the regular expressions; they describe a non-regular language, so you cannot replace them with the three primitives.

Next time, we will discuss anchors and zero-width assertions.

2023 in review

Sun, 14 Jan 2024 12:27:50 +0100

In 2023, I continued to support Ukraine and donated more than 50% of the revenue from Aba Search and Replace to the charities helping Ukrainians in need. I will keep donating this year.

Released in December, Aba 2.6 is the first version that requires Windows Vista. The previous versions were tested on Windows XP, which remained popular for a long time after its release. Unfortunately, it became increasingly hard to maintain the Windows XP compatibility code and it limited the further development, so I had to say goodbye to Windows 2000/XP. Please contact me if it creates any problem for you; I always listen to your feedback and can send you the previous version.

In January 2023, Microsoft certified Aba Search and Replace for publication to the Microsoft Store. The new version 2.6 was also approved a few days ago, so you can download it from the Microsoft Store as well as from this website.

Thanks to Richard, Aba is also available in French . If you are a native speaker of Spanish , German , or Italian and you can translate the 17 messages that were added in the recent version, please contact me. Feel free to use Google Translate or ChatGPT, then review and edit the automatic translation. Thank you so much.

The blog post about escaping in regular expressions is still the most popular on this blog. In April, I wrote a followup about empty character classes, which was also well-received.

The new Aba version remains lean and fast. No huge runtime libraries, no cluttered UIs or bloatware. Stay tuned for the next versions!

Regular expression for numbers

Sat, 30 Dec 2023 18:13:28 +0100

It's easy to find a positive integer number with regular expressions:

[0-9]+

This regex means digits from 0 to 9, repeated one or more times. However, numbers starting with zero are treated as octal in many programming languages, so you may wish to avoid matching them:

[1-9][0-9]*

This regular expression matches any positive integer number starting with a non-zero digit. If you also need to match zero, you can include it as another branch:

[1-9][0-9]*|0

To also accomodate negative integer numbers, you can allow a minus sign before the digits:

-?[1-9][0-9]*|0

Sometimes it's necessary to allow a plus sign as well:

[-+]?[1-9][0-9]*|0

The previous regexes searched the input string for a number. If you need to match a number only discarding anything else, you can add the ^ anchor to match the beginning of the string and the $ anchor to match the end:

^(-?[1-9][0-9]*|0)$

Parentheses are necessary here; without them, the ^ anchor would apply only to the first branch. Another variation of the same regex avoids finding numbers that are part of words, such as 600px or x64:

\b(-?[1-9][0-9]*|0)\b

Things get more complicated if you need to match a fractional number:

\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0)\b

Let's break down this regular expression:

The first branch [1-9][0-9]*(?:\.[0-9]+)? matches an integer number starting with a non-zero digit, then an optional fractional part.
The second branch \.[0-9]+ matches fractional numbers starting with a dot, for example, .5 is another way to write 0.5.
The third branch matches zero. Note that both positive and negative zeros are possible in floating-point numbers.

For floating-point numbers with an exponent, such as 5.2777e+231, please use:

\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0)(?:[eE][+-]?[0-9]+)?\b

Many programming languages support hexadecimal numbers starting with 0x. Here is a regular expression to match them:

0x[0-9a-fA-F]+

Finally, here is a comprehensive regular expression to match floating-point, integer decimal, or hexadecimal numbers:

\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0(?:x[0-9a-fA-F]+)?)(?:[eE][+-]?[0-9]+)?\b

Aba 2.6 released

Mon, 25 Dec 2023 03:06:00 +0100

This version adds the following features:

complex replacements including converting the matching text to lowercase, inserting the file name, or adding width/height attributes to <img> tags (now you can use a simple scripting language in the replacements);
a 64-bit version (if needed, you still can choose a 32-bit version during installation);
a new hotkey: the left/right arrow key to quickly jump to the next/previous file (when the results pane is focused);
the taskbar button now flashes when a long operation is complete;
basic support for emojis (ZWJ sequences and skin tones are displayed as separate characters).

Just as always, the upgrade is free for the registered users; your settings and search history will be preserved when you run the installer.

If you have any suggestions for new features, please contact me. I will be happy to implement your ideas.

Search from the Windows command prompt

Sun, 21 May 2023 14:07:58 +0200

When you need to search within text files from Windows batch files, you can use either the find or findstr command. Findstr supports a limited version of regular expressions. You can also automate certain tasks based on the search results.

The find command

To search for text in multiple files from the Windows command prompt or batch files, you can use the FIND command, which has been present since the days of MS DOS and is still available in Windows 11. It's similar to the Unix grep command, but does not support regular expressions. If you want to search for the word borogoves in the current directory, please follow this syntax:

find "borogoves" *

Note that the double quotes around the pattern are mandatory. If you are using PowerShell, you will need to include single quotes as well:

find '"borogoves"' *

Instead of the asterisk (*), you can specify a file mask such as *.htm?. The find command displays the names of the files it scans, even if it doesn't find any matches within these files:

The search is case-sensitive by default, so you typically need to add the /I switch to treat uppercase and lowercase letters as equivalent:

find /I "<a href=" *.htm

If you don't specify the file to search in, find will wait for the text input from stdin, so that you can pipe output from another command. For example, you can list all copy commands supported in Windows:

help | find /i "copy"

Another switch, /V, allows you to find all lines not containing the pattern, similar to the grep -v command.

In batch files, you can use the fact that the find command sets the exit code (errorlevel) to 1 if the pattern is not found. For instance, you can check if the machine is running a 64-bit or 32-bit version of Windows:

@echo off

rem Based on KB556009 with some corrections
reg Query "HKLM\Hardware\Description\System\CentralProcessor\0" /v "Identifier" | find /i "x86 Family" > nul
if errorlevel 1 goto win64

echo 32-bit Windows
goto :eof

:win64
rem Could be AMD64 or ARM64
echo 64-bit Windows

The findstr command: regular expression search

If you need to find a regular expression, try the FINDSTR command, which was introduced in Windows XP. For historical reasons, findstr supports a limited subset of regular expressions, so you can only use these regex features:

The dot . matches any character except for newline and extended ASCII characters.
Character lists [abc] match any of the specified characters (a, b, or c).
Character list ranges [a-z] match any letter from a to z.
The asterisk (*) indicates that the previous character cane be repeated zero or more times.
The \< and \> symbols mark the beginning and the end of a word.
The caret (^) and the dollar sign ($) denote the beginning of and the end of a line.
The backslash (\) escapes any metacharacter, allowing you to find literal characters. For example, \$ finds the dollar sign itself.

Findstr does not support character classes (\d), alternation (|), or other repetitions (+ or {5}).

The basic syntax is the same as for the FIND command:

findstr "\<20[0-9][0-9]\>" *.htm

This command finds all years starting with 2000 in the .htm files of the current directory. Just like with find, use the /I switch for a case-insensitive search:

Findstr limitations and quirks

Character lists [a-z] are always case-insensitive, so echo ABC | findstr "[a-z]" matches.

The space character works as the alternation metacharacter in findstr, so a search query like findstr "new shoes" * will find all lines containing either new or shoes. Unfortunately, there is no way to escape the space and use it as a literal character in a regular expression. For example, you cannot find lines starting with a space.

Syntax errors in regular expression are ignored. For instance, findstr "[" * will match all lines that contain the [ character.

If the file contains Unix line breaks (LF), the $ metacharacter does not work correctly. If the last line of a file lacks a line terminator, findstr will be unable to find it. For example, findstr "</html>$" * won't work if there is no CR+LF after </html>.

Early Windows versions had limitations on line length for find and findstr, as well as other commands. The recent versions lifted these limits, so you don't have to worry about them anymore. See this StackOverflow question for findstr limitations and bugs, especially in early Windows versions.

The findstr command operates in the OEM (MS DOS) code page; the dot metacharacter does not match any of the extended ASCII characters. As the result, the command is not very useful for non-English text. Besides that, you cannot search for Unicode characters (UTF-8 or UTF-16).

Conclusion

You can learn about other switches by typing findstr /? or find /?. For example, the additional switches allow you to search in subdirectories or print line numbers. You can also refer to the official documentation.

In general, the find and findstr commands are outdated and come with various quirks and limitations. Shameless plug: Aba Search and Replace supports command-line options as well, allowing you to search from the command prompt and replace text from Windows batch files.

Empty character class in JavaScript regexes

Mon, 10 Apr 2023 17:44:12 +0200

I contributed to PCRE and wrote two smaller regular expression engines, but I still regularly learn something new about this topic. This time, it's about a regex that never matches.

When using character classes, you can specify the allowed characters in brackets, such as [a-z] or [aeiouy]. But what happens if the character class is empty?

Popular regex engines treat the empty brackets [] differently. In JavaScript, they never match. This is a valid JavaScript code, and it always prints false regardless of the value of str:

const str = 'a';
console.log(/[]/.test(str));

However, in Java, PHP (PCRE), Go, and Python, the same regex throws an exception:

// Java
@Test
void testRegex1() {
    PatternSyntaxException e = assertThrows(PatternSyntaxException.class,
        () -> Pattern.compile("[]"));
    assertEquals("Unclosed character class", e.getDescription());
}

<?php
ini_set('display_errors', 1);
error_reporting(E_ALL);

// Emits a warning: preg_match(): Compilation failed: missing terminating ] for character class
echo preg_match('/[]/', ']') ? 'Match ' : 'No match';

# Python
import re
re.compile('[]') # throws "unterminated character set"

In these languages, you can put the closing bracket right after the opening bracket to avoid escaping the former:

// Java
@Test
void testRegex2() {
    Pattern p = Pattern.compile("[]]");
    Matcher m = p.matcher("]");
    assertTrue(m.matches());
}

<?php
echo preg_match('/[]]/', ']', $m) ? 'Match ' : 'No match'; // Outputs 'Match'
print_r($m);

# Python
import re
print(re.match('[]]', ']')) # outputs the Match object

// Go
package main

import (
    "fmt"
    "regexp"
)

func main() {
    matched, err := regexp.MatchString(`[]]`, "]")
    fmt.Println(matched, err)
}

This won't work in JavaScript because the first ] is interpreted as the end of the character class there, so the same regular expression in JavaScript means an empty character class that never matches, followed by a closing bracket. As the result, the regular expression never finds the closing bracket:

// JavaScript
console.log(/[]]/.test(']')); // outputs false

If you negate the empty character class with ^ in JavaScript, it will match any character including newlines:

console.log(/[^]/.test('')); // outputs false
console.log(/[^]/.test('a')); // outputs true
console.log(/[^]/.test('\n')); // outputs true

Again, this is an invalid regex in other languages. PCRE can emulate the JavaScript behavior if you pass the PCRE2_ALLOW_EMPTY_CLASS option to pcre_compile. PHP never passes this flag.

If you want to match an opening or a closing bracket, this somewhat cryptic regular expression will help you in Java, PHP, Python, or Go: [][]. The first opening bracket starts the character class, which includes the literal closing bracket and the literal opening bracket, and finally, the last closing bracket ends the class.

In JavaScript, you need to escape the closing bracket like this: [\][]

console.log(/[\][]/.test('[')); // outputs true
console.log(/[\][]/.test(']')); // outputs true

In Aba Search and Replace, I chose to support the syntax used in Java/PHP/Python/Go. There are many other ways to construct a regular expression that always fails, in case you need it. So it makes sense to use this syntax for a literal closing bracket.

Privacy Policy Update - December 2022

Sun, 25 Dec 2022 21:17:32 +0100

Updated our privacy policy:

clarified your rights under GDPR (you can object to processing of your personal data or restrict the processing, etc.);
added that we don't do any profiling for marketing purposes, but PayPro Global may do risk scoring in order to prevent a potential credit card fraud;
added that we can notify you by email about new software versions (you can leave this checkbox empty or unsubscribe at any time);
listed what happens if you don't provide your personal data (e.g., if you don't provide your email address, we cannot reply to you);
changed the refund policy from 30 to 14 days, added a reference to the relevant Czech law;
stated that we do full-disk encryption and encrypt all backups, so your personal data are safe with us.

Note that we are required by law to notify you of any changes in the privacy policy. Thank you and have a nice holiday season!