Aba Search and Replace bloghttps://www.abareplace.com/blog/Search tips and tricks, regular expression tutorials, announcements about new versions of Aba Search and Replace.1440Regular Expressions 101<p>With regular expressions, you can describe the patterns that are similar to each other. For example, you have multiple <code>&lt;img&gt;</code> tags, and you want to move all these images to the <code>images</code> folder:</p> <pre> &lt;img src="9.png"&gt; &#x2192; &lt;img src="images/9.png"&gt; &lt;img src="10.png"&gt; &#x2192; &lt;img src="images/10.png"&gt; and so on </pre> <p>You can easily write a regular expression that matches all file names that are numbers, then replace all such tags at once.</p> <h3>Basic syntax</h3> <p>If you need to match <b>one of the alternatives,</b> use an alternation (vertical bar). For example:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>a|img|h1|h2</code></td><td class="c">either <code>a</code>, or <code>img</code>, or <code>h1</code>, or <code>h2</code></td></tr> </table> <p>When using alternation, you often need to <b>group</b> characters together; you can do this with parentheses. For example, if you want to match an HTML tag, this approach won't work:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>&lt;h1|h2|b|i&gt;</code></td><td class="c"><code>&lt;h1</code> or <code>h2</code> (without the angle brackets) or <code>b</code> or <code>i&gt;</code></td></tr> </table> <p>because <code>&lt;</code> applies to the first alternative only and <code>&gt;</code> applies to the last one only. To apply the angle brackets to all alternatives, you need to group the alternatives together:</p> <pre> &lt;(h1|h2|b|i)&gt; </pre> <p>The last primitive (star) allows you to <b>repeat</b> anything zero or more times. You can apply it to one character, for example:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>a*</code></td><td class="c">an empty string, <code>a</code>, <code>aa</code>, <code>aaa</code>, <code>aaaa</code>, etc.</td></tr> </table> <p>You also can apply it to multiple characters in parentheses:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>(ab)*</code></td><td class="c">an empty string, <code>ab</code>, <code>abab</code>, <code>ababab</code>, <code>abababab</code>, etc.</td></tr> </table> <p>Note that if you remove the parentheses, the star will apply to the last character only:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>ab*</code></td><td class="c">an empty string, <code>ab</code>, <code>abb</code>, <code>abbb</code>, <code>abbbb</code>, etc.</td></td></tr> </table> <figure> <img src="/Stephen_Kleene.jpg" width="281" height="400" alt="A portrait of Stephen Cole Kleene, the regular expression inventor" title=""> <figcaption>Stephen Kleene (1909-1994), the regular expression inventor.<br/>Author: Konrad Jacobs. Source: Archives of the Mathematisches Forschungsinstitut Oberwolfach.</figcaption> </figure> <p>The star is named <b>Kleene star</b> after an American mathematician <a href="https://en.wikipedia.org/wiki/Stephen_Cole_Kleene">Stephen Kleene</a> who invented regular expressions in the 1950s. It can match an empty string as well as any number of repetitions.</p> <p>These <b>three primitives</b> (alternation, parentheses, and the star for repetition) are enough to write any regular expression, but the syntax may be verbose. For example, you now can write a regex for matching the file names that are numbers in an <code>&lt;img&gt;</code> tag:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>(0|1|2|3|4&#x200b;|5|6|7|8|9)(0|1|2|3|4&#x200b;|5|6|7|8|9)*</code></td><td class="c">one or more digits</td></tr> <tr><td class="r"><code>(1|2|3|4|5&#x200b;|6|7|8|9)(0|1|2|3|4&#x200b;|5|6|7|8|9)*</code></td><td class="c">a positive integer number (don't allow zero as the first character)</td></tr> </table> <p>The parentheses may be nested without a limit, for example:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>(1|2|3|4&#x200b;|5|6|7|8|9)(0|1|2|3|4&#x200b;|5|6|7|8|9)*(,(1|2|3|4&#x200b;|5|6|7|8|9)(0|1|2|3|4&#x200b;|5|6|7|8|9)*)*</code></td><td class="c">one or more positive integer numbers, separated with commas</td></tr> </table> <h3>Convenient shortcuts for character classes</h3> <p>You can write any regex with the three primitives, but it quickly becomes hard to read, so a few shortcuts were invented. When you need to match <b>any of the listed characters,</b> please put them into square brackets:</p> <table class="example"> <thead><tr><td>Regex</td><td>Shorter regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>a|e|i|o|u|y</code></td><td class="r"><code>[aeiouy]</code></td><td class="c">a vowel</td></tr> <tr><td class="r"><code>0|1|2|3|4&#x200b;|5|6|7|8|9</code></td><td class="r"><code>[0123456789]</code></td><td class="c">a digit</td></tr> <tr><td class="r"><code>0|1|2|3|4&#x200b;|5|6|7|8|9</code></td><td class="r"><code>[0-9]</code></td><td class="c">a digit</td></tr> <tr><td class="r"><code>a|b|c|d|e&#x200b;|f|g|h|i|j&#x200b;|k|l|m|n&#x200b;|o|p|q|r&#x200b;|s|t|u|v&#x200b;|w|x|y|z</code></td><td class="r"><code>[a-z]</code></td><td class="c">a letter</td></tr> </table> <p>As you can see, it's possible to specify only the first and the last allowed character if you put <b>a dash</b> between them. There may be several such ranges inside square brackets:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>[a-z0-9]</code></td><td class="c">a letter or a digit</td></tr> <tr><td class="r"><code>[a-z0-9_]</code></td><td class="c">a letter, a digit, or the underscore character</td></tr> <tr><td class="r"><code>[a-f0-9]</code></td><td class="c">a hexadecimal digit</td></tr> </table> <p>There are some <b>predefined character classes</b> that are even shorter to write:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>\s</code></td><td class="c">a space character: the space, the tab character, the new line, or the carriage feed</td></tr> <tr><td class="r"><code>\d</code></td><td class="c">a digit</td></tr> <tr><td class="r"><code>\w</code></td><td class="c">a word character (a letter, a digits, or the underscore character)</td></tr> <tr><td class="r"><code>.</code></td><td class="c">any character</td></tr> </table> <p>In Aba Search and Replace, these <a href="/docs/charListClass.php">character classes</a> include Unicode characters such as accented letters or Unicode line breaks. In other regex dialects, they usually include ASCII characters only, so <code>\d</code> is typically the same as <code>[0-9]</code> and <code>\w</code> is the same as <code>[a-zA-Z0-9_]</code>.</p> <p>The character classes don't add any new capabilities to the regular expressions; you can just list all allowed characters with an alternation, but a character class is much shorter to write. We now can write a shorter version of the regex mentioned before:</p> <table class="example"> <thead><tr><td>Regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>[1-9][0-9]*(,[1-9][0-9])*</code></td><td class="c">one or more positive integer numbers, separated with commas</td></tr> </table> <h3>Repetitions</h3> <p>A Kleene star means "repeating zero or more times", but you often need another number of repetitions. As shown before, you can just copy-and-paste a regex to repeat it twice or three times, but there is a shorter notation for that:</p> <table class="example"> <thead><tr><td>Regex</td><td>Shorter regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>\d\d*</code></td><td class="r"><code>\d+</code></td><td class="c">one or more digits</td></tr> <tr><td class="r"><code>(0|1)(0|1)*</code></td><td class="r"><code>[01]+</code></td><td class="c">any binary number (consisting of zeros and ones)</td></tr> <tr><td class="r"><code>(\s|)</code></td><td class="r"><code>\s?</code></td><td class="c">either a space character or nothing</td></tr> <tr><td class="r"><code>http(s|)</code></td><td class="r"><code>https?</code></td><td class="c">either <code>http</code> or <code>https</code></td></tr> <tr><td class="r"><code>(-|\+|)</code></td><td class="r"><code>[-+]?</code></td><td class="c">the minus sign, the plus sign, or nothing</td></tr> <tr><td class="r"><code>[a-z][a-z]</code></td><td class="r"><code>[a-z]{2}</code></td><td class="c">two small letters</td></tr> <tr><td class="r"><code>[a-z][a-z]((([a-z]|)[a-z]|)[a-z]|)</code></td><td class="r"><code>[a-z]{2,5}</code></td><td class="c">from two to five small letters</td></tr> <tr><td class="r"><code>[a-z][a-z][a-z]*</code></td><td class="r"><code>[a-z]{2,}</code></td><td class="c">two or more small letters</td></tr> </table> <p>So there are the following <a href="/docs/repetitions.php">repetition operators:</a></p> <ul> <li>a Kleene star <code>*</code> means repeating <b>zero or more times,</b> so it can never match, it can match once, twice, three times, etc.;</li> <li>a plus sign <code>+</code> means repeating <b>one or more times,</b> so it must match at least once;</li> <li>an optional part <code>?</code> means <b>zero times or once</b>;</li> <li>curly brackets <code>{m,n}</code> means repeating <b>from m to n times</b>.</li> </ul> <p>Note that you can express any repetition with the curly brackets, so these operators partially duplicate each other. For example:</p> <table class="example"> <thead><tr><td>Regex</td><td>Shorter regex</td><td>Meaning</td></tr></thead> <tr><td class="r"><code>\d{0,}</code></td><td class="r"><code>\d*</code></td><td class="c">nothing or some digits</td></tr> <tr><td class="r"><code>\d{1,}</code></td><td class="r"><code>\d+</code></td><td class="c">one or more digits</td></tr> <tr><td class="r"><code>\s{0,1}</code></td><td class="r"><code>\s?</code></td><td class="c">either a space character or nothing</td></tr> </table> <p>Just like the Kleene star, the other repetition operators can apply to parentheses, so you can nest them indefinitely.</p> <h3>Escaping</h3> <p>If you need to match any of <b>the special characters</b> like parentheses, vertical bar, plus, or star, you must <a href="/blog/escape-regexp/">escape them</a> by adding a backslash <code>\</code> before them. For example, to find a number in parentheses, use <code>\(\d+\)</code>.</p> <p>A common mistake is to forget a backslash before a dot. Note that a dot means any character, so if you write <code>example.com</code> in a regular expression, it will match <code>examplexcom</code> or something similar, which may even cause a security issue in your program. Now we can write a regex to match the <code>&lt;img&gt;</code> tags:</p> <pre> &lt;img src="\d+\.png"&gt; </pre> <p>This matches any filename consisting of digits and we correctly escaped the dot.</p> <h3>Other features</h3> <p>Modern regex engines add more features such as <b>backreferences</b> or conditional subpatterns. Mathematically speaking, these features don't belong to the regular expressions; they describe a non-regular language, so you cannot replace them with the three primitives.</p> <p>Next time, we will discuss anchors and zero-width assertions.</p> Sun, 28 Jan 2024 15:10:17 +0100https://www.abareplace.com/blog/regex101/Regular expression for numbers<p>It's easy to find a positive integer number with regular expressions:</p> <pre>[0-9]+</pre> <p>This regex means digits from 0 to 9, repeated one or more times. However, <b>numbers starting with zero</b> are treated as octal in many programming languages, so you may wish to avoid matching them:</p> <pre>[1-9][0-9]*</pre> <p>This regular expression matches any positive integer number starting with a non-zero digit. If you also need to match zero, you can include it as another branch:</p> <pre>[1-9][0-9]*|0</pre> <p>To also accomodate <b>negative integer numbers,</b> you can allow a minus sign before the digits:</p> <pre>-?[1-9][0-9]*|0</pre> <p>Sometimes it's necessary to allow a plus sign as well:</p> <pre>[-+]?[1-9][0-9]*|0</pre> <p>The previous regexes searched the input string for a number. If you need to match <b>a number only</b> discarding anything else, you can add the <code>^</code> anchor to match the beginning of the string and the <code>$</code> anchor to match the end:</p> <pre>^(-?[1-9][0-9]*|0)$</pre> <p>Parentheses are necessary here; without them, the <code>^</code> anchor would apply only to the first branch. Another variation of the same regex avoids finding numbers that are part of words, such as <code>600px</code> or <code>x64</code>:</p> <pre>\b(-?[1-9][0-9]*|0)\b</pre> <p>Things get more complicated if you need to match <b>a fractional number</b>:</p> <pre>\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0)\b</pre> <p>Let's break down this regular expression:</p> <ul> <li>The first branch <code>[1-9][0-9]*(?:\.[0-9]+)?</code> matches an integer number starting with a non-zero digit, then an optional fractional part.</li> <li>The second branch <code>\.[0-9]+</code> matches fractional numbers starting with a dot, for example, <code>.5</code> is another way to write <code>0.5</code>.</li> <li>The third branch matches zero. Note that both positive and negative zeros are possible in floating-point numbers.</li> </ul> <p>For floating-point numbers with an exponent, such as <code>5.2777e+231</code>, please use:</p> <pre>\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0)(?:[eE][+-]?[0-9]+)?\b</pre> <p>Many programming languages support <b>hexadecimal numbers</b> starting with <code>0x</code>. Here is a regular expression to match them:</p> <pre>0x[0-9a-fA-F]+</pre> <p>Finally, here is a comprehensive regular expression to match floating-point, integer decimal, or hexadecimal numbers:</p> <pre>\b-?(?:[1-9][0-9]*(?:\.[0-9]+)?|\.[0-9]+|0(?:x[0-9a-fA-F]+)?)(?:[eE][+-]?[0-9]+)?\b</pre> Sat, 30 Dec 2023 18:13:28 +0100https://www.abareplace.com/blog/regex_numbers/Aba 2.6 released<p>This version adds the following features:</p> <ul> <li><a href="https://www.abareplace.com/docs/baoOverview.php">complex replacements</a> including converting the matching text to lowercase, inserting the file name, or adding width/height attributes to &lt;img&gt; tags (now you can use a simple scripting language in the replacements); </li> <li>a 64-bit version (if needed, you still can choose a 32-bit version during installation);</li> <li>a new <a href="https://www.abareplace.com/docs/hotkeys.php">hotkey:</a> the left/right arrow key to quickly jump to the next/previous file (when <a href="https://www.abareplace.com/docs/searchResults.php">the results pane</a> is focused);</li> <li>the taskbar button now flashes when a long operation is complete;</li> <li>basic support for emojis (ZWJ sequences and skin tones are displayed as separate characters).</li> </ul> <p>Just as always, <b>the upgrade is free</b> for the registered users; your settings and search history will be preserved when you run the installer.</p> <p>If you have any suggestions for new features, please <a href="/support/">contact me.</a> I will be happy to implement your ideas.</p>Mon, 25 Dec 2023 03:06:00 +0100https://www.abareplace.com/blog/aba26/Search from the Windows command prompt<p>When you need to search within text files from Windows batch files, you can use either the find or findstr command. Findstr supports a limited version of regular expressions. You can also automate certain tasks based on the search results.</p> <h3>The find command</h3> <p>To search for text in multiple files from the Windows command prompt or batch files, you can use the <b>FIND</b> command, which has been present since the days of MS DOS and is still available in Windows 11. It's similar to the Unix <code>grep</code> command, but does not support regular expressions. If you want to search for the word <code>borogoves</code> in the current directory, please follow this syntax:</p> <pre> find "borogoves" * </pre> <p>Note that the double quotes around the pattern are mandatory. If you are using PowerShell, you will need to include single quotes as well:</p> <pre> find '"borogoves"' * </pre> <p>Instead of the asterisk (<code>*</code>), you can specify a file mask such as <code>*.htm?</code>. The <code>find</code> command displays the names of the files it scans, even if it doesn't find any matches within these files:</p> <img src="/FindStr1.png" alt="The FIND command in Windows 11" title="" width="652" height="262"> <p>The search is <b>case-sensitive</b> by default, so you typically need to add the <code>/I</code> switch to treat uppercase and lowercase letters as equivalent:</p> <pre> find /I "&lt;a href=" *.htm </pre> <p>If you don't specify the file to search in, <code>find</code> will wait for the text input <b>from stdin,</b> so that you can pipe output from another command. For example, you can list all copy commands supported in Windows:</p> <pre> help | find /i "copy" </pre> <p>Another switch, <code>/V</code>, allows you to find all lines not containing the pattern, similar to the <code>grep -v</code> command.</p> <p>In <b>batch files,</b> you can use the fact that the <code>find</code> command sets the exit code (<b>errorlevel</b>) to 1 if the pattern is not found. For instance, you can check if the machine is running a 64-bit or 32-bit version of Windows:</p> <pre> @echo off rem Based on KB556009 with some corrections reg Query "HKLM\Hardware\Description\System\CentralProcessor\0" /v "Identifier" | find /i "x86 Family" &gt; nul if errorlevel 1 goto win64 echo 32-bit Windows goto :eof :win64 rem Could be AMD64 or ARM64 echo 64-bit Windows </pre> <h3>The findstr command: regular expression search</h3> <p>If you need to find <b>a regular expression,</b> try the <code>FINDSTR</code> command, which was introduced in Windows XP. <a href="https://devblogs.microsoft.com/oldnewthing/20151209-00/?p=92361">For historical reasons,</a> <code>findstr</code> supports a limited subset of regular expressions, so you can only use these <a href="https://www.abareplace.com/docs/regExprElements.php">regex features:</a></p> <ul> <li>The dot <code>.</code> matches any character except for newline and extended ASCII characters.</li> <li>Character lists <code>[abc]</code> match any of the specified characters (<code>a</code>, <code>b</code>, or <code>c</code>).</li> <li>Character list ranges <code>[a-z]</code> match any letter from <code>a</code> to <code>z</code>.</li> <li>The asterisk (<code>*</code>) indicates that the previous character cane be repeated zero or more times.</li> <li>The <code>\&lt;</code> and <code>\&gt;</code> symbols mark the beginning and the end of a word.</li> <li>The caret (<code>^</code>) and the dollar sign (<code>$</code>) denote the beginning of and the end of a line.</li> <li>The backslash (<code>\</code>) escapes any metacharacter, allowing you to find literal characters. For example, <code>\$</code> finds the dollar sign itself.</li> </ul> <p><b>Findstr</b> does not support character classes (<code>\d</code>), alternation (<code>|</code>), or other repetitions (<code>+</code> or <code>{5}</code>).</p> <p>The basic syntax is the same as for the <code>FIND</code> command:</p> <pre> findstr "\&lt;20[0-9][0-9]\&gt;" *.htm </pre> <p>This command finds all years starting with 2000 in the <code>.htm</code> files of the current directory. Just like with <code>find</code>, use the <code>/I</code> switch for <b>a case-insensitive</b> search:</p> <img src="/FindStr2.png" alt="The FINDSTR command in Windows 11" title="" width="652" height="115"> <h3>Findstr limitations and quirks</h3> <p>Character lists <code>[a-z]</code> are always case-insensitive, so <code>echo ABC | findstr "[a-z]"</code> matches.</p> <p><b>The space character</b> works as the alternation metacharacter in <code>findstr</code>, so a search query like <code>findstr "new shoes" *</code> will find all lines containing either <code>new</code> or <code>shoes</code>. Unfortunately, there is no way to escape the space and use it as a literal character in a regular expression. For example, you cannot find lines starting with a space.</p> <p><b>Syntax errors</b> in regular expression are ignored. For instance, <code>findstr "[" *</code> will match all lines that contain the <code>[</code> character.</p> <p>If the file contains <b>Unix line breaks</b> (LF), the <code>$</code> metacharacter does not work correctly. If <b>the last line of a file</b> lacks a line terminator, <code>findstr</code> will be unable to find it. For example, <code>findstr "&lt;/html&gt;$" *</code> won't work if there is no CR+LF after &lt;/html&gt;.</p> <p>Early Windows versions had <b>limitations on line length</b> for <code>find</code> and <code>findstr</code>, as well as other commands. The recent versions lifted these limits, so you don't have to worry about them anymore. See <a href="https://stackoverflow.com/questions/8844868/what-are-the-undocumented-features-and-limitations-of-the-windows-findstr-comman/20159191#20159191">this StackOverflow question</a> for <code>findstr</code> limitations and bugs, especially in early Windows versions.</p> <p>The findstr command operates in <b>the OEM (MS DOS) code page;</b> the dot metacharacter does not match any of the extended ASCII characters. As the result, the command is not very useful for non-English text. Besides that, you cannot search for Unicode characters (UTF-8 or UTF-16).</p> <h3>Conclusion</h3> <p>You can learn about other switches by typing <code>findstr /?</code> or <code>find /?</code>. For example, the additional switches allow you to search in subdirectories or print line numbers. You can also refer to <a href="https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/findstr">the official documentation.</a></p> <p>In general, the <code>find</code> and <code>findstr</code> commands are outdated and come with various quirks and limitations. Shameless plug: <b>Aba Search and Replace</b> supports <a href="/docs/cmdLine.php">command-line options as well,</a> allowing you to search from the command prompt and replace text from Windows batch files.</p> Sun, 21 May 2023 14:07:58 +0200https://www.abareplace.com/blog/findstr/Empty character class in JavaScript regexes<p>I <a href="https://github.com/PCRE2Project/pcre2/blob/master/maint/GenerateUcd.py">contributed to PCRE</a> and wrote two smaller regular expression engines, but I still regularly learn something new about this topic. This time, it's about <b>a regex that never matches.</b></p> <p>When using <a href="https://www.abareplace.com/docs/charListClass.php">character classes,</a> you can specify the allowed characters in brackets, such as <code>[a-z]</code> or <code>[aeiouy]</code>. But what happens if the character class is empty?</p> <p>Popular <b>regex engines</b> treat the empty brackets <code>[]</code> differently. In JavaScript, they never match. This is a valid JavaScript code, and it always prints false regardless of the value of <code>str</code>:</p> <pre> const str = 'a'; console.log(/[]/.test(str)); </pre> <p>However, in Java, PHP (PCRE), Go, and Python, the same regex throws an exception:</p> <pre> // Java @Test void testRegex1() { PatternSyntaxException e = assertThrows(PatternSyntaxException.class, () -> Pattern.compile("[]")); assertEquals("Unclosed character class", e.getDescription()); } </pre> <pre> &lt;?php ini_set('display_errors', 1); error_reporting(E_ALL); // Emits a warning: preg_match(): Compilation failed: missing terminating ] for character class echo preg_match('/[]/', ']') ? 'Match ' : 'No match'; </pre> <pre> # Python import re re.compile('[]') # throws "unterminated character set" </pre> <p>In these languages, you can <b>put the closing bracket right after the opening bracket</b> to avoid <a href="https://www.abareplace.com/blog/escape-regexp/">escaping the former</a>:</p> <pre> // Java @Test void testRegex2() { Pattern p = Pattern.compile("[]]"); Matcher m = p.matcher("]"); assertTrue(m.matches()); } </pre> <pre> &lt;?php echo preg_match('/[]]/', ']', $m) ? 'Match ' : 'No match'; // Outputs 'Match' print_r($m); </pre> <pre> # Python import re print(re.match('[]]', ']')) # outputs the Match object </pre> <pre> // Go package main import ( "fmt" "regexp" ) func main() { matched, err := regexp.MatchString(`[]]`, "]") fmt.Println(matched, err) } </pre> <p>This won't work in JavaScript because the first <code>]</code> is interpreted as the end of the character class there, so the same regular expression in JavaScript means <a href="https://262.ecma-international.org/13.0/#sec-compiletocharset">an empty character class</a> that never matches, followed by a closing bracket. As the result, the regular expression never finds the closing bracket:</p> <pre> // JavaScript console.log(/[]]/.test(']')); // outputs false </pre> <p>If you <b>negate the empty character class</b> with <code>^</code> in JavaScript, it will match any character including newlines:</p> <pre> console.log(/[^]/.test('')); // outputs false console.log(/[^]/.test('a')); // outputs true console.log(/[^]/.test('\n')); // outputs true </pre> <p>Again, this is an invalid regex in other languages. PCRE can emulate the JavaScript behavior if you pass the PCRE2_ALLOW_EMPTY_CLASS option to <a href="https://pcre.org/current/doc/html/pcre2api.html#SEC20">pcre_compile.</a> PHP never passes this flag.</p> <p>If you want to match <b>an opening or a closing bracket,</b> this somewhat cryptic regular expression will help you in Java, PHP, Python, or Go: <code><b>[</b>][<b>]</b></code>. The first opening bracket starts the character class, which includes the literal closing bracket and the literal opening bracket, and finally, the last closing bracket ends the class.</p> <p>In JavaScript, you need to escape the closing bracket like this: <code><b>[</b>\][<b>]</b></code></p> <pre> console.log(/[\][]/.test('[')); // outputs true console.log(/[\][]/.test(']')); // outputs true </pre> <p>In Aba Search and Replace, I chose to support the syntax used in Java/PHP/Python/Go. There are <a href="https://stackoverflow.com/questions/1723182/a-regex-that-will-never-be-matched-by-anything">many other ways</a> to construct a regular expression that always fails, in case you need it. So it makes sense to use this syntax for a literal closing bracket.</p> Mon, 10 Apr 2023 17:44:12 +0200https://www.abareplace.com/blog/emptybrackets/Privacy Policy Update - December 2022<p>Updated <a href="/order/#privacy">our privacy policy:</a></p> <ul> <li>clarified your rights under GDPR (you can object to processing of your personal data or restrict the processing, etc.);</li> <li>added that we don't do any profiling for marketing purposes, but PayPro Global may do risk scoring in order to prevent a potential credit card fraud;</li> <li>added that we can notify you by email about new software versions (you can leave this checkbox empty or unsubscribe at any time);</li> <li>listed what happens if you don't provide your personal data (e.g., if you don't provide your email address, we cannot reply to you);</li> <li>changed the refund policy from 30 to 14 days, added a reference to the relevant Czech law;</li> <li>stated that we do full-disk encryption and encrypt all backups, so your personal data are safe with us.</li> </ul> <p>Note that we are required by law to notify you of any changes in the privacy policy. Thank you and have a nice holiday season!</p> Sun, 25 Dec 2022 21:17:32 +0100https://www.abareplace.com/blog/privacy2022-12/Aba 2.5 released<p>The new features in this version include:</p> <ul> <li>Search and replace <a href="/docs/cmdLine.php">from the command line</a></li> <li><a href="/docs/searchParams.php#browseForFiles">Skip subdirectories</a> when searching (click the <i>Browse</i> button and uncheck <i>Include subdirectories</i>)</li> <li><a href="/docs/searchResults.php#sorting">Sorting</a> the search results by path, filename, extension, modification date, or file size.</li> <li>Escape sequences and character classes inside the character lists, e.g. <code>[\d\s]</code> to find a digit, a space, or a newline.</li> <li>Fixed multiple bugs including encoding detection in very short files and searching for the replacement character U+FFFD (many thanks to Joe). Also fixed incorrect search in files slightly larger than 4 GB.</li> <li>Now relative paths are displayed instead of absolute ones in the search results.</li> </ul> <p>The upgrade is free for the registered users. Just <a href="/download/">download</a> the installer and run it; your settings and search history will be preserved.</p> <img src="/blog_aba25.png" width="688" height="436" alt="Aba 2.5 window" title=""> Sun, 11 Dec 2022 20:03:51 +0100https://www.abareplace.com/blog/aba25/Our response to the war in Ukraine<p>In response to the Russian invasion of Ukraine, I blocked all orders from Russia starting from March 2022. I fully support Ukraine in this terrible war and donate money to help Ukrainian refugees in Czech Republic.</p> <p>Many of you are in a tough situation now due to the high inflation and the rising energy prices. So I introduce <b>a 10% discount</b> for all new Aba Search and Replace users, but especially for freelancers and small businesses who pay for the software from their own pocket.</p> <p>Please use this coupon code at <a href="/buy/">checkout:</a></p> <p><code><b>GloryToUkraine</b></code> &nbsp; <button onclick="navigator.clipboard.writeText('GloryToUkraine'); return false;">📋 Copy to clipboard</button> <p>The coupon code is valid until the end of 2022. I plan to release a new version within several weeks; the upgrade will be free for all registered users. Please stay tuned.</p> <p>Thank you for your continuous support. Wishing you peace and good fortune.</p> <p><i>Peter Kankowski,</i><br><i>Aba Search and Replace developer</i></p>Sat, 01 Oct 2022 12:37:28 +0200https://www.abareplace.com/blog/ukraine/Check VAT ID with regular expressions and VIES<p>In the European Union, each business registered for VAT has an unique <a href="https://taxation-customs.ec.europa.eu/vat-identification-numbers_en">identification number</a> like IE9825613N or LU20260743. When selling to EU companies, you need to ask for their VAT ID, validate it, and include it into the invoice. The tax rate depends on <a href="https://taxation-customs.ec.europa.eu/where-tax_en">the place of taxation</a> and the client type (a person or a company). Some customers may provide a wrong VAT ID &mdash; either by mistake or in an attempt to avoid paying the tax. So it's important to check the VAT number.</p> <p>EU provides the <a href="https://ec.europa.eu/taxation_customs/vies/vatRequest.html">VIES page</a> (VAT Information Exchange System) and <a href="https://ec.europa.eu/taxation_customs/vies/faq.html#item_18">a free SOAP API</a> for the VAT ID validation. Here is how you can query the API:</p> <pre> Linux / macOS: curl -d '&lt;soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"&gt;&lt;soapenv:Body&gt;&lt;urn:checkVat&gt;&lt;urn:countryCode&gt;IE&lt;/urn:countryCode&gt;&lt;urn:vatNumber&gt;9825613N&lt;/urn:vatNumber&gt;&lt;/urn:checkVat&gt;&lt;/soapenv:Body&gt;&lt;/soapenv:Envelope&gt;' 'https://ec.europa.eu/taxation_customs/vies/services/checkVatService' Windows: (iwr 'https://ec.europa.eu/taxation_customs/vies/services/checkVatService' -method post -body '&lt;soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"&gt;&lt;soapenv:Body&gt;&lt;urn:checkVat&gt;&lt;urn:countryCode&gt;IE&lt;/urn:countryCode&gt;&lt;urn:vatNumber&gt;9825613N&lt;/urn:vatNumber&gt;&lt;/urn:checkVat&gt;&lt;/soapenv:Body&gt;&lt;/soapenv:Envelope&gt;').content </pre> <p>If the VAT number is invalid, you will get <code>&lt;valid&gt;false&lt;/valid&gt;</code> in the response. You can use a SOAP library or just concatenate the XML string with the VAT identification number. In the latter case, you should quickly check the VAT number with a regular expression, otherwise an attacker can include an arbitrary XML code into it. The VIES <a href="https://ec.europa.eu/taxation_customs/vies/checkVatService.wsdl">WSDL file</a> provides these regular expressions:</p> <pre> Country code: [A-Z]{2} VAT ID without the country code: [0-9A-Za-z\+\*\.]{2,12} </pre> <p>The country code consists of two capital letters; the VAT ID itself is from 2 to 12 letters, digits, or these characters: <code>+ * .</code></p> <p>So the finished code for VAT ID validation could look like this:</p> <pre> import re, urllib.request, xml.etree.ElementTree as XmlElementTree # Return a dictionary with some information about the company, or False if the vat_id is invalid def check_vat_id(vat_id): m = re.match('^([A-Z]{2})([0-9A-Za-z\+\*\.]{2,12})$', vat_id.replace(' ', '')) if not m: return False data = '&lt;soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" ' + \ 'xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"&gt;' + \ '&lt;soapenv:Body&gt;&lt;urn:checkVat&gt;&lt;urn:countryCode&gt;' + m.group(1) + '&lt;/urn:countryCode&gt;' + \ '&lt;urn:vatNumber&gt;' + m.group(2) + '&lt;/urn:vatNumber&gt;&lt;/urn:checkVat&gt;&lt;/soapenv:Body&gt;&lt;/soapenv:Envelope&gt;' with urllib.request.urlopen('https://ec.europa.eu/taxation_customs/vies/services/checkVatService', data.encode('ascii')) as response: resp = response.read().decode('utf-8') ns = { 'soap': 'http://schemas.xmlsoap.org/soap/envelope/', 'checkVat': 'urn:ec.europa.eu:taxud:vies:services:checkVat:types', } checkVatResponse = XmlElementTree.fromstring(resp).find('./soap:Body/checkVat:checkVatResponse', ns) if checkVatResponse.find('./checkVat:valid', ns).text != 'true': return False res = {} for child in checkVatResponse: res[child.tag.replace('{urn:ec.europa.eu:taxud:vies:services:checkVat:types}', '')] = child.text return res print(check_vat_id('IE9825613N')) </pre> <p>Each EU country also has <a href="https://ec.europa.eu/taxation_customs/vies/faq.html#item_11">its own rules</a> for a VAT identification number, so you can do a stricter pre-check with a complex regular expression, but VIES already covers this for you. Also note that some payment processors (e.g. <a href="https://stripe.com/docs/billing/customer/tax-ids#eu-vat">Stripe</a>) already do a VIES query under the hood.</p> Sun, 31 Jul 2022 16:50:00 +0200https://www.abareplace.com/blog/vat_id/Which special characters must be escaped in regular expressions?<p>In most regular expression engines (PCRE, JavaScript, Python, Go, and Java), these special characters <b>must</b> be escaped outside of character classes:</p> <pre> [ * + ? { . ( ) ^ $ | \ </pre> <p>If you want to find one of these metacharacters literally, please add <code>\</code> before it. For example, to find the text <code>$100</code>, use <code>\$100</code>. If you want to find the backslash itself, double it: <code>\\</code>.</p> <p><b>Inside character classes</b> [square brackets], you must escape the following characters:</p> <pre> \ ] - </pre> <p>For example, to find an opening or a closing bracket, use <code>[[\]]</code>.</p> <p>If you need to include <b>the dash into a character class,</b> you can make it the first or the last character instead of escaping it. Use <code>[a-z-]</code> or <code>[a-z\-]</code> to find a Latin letter or a dash.</p> <p>If you need to include <b>the caret ^ into a character class,</b> it cannot be the first character; otherwise, it will be interpreted as any character except the specified ones. For example: <code>[^aeiouy]</code> means "any character except vowels", while <code>[a^eiouy]</code> means "any vowel or a caret". Alternatively, you can escape the caret: <code>[\^aeiouy]</code></p> <h3>JavaScript</h3> <p>In JavaScript, you also need to escape <b>the slash</b> <code>/</code> in regular expression literals:</p> <pre> /AC\/DC/.test('AC/DC') </pre> <p><b>Lone closing brackets</b> <code>]</code> and <code>}</code> <a href="https://262.ecma-international.org/11.0/#prod-annexB-ExtendedPatternCharacter">are allowed by default,</a> but if you <a href="https://eslint.org/docs/rules/require-unicode-regexp">use the 'u' flag,</a> then you <a href="https://262.ecma-international.org/11.0/#prod-PatternCharacter">must escape them:</a></p> <pre> /]}/.test(']}') // true /]}/u.test(']}') // throws an exception </pre> <p>This feature is specific for JavaScript; lone closing brackets are allowed in other languages.</p> <p>If you create a regular expression on the fly <b>from a user-supplied string,</b> you can use the following function to properly escape the special characters:</p> <pre> function escapeRe(str) { return str.replace(/[[\]*+?{}.()^$|\\-]/g, '\\$&amp;'); } var re = new RegExp(escapeRe(start) + '.*?' + escapeRe(end)); </pre> <h3>PHP</h3> <p>In PHP, you have the <a href="https://www.php.net/manual/en/function.preg-quote.php">preg_quote</a> function to <b>insert a user-supplied string</b> into a regular expression pattern. In addition to the characters listed above, it also escapes <code>#</code> (in 7.3.0 and higher), the null terminator, and the following characters: <code>= ! &lt; &gt; : -</code>, which do not have a special meaning in PCRE regular expressions but are sometimes used as delimiters. Closing brackets <code>]</code> and <code>}</code> are escaped, too, which is unnecessary:</p> <pre> preg_match('/]}/', ']}'); // returns 1 </pre> <p>Just like in JavaScript, you also need to <b>escape the delimiter,</b> which is usually <code>/</code>, but <a href="https://www.php.net/manual/en/regexp.reference.delimiters.php">you can use another special character</a> such as <code>#</code> or <code>=</code> if the slash appears inside your pattern:</p> <pre> if (preg_match('/\/posts\/([0-9]+)/', $path, $matches)) { } // Can be simplified to: if (preg_match('#/posts/([0-9]+)#', $path, $matches)) { } </pre> <p>Note that preg_quote does not escape the tilde <code>~</code> and the slash <code>/</code>, so you should not use them as delimiters if you construct regexes from strings.</p> <p><a href="https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.double"><b>In double quotes,</b></a> <code>\1</code> and <code>$</code> are interpreted differently than in regular expressions, so the best practice is:</p> <ul> <li>to use single quotes with preg_match, preg_replace, etc.;</li> <li><a href="https://www.php.net/manual/en/regexp.reference.escape.php">to repeat backslash 4 times</a> if you need to match a literal backslash. This is because you need to escape the backslash in the regular expression, but you also need to escape it <a href="https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.single">in the single-quoted string.</a> So it's escaped twice:</li> </ul> <pre> $text = 'C:\\Program files\\'; echo $text; if (preg_match('/C:\\\\Program files\\\\/', $text, $matches)) { print_r($matches); } </pre> <h3>Python</h3> <p>Python has <b>a raw string syntax</b> (<code>r''</code>), which conveniently avoids the backslash escaping idiosyncrasies of PHP:</p> <pre> import re re.match(r'C:\\Program files/Tools', 'C:\\Program files/Tools') </pre> <p>You only need to escape the quote in raw strings: <pre> re.match(r'\'', "'") re.match(r"'", "'") // or just use double quotes if you have a regex with a single quote re.match(r"\"", '"') re.match(r'"', '"') // or use single quotes if you have a regex with a double quote re.match(r'"\'', '"\'') // multiple quote types; cannot avoid escaping them </pre> <p><a href="https://docs.python.org/3/reference/lexical_analysis.html#literals">A raw string literal</a> cannot end with a single backslash, but this is not a problem for a valid regular expression.</p> <p>To match a literal <code>]</code> <b>inside a character class</b>, you can make it the first character: <code>[][]</code> matches a closing or an opening bracket. Aba Search &amp; Replace <a href="https://www.abareplace.com/docs/charListClass.php">supports this syntax,</a> but other programming languages do not. You can also quote the <code>]</code> character with a slash, which works in all languages: <code>[\][]</code> or <code>[[\]]</code>.</p> <p>For <b>inserting a string</b> into a regular expression, Python offers the <a href="https://docs.python.org/3/library/re.html#re.escape">re.escape</a> method. Unlike JavaScript with the <code>u</code> flag, Python tolerates escaping non-special punctuation characters, so this function also escapes <code>-</code>, <code>#</code>, <code>&amp;</code>, and <code>~</code>:</p> <pre> print(re.escape(r'-#&amp;~')) // prints \-\#\&amp;\~ re.match(r'\@\~', '@~') // matches </pre> <h3>Java</h3> <p>Java <a href="https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html">allows escaping non-special punctuation characters,</a> too:</p> <pre> Assert.assertTrue(Pattern.matches("\\@\\}\\] }]", "@}] }]")); </pre> <p>Similarly to PHP, you need to repeat the backslash character 4 times, but in Java, you also must <b>double the backslash character</b> when escaping other characters:</p> <pre> Assert.assertTrue(Pattern.matches("C:\\\\Program files \\(x86\\)\\\\", "C:\\Program files (x86)\\")); </pre> <p>This is because the backslash must be escaped in a Java string literal, so if you want to pass <code>\\ \[</code> to the regular expression engine, you need to double each backslash: <code>"\\\\ \\["</code>. There are no raw string literals in Java, so regular expressions are just usual strings.</p> <p>There is the <a href="https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#quote-java.lang.String-">Pattern.quote method</a> for <b>inserting a string</b> into a regular expression. It surrounds the string with <code>\Q</code> and <code>\E</code>, which escapes multiple characters in Java regexes (<a href="https://perldoc.perl.org/functions/quotemeta">borrowed from Perl</a>). If the string contains <code>\E</code>, it will be escaped with the backslash <code>\</code>:</p> <pre> Assert.assertEquals("\\Q()\\E", Pattern.quote("()")); Assert.assertEquals("\\Q\\E\\\\E\\Q\\E", Pattern.quote("\\E")); Assert.assertEquals("\\Q(\\E\\\\E\\Q)\\E", Pattern.quote("(\\E)")); </pre> <p>The <code>\Q...\E</code> syntax is <b>another way</b> to escape multiple special characters that you can use. Besides Java, it's supported in PHP/PCRE and Go regular expressions, but not in Python nor in JavaScript.</p> <h3>Go</h3> <p>Go <a href="https://go.dev/ref/spec#String_literals">raw string literals</a> are characters between back quotes: <code>`\(`</code>. It's preferable to use them for regular expressions because <b>you don't need to double-escape the backslash:</b></p> <pre> r := regexp.MustCompile(`\(text\)`) fmt.Println(r.FindString("(text)")) </pre> <p><b>A back quote</b> cannot be used in a raw string literal, so you have to resort to the usual <code>"`"</code> string syntax for it. But this is a rare character.</p> <p>The <b><code>\Q...\E</code> syntax</b> is supported, too:</p> <pre> r := regexp.MustCompile(`\Q||\E`) fmt.Println(r.FindString("||")) </pre> <p>There is a <a href="https://golang.google.cn/pkg/regexp/#QuoteMeta">regexp.QuoteMeta</a> method for <b>inserting strings</b> into a regular expression. In addition to the characters listed above, it also escapes closing brackets <code>]</code> and <code>}</code>.</p> Sat, 08 Jan 2022 12:08:02 +0100https://www.abareplace.com/blog/escape-regexp/