Aba Search and Replace bloghttps://www.abareplace.com/blog/Search tips and tricks, regular expression tutorials, announcements about new versions of Aba Search and Replace.1440Search from the Windows command prompt<p>When you need to search within text files from Windows batch files, you can use either the find or findstr command. Findstr supports a limited version of regular expressions. You can also automate certain tasks based on the search results.</p> <h3>The find command</h3> <p>To search for text in multiple files from the Windows command prompt or batch files, you can use the <b>FIND</b> command, which has been present since the days of MS DOS and is still available in Windows 11. It's similar to the Unix <code>grep</code> command, but does not support regular expressions. If you want to search for the word <code>borogoves</code> in the current directory, please follow this syntax:</p> <pre> find "borogoves" * </pre> <p>Note that the double quotes around the pattern are mandatory. If you are using PowerShell, you will need to include single quotes as well:</p> <pre> find '"borogoves"' * </pre> <p>Instead of the asterisk (<code>*</code>), you can specify a file mask such as <code>*.htm?</code>. The <code>find</code> command displays the names of the files it scans, even if it doesn't find any matches within these files:</p> <img src="/FindStr1.png" alt="The FIND command in Windows 11" title="" width="652" height="262"> <p>The search is <b>case-sensitive</b> by default, so you typically need to add the <code>/I</code> switch to treat uppercase and lowercase letters as equivalent:</p> <pre> find /I "&lt;a href=" *.htm </pre> <p>If you don't specify the file to search in, <code>find</code> will wait for the text input <b>from stdin,</b> so that you can pipe output from another command. For example, you can list all copy commands supported in Windows:</p> <pre> help | find /i "copy" </pre> <p>Another switch, <code>/V</code>, allows you to find all lines not containing the pattern, similar to the <code>grep -v</code> command.</p> <p>In <b>batch files,</b> you can use the fact that the <code>find</code> command sets the exit code (<b>errorlevel</b>) to 1 if the pattern is not found. For instance, you can check if the machine is running a 64-bit or 32-bit version of Windows:</p> <pre> @echo off rem Based on KB556009 with some corrections reg Query "HKLM\Hardware\Description\System\CentralProcessor\0" /v "Identifier" | find /i "x86 Family" &gt; nul if errorlevel 1 goto win64 echo 32-bit Windows goto :eof :win64 rem Could be AMD64 or ARM64 echo 64-bit Windows </pre> <h3>The findstr command: regular expression search</h3> <p>If you need to find <b>a regular expression,</b> try the <code>FINDSTR</code> command, which was introduced in Windows XP. <a href="https://devblogs.microsoft.com/oldnewthing/20151209-00/?p=92361">For historical reasons,</a> <code>findstr</code> supports a limited subset of regular expressions, so you can only use these <a href="https://www.abareplace.com/docs/regExprElements.php">regex features:</a></p> <ul> <li>The dot <code>.</code> matches any character except for newline and extended ASCII characters.</li> <li>Character lists <code>[abc]</code> match any of the specified characters (<code>a</code>, <code>b</code>, or <code>c</code>).</li> <li>Character list ranges <code>[a-z]</code> match any letter from <code>a</code> to <code>z</code>.</li> <li>The asterisk (<code>*</code>) indicates that the previous character cane be repeated zero or more times.</li> <li>The <code>\&lt;</code> and <code>\&gt;</code> symbols mark the beginning and the end of a word.</li> <li>The caret (<code>^</code>) and the dollar sign (<code>$</code>) denote the beginning of and the end of a line.</li> <li>The backslash (<code>\</code>) escapes any metacharacter, allowing you to find literal characters. For example, <code>\$</code> finds the dollar sign itself.</li> </ul> <p><b>Findstr</b> does not support character classes (<code>\d</code>), alternation (<code>|</code>), or other repetitions (<code>+</code> or <code>{5}</code>).</p> <p>The basic syntax is the same as for the <code>FIND</code> command:</p> <pre> findstr "\&lt;20[0-9][0-9]\&gt;" *.htm </pre> <p>This command finds all years starting with 2000 in the <code>.htm</code> files of the current directory. Just like with <code>find</code>, use the <code>/I</code> switch for <b>a case-insensitive</b> search:</p> <img src="/FindStr2.png" alt="The FINDSTR command in Windows 11" title="" width="652" height="115"> <h3>Findstr limitations and quirks</h3> <p>Character lists <code>[a-z]</code> are always case-insensitive, so <code>echo ABC | findstr "[a-z]"</code> matches.</p> <p><b>The space character</b> works as the alternation metacharacter in <code>findstr</code>, so a search query like <code>findstr "new shoes" *</code> will find all lines containing either <code>new</code> or <code>shoes</code>. Unfortunately, there is no way to escape the space and use it as a literal character in a regular expression. For example, you cannot find lines starting with a space.</p> <p><b>Syntax errors</b> in regular expression are ignored. For instance, <code>findstr "[" *</code> will match all lines that contain the <code>[</code> character.</p> <p>If the file contains <b>Unix line breaks</b> (LF), the <code>$</code> metacharacter does not work correctly. If <b>the last line of a file</b> lacks a line terminator, <code>findstr</code> will be unable to find it. For example, <code>findstr "&lt;/html&gt;$" *</code> won't work if there is no CR+LF after &lt;/html&gt;.</p> <p>Early Windows versions had <b>limitations on line length</b> for <code>find</code> and <code>findstr</code>, as well as other commands. The recent versions lifted these limits, so you don't have to worry about them anymore. See <a href="https://stackoverflow.com/questions/8844868/what-are-the-undocumented-features-and-limitations-of-the-windows-findstr-comman/20159191#20159191">this StackOverflow question</a> for <code>findstr</code> limitations and bugs, especially in early Windows versions.</p> <p>The findstr command operates in <b>the OEM (MS DOS) code page;</b> the dot metacharacter does not match any of the extended ASCII characters. As the result, the command is not very useful for non-English text. Besides that, you cannot search for Unicode characters (UTF-8 or UTF-16).</p> <h3>Conclusion</h3> <p>You can learn about other switches by typing <code>findstr /?</code> or <code>find /?</code>. For example, the additional switches allow you to search in subdirectories or print line numbers. You can also refer to <a href="https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/findstr">the official documentation.</a></p> <p>In general, the <code>find</code> and <code>findstr</code> commands are outdated and come with various quirks and limitations. Shameless plug: <b>Aba Search and Replace</b> supports <a href="/docs/cmdLine.php">command-line options as well,</a> allowing you to search from the command prompt and replace text from Windows batch files.</p> Sun, 21 May 2023 14:07:58 +0200https://www.abareplace.com/blog/findstr/Empty character class in JavaScript regexes<p>I <a href="https://github.com/PCRE2Project/pcre2/blob/master/maint/GenerateUcd.py">contributed to PCRE</a> and wrote two smaller regular expression engines, but I still regularly learn something new about this topic. This time, it's about <b>a regex that never matches.</b></p> <p>When using <a href="https://www.abareplace.com/docs/charListClass.php">character classes,</a> you can specify the allowed characters in brackets, such as <code>[a-z]</code> or <code>[aeiouy]</code>. But what happens if the character class is empty?</p> <p>Popular <b>regex engines</b> treat the empty brackets <code>[]</code> differently. In JavaScript, they never match. This is a valid JavaScript code, and it always prints false regardless of the value of <code>str</code>:</p> <pre> const str = 'a'; console.log(/[]/.test(str)); </pre> <p>However, in Java, PHP (PCRE), Go, and Python, the same regex throws an exception:</p> <pre> // Java @Test void testRegex1() { PatternSyntaxException e = assertThrows(PatternSyntaxException.class, () -> Pattern.compile("[]")); assertEquals("Unclosed character class", e.getDescription()); } </pre> <pre> &lt;?php ini_set('display_errors', 1); error_reporting(E_ALL); // Emits a warning: preg_match(): Compilation failed: missing terminating ] for character class echo preg_match('/[]/', ']') ? 'Match ' : 'No match'; </pre> <pre> # Python import re re.compile('[]') # throws "unterminated character set" </pre> <p>In these languages, you can <b>put the closing bracket right after the opening bracket</b> to avoid <a href="https://www.abareplace.com/blog/escape-regexp/">escaping the former</a>:</p> <pre> // Java @Test void testRegex2() { Pattern p = Pattern.compile("[]]"); Matcher m = p.matcher("]"); assertTrue(m.matches()); } </pre> <pre> &lt;?php echo preg_match('/[]]/', ']', $m) ? 'Match ' : 'No match'; // Outputs 'Match' print_r($m); </pre> <pre> # Python import re print(re.match('[]]', ']')) # outputs the Match object </pre> <pre> // Go package main import ( "fmt" "regexp" ) func main() { matched, err := regexp.MatchString(`[]]`, "]") fmt.Println(matched, err) } </pre> <p>This won't work in JavaScript because the first <code>]</code> is interpreted as the end of the character class there, so the same regular expression in JavaScript means <a href="https://262.ecma-international.org/13.0/#sec-compiletocharset">an empty character class</a> that never matches, followed by a closing bracket. As the result, the regular expression never finds the closing bracket:</p> <pre> // JavaScript console.log(/[]]/.test(']')); // outputs false </pre> <p>If you <b>negate the empty character class</b> with <code>^</code> in JavaScript, it will match any character including newlines:</p> <pre> console.log(/[^]/.test('')); // outputs false console.log(/[^]/.test('a')); // outputs true console.log(/[^]/.test('\n')); // outputs true </pre> <p>Again, this is an invalid regex in other languages. PCRE can emulate the JavaScript behavior if you pass the PCRE2_ALLOW_EMPTY_CLASS option to <a href="https://pcre.org/current/doc/html/pcre2api.html#SEC20">pcre_compile.</a> PHP never passes this flag.</p> <p>If you want to match <b>an opening or a closing bracket,</b> this somewhat cryptic regular expression will help you in Java, PHP, Python, or Go: <code><b>[</b>][<b>]</b></code>. The first opening bracket starts the character class, which includes the literal closing bracket and the literal opening bracket, and finally, the last closing bracket ends the class.</p> <p>In JavaScript, you need to escape the closing bracket like this: <code><b>[</b>\][<b>]</b></code></p> <pre> console.log(/[\][]/.test('[')); // outputs true console.log(/[\][]/.test(']')); // outputs true </pre> <p>In Aba Search and Replace, I chose to support the syntax used in Java/PHP/Python/Go. There are <a href="https://stackoverflow.com/questions/1723182/a-regex-that-will-never-be-matched-by-anything">many other ways</a> to construct a regular expression that always fails, in case you need it. So it makes sense to use this syntax for a literal closing bracket.</p> Mon, 10 Apr 2023 17:44:12 +0200https://www.abareplace.com/blog/emptybrackets/Privacy Policy Update - December 2022<p>Updated <a href="/order/#privacy">our privacy policy:</a></p> <ul> <li>clarified your rights under GDPR (you can object to processing of your personal data or restrict the processing, etc.);</li> <li>added that we don't do any profiling for marketing purposes, but PayPro Global may do risk scoring in order to prevent a potential credit card fraud;</li> <li>added that we can notify you by email about new software versions (you can leave this checkbox empty or unsubscribe at any time);</li> <li>listed what happens if you don't provide your personal data (e.g., if you don't provide your email address, we cannot reply to you);</li> <li>changed the refund policy from 30 to 14 days, added a reference to the relevant Czech law;</li> <li>stated that we do full-disk encryption and encrypt all backups, so your personal data are safe with us.</li> </ul> <p>Note that we are required by law to notify you of any changes in the privacy policy. Thank you and have a nice holiday season!</p> Sun, 25 Dec 2022 21:17:32 +0100https://www.abareplace.com/blog/privacy2022-12/Aba 2.5 released<p>The new features in this version include:</p> <ul> <li>Search and replace <a href="/docs/cmdLine.php">from the command line</a></li> <li><a href="/docs/searchParams.php#browseForFiles">Skip subdirectories</a> when searching (click the <i>Browse</i> button and uncheck <i>Include subdirectories</i>)</li> <li><a href="/docs/searchResults.php#sorting">Sorting</a> the search results by path, filename, extension, modification date, or file size.</li> <li>Escape sequences and character classes inside the character lists, e.g. <code>[\d\s]</code> to find a digit, a space, or a newline.</li> <li>Fixed multiple bugs including encoding detection in very short files and searching for the replacement character U+FFFD (many thanks to Joe). Also fixed incorrect search in files slightly larger than 4 GB.</li> <li>Now relative paths are displayed instead of absolute ones in the search results.</li> </ul> <p>The upgrade is free for the registered users. Just <a href="/download/">download</a> the installer and run it; your settings and search history will be preserved.</p> <img src="/blog_aba25.png" width="688" height="436" alt="Aba 2.5 window" title=""> Sun, 11 Dec 2022 20:03:51 +0100https://www.abareplace.com/blog/aba25/Our response to the war in Ukraine<p>In response to the Russian invasion of Ukraine, I blocked all orders from Russia starting from March 2022. I fully support Ukraine in this terrible war and donate money to help Ukrainian refugees in Czech Republic.</p> <p>Many of you are in a tough situation now due to the high inflation and the rising energy prices. So I introduce <b>a 10% discount</b> for all new Aba Search and Replace users, but especially for freelancers and small businesses who pay for the software from their own pocket.</p> <p>Please use this coupon code at <a href="/buy/">checkout:</a></p> <p><code><b>GloryToUkraine</b></code> &nbsp; <button onclick="navigator.clipboard.writeText('GloryToUkraine'); return false;">📋 Copy to clipboard</button> <p>The coupon code is valid until the end of 2022. I plan to release a new version within several weeks; the upgrade will be free for all registered users. Please stay tuned.</p> <p>Thank you for your continuous support. Wishing you peace and good fortune.</p> <p><i>Peter Kankowski,</i><br><i>Aba Search and Replace developer</i></p>Sat, 01 Oct 2022 12:37:28 +0200https://www.abareplace.com/blog/ukraine/Check VAT ID with regular expressions and VIES<p>In the European Union, each business registered for VAT has an unique <a href="https://taxation-customs.ec.europa.eu/vat-identification-numbers_en">identification number</a> like IE9825613N or LU20260743. When selling to EU companies, you need to ask for their VAT ID, validate it, and include it into the invoice. The tax rate depends on <a href="https://taxation-customs.ec.europa.eu/where-tax_en">the place of taxation</a> and the client type (a person or a company). Some customers may provide a wrong VAT ID &mdash; either by mistake or in an attempt to avoid paying the tax. So it's important to check the VAT number.</p> <p>EU provides the <a href="https://ec.europa.eu/taxation_customs/vies/vatRequest.html">VIES page</a> (VAT Information Exchange System) and <a href="https://ec.europa.eu/taxation_customs/vies/faq.html#item_18">a free SOAP API</a> for the VAT ID validation. Here is how you can query the API:</p> <pre> Linux / macOS: curl -d '&lt;soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"&gt;&lt;soapenv:Body&gt;&lt;urn:checkVat&gt;&lt;urn:countryCode&gt;IE&lt;/urn:countryCode&gt;&lt;urn:vatNumber&gt;9825613N&lt;/urn:vatNumber&gt;&lt;/urn:checkVat&gt;&lt;/soapenv:Body&gt;&lt;/soapenv:Envelope&gt;' 'https://ec.europa.eu/taxation_customs/vies/services/checkVatService' Windows: (iwr 'https://ec.europa.eu/taxation_customs/vies/services/checkVatService' -method post -body '&lt;soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"&gt;&lt;soapenv:Body&gt;&lt;urn:checkVat&gt;&lt;urn:countryCode&gt;IE&lt;/urn:countryCode&gt;&lt;urn:vatNumber&gt;9825613N&lt;/urn:vatNumber&gt;&lt;/urn:checkVat&gt;&lt;/soapenv:Body&gt;&lt;/soapenv:Envelope&gt;').content </pre> <p>If the VAT number is invalid, you will get <code>&lt;valid&gt;false&lt;/valid&gt;</code> in the response. You can use a SOAP library or just concatenate the XML string with the VAT identification number. In the latter case, you should quickly check the VAT number with a regular expression, otherwise an attacker can include an arbitrary XML code into it. The VIES <a href="https://ec.europa.eu/taxation_customs/vies/checkVatService.wsdl">WSDL file</a> provides these regular expressions:</p> <pre> Country code: [A-Z]{2} VAT ID without the country code: [0-9A-Za-z\+\*\.]{2,12} </pre> <p>The country code consists of two capital letters; the VAT ID itself is from 2 to 12 letters, digits, or these characters: <code>+ * .</code></p> <p>So the finished code for VAT ID validation could look like this:</p> <pre> import re, urllib.request, xml.etree.ElementTree as XmlElementTree # Return a dictionary with some information about the company, or False if the vat_id is invalid def check_vat_id(vat_id): m = re.match('^([A-Z]{2})([0-9A-Za-z\+\*\.]{2,12})$', vat_id.replace(' ', '')) if not m: return False data = '&lt;soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" ' + \ 'xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"&gt;' + \ '&lt;soapenv:Body&gt;&lt;urn:checkVat&gt;&lt;urn:countryCode&gt;' + m.group(1) + '&lt;/urn:countryCode&gt;' + \ '&lt;urn:vatNumber&gt;' + m.group(2) + '&lt;/urn:vatNumber&gt;&lt;/urn:checkVat&gt;&lt;/soapenv:Body&gt;&lt;/soapenv:Envelope&gt;' with urllib.request.urlopen('https://ec.europa.eu/taxation_customs/vies/services/checkVatService', data.encode('ascii')) as response: resp = response.read().decode('utf-8') ns = { 'soap': 'http://schemas.xmlsoap.org/soap/envelope/', 'checkVat': 'urn:ec.europa.eu:taxud:vies:services:checkVat:types', } checkVatResponse = XmlElementTree.fromstring(resp).find('./soap:Body/checkVat:checkVatResponse', ns) if checkVatResponse.find('./checkVat:valid', ns).text != 'true': return False res = {} for child in checkVatResponse: res[child.tag.replace('{urn:ec.europa.eu:taxud:vies:services:checkVat:types}', '')] = child.text return res print(check_vat_id('IE9825613N')) </pre> <p>Each EU country also has <a href="https://ec.europa.eu/taxation_customs/vies/faq.html#item_11">its own rules</a> for a VAT identification number, so you can do a stricter pre-check with a complex regular expression, but VIES already covers this for you. Also note that some payment processors (e.g. <a href="https://stripe.com/docs/billing/customer/tax-ids#eu-vat">Stripe</a>) already do a VIES query under the hood.</p> Sun, 31 Jul 2022 16:50:00 +0200https://www.abareplace.com/blog/vat_id/Which special characters must be escaped in regular expressions?<p>In most regular expression engines (PCRE, JavaScript, Python, Go, and Java), these special characters <b>must</b> be escaped outside of character classes:</p> <pre> [ * + ? { . ( ) ^ $ | \ </pre> <p>If you want to find one of these metacharacters literally, please add <code>\</code> before it. For example, to find the text <code>$100</code>, use <code>\$100</code>. If you want to find the backslash itself, double it: <code>\\</code>.</p> <p><b>Inside character classes</b> [square brackets], you must escape the following characters:</p> <pre> \ ] - </pre> <p>For example, to find an opening or a closing bracket, use <code>[[\]]</code>.</p> <p>If you need to include <b>the dash into a character class,</b> you can make it the first or the last character instead of escaping it. Use <code>[a-z-]</code> or <code>[a-z\-]</code> to find a Latin letter or a dash.</p> <p>If you need to include <b>the caret ^ into a character class,</b> it cannot be the first character; otherwise, it will be interpreted as any character except the specified ones. For example: <code>[^aeiouy]</code> means "any character except vowels", while <code>[a^eiouy]</code> means "any vowel or a caret". Alternatively, you can escape the caret: <code>[\^aeiouy]</code></p> <h3>JavaScript</h3> <p>In JavaScript, you also need to escape <b>the slash</b> <code>/</code> in regular expression literals:</p> <pre> /AC\/DC/.test('AC/DC') </pre> <p><b>Lone closing brackets</b> <code>]</code> and <code>}</code> <a href="https://262.ecma-international.org/11.0/#prod-annexB-ExtendedPatternCharacter">are allowed by default,</a> but if you <a href="https://eslint.org/docs/rules/require-unicode-regexp">use the 'u' flag,</a> then you <a href="https://262.ecma-international.org/11.0/#prod-PatternCharacter">must escape them:</a></p> <pre> /]}/.test(']}') // true /]}/u.test(']}') // throws an exception </pre> <p>This feature is specific for JavaScript; lone closing brackets are allowed in other languages.</p> <p>If you create a regular expression on the fly <b>from a user-supplied string,</b> you can use the following function to properly escape the special characters:</p> <pre> function escapeRe(str) { return str.replace(/[[\]*+?{}.()^$|\\-]/g, '\\$&amp;'); } var re = new RegExp(escapeRe(start) + '.*?' + escapeRe(end)); </pre> <h3>PHP</h3> <p>In PHP, you have the <a href="https://www.php.net/manual/en/function.preg-quote.php">preg_quote</a> function to <b>insert a user-supplied string</b> into a regular expression pattern. In addition to the characters listed above, it also escapes <code>#</code> (in 7.3.0 and higher), the null terminator, and the following characters: <code>= ! &lt; &gt; : -</code>, which do not have a special meaning in PCRE regular expressions but are sometimes used as delimiters. Closing brackets <code>]</code> and <code>}</code> are escaped, too, which is unnecessary:</p> <pre> preg_match('/]}/', ']}'); // returns 1 </pre> <p>Just like in JavaScript, you also need to <b>escape the delimiter,</b> which is usually <code>/</code>, but <a href="https://www.php.net/manual/en/regexp.reference.delimiters.php">you can use another special character</a> such as <code>#</code> or <code>=</code> if the slash appears inside your pattern:</p> <pre> if (preg_match('/\/posts\/([0-9]+)/', $path, $matches)) { } // Can be simplified to: if (preg_match('#/posts/([0-9]+)#', $path, $matches)) { } </pre> <p>Note that preg_quote does not escape the tilde <code>~</code> and the slash <code>/</code>, so you should not use them as delimiters if you construct regexes from strings.</p> <p><a href="https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.double"><b>In double quotes,</b></a> <code>\1</code> and <code>$</code> are interpreted differently than in regular expressions, so the best practice is:</p> <ul> <li>to use single quotes with preg_match, preg_replace, etc.;</li> <li><a href="https://www.php.net/manual/en/regexp.reference.escape.php">to repeat backslash 4 times</a> if you need to match a literal backslash. This is because you need to escape the backslash in the regular expression, but you also need to escape it <a href="https://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.single">in the single-quoted string.</a> So it's escaped twice:</li> </ul> <pre> $text = 'C:\\Program files\\'; echo $text; if (preg_match('/C:\\\\Program files\\\\/', $text, $matches)) { print_r($matches); } </pre> <h3>Python</h3> <p>Python has <b>a raw string syntax</b> (<code>r''</code>), which conveniently avoids the backslash escaping idiosyncrasies of PHP:</p> <pre> import re re.match(r'C:\\Program files/Tools', 'C:\\Program files/Tools') </pre> <p>You only need to escape the quote in raw strings: <pre> re.match(r'\'', "'") re.match(r"'", "'") // or just use double quotes if you have a regex with a single quote re.match(r"\"", '"') re.match(r'"', '"') // or use single quotes if you have a regex with a double quote re.match(r'"\'', '"\'') // multiple quote types; cannot avoid escaping them </pre> <p><a href="https://docs.python.org/3/reference/lexical_analysis.html#literals">A raw string literal</a> cannot end with a single backslash, but this is not a problem for a valid regular expression.</p> <p>To match a literal <code>]</code> <b>inside a character class</b>, you can make it the first character: <code>[][]</code> matches a closing or an opening bracket. Aba Search &amp; Replace <a href="https://www.abareplace.com/docs/charListClass.php">supports this syntax,</a> but other programming languages do not. You can also quote the <code>]</code> character with a slash, which works in all languages: <code>[\][]</code> or <code>[[\]]</code>.</p> <p>For <b>inserting a string</b> into a regular expression, Python offers the <a href="https://docs.python.org/3/library/re.html#re.escape">re.escape</a> method. Unlike JavaScript with the <code>u</code> flag, Python tolerates escaping non-special punctuation characters, so this function also escapes <code>-</code>, <code>#</code>, <code>&amp;</code>, and <code>~</code>:</p> <pre> print(re.escape(r'-#&amp;~')) // prints \-\#\&amp;\~ re.match(r'\@\~', '@~') // matches </pre> <h3>Java</h3> <p>Java <a href="https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html">allows escaping non-special punctuation characters,</a> too:</p> <pre> Assert.assertTrue(Pattern.matches("\\@\\}\\] }]", "@}] }]")); </pre> <p>Similarly to PHP, you need to repeat the backslash character 4 times, but in Java, you also must <b>double the backslash character</b> when escaping other characters:</p> <pre> Assert.assertTrue(Pattern.matches("C:\\\\Program files \\(x86\\)\\\\", "C:\\Program files (x86)\\")); </pre> <p>This is because the backslash must be escaped in a Java string literal, so if you want to pass <code>\\ \[</code> to the regular expression engine, you need to double each backslash: <code>"\\\\ \\["</code>. There are no raw string literals in Java, so regular expressions are just usual strings.</p> <p>There is the <a href="https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#quote-java.lang.String-">Pattern.quote method</a> for <b>inserting a string</b> into a regular expression. It surrounds the string with <code>\Q</code> and <code>\E</code>, which escapes multiple characters in Java regexes (<a href="https://perldoc.perl.org/functions/quotemeta">borrowed from Perl</a>). If the string contains <code>\E</code>, it will be escaped with the backslash <code>\</code>:</p> <pre> Assert.assertEquals("\\Q()\\E", Pattern.quote("()")); Assert.assertEquals("\\Q\\E\\\\E\\Q\\E", Pattern.quote("\\E")); Assert.assertEquals("\\Q(\\E\\\\E\\Q)\\E", Pattern.quote("(\\E)")); </pre> <p>The <code>\Q...\E</code> syntax is <b>another way</b> to escape multiple special characters that you can use. Besides Java, it's supported in PHP/PCRE and Go regular expressions, but not in Python nor in JavaScript.</p> <h3>Go</h3> <p>Go <a href="https://go.dev/ref/spec#String_literals">raw string literals</a> are characters between back quotes: <code>`\(`</code>. It's preferable to use them for regular expressions because <b>you don't need to double-escape the backslash:</b></p> <pre> r := regexp.MustCompile(`\(text\)`) fmt.Println(r.FindString("(text)")) </pre> <p><b>A back quote</b> cannot be used in a raw string literal, so you have to resort to the usual <code>"`"</code> string syntax for it. But this is a rare character.</p> <p>The <b><code>\Q...\E</code> syntax</b> is supported, too:</p> <pre> r := regexp.MustCompile(`\Q||\E`) fmt.Println(r.FindString("||")) </pre> <p>There is a <a href="https://golang.google.cn/pkg/regexp/#QuoteMeta">regexp.QuoteMeta</a> method for <b>inserting strings</b> into a regular expression. In addition to the characters listed above, it also escapes closing brackets <code>]</code> and <code>}</code>.</p> Sat, 08 Jan 2022 12:08:02 +0100https://www.abareplace.com/blog/escape-regexp/Aba 2.4 releasedThe new version adds support for 4K monitors and very long paths (longer than 260 characters). I also fixed the annoying sound when typing a regular expression. Thanks to Pouemes, we now have a French translation. The upgrade is free for the registered users; your settings and previous searches are fully preserved.Sun, 05 Sep 2021 02:00:00 +0200https://www.abareplace.com/blog/aba24/Privacy Policy Update - April 2021<p>Updated our <a href="/order/#privacy">privacy policy</a>:</p> <ul> <li>stated that we now use a web/mail hosting in Germany (Hetzner);</li> <li>clarified which data we collect via the contact form;</li> <li>re-iterated that you don't have to use Do Not Track or Global Privacy Control headers because we don't track you by default.</li> </ul> <p>Future updates to the privacy policy will be published in this blog; we are required by law to inform you in case of any changes. Thank you and have a nice day!</p> Sat, 17 Apr 2021 20:39:02 +0200https://www.abareplace.com/blog/privacy2021-04/Review of Aba Search and Replace with video<p>FindMySoft, a software download directory, published a Quick Look Video showcasing Aba Search and Replace 2.2 interface and features. You can <a href="http://aba-search-and-replace.findmysoft.com/">watch the video and read a review</a> of my program at their site.</p> <div style="text-align:center"><a href="http://aba-search-and-replace.findmysoft.com/"><img src="/blog_findmysoft.png" alt="Find My Soft review and video" width="764" height="346" style="border:0"></a></div> <p>Some clarifications to their review. Aba cannot search and replace in MS Word documents (yet!). About being “too simple for advanced users”: Aba’s support for regular expressions includes <a href="/docs/lookaround.php">variable-length lookbehind</a> and <a href="/docs/charListClass.php">Unicode character classes</a>; most search-and-replace tools lack these features. Generally, I try to keep the interface clean and less cluttered than competitors while adding advanced features for power users.</p>Fri, 20 Apr 2012 10:00:00 +0200https://www.abareplace.com/blog/findmysoft/Aba 2.2 released<p>The new version adds <a href="/docs/lookaround.php">lookaround</a> and braces in regular expressions. I also implemented \b anchor and <a href="/docs/backref.php#noncapturing">non-capturing groups</a>.</p> <p>Several bugs were fixed including incorrect PHP syntax highlight, crash when processing invalid UTF-8 or when changing a long replacement.</p> <p>Many thanks to Stefan Schuck for updating the German translation. I'm looking for people who can translate Aba into other languages (especially, French and Spanish).</p> <p><a href="http://www.abareplace.com/setup.exe">Download the new version</a></p>Tue, 03 Jan 2012 21:00:00 +0100https://www.abareplace.com/blog/aba22/Discount on Aba Search and Replace<p>I would like to offer everybody <b>15% discount</b> on Aba Search and Replace until the end of January. Please use the coupon code:</p> <pre>Happy2012</pre> <p>Future upgrades will be free for registered users. Thank you for using Aba. Happy Holidays and best wishes for the New Year!</p> Tue, 20 Dec 2011 09:00:00 +0100https://www.abareplace.com/blog/happy2012discount/Using search and replace to rename a method<p>It's easy to rename a method using <a href="/">Aba Search and Replace:</a></p> <ul> <li>enter the current and the new names,</li> <li>turn on the <i>Match whole word</i> and <i>Match case</i> modes,</li> <li>review the found occurrences;</li> <li>press the <i>Replace</i> button.</li> </ul> <img src="/blog_rename_method.png" alt="Replace GetFileSize with GetSize, addslashes with sqlite_escape_string." width="488" height="65"> <p>The name of a method, <i>GetFileSize</i>, collided with the name of <a href="http://msdn.microsoft.com/en-us/library/aa364955">a Win32 API function,</a> so I wanted to replace it with <i>GetSize.</i> The later is also shorter and avoids the tautology: <i>file</i>.Get<i>File</i>Size.</p> <p>To find the references to my method, not to the Win32 API function, I added a dot (and <code>-&gt;</code> in C++): <code>.GetFileSize</code></p> <p>In a PHP code, I replaced all calls to <code>addslashes</code> with <code>sqlite_escape_string</code> when porting my site to SQLite. The two functions escape quotes <a href="http://php.net/sqlite_escape_string">differently;</a> <i>addslashes</i> should never be used in SQLite.</p>Mon, 05 Dec 2011 09:00:00 +0100https://www.abareplace.com/blog/blog_rename_method/Cleaning the output of a converter<p>When I worked at a small web design company, we often had clients bringing us <b>a MS Word, Excel, or PDF file</b> that must be published on web. Not as a downloadable file, but as a web page integrated into their site.</p> <p>Microsoft Word certainly can save files in HTML, but the resulting code was bloated and different from our design. What we needed was a simple HTML that our designer could edit and style. How could Aba S&amp;R help us?</p> <p>Here is a DOC file saved in HTML:</p> <p><code>&lt;h3 align=center style='text-align:center;'&gt;&lt;b&gt;&lt;span style='font-size:10.0pt;font-family:"Arial";'&gt;Lorem ipsum&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;</code></p> <p><code>&lt;p class=Normal align=justify style='text-indent:14.0pt;text-align:justify;'&gt;&lt;span style='font-family:"Times New Roman";'&gt;Lorem ipsum dolor sit &lt;i&gt;amet,&lt;/i&gt; consectetur adipisicing elit.&lt;/span&gt;&lt;/p&gt;</code></p> <p>We need to <b>remove all attributes and &lt;span&gt; tags:</b></p> <p><code>&lt;h3&gt;&lt;b&gt;Lorem ipsum&lt;/b&gt;&lt;/h3&gt;</code></p> <p><code>&lt;p&gt;Lorem ipsum dolor sit &lt;i&gt;amet,&lt;/i&gt; consectetur adipisicing elit.&lt;/p&gt;</code></p> <p>The following <b>replacements</b> can be used:</p> <pre>Search for: <b>&lt;</b><span style="color:blue">(</span><b>p</b><span style="color:blue">|</span><b>h1</b><span style="color:blue">|</span><b>h2</b><span style="color:blue">|</span><b>h3</b><span style="color:blue">)</span> <span style="color:olive">[^&gt;]</span><span style="color:purple">*</span><b>&gt;</b> Replace with: &lt;<span style="color:purple">\1</span>&gt; Search for: <b>&lt;span </b><span style="color:olive">[^&gt;]</span><span style="color:purple">*</span><b>&gt;</b> Replace with: (nothing) Search for: <b>&lt;/span&gt;</b> Replace with: (nothing)</pre> <p><code><span style="color:olive">[^&gt;]</span><span style="color:purple">*</span></code> matches everything up to the next closing angle bracket &gt;, and <span style="color:purple">\1</span> means the text inside the first parentheses (in our case, the tag name).</p> <div style="text-align:center"><img src="/blog_html_convertor.png" alt="Remove HTML attributes with regular expressions" width="492" height="342"></div> <p>Generally, I often used Aba to <b>clean the output of a converter.</b> For one client, I had to convert dozens of PDF files with technical specifications to HTML. There was a lot of formatting (subscripts, superscripts, tables), so I could not simply copy-and-paste it. There also were errors, for example, the letter O instead of zero in subscripts. Without Aba, I would not clean this mess.</p> <a name="bad-practice"></a> <h3>Is it a bad practice?</h3> <p><a href="http://www.reddit.com/r/webdev/comments/mcwh5/how_to_replace_html_tags_using_regular_expressions/">Two redditors criticized</a> my previous post about <a href="/blog/html_tags/">using regular expressions to replace HTML tags.</a></p> <p>I fully agree that regexes should never be used to parse <b>an arbitrary HTML code,</b> for example, an HTML code entered by user. Never do this in your scripts; it's unreliable and insecure.</p> <p>But what if you need to replace all relative links (/blog/) <b>in your own code</b> with absolute links (http://www.example.com/blog/), because you are moving some parts of your site to a subdomain (http://myproduct.example.com). <b>Would you craft a script</b> that parses your HTML code (carefully skipping &lt;?php tags — Python's HTMLParser cannot do that), searches for all <code>&lt;a&gt;</code> tags with the <code>href</code> attribute, replaces the links, and saves the result to a file?</p> <p><b>Or would you toss off a regex</b> in <a href="/">a search-and-replace tool?</a></p> <div style="text-align:center"><img src="/blog_html_convertor2.png" alt="Would you write 43 lines of Python code or one-line regex for an ad-hoc replacement?" title="Would you write 43 lines of Python code or one-line regex for an ad-hoc replacement?" width="652" height="466" style="padding-top:10px"></div>Fri, 18 Nov 2011 09:00:00 +0100https://www.abareplace.com/blog/html_convertor/Aba 2.1 released<p>The new version fixes some bugs like incorrectly displayed date/time and adds the <i>File</i> menu for viewing/editing a file or copying the results list into clipboard.</p> <img src="/docs/fileMenu.png" width="314" height="163" alt="File menu"> <p>Just as always, <b>the upgrade is free</b> for registered users.</p> <p><a href="http://www.abareplace.com/">Download the new version</a></p>Fri, 11 Nov 2011 09:00:00 +0100https://www.abareplace.com/blog/aba21/How to replace HTML tags using regular expressions<p>Strictly speaking, you cannot parse HTML using only regular expressions. The reason is explained in any computer science curriculum: HTML is <a href="http://en.wikipedia.org/wiki/Context-free_language">a context-free language</a>, but regular expressions allow only <a href="http://en.wikipedia.org/wiki/Regular_language">regular languages</a>. So, <b>you cannot match nested tags</b> with them.</p> <p>However, regexes are really useful for quick search and replace in your web pages. Full parsing is unnecessary, because you know the HTML code that you wrote. Approaches that are “impure” from theoretical point of view work extremely well in this setting. You even can simplify the regexes shown below: say, if you never insert newlines between <code>a</code> and <code>href</code>, then you need not to allow for them in your regular expression.</p> <h3>Match an HTML tag</h3> <pre><b>&lt;a</b><span style="color:#D2691E">\s</span><span style="color:blue">(</span><span style="color:olive">.*?</span><span style="color:blue">)</span><b>&gt;</b><span style="color:blue">(</span><span style="color:olive">.*?</span><span style="color:blue">)</span><b>&lt;/a&gt;</b></pre> <p>This regex matches an &lt;a&gt; tag with any attributes. If you break it into parts:</p> <ul> <li><code>\s</code> matches a space or a newline after <code>a</code>;</li> <li><code>.*?</code> matches any text to the next closing angle bracket <code>&gt;</code>;</li> <li>another <code>.*?</code> matches any text inside the tag.</li> </ul> <p>Parenthesis are used to capture the attributes and the text inside tag. You can then refer to them using <code>\1</code> and <code>\2</code> in the replacement. For example, you can <b>remove all links:</b></p> <div style="text-align:center"><img src="/blog_html_tags1.png" alt="Search for &lt;a\s(.*?)&gt;(.*?)&lt;/a&gt; and replace with \2" width="658" height="409"></div> <p>As mentioned above, the regex will not correctly match nested <code>&lt;a&gt;</code> tags; it just finds the next closing tag of the same type. But in this case, it's not important, because the nested <code>&lt;a&gt;</code> tags make little sense :)</p> <h3>Match an opening HTML tag with some attribute</h3> <pre><b>&lt;a</b><span style="color:#D2691E">\s</span><span style="color:blue">(</span><span style="color:#D2691E">[^&gt;]</span><span style="color:olive">*</span><span style="color:#D2691E">\s</span><span style="color:blue">)</span><span style="color:olive">?</span><b>href=&quot;</b><span style="color:blue">(</span><span style="color:olive">.*?</span><span style="color:blue">)</span><b>&quot;</b><span style="color:blue">(</span><span style="color:olive">.*?</span><span style="color:blue">)</span><b>&gt;</b></pre> <p>This regex matches an opening &lt;a&gt; tag with <code>href</code> attribute. The differences from the previous example are:</p> <ul> <li><code>[^&gt;]*</code> matches anything except the closing angle bracket <code>&gt;</code> (so it skips any attributes before <code>href</code>);</li> <li>the question mark <code>?</code> makes the other attributes optional, so <code>href</code> can immediately follow <code>a</code>.</li> </ul> <p>This regex is simple enough and works in most cases. The HTML standard allows spaces around <code>=</code> and single quotes instead of double quotes in attribute values. If you need to match such tags, you need a more complicated regex:</p> <pre><b>&lt;a</b><span style="color:#D2691E">\s</span><span style="color:blue">(</span><span style="color:#D2691E">[^&gt;]</span><span style="color:olive">*</span><span style="color:#D2691E">\s</span><span style="color:blue">)</span><span style="color:olive">?</span><b>href</b><span style="color:#D2691E">\s</span><span style="color:olive">*</span><b>=</b><span style="color:#D2691E">\s</span><span style="color:olive">*</span><span style="color:blue">(</span><span style="color:#D2691E">[&quot;']</span><span style="color:blue">)</span><span style="color:blue">(</span><span style="color:olive">.*?</span><span style="color:blue">)</span><span style="color:blue">\2</span><span style="color:blue">(</span><span style="color:olive">.*?</span><span style="color:blue">)</span><b>&gt;</b></pre> <p>But simpler regexes usually suffice. Here is how you can <b>replace absolute links with relative ones:</b></p> <div style="text-align:center"><img src="/blog_html_tags2.png" alt="Search for &lt;a\s([^&gt;]*\s)?href=&quot;http://www.abareplace.com(.*?)&quot; and replace with &lt;a \1 href=&quot;\2&quot;" width="658" height="370"></div> <p>I hope that this short tutorial convinced you of the power of regular expressions :)</p> <p>See also: <a href="/docs/regExprElements.php">Regular expression reference</a></p>Thu, 03 Nov 2011 09:00:00 +0100https://www.abareplace.com/blog/html_tags/Video trailer for Aba<p>Softoxi, an independent software site, published <a href="http://www.softoxi.com/aba-search--replace.html">an original review</a> of Aba Search and Replace. They even shoot <b>a video showing major features.</b></p> <div style="text-align:center"><a href="http://www.softoxi.com/aba-search--replace-video-trailer-screenshots.html"><img src="/blog_softoxi.jpg" width="502" height="378" alt="Aba Search and Replace video review" style="border:0"></a></div> <p>Another popular site, Softpedia, <a href="http://www.softpedia.com/progClean/Aba-Search-and-Replace-Clean-90327.html">granted “100% clean” award to Aba,</a> which means it does not contain any form of spyware or viruses.</p>Thu, 27 Oct 2011 10:00:00 +0200https://www.abareplace.com/blog/softoxi/Aba 2.0 released<div style="float:left; padding: 10px 40px 0 0"><img src="/aba2.png" width="375" height="318" alt="Aba 2.0 screenshot"></div> <p>After a month of beta testing, I released Aba 2.0. <b>The new features</b> in this version include:</p> <ul><li>Added syntax highlight for context viewer and for regular expressions.</li> <li>Implemented search history and favorites.</li> <li>Now you can undo a replacement if you have started another search or closed the program (undo information is saved in a dedicated folder).</li> <li>An editor or a viewer can be called for the selected file.</li> <li>When search or replacement is finished, the program notifies you by playing a sound.</li> <li>Visual styles and Windows 7 taskbar are now supported.</li> <li>When you edit the <i>Replace with</i> field, the search is not restarted (except when needed).</li> <li>Non-greedy matches are faster than they were in the previous version.</li> </ul> <p>Many <b>thanks to the beta testers:</b> Kyle Alons, Massimiliano Tiraboschi, and JJS. Without your help, I would never find some tricky bugs :)</p> <p>Unfortunately, German and Italian translations are still unfinished, but I'm waiting for response from our translators.</p> <div style="clear:both">&nbsp;</div>Tue, 25 Oct 2011 10:00:00 +0200https://www.abareplace.com/blog/aba20/