Aba logo

Check VAT ID with regular expressions and VIES

31 Jul 2022

In the European Union, each business registered for VAT has an unique identification number like IE9825613N or LU20260743. When selling to EU companies, you need to ask for their VAT ID, validate it, and include it into the invoice. The tax rate depends on the place of taxation and the client type (a person or a company). Some customers may provide a wrong VAT ID — either by mistake or in an attempt to avoid paying the tax. So it's important to check the VAT number.

EU provides the VIES page (VAT Information Exchange System) and a free SOAP API for the VAT ID validation. Here is how you can query the API:

Linux / macOS:

curl -d '<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"><soapenv:Body><urn:checkVat><urn:countryCode>IE</urn:countryCode><urn:vatNumber>9825613N</urn:vatNumber></urn:checkVat></soapenv:Body></soapenv:Envelope>' 'https://ec.europa.eu/taxation_customs/vies/services/checkVatService'

Windows:

(iwr 'https://ec.europa.eu/taxation_customs/vies/services/checkVatService' -method post -body '<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types"><soapenv:Body><urn:checkVat><urn:countryCode>IE</urn:countryCode><urn:vatNumber>9825613N</urn:vatNumber></urn:checkVat></soapenv:Body></soapenv:Envelope>').content

If the VAT number is invalid, you will get <valid>false</valid> in the response. You can use a SOAP library or just concatenate the XML string with the VAT identification number. In the latter case, you should quickly check the VAT number with a regular expression, otherwise an attacker can include an arbitrary XML code into it. The VIES WSDL file provides these regular expressions:

Country code: [A-Z]{2}
VAT ID without the country code: [0-9A-Za-z\+\*\.]{2,12}

The country code consists of two capital letters; the VAT ID itself is from 2 to 12 letters, digits, or these characters: + * .

So the finished code for VAT ID validation could look like this:

import re, urllib.request, xml.etree.ElementTree as XmlElementTree

# Return a dictionary with some information about the company, or False if the vat_id is invalid
def check_vat_id(vat_id):
  m = re.match('^([A-Z]{2})([0-9A-Za-z\+\*\.]{2,12})$', vat_id.replace(' ', ''))
  if not m:
    return False

  data = '<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" ' + \
         'xmlns:urn="urn:ec.europa.eu:taxud:vies:services:checkVat:types">' + \
         '<soapenv:Body><urn:checkVat><urn:countryCode>' + m.group(1) + '</urn:countryCode>' + \
         '<urn:vatNumber>' + m.group(2) + '</urn:vatNumber></urn:checkVat></soapenv:Body></soapenv:Envelope>'
         
  with urllib.request.urlopen('https://ec.europa.eu/taxation_customs/vies/services/checkVatService', data.encode('ascii')) as response:
    resp = response.read().decode('utf-8')
        
    ns = {
       'soap': 'http://schemas.xmlsoap.org/soap/envelope/',
       'checkVat': 'urn:ec.europa.eu:taxud:vies:services:checkVat:types',
    }
    
    checkVatResponse = XmlElementTree.fromstring(resp).find('./soap:Body/checkVat:checkVatResponse', ns)
    if checkVatResponse.find('./checkVat:valid', ns).text != 'true':
       return False
    
    res = {}
    for child in checkVatResponse:
       res[child.tag.replace('{urn:ec.europa.eu:taxud:vies:services:checkVat:types}', '')] = child.text
    return res
    

print(check_vat_id('IE9825613N'))

Each EU country also has its own rules for a VAT identification number, so you can do a stricter pre-check with a complex regular expression, but VIES already covers this for you. Also note that some payment processors (e.g. Stripe) already do a VIES query under the hood.

 

Which special characters must be escaped in regular expressions?

8 Jan 2022

In most regular expression engines (PCRE, JavaScript, Python, Go, and Java), these special characters must be escaped outside of character classes:

[ * + ? { . ( ) ^ $ | \

If you want to find one of these metacharacters literally, please add \ before it. For example, to find the text $100, use \$100. If you want to find the backslash itself, double it: \\.

Inside character classes [square brackets], you must escape the following characters:

\ ] -

For example, to find an opening or a closing bracket, use [[\]].

If you need to include the dash into a character class, you can make it the first or the last character instead of escaping it. Use [a-z-] or [a-z\-] to find a Latin letter or a dash.

If you need to include the caret ^ into a character class, it cannot be the first character; otherwise, it will be interpreted as any character except the specified ones. For example: [^aeiouy] means "any character except vowels", while [a^eiouy] means "any vowel or a caret". Alternatively, you can escape the caret: [\^aeiouy]

JavaScript

In JavaScript, you also need to escape the slash / in regular expression literals:

/AC\/DC/.test('AC/DC')

Lone closing brackets ] and } are allowed by default, but if you use the 'u' flag, then you must escape them:

/]}/.test(']}') // true
/]}/u.test(']}') // throws an exception

This feature is specific for JavaScript; lone closing brackets are allowed in other languages.

If you create a regular expression on the fly from a user-supplied string, you can use the following function to properly escape the special characters:

function escapeRe(str) {
    return str.replace(/[[\]*+?{}.()^$|\\-]/g, '\\$&');
}

var re = new RegExp(escapeRe(start) + '.*?' + escapeRe(end));

PHP

In PHP, you have the preg_quote function to insert a user-supplied string into a regular expression pattern. In addition to the characters listed above, it also escapes # (in 7.3.0 and higher), the null terminator, and the following characters: = ! < > : -, which do not have a special meaning in PCRE regular expressions but are sometimes used as delimiters. Closing brackets ] and } are escaped, too, which is unnecessary:

preg_match('/]}/', ']}'); // returns 1

Just like in JavaScript, you also need to escape the delimiter, which is usually /, but you can use another special character such as # or = if the slash appears inside your pattern:

if (preg_match('/\/posts\/([0-9]+)/', $path, $matches)) {
}

// Can be simplified to:
if (preg_match('#/posts/([0-9]+)#', $path, $matches)) {
}

Note that preg_quote does not escape the tilde ~ and the slash /, so you should not use them as delimiters if you construct regexes from strings.

In double quotes, \1 and $ are interpreted differently than in regular expressions, so the best practice is:

$text = 'C:\\Program files\\';
echo $text;
if (preg_match('/C:\\\\Program files\\\\/', $text, $matches)) {
   print_r($matches);
}

Python

Python has a raw string syntax (r''), which conveniently avoids the backslash escaping idiosyncrasies of PHP:

import re
re.match(r'C:\\Program files/Tools', 'C:\\Program files/Tools')

You only need to escape the quote in raw strings:

re.match(r'\'', "'")
re.match(r"'", "'") // or just use double quotes if you have a regex with a single quote

re.match(r"\"", '"')
re.match(r'"', '"') // or use single quotes if you have a regex with a double quote

re.match(r'"\'', '"\'') // multiple quote types; cannot avoid escaping them

A raw string literal cannot end with a single backslash, but this is not a problem for a valid regular expression.

To match a literal ] inside a character class, you can make it the first character: [][] matches a closing or an opening bracket. Aba Search & Replace supports this syntax, but other programming languages do not. You can also quote the ] character with a slash, which works in all languages: [\][] or [[\]].

For inserting a string into a regular expression, Python offers the re.escape method. Unlike JavaScript with the u flag, Python tolerates escaping non-special punctuation characters, so this function also escapes -, #, &, and ~:

print(re.escape(r'-#&~')) // prints \-\#\&\~
re.match(r'\@\~', '@~') // matches

Java

Java allows escaping non-special punctuation characters, too:

Assert.assertTrue(Pattern.matches("\\@\\}\\] }]", "@}] }]"));

Similarly to PHP, you need to repeat the backslash character 4 times, but in Java, you also must double the backslash character when escaping other characters:

Assert.assertTrue(Pattern.matches("C:\\\\Program files \\(x86\\)\\\\", "C:\\Program files (x86)\\"));

This is because the backslash must be escaped in a Java string literal, so if you want to pass \\ \[ to the regular expression engine, you need to double each backslash: "\\\\ \\[". There are no raw string literals in Java, so regular expressions are just usual strings.

There is the Pattern.quote method for inserting a string into a regular expression. It surrounds the string with \Q and \E, which escapes multiple characters in Java regexes (borrowed from Perl). If the string contains \E, it will be escaped with the backslash \:

Assert.assertEquals("\\Q()\\E",
      Pattern.quote("()"));

Assert.assertEquals("\\Q\\E\\\\E\\Q\\E",
      Pattern.quote("\\E"));

Assert.assertEquals("\\Q(\\E\\\\E\\Q)\\E",
      Pattern.quote("(\\E)"));

The \Q...\E syntax is another way to escape multiple special characters that you can use. Besides Java, it's supported in PHP/PCRE and Go regular expressions, but not in Python nor in JavaScript.

Go

Go raw string literals are characters between back quotes: `\(`. It's preferable to use them for regular expressions because you don't need to double-escape the backslash:

r := regexp.MustCompile(`\(text\)`)
fmt.Println(r.FindString("(text)"))

A back quote cannot be used in a raw string literal, so you have to resort to the usual "`" string syntax for it. But this is a rare character.

The \Q...\E syntax is supported, too:

r := regexp.MustCompile(`\Q||\E`)
fmt.Println(r.FindString("||"))

There is a regexp.QuoteMeta method for inserting strings into a regular expression. In addition to the characters listed above, it also escapes closing brackets ] and }.

 

Aba 2.4 released

5 Sep 2021

The new version adds support for 4K monitors and very long paths (longer than 260 characters). I also fixed the annoying sound when typing a regular expression. Thanks to Pouemes, we now have a French translation. The upgrade is free for the registered users; your settings and previous searches are fully preserved.

 

Privacy Policy Update - April 2021

17 Apr 2021

Updated our privacy policy:

Future updates to the privacy policy will be published in this blog; we are required by law to inform you in case of any changes. Thank you and have a nice day!

 

Review of Aba Search and Replace with video

20 Apr 2012

FindMySoft, a software download directory, published a Quick Look Video showcasing Aba Search and Replace 2.2 interface and features. You can watch the video and read a review of my program at their site.

Find My Soft review and video

Some clarifications to their review. Aba cannot search and replace in MS Word documents (yet!). About being “too simple for advanced users”: Aba’s support for regular expressions includes variable-length lookbehind and Unicode character classes; most search-and-replace tools lack these features. Generally, I try to keep the interface clean and less cluttered than competitors while adding advanced features for power users.

 

Aba 2.2 released

3 Jan 2012

The new version adds lookaround and braces in regular expressions. I also implemented \b anchor and non-capturing groups.

Several bugs were fixed including incorrect PHP syntax highlight, crash when processing invalid UTF-8 or when changing a long replacement.

Many thanks to Stefan Schuck for updating the German translation. I'm looking for people who can translate Aba into other languages (especially, French and Spanish).

Download the new version

 

Discount on Aba Search and Replace

20 Dec 2011

I would like to offer everybody 15% discount on Aba Search and Replace until the end of January. Please use the coupon code:

Happy2012

Future upgrades will be free for registered users. Thank you for using Aba. Happy Holidays and best wishes for the New Year!

 

Using search and replace to rename a method

5 Dec 2011

It's easy to rename a method using Aba Search and Replace:

Replace GetFileSize with GetSize, addslashes with sqlite_escape_string.

The name of a method, GetFileSize, collided with the name of a Win32 API function, so I wanted to replace it with GetSize. The later is also shorter and avoids the tautology: file.GetFileSize.

To find the references to my method, not to the Win32 API function, I added a dot (and -> in C++): .GetFileSize

In a PHP code, I replaced all calls to addslashes with sqlite_escape_string when porting my site to SQLite. The two functions escape quotes differently; addslashes should never be used in SQLite.

 

Cleaning the output of a converter

18 Nov 2011

When I worked at a small web design company, we often had clients bringing us a MS Word, Excel, or PDF file that must be published on web. Not as a downloadable file, but as a web page integrated into their site.

Microsoft Word certainly can save files in HTML, but the resulting code was bloated and different from our design. What we needed was a simple HTML that our designer could edit and style. How could Aba S&R help us?

Here is a DOC file saved in HTML:

<h3 align=center style='text-align:center;'><b><span style='font-size:10.0pt;font-family:"Arial";'>Lorem ipsum</span></b></h3>

<p class=Normal align=justify style='text-indent:14.0pt;text-align:justify;'><span style='font-family:"Times New Roman";'>Lorem ipsum dolor sit <i>amet,</i> consectetur adipisicing elit.</span></p>

We need to remove all attributes and <span> tags:

<h3><b>Lorem ipsum</b></h3>

<p>Lorem ipsum dolor sit <i>amet,</i> consectetur adipisicing elit.</p>

The following replacements can be used:

Search for: <(p|h1|h2|h3) [^>]*>
Replace with: <\1>

Search for: <span [^>]*>
Replace with: (nothing)

Search for: </span>
Replace with: (nothing)

[^>]* matches everything up to the next closing angle bracket >, and \1 means the text inside the first parentheses (in our case, the tag name).

Remove HTML attributes with regular expressions

Generally, I often used Aba to clean the output of a converter. For one client, I had to convert dozens of PDF files with technical specifications to HTML. There was a lot of formatting (subscripts, superscripts, tables), so I could not simply copy-and-paste it. There also were errors, for example, the letter O instead of zero in subscripts. Without Aba, I would not clean this mess.

Is it a bad practice?

Two redditors criticized my previous post about using regular expressions to replace HTML tags.

I fully agree that regexes should never be used to parse an arbitrary HTML code, for example, an HTML code entered by user. Never do this in your scripts; it's unreliable and insecure.

But what if you need to replace all relative links (/blog/) in your own code with absolute links (http://www.example.com/blog/), because you are moving some parts of your site to a subdomain (http://myproduct.example.com). Would you craft a script that parses your HTML code (carefully skipping <?php tags — Python's HTMLParser cannot do that), searches for all <a> tags with the href attribute, replaces the links, and saves the result to a file?

Or would you toss off a regex in a search-and-replace tool?

Would you write 43 lines of Python code or one-line regex for an ad-hoc replacement?

 

Aba 2.1 released

11 Nov 2011

The new version fixes some bugs like incorrectly displayed date/time and adds the File menu for viewing/editing a file or copying the results list into clipboard.

File menu

Just as always, the upgrade is free for registered users.

Download the new version

 

How to replace HTML tags using regular expressions

3 Nov 2011

Strictly speaking, you cannot parse HTML using only regular expressions. The reason is explained in any computer science curriculum: HTML is a context-free language, but regular expressions allow only regular languages. So, you cannot match nested tags with them.

However, regexes are really useful for quick search and replace in your web pages. Full parsing is unnecessary, because you know the HTML code that you wrote. Approaches that are “impure” from theoretical point of view work extremely well in this setting. You even can simplify the regexes shown below: say, if you never insert newlines between a and href, then you need not to allow for them in your regular expression.

Match an HTML tag

<a\s(.*?)>(.*?)</a>

This regex matches an <a> tag with any attributes. If you break it into parts:

Parenthesis are used to capture the attributes and the text inside tag. You can then refer to them using \1 and \2 in the replacement. For example, you can remove all links:

Search for <a\s(.*?)>(.*?)</a> and replace with \2

As mentioned above, the regex will not correctly match nested <a> tags; it just finds the next closing tag of the same type. But in this case, it's not important, because the nested <a> tags make little sense :)

Match an opening HTML tag with some attribute

<a\s([^>]*\s)?href="(.*?)"(.*?)>

This regex matches an opening <a> tag with href attribute. The differences from the previous example are:

This regex is simple enough and works in most cases. The HTML standard allows spaces around = and single quotes instead of double quotes in attribute values. If you need to match such tags, you need a more complicated regex:

<a\s([^>]*\s)?href\s*=\s*(["'])(.*?)\2(.*?)>

But simpler regexes usually suffice. Here is how you can replace absolute links with relative ones:

Search for <a\s([^>]*\s)?href="http://www.abareplace.com(.*?)" and replace with <a \1 href="\2"

I hope that this short tutorial convinced you of the power of regular expressions :)

See also: Regular expression reference

 

Video trailer for Aba

27 Oct 2011

Softoxi, an independent software site, published an original review of Aba Search and Replace. They even shoot a video showing major features.

Aba Search and Replace video review

Another popular site, Softpedia, granted “100% clean” award to Aba, which means it does not contain any form of spyware or viruses.

 

Aba 2.0 released

25 Oct 2011

Aba 2.0 screenshot

After a month of beta testing, I released Aba 2.0. The new features in this version include:

Many thanks to the beta testers: Kyle Alons, Massimiliano Tiraboschi, and JJS. Without your help, I would never find some tricky bugs :)

Unfortunately, German and Italian translations are still unfinished, but I'm waiting for response from our translators.

 

 

This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.