Anonymizing a dataset by replacing names with counters

11 Jan 2025

Sometimes, you need to remove personal data from a dataset, such as when preparing examples or unit tests. With Aba Search and Replace, you can mask names, addresses, and other personally identifiable information by replacing them with counters.

Let's use the following CSV file with information about Alice in Wonderland characters as an example:

Name,Address,Favorite Color
Alice,Near the Rabbit Hole,Blue
Mad Hatter,Tea Party Garden,Orange
White Rabbit,Rabbit Hole,White
Queen of Hearts,Hearts Castle,Red
Cheshire Cat,Forest Tree Hollow,Purple
Caterpillar,Mushroom Grove,Green
Tweedledee,Looking Glass Land,Yellow
Tweedledum,Looking Glass Land,Yellow
March Hare,Mad Tea Party Estate,Brown
Dormouse,Tea Party Garden,Gray

You want to remove real names and addresses from this file. A common approach would be to write a script that opens the file, reads each line, replaces the first two fields with counters, and then prints the result. However, it's easier to do the same task with Aba Search and Replace. You don't have to write boilerplate code for file reading, and you can immediately preview the replacement results.

We'll use the following regular expression to match the first two columns in the CSV file while skipping the headers:

(?<=\n)(\N+?),(\N+?),

Here's how it works: first, we check that a newline \n is found before the match using a lookbehind assertion, which allows us to skip the headers (the first line). Next, we match two fields separated with commas.

We would like to replace the names (Alice, Mad Hatter, White Rabbit, etc.) with a counter like person1, person2, person3, etc. Aba provides functions for inserting counters; Aba.matchNo works well for this case:

Aba window

For the address field, we don't want to use the same sequence (1, 2, 3), so let's do some math with the counter in order to start from 77 and decrement each street number by 3. The replacement expression becomes:

person\{ Aba.matchNo() },\{ 80 - Aba.matchNo() * 3 } Wonderland Drive,

Note that proper anonymization is more complex than this. In our example, it's still possible to identify some characters after the replacement. For example, White Rabbit predictably likes white, Queen of Hearts likes red ❤️, and the twins (Tweedledee and Tweedledum) share the same favorite color, yellow. So the anonymization process won't meet GDPR requirements and you need further manual edits to remove or randomize such cases, but the replacement is a good first step for removing sensitive information.

Aba Search and Replace screenshot

Replacing text in several files used to be a tedious and error-prone task. Aba Search and Replace solves the problem, allowing you to correct errors on your web pages, replace banners and copyright notices, change method names, and perform other text-processing tasks.

This is a blog about Aba Search and Replace, a tool for replacing text in multiple files.