This week at work I had to find a way of replacing all values of columns from a PostgreSQL database with randomly selected values. I also knew I had to do that for several columns of different natures (names, city names, phone numbers, etc), so each column would have a different list of possible values. Also it would be great if I could easily switch between different lists depending on the value of another column from the table in order to, for example, use a list of city names from the same country (that is another column on the same table).
As you may be guessing, the idea is to anonymise all columns with personal information with realistic-looking but fake values. Since that's potentially a lot of data, doing that outside the database wasn't a great option - I wanted this anonymisation to be a very quick process. Here are my findings about how to do that.
From now on we use a list of values as arrays with the strings "1" and "2" as an example. On a real scenario such array could contain hundreds or more items.
This was my first try:
SELECT unnest(ARRAY['1', '2']) ORDER BY RANDOM() LIMIT 1
On the query above, the unnest function expands an array to a set of rows, which are then sorted with
ORDER BY RANDOM(). Finally we pick the first row with
One important attribute of such approach is that it is a self-contained expression; that is, we can put it between parentheses and use it on different contexts. This aspect makes writing code that generates this kind of SQL easier.
Even though the query above works, it can be very slow for large arrays. The problem is the
SORT BY clause. So after some thought I realised I could use OFFSET + RANDOM:
SELECT unnest(ARRAY['1', '2']) OFFSET FLOOR(RANDOM() * $number_of_items) LIMIT 1
($number_of_items should be interpolated when you generate the query - we have $number_of_items = 2 on the query above)
It turns out that this query is much more efficient since it doesn't have to sort the items before picking the first.
Then I asked myself: can I get rid of the
unnest? Yes! Items of arrays on PostgreSQL can be accessed individually via an indexer (like in most languages). Here is the "final" version:
SELECT (ARRAY['1', '2'])[floor(RANDOM() * $number_of_items) + 1]
I did some simple benchmarking and this last version seems 25% faster than using unnest + OFFSET. I ran this benchmark on a crappy computer without much rigour, so please do your own testing if you care about it.
Problem: Subquery caching
Remember the need of updating a column of an entire table? This is how I first tried to use the query above:
UPDATE table SET column = (SELECT (ARRAY['1', '2'])[floor(RANDOM() * 2) + 1])
It turns out that PostgreSQL caches the result of the subquery above, so all rows ended up with the same value that was randomly picked once.
After some digging I found this excellent blog post with a fix to the problem:
UPDATE table SET column = ( SELECT (ARRAY['1', '2'])[floor(RANDOM() * 2) + 1] WHERE table = table )
The reference to the outer table makes sure the subquery won't be cached, so the
UPDATE works as expected 😊
Don't forget to write an automated test to make sure the
UPDATE works and continues to work as time passes. The reason this is particularly important here is because of the
WHERE clause we had to add to the subquery - it looks a bit off at first glance, so someone else without the context of this code may feel tempted to remove it in the future. Such a change won't make the
UPDATE crash and that's the worst kind of bug.
Bonus: UPDATE is too slow
For very large tables, or medium sized tables with a lot of indexes, running an
UPDATE without a
WHERE could take a long time. On such cases, the fact that the random selection of values is a self-contained expression saves the day. We can easily rewrite the
UPDATE with the following procedure:
- Clone the table structure as a new table.
- Insert into the new table from a
SELECTon the original table, replacing the column you want to anonymise with the anonymisation expression.
- Rename both tables so the new table sits in place of the original table.
In my experience, such procedure can drastically reduce the amount of time it takes to update a column on certain scenarios. This procedure is also very useful for doing very large data migrations. On a future post I intend to share helpers that simplify doing such operations.