I've been familiarizing myself at work with the new regulation for handling personal data of European citizens - the General Data Protection Regulation (GDPR).
At this point I strongly believe this new regulation will change for the good the way companies handle personal data. However not only business practices will have to change: GDPR will also change how we, software engineers, handle personal data. On this post I'll try to point out (some) points of how GDPR will affect our profession and give some practical advice for engineers.
Important note: this post does not constitute legal advice.
GDPR requires the implementation of the privacy by design principles by law. Some of those principles are:
- Privacy must be a proactive effort, not reactive.
- Your defaults should respect the privacy of your user.
- Privacy must be user centric (users should have the right to restrict processing of their data, for example).
- You should protect and respect the privacy of your users during the whole lifecycle of their relationship with your business (including the right to be "forgotten").
The first three of those principles (and most of them listed here) are strongly connected to UX and business practices, like considering privacy concerns at the early stage of software projects, not having pre-selected checkboxes granting consent to data sharing, not allowing or making it hard for users to revoke consent, etc.
The last two items of the list however have great impact on the daily work of software engineers. Some practical effects of them are:
- Data cannot be stored indefinitely.
- Data collection and storage should be minimized to the least necessary.
- Data should be secured in transit and at rest.
Now I'll cover some real world consequences of those effects.
How much effort do you put into reasoning what kind of data you can put into a log message? It's not unusual for an engineer debugging a production issue to be extra zealous and add entire objects to their log messages. After all, the size of those log messages is hardly a problem, right?
The actual problem with that thinking with the contents of the data being logged and the append-only nature of log messages.
If those objects include personal data like names, emails, etc, then that means you'll be putting your user's personal data into a storage that probably won't be updated and can take years to be erased (sometimes there isn't even a policy set up to erase logs, since storage is so cheap).
What if your user ended their relationship with the business? Or maybe they asked to restrict the processing of their data? Will you re-write your logs to remove their data? That's not practical on most cases.
We need to respect our users privacy and restrict the kind of data that we put into log messages. In most scenarios you can log some kind of identifier instead of personal data. If you really need this kind of data into a log message, then consider using some kind of encryption. Later, when you're done with the issue that made you log personal data, you can erase the encryption key, making the data encrypted unusable.
Another important aspect to consider about logging is how safe is your logging infrastructure. Are you using secure communication? Are your logs encrypted at rest? Those are important requirements now and if you don't know the answer, you should plan some time to figure that out.
Databases and secure communication (the case of Redis)
Continuing the topic of using secure communication, are you 100% sure your apps are using some kind of secure communication to talk to your databases?
Heroku recently started a migration process to enforce SSL on all the PostgreSQL databases they manage. The "brownout" tests they are performing means customers are still connecting to their databases without SSL.
If you use Redis, are you aware that Redis does not support encryption?. Due to its nature I believe most Redis servers are running on trusted networks and properly firewalled away from the internet. However if you're using a third-party service to host and manage your Redis server, are you using secure tunnels like spiped or stunnel to talk to your servers?
We are usually faced with hard problems to solve on tight deadlines, so not noticing that the database configuration we're using does not enforce or support some kind of secure connection is an honest mistake. However with GDPR every team needs a security checklist that should include making sure all communication to databases is secure. Similarly as for log storage, this security checklist should also ensure the database files are encrypted at rest.
Another important aspect of databases is the retention of their backups. How useful are them after a few weeks? Probably not much. In those cases, keeping those backups means you're keeping data from users who may have asked to be "forgotten" or to have processing on their data limited. Also in case of a security breach, you may end up leaking data from people who are not even users anymore.
You need to ensure you're not leaving old backups behind by implementing some kind of retention policy. Having backups expiring after some days means you won't keep data from users that asked to have their data deleted. It's also important to have a plan in case you restore a backup that includes data from users that asked to have their data deleted.
Personal data on third parties like Trello, Slack, Google Docs
It's quite common to copy content or take screenshots from your apps and uploaded them to some kind of task management app or cloud services like Google Docs or Microsoft Office. That's a great way of describing tasks, sharing current status, build evidence a task is done, etc.
With GDPR this content you put into those services shouldn't contain any kind of personal data, so we must make sure we don't copy names, emails, date of births, etc.
My guess is that this kind of behavior will take some effort to change, so it's important that companies run training sessions to familiarize people with the new regulation and that this kind of aspect is not ok anymore.
Production data into non-production environments (staging, QA, development)
Having production data into non production environments like staging, QA and especially development should be avoided.
Most teams won't allow production data into development environments given how easily that can be an issue, but you should also anonymize or pseudo-anonymize personal data even from staging and QA environments.
It's quite common to have some kind of staging environment as a clone of production with databases that are restored periodically from production snapshots. During those restores, changing email addresses is usually the main (and sometimes only) anonymization performed and that's due to the risk of such an environment sending emails to actual users. With GDPR you should either use seed data or consider scrubbing away all personal data with randomly generated data.
I recently wrote about how to efficiently update columns with randomly selected values from a list. That's a useful technique for scrubbing away personal data from medium-sized or small databases.
Those are some real-world examples I noticed of how GDPR will affect software engineers. Can you think of others?
I tried to keep this post very short and practical for engineers. If you would like to read more, GDPR for web developers is a great article and has a lot more information.