Lessons learned writing data pipelines

I spent quite a large part of my time at FutureLearn building custom data pipelines to make data available for the rest of the company. Since I've been experimenting with different and better ways of doing the same things, I decided to write some of the lessons I've learned during this time. This is not a comprehensive list but instead just some tips I wish I knew before I started.

Focus on your final dataset

Why are you building your data pipeline? What value do it provide to users? Ask yourself that often. You're building a pipeline so your users can consume it, not as an engineering challenge. It's very easy to lose sight of your goal with so many tools, languages, architectures, etc. When altering your pipeline to make it more reliable, faster, scalable, etc, be pragmatic, make your objectives clear and stick to them.

This really applies to any kind of software development or even any kind of work, but in my experience this kind of engineering can feel more disconnected from users than other kinds like web development, so for me keeping focus is particularly important when building pipelines.

If you're building a new dataset, first define your initial schema and work towards making it available as soon as possible. You can always do more later if the dataset proved to be valuable.

Use compression everywhere

IO is probably your biggest constraint, especially at the beginning. Using compression from the start will make your pipelines faster and save you a lot of time for very little amount of extra work. Also if you design your components to use compressed files from the moment they are created, you won't have to change code to deal with compression later.

Most languages offer decorators for IO objects on their base class library, so using compression from the beginning shouldn't be an issue. For instance, in Ruby, instead of doing

io = File.open(path, "w")

use the Zlib::GzipWriter decorator:

io = File.open(path, "w")
io = Zlib::GzipWriter.new(io)

You can do the same for reading the data afterwards. For reading those files manually, embrace the command line. Tools like zcat and pipes make working with those files as easy as working with uncompressed files.

Use some kind of distributed storage

Writing data to normal disks mounted on a single computer can cause you a lot of pain. Try to use distributed storage platforms or distributed file systems to store files, even those intermediate files used only by your pipeline.

You get a lot of benefits from storing your data on a distributed storage:

  • "Unlimited" disk space.
  • "Unlimited" IO.
  • No need to manually "share" files between computing nodes.

Those factors make scaling to several computers processing large amounts of data a lot easier.

Some common examples of storage like are: AWS S3, AWS EFS, HDFS, etc.

Start with a (managed) distributed database

Even if you never used a distributed database and you are familiar with PostgreSQL, MySQL, etc, it's really worth the effort to get out of you comfort zone and use a distributed, horizontally scalable database from the start or as soon as possible.

Why? Well, first because they allow you to do ELT instead of just ETL. You can load a lot of raw(-ish) data and use the distributed nature of those databases to perform transformations and aggregations very quickly without having to worry about indexes, changing disks, etc.

Another advantage of using this kind of database is that you won't have to care about "loading too much data". I had to use a PostgreSQL database as a data warehouse in the past and I remember having to write a lot of complex ETL code just to deal with the fact that we couldn't load all the data to the same database.

Also, relying on those databases allow you to just use whatever language you feel more comfortable with for data preparation and integration, since you can always push the heavy lifting to your database. Being able to use your favorite language or the language your team is more comfortable with can be a major factor for you success.

Some examples of distributed databases are AWS Redshift, Snowflake, Hive/Presto/Impala, etc.

What do you think of those points? Is there something you wished you knew before you started building your own pipelines?

Show Comments