I recently decided to build a Spark-based ETL pipeline for preparing datasets for dimensional models using star schemas.
Previously I used Ruby and some sprinkles of Python (for scheduling/job execution via Airflow) to build such pipelines. Even though Ruby is not very performant, it is possible to build those pipelines in terms of many small and isolated scripts and Ruby is very good at that. Also the performance of Ruby is not a big issue unless you have to deal with enormous amounts of data, which isn't the case for most small companies.
Anyway, I started studying Spark and I've learned that even though Python (which I'm comfortable with) is a supported language, writing Spark user-defined functions (UDFs) on it is not a great idea due to the overhead of "moving data over" from the JVM to Python.
I tried to find if I could use Spark via JRuby but that doesn't seem to be supported, so I had to choose between Java or Scala.
Java is relatively easy for me due to my background in C#, but when I used before, it always felt so unproductive. I also used Scala before (I even finished Odersky's course on Coursera), but I always perceived it as an excessively complicated language (although interesting) and so I kinda had gave up learning more about it.
After thinking a bit about it, I decided to give Scala a chance again. So now my challenge is to:
- How to get Scala running in production
- Remember the basics of the language without getting bored
- Get up and running and familiar with the tooling
1. How to get Scala running in production
Initially I'll probably use Heroku to orchestrate Scala ETL jobs. Thankfully running Scala apps on Heroku seems very easy. Migrating to an Airflow-based workflow should also be easy in the future since running Airflow on Heroku is also not very complicated.
2. Remember the basics of the language without getting bored
I definitely don't want to read an 800 pages book again. I don't know yet exactly how I'm going to pick up the language one more time, but I think I'm going to read Underscore's free Essential Scala e-book.
3. Get up and running and familiar with the tooling
Tooling is usually a vast topic on most languages. It seems you have to choose either use an IDE or the command line. I assume you can use both when you know what you're doing, but it's probably best to pick one of those approaches and focus on it to start with. I decided to use the CLI since that's a more sane approach in my opinion (for people that are used to CLI-based tools already).
Since I don't use any JVM-based language, I didn't have the JDK installed and that was the first issue: Oracle provides an installer for OS X instead of just giving you a gzipped file. Wat?
After some Googling, I found out that you can install the JDK using Homebrew. The basic steps to install the JDK 8 are
brew tap caskroom/versions brew cask install java8
Spark needs Java 8, but if you just need the latest Java you can go with just
brew install java. Those recipes install the JDK files on
Also, if for whatever reason you need to manage multiple versions of the JDK on the same computer, jenv seems to work very well. I was very happy to find
jenv since I'm very used to
Now that the JDK was ready, I could finally install
sbt (Scala build tool). I used Homebrew with the command
brew install sbt@1
sbt can be very complicated, but you can go very far with a few basic commands like
# Create a new project based on a simple template (it will ask you for the project name). sbt new scala/scala-seed.g8 # Compile, test, run when inside a project directory sbt compile sbt test sbt run # Run Scala's REPL sbt console
You can also run just
sbt, which will give you a project-specific sbt-console that you can use to call
run, etc, without having to wait for sbt to boot every time. It's probably a good idea to have
sbt running all the time while you are coding in Scala.
According to its docs, you need Scala 2.11 in order to run Spark, but you probably got a newer version by default from
sbt. You can configure
sbt to use a specific version of Scala on your project by editing the
build.sbt file. It should have a line with something like
scalaVersion := "2.XX.XX", which you can change to be whatever version you want.
That's all for today. I'll probably write later how to do some basic stuff on Spark with Scala.