We recently had an incident at work on our main application. This is the story of why it happened and what we learned from it. The initial part of this post applies mostly to Ruby applications but the rest of it is language agnostic, so you can keep on reading even if you don't use Ruby.
After a deployment, all our background workers were gone. Jobs piled up, error notifications started to pop up and our support team started to receive calls and emails with complaints.
After checking that the instances running the workers were alive, we then looked at the logs. The processes that run the background workers were crashing during boot.
The processes running the workers were failing with the following error:
TypeError: superclass mismatch for class Country
The stack trace told us that the error was happening inside a gem that we have used for ages. We initially thought that it was somehow a mismatch with the
Country class we have on the global scope (that's common on Rails projects), but after I looked at the source code of the gem the problem became clear.
The gem had the following line of code nested on a class:
class Country < Struct.new(:request, :ip, :country_code, :country_code2, :country_code3, :country_name, :continent_code)
This code works fine under normal conditions, but if you for any reason load the file containing it twice, Ruby will raise the superclass mismatch error from above. Want to check it out? Put this into a file named
class Country < Struct.new(:request, :ip, :country_code, :country_code2, :country_code3, :country_name, :continent_code); end
And then run this on
20:20:43 ~$ irb irb(main):001:0> load "boom.rb" => true irb(main):002:0> load "boom.rb" Traceback (most recent call last): 4: from /usr/local/bin/irb:11:in `<main>' 3: from (irb):2 2: from (irb):2:in `load' 1: from boom.rb:1:in `<top (required)>' TypeError (superclass mismatch for class Country)
This happens because each instance of
Struct.new is different and in Ruby, even though you can re-open classes, you can't change their inheritance chain.
The error was being triggered by a
require_dependency 'GEM_NAME_HERE' call that was added by mistake and ended up loading the gem twice under certain conditions that were present only when running the background jobs.
Fixing the issue
When we realized what the issue was, we quickly pushed a fix removing the extra
require_dependency call and waited for deployment.
This issue, however, could have been avoided if
Struct wasn't being used as the superclass of
Country. Inheriting from
Struct is a bad idea for several reasons.
Back in 2013 my friend Henrik already wrote why you shouldn't inherit from
Struct. He listed several reasons on his blog post, like the fact that
Struct arguments are not required and the fact that it always adds public write accessors. He even mentioned that a superclass mismatch could happen with Rails' class reloading but he couldn't replicate that. Oh well :-)
On complex systems, if something can potentially go wrong, that will happen at some point.
Another reason not to inherit from
Struct is clarity.
Struct doesn't support named parameters and having a long list of values being passed as parameters without knowing the parameter name makes your code harder to read.
What should you use instead? As suggested by Henrik, you should either just write the boilerplate code or use a gem like attr_extras that does that for you. I strongly recommend attr_extras. It not only removes the boilerplate for situations where you would want to inherit from
Struct (or even use it on non-inheritance scenarios) but also simplifies writing method objects, value objects, static facades, etc.
But wait! How did this code end up being deployed to production?
For us, after running our comprehensive test suite, deploying to staging is the best way of knowing if a new version of our application can be safely deployed to production. The challenge though is knowing if the application running in staging is actually working, since there are always some integrations that we can't test in an automated way. That's why we have smoke tests.
Smoke tests are a non-exhaustive test suite that checks if the most important features of an application work. They are normally used to test new versions of applications in automated deployment pipelines. The literature about tests say that smoke tests should be used early, before more expensive tests are executed. For us, running it after the rest of our test suite is the most pragmatic approach to make sure a recently deployed version is running.
Since this application is a public web app, our smoke tests consist mostly of making HTTP requests to important pages for which we know would end up exercising large parts of the codebase.
Fixing the pipeline
The problem with our pipeline is that it had no smoke tests for our background workers. The deployment pipeline ignored that the worker processes were dying during boot.
The fix to prevent this kind of issue from happening again was to extend our smoke tests to make sure they cover the background workers. Our smoke tests require some creativity since they have to run from outside the application and they can't be destructive like normal automated tests (since they run against a live environment).
There are several ways of accomplishing this, like dead man's switches, heart-beat jobs, monitoring whether and for how long processes are running, poking debugging HTTP endpoints, etc.
The most important lesson to be learned here is to make sure you run smoke tests (or whatever you call them) on all parts of your application, outside test environments. Without them, any change has the potential to take down entire parts of your system.
Do you do continuous deployment? How do you make sure a version deployed to staging (or any pre-production environment) actually works? Please leave a comment if you implement this kind of pipeline differently.