Mike Sisk

The Bug We Couldn't (immediately) Fix

Clod DNS

Years ago at Rackspace, I found myself running the Cloud DNS product.

DNS is one of those things you never think about — until it breaks. In IT support circles, we have a saying: “It’s always DNS.”

At its core, DNS sounds simple: type “www.mikesisk.com” into your browser and it returns 134.199.187.56. You usually never see it, but IP addresses are the plumbing of the internet.

Even more impressive: Jon Postel dreamed up DNS in 1982, and the rules that govern it have barely changed. The internet has grown far beyond anything anyone could have imagined back then.


The Bug That Froze Everything

In 2014, not long after I arrived at Rackspace, I inherited Cloud DNS. This wasn’t what I was hired for, but when you find a big security flaw in something on your first day, you sometimes get “volunteered” for new responsibilities.

A year later, support rackers (our term for Rackspace staff) started reporting that some customers’ DNS API calls were hanging before they even got to the DNS part — right at authentication. The deeper we dug, the weirder it got: a hash collision buried in a third-party Java auth library.

The only real fix was to rewrite the auth code — a long job. Meanwhile, about two dozen customers couldn’t use the DNS API. Their authentication calls were working with other services, so it was certainly something in our code.


Workarounds and Gift Packs

Most affected customers barely touched the API. For them, I could manually bypass the auth code and make changes one at a time by hand.

But three customers were hammering the API dozens of times per second for things like hooking up mobile devices as game controllers. For them, I cloned entire accounts into new ones to dodge the bug — thousands of servers, databases, and DNS records.

Rackspace prided itself on being customer-centric, so I did it myself. I even mailed those three customers a gift pack: a note, a flashlight, and a toy Hexbug—thanking them “for shining a light on our bugs.” Big thanks to my wife and kids for coming up with that idea while I moved code around.

HEX bug

The Fix

After a couple of months, we rolled out the rewritten code, tested every account, and declared victory — until the next bug, anyway. At scale, you run into problems you didn’t think were possible.

Here’s the email I sent to the affected customers:

About your Rackspace Cloud Account...

Hello!

I'm Mike Sisk, the operations engineer at Rackspace responsible for the 
Cloud DNS API. 

Your Rackspace Cloud account is one of around two-dozen that we identified 
as having encountered a defect in our Cloud DNS API product. The external
 behavior of the defect as you would see it is a "hang" in any request 
 sent to the API. 

The cause was rather difficult to diagnose, identify and fix. We narrowed 
it down to a hash-collision in a third-party software library and it took 
us a while to work with those folks to get a fix, get it tested, and get 
it released. 

But we finally did it and yesterday we released a new version of the API 
with this defect fixed. I tested your account to make sure it works and 
you should be good to go. 

I apologize that you ran into this bug and that it took us a while to 
get it fixed. 

If you run into further issues with our DNS API feel free to open a 
support ticket anytime, or feel free to contact me. A support ticket 
is usually faster, since I have to sleep occasionally, but our support 
folks are always up and ready to help you out. 

Thanks!

Mike Sisk
Sr. Systems Engineer
Platform Operations

Why the Cake Said “Clod”

If you’re wondering about the cake at the top of this post: when we launched the first Cloud DNS release, our product manager ordered a cake. His handwriting wasn’t great, so it came back with Clod instead of Cloud. From then on, every release cake was “Clod.”
Because once something becomes an inside joke, it sticks.

Clod DNS