How To Properly Handle an Outage

Last week, Instapaper experienced a big outage that left users without access to the service for more than a day. They were able to get the service back up and running within 31 hours, with a full recovery taking place just this morning.

I — like many other Instapaper users — were worried over this. The service has changed hands a few times, and I wondered if it was time to look for another read-it-later service.

I didn’t want to change services. I’ve used Instapaper since the very early days of the App Store, and it’s been in the same spot on my iPhone home screen for years and years. Plus, no other solution out there works the way I want to work.

This morning, Brian Donohue, the lead developer on Instapaper wrote an in-depth post-mortem on the outage. In it, he explains what went wrong, but it’s the way he closed the post that jumped out me:

I take full responsibility for the incident and the downtime. While the information about the 2TB limitation wasn’t directly available to me, it’s my responsibility to understand the limitations of the technologies I’m using in my day-to-day operations, even if those technologies are hosted by another company. Additionally, I take responsibility for the lack of an appropriate disaster recovery plan and will be working closely with Pinterest’s Site Reliability Engineering team to ensure we have a process in place to recover from this type of failure in the event it ever happens again.

It’s encouraging to see someone take such responsibility for their work, especially when things go wrong. I will continue to use Instapaper, knowing its in good hands.