Report on Launchpad down-time of 4th Feb 2010

If you visited Launchpad between 13.30 and 15.30 UTC yesterday (4th Feb), you’ll have seen that Launchpad was largely unavailable.

Since then, I’ve spoken to quite a few people who use Launchpad regularly and I want to say thanks to everyone for your patience while we fixed the problem. As we all use Launchpad for our own development, we know just how painful unplanned down-time is and we’re sorry for the disruption to your work.

I’d like to explain what happened, how we fixed the problem and what we’re doing to avoid a similar situation in future.

As you’d probably expect, we run more than one database server for Launchpad. There are two master databases and then slaves, which are copies of the masters. The master databases replicate constantly to the slaves.

When Launchpad makes a read-only request, such as fetching the title and description of a bug report, we can reduce the load on the master databases by fetching that data from one of the slaves. However, to ensure the data you see is up to date, each time Launchpad is about to fetch data from the slave database, it checks how long it has been since the last replication from the relevant master database. If, for whatever reason, the replication wasn’t recent enough, Launchpad will instead grab the data from the master database.

Yesterday, it was this check that was taking far longer than expected and so causing the problems that you may have seen. We were able to implement a temporary fix, to bring Launchpad back online, by directing all database queries straight to the correct master.

In the longer-term, we’re going to overhaul the way that Launchpad checks the freshness of the data in the slave databases. Rather than checking each time a query is made, Launchpad will check once every so often and cache the result, meaning that this problem shouldn’t arise again.

Thanks again for your patience.

Leave a Reply