I’m thrilled to be writing this blog post just over a year after starting as Launchpad’s technical architect. During that year we have been steadily improving our ability to deploy changes to Launchpad without causing downtime (of any or all services). Our ability to do this directly impacts our ability to deliver bug fixes and new functionality – our users are very sensitive to downtime.
There has been one particularly tricky holdout though – our monthly 90 minute downtime window where we apply schema changes, do DB server maintenance and so forth.
Starting very soon we will instead have very short windows – approximately 60 seconds long – where we perform schema changes, database server failover (in order to permit DB maintenance on the master server) and so forth.
We expect to do these about 6 times a month based on our historical rate of schema patches, and we are – for now – planning on doing these at 0800 UTC consistently.
This will deliver much less total downtime – 6 minutes a month rather than 90 – at the cost of more frequent interruptions.
If you have API scripts running against Launchpad, you may want to build in a retry mechanism to deal with up to a few minutes of downtime.
We cannot remove downtime entirely for purely technical reasons: Our primary database (postgresql) blocks new readers (or writers) when a schema change is being executed, and the schema change blocks on existing readers (or writers) to complete – it needs an exclusive lock on each relation being altered.
What we can do is automate the process of disconnecting and interrupting existing database connections to let the schema change execute rapidly, and make our schema changes as minimal as possible. Previously, we shut down all the application servers (via a script, but shutting down gracefully takes time), and then ran schema changes which did data migration and so forth. In this new process we will leave the appservers running and just interrupt their connections for the time it take to apply the schema change. That, combined with moving data migration to a background job rather than doing it during the schema change, gives us the short downtimes we’re about to start doing.