Today we reached a significant milestone, we completed our first fast down-time deployment. Two obvious reasons for doing this were already mentioned in the announcement and our technical architect”s post describing the change:
- We’ll have less downtime per month (at the cost of more frequent but short interruption).
- We’ll be able to deploy fixes and changes involving DB schema more frequently.
But from my perspective, the most important benefit I think we’ll get from this is a speed up in our rate of development, particularly, in terms of completing feature projects. It’s not a secret, our feature squads spends a lot of time to complete their projects. There are multiple reasons for this, but in the end, there usually fall under two broad cateries: the time it takes to actually make the change, and the delays in getting feedback on the change itself.
To help with the first category, you’ll want better and more powerful libraries, better architecture, developers’ training, etc. Think the time difference between developping a database web application in Django vs as a CGI C application using only the standard C library. Launchpad isn’t using the most modern libraries and toolkits, and we could still make a lot of improvements there. But the costs of making changes in this space are compounded by the problems of the other category.
Once you wrote the change, you are far from done. There are lots of hoops you still have to jump through before saying “done-done“: you’ll need to make sure the tests pass, to get your changes reviewed, merged, QAed and then deployed. And finally, you’ll probably want to make sure that it matches the user’s expectations, but until it’s in production, this is hard to assess reliably. All of these steps takes time and introduce delays, some bigger than others. The Launchpad team is always on the look-out to cut in these delays and the new “fast down-time deployments” cut on of the biggest one we had.
Since a picture is worth at least a thousand words, have a look at the chart above to have a better idea of what I’m talking about. It shows the distribution of the time it takes to complete a “change”. (What this plots is the cycle time from coding to deployment of our Kanban cards which roughly map to one logical change.) You’ll see that 50% of our changes are deployed to production in about a week. And the next 45% takes between 1 and 5 weeks. Now, our feature projects are composed of many many of these smaller changes. If those are all relatively small changes, why do they take so long?
One of the big bottleneck was the batch size of our DB deployment. If a change required a DB schema it waited until the next downtime deployment which happened once every month. In theory, that means that on average a change involving the database would wait 2 weeks in the queue before deployment. In practice, it’s more complex than that, because squad leads would often plan around these. So a database change would be hold off onto because it was deemed that it couldn’t be safely completely to be part of the next downtime deployment. So it might be put on hold in favour of other work, and delayed to the next downtime deployment. It’s also frequent to have other changes building on the first also queued up waiting for the next deployment window. Add on top of that, that it’s common for the completion of a feature to require several iterations of DB change based on feedback and you quickly understand how you can be working months on a feature project!
But this major bottleneck is now gone! We’ll be able to land and deploy DB changes reliably within days, giving us much more rapid feedback. I’m looking forward to the change in the cycle time distribution in the coming months. The whole distribution should move toward the left. I’ll write a follow-up in two months to see if this prediction comes true.
Photo by Nathan E Photography. Licence: CC BY 2.0.