Speeding up development
Published by Francis J. Lacoste September 9, 2011 in General
Today we reached a significant milestone, we completed our first fast down-time deployment. Two obvious reasons for doing this were already mentioned in the announcement and our technical architect”s post describing the change:
- We’ll have less downtime per month (at the cost of more frequent but short interruption).
- We’ll be able to deploy fixes and changes involving DB schema more frequently.
But from my perspective, the most important benefit I think we’ll get from this is a speed up in our rate of development, particularly, in terms of completing feature projects. It’s not a secret, our feature squads spends a lot of time to complete their projects. There are multiple reasons for this, but in the end, there usually fall under two broad cateries: the time it takes to actually make the change, and the delays in getting feedback on the change itself.
To help with the first category, you’ll want better and more powerful libraries, better architecture, developers’ training, etc. Think the time difference between developping a database web application in Django vs as a CGI C application using only the standard C library. Launchpad isn’t using the most modern libraries and toolkits, and we could still make a lot of improvements there. But the costs of making changes in this space are compounded by the problems of the other category.
Once you wrote the change, you are far from done. There are lots of hoops you still have to jump through before saying “done-done“: you’ll need to make sure the tests pass, to get your changes reviewed, merged, QAed and then deployed. And finally, you’ll probably want to make sure that it matches the user’s expectations, but until it’s in production, this is hard to assess reliably. All of these steps takes time and introduce delays, some bigger than others. The Launchpad team is always on the look-out to cut in these delays and the new “fast down-time deployments” cut on of the biggest one we had.
Since a picture is worth at least a thousand words, have a look at the chart above to have a better idea of what I’m talking about. It shows the distribution of the time it takes to complete a “change”. (What this plots is the cycle time from coding to deployment of our Kanban cards which roughly map to one logical change.) You’ll see that 50% of our changes are deployed to production in about a week. And the next 45% takes between 1 and 5 weeks. Now, our feature projects are composed of many many of these smaller changes. If those are all relatively small changes, why do they take so long?
One of the big bottleneck was the batch size of our DB deployment. If a change required a DB schema it waited until the next downtime deployment which happened once every month. In theory, that means that on average a change involving the database would wait 2 weeks in the queue before deployment. In practice, it’s more complex than that, because squad leads would often plan around these. So a database change would be hold off onto because it was deemed that it couldn’t be safely completely to be part of the next downtime deployment. So it might be put on hold in favour of other work, and delayed to the next downtime deployment. It’s also frequent to have other changes building on the first also queued up waiting for the next deployment window. Add on top of that, that it’s common for the completion of a feature to require several iterations of DB change based on feedback and you quickly understand how you can be working months on a feature project!
But this major bottleneck is now gone! We’ll be able to land and deploy DB changes reliably within days, giving us much more rapid feedback. I’m looking forward to the change in the cycle time distribution in the coming months. The whole distribution should move toward the left. I’ll write a follow-up in two months to see if this prediction comes true.
Photo by Nathan E Photography. Licence: CC BY 2.0.
08.30 UTC is super fast down-time … time
Published by Matthew Revell September 8, 2011 in Notifications
Tomorrow, you may notice a blip in Launchpad’s availability around 08.30 UTC. Believe it or not, this is good news 🙂
Until tomorrow, we’d been rolling out database changes — schema updates, database server maintenance, etc — once a month, with a 90 minute period where Launchpad was read-only.
Now, we’re doing things a little differently: two or three times a week, we’ll be doing a fast database update at 08.30 UTC (weekdays only). To start with, it won’t quite be “blink and you’ll miss it”. We’re talking around two minutes but we’ve already identified ways to cut this time. During the update, Launchpad will be effectively unavailable. But it’ll be quick and at a predictable time each day that we do it.
So, other than the obvious bonus of not having Launchpad go read-only for a big 90 minute block every month, why’s this good news? As it’s always at the same and for a short time, we think it’ll be easier to work around. The down-time won’t even be long enough to make a decent cup of tea or coffee. Importantly, it also means you’ll get new Launchpad code faster: if a new feature or a bug fix requires a database schema change, we can now roll it out pretty much within 24 hours rather than waiting up to a month for the next big 90 minute read-only time.
There’s a bug we need to fix: right now, during the fast down-time you’ll get an OOPS when Launchpad tries to access the database. Once we’ve fixed the bug you’ll get a somewhat friendlier and more appropriate 503 error.
While we’re all getting used to it, we’ll still announce these fast database updates on the status feed. We’re hopeful, though, that they’ll be quick enough and predictable enough (08.30 UTC weekdays, two or three times a week) that eventually you’ll have to try hard to notice them.
We’re hiring: a Software Engineer and a Usability and Communications Specialist
Published by Matthew Revell September 7, 2011 in We're hiring!
We’re looking for a couple of smart, motivated and experienced people to join us on the Launchpad team at Canonical.
First up is a Software Engineer, to join one of the Launchpad development squads working on both new Launchpad features and maintenance of existing functionality.
There’s also an opening for a Usability and Communications Specialist. This is to join Launchpad’s Product Team, where we’re looking for someone who can run a usability research programme and produce documentation, blog posts and so on.
If you’ve got any questions about either role, feel free to grab me (mrevell) on FreeNode.
Matthew Revell is the new Launchpad Product Manager
Published by Francis J. Lacoste August 26, 2011 in General
It’s my pleasure to announce that as of today, Matthew Revell is the new
Launchpad Product Manager replacing Jonathan Lange.
We were seduced by his bold vision for Launchpad along with his data-centric approach that he intends to bring to the role. He also has an already extensive experience interacting with Launchpad developers and users. If you read this blog, you probably read something written by him! Or you might have interacted with him in one of the many user-research sessions he ran. The introduction of user-research helped us release better designed feature. Building on this experience, we hope that his leadership will bring Launchpad to the next level.
Matt will communicate more about his plans for Launchpad shortly. In the
mean time, let’s give him our warmest congratulations!
Beta testers: try the new person picker
Published by Matthew Revell August 22, 2011 in Beta, Coming changes
When you want to assign a bug to someone, subscribe them to a blueprint and so on, you see Launchpad’s person picker. It’s where you search for someone or a team, get a list of possible matches and then select the right one.
Fairly recently, we’ve made a couple of improvements to the person picker, such as adding the person/team’s unique Launchpad ID after their display name, so you stand a better chance of choosing the right person.
The trouble is, how many of us know the Launchpad ID of each person or team we’re likely to deal with?
I know I think more in terms of someone’s IRC nick or the various associations they might have, rather than what they chose as their Launchpad ID.
That’s why we’re changing the person picker: soon, everyone will get a new version of the person picker that shows you what I think is a much more useful set of information in helping find the right person.
Here’s what it might look like:
If you’re in the Launchpad beta testers team you might have seen it already.
If you’ve used it, let us know what you think: does it give you the information you need? Have you come across any bugs to report?
Either email feedback@launchpad.net or comment directly on this post to help us shape the new person picker!
Users can now move bugs between projects and distros
Published by Curtis Hovey August 17, 2011 in General
Users can use the affects form on the bug page to change which project or distribution the bug affects. You can also select the affected package. Lp API users can assign a project, distribution, or package to the BugTask target property to change the affected bug target. The behaviour is similar to the way questions can be retargeted between projects and distributions. Affected series cannot be changed, though the affected series package can be.
Previously, users had to mark a bug affecting a project or distribution as invalid, then add a new affected project or distribution. This cluttered the UI, caused excessive emails, and made pages slower.
Fast JSONCache updates now active for improved responsiveness
Published by Aaron Bentley August 5, 2011 in API, Code
I recently posted about Initializing page JavaScript from the JSONCache. Now I’m pleased to announce that you can also get updated copies of the IJSONRequestCache, to make it easier to update your page.
Brad Crittenden and I started work on this at the Dublin Thunderdome, and it’s finally been deployed. What this means is that for basically any page on Launchpad, you can append /++model++ to the URL, to get a fresh copy of the IJSONRequestCache. With ++model++, a change will typically require only two roundtrips; one to make a change, and one to retrieve an updated model. Future work may reduce this to a single roundtrip.
Why ++model++, not ++cache++? Cache is a really poor name for what the IJSONRequestCache is. Rather than providing fast access to whatever data has been previously retrieved, it is a complete collection of all the relevant data.
In Launchpad, the IJSonRequestCache is associated with the view, so we’re trying to rebrand it as the “view model”. This may seem strange from an MVC (Model, View Controller) perspective, but MVC can be recursive. A view may use a model to render itself.
Approve your own translation imports
Published by Matthew Revell July 29, 2011 in Translations
Good news if you run a project’s translation effort in Launchpad!
Until today, when you imported a template or translation file into Launchpad for the first time, you’d have to wait for a member of the Canonical Launchpad team to review and then approve that file before your project’s translation community could make use of it.
Now, if you’re a project maintainer, you can manage your project’s translations import queue yourself. All you need do is follow the “import queue” link on your project’s translations overview page and you’ll see something like this:
Once you’ve approved a file, and it has been imported, subsequent changes will go through Launchpad’s automatic approval process.
Take a look at our guide to importing templates for more detail.
Road sign photo by Spixey. Licence: CC BY.
Less mail for mailing list admins
Published by Robert Collins July 27, 2011 in Notifications
Many mailing lists in Launchpad are open teams – that is, anyone is welcome to join, or leave, as they choose.
Until today, every time that happened all the list admins were mailed when someone joined or left their team, even though there is no action to take : in an open team, you cannot kick someone out.
We’ve fixed this – now for open teams (and only open teams) when someone joins or leaves the team, the team admins will not be notified.
In future we will have a subscription facility for team admins that do want these emails, and at that point we will make them optional for all team types.
No more monthly 90 minute downtime
Published by Robert Collins July 26, 2011 in Coming changes
I’m thrilled to be writing this blog post just over a year after starting as Launchpad’s technical architect. During that year we have been steadily improving our ability to deploy changes to Launchpad without causing downtime (of any or all services). Our ability to do this directly impacts our ability to deliver bug fixes and new functionality – our users are very sensitive to downtime.
There has been one particularly tricky holdout though – our monthly 90 minute downtime window where we apply schema changes, do DB server maintenance and so forth.
Starting very soon we will instead have very short windows – approximately 60 seconds long – where we perform schema changes, database server failover (in order to permit DB maintenance on the master server) and so forth.
We expect to do these about 6 times a month based on our historical rate of schema patches, and we are – for now – planning on doing these at 0800 UTC consistently.
This will deliver much less total downtime – 6 minutes a month rather than 90 – at the cost of more frequent interruptions.
If you have API scripts running against Launchpad, you may want to build in a retry mechanism to deal with up to a few minutes of downtime.
We cannot remove downtime entirely for purely technical reasons: Our primary database (postgresql) blocks new readers (or writers) when a schema change is being executed, and the schema change blocks on existing readers (or writers) to complete – it needs an exclusive lock on each relation being altered.
What we can do is automate the process of disconnecting and interrupting existing database connections to let the schema change execute rapidly, and make our schema changes as minimal as possible. Previously, we shut down all the application servers (via a script, but shutting down gracefully takes time), and then ran schema changes which did data migration and so forth. In this new process we will leave the appservers running and just interrupt their connections for the time it take to apply the schema change. That, combined with moving data migration to a background job rather than doing it during the schema change, gives us the short downtimes we’re about to start doing.
More information is available in the LEP and my mailing list post about the project starting.