Burning down critical bugs

I have been analysing Launchpad’s critical bugs to track the Purple squad’s progress while on Launchpad maintenance duty. In January of 2011, the Cloud Engineering team né Launchpad Engineering team was reorganised into squads, where one or more squads would maintain Launchpad while other squads work on features. This change also aligned with a new found effort to enforce the zero-oops policy. The two maintenance squads had more than 332 critical bugs to close before we could consider adding features that the stakeholders and community wanted. By July 2011, the count dropped to its lowest point, 250 known critical bugs. Why did the count stop falling for fifteen months? Why is the count falling again?

Charting and analysing critical bugs

Chart of Launchpad's critical bugs since the formation of Launchpad squads and maintenance duties
The chart above needs some explanation to understand what is happening in Launchpad’s critical bugs over time. (You may want to open the image in a separate window to see everything in detail.) Each iteration is one week. The backlog represent the open critical bugs in launchpad at the start of the iteration. The future bugs are either bugs that are not discovered, not introduced, or reported and fixed within the iteration. The last group is crucial to understand the lines plotting the number of bugs fixed and added during the iteration. We strive to close critical bugs immediately. Most critical bugs are reported and fixed in a few days, so most bugs were not open long enough to be show up in the backlog. The number of bugs fixed must exceed the number added to make the backlog count fall. You can see that the maintenance squads have always been burning down the critical bugs, but if you are just watching the number of open bugs in Launchpad, you get the sense that the squads are running to just stand still.

I use the lp-milestone YUI widget to chart the bugs and analyse the our progress through the critical bugs. It allows me to summarise a set of bugs, or analyse a subset by bug tag.

Launchpad maintenance analysis -- driving critical bugs to zero

Though 22 bugs were fixed this past week, 14 were added, thus the critical count dropped by 8. The last eight iterations are used to calculate the average bugs closed and open per iteration. The relative velocity (velocity – flux) is used to estimate the remaining number of days to drive the count to zero. When the Purple squad started maintenance on September 10th of 2012, the estimated days of effort was more than 1,200. In just three weeks, the number has fallen dramatically. The principle reason the backlog of critical bugs has fallen is that the Purple squad is now giving those bugs their full attention, but that generalisation is unsatisfactory.

Why is the Purple squad so good at closing bugs in the critical backlog?

I do not know the answer to my question. The critical backlog reached its all-time low of 250 bugs with the release of the Purple squad’s maintenance work in July 2011. There was supposition that  Purple fixed the easy bugs, or that the fixes did not address the root cause, so another critical bug was opened. I disagree. The squad had no trouble finding easy bugs, and it too would have been fixing secondary bugs if the first fix was incomplete. I can tell you how the squad works on critical bugs, but not why it is successful.

I was surprised to see the Purple squad were still the top critical bug closers when it returned to maintenance after 15 months of feature work. How could that be?  The squad fixed a lot of old timeout and JavaScript bugs in the last few months through systemic changes — enough to significantly affect the statistics. About 600 critical bugs were closed while Purple squad were on feature work. The squad closed 210 of those bugs. 60 were regressions that were fixed within the iteration, so they never showed up in the backlog. 70 critical bugs were fixed because they blocked the feature, and 80 critical bugs were because Purple was the only squad awake when the issue was reported. The 4 other squads fixed an average of 98 bugs each when they were on maintenance. The Purple squad fixed more bugs then maintenance squads on average even when they were not officially doing maintenance work.  The data, charts, and analysis always includes the Purple squad.

I suspect the Purple squad has more familiarity with bugs in the critical backlog. They never stopped reading the critical bugs when they were on feature work. They saw opportunities to fix critical bugs while solving feature problems. I know some of the squad members are subscribed to all critical bugs and re-read them often. They triage and re-triage Launchpad bugs. This familiarity means that many bugs are ready to code — they know where the problem is and how to fix it before the work is assigned to them. They fixed many bugs in less than a day, often doing exactly what was suggested in the bug comments.

During the first week of their return to maintenance, about 30 critical bugs were discovered to be dupes of other bugs. Though this change does make the backlog count fall, it also revised all the data, so the chart is not showing these 30 bugs as at all now. The decline of backlog bugs does not include dupes. While the squad was familiar enough to find many bugs that they close in a single day, they were not so familiar as to have known that there were 30 duplicate bugs in the backlog when they started.

Most squad have only one person with DB access, but the Purple squad is blessed with 3 people who can test queries against production-level data. This could be a significant factor. It is nigh impossible to fix a timeout bug without proper database testing. Only 13 of the recent bugs closed were timeouts though. The access also helps plan proper fixes for other bugs as well, so maybe 20% of the fixed bugs can be attributed to database access.

Maybe the Purple squad are better maintenance engineers than other squads who work on maintenance. For 28 months, I was the leading bug closer working on Launchpad. I closed 3 times more bugs than the average Launchpad engineer. I am not a great engineer though. My “winning” streak came to a closed shortly after William Grant started working on Launchpad full time; he soundly trounced me over several months. Then he and I were put on the same squad and asked to fix critical bugs. Purple also had Jon Sackett, who was closing almost 2 times the number bugs than the other engineers. I don’t think I need to be humble on this matter. To use the vulgar, we rocked! Ian was the odd man on the Purple squad. He was the slowest bug closer, often going beyond our intended scope to fix an issue. Then Purple switched to feature work…Ian lept to the first rank while the rest of the squad struggled. Ian fixed almost double the number of Disclosure bugs than other squad members. The leading critical bug closer on the squad at the moment though is Steve Kowalik. This is his first time working on maintenance. His productivity has jumped since transitioning to maintenance.

I can only speculate as to why some engineers are better at maintenance, or can just close more bugs than others. A maintenance engineer must be familiar with the code and the rules that guide it. Feature engineers need to analyse issues and create new rules to guide code. I did not gradually become a leading bug closer, it happen in a single day when I realised while solving one issue that the code I was looking at was flawed, it certainly was causing a bug, I knew how to fix it, and with a few extra hours of extra effort, I could close two bugs in a single day. Closing bugs has always been easy since that moment.

I believe the Purple squad values certainty over severity and small scope over large scope when choosing which critical backlog bugs to fix. I created several charts that break the critical bugs into smaller categories. I suggested the squad burn down sub-categories of bugs like regressions, or 404s. The squad members are instead fixing bugs from the entire backlog. They are choosing bugs that they are certain they can fix in a few hours.  I think the squad has tacitly agreed to fix bugs that are less than a day of effort. When this group is exhausted, they will fix issues that require days of effort, but also fix as many bugs. The last bugs to be fixed will be those that require many days to fix a single bug. Fixing the bugs with the highest certainty reduced our churn through the critical bugs, there are fewer to triage, to dupe, to get ready to code.

The Purple squad avoids doing feature-level design and effort to fix critical bugs. Feature-level efforts entail more risk, more planning, and much more time. There is often no guarantee, low certainty, that a feature will fix the issue. A faster change with higher certainty can fix the issue, but leaves cruft in the code that the engineers do not like. Choosing to do feature-level fixes when a more certain fix is available indicates there is tension between the Launchpad users who have a “critical” issue that stops them from using Launchpad, and the engineers who have a “high” issue maintaining mediocre code. I contend it is easier to do feature-level work when you are not interrupted with maintenance issues. When the Purple squad does choose to do feature-level work to fix a critical, they have a list of the bugs they expect to fix, and they cut scope when fixing a single bug delays the fix of the others. The Launchpad Answers email subsystem was re-written when other options were not viable, there we about 20 leading timeouts represented by 5 specific bugs to justify 10 days of effort to fix them.

The Purple squad is not unique

Nothing that I have written explains why the Purple squad are better are closing critical bugs. All squads have roughly the same skills and make decisions like Purple. Maybe the issue is just a matter of degree. If the maintenance squad is not closing enough bugs to burn down the backlog, their time is consumed by triaging and duping new critical bug reports. Familiarity with Launchpad’s 1000’s of bugs is an advantage when triaging bugs and getting a bug ready to code. Being able to test queries yourself on a production-level database takes hours or days off the time needed to fix an issue. Familiarity with the code and the reasoning that guided it increases the certainty of success. The only domain that Purple is not comfortable working with is lp.translations; the squad is comfortable changing 90% of Launchpad’s code. There may be correlation between familiarity with code, and the facts that the squad members participated in the apocalypse that  re-organised the code base, and that some have a LoC credit count in the 1000’s.

Tags: ,

Leave a Reply