“There is a production problem!! All hands on deck!!”
Really? Do we actually need to involve every member of the team? I say, probably not.
On a team with myriad legacy systems, production problems will often be a significant burden for the team. In my experience, without a strategy for managing the team’s approach to tackling these production ‘fires’, the team’s yield for new value creation will be far below it’s potential.
Without compromising the core tenet of a cross-trained team, how does production support work in a agile environment? How is a “production support” role best framed? While the first step may be acknowledging the need for a strategy, the real “magic” in making your strategy work comes from teams’ ownership of the need.
Fire!
Who doesn’t love a Fire Chief? My little nephew is obsessed with fire trucks, and would probably trade cookies for a trip down to the fire station! And isn’t the Fire Chief the bravest and smartest fire fighter of them all?
With few exceptions, it generally doesn’t take an entire team to fight a fire. The “all hands on deck” approach to production issues is often born from a misguided – albeit well-intentioned – desire to resolve an issue quickly in order to get back to work on the “new stuff”. Best case, this approach may optimize a single engineer’s time-to-resolution at the cost of lowering the overall yield of the team. Worst case, “drop everything” is a trained response designed as much to create the appearance of motion as any real progress.
Our approach has been to define a new role – the Engineering Fire Chief. Simply put, our Fire Chief is an engineer (or two) who actively accepts the role of providing distraction-free “cover” for their team. (yes, we actually bought a Fire Chief hat.)
While the team is working away on creating net-new value for our business, the Fire Chief’s duty is to put out the production fires, so that the team can focus fully on their stories without distraction. After a couple of iterations worth of assessing and adjusting the Fire Chief position to make it a success, here are the key results of what we’ve learned:
Rotation
Whether or not to rotate the Fire Chief role amongst all team members was an easy decision. It’s something I’m passionate about, but more importantly, it’s something the team is passionate about. After all, our software belongs to all of us. We started a rotation from the moment the Fire Chief role was coined (and we had acquired the hat). The hat, while being a bit silly (fun?), creates a informal but important “hand-off” of the responsibility of running our technology from Fire Chief to Fire Chief. It truly is a relief to hand it off, but we all have a laugh about it too, which makes our little ceremony fun.
Rotation Frequency
We experimented with a variety of alternatives, from 1 week, to 2 weeks, to 1 month. Our trial and error approach was really interesting, and over time has taught us a lot about what the Fire Chief role is really all about. Ultimately we settled on 2 weeks (the same length as our Iteration). The Iteration start and end is a natural breaking point for the Fire Chief rotation, and we’ve created a recurring forum for our business partners when the new Fire Chief is handed the hat. In this forum we focus on lingering issues and/or minor features or fixes which need to be applied to the legacy world. The new and retiring Fire Chief also engage in a hand-off discussion.
Cover Man Means Cover Man
Somehow the team had been ingrained with the mentality that some problems required the attention of the entire team (per this article’s first quote). During one particular iteration, our feed processing pipeline was experiencing a slow down. After several days of slow down, and no solution in sight, the risk became large enough, that the entire team was ‘required’ to dive in. All of our new product development (the stories in our iteration) ground to a halt. When we eventually found the problem and fixed it, we determined as a team that it didn’t actually require the entire team’s attention to find the root cause. This was perhaps one of our more valuable retrospectives. To summarize, while it took us a while to notice cause and effect, we eventually noticed that when the entire team jumped on a production problem, our velocity for the iteration was drastically impacted.
At large, this helped us refine the Fire Chief framing as a ‘cover man’. Prior to this, and with the Fire Chief accounted for in our normal iteration capacity, Fire Chief’s were scrambling to resolve problems, and in doing so, were involving other team members. A Fire Chief would involve another team member in order to solve the production problem as quickly as possibly, in order to get back to the stories in the iteration. In involving other team members, the Fire Chief was unintentionally impacting the broader team’s ability to deliver. The Fire Chief was accidentally hurting the team’s velocity.
The above distilled, we simply no longer count the Fire Chief’s capacity in the iteration. As a result, the person in this role is less pressured and is free to discover root causes and develop improvements without impacting or distracting other team members. The Fire Chief is truly a cover man.
Make the Production World Better for the Next Fire Chief
What good is a rotation of Fire Chiefs who are focused on band-aids, where no improvement of our legacy world occurs? It took some time, but by creating the right mix of rotation, rotation frequency, and cover-man-means-cover-man mentality, we’re now able to improve the legacy world with each Fire Chief’s tenure. Not counting the Fire Chief’s capacity in the iteration, means that the Fire Chief can spend time to find and fix problems at their root cause. One recent Fire Chief exposed new and interesting real-time performance metrics via JMX, and added visualizations of the metrics to our internal dashboards. This dash acts as a real-time window into performance problems, and improves every subsequent Fire Chief’s ability to visualize and problem solve.
~
How do you deal with “production support”? How are you able to evolve your systems and services while maintaining uptime and SLAs for existing legacy systems?