Ensure Reliability Without Heroes: Building a Strong System
In today’s fast-paced world, heroes often emerge when problems arise, but relying on heroics should be the exception, not the rule. When individuals constantly step in to save the day, it’s a sign that the underlying system may be lacking. Teams supporting products and services play a critical role in maintaining reliability, but unsustainable efforts can lead to burnout.
Don’t get me wrong. We need heroes. When something unusual happens or an irascible bug rears its head–we need heroes to step in and step up to resolve the issue. Going above and beyond is appreciated, but this should be the exception. If heroics is the status quo and something that is relied upon, then people burn out. When this happens (or even better before it happens), it merits a deeper look at the underlying system. The system needs to be built for the desired reliability and not rely on an individual hero.
Going above and beyond is appreciated, but this should be the exception.
Teams that support products and services are on the front lines of issues. They are the reliability engineers that keep a website operating or a service online without interruption. However, when an individual takes on responsibilities that are not supported intrinsically in a system it may lead to unsustainable effort. When someone makes up for a system’s shortcomings, they become a hero. Often at the expense of long hours and tirelessly working through weekends. These Herculean efforts may mask underlying issues and/or prevent long-term fixes from ever being identified, developed, and implemented.
For instance, a 24-hour response time is expected, yet the system doesn’t support this inherently. In such cases, individuals end up working long hours to meet these demands. For example, imagine a Service Line Agreement with a twenty-four hour response time. Sometimes the response time may not even be explicitly committed to customers. Tickets are assigned to individuals who must triage and resolve. However, when ticket rates are high, the only way to meet the response time is to work longer hours.
Uptime is a another good example. Services often come with advertised availability such as 99.999% uptime. It is important for some services to be reliability, but if the system doesn’t support this level of availability inherently such as proper production rollouts or automated testing, then it falls back on individuals to monitor for issues and keep systems up. This then becomes a constant effort to ceaselessly monitor metrics and actively anticipate issues–not very sustainable even for a hero.
The system itself should be built to support the desired reliability: automated testing that kicks off for every new release, safeguards for rolling out updates with ability to rollback issues, fallback redundancy for critical systems, well-defined incident response protocol, and automated monitoring, reporting, and alerts. All of these can be built into a system to provide robustness and confidence. Be clear about reliability metrics–not everything that can be measured is important nor provides equal value. If you have SLAs whether internal or customer-facing be sure they provide clear value. Then, map them to the underlying systems and behaviors. In this way work can be focused on improving the reliability of the system itself rather than burning out individuals while fire fighting to meet unsustainable demands.
The system itself should be built to support the desired reliability.
By prioritizing system improvements over heroics, you can achieve consistent reliability without burning out your team. Say goodbye to heroes and hello to a robust, dependable system. In short, no heroes please.