Reliability or Resilience?
Reliability. Resilience.
These terms and concepts overlay each other. I admit, I usually do not have a specifc one in mind when I apply it to operations, teams and organizations. I certainly make no claims about being true to research on the subject.
But I find it helps to broadly categorize them like this:
- Reliability informs external actors of how to tune their response to your system, service or application. That is, which strategy do they need to adopt for the cost of using your system?
- Resilience informs you about your internal capacity to keep an healthy operationnal position. That is your capability to keep that reliability afloat and ensure your users remain confident.
So, which one to use? I think any of them is good but we must acknowledge that each has a particular facet that colors the discussion.
If you say “we are building our reliability”, say by implementing SRE, I would argue that the discussion is positionned from the perspective of how your users’ experience of your application or product.
It doesn’t inform the discussion about how the internal struggles, challenges and capacity deployed to build this reliability.
Taking reliability without resilience means being reactive not proactive. In a world continuously transforming itself, reactive is in direct conflict with leading or being innovative.
So, should we focus only on resilience? No. Resilience on its own doesn’t respond to expectations from your users. They can only operate on your how reliable you are. Your resilience has limited impact on them if you don’t respond with concrete reliability responses. Beyond that, resilience is not something that is easily built directly. It grows organically through events that can be personal or at team and organizational level.
For instance, think being on-call during an incident, introducing a regression that hurt users, lay-off, changing team, projects being canned, a competitor innovation, an important customer switching to a different strategy…
These events may be used as opportunity for a person, a team or an organization to reflect and increase their capacity to deal with the future.
Right, this is about being adaptable then? Well, yes and no.
A healthy system continuously adapt to new conditions. It is a voluntarily action to change. In effect, a system capable of resilience may supports a more appropriate adaptation improving a greater reliability. The operative word is “appropriate”. For instance, fixing defects isn’t always the right response to give to some behavior. Perhaps, a more appropriate adaptation is a change in architecture. Resilience will help teams and organizations trust themselves into challenging their responses.
Resilience informs our capacity to effectuate appropriate changes.
While we adapt to lots of conditions without thinking too much about the process. There are times when the resilience we’ve built along the way should help improve the quality and efficience of adaptation.
Where does this leave us?
I believe reliability is a concrete and objective object. One you actively build, measure and report on. One that your users work with. Reliability varies through adaptation of system and operationnal characteristics to continuous stressors.
Resilience is at the heart of this system though. It informs about the capacity of the organization, team or person to orchestrate these evolutions.
Forgetting to care about resilience will quickly lead to burn out, turn over or disengagement.
We have several approaches to work on both: retrospectives, incident management, game days, chaos engineering. They all focus on getting better.
Reliability and Resilience are not opposed. They work together in a virtuous circle that benefits your organization, teams and customers.