As digital products and services are more deeply embedded in critical industries and infrastructure and the implications of problems grow in scope, engineering organizations are renewing their focus on building platforms that are scalable, reliable and secure. More importantly, they are recognizing that this isn’t the responsibility of an SRE or CSO, it’s an ethos that must be carried each day by every engineer on the team. This is an important shift—from teams and cultures that valued speed of delivery above all else to teams and cultures that make high-quality systems the foundation of everything they do.
This isn’t easy, and big cultural changes don’t happen overnight. There are many things that can go wrong when the team is operating in the legacy mode of prioritizing quick time-to-market but there are also some strategies and best practices for beginning the process of cultural change across your engineering organization.
The Old Way of Doing Things
You’re the engineer on call, and you get woken up at 3:00 a.m. by an alert that tells you an app is down. Groggy and annoyed, you look at the message and have no idea what service it’s referring to or where to begin.
After nearly an hour, you can’t figure out who owns the service or how to get it up and running. So, you call that senior engineer on your team—you know the one—the person who gets called every time there’s a tricky question or problem. Thankfully, they answer your call in the middle of the night and help you bring the app back up. But in a month, the same scenario happens again; only this time, when you go to call that senior engineer, they’re no longer there. They quit to go work at a company that cares about reliability.
Best Practices for Today
How can you avoid this vicious cycle? Here are some best practices you can implement to begin shifting focus away from speed and toward scalability, reliability and security.
1. Evangelize the Narrative That the Systems you Work on Matter
At the end of the day, your systems are there to serve your customers and users. Leadership and teammates who embody these values are incredibly important. People should worry about letting their customers down, not about being yelled at by their manager or receiving a bad performance review.
2. Set Explicit Expectations about Uptime
When you say that your app needs to have 99.99% uptime, it doesn’t have the same impact as saying 52.6 minutes of downtime a year. Translate uptime numbers so they’re tangible—and on people’s minds every time there’s an outage.
3. Don’t Sweep Problems Under the Rug
After the dust settles from an incident, it’s important to get together and talk about it. Be transparent about what happened, which person or team made the error and what the implications to users were. The idea isn’t to point fingers or assign blame but to bring the team together so that you can learn, evolve and figure out how to prevent this from happening in the future.
Here are some questions you can use:
- Who wrote the code?
- How was it tested?
- Should our automated testing have caught this?
- Who reviewed this code?
- Were they the right people to review the code?
- Why wasn’t it caught in the staging environment?
- Who released it and who went through the release validation checklist?
Formalizing this process lets everyone on the team know that mistakes are part of the job and that there’s a standard way that problems are handled. This is a great way to help engineers feel free to do their best work without fear or avoidance.
4. Define Individual Owners for Every Piece of the System
If there’s any confusion about who owns what within your engineering org, you need to address that right away; it’s a massive problem. Not only do you need an owner for every part of the system, but they must also be empowered to fix issues and know that they’re the ones on the hook.
5. Invest in Documentation, Runbooks and Logs
The owner of each piece of the application is also responsible for making sure that documentation is up to date, runbooks are clear and easy to follow and logs are readily accessible. When you’re getting paged at 3:00 a.m. about a service you’ve never heard of before, these resources are key.
6. Prioritize the Delivery Pipeline Like it’s a Key Feature
You should always be thinking about how to build safeguards around your workflows, like automated testing or new tools. This may mean you need to go toe-to-toe with product managers to make sure the work required to build a robust delivery system isn’t getting kicked down to the backlog.
7. Get Business Buy-In by Using Terms They Understand
Unreliable code is technical debt. And debt is something all leaders understand, regardless of their technical abilities. Stripe published a survey in 2018 that found the average developer spends more than 17 hours a week dealing with maintenance issues, costing $300 billion per year globally. If you’re pushing to invest in SRE practices, those numbers might help you make your case.
Culture comes first, but the right tools can reinforce the culture and drive accountability over time. Tools that help teams keep track of critical information such as service owners, documentation and runbooks can help tremendously. Additionally, having easily available dependency maps of the architecture are key to quickly understanding complex systems. Lastly, setting organization-wide standards for best practices and providing templates for spinning up new services that follow these can help ensure that these things are not forgotten.