Hey friend.
There’s a saying in London that if you stand in one place in the city for long enough, you’ll get a parking ticket. If you stand there a bit longer, you’ll get demolished and replaced with an office sky-rise or an upscale apartment block with a concierge that looks like a gargoyle.
The city is in a constant state of becoming, it’s a place where the past and the future are locked in an ever-lasting passive-aggressive argument.
And right now, our tech world is experiencing a very similar moment. Many of us are frantically building the future, a future powered by AI. And just like a new apartment block, it looks impressive on the outside, but someone needs to make sure that the foundation is solid.
This is where SRE come in.
Site Reliability Engineers have been the experts on building resilient backend infrastructure for many years.
However, AI does not quite work the same way. You cannot run it at scale on the good-old commodity hardware and keep a bunch of empty machines on stand-by for redundancy.
Instead, we’re going back to the supercomputer era when a large number of tightly interconnected units all operate to solve a single computational task. Any unit’s failure in this system is a failure of the whole, often long and costly, process. Which gives you an error-budget of zero.
So, how do we even begin to get a handle on this?
We need a new way of thinking. SRE needs to evolve.
One thing that can help us with this is STPA - Systems-Theoretical Process Analysis. It sounds a bit dry, but there’s certain elegance to it. STPA forces you to see the system as a whole, it’s a tool for finding the subtle, hidden flaws, so that a self-aware toaster doesn’t decide to take over your kitchen.
Of course, STPA is not a replacement for trusted SRE best-practices. You still need to have your SLOs, your monitoring, oncall rotations, and all that other good stuff. But STPA can be a powerful tool that can help you take your reliability expertise to the next level.
So that’s our challenge now, evolve as AI world evolves around us.