Site Reliability Engineering was invented at Google, where teams have the headcount to have dedicated SREs, toil-reduction programmes, and extensive capacity planning functions. Most companies don't. That doesn't make the principles inapplicable — it means the implementation needs to be lighter.
SLOs: pick two metrics and measure them honestly
Don't start with a comprehensive SLO programme. Start with two metrics that matter to your users: probably availability and latency at a meaningful percentile (p95 or p99, not p50). Define what "good" looks like. Measure it. Make it visible. That's it for now.
The value of an SLO isn't the number. It's the discipline of having agreed, in advance, what good looks like — so that when something degrades, you have a shared language for how bad it is.
Error budgets as release gates
The practical version for small teams: if your availability SLO is 99.5% and you've used 80% of your error budget this month, that's a signal to slow down feature work and focus on reliability. You don't need a formal process — you need one person to own the number and the authority to slow down releases when it's trending badly.
Blameless postmortems, lightweight version
For every incident that causes user impact: write down what happened, what the timeline was, what the contributing factors were, and what would prevent recurrence. No names in the contributing factors section — only systems, processes, and decisions. Share it with the team. This is the entire practice, stripped to its core.
"A blameless postmortem is not about making people feel better. It's about making the system better."
The full SRE book is worth reading. Apply it selectively, starting with the parts that address your most frequent pain points.