SLO Playbook: Setting Objectives That Actually Drive Engineering
How to set service level objectives that your team will actually use — the process, the templates, and the error budget thinking that makes reliability engineering practical.
Most teams set SLOs and then ignore them. They live in a dashboard nobody opens until there's an incident. The SLO number is a guess, the error budget is theoretical, and reliability discussions continue to be driven by whoever shouts loudest.
This guide is about making SLOs useful — not just correct on paper, but genuinely driving how your team makes decisions about what to ship and when.
The Purpose of an SLO
An SLO is not primarily about measuring reliability. It's about enabling a conversation between product and engineering about how reliable is reliable enough.
Without SLOs:
- "We should be more reliable" — but what does that mean?
- "This deploy is risky" — but risky compared to what?
- "We're spending too much time on reliability" — says who?
With SLOs:
- "We have 35% of our error budget remaining this month"
- "This deploy historically adds 0.5% error rate. We can afford 2% before we breach the objective"
- "We've breached the SLO twice this quarter — reliability work is justified"
The SLO turns reliability into an engineering problem with clear numbers.
The Process: From Customer Journey to Alert
Step 1: Start With the User, Not the System
The most common mistake: starting with "what can we easily measure?" instead of "what does the user actually experience?"
Start here:
| Customer journey | What success looks like |
|---|---|
| User signs in | Login completes in < 2 seconds |
| User runs a workflow | Workflow completes without error |
| AI agent processes a request | Response within 5 seconds, correct output |
| User uploads a document | Upload succeeds, processing completes within 30 seconds |
Then ask: what measurement (SLI) tells us whether this journey is succeeding?
Step 2: Choose Your SLI
The indicator is the measurable proxy for user experience.
Latency SLI:
p95 of [endpoint] response time, measured at the load balancer,
excluding requests that resulted in 4xx client errors
Availability SLI:
Proportion of requests to [service] that result in a successful response
(HTTP 2xx or expected 3xx), measured over a rolling 28-day window
Quality SLI (for AI features):
Proportion of AI-generated responses rated "helpful" by users
(thumbs up / total rated), measured weekly
Key principle: measure what the user experiences, not what's easy to instrument. It doesn't help users that your database is up if the API layer is returning timeouts.
Step 3: Set the Objective
The format:
[SLI measure] [comparator] [threshold] over [rolling window]
Real examples:
p95 API latency < 500ms for 99% of requests over 28 daysError rate < 0.5% over 7-day rolling windowWorkflow completion rate > 98% over 28 daysAI response quality rating > 4.0/5.0 over 7-day rolling average
How to pick the threshold:
Don't guess. Use your historical data.
-- Find your current p95 latency over the past 90 days
SELECT
date_trunc('day', created_at) as day,
percentile_cont(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95
FROM request_logs
WHERE created_at > NOW() - INTERVAL '90 days'
GROUP BY 1
ORDER BY 1;Set the initial SLO at roughly your 75th percentile performance. If you're hitting p95 < 400ms 85% of days, set the SLO at 500ms. This gives you room to work without constant breaches.
How to pick the window:
| Window | Use when |
|---|---|
| 7-day rolling | Fast-moving products, frequent deploys |
| 28-day rolling | Stable products, monthly business cycles |
| Calendar month | Financial reporting alignment |
28-day rolling is my default. It smooths out weekly patterns (traffic on Monday vs Friday) without letting problems accumulate too long.
Error Budget: The Decision Tool
The error budget is what makes SLOs operational. It answers: "how much can we fail this month?"
Error budget = 1 - SLO
| SLO | Monthly budget (minutes) | Weekly budget (minutes) |
|---|---|---|
| 99.9% | 43.2 min | 10.1 min |
| 99.5% | 216 min | 50.4 min |
| 99.0% | 432 min | 100.8 min |
| 95.0% | 2,160 min | 504 min |
The question that makes budgets real: "If we deploy this risky change and it causes a 2-hour incident, can we afford that this month?"
If the answer is no, don't deploy. If yes, ship it and monitor.
Error Budget Decision Framework
flowchart TD
A[Want to ship something risky?] --> B{Check error budget}
B -->|>50% remaining| C[Ship it — you have room]
B -->|20-50% remaining| D[Ship with feature flag\nand quick rollback plan]
B -->|<20% remaining| E[Defer until next period\nor reduce risk first]
B -->|Exhausted| F[Freeze new features\nreliability work only]This is the conversation the SLO enables. Not "should we be reliable?" but "given our remaining budget, which option do we take?"
Templates
The SLO Card
For each service, maintain a one-page SLO card:
## SLO Card: [Service Name]
**Owner:** [Team or person]
**Last reviewed:** YYYY-MM-DD
### Objectives
| SLI | Target | Window | Alert threshold |
|---|---|---|---|
| API p95 latency | < 500ms | 28-day rolling | Burn rate > 5x |
| Error rate | < 0.5% | 28-day rolling | Burn rate > 5x |
| Workflow completion | > 98% | 28-day rolling | Drop below 96% |
### Error Budgets
| SLO | Monthly budget | Current consumption |
|---|---|---|
| API latency | 43 min/month | [live link] |
| Error rate | 43 min/month | [live link] |
### Runbooks
- [High error rate runbook](link)
- [Latency regression runbook](link)
### Review notes
[Notes from most recent SLO review — what changed, what we learned]The Prometheus SLO Config (with Sloth)
# sloth.yaml — generates Prometheus recording rules and alerts
version: "prometheus/v1"
service: "order-service"
slos:
- name: "requests-availability"
objective: 99.5
description: "Order service requests should succeed 99.5% of the time"
sli:
events:
error_query: sum(rate(http_requests_total{job="order-service",code=~"5.."}[WINDOW]))
total_query: sum(rate(http_requests_total{job="order-service"}[WINDOW]))
alerting:
name: OrderServiceAvailabilityAlert
labels:
severity: "critical"
annotations:
runbook: "https://runbooks.internal/order-service/availability"# Generate Prometheus rules from Sloth config
sloth generate -i sloth.yaml -o prometheus-rules.yamlCommon Mistakes
Setting aspirational targets, not honest ones. An SLO set at 99.99% that you currently achieve 99.5% will be permanently breached. Start at your current performance, then tighten over time as the system improves.
Not connecting SLOs to user impact. "p99 database query time < 10ms" isn't a user SLO — users don't experience database queries directly. Connect it to user experience: "page load time < 2 seconds at p95."
Measuring availability without measuring quality. A feature that responds successfully but gives wrong answers is worse than one that's temporarily unavailable. Add quality SLIs for AI features.
Ignoring the error budget. If nobody checks the error budget before deploying, the SLO is decoration. Add "check error budget" to your deployment checklist.
Annual reviews. Review SLOs quarterly at minimum. User expectations change, product evolves, and an SLO that was ambitious a year ago might be embarrassingly low now.
The Monthly SLO Review
30 minutes, once a month, with the engineering lead and product manager:
- Status: Which SLOs are we meeting? Which did we breach?
- Budget: How much error budget did we consume? On what?
- Trends: Are we getting more or less reliable over time?
- Target review: Are current targets still the right targets?
- Action items: What reliability investments are justified by the data?
The review creates accountability without heroics. When a breach happened, you discuss what system change would prevent recurrence. When you're comfortably within budget, you confirm you're not over-engineering reliability at the expense of features.