SLO Playbook: Setting Objectives That Actually Drive Engineering

How to set service level objectives that your team will actually use — the process, the templates, and the error budget thinking that makes reliability engineering practical.

Most teams set SLOs and then ignore them. They live in a dashboard nobody opens until there's an incident. The SLO number is a guess, the error budget is theoretical, and reliability discussions continue to be driven by whoever shouts loudest.

This guide is about making SLOs useful — not just correct on paper, but genuinely driving how your team makes decisions about what to ship and when.

The Purpose of an SLO

An SLO is not primarily about measuring reliability. It's about enabling a conversation between product and engineering about how reliable is reliable enough.

Without SLOs:

"We should be more reliable" — but what does that mean?
"This deploy is risky" — but risky compared to what?
"We're spending too much time on reliability" — says who?

With SLOs:

"We have 35% of our error budget remaining this month"
"This deploy historically adds 0.5% error rate. We can afford 2% before we breach the objective"
"We've breached the SLO twice this quarter — reliability work is justified"

The SLO turns reliability into an engineering problem with clear numbers.

The Process: From Customer Journey to Alert

Step 1: Start With the User, Not the System

The most common mistake: starting with "what can we easily measure?" instead of "what does the user actually experience?"

Start here:

Customer journey	What success looks like
User signs in	Login completes in < 2 seconds
User runs a workflow	Workflow completes without error
AI agent processes a request	Response within 5 seconds, correct output
User uploads a document	Upload succeeds, processing completes within 30 seconds

Then ask: what measurement (SLI) tells us whether this journey is succeeding?

Step 2: Choose Your SLI

The indicator is the measurable proxy for user experience.

Latency SLI:

p95 of [endpoint] response time, measured at the load balancer,
excluding requests that resulted in 4xx client errors

Availability SLI:

Proportion of requests to [service] that result in a successful response
(HTTP 2xx or expected 3xx), measured over a rolling 28-day window

Quality SLI (for AI features):

Proportion of AI-generated responses rated "helpful" by users
(thumbs up / total rated), measured weekly

Key principle: measure what the user experiences, not what's easy to instrument. It doesn't help users that your database is up if the API layer is returning timeouts.

Step 3: Set the Objective

The format:

[SLI measure] [comparator] [threshold] over [rolling window]

Real examples:

p95 API latency < 500ms for 99% of requests over 28 days
Error rate < 0.5% over 7-day rolling window
Workflow completion rate > 98% over 28 days
AI response quality rating > 4.0/5.0 over 7-day rolling average

How to pick the threshold:

Don't guess. Use your historical data.

-- Find your current p95 latency over the past 90 days
SELECT
  date_trunc('day', created_at) as day,
  percentile_cont(0.95) WITHIN GROUP (ORDER BY response_time_ms) as p95
FROM request_logs
WHERE created_at > NOW() - INTERVAL '90 days'
GROUP BY 1
ORDER BY 1;

Set the initial SLO at roughly your 75th percentile performance. If you're hitting p95 < 400ms 85% of days, set the SLO at 500ms. This gives you room to work without constant breaches.

How to pick the window:

Window	Use when
7-day rolling	Fast-moving products, frequent deploys
28-day rolling	Stable products, monthly business cycles
Calendar month	Financial reporting alignment

28-day rolling is my default. It smooths out weekly patterns (traffic on Monday vs Friday) without letting problems accumulate too long.

Error Budget: The Decision Tool

The error budget is what makes SLOs operational. It answers: "how much can we fail this month?"

Error budget = 1 - SLO

SLO	Monthly budget (minutes)	Weekly budget (minutes)
99.9%	43.2 min	10.1 min
99.5%	216 min	50.4 min
99.0%	432 min	100.8 min
95.0%	2,160 min	504 min

The question that makes budgets real: "If we deploy this risky change and it causes a 2-hour incident, can we afford that this month?"

If the answer is no, don't deploy. If yes, ship it and monitor.

Error Budget Decision Framework

flowchart TD
    A[Want to ship something risky?] --> B{Check error budget}
    B -->|>50% remaining| C[Ship it — you have room]
    B -->|20-50% remaining| D[Ship with feature flag\nand quick rollback plan]
    B -->|<20% remaining| E[Defer until next period\nor reduce risk first]
    B -->|Exhausted| F[Freeze new features\nreliability work only]

This is the conversation the SLO enables. Not "should we be reliable?" but "given our remaining budget, which option do we take?"

Templates

The SLO Card

For each service, maintain a one-page SLO card:

## SLO Card: [Service Name]
 
**Owner:** [Team or person]
**Last reviewed:** YYYY-MM-DD
 
### Objectives
 
| SLI | Target | Window | Alert threshold |
|---|---|---|---|
| API p95 latency | < 500ms | 28-day rolling | Burn rate > 5x |
| Error rate | < 0.5% | 28-day rolling | Burn rate > 5x |
| Workflow completion | > 98% | 28-day rolling | Drop below 96% |
 
### Error Budgets
 
| SLO | Monthly budget | Current consumption |
|---|---|---|
| API latency | 43 min/month | [live link] |
| Error rate | 43 min/month | [live link] |
 
### Runbooks
- [High error rate runbook](link)
- [Latency regression runbook](link)
 
### Review notes
[Notes from most recent SLO review — what changed, what we learned]

The Prometheus SLO Config (with Sloth)

# sloth.yaml — generates Prometheus recording rules and alerts
version: "prometheus/v1"
service: "order-service"
 
slos:
  - name: "requests-availability"
    objective: 99.5
    description: "Order service requests should succeed 99.5% of the time"
 
    sli:
      events:
        error_query: sum(rate(http_requests_total{job="order-service",code=~"5.."}[WINDOW]))
        total_query: sum(rate(http_requests_total{job="order-service"}[WINDOW]))
 
    alerting:
      name: OrderServiceAvailabilityAlert
      labels:
        severity: "critical"
      annotations:
        runbook: "https://runbooks.internal/order-service/availability"

# Generate Prometheus rules from Sloth config
sloth generate -i sloth.yaml -o prometheus-rules.yaml

Common Mistakes

Setting aspirational targets, not honest ones. An SLO set at 99.99% that you currently achieve 99.5% will be permanently breached. Start at your current performance, then tighten over time as the system improves.

Not connecting SLOs to user impact. "p99 database query time < 10ms" isn't a user SLO — users don't experience database queries directly. Connect it to user experience: "page load time < 2 seconds at p95."

Measuring availability without measuring quality. A feature that responds successfully but gives wrong answers is worse than one that's temporarily unavailable. Add quality SLIs for AI features.

Ignoring the error budget. If nobody checks the error budget before deploying, the SLO is decoration. Add "check error budget" to your deployment checklist.

Annual reviews. Review SLOs quarterly at minimum. User expectations change, product evolves, and an SLO that was ambitious a year ago might be embarrassingly low now.

The Monthly SLO Review

30 minutes, once a month, with the engineering lead and product manager:

Status: Which SLOs are we meeting? Which did we breach?
Budget: How much error budget did we consume? On what?
Trends: Are we getting more or less reliable over time?
Target review: Are current targets still the right targets?
Action items: What reliability investments are justified by the data?

The review creates accountability without heroics. When a breach happened, you discuss what system change would prevent recurrence. When you're comfortably within budget, you confirm you're not over-engineering reliability at the expense of features.