Operations

Runbooks & On-Call

Foundational

When something breaks at 3 a.m., the person responding should not have to figure out the system under pressure. Runbooks are short, practical guides for operating and recovering services. With a clear, humane on-call setup, they turn a panicked scramble into a calm, repeatable response.

A runbook answers "what do I do when X happens?": how to tell if the service is healthy, common failures and their fixes, how to restart, roll back, or scale, what it depends on, and how to escalate. On-call is the human side: who is responsible right now, how they are alerted, and how the rota stays sustainable. Together they make operations a shared, documented capability rather than knowledge stuck in one person's head.

This supports Incident Readiness (the response process), Observability (the signals), and Ownership & Accountability (you operate what you build). For a largely junior team, good runbooks are also how knowledge spreads and how a newer engineer can safely hold the pager.

Write runbooks people can actually use

Run on-call humanely

Self-review checklist

Why it matters: Incidents are won or lost on preparation. Clear runbooks and a humane, well-defined on-call turn an outage into a calm, fast recovery and stop reliability from depending on one heroic person. They also spread operational knowledge across the team, which is essential when most engineers are still building that experience.