Runbook Template
When to use this template
Create a new runbook when onboarding a system into Anchor managed production. Every managed system must have a runbook. Copy this template and fill in all sections.
Template
# [System Name] Runbook
**Anchor Operator:** [Name]
**Development Team Contact:** [Name, Slack handle, phone]
**Last updated:** [Date]
## Architecture
[2-3 sentences describing the system. What does it do? What are its main components? Draw the data flow: where does data come in, how is it processed, where does it go?]
## Dependencies
| Dependency | Type | Impact if Unavailable |
|-----------|------|----------------------|
| [e.g., PostgreSQL] | Database | Service cannot start |
| [e.g., Redis] | Cache | Degraded performance |
| [e.g., Stripe API] | External API | Payments fail |
## Common Operations
### Start the service
[Exact command or procedure]
### Stop the service
[Exact command or procedure]
### Restart the service
[Exact command or procedure]
### Scale the service
[How to scale up/down, if applicable]
## Monitoring
| What to Check | Where |
|--------------|-------|
| Uptime | Uptime Kuma → [check name] |
| Metrics | Grafana → [dashboard name] |
| Logs | Grafana Explore → Loki → [label filter] |
| Alerts | Alertmanager → [alert group] |
## Troubleshooting
### [Failure Mode 1: e.g., "Service returns 502"]
**Symptoms:** [What you see in monitoring/logs]
**Likely cause:** [What usually causes this]
**Fix:** [Step-by-step resolution]
### [Failure Mode 2: e.g., "Database connection pool exhausted"]
**Symptoms:** [What you see in monitoring/logs]
**Likely cause:** [What usually causes this]
**Fix:** [Step-by-step resolution]
## Escalation Contacts
| Role | Name | Slack | Phone |
|------|------|-------|-------|
| Anchor Operator | [Name] | @handle | +1-xxx-xxx-xxxx |
| Dev Team Lead | [Name] | @handle | +1-xxx-xxx-xxxx |
| Dev Team Backend | [Name] | @handle | +1-xxx-xxx-xxxx |
Example
# Acme Payments API Runbook
**Anchor Operator:** Jamie Chen
**Development Team Contact:** Alex Rivera, @alex.rivera, +1-555-0142
**Last updated:** 2026-04-04
## Architecture
Acme Payments API is a Node.js REST API that processes payment transactions. It receives requests from the Acme web frontend, validates them, interacts with the Stripe API for payment processing, and stores transaction records in PostgreSQL. Redis is used for rate limiting and session caching.
## Dependencies
| Dependency | Type | Impact if Unavailable |
|-----------|------|----------------------|
| PostgreSQL 15 | Database | Service cannot start, all transactions fail |
| Redis 7 | Cache | Rate limiting disabled, sessions lost, degraded mode |
| Stripe API | External API | Payment processing fails, transactions queue |
## Common Operations
### Start the service
docker compose up -d acme-payments
### Stop the service
docker compose stop acme-payments
### Restart the service
docker compose restart acme-payments
### Scale the service
docker compose up -d --scale acme-payments=3
## Monitoring
| What to Check | Where |
|--------------|-------|
| Uptime | Uptime Kuma → acme-payments-health |
| Metrics | Grafana → Acme Payments Dashboard |
| Logs | Grafana Explore → Loki → {app="acme-payments"} |
| Alerts | Alertmanager → acme-payments group |
## Troubleshooting
### Service returns 502
**Symptoms:** Uptime Kuma reports DOWN. Grafana shows spike in 5xx responses.
**Likely cause:** Application crashed or ran out of memory.
**Fix:**
1. Check logs: Grafana Explore → Loki → {app="acme-payments"} for error messages.
2. Check resource usage: Grafana → Acme Payments Dashboard → Memory/CPU panel.
3. If OOM: restart the service and consider increasing memory limits.
4. If crash loop: check recent deploys, roll back if needed.
### Database connection pool exhausted
**Symptoms:** Logs show "connection pool timeout" errors. Response times spike.
**Likely cause:** Long-running queries or connection leak.
**Fix:**
1. Check active connections: psql → SELECT count(*) FROM pg_stat_activity WHERE application_name = 'acme-payments';
2. Kill idle-in-transaction connections older than 5 minutes.
3. Restart the service to reset the connection pool.
4. Escalate to dev team if the issue recurs — likely a query or connection leak in application code.
## Escalation Contacts
| Role | Name | Slack | Phone |
|------|------|-------|-------|
| Anchor Operator | Jamie Chen | @jamie.chen | +1-555-0198 |
| Dev Team Lead | Alex Rivera | @alex.rivera | +1-555-0142 |
| Dev Team Backend | Sam Okafor | @sam.okafor | +1-555-0167 |