Anchor SOP Site — Design Spec
Date: 2026-04-04 Status: Draft
Overview
Docusaurus 3 SOP site for Anchor MSP. Internal-only field manual for Anchor operators covering production ownership, monitoring, security, incident response, and all operational processes.
Anchor MSP's responsibility begins when a system is accepted into managed production operations. Anchor is the authority for production monitoring, logging, alerting, backups, secrets, security, and incident response. EGI and Mast are separate development companies that hand systems off to Anchor when ready.
Decisions
- Framework: Docusaurus 3 (latest)
- Hosting: Vercel from git repo
- Audience: Internal Anchor team only
- Sidebar structure: Flat — 13 top-level collapsible categories, autogenerated from folder structure
- Tone: Operational and direct. Short declarative sentences, imperative voice. Field manual style.
- Stack truth: Runtime truth lives in the operational stack, not in CI/CD metadata
Operational Stack
| Tool | Role | Hosting |
|---|---|---|
| Grafana | Dashboards, visualization | Self-managed |
| Prometheus | Metrics collection | Self-managed |
| Loki | Log aggregation | Self-managed |
| Alertmanager | Alert routing | Self-managed |
| Uptime Kuma | Uptime monitoring | Self-managed |
| Wazuh | Host IDS, file integrity | Self-managed |
| CrowdSec | Threat intelligence, auto-blocking | Hosted console |
| PostHog | Product analytics | Hosted |
| Restic | Backup orchestration | Self-managed |
| Vault | Secrets management | Self-managed |
| Slack | Alert channels, team comms | SaaS |
| Twilio | SMS escalation | SaaS |
Repo Structure
anchor-sop/
├── docs/
│ ├── intro.md
│ ├── production-ownership/
│ │ ├── _category_.json
│ │ └── production-ownership-policy.md
│ ├── client-onboarding/
│ │ ├── _category_.json
│ │ └── new-managed-system-onboarding-checklist.md
│ ├── system-handoff-acceptance/
│ │ ├── _category_.json
│ │ ├── managed-production-acceptance-criteria.md
│ │ └── handoff-acceptance-checklist.md
│ ├── monitoring-and-alerting/
│ │ ├── _category_.json
│ │ ├── alert-severity-matrix.md
│ │ ├── slack-alert-routing-policy.md
│ │ └── sms-escalation-policy.md
│ ├── logging-and-metrics/
│ │ └── _category_.json
│ ├── backup-and-restore/
│ │ ├── _category_.json
│ │ ├── backup-policy.md
│ │ └── restore-testing-policy.md
│ ├── secrets-management/
│ │ ├── _category_.json
│ │ └── secrets-management-policy.md
│ ├── security-operations/
│ │ ├── _category_.json
│ │ ├── security-monitoring-policy.md
│ │ └── cloudflare-ownership-admin-policy.md
│ ├── incident-response/
│ │ └── _category_.json
│ ├── change-management/
│ │ ├── _category_.json
│ │ └── production-change-policy.md
│ ├── access-control/
│ │ └── _category_.json
│ ├── runbooks/
│ │ ├── _category_.json
│ │ └── runbook-template.md
│ └── decision-logs/
│ ├── _category_.json
│ └── decision-log-template.md
├── src/
│ └── css/
│ └── custom.css
├── static/
│ └── img/
├── sidebars.js
├── docusaurus.config.js
├── package.json
└── README.md
Sidebar Order
Autogenerated from _category_.json position values:
- Production Ownership
- Client Onboarding
- System Handoff Acceptance
- Monitoring and Alerting
- Logging and Metrics
- Backup and Restore
- Secrets Management
- Security Operations
- Incident Response
- Change Management
- Access Control
- Runbooks
- Decision Logs
Ordering rationale: Follows the lifecycle of a system entering Anchor's care — ownership definition → onboarding → day-to-day operations → reactive processes → reference material.
Navigation
- Top navbar: Site title ("Anchor SOP") linking to docs landing page. No blog, no extra navbar items.
- Landing page:
docs/intro.md— brief overview of what Anchor is, what this site contains, and the ownership boundary summary. - Sidebar: Single docs sidebar with all 13 sections. Autogenerated.
Page Structure Templates
Policies
# [Policy Name]
**Owner:** [Role/team responsible]
**Last reviewed:** [Date]
## Purpose
One or two sentences.
## Scope
What systems/teams/situations this covers.
## Policy
Numbered or bulleted directives.
## Exceptions
How to request an exception and who approves.
Checklists
# [Checklist Name]
**Owner:** [Role/team responsible]
**Last reviewed:** [Date]
## Purpose
When to use this checklist.
## Checklist
- [ ] Item with clear acceptance criteria
Templates
# [Template Name]
## When to use this template
Brief context.
## Template
Content with placeholders.
## Example
Filled-in example.
Starter Page Content
1. Production Ownership Policy
- Anchor owns all production infrastructure from handoff acceptance onward
- Responsibility matrix table: rows for each operational domain (monitoring, alerting, backups, secrets, security, incident response, DNS/Cloudflare, logging), columns for Anchor/EGI/Mast
- Anchor owns: monitoring, alerting, backups, secrets, security, incident response, DNS/Cloudflare admin, logging infrastructure
- EGI/Mast own: application code, feature development, pre-production environments, application-level testing
- Deploy authority: application teams deploy their own code; Anchor owns the production environment those deploys target; Anchor can halt deploys if production stability is at risk
- Runtime truth lives in the operational stack (Grafana, Prometheus, Wazuh, etc.), not in CI/CD metadata or repo configs
2. Managed Production Acceptance Criteria
Minimum requirements before Anchor accepts a system:
- Exposes a health check endpoint
- Produces structured logs to stdout
- Has defined resource limits (CPU, memory)
- All secrets documented and ready for Vault migration
- Backup-eligible data identified and classified
- Monitoring hooks available (metrics endpoint or log patterns)
- Deployment process documented
- Rollback procedure documented
Systems not meeting criteria go back to the development team with a gap list.
3. New Managed System Onboarding Checklist
Ordered steps:
- Receive handoff request from development team
- Validate acceptance criteria (reference acceptance criteria page)
- Provision uptime monitoring (Uptime Kuma)
- Configure metrics scraping (Prometheus)
- Set up dashboards (Grafana)
- Configure log aggregation (Loki)
- Set up alerting rules (Alertmanager → Slack channels)
- Configure SMS escalation for critical alerts (Twilio)
- Configure backups (Restic orchestration)
- Onboard secrets to Vault
- Register in PostHog
- Add to CrowdSec protection
- Configure Wazuh monitoring
- Create system runbook (reference runbook template)
- Confirm completion with development team
4. Handoff Acceptance Checklist
Gate between development and Anchor. All items verified and signed off:
- Acceptance criteria met (all items from acceptance criteria page)
- Monitoring configured and producing data
- Alerts routing to correct Slack channels
- Critical alert SMS escalation verified
- Backups configured and first backup completed
- Secrets migrated to Vault
- Access provisioned for Anchor operators
- Runbook created and reviewed
- Escalation contacts confirmed with development team
- Development team acknowledges handoff complete
5. Alert Severity Matrix
| Severity | Criteria | Channel | Response Time | Escalation |
|---|---|---|---|---|
| Critical | Service down, data loss risk | #alerts-critical + SMS | 15 minutes | Immediate |
| High | Degraded service, partial outage | #alerts-high | 1 hour | After 1 hour |
| Medium | Anomaly, threshold breach | #alerts-medium | Next business day | None |
| Low | Informational, trends | #alerts-low | Weekly review | None |
Maps to Alertmanager severity label values: critical, high, medium, low.
6. Slack Alert Routing Policy
#alerts-critical— Critical severity. Monitored 24/7. Also triggers SMS.#alerts-high— High severity. Monitored during business hours + on-call.#alerts-medium— Medium severity. Reviewed next business day.#alerts-low— Low severity. Reviewed weekly.- Alertmanager routes by
severitylabel to the matching channel via webhook. - No alert routes to a general or unmonitored channel.
- Every managed system must have its alerts routing to these channels before handoff is complete.
7. SMS Escalation Policy
- Triggered by: Critical severity alerts only
- Delivery: Twilio SMS to on-call operator's phone
- Escalation chain: Primary on-call → Secondary on-call → Team lead
- Timeout between escalation steps: 10 minutes
- Acknowledgment: Operator must acknowledge in
#alerts-criticalwithin 15 minutes - On-call rotation schedule: defined separately (future page)
8. Cloudflare Ownership/Admin Policy
- Anchor owns all production Cloudflare accounts and DNS zones
- Development teams do not have direct Cloudflare access for production domains
- Changes to DNS records, WAF rules, page rules, or TLS certificates go through Anchor via the change management process
- Anchor maintains Cloudflare API tokens scoped per service, stored in Vault
- Emergency DNS changes follow the emergency change process (reference change management)
9. Backup Policy
- All production data backed up via Restic orchestration
- Frequency by data classification:
- Databases: daily
- Configuration: on change
- Media/uploads: daily
- Retention: 7 daily, 4 weekly, 6 monthly
- Backups stored off-host. Location and encryption managed by Anchor.
- Backup success/failure tracked via Prometheus metrics
- Failed backups generate a High severity alert
- Backup configuration documented per system in the system's runbook
10. Restore Testing Policy
- Restore tests run quarterly for every managed system
- Procedure: select a recent backup → restore to isolated environment → verify data integrity → document result
- Restore test results logged and tracked
- Failed restore test is treated as an incident and triggers immediate investigation
- Systems with no successful restore test in the past quarter are flagged for review
11. Secrets Management Policy
- All production secrets stored in Vault. No exceptions.
- Secrets never stored in: environment files, git repos, CI/CD variables at rest, config files on disk
- Secrets injected at runtime by Vault using service identity
- Rotation schedule:
- API keys: quarterly
- Database credentials: quarterly
- TLS certificates: auto-renewed
- Access to secrets scoped by service identity, not by individual person
- Secret access audited via Vault audit log
- New secrets for a system are onboarded during the system onboarding process
12. Security Monitoring Policy
- Wazuh: host-level intrusion detection and file integrity monitoring on all managed hosts
- CrowdSec: crowd-sourced threat intelligence, automated IP blocking at edge
- Security alerts route through Alertmanager into the standard severity/Slack/SMS pipeline
- Weekly review of security events via Grafana security dashboard
- CrowdSec ban lists and Wazuh rule updates applied on Anchor's schedule
- Security incidents escalate per the incident response process (future page)
13. Runbook Template
Sections:
- Service name
- Owner (Anchor operator + development team contact)
- Architecture summary (components, dependencies, data flow)
- Dependencies (upstream/downstream services, external APIs)
- Common operations (start, stop, restart, scale)
- Monitoring (which Grafana dashboards, key metrics to watch)
- Troubleshooting (known failure modes with diagnosis and fix steps)
- Escalation contacts
Includes a filled example for a generic web application backed by PostgreSQL.
14. Production Change Policy
- All production changes go through change management
- Change types:
- Standard: Pre-approved, low risk. Documented, executed, logged.
- Normal: Submitted → reviewed → approved → scheduled → executed → verified.
- Emergency: Executed immediately, documented retroactively within 24 hours.
- Every change record includes: who, what, when, why, rollback plan
- Changes logged in a central location (future: decision log or change log)
- Failed changes trigger rollback and incident review
15. Decision Log Template
Sections:
- Date
- Decision (one sentence)
- Context (why this came up)
- Options considered (brief list)
- Decision rationale (why this option was chosen)
- Consequences accepted (known trade-offs)
- Revisit date (optional)
Includes a filled example.
Future Pages (Suggested)
- On-call rotation schedule
- Incident response playbook
- Post-incident review template
- Access request procedure
- Offboarding checklist (removing a system from Anchor management)
- SLA definitions per client
- Logging standards (structured log format requirements)
- Metrics naming conventions
- Network security policy
- Vulnerability management policy
- Capacity planning process
- Disaster recovery plan