Anchor SOP Site — Design Spec

Date: 2026-04-04 Status: Draft

Overview

Docusaurus 3 SOP site for Anchor MSP. Internal-only field manual for Anchor operators covering production ownership, monitoring, security, incident response, and all operational processes.

Anchor MSP's responsibility begins when a system is accepted into managed production operations. Anchor is the authority for production monitoring, logging, alerting, backups, secrets, security, and incident response. EGI and Mast are separate development companies that hand systems off to Anchor when ready.

Decisions

Framework: Docusaurus 3 (latest)
Hosting: Vercel from git repo
Audience: Internal Anchor team only
Sidebar structure: Flat — 13 top-level collapsible categories, autogenerated from folder structure
Tone: Operational and direct. Short declarative sentences, imperative voice. Field manual style.
Stack truth: Runtime truth lives in the operational stack, not in CI/CD metadata

Operational Stack

Tool	Role	Hosting
Grafana	Dashboards, visualization	Self-managed
Prometheus	Metrics collection	Self-managed
Loki	Log aggregation	Self-managed
Alertmanager	Alert routing	Self-managed
Uptime Kuma	Uptime monitoring	Self-managed
Wazuh	Host IDS, file integrity	Self-managed
CrowdSec	Threat intelligence, auto-blocking	Hosted console
PostHog	Product analytics	Hosted
Restic	Backup orchestration	Self-managed
Vault	Secrets management	Self-managed
Slack	Alert channels, team comms	SaaS
Twilio	SMS escalation	SaaS

Repo Structure

anchor-sop/
├── docs/
│   ├── intro.md
│   ├── production-ownership/
│   │   ├── _category_.json
│   │   └── production-ownership-policy.md
│   ├── client-onboarding/
│   │   ├── _category_.json
│   │   └── new-managed-system-onboarding-checklist.md
│   ├── system-handoff-acceptance/
│   │   ├── _category_.json
│   │   ├── managed-production-acceptance-criteria.md
│   │   └── handoff-acceptance-checklist.md
│   ├── monitoring-and-alerting/
│   │   ├── _category_.json
│   │   ├── alert-severity-matrix.md
│   │   ├── slack-alert-routing-policy.md
│   │   └── sms-escalation-policy.md
│   ├── logging-and-metrics/
│   │   └── _category_.json
│   ├── backup-and-restore/
│   │   ├── _category_.json
│   │   ├── backup-policy.md
│   │   └── restore-testing-policy.md
│   ├── secrets-management/
│   │   ├── _category_.json
│   │   └── secrets-management-policy.md
│   ├── security-operations/
│   │   ├── _category_.json
│   │   ├── security-monitoring-policy.md
│   │   └── cloudflare-ownership-admin-policy.md
│   ├── incident-response/
│   │   └── _category_.json
│   ├── change-management/
│   │   ├── _category_.json
│   │   └── production-change-policy.md
│   ├── access-control/
│   │   └── _category_.json
│   ├── runbooks/
│   │   ├── _category_.json
│   │   └── runbook-template.md
│   └── decision-logs/
│       ├── _category_.json
│       └── decision-log-template.md
├── src/
│   └── css/
│       └── custom.css
├── static/
│   └── img/
├── sidebars.js
├── docusaurus.config.js
├── package.json
└── README.md

Autogenerated from _category_.json position values:

Production Ownership
Client Onboarding
System Handoff Acceptance
Monitoring and Alerting
Logging and Metrics
Backup and Restore
Secrets Management
Security Operations
Incident Response
Change Management
Access Control
Runbooks
Decision Logs

Ordering rationale: Follows the lifecycle of a system entering Anchor's care — ownership definition → onboarding → day-to-day operations → reactive processes → reference material.

Top navbar: Site title ("Anchor SOP") linking to docs landing page. No blog, no extra navbar items.
Landing page: docs/intro.md — brief overview of what Anchor is, what this site contains, and the ownership boundary summary.
Sidebar: Single docs sidebar with all 13 sections. Autogenerated.

Page Structure Templates

Policies

# [Policy Name]

**Owner:** [Role/team responsible]
**Last reviewed:** [Date]

## Purpose
One or two sentences.

## Scope
What systems/teams/situations this covers.

## Policy
Numbered or bulleted directives.

## Exceptions
How to request an exception and who approves.

Checklists

# [Checklist Name]

**Owner:** [Role/team responsible]
**Last reviewed:** [Date]

## Purpose
When to use this checklist.

## Checklist
- [ ] Item with clear acceptance criteria

Templates

# [Template Name]

## When to use this template
Brief context.

## Template
Content with placeholders.

## Example
Filled-in example.

Starter Page Content

1. Production Ownership Policy

Anchor owns all production infrastructure from handoff acceptance onward
Responsibility matrix table: rows for each operational domain (monitoring, alerting, backups, secrets, security, incident response, DNS/Cloudflare, logging), columns for Anchor/EGI/Mast
Anchor owns: monitoring, alerting, backups, secrets, security, incident response, DNS/Cloudflare admin, logging infrastructure
EGI/Mast own: application code, feature development, pre-production environments, application-level testing
Deploy authority: application teams deploy their own code; Anchor owns the production environment those deploys target; Anchor can halt deploys if production stability is at risk
Runtime truth lives in the operational stack (Grafana, Prometheus, Wazuh, etc.), not in CI/CD metadata or repo configs

2. Managed Production Acceptance Criteria

Minimum requirements before Anchor accepts a system:

Exposes a health check endpoint
Produces structured logs to stdout
Has defined resource limits (CPU, memory)
All secrets documented and ready for Vault migration
Backup-eligible data identified and classified
Monitoring hooks available (metrics endpoint or log patterns)
Deployment process documented
Rollback procedure documented

Systems not meeting criteria go back to the development team with a gap list.

3. New Managed System Onboarding Checklist

Ordered steps:

Receive handoff request from development team
Validate acceptance criteria (reference acceptance criteria page)
Provision uptime monitoring (Uptime Kuma)
Configure metrics scraping (Prometheus)
Set up dashboards (Grafana)
Configure log aggregation (Loki)
Set up alerting rules (Alertmanager → Slack channels)
Configure SMS escalation for critical alerts (Twilio)
Configure backups (Restic orchestration)
Onboard secrets to Vault
Register in PostHog
Add to CrowdSec protection
Configure Wazuh monitoring
Create system runbook (reference runbook template)
Confirm completion with development team

4. Handoff Acceptance Checklist

Gate between development and Anchor. All items verified and signed off:

5. Alert Severity Matrix

Severity	Criteria	Channel	Response Time	Escalation
Critical	Service down, data loss risk	`#alerts-critical` + SMS	15 minutes	Immediate
High	Degraded service, partial outage	`#alerts-high`	1 hour	After 1 hour
Medium	Anomaly, threshold breach	`#alerts-medium`	Next business day	None
Low	Informational, trends	`#alerts-low`	Weekly review	None

Maps to Alertmanager severity label values: critical, high, medium, low.

6. Slack Alert Routing Policy

#alerts-critical — Critical severity. Monitored 24/7. Also triggers SMS.
#alerts-high — High severity. Monitored during business hours + on-call.
#alerts-medium — Medium severity. Reviewed next business day.
#alerts-low — Low severity. Reviewed weekly.
Alertmanager routes by severity label to the matching channel via webhook.
No alert routes to a general or unmonitored channel.
Every managed system must have its alerts routing to these channels before handoff is complete.

7. SMS Escalation Policy

Triggered by: Critical severity alerts only
Delivery: Twilio SMS to on-call operator's phone
Escalation chain: Primary on-call → Secondary on-call → Team lead
Timeout between escalation steps: 10 minutes
Acknowledgment: Operator must acknowledge in #alerts-critical within 15 minutes
On-call rotation schedule: defined separately (future page)

8. Cloudflare Ownership/Admin Policy

Anchor owns all production Cloudflare accounts and DNS zones
Development teams do not have direct Cloudflare access for production domains
Changes to DNS records, WAF rules, page rules, or TLS certificates go through Anchor via the change management process
Anchor maintains Cloudflare API tokens scoped per service, stored in Vault
Emergency DNS changes follow the emergency change process (reference change management)

9. Backup Policy

All production data backed up via Restic orchestration
Frequency by data classification:
- Databases: daily
- Configuration: on change
- Media/uploads: daily
Retention: 7 daily, 4 weekly, 6 monthly
Backups stored off-host. Location and encryption managed by Anchor.
Backup success/failure tracked via Prometheus metrics
Failed backups generate a High severity alert
Backup configuration documented per system in the system's runbook

10. Restore Testing Policy

Restore tests run quarterly for every managed system
Procedure: select a recent backup → restore to isolated environment → verify data integrity → document result
Restore test results logged and tracked
Failed restore test is treated as an incident and triggers immediate investigation
Systems with no successful restore test in the past quarter are flagged for review

11. Secrets Management Policy

All production secrets stored in Vault. No exceptions.
Secrets never stored in: environment files, git repos, CI/CD variables at rest, config files on disk
Secrets injected at runtime by Vault using service identity
Rotation schedule:
- API keys: quarterly
- Database credentials: quarterly
- TLS certificates: auto-renewed
Access to secrets scoped by service identity, not by individual person
Secret access audited via Vault audit log
New secrets for a system are onboarded during the system onboarding process

12. Security Monitoring Policy

Wazuh: host-level intrusion detection and file integrity monitoring on all managed hosts
CrowdSec: crowd-sourced threat intelligence, automated IP blocking at edge
Security alerts route through Alertmanager into the standard severity/Slack/SMS pipeline
Weekly review of security events via Grafana security dashboard
CrowdSec ban lists and Wazuh rule updates applied on Anchor's schedule
Security incidents escalate per the incident response process (future page)

13. Runbook Template

Sections:

Service name
Owner (Anchor operator + development team contact)
Architecture summary (components, dependencies, data flow)
Dependencies (upstream/downstream services, external APIs)
Common operations (start, stop, restart, scale)
Monitoring (which Grafana dashboards, key metrics to watch)
Troubleshooting (known failure modes with diagnosis and fix steps)
Escalation contacts

Includes a filled example for a generic web application backed by PostgreSQL.

14. Production Change Policy

All production changes go through change management
Change types:
- Standard: Pre-approved, low risk. Documented, executed, logged.
- Normal: Submitted → reviewed → approved → scheduled → executed → verified.
- Emergency: Executed immediately, documented retroactively within 24 hours.
Every change record includes: who, what, when, why, rollback plan
Changes logged in a central location (future: decision log or change log)
Failed changes trigger rollback and incident review

15. Decision Log Template

Sections:

Date
Decision (one sentence)
Context (why this came up)
Options considered (brief list)
Decision rationale (why this option was chosen)
Consequences accepted (known trade-offs)
Revisit date (optional)

Includes a filled example.

Future Pages (Suggested)

On-call rotation schedule
Incident response playbook
Post-incident review template
Access request procedure
Offboarding checklist (removing a system from Anchor management)
SLA definitions per client
Logging standards (structured log format requirements)
Metrics naming conventions
Network security policy
Vulnerability management policy
Capacity planning process
Disaster recovery plan

Anchor SOP Site — Design Spec

Overview

Decisions

Operational Stack

Repo Structure

Sidebar Order

Navigation

Page Structure Templates

Policies

Checklists

Templates

Starter Page Content

1. Production Ownership Policy

2. Managed Production Acceptance Criteria

3. New Managed System Onboarding Checklist

4. Handoff Acceptance Checklist

5. Alert Severity Matrix

6. Slack Alert Routing Policy

7. SMS Escalation Policy

8. Cloudflare Ownership/Admin Policy

9. Backup Policy

10. Restore Testing Policy

11. Secrets Management Policy

12. Security Monitoring Policy

13. Runbook Template

14. Production Change Policy

15. Decision Log Template

Future Pages (Suggested)

Overview​

Decisions​

Operational Stack​

Repo Structure​

Sidebar Order​

Navigation​

Page Structure Templates​

Policies​

Checklists​

Templates​

Starter Page Content​

1. Production Ownership Policy​

2. Managed Production Acceptance Criteria​

3. New Managed System Onboarding Checklist​

4. Handoff Acceptance Checklist​

5. Alert Severity Matrix​

6. Slack Alert Routing Policy​

7. SMS Escalation Policy​

8. Cloudflare Ownership/Admin Policy​

9. Backup Policy​

10. Restore Testing Policy​

11. Secrets Management Policy​

12. Security Monitoring Policy​

13. Runbook Template​

14. Production Change Policy​

15. Decision Log Template​

Future Pages (Suggested)​

Overview

Decisions

Operational Stack

Repo Structure

Sidebar Order

Navigation

Page Structure Templates

Policies

Checklists

Templates

Starter Page Content

1. Production Ownership Policy

2. Managed Production Acceptance Criteria

3. New Managed System Onboarding Checklist

4. Handoff Acceptance Checklist

5. Alert Severity Matrix

6. Slack Alert Routing Policy

7. SMS Escalation Policy

8. Cloudflare Ownership/Admin Policy

9. Backup Policy

10. Restore Testing Policy

11. Secrets Management Policy

12. Security Monitoring Policy

13. Runbook Template

14. Production Change Policy

15. Decision Log Template

Future Pages (Suggested)