Skip to main content

Anchor SOP Site Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Create a Docusaurus 3 SOP site for Anchor MSP with 13 sidebar sections and 15 starter documentation pages.

Architecture: Static Docusaurus 3 site with autogenerated sidebar from folder structure. Each of 13 sections is a folder under docs/ with a _category_.json for ordering. Content is operational field-manual-style markdown. Deployed to Vercel from git.

Tech Stack: Docusaurus 3, Node.js 18+, Markdown, Vercel


File Map

Docusaurus scaffold (Task 1)

  • Create: package.json
  • Create: docusaurus.config.js
  • Create: sidebars.js
  • Create: src/css/custom.css
  • Create: static/.gitkeep

Landing page (Task 2)

  • Create: docs/intro.md

Category scaffolds (Task 3)

  • Create: docs/production-ownership/_category_.json
  • Create: docs/client-onboarding/_category_.json
  • Create: docs/system-handoff-acceptance/_category_.json
  • Create: docs/monitoring-and-alerting/_category_.json
  • Create: docs/logging-and-metrics/_category_.json
  • Create: docs/backup-and-restore/_category_.json
  • Create: docs/secrets-management/_category_.json
  • Create: docs/security-operations/_category_.json
  • Create: docs/incident-response/_category_.json
  • Create: docs/change-management/_category_.json
  • Create: docs/access-control/_category_.json
  • Create: docs/runbooks/_category_.json
  • Create: docs/decision-logs/_category_.json

Content pages — Governance (Task 4)

  • Create: docs/production-ownership/production-ownership-policy.md

Content pages — Onboarding & Handoff (Task 5)

  • Create: docs/system-handoff-acceptance/managed-production-acceptance-criteria.md
  • Create: docs/system-handoff-acceptance/handoff-acceptance-checklist.md
  • Create: docs/client-onboarding/new-managed-system-onboarding-checklist.md

Content pages — Monitoring & Alerting (Task 6)

  • Create: docs/monitoring-and-alerting/alert-severity-matrix.md
  • Create: docs/monitoring-and-alerting/slack-alert-routing-policy.md
  • Create: docs/monitoring-and-alerting/sms-escalation-policy.md

Content pages — Infrastructure Ops (Task 7)

  • Create: docs/backup-and-restore/backup-policy.md
  • Create: docs/backup-and-restore/restore-testing-policy.md
  • Create: docs/secrets-management/secrets-management-policy.md

Content pages — Security (Task 8)

  • Create: docs/security-operations/security-monitoring-policy.md
  • Create: docs/security-operations/cloudflare-ownership-admin-policy.md

Content pages — Process & Templates (Task 9)

  • Create: docs/change-management/production-change-policy.md
  • Create: docs/runbooks/runbook-template.md
  • Create: docs/decision-logs/decision-log-template.md

Verification (Task 10)

  • No new files. Build and verify site renders correctly.

Task 1: Docusaurus Scaffold

Files:

  • Create: package.json

  • Create: docusaurus.config.js

  • Create: sidebars.js

  • Create: src/css/custom.css

  • Create: static/.gitkeep

  • Create: .gitignore

  • Step 1: Initialize Docusaurus project

Run from the project root /Users/elliottgodwin/Developer/anchor-sop:

npx create-docusaurus@latest temp-docusaurus classic --javascript

This creates a temp-docusaurus/ directory with the full scaffold.

  • Step 2: Move scaffold files to project root
cp temp-docusaurus/package.json .
cp temp-docusaurus/docusaurus.config.js .
cp temp-docusaurus/sidebars.js .
cp temp-docusaurus/babel.config.js .
cp -r temp-docusaurus/src .
cp -r temp-docusaurus/static .
cp temp-docusaurus/.gitignore .
rm -rf temp-docusaurus
  • Step 3: Configure docusaurus.config.js

Replace the entire contents of docusaurus.config.js with:

// @ts-check

/** @type {import('@docusaurus/types').Config} */
const config = {
title: 'Anchor SOP',
tagline: 'Standard Operating Procedures for Anchor MSP',
favicon: 'img/favicon.ico',

url: 'https://anchor-sop.vercel.app',
baseUrl: '/',

onBrokenLinks: 'throw',
onBrokenMarkdownLinks: 'warn',

i18n: {
defaultLocale: 'en',
locales: ['en'],
},

presets: [
[
'classic',
/** @type {import('@docusaurus/preset-classic').Options} */
({
docs: {
routeBasePath: '/',
sidebarPath: './sidebars.js',
},
blog: false,
theme: {
customCss: './src/css/custom.css',
},
}),
],
],

themeConfig:
/** @type {import('@docusaurus/preset-classic').ThemeConfig} */
({
navbar: {
title: 'Anchor SOP',
},
footer: {
style: 'dark',
copyright: `Anchor MSP — Internal Use Only`,
},
}),
};

export default config;

Key decisions:

  • routeBasePath: '/' makes docs the root (no /docs/ prefix in URLs)

  • blog: false removes the blog feature entirely

  • No extra navbar items — just the title

  • Footer marks it as internal use only

  • Step 4: Configure sidebars.js

Replace the entire contents of sidebars.js with:

/** @type {import('@docusaurus/plugin-content-docs').SidebarsConfig} */
const sidebars = {
sopSidebar: [
{
type: 'autogenerated',
dirName: '.',
},
],
};

export default sidebars;

This autogenerates the sidebar from the folder structure. Category ordering is controlled by _category_.json files in each folder.

  • Step 5: Clean up src/css/custom.css

Replace the entire contents of src/css/custom.css with:

:root {
--ifm-color-primary: #1a1a2e;
--ifm-color-primary-dark: #16162a;
--ifm-color-primary-darker: #121226;
--ifm-color-primary-darkest: #0a0a1a;
--ifm-color-primary-light: #1e1e32;
--ifm-color-primary-lighter: #222236;
--ifm-color-primary-lightest: #2a2a42;
--ifm-font-family-base: system-ui, -apple-system, sans-serif;
}

[data-theme='dark'] {
--ifm-color-primary: #a0a0c0;
--ifm-color-primary-dark: #8e8eb0;
--ifm-color-primary-darker: #8282a8;
--ifm-color-primary-darkest: #64648c;
--ifm-color-primary-light: #b2b2d0;
--ifm-color-primary-lighter: #bebede;
--ifm-color-primary-lightest: #d8d8f0;
}
  • Step 6: Remove default docs and pages
rm -rf docs/tutorial-basics docs/tutorial-extras docs/intro.md
rm -rf src/components src/pages

This removes all the default Docusaurus tutorial content and the landing page component (we're using docs as root).

  • Step 7: Install dependencies
npm install

Expected: Clean install with no errors.

  • Step 8: Verify build scaffolding works
npx docusaurus build 2>&1 | tail -20

Note: This will fail because there are no docs yet. That's expected. We just want to confirm the config parses without syntax errors. If you see a config parse error, fix it. If you see "no docs found" or similar, that's fine — we'll add docs next.

  • Step 9: Commit
git add package.json package-lock.json docusaurus.config.js sidebars.js babel.config.js src/ static/ .gitignore
git commit -m "feat: scaffold Docusaurus 3 project for Anchor SOP site"

Task 2: Landing Page

Files:

  • Create: docs/intro.md

  • Step 1: Create the landing page

Create docs/intro.md with:

---
sidebar_position: 0
slug: /
title: Anchor SOP
---

# Anchor MSP — Standard Operating Procedures

This is the internal field manual for Anchor MSP operations. It covers every system and process Anchor owns once a system enters managed production.

## What Anchor Owns

Anchor's responsibility begins at **handoff acceptance**. When a development team (EGI or Mast) declares a system ready for production, Anchor takes ownership of:

- **Monitoring and alerting** — Uptime Kuma, Prometheus, Alertmanager, Grafana
- **Logging** — Loki, structured log aggregation
- **Backups** — Restic orchestration, restore testing
- **Secrets** — Vault, rotation, access control
- **Security** — Wazuh, CrowdSec, Cloudflare administration
- **Incident response** — detection, triage, resolution, post-incident review
- **Change management** — production change approval and logging

## What Anchor Does Not Own

- Application code and feature development (EGI, Mast)
- Pre-production environments (EGI, Mast)
- Application-level testing (EGI, Mast)

Development teams deploy their own code. Anchor owns the production environment those deploys target. Anchor can halt deploys if production stability is at risk.

## Operational Stack

| Tool | Role |
|------|------|
| Grafana | Dashboards and visualization |
| Prometheus | Metrics collection |
| Loki | Log aggregation |
| Alertmanager | Alert routing |
| Uptime Kuma | Uptime monitoring |
| Wazuh | Host intrusion detection, file integrity |
| CrowdSec | Threat intelligence, automated blocking |
| PostHog | Product analytics |
| Restic | Backup orchestration |
| Vault | Secrets management |
| Slack | Alert channels, team communication |
| Twilio | SMS escalation |

## Principle

Runtime truth lives in the operational stack — Grafana dashboards, Prometheus metrics, Wazuh alerts, Vault audit logs. Not in CI/CD metadata. Not in repo configs. If the stack says it's down, it's down.
  • Step 2: Verify the landing page renders
npx docusaurus build 2>&1 | tail -5

Expected: Build succeeds (or succeeds with warnings about empty categories, which is fine).

  • Step 3: Commit
git add docs/intro.md
git commit -m "feat: add landing page with ownership boundaries and stack overview"

Task 3: Category Scaffolds

Files:

  • Create: 13 _category_.json files, one per section folder

  • Step 1: Create all category directories and category.json files

Create the following 13 files. Each file controls the sidebar label and position for its section.

docs/production-ownership/_category_.json:

{
"label": "Production Ownership",
"position": 1
}

docs/client-onboarding/_category_.json:

{
"label": "Client Onboarding",
"position": 2
}

docs/system-handoff-acceptance/_category_.json:

{
"label": "System Handoff Acceptance",
"position": 3
}

docs/monitoring-and-alerting/_category_.json:

{
"label": "Monitoring and Alerting",
"position": 4
}

docs/logging-and-metrics/_category_.json:

{
"label": "Logging and Metrics",
"position": 5
}

docs/backup-and-restore/_category_.json:

{
"label": "Backup and Restore",
"position": 6
}

docs/secrets-management/_category_.json:

{
"label": "Secrets Management",
"position": 7
}

docs/security-operations/_category_.json:

{
"label": "Security Operations",
"position": 8
}

docs/incident-response/_category_.json:

{
"label": "Incident Response",
"position": 9
}

docs/change-management/_category_.json:

{
"label": "Change Management",
"position": 10
}

docs/access-control/_category_.json:

{
"label": "Access Control",
"position": 11
}

docs/runbooks/_category_.json:

{
"label": "Runbooks",
"position": 12
}

docs/decision-logs/_category_.json:

{
"label": "Decision Logs",
"position": 13
}
  • Step 2: Commit
git add docs/
git commit -m "feat: add 13 section category scaffolds with sidebar ordering"

Task 4: Production Ownership Policy

Files:

  • Create: docs/production-ownership/production-ownership-policy.md

  • Step 1: Create the production ownership policy page

Create docs/production-ownership/production-ownership-policy.md with:

---
sidebar_position: 1
title: Production Ownership Policy
---

# Production Ownership Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define what Anchor owns, what development teams own, and where the boundary sits.

## Scope

All systems accepted into Anchor managed production. All Anchor operators. All development teams (EGI, Mast) handing off systems.

## Policy

1. Anchor owns all production infrastructure from the moment of handoff acceptance.
2. Anchor is the sole authority for production monitoring, logging, alerting, backups, secrets management, security operations, and incident response.
3. Development teams (EGI, Mast) own application code, feature development, pre-production environments, and application-level testing.
4. Application teams deploy their own code to production. Anchor owns the production environment those deploys target.
5. Anchor can halt any deploy if production stability is at risk.
6. Runtime truth lives in the operational stack (Grafana, Prometheus, Loki, Wazuh, Vault). CI/CD metadata and repo configs are not authoritative for production state.

## Responsibility Matrix

| Domain | Anchor | EGI / Mast |
|--------|--------|------------|
| Production monitoring | Owns ||
| Alerting and escalation | Owns | Receives alerts for app-level issues |
| Log aggregation | Owns infrastructure | Produces structured logs |
| Backups and restore | Owns | Identifies backup-eligible data |
| Secrets management | Owns Vault, rotation, access | Declares secrets during handoff |
| Security (host, network, edge) | Owns ||
| Incident response | Owns triage, resolution | Provides app-level expertise when escalated |
| DNS and Cloudflare | Owns | Requests changes via Anchor |
| Deploys to production | Provides the environment | Executes deploys |
| Application code || Owns |
| Pre-production environments || Owns |
| Application testing || Owns |

## Exceptions

No exceptions to production ownership boundaries without written approval from Anchor Operations Lead. Temporary exceptions (e.g., granting a developer direct production access for debugging) must be time-boxed, logged, and revoked when complete.
  • Step 2: Commit
git add docs/production-ownership/production-ownership-policy.md
git commit -m "feat: add production ownership policy with responsibility matrix"

Task 5: Onboarding & Handoff Pages

Files:

  • Create: docs/system-handoff-acceptance/managed-production-acceptance-criteria.md

  • Create: docs/system-handoff-acceptance/handoff-acceptance-checklist.md

  • Create: docs/client-onboarding/new-managed-system-onboarding-checklist.md

  • Step 1: Create managed production acceptance criteria page

Create docs/system-handoff-acceptance/managed-production-acceptance-criteria.md with:

---
sidebar_position: 1
title: Managed Production Acceptance Criteria
---

# Managed Production Acceptance Criteria

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define the minimum requirements a system must meet before Anchor accepts it into managed production.

## Scope

Every system being handed off from a development team (EGI or Mast) to Anchor.

## Acceptance Criteria

A system must satisfy all of the following before Anchor accepts ownership:

1. **Health check endpoint.** The system exposes an HTTP endpoint that returns its health status. Anchor uses this for uptime monitoring (Uptime Kuma) and liveness checks.
2. **Structured logging to stdout.** Application logs are written to stdout in a structured format (JSON preferred). Anchor aggregates these via Loki.
3. **Defined resource limits.** CPU and memory limits are documented and configured. Anchor monitors resource usage via Prometheus.
4. **Secrets documented.** All secrets the system uses are listed, with purpose and rotation requirements. Anchor migrates these to Vault during onboarding.
5. **Backup-eligible data identified.** The development team identifies which data stores need backups and classifies them (database, config, media/uploads).
6. **Monitoring hooks available.** The system either exposes a Prometheus metrics endpoint or produces log patterns that Anchor can build alerts from.
7. **Deployment process documented.** How to deploy the system, including any pre/post-deploy steps.
8. **Rollback procedure documented.** How to roll back a failed deploy, including any data migration considerations.

## What Happens If Criteria Are Not Met

The system goes back to the development team with a gap list. Anchor does not accept partial handoffs. Every item above must be satisfied before Anchor signs off on the handoff acceptance checklist.
  • Step 2: Create handoff acceptance checklist page

Create docs/system-handoff-acceptance/handoff-acceptance-checklist.md with:

---
sidebar_position: 2
title: Handoff Acceptance Checklist
---

# Handoff Acceptance Checklist

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Use this checklist when a development team (EGI or Mast) hands a system off to Anchor for managed production. Every item must be verified and signed off before Anchor accepts ownership.

## Checklist

- [ ] **Acceptance criteria met.** All items from the [Managed Production Acceptance Criteria](./managed-production-acceptance-criteria.md) are satisfied.
- [ ] **Monitoring configured and producing data.** Uptime Kuma checks are live. Prometheus is scraping metrics. Grafana dashboards are set up.
- [ ] **Alerts routing to correct Slack channels.** Alertmanager rules are configured. Test alerts have been fired and confirmed in `#alerts-critical`, `#alerts-high`, `#alerts-medium`, `#alerts-low`.
- [ ] **Critical alert SMS escalation verified.** A test critical alert has triggered SMS delivery via Twilio to the on-call operator.
- [ ] **Backups configured and first backup completed.** Restic jobs are running. First backup has completed successfully. Backup metrics are reporting to Prometheus.
- [ ] **Secrets migrated to Vault.** All secrets listed during acceptance criteria are stored in Vault. Application is configured to read secrets from Vault at runtime.
- [ ] **Access provisioned for Anchor operators.** Anchor team has the access needed to operate, monitor, and troubleshoot the system.
- [ ] **Runbook created and reviewed.** A system runbook exists using the [Runbook Template](/runbooks/runbook-template). It has been reviewed by an Anchor operator.
- [ ] **Escalation contacts confirmed.** Development team has provided contacts for application-level escalation (name, role, Slack handle, phone).
- [ ] **Development team acknowledges handoff complete.** The development team confirms they understand Anchor now owns production operations for this system.

## After Completion

Once all items are checked, the system is officially under Anchor management. The handoff date is recorded and the system's runbook becomes the operational reference.
  • Step 3: Create new managed system onboarding checklist page

Create docs/client-onboarding/new-managed-system-onboarding-checklist.md with:

---
sidebar_position: 1
title: New Managed System Onboarding Checklist
---

# New Managed System Onboarding Checklist

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Step-by-step operational checklist for Anchor operators when bringing a new system into managed production. Use this after the development team has submitted a handoff request.

## Checklist

Complete these steps in order:

- [ ] **1. Receive handoff request.** Development team (EGI or Mast) submits a request to hand off a system to Anchor.
- [ ] **2. Validate acceptance criteria.** Review the system against the [Managed Production Acceptance Criteria](/system-handoff-acceptance/managed-production-acceptance-criteria). If any criteria are not met, return a gap list to the development team. Do not proceed until all criteria are satisfied.
- [ ] **3. Provision uptime monitoring.** Add the system's health check endpoint to Uptime Kuma. Confirm checks are returning healthy.
- [ ] **4. Configure metrics scraping.** Add the system's Prometheus metrics endpoint to the scrape configuration. Verify metrics are appearing in Prometheus.
- [ ] **5. Set up dashboards.** Create a Grafana dashboard for the system covering key metrics: uptime, response time, error rate, resource usage.
- [ ] **6. Configure log aggregation.** Ensure the system's stdout logs are being collected by Loki. Verify logs appear in Grafana Explore.
- [ ] **7. Set up alerting rules.** Create Alertmanager rules for the system. At minimum: health check failure (critical), high error rate (high), resource threshold breach (medium).
- [ ] **8. Configure SMS escalation.** Verify that critical alerts for this system trigger Twilio SMS delivery to the on-call operator.
- [ ] **9. Configure backups.** Set up Restic backup jobs for the system's identified data stores. Run the first backup manually and confirm completion. Add backup success/failure metrics to Prometheus.
- [ ] **10. Onboard secrets to Vault.** Migrate all secrets listed in the acceptance criteria to Vault. Configure the application to read from Vault at runtime. Verify the application starts and functions correctly with Vault-sourced secrets.
- [ ] **11. Register in PostHog.** Set up PostHog tracking for the system if applicable. Confirm events are flowing.
- [ ] **12. Add to CrowdSec protection.** Register the system's public-facing endpoints with CrowdSec. Verify the bouncer is active.
- [ ] **13. Configure Wazuh monitoring.** Enroll the system's host(s) in Wazuh. Confirm the agent is reporting and file integrity monitoring is active.
- [ ] **14. Create system runbook.** Write a runbook for this system using the [Runbook Template](/runbooks/runbook-template). Include architecture, dependencies, common operations, troubleshooting steps, and escalation contacts.
- [ ] **15. Complete handoff acceptance.** Walk through the [Handoff Acceptance Checklist](/system-handoff-acceptance/handoff-acceptance-checklist) with the development team. Get sign-off from both sides.

## After Completion

The system is now under Anchor management. All future production operations follow Anchor SOPs.
  • Step 4: Commit
git add docs/system-handoff-acceptance/ docs/client-onboarding/
git commit -m "feat: add acceptance criteria, handoff checklist, and onboarding checklist"

Task 6: Monitoring & Alerting Pages

Files:

  • Create: docs/monitoring-and-alerting/alert-severity-matrix.md

  • Create: docs/monitoring-and-alerting/slack-alert-routing-policy.md

  • Create: docs/monitoring-and-alerting/sms-escalation-policy.md

  • Step 1: Create alert severity matrix page

Create docs/monitoring-and-alerting/alert-severity-matrix.md with:

---
sidebar_position: 1
title: Alert Severity Matrix
---

# Alert Severity Matrix

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define alert severity levels, response expectations, and escalation rules. All Alertmanager rules must use these severity levels.

## Scope

All alerts generated by any monitored system under Anchor management.

## Severity Levels

| Severity | Alertmanager Label | Criteria | Slack Channel | Response Time | Escalation |
|----------|-------------------|----------|---------------|---------------|------------|
| Critical | `severity: critical` | Service down. Data loss risk. Complete outage. | `#alerts-critical` + SMS | 15 minutes | Immediate — SMS to on-call, escalation chain starts |
| High | `severity: high` | Degraded service. Partial outage. Major performance issue. | `#alerts-high` | 1 hour | After 1 hour unacknowledged — escalate to secondary |
| Medium | `severity: medium` | Anomaly detected. Threshold breach. Non-urgent degradation. | `#alerts-medium` | Next business day | None |
| Low | `severity: low` | Informational. Trend changes. Capacity warnings. | `#alerts-low` | Weekly review | None |

## Rules

1. Every Alertmanager rule must include a `severity` label with one of the four values above.
2. A new managed system must have at minimum: one critical rule (health check failure), one high rule (error rate spike), and one medium rule (resource threshold).
3. Do not create alerts without a defined severity. Unclassified alerts are a policy violation.
4. Severity levels are reviewed quarterly. If an alert consistently fires without action, downgrade or remove it.

## Exceptions

Severity overrides require approval from the Operations Lead. Document the override reason in the alert rule comments.
  • Step 2: Create Slack alert routing policy page

Create docs/monitoring-and-alerting/slack-alert-routing-policy.md with:

---
sidebar_position: 2
title: Slack Alert Routing Policy
---

# Slack Alert Routing Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define how alerts route from Alertmanager to Slack channels. Every alert must land in a monitored channel.

## Scope

All alerts from all managed systems.

## Channel Structure

| Channel | Severity | Monitoring Expectation |
|---------|----------|----------------------|
| `#alerts-critical` | Critical | Monitored 24/7. Also triggers SMS. |
| `#alerts-high` | High | Monitored during business hours and by on-call. |
| `#alerts-medium` | Medium | Reviewed next business day. |
| `#alerts-low` | Low | Reviewed weekly. |

## Policy

1. Alertmanager routes alerts by the `severity` label to the matching Slack channel via webhook integration.
2. No alert routes to a general channel, a personal DM, or an unmonitored destination.
3. Every managed system must have its alerts routing to these four channels before the handoff acceptance checklist is complete.
4. Critical alerts post to `#alerts-critical` **and** trigger SMS escalation. Both must fire.
5. Alert messages must include: system name, alert name, severity, description, and a link to the relevant Grafana dashboard.
6. Silencing an alert in Alertmanager must be logged with a reason and an expiration time. Open-ended silences are not permitted.

## Exceptions

Routing a specific alert to an additional channel (e.g., a system-specific channel) is allowed as long as it also routes to the standard severity channel. The standard channel is never bypassed.
  • Step 3: Create SMS escalation policy page

Create docs/monitoring-and-alerting/sms-escalation-policy.md with:

---
sidebar_position: 3
title: SMS Escalation Policy
---

# SMS Escalation Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define when and how SMS escalation is triggered for production alerts.

## Scope

Critical severity alerts from all managed systems.

## Policy

1. SMS escalation is triggered only by **critical severity** alerts.
2. SMS is delivered via Twilio to the on-call operator's registered phone number.
3. Escalation chain:
- **0 minutes:** SMS sent to **primary on-call operator**.
- **10 minutes:** If unacknowledged, SMS sent to **secondary on-call operator**.
- **20 minutes:** If still unacknowledged, SMS sent to **team lead**.
4. Acknowledgment: The on-call operator must post in `#alerts-critical` within 15 minutes of receiving the SMS. Posting confirms they are investigating.
5. If no one acknowledges within 30 minutes, the incident is automatically escalated to the Operations Lead.

## On-Call Rotation

On-call rotation schedule is maintained separately. Each operator's phone number is registered in the escalation system. Operators are responsible for keeping their contact information current.

## Testing

SMS escalation is tested monthly. A test critical alert is fired and the full escalation chain is verified. Test results are logged.

## Exceptions

No exceptions. Critical alerts always trigger SMS. If an alert is incorrectly classified as critical, fix the classification — do not disable SMS escalation.
  • Step 4: Commit
git add docs/monitoring-and-alerting/
git commit -m "feat: add alert severity matrix, Slack routing, and SMS escalation policies"

Task 7: Infrastructure Ops Pages

Files:

  • Create: docs/backup-and-restore/backup-policy.md

  • Create: docs/backup-and-restore/restore-testing-policy.md

  • Create: docs/secrets-management/secrets-management-policy.md

  • Step 1: Create backup policy page

Create docs/backup-and-restore/backup-policy.md with:

---
sidebar_position: 1
title: Backup Policy
---

# Backup Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define how production data is backed up, how often, and how backup health is monitored.

## Scope

All data stores in all systems under Anchor managed production.

## Policy

1. All production data is backed up using Restic orchestration. No exceptions.
2. Backup frequency by data classification:
- **Databases:** Daily.
- **Configuration files:** On change.
- **Media and uploads:** Daily.
3. Retention schedule:
- **7 daily** snapshots retained.
- **4 weekly** snapshots retained.
- **6 monthly** snapshots retained.
4. Backups are stored off-host. Backup storage location and encryption are managed by Anchor. Development teams do not have direct access to backup storage.
5. Backup success and failure are tracked via Prometheus metrics. Every Restic job reports completion status.
6. A failed backup generates a **High** severity alert (see [Alert Severity Matrix](/monitoring-and-alerting/alert-severity-matrix)).
7. Backup configuration for each system is documented in that system's runbook.
8. The first backup for a new system is run manually during onboarding and verified before handoff acceptance is complete.

## Exceptions

Systems with no persistent data (stateless services) may be exempt from backups. The exemption must be documented in the system's runbook with justification.
  • Step 2: Create restore testing policy page

Create docs/backup-and-restore/restore-testing-policy.md with:

---
sidebar_position: 2
title: Restore Testing Policy
---

# Restore Testing Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Ensure backups are actually restorable. A backup that cannot be restored is not a backup.

## Scope

All systems under Anchor managed production that have backups configured.

## Policy

1. Restore tests are run **quarterly** for every managed system with backups.
2. Restore test procedure:
- Select the most recent daily backup.
- Restore it to an isolated environment (never to production).
- Verify data integrity: row counts, checksums, or application-level validation as appropriate.
- Document the result: pass or fail, time taken, any issues encountered.
3. Restore test results are logged and tracked. Each system's restore test history is maintained.
4. A **failed restore test is treated as an incident.** It triggers immediate investigation. The backup configuration is reviewed, the issue is resolved, and the restore test is re-run until it passes.
5. Any system with no successful restore test in the past quarter is flagged for review by the Operations Lead.

## Exceptions

None. If a system has backups, it has restore tests. No exceptions.
  • Step 3: Create secrets management policy page

Create docs/secrets-management/secrets-management-policy.md with:

---
sidebar_position: 1
title: Secrets Management Policy
---

# Secrets Management Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define how production secrets are stored, accessed, and rotated.

## Scope

All secrets used by all systems under Anchor managed production. This includes API keys, database credentials, TLS certificates, encryption keys, and any other sensitive configuration.

## Policy

1. All production secrets are stored in **Vault**. No exceptions.
2. Secrets are **never** stored in:
- Environment files (`.env`)
- Git repositories
- CI/CD variables at rest
- Config files on disk
- Slack messages, emails, or documents
3. Secrets are injected at runtime by Vault using **service identity**. Applications authenticate to Vault as a service, not as a person.
4. Rotation schedule:
- **API keys:** Quarterly.
- **Database credentials:** Quarterly.
- **TLS certificates:** Auto-renewed.
5. Access to secrets is scoped by service identity. A service can only read the secrets it needs. No service has broad access.
6. All secret access is audited via the **Vault audit log**. The audit log is retained and reviewed.
7. New secrets for a system are onboarded to Vault during the [system onboarding process](/client-onboarding/new-managed-system-onboarding-checklist).

## Exceptions

Emergency break-glass access to secrets is available to the Operations Lead. Break-glass access is logged, reviewed, and triggers a follow-up to determine why normal access was insufficient.
  • Step 4: Commit
git add docs/backup-and-restore/ docs/secrets-management/
git commit -m "feat: add backup, restore testing, and secrets management policies"

Task 8: Security Pages

Files:

  • Create: docs/security-operations/security-monitoring-policy.md

  • Create: docs/security-operations/cloudflare-ownership-admin-policy.md

  • Step 1: Create security monitoring policy page

Create docs/security-operations/security-monitoring-policy.md with:

---
sidebar_position: 1
title: Security Monitoring Policy
---

# Security Monitoring Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define how Anchor monitors managed systems for security threats.

## Scope

All hosts and public-facing endpoints under Anchor managed production.

## Policy

1. **Wazuh** runs on all managed hosts. It provides:
- Host-level intrusion detection (rootkit checks, anomaly detection).
- File integrity monitoring (FIM) on critical system files and application configs.
- Log-based threat detection from system and application logs.
2. **CrowdSec** protects all public-facing endpoints. It provides:
- Crowd-sourced threat intelligence — known malicious IPs are blocked automatically.
- Automated IP blocking at the edge via bouncers.
- Behavioral detection for brute force, scanning, and other attack patterns.
3. Security alerts from both Wazuh and CrowdSec route through **Alertmanager** into the standard alert pipeline:
- Critical security events (active intrusion, confirmed compromise) → `#alerts-critical` + SMS.
- High security events (suspicious activity, repeated blocked attempts) → `#alerts-high`.
- Medium/low events (informational, trend changes) → `#alerts-medium` or `#alerts-low`.
4. A **weekly security review** is conducted using the Grafana security dashboard. The review covers:
- Wazuh alerts and FIM changes from the past week.
- CrowdSec block statistics and new threat patterns.
- Any anomalies in Vault audit logs.
5. CrowdSec ban lists and Wazuh detection rules are updated on Anchor's schedule. Updates are tested before deployment.
6. Security incidents escalate per the incident response process.

## Exceptions

None. Every managed host runs Wazuh. Every public endpoint is protected by CrowdSec.
  • Step 2: Create Cloudflare ownership/admin policy page

Create docs/security-operations/cloudflare-ownership-admin-policy.md with:

---
sidebar_position: 2
title: Cloudflare Ownership and Administration Policy
---

# Cloudflare Ownership and Administration Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define who owns and administers Cloudflare for production domains.

## Scope

All Cloudflare accounts, DNS zones, and configurations for systems under Anchor managed production.

## Policy

1. Anchor owns all production Cloudflare accounts and DNS zones.
2. Development teams (EGI, Mast) **do not have direct Cloudflare access** for production domains.
3. Changes to the following require a request through Anchor's [change management process](/change-management/production-change-policy):
- DNS records
- WAF rules
- Page rules
- TLS certificate settings
- Cache configuration
- Access policies
4. Anchor maintains **Cloudflare API tokens scoped per service**. Tokens are stored in Vault. No broad-access tokens exist.
5. Emergency DNS changes (e.g., failover during an outage) follow the **emergency change process** — executed immediately by Anchor, documented retroactively within 24 hours.
6. Cloudflare account credentials and API tokens are rotated per the [Secrets Management Policy](/secrets-management/secrets-management-policy).

## Exceptions

Temporary read-only Cloudflare access may be granted to a development team for debugging. Access must be time-boxed, logged, and revoked when complete. Approval from the Operations Lead is required.
  • Step 3: Commit
git add docs/security-operations/
git commit -m "feat: add security monitoring and Cloudflare administration policies"

Task 9: Process & Template Pages

Files:

  • Create: docs/change-management/production-change-policy.md

  • Create: docs/runbooks/runbook-template.md

  • Create: docs/decision-logs/decision-log-template.md

  • Step 1: Create production change policy page

Create docs/change-management/production-change-policy.md with:

---
sidebar_position: 1
title: Production Change Policy
---

# Production Change Policy

**Owner:** Anchor MSP Operations Lead
**Last reviewed:** 2026-04-04

## Purpose

Define how changes to production systems are proposed, approved, executed, and recorded.

## Scope

All changes to production infrastructure, configuration, monitoring, alerting, backups, secrets, DNS, and security settings for systems under Anchor management. Application deploys initiated by development teams are not covered by this policy (but Anchor can halt them per the [Production Ownership Policy](/production-ownership/production-ownership-policy)).

## Change Types

### Standard Changes

Pre-approved, low-risk, routine changes. Examples: adding a new Grafana dashboard, updating an alert threshold, adding a new Uptime Kuma check.

- Documented before execution.
- Executed by any Anchor operator.
- Logged after completion.
- No approval step required.

### Normal Changes

Changes with meaningful impact that require review. Examples: modifying Alertmanager routing, changing backup schedules, updating Cloudflare WAF rules, rotating Vault credentials.

1. **Submitted:** Operator describes the change, reason, and rollback plan.
2. **Reviewed:** A second Anchor operator reviews the change.
3. **Approved:** Operations Lead approves (or the reviewer, for routine normal changes).
4. **Scheduled:** Change is assigned a time window.
5. **Executed:** Change is made.
6. **Verified:** Operator confirms the change works as expected. Monitoring is checked.

### Emergency Changes

Changes required immediately to restore service or prevent data loss. Examples: DNS failover during outage, emergency secret rotation after a leak, blocking a malicious IP.

- Executed immediately by the on-call operator.
- Documented retroactively **within 24 hours**.
- Reviewed by the Operations Lead after the fact.

## Change Records

Every change record includes:

- **Who:** Operator who made the change.
- **What:** Specific change made.
- **When:** Date and time of execution.
- **Why:** Reason for the change.
- **Rollback plan:** How to undo the change if it causes problems.

## Failed Changes

If a change causes unexpected problems:

1. Execute the rollback plan immediately.
2. Alert the team in Slack.
3. Treat the failed change as an incident — investigate root cause and document findings.

## Exceptions

No changes bypass this policy. If a change cannot wait for the normal process, use the emergency change process. Emergency changes still get documented.
  • Step 2: Create runbook template page

Create docs/runbooks/runbook-template.md. Important: This file contains code fences inside code fences. Use 4-backtick fences (``````) to wrap the Template and Example sections so the inner triple-backtick content renders correctly in Docusaurus. Write the file with this exact content:

---
sidebar_position: 1
title: Runbook Template
---

# Runbook Template

## When to use this template

Create a new runbook when onboarding a system into Anchor managed production. Every managed system must have a runbook. Copy this template and fill in all sections.

## Template

```markdown
# [System Name] Runbook

**Anchor Operator:** [Name]
**Development Team Contact:** [Name, Slack handle, phone]
**Last updated:** [Date]

## Architecture

[2-3 sentences describing the system. What does it do? What are its main components? Draw the data flow: where does data come in, how is it processed, where does it go?]

## Dependencies

| Dependency | Type | Impact if Unavailable |
|-----------|------|----------------------|
| [e.g., PostgreSQL] | Database | Service cannot start |
| [e.g., Redis] | Cache | Degraded performance |
| [e.g., Stripe API] | External API | Payments fail |

## Common Operations

### Start the service
[Exact command or procedure]

### Stop the service
[Exact command or procedure]

### Restart the service
[Exact command or procedure]

### Scale the service
[How to scale up/down, if applicable]

## Monitoring

| What to Check | Where |
|--------------|-------|
| Uptime | Uptime Kuma → [check name] |
| Metrics | Grafana → [dashboard name] |
| Logs | Grafana Explore → Loki → [label filter] |
| Alerts | Alertmanager → [alert group] |

## Troubleshooting

### [Failure Mode 1: e.g., "Service returns 502"]

**Symptoms:** [What you see in monitoring/logs]
**Likely cause:** [What usually causes this]
**Fix:** [Step-by-step resolution]

### [Failure Mode 2: e.g., "Database connection pool exhausted"]

**Symptoms:** [What you see in monitoring/logs]
**Likely cause:** [What usually causes this]
**Fix:** [Step-by-step resolution]

## Escalation Contacts

| Role | Name | Slack | Phone |
|------|------|-------|-------|
| Anchor Operator | [Name] | @handle | +1-xxx-xxx-xxxx |
| Dev Team Lead | [Name] | @handle | +1-xxx-xxx-xxxx |
| Dev Team Backend | [Name] | @handle | +1-xxx-xxx-xxxx |
```

## Example

```markdown
# Acme Payments API Runbook

**Anchor Operator:** Jamie Chen
**Development Team Contact:** Alex Rivera, @alex.rivera, +1-555-0142
**Last updated:** 2026-04-04

## Architecture

Acme Payments API is a Node.js REST API that processes payment transactions. It receives requests from the Acme web frontend, validates them, interacts with the Stripe API for payment processing, and stores transaction records in PostgreSQL. Redis is used for rate limiting and session caching.

## Dependencies

| Dependency | Type | Impact if Unavailable |
|-----------|------|----------------------|
| PostgreSQL 15 | Database | Service cannot start, all transactions fail |
| Redis 7 | Cache | Rate limiting disabled, sessions lost, degraded mode |
| Stripe API | External API | Payment processing fails, transactions queue |

## Common Operations

### Start the service
docker compose up -d acme-payments

### Stop the service
docker compose stop acme-payments

### Restart the service
docker compose restart acme-payments

### Scale the service
docker compose up -d --scale acme-payments=3

## Monitoring

| What to Check | Where |
|--------------|-------|
| Uptime | Uptime Kuma → acme-payments-health |
| Metrics | Grafana → Acme Payments Dashboard |
| Logs | Grafana Explore → Loki → {app="acme-payments"} |
| Alerts | Alertmanager → acme-payments group |

## Troubleshooting

### Service returns 502

**Symptoms:** Uptime Kuma reports DOWN. Grafana shows spike in 5xx responses.
**Likely cause:** Application crashed or ran out of memory.
**Fix:**
1. Check logs: Grafana Explore → Loki → {app="acme-payments"} for error messages.
2. Check resource usage: Grafana → Acme Payments Dashboard → Memory/CPU panel.
3. If OOM: restart the service and consider increasing memory limits.
4. If crash loop: check recent deploys, roll back if needed.

### Database connection pool exhausted

**Symptoms:** Logs show "connection pool timeout" errors. Response times spike.
**Likely cause:** Long-running queries or connection leak.
**Fix:**
1. Check active connections: psql → SELECT count(*) FROM pg_stat_activity WHERE application_name = 'acme-payments';
2. Kill idle-in-transaction connections older than 5 minutes.
3. Restart the service to reset the connection pool.
4. Escalate to dev team if the issue recurs — likely a query or connection leak in application code.

## Escalation Contacts

| Role | Name | Slack | Phone |
|------|------|-------|-------|
| Anchor Operator | Jamie Chen | @jamie.chen | +1-555-0198 |
| Dev Team Lead | Alex Rivera | @alex.rivera | +1-555-0142 |
| Dev Team Backend | Sam Okafor | @sam.okafor | +1-555-0167 |
```
  • Step 3: Create decision log template page

Create docs/decision-logs/decision-log-template.md. Important: Same nested code fence approach as the runbook template — use 4-backtick fences. Write the file with this exact content:

---
sidebar_position: 1
title: Decision Log Template
---

# Decision Log Template

## When to use this template

Record any significant operational decision: tool choices, policy changes, architecture decisions, process changes, or anything that future operators might ask "why did we do it this way?"

## Template

```markdown
## [YYYY-MM-DD] [Decision Title]

**Decision:** [One sentence stating what was decided.]

**Context:** [Why this decision came up. What problem or question triggered it.]

**Options considered:**
1. [Option A] — [Brief description]
2. [Option B] — [Brief description]
3. [Option C] — [Brief description, if applicable]

**Rationale:** [Why the chosen option was selected over the others.]

**Consequences accepted:** [Known trade-offs or downsides of this decision.]

**Revisit date:** [Optional. Date to re-evaluate this decision, if applicable.]
```

## Example

```markdown
## 2026-03-15 Use Restic for Backup Orchestration

**Decision:** Use Restic as the backup tool for all managed production systems.

**Context:** Anchor needed a backup solution for managed systems. Several options were evaluated during initial stack selection.

**Options considered:**
1. Restic — Deduplicating, encrypted, supports multiple backends, active development.
2. Borg — Similar to Restic but requires SSH access to backup server, less flexible backend support.
3. Cloud-native snapshots — Provider-specific (EBS snapshots, managed DB backups). Simple but locks us into a provider.

**Rationale:** Restic is platform-agnostic, supports encryption by default, handles deduplication well, and works with multiple storage backends (S3, local, SFTP). This aligns with Anchor's principle of staying platform-agnostic while keeping backups under our direct control.

**Consequences accepted:** Restic requires us to manage the backup infrastructure ourselves (storage, monitoring, orchestration). More operational overhead than cloud-native snapshots, but we control the entire pipeline.

**Revisit date:** 2027-03-15
```
  • Step 4: Commit
git add docs/change-management/ docs/runbooks/ docs/decision-logs/
git commit -m "feat: add production change policy, runbook template, and decision log template"

Task 10: Build Verification

Files:

  • No new files. Verification only.

  • Step 1: Run Docusaurus build

cd /Users/elliottgodwin/Developer/anchor-sop && npx docusaurus build

Expected: Build succeeds. Warnings about empty categories (Logging and Metrics, Incident Response, Access Control) are acceptable — those categories have no pages yet.

  • Step 2: Start dev server and verify navigation
npx docusaurus start --port 3333

Open http://localhost:3333 in a browser. Verify:

  1. Landing page loads with the Anchor SOP title and ownership overview.
  2. Sidebar shows all 13 sections in the correct order.
  3. Each section with content pages shows its pages when expanded.
  4. Empty sections (Logging and Metrics, Incident Response, Access Control) appear in the sidebar.
  5. All internal links between pages work (e.g., the onboarding checklist links to the acceptance criteria page).
  6. The navbar shows "Anchor SOP" and nothing else.
  7. The footer shows "Anchor MSP — Internal Use Only".

Stop the dev server when done.

  • Step 3: Final commit
git add -A && git commit -m "chore: verify build and clean up any remaining artifacts"

Only commit if there are uncommitted changes from build artifacts or fixes. If git status shows nothing to commit, skip this step.