BRB Monitoring and Maintenance
Owner: Anchor MSP Operations Lead Last reviewed: 2026-05-24
Purpose
Define the ongoing monitoring checks and maintenance schedule for BRB Protocol components. A BRB agent that is unhealthy at the moment of a critical incident provides zero protection. Continuous monitoring and regular maintenance ensure the system is ready when needed.
Scope
All BRB agents, the BRB controller, Redis infrastructure, and R2 forensics storage used by Anchor managed production systems.
Monitoring
Agent Health Monitoring
Every BRB agent exposes a health endpoint at http://host:9090/health. This endpoint is monitored by two independent systems:
- Prometheus — Scrapes the agent metrics endpoint every 15 seconds. An Alertmanager rule fires if the agent is unreachable for more than 2 minutes. Severity: Critical.
- Uptime Kuma — Performs an HTTP health check every 60 seconds. Provides a secondary, independent verification of agent availability.
If either monitoring system reports an agent as down, an operator must investigate within 15 minutes. A down BRB agent means the system cannot be locked down in an emergency.
Controller Health Monitoring
The BRB controller is monitored for:
| Check | Method | Alert Threshold |
|---|---|---|
| Controller API health | Prometheus + Uptime Kuma | Unreachable for 2 minutes: Critical |
| Redis connectivity | Controller health endpoint reports Redis status | Redis disconnected for 1 minute: Critical |
| API response time | Prometheus http_request_duration_seconds | P95 > 5 seconds for 5 minutes: High |
| Active agent count | Controller metrics | Agent count drops below expected: High |
R2 Storage Monitoring
| Check | Method | Alert Threshold |
|---|---|---|
| Upload test | Scheduled canary upload (1KB test file) | Upload fails: High |
| Bucket accessibility | API head request to bucket | Bucket unreachable: High |
Dashboard
A dedicated Grafana dashboard (anchor-brb-overview) provides a single view of all BRB components:
- Agent health status for all systems (up/down with duration)
- Controller health and API latency
- Redis connectivity status
- Recent lockdown events and their status
- Recovery stage progress for active lockdowns
- Forensic package upload status
Maintenance Schedule
Daily Checks
Performed by the on-call operator as part of the daily monitoring review:
- Agent health status. Verify all BRB agents are healthy on the Grafana dashboard. Confirm no agents have been in a degraded state.
- Controller status. Verify the BRB controller is healthy and API latency is normal.
- Alert review. Check for any BRB-related alerts that fired in the past 24 hours.
Weekly Checks
Performed every Monday by the designated operator:
- Redis connectivity test. From each agent host, verify Redis connectivity:
redis-cli -h CONTROLLER_HOST -p CONTROLLER_PORT ping. Confirm response isPONG. - R2 upload test. Upload a small test file from one agent host to the forensics bucket. Verify the upload succeeds and the file is accessible. Delete the test file after verification.
- Agent log review. Review BRB agent logs (
journalctl -u brb-agent --since "7 days ago") for warnings or errors. Investigate any anomalies. - Controller log review. Review BRB controller logs for errors, failed commands, or unusual activity.
Monthly Checks
Performed on the first Monday of each month:
- Agent version audit. Verify all agents are running the current approved version. Agents running outdated versions must be upgraded.
- Configuration drift check. Compare the active agent configuration (
/etc/brb/agent.yaml) against the expected configuration in the infrastructure repository. Investigate and correct any drift. - Emergency SSH key test. Verify the emergency SSH user can authenticate to each BRB-protected system. Test from the Anchor management network.
- Forensic package retention review. Review forensic packages in R2. Packages older than the retention policy should be archived or deleted per policy.
Quarterly Checks
Performed alongside the quarterly access review:
- Tabletop lockdown exercise. Walk through a simulated lockdown scenario with the operations team. Verify operators know how to trigger lockdown, access the emergency SSH session, and submit recovery requests.
- Staging lockdown test. Execute a full lockdown and recovery cycle on a staging system. This verifies end-to-end functionality including forensic collection and dual-approval recovery. Follow the BRB Protocol Testing Standards.
- BRB controller access review. Verify all operator accounts on the BRB controller are current. Remove accounts for departed operators. Confirm MFA is enabled for all accounts.
- Redis security review. Verify Redis authentication is configured, access is restricted to the controller and agents only, and TLS is enabled for Redis connections.
- R2 credential rotation. Rotate the R2 API credentials used by BRB agents for forensic uploads. Update the credentials in Vault and on all agents.
Maintenance Windows
BRB maintenance that requires agent restarts or controller downtime must be scheduled during maintenance windows:
- Maintenance windows are coordinated with the Operations Lead.
- No BRB maintenance is performed during known high-risk periods (client launches, peak traffic, active incidents).
- Maintenance is performed on one system at a time. At no point should all BRB agents be simultaneously unavailable.
- The on-call operator is notified before and after maintenance.
Incident: BRB Component Failure
If a BRB component fails outside of maintenance:
| Component | Severity | Response |
|---|---|---|
| Single agent down | Critical | Investigate and restore within 15 minutes. System is unprotected. |
| Multiple agents down | Critical | Investigate immediately. Check for systemic issue (Redis, network). |
| Controller down | Critical | All systems lose lockdown capability. Investigate immediately. |
| Redis down | Critical | Agents cannot receive commands. Controller cannot dispatch lockdowns. |
| R2 unreachable | High | Forensic collection will fail. Lockdown still functions but evidence is not preserved. |
Exceptions
None. BRB monitoring and maintenance schedule applies to all BRB-protected systems. Skipping a scheduled check requires approval from the Operations Lead with a documented justification and a rescheduled date.