BRB Monitoring and Maintenance

Owner: Anchor MSP Operations Lead Last reviewed: 2026-05-24

Purpose

Define the ongoing monitoring checks and maintenance schedule for BRB Protocol components. A BRB agent that is unhealthy at the moment of a critical incident provides zero protection. Continuous monitoring and regular maintenance ensure the system is ready when needed.

Scope

All BRB agents, the BRB controller, Redis infrastructure, and R2 forensics storage used by Anchor managed production systems.

Monitoring

Agent Health Monitoring

Every BRB agent exposes a health endpoint at http://host:9090/health. This endpoint is monitored by two independent systems:

Prometheus — Scrapes the agent metrics endpoint every 15 seconds. An Alertmanager rule fires if the agent is unreachable for more than 2 minutes. Severity: Critical.
Uptime Kuma — Performs an HTTP health check every 60 seconds. Provides a secondary, independent verification of agent availability.

If either monitoring system reports an agent as down, an operator must investigate within 15 minutes. A down BRB agent means the system cannot be locked down in an emergency.

Controller Health Monitoring

The BRB controller is monitored for:

Check	Method	Alert Threshold
Controller API health	Prometheus + Uptime Kuma	Unreachable for 2 minutes: Critical
Redis connectivity	Controller health endpoint reports Redis status	Redis disconnected for 1 minute: Critical
API response time	Prometheus `http_request_duration_seconds`	P95 > 5 seconds for 5 minutes: High
Active agent count	Controller metrics	Agent count drops below expected: High

R2 Storage Monitoring

Check	Method	Alert Threshold
Upload test	Scheduled canary upload (1KB test file)	Upload fails: High
Bucket accessibility	API head request to bucket	Bucket unreachable: High

Dashboard

A dedicated Grafana dashboard (anchor-brb-overview) provides a single view of all BRB components:

Agent health status for all systems (up/down with duration)
Controller health and API latency
Redis connectivity status
Recent lockdown events and their status
Recovery stage progress for active lockdowns
Forensic package upload status

Maintenance Schedule

Daily Checks

Performed by the on-call operator as part of the daily monitoring review:

Agent health status. Verify all BRB agents are healthy on the Grafana dashboard. Confirm no agents have been in a degraded state.
Controller status. Verify the BRB controller is healthy and API latency is normal.
Alert review. Check for any BRB-related alerts that fired in the past 24 hours.

Weekly Checks

Performed every Monday by the designated operator:

Redis connectivity test. From each agent host, verify Redis connectivity: redis-cli -h CONTROLLER_HOST -p CONTROLLER_PORT ping. Confirm response is PONG.
R2 upload test. Upload a small test file from one agent host to the forensics bucket. Verify the upload succeeds and the file is accessible. Delete the test file after verification.
Agent log review. Review BRB agent logs (journalctl -u brb-agent --since "7 days ago") for warnings or errors. Investigate any anomalies.
Controller log review. Review BRB controller logs for errors, failed commands, or unusual activity.

Monthly Checks

Performed on the first Monday of each month:

Agent version audit. Verify all agents are running the current approved version. Agents running outdated versions must be upgraded.
Configuration drift check. Compare the active agent configuration (/etc/brb/agent.yaml) against the expected configuration in the infrastructure repository. Investigate and correct any drift.
Emergency SSH key test. Verify the emergency SSH user can authenticate to each BRB-protected system. Test from the Anchor management network.
Forensic package retention review. Review forensic packages in R2. Packages older than the retention policy should be archived or deleted per policy.

Quarterly Checks

Performed alongside the quarterly access review:

Tabletop lockdown exercise. Walk through a simulated lockdown scenario with the operations team. Verify operators know how to trigger lockdown, access the emergency SSH session, and submit recovery requests.
Staging lockdown test. Execute a full lockdown and recovery cycle on a staging system. This verifies end-to-end functionality including forensic collection and dual-approval recovery. Follow the BRB Protocol Testing Standards.
BRB controller access review. Verify all operator accounts on the BRB controller are current. Remove accounts for departed operators. Confirm MFA is enabled for all accounts.
Redis security review. Verify Redis authentication is configured, access is restricted to the controller and agents only, and TLS is enabled for Redis connections.
R2 credential rotation. Rotate the R2 API credentials used by BRB agents for forensic uploads. Update the credentials in Vault and on all agents.

Maintenance Windows

BRB maintenance that requires agent restarts or controller downtime must be scheduled during maintenance windows:

Maintenance windows are coordinated with the Operations Lead.
No BRB maintenance is performed during known high-risk periods (client launches, peak traffic, active incidents).
Maintenance is performed on one system at a time. At no point should all BRB agents be simultaneously unavailable.
The on-call operator is notified before and after maintenance.

Incident: BRB Component Failure

If a BRB component fails outside of maintenance:

Component	Severity	Response
Single agent down	Critical	Investigate and restore within 15 minutes. System is unprotected.
Multiple agents down	Critical	Investigate immediately. Check for systemic issue (Redis, network).
Controller down	Critical	All systems lose lockdown capability. Investigate immediately.
Redis down	Critical	Agents cannot receive commands. Controller cannot dispatch lockdowns.
R2 unreachable	High	Forensic collection will fail. Lockdown still functions but evidence is not preserved.

Exceptions

None. BRB monitoring and maintenance schedule applies to all BRB-protected systems. Skipping a scheduled check requires approval from the Operations Lead with a documented justification and a rescheduled date.

Purpose​

Scope​

Monitoring​

Agent Health Monitoring​

Controller Health Monitoring​

R2 Storage Monitoring​

Dashboard​

Maintenance Schedule​

Daily Checks​

Weekly Checks​

Monthly Checks​

Quarterly Checks​

Maintenance Windows​

Incident: BRB Component Failure​

Exceptions​