BRB Protocol Testing Standards
Owner: Anchor MSP Operations Lead Last reviewed: 2026-05-24
Purpose
Define the minimum test requirements for the BRB (Big Red Button) Protocol before a system is accepted into Anchor-managed production during the handoff phase. No system requiring BRB protection may be accepted into managed production without passing these tests.
Scope
All systems under Anchor managed production that have BRB agents deployed or planned for deployment.
Prerequisites
Before testing begins, the following must be in place:
- Staging clone of the target production system, isolated from production traffic.
- BRB agent deployed on staging and reporting healthy.
- 2+ operators with active BRB accounts (different users required for recovery approval testing).
- Redis connectivity verified — agent can subscribe to and receive commands via the BRB controller's Redis instance.
- R2 connectivity verified — agent can upload forensic packages to the
brb-forensicsbucket. - Slack integration active —
#anchor-incidents-criticalchannel is receiving notifications from the BRB controller.
Mandatory Test Scenarios
All mandatory tests must pass before handoff acceptance is complete. Execute each scenario on the staging clone.
1. Agent Health Check
Test: Confirm the BRB agent is running and responsive.
curl -s http://STAGING_IP:9090/health | jq
Expected result: Response contains "status":"healthy" with system ID matching the staging system.
Pass criteria: HTTP 200 with healthy status. Agent uptime is greater than 0.
2. Full Lockdown Trigger
Test: Trigger a full lockdown on the staging system via the BRB controller API or Glance dashboard.
Expected result: All four lockdown actions execute:
- Network isolation — all traffic blocked except emergency SSH
- Service shutdown — all configured services stopped
- User account locking — all accounts locked except emergency user
- Session termination — all active sessions killed
Verification:
# From emergency SSH session on staging
iptables -L -v -n # Verify restrictive rules
systemctl status docker nginx postgresql # Verify services stopped
who # Verify no active sessions besides emergency
Pass criteria: All four lockdown actions confirmed via verification commands.
3. Forensic Collection
Test: Verify forensic package is collected, checksummed, and uploaded during the lockdown triggered in Scenario 2.
Expected result:
forensics.tar.gzcreated on the staging system- SHA256 checksum file generated alongside the archive
- Package uploaded to R2 at the expected path:
s3://brb-forensics/<client_id>/<system_id>/event-<timestamp>/
Verification:
# Download and verify
aws s3 ls s3://brb-forensics/{client_id}/{system_id}/ \
--endpoint-url https://23b4ba8d8f996dfbc2eb473cb3b32582.r2.cloudflarestorage.com
aws s3 cp s3://brb-forensics/{client_id}/{system_id}/event-YYYYMMDD-HHMMSS/forensics.tar.gz . \
--endpoint-url https://23b4ba8d8f996dfbc2eb473cb3b32582.r2.cloudflarestorage.com
sha256sum -c forensics.tar.gz.sha256
tar -tzf forensics.tar.gz # List contents
Pass criteria: Archive exists in R2, SHA256 matches, contents include system state, logs, and network info.
4. Slack Notification Delivery
Test: Confirm that the lockdown event triggered a notification in #anchor-incidents-critical.
Expected result: Slack message received in the channel with:
- System ID
- Client ID
- Lockdown reason
- Timestamp
- Link to BRB controller event details
Pass criteria: Notification received within 60 seconds of lockdown trigger.
5. Network Recovery (Single Approval)
Test: Submit one recovery approval for the "network" stage.
Expected result:
- Network access is restored (firewall rules reverted)
- Services remain stopped
- User accounts remain locked
Verification:
# From emergency SSH
ping 8.8.8.8 # Should succeed
curl https://google.com # Should succeed
systemctl status docker # Should still be stopped
Pass criteria: Network restored, services and accounts still locked.
6. Full Recovery (Two Approvals, Different Users)
Test: Submit two recovery approvals for the "full" stage from two different operator accounts.
Expected result:
- All services restored and running
- All user accounts unlocked
- All network access fully restored
Verification:
systemctl status docker nginx postgresql # All running
id user1 user2 # Accounts active
curl http://localhost:APP_PORT/health # Application responding
Pass criteria: System fully operational. All services, accounts, and network access restored.
7. Duplicate Approval Rejection
Test: Attempt to approve the same recovery stage twice with the same operator account.
Expected result: Second approval is rejected. The BRB controller returns an error indicating the same user cannot approve twice.
Pass criteria: API returns a rejection response. Recovery does not advance on duplicate approval.
8. Post-Recovery Validation
Test: After full recovery, verify the system is healthy and operational.
Expected result:
- Application health check passing
- BRB agent health check passing (
:9090/health→ healthy) - Logs flowing to Loki/promtail (if configured)
- Monitoring checks passing in Uptime Kuma
Pass criteria: All health checks green. No residual lockdown artifacts.
Recommended Tests
These tests are not required for handoff acceptance but are strongly recommended and should be documented if performed.
R1. Agent Reconnection After Redis Restart
Test: Restart the Redis instance on the BRB controller while the agent is connected. Verify the agent automatically reconnects and resumes listening for commands.
Pass criteria: Agent reconnects within 60 seconds and responds to a subsequent health check.
R2. Lockdown Under Simulated Load
Test: Generate simulated traffic on the staging system (HTTP requests, database queries) and trigger a lockdown during active load.
Pass criteria: Lockdown executes cleanly. No partial states. All connections terminated.
R3. RTO Measurement
Test: Measure the Recovery Time Objective — the elapsed time from lockdown trigger to full recovery completion.
Documentation: Record the following timestamps:
- Lockdown command sent
- All lockdown actions confirmed
- First recovery approval submitted
- Second recovery approval submitted
- Full recovery confirmed
- Application health check passing
Pass criteria: RTO documented. No specific target required for handoff, but the measurement informs SLA commitments.
Pass/Fail Criteria
A system passes BRB Protocol testing when:
- All 8 mandatory test scenarios pass.
- RTO is documented (even if only from mandatory test execution).
- Forensic package is downloadable and verifiable (SHA256 matches, contents are complete).
- At least 2 different operators have successfully participated in the test (approved recoveries).
A single mandatory test failure results in an overall fail. The issue must be resolved and the failed test re-run before acceptance.
Documentation Requirements
After testing is complete, the following must be documented and retained:
| Item | Details |
|---|---|
| Test date | Date testing was performed |
| Tester names | Names and roles of all operators involved |
| Staging system ID | The system_id of the staging clone used |
| Pass/fail per scenario | Result for each of the 8 mandatory scenarios, with notes on any issues encountered |
| RTO measurement | Timestamps and elapsed time from lockdown to full recovery |
| Evidence | Screenshots, terminal output, or log excerpts for each scenario |
| Forensic package location | R2 path to the test forensic package |
Store test documentation alongside the system's handoff acceptance records.
Sign-Off
BRB Protocol testing requires sign-off from:
- Development team lead — confirms the staging clone accurately represents production and that application-level recovery is validated.
- Anchor operator — confirms all mandatory tests passed and documentation is complete.
Both signatures must be recorded before the BRB-related item on the Handoff Acceptance Checklist can be checked off.