Skip to main content

BRB Protocol Testing Standards

Owner: Anchor MSP Operations Lead Last reviewed: 2026-05-24

Purpose

Define the minimum test requirements for the BRB (Big Red Button) Protocol before a system is accepted into Anchor-managed production during the handoff phase. No system requiring BRB protection may be accepted into managed production without passing these tests.

Scope

All systems under Anchor managed production that have BRB agents deployed or planned for deployment.

Prerequisites

Before testing begins, the following must be in place:

  1. Staging clone of the target production system, isolated from production traffic.
  2. BRB agent deployed on staging and reporting healthy.
  3. 2+ operators with active BRB accounts (different users required for recovery approval testing).
  4. Redis connectivity verified — agent can subscribe to and receive commands via the BRB controller's Redis instance.
  5. R2 connectivity verified — agent can upload forensic packages to the brb-forensics bucket.
  6. Slack integration active#anchor-incidents-critical channel is receiving notifications from the BRB controller.

Mandatory Test Scenarios

All mandatory tests must pass before handoff acceptance is complete. Execute each scenario on the staging clone.

1. Agent Health Check

Test: Confirm the BRB agent is running and responsive.

curl -s http://STAGING_IP:9090/health | jq

Expected result: Response contains "status":"healthy" with system ID matching the staging system.

Pass criteria: HTTP 200 with healthy status. Agent uptime is greater than 0.

2. Full Lockdown Trigger

Test: Trigger a full lockdown on the staging system via the BRB controller API or Glance dashboard.

Expected result: All four lockdown actions execute:

  • Network isolation — all traffic blocked except emergency SSH
  • Service shutdown — all configured services stopped
  • User account locking — all accounts locked except emergency user
  • Session termination — all active sessions killed

Verification:

# From emergency SSH session on staging
iptables -L -v -n # Verify restrictive rules
systemctl status docker nginx postgresql # Verify services stopped
who # Verify no active sessions besides emergency

Pass criteria: All four lockdown actions confirmed via verification commands.

3. Forensic Collection

Test: Verify forensic package is collected, checksummed, and uploaded during the lockdown triggered in Scenario 2.

Expected result:

  • forensics.tar.gz created on the staging system
  • SHA256 checksum file generated alongside the archive
  • Package uploaded to R2 at the expected path: s3://brb-forensics/<client_id>/<system_id>/event-<timestamp>/

Verification:

# Download and verify
aws s3 ls s3://brb-forensics/{client_id}/{system_id}/ \
--endpoint-url https://23b4ba8d8f996dfbc2eb473cb3b32582.r2.cloudflarestorage.com

aws s3 cp s3://brb-forensics/{client_id}/{system_id}/event-YYYYMMDD-HHMMSS/forensics.tar.gz . \
--endpoint-url https://23b4ba8d8f996dfbc2eb473cb3b32582.r2.cloudflarestorage.com

sha256sum -c forensics.tar.gz.sha256
tar -tzf forensics.tar.gz # List contents

Pass criteria: Archive exists in R2, SHA256 matches, contents include system state, logs, and network info.

4. Slack Notification Delivery

Test: Confirm that the lockdown event triggered a notification in #anchor-incidents-critical.

Expected result: Slack message received in the channel with:

  • System ID
  • Client ID
  • Lockdown reason
  • Timestamp
  • Link to BRB controller event details

Pass criteria: Notification received within 60 seconds of lockdown trigger.

5. Network Recovery (Single Approval)

Test: Submit one recovery approval for the "network" stage.

Expected result:

  • Network access is restored (firewall rules reverted)
  • Services remain stopped
  • User accounts remain locked

Verification:

# From emergency SSH
ping 8.8.8.8 # Should succeed
curl https://google.com # Should succeed
systemctl status docker # Should still be stopped

Pass criteria: Network restored, services and accounts still locked.

6. Full Recovery (Two Approvals, Different Users)

Test: Submit two recovery approvals for the "full" stage from two different operator accounts.

Expected result:

  • All services restored and running
  • All user accounts unlocked
  • All network access fully restored

Verification:

systemctl status docker nginx postgresql # All running
id user1 user2 # Accounts active
curl http://localhost:APP_PORT/health # Application responding

Pass criteria: System fully operational. All services, accounts, and network access restored.

7. Duplicate Approval Rejection

Test: Attempt to approve the same recovery stage twice with the same operator account.

Expected result: Second approval is rejected. The BRB controller returns an error indicating the same user cannot approve twice.

Pass criteria: API returns a rejection response. Recovery does not advance on duplicate approval.

8. Post-Recovery Validation

Test: After full recovery, verify the system is healthy and operational.

Expected result:

  • Application health check passing
  • BRB agent health check passing (:9090/health → healthy)
  • Logs flowing to Loki/promtail (if configured)
  • Monitoring checks passing in Uptime Kuma

Pass criteria: All health checks green. No residual lockdown artifacts.

These tests are not required for handoff acceptance but are strongly recommended and should be documented if performed.

R1. Agent Reconnection After Redis Restart

Test: Restart the Redis instance on the BRB controller while the agent is connected. Verify the agent automatically reconnects and resumes listening for commands.

Pass criteria: Agent reconnects within 60 seconds and responds to a subsequent health check.

R2. Lockdown Under Simulated Load

Test: Generate simulated traffic on the staging system (HTTP requests, database queries) and trigger a lockdown during active load.

Pass criteria: Lockdown executes cleanly. No partial states. All connections terminated.

R3. RTO Measurement

Test: Measure the Recovery Time Objective — the elapsed time from lockdown trigger to full recovery completion.

Documentation: Record the following timestamps:

  • Lockdown command sent
  • All lockdown actions confirmed
  • First recovery approval submitted
  • Second recovery approval submitted
  • Full recovery confirmed
  • Application health check passing

Pass criteria: RTO documented. No specific target required for handoff, but the measurement informs SLA commitments.

Pass/Fail Criteria

A system passes BRB Protocol testing when:

  1. All 8 mandatory test scenarios pass.
  2. RTO is documented (even if only from mandatory test execution).
  3. Forensic package is downloadable and verifiable (SHA256 matches, contents are complete).
  4. At least 2 different operators have successfully participated in the test (approved recoveries).

A single mandatory test failure results in an overall fail. The issue must be resolved and the failed test re-run before acceptance.

Documentation Requirements

After testing is complete, the following must be documented and retained:

ItemDetails
Test dateDate testing was performed
Tester namesNames and roles of all operators involved
Staging system IDThe system_id of the staging clone used
Pass/fail per scenarioResult for each of the 8 mandatory scenarios, with notes on any issues encountered
RTO measurementTimestamps and elapsed time from lockdown to full recovery
EvidenceScreenshots, terminal output, or log excerpts for each scenario
Forensic package locationR2 path to the test forensic package

Store test documentation alongside the system's handoff acceptance records.

Sign-Off

BRB Protocol testing requires sign-off from:

  1. Development team lead — confirms the staging clone accurately represents production and that application-level recovery is validated.
  2. Anchor operator — confirms all mandatory tests passed and documentation is complete.

Both signatures must be recorded before the BRB-related item on the Handoff Acceptance Checklist can be checked off.