Metrics Standards
Owner: Anchor MSP Operations Lead Last reviewed: 2026-05-24
Purpose
Define metrics collection standards for all systems under Anchor managed production. Consistent metrics enable effective monitoring, alerting, capacity planning, and performance analysis.
Scope
All hosts, applications, and infrastructure components managed by Anchor MSP. This covers system-level metrics (CPU, memory, disk, network) and application-level metrics (request rates, error rates, latency).
Policy
Prometheus as the Metrics Backend
- Prometheus is the standard metrics collection and storage backend for all managed systems.
- All metrics are exposed via HTTP endpoints in Prometheus exposition format.
- Prometheus scrapes metrics from targets. Targets do not push metrics.
- Prometheus data is retained for 30 days locally. Long-term storage uses remote write to Thanos or equivalent.
Required Metrics Per System
Every managed system must expose the following baseline metrics. These are non-negotiable for handoff acceptance.
System Metrics (via node_exporter)
| Metric | Description | Alert Threshold (typical) |
|---|---|---|
up | Target reachability (1 = up, 0 = down) | == 0 for 2 minutes triggers Critical |
node_cpu_seconds_total | CPU usage by mode | > 90% sustained for 10 minutes triggers High |
node_memory_MemAvailable_bytes | Available memory | < 10% of total triggers High |
node_filesystem_avail_bytes | Available disk space | < 15% of total triggers High, < 5% triggers Critical |
node_network_receive_bytes_total | Network bytes received | Anomaly-based alerting |
node_network_transmit_bytes_total | Network bytes transmitted | Anomaly-based alerting |
Application Metrics
| Metric | Description | Alert Threshold (typical) |
|---|---|---|
http_requests_total | Total HTTP requests by method, status code | Error rate > 5% for 5 minutes triggers High |
http_request_duration_seconds | Request latency histogram | P95 > 2s for 5 minutes triggers High |
app_up | Application health check | == 0 for 1 minute triggers Critical |
Applications should expose additional metrics specific to their domain (e.g., queue depth, active connections, cache hit rate).
node_exporter Setup
- node_exporter runs on every managed host. It exposes system metrics on port
9100. - node_exporter is installed as a systemd service with automatic restart on failure.
- Default collectors are enabled. Additional collectors are enabled as needed per system requirements.
- node_exporter must be accessible only from the Prometheus server. Firewall rules restrict port
9100access.
Naming Conventions
- Metric names use snake_case. No camelCase, no kebab-case.
- Metric names include a unit suffix describing the unit of measurement:
_secondsfor durations_bytesfor sizes_totalfor counters_ratiofor ratios (0 to 1)_infofor informational metrics (always value 1)
- Metric names are prefixed with the service or component name:
myapp_http_requests_total, nothttp_requests_total. - Labels use snake_case. Label values are lowercase where possible.
- Avoid high-cardinality labels. Labels like
user_id,request_id, orip_addressare prohibited in metrics (use logs for these).
Scrape Intervals
| Target Type | Scrape Interval | Justification |
|---|---|---|
| Default (node_exporter, app metrics) | 15 seconds | Provides sufficient resolution for alerting and dashboards. |
| Expensive metrics (custom collectors, database stats) | 60 seconds | Reduces load on the target system when metric collection is resource-intensive. |
| Blackbox probes (HTTP checks, TCP checks) | 30 seconds | Balances detection speed with probe frequency. |
Scrape intervals are configured in the Prometheus scrape config. Do not configure scrape intervals shorter than 15 seconds without approval from the Operations Lead.
Metric Retention and Storage
- Local Prometheus retention: 30 days at full resolution.
- Long-term storage: metrics are downsampled and stored for 1 year via remote write.
- Dashboard queries for periods longer than 30 days use the long-term storage backend.
Metric Hygiene
- Remove metrics that are no longer used or monitored. Stale metrics consume storage and cause confusion.
- Review metric cardinality quarterly. High-cardinality metrics (e.g., metrics with many unique label combinations) are a common source of Prometheus performance issues.
- Document custom metrics in the system's runbook, including their purpose, labels, and expected values.
Exceptions
Systems that cannot expose Prometheus-format metrics must use an exporter or adapter to translate their native metrics format. The exporter configuration must be documented in the system's runbook.