Metrics Standards

Owner: Anchor MSP Operations Lead Last reviewed: 2026-05-24

Purpose

Define metrics collection standards for all systems under Anchor managed production. Consistent metrics enable effective monitoring, alerting, capacity planning, and performance analysis.

Scope

All hosts, applications, and infrastructure components managed by Anchor MSP. This covers system-level metrics (CPU, memory, disk, network) and application-level metrics (request rates, error rates, latency).

Policy

Prometheus as the Metrics Backend

Prometheus is the standard metrics collection and storage backend for all managed systems.
All metrics are exposed via HTTP endpoints in Prometheus exposition format.
Prometheus scrapes metrics from targets. Targets do not push metrics.
Prometheus data is retained for 30 days locally. Long-term storage uses remote write to Thanos or equivalent.

Required Metrics Per System

Every managed system must expose the following baseline metrics. These are non-negotiable for handoff acceptance.

System Metrics (via node_exporter)

Metric	Description	Alert Threshold (typical)
`up`	Target reachability (1 = up, 0 = down)	`== 0` for 2 minutes triggers Critical
`node_cpu_seconds_total`	CPU usage by mode	`> 90%` sustained for 10 minutes triggers High
`node_memory_MemAvailable_bytes`	Available memory	`< 10%` of total triggers High
`node_filesystem_avail_bytes`	Available disk space	`< 15%` of total triggers High, `< 5%` triggers Critical
`node_network_receive_bytes_total`	Network bytes received	Anomaly-based alerting
`node_network_transmit_bytes_total`	Network bytes transmitted	Anomaly-based alerting

Application Metrics

Metric	Description	Alert Threshold (typical)
`http_requests_total`	Total HTTP requests by method, status code	Error rate `> 5%` for 5 minutes triggers High
`http_request_duration_seconds`	Request latency histogram	P95 `> 2s` for 5 minutes triggers High
`app_up`	Application health check	`== 0` for 1 minute triggers Critical

Applications should expose additional metrics specific to their domain (e.g., queue depth, active connections, cache hit rate).

node_exporter Setup

node_exporter runs on every managed host. It exposes system metrics on port 9100.
node_exporter is installed as a systemd service with automatic restart on failure.
Default collectors are enabled. Additional collectors are enabled as needed per system requirements.
node_exporter must be accessible only from the Prometheus server. Firewall rules restrict port 9100 access.

Naming Conventions

Metric names use snake_case. No camelCase, no kebab-case.
Metric names include a unit suffix describing the unit of measurement:
- _seconds for durations
- _bytes for sizes
- _total for counters
- _ratio for ratios (0 to 1)
- _info for informational metrics (always value 1)
Metric names are prefixed with the service or component name: myapp_http_requests_total, not http_requests_total.
Labels use snake_case. Label values are lowercase where possible.
Avoid high-cardinality labels. Labels like user_id, request_id, or ip_address are prohibited in metrics (use logs for these).

Scrape Intervals

Target Type	Scrape Interval	Justification
Default (node_exporter, app metrics)	15 seconds	Provides sufficient resolution for alerting and dashboards.
Expensive metrics (custom collectors, database stats)	60 seconds	Reduces load on the target system when metric collection is resource-intensive.
Blackbox probes (HTTP checks, TCP checks)	30 seconds	Balances detection speed with probe frequency.

Scrape intervals are configured in the Prometheus scrape config. Do not configure scrape intervals shorter than 15 seconds without approval from the Operations Lead.

Metric Retention and Storage

Local Prometheus retention: 30 days at full resolution.
Long-term storage: metrics are downsampled and stored for 1 year via remote write.
Dashboard queries for periods longer than 30 days use the long-term storage backend.

Metric Hygiene

Remove metrics that are no longer used or monitored. Stale metrics consume storage and cause confusion.
Review metric cardinality quarterly. High-cardinality metrics (e.g., metrics with many unique label combinations) are a common source of Prometheus performance issues.
Document custom metrics in the system's runbook, including their purpose, labels, and expected values.

Exceptions

Systems that cannot expose Prometheus-format metrics must use an exporter or adapter to translate their native metrics format. The exporter configuration must be documented in the system's runbook.

Purpose​

Scope​

Policy​

Prometheus as the Metrics Backend​

Required Metrics Per System​

System Metrics (via node_exporter)​

Application Metrics​

node_exporter Setup​

Naming Conventions​

Scrape Intervals​

Metric Retention and Storage​

Metric Hygiene​

Exceptions​