Skip to main content

Metrics Standards

Owner: Anchor MSP Operations Lead Last reviewed: 2026-05-24

Purpose

Define metrics collection standards for all systems under Anchor managed production. Consistent metrics enable effective monitoring, alerting, capacity planning, and performance analysis.

Scope

All hosts, applications, and infrastructure components managed by Anchor MSP. This covers system-level metrics (CPU, memory, disk, network) and application-level metrics (request rates, error rates, latency).

Policy

Prometheus as the Metrics Backend

  1. Prometheus is the standard metrics collection and storage backend for all managed systems.
  2. All metrics are exposed via HTTP endpoints in Prometheus exposition format.
  3. Prometheus scrapes metrics from targets. Targets do not push metrics.
  4. Prometheus data is retained for 30 days locally. Long-term storage uses remote write to Thanos or equivalent.

Required Metrics Per System

Every managed system must expose the following baseline metrics. These are non-negotiable for handoff acceptance.

System Metrics (via node_exporter)

MetricDescriptionAlert Threshold (typical)
upTarget reachability (1 = up, 0 = down)== 0 for 2 minutes triggers Critical
node_cpu_seconds_totalCPU usage by mode> 90% sustained for 10 minutes triggers High
node_memory_MemAvailable_bytesAvailable memory< 10% of total triggers High
node_filesystem_avail_bytesAvailable disk space< 15% of total triggers High, < 5% triggers Critical
node_network_receive_bytes_totalNetwork bytes receivedAnomaly-based alerting
node_network_transmit_bytes_totalNetwork bytes transmittedAnomaly-based alerting

Application Metrics

MetricDescriptionAlert Threshold (typical)
http_requests_totalTotal HTTP requests by method, status codeError rate > 5% for 5 minutes triggers High
http_request_duration_secondsRequest latency histogramP95 > 2s for 5 minutes triggers High
app_upApplication health check== 0 for 1 minute triggers Critical

Applications should expose additional metrics specific to their domain (e.g., queue depth, active connections, cache hit rate).

node_exporter Setup

  1. node_exporter runs on every managed host. It exposes system metrics on port 9100.
  2. node_exporter is installed as a systemd service with automatic restart on failure.
  3. Default collectors are enabled. Additional collectors are enabled as needed per system requirements.
  4. node_exporter must be accessible only from the Prometheus server. Firewall rules restrict port 9100 access.

Naming Conventions

  1. Metric names use snake_case. No camelCase, no kebab-case.
  2. Metric names include a unit suffix describing the unit of measurement:
    • _seconds for durations
    • _bytes for sizes
    • _total for counters
    • _ratio for ratios (0 to 1)
    • _info for informational metrics (always value 1)
  3. Metric names are prefixed with the service or component name: myapp_http_requests_total, not http_requests_total.
  4. Labels use snake_case. Label values are lowercase where possible.
  5. Avoid high-cardinality labels. Labels like user_id, request_id, or ip_address are prohibited in metrics (use logs for these).

Scrape Intervals

Target TypeScrape IntervalJustification
Default (node_exporter, app metrics)15 secondsProvides sufficient resolution for alerting and dashboards.
Expensive metrics (custom collectors, database stats)60 secondsReduces load on the target system when metric collection is resource-intensive.
Blackbox probes (HTTP checks, TCP checks)30 secondsBalances detection speed with probe frequency.

Scrape intervals are configured in the Prometheus scrape config. Do not configure scrape intervals shorter than 15 seconds without approval from the Operations Lead.

Metric Retention and Storage

  1. Local Prometheus retention: 30 days at full resolution.
  2. Long-term storage: metrics are downsampled and stored for 1 year via remote write.
  3. Dashboard queries for periods longer than 30 days use the long-term storage backend.

Metric Hygiene

  1. Remove metrics that are no longer used or monitored. Stale metrics consume storage and cause confusion.
  2. Review metric cardinality quarterly. High-cardinality metrics (e.g., metrics with many unique label combinations) are a common source of Prometheus performance issues.
  3. Document custom metrics in the system's runbook, including their purpose, labels, and expected values.

Exceptions

Systems that cannot expose Prometheus-format metrics must use an exporter or adapter to translate their native metrics format. The exporter configuration must be documented in the system's runbook.