Metrics Basics
Why metrics are different from logs
Logs tell you what happened in a specific request. Metrics tell you what is happening across all requests. In production, metrics are how you detect incidents before users report them. They power dashboards, alerts, and SLO tracking. A good metrics setup is small, stable, and intentionally labeled.
Production goals at this level
- Measure throughput: how many requests per route and status code.
- Measure errors: how many failures by type.
- Measure latency: how long requests and key operations take.
- Keep labels safe: avoid high-cardinality fields like request id.
What to measure first
Start with three core metric types:
- Counter: total number of requests and errors.
- Histogram: latency distribution in milliseconds or seconds.
- Gauge: occasionally for in-flight requests or pool usage.
You do not need dozens of metrics. You need a small, reliable baseline.
Cardinality: the hidden production risk
Cardinality is the number of unique label combinations. High-cardinality metrics explode memory usage in monitoring systems. Never use these as labels:
- request_id
- user_id
- email or any unique identifier
- raw error messages
Good label examples:
- route
- method
- status_code
- error_type (small fixed set)
Minimal Prometheus setup
One common approach is using the prometheus crate and exposing a /metrics endpoint. Keep it simple and explicit.
Dependencies
# Cargo.toml [dependencies] prometheus = "0.13" once_cell = "1"
Define metrics statically
Define a few global metrics with stable names. Names are part of your contract. Changing them breaks dashboards.
use once_cell::sync::Lazy;
use prometheus::{Encoder, HistogramVec, IntCounterVec, TextEncoder, register_histogram_vec, register_int_counter_vec};
pub static HTTP_REQUESTS_TOTAL: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"http_requests_total",
"Total number of HTTP requests",
&["method", "route", "status"]
).unwrap()
});
pub static HTTP_REQUEST_DURATION: Lazy<HistogramVec> = Lazy::new(|| {
register_histogram_vec!(
"http_request_duration_seconds",
"HTTP request latency in seconds",
&["method", "route"]
).unwrap()
});
Instrumenting a handler
Record count and latency at the request boundary. Keep label values bounded and normalized (use route pattern, not full path).
use std::time::Instant;
use axum::{routing::get, Router};
async fn hello() -> String {
"hello".to_string()
}
pub async fn hello_instrumented() -> String {
let start = Instant::now();
let response = hello().await;
let elapsed = start.elapsed().as_secs_f64();
HTTP_REQUESTS_TOTAL
.with_label_values(&["GET", "/hello", "200"])
.inc();
HTTP_REQUEST_DURATION
.with_label_values(&["GET", "/hello"])
.observe(elapsed);
response
}
Expose metrics endpoint
Prometheus scrapes metrics from an HTTP endpoint. This endpoint should not depend on your main business logic.
use axum::{response::IntoResponse, routing::get, Router};
async fn metrics_handler() -> impl IntoResponse {
let encoder = TextEncoder::new();
let metric_families = prometheus::gather();
let mut buffer = Vec::new();
encoder.encode(&metric_families, &mut buffer).unwrap();
String::from_utf8(buffer).unwrap()
}
pub fn with_metrics(app: Router) -> Router {
app.route("/metrics", get(metrics_handler))
}
Error metrics
Track error counts separately using a small error_type label set. This allows dashboards to show trends in validation errors versus internal errors.
pub static HTTP_ERRORS_TOTAL: Lazy<IntCounterVec> = Lazy::new(|| {
register_int_counter_vec!(
"http_errors_total",
"Total number of HTTP errors",
&["route", "error_type"]
).unwrap()
});
// Example usage
HTTP_ERRORS_TOTAL
.with_label_values(&["/users", "internal"])
.inc();
Latency histograms and SLO thinking
Histograms allow you to compute percentiles such as p95 or p99. In production, you often care more about tail latency than average latency. Even at this stage, design your histogram buckets thoughtfully and keep them consistent over time.
Operational checklist
- /metrics endpoint responds and can be scraped.
- Request counter increases for every request.
- Error counter increases only for failed responses.
- Latency histogram shows realistic distribution under load.
- No high-cardinality labels exist.
How metrics and tracing complement each other
Metrics show that error rate increased from 1 percent to 5 percent. Tracing shows which specific requests failed and why. Logs show the exact error details. Production observability requires all three working together.
What comes next
Next we will formalize health checks and readiness signals in a more operational way, separating liveness from dependency health and integrating them with observability signals.