- Overview
- Metrics Naming Convention
- Application Metrics
- Database Metrics
- Background Job Metrics
- Business Metrics
- System Metrics
- Custom Metrics Guide
- SLIs and SLOs
- Best Practices
This document catalogs all available metrics in the Ampel system, their meanings, and recommended usage for monitoring and alerting.
Metrics Endpoint: http://localhost:8080/metrics
Format: Prometheus text exposition format
Ampel follows Prometheus naming conventions:
<namespace>_<subsystem>_<name>_<unit>_<suffix>
Examples:
ampel_http_requests_total- Counterampel_http_request_duration_seconds- Histogramampel_db_connections_active- Gauge
Suffixes:
_total- Counter (cumulative)_seconds- Time duration_bytes- Size_ratio- Percentage (0-1)
Labels:
Use labels for dimensions, not separate metrics:
# ✅ Good: One metric with labels
http_requests_total{method="GET", status="200", path="/api/prs"}
# ❌ Bad: Separate metrics per status
http_requests_200_total
http_requests_404_total
http_requests_500_total
Type: Counter
Description: Total number of HTTP requests received
Labels:
method- HTTP method (GET, POST, PUT, DELETE, PATCH)path- Request path (e.g.,/api/prs,/api/auth/login)status- HTTP status code (200, 404, 500, etc.)
Example:
ampel_http_requests_total{method="GET",path="/api/prs",status="200"} 1234
ampel_http_requests_total{method="POST",path="/api/auth/login",status="401"} 15
Usage:
# Request rate per second
rate(ampel_http_requests_total[5m])
# Success rate
sum(rate(ampel_http_requests_total{status=~"2.."}[5m]))
/ sum(rate(ampel_http_requests_total[5m]))
# Error rate
sum(rate(ampel_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(ampel_http_requests_total[5m]))
Type: Histogram
Description: HTTP request duration in seconds
Labels:
method- HTTP methodpath- Request pathstatus- HTTP status code
Buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
Example:
ampel_http_request_duration_seconds_bucket{method="GET",path="/api/prs",le="0.1"} 500
ampel_http_request_duration_seconds_bucket{method="GET",path="/api/prs",le="0.5"} 800
ampel_http_request_duration_seconds_sum{method="GET",path="/api/prs"} 245.5
ampel_http_request_duration_seconds_count{method="GET",path="/api/prs"} 1000
Usage:
# P95 latency
histogram_quantile(0.95,
rate(ampel_http_request_duration_seconds_bucket[5m])
)
# P99 latency
histogram_quantile(0.99,
rate(ampel_http_request_duration_seconds_bucket[5m])
)
# Average latency
rate(ampel_http_request_duration_seconds_sum[5m])
/ rate(ampel_http_request_duration_seconds_count[5m])
Type: Histogram
Description: HTTP request payload size in bytes
Labels:
method- HTTP methodpath- Request path
Buckets: 100, 1000, 10000, 100000, 1000000, 10000000
Type: Histogram
Description: HTTP response payload size in bytes
Labels:
method- HTTP methodpath- Request pathstatus- HTTP status code
Type: Gauge
Description: Number of currently active HTTP connections
Example:
ampel_http_connections_active 45
Type: Counter
Description: Total number of login attempts
Labels:
success- true/falseprovider- github/gitlab/bitbucket/local
Example:
ampel_auth_login_attempts_total{success="true",provider="github"} 1500
ampel_auth_login_attempts_total{success="false",provider="local"} 23
Type: Counter
Description: Total number of JWT token refreshes
Labels:
success- true/false
Type: Gauge
Description: Number of currently active user sessions
Type: Gauge
Description: Number of active database connections
Example:
ampel_db_connections_active{database="ampel"} 8
Type: Gauge
Description: Number of idle database connections in pool
Type: Gauge
Description: Maximum number of database connections allowed
Example:
ampel_db_connections_max{database="ampel"} 20
Usage:
# Connection pool utilization percentage
(ampel_db_connections_active / ampel_db_connections_max) * 100
# Alert when >90% utilized
(ampel_db_connections_active / ampel_db_connections_max) > 0.9
Type: Histogram
Description: Time spent waiting for a database connection
Buckets: 0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1
Type: Histogram
Description: Database query execution duration
Labels:
operation- Query operation (select, insert, update, delete)table- Primary table being queried
Example:
ampel_db_query_duration_seconds_bucket{operation="select",table="pull_requests",le="0.1"} 950
ampel_db_query_duration_seconds_sum{operation="select",table="pull_requests"} 87.5
Usage:
# Slow queries (>1s)
sum(rate(ampel_db_query_duration_seconds_count{
le="+Inf",
duration_seconds > 1
}[5m]))
# P95 query latency by table
histogram_quantile(0.95,
sum(rate(ampel_db_query_duration_seconds_bucket[5m])) by (le, table)
)
Type: Counter
Description: Total number of database queries executed
Labels:
operation- Query operationtable- Primary tablestatus- success/error
Type: Histogram
Description: Number of rows returned by SELECT queries
Labels:
table- Table being queried
Buckets: 1, 10, 100, 1000, 10000, 100000
Type: Counter
Description: Total number of database transactions
Labels:
status- committed/rolled_back
Type: Histogram
Description: Transaction duration from BEGIN to COMMIT/ROLLBACK
Type: Gauge
Description: Number of jobs currently in queue
Labels:
job_type- Type of job (pr_sync, metrics_collection, webhook_delivery)priority- Job priority (low, normal, high)
Example:
ampel_jobs_queued{job_type="pr_sync",priority="normal"} 42
ampel_jobs_queued{job_type="metrics_collection",priority="low"} 5
Alert:
# Queue backlog alert
ampel_jobs_queued > 1000
Type: Counter
Description: Total number of jobs processed
Labels:
job_type- Type of jobstatus- success/failure/retry
Example:
ampel_jobs_processed_total{job_type="pr_sync",status="success"} 5420
ampel_jobs_processed_total{job_type="pr_sync",status="failure"} 12
ampel_jobs_processed_total{job_type="pr_sync",status="retry"} 8
Usage:
# Job success rate
sum(rate(ampel_jobs_processed_total{status="success"}[5m]))
/ sum(rate(ampel_jobs_processed_total[5m]))
# Job failure rate
sum(rate(ampel_jobs_processed_total{status="failure"}[5m]))
/ sum(rate(ampel_jobs_processed_total[5m]))
Type: Histogram
Description: Job execution duration
Labels:
job_type- Type of jobstatus- success/failure
Buckets: 0.1, 0.5, 1, 5, 10, 30, 60, 120, 300
Type: Histogram
Description: Number of retries before job succeeded or permanently failed
Labels:
job_type- Type of job
Buckets: 0, 1, 2, 3, 5, 10
Type: Gauge
Description: Total number of pull requests by status
Labels:
status- ampel status (green, yellow, red)state- PR state (open, merged, closed)provider- github/gitlab/bitbucket
Example:
ampel_prs_total{status="green",state="open",provider="github"} 150
ampel_prs_total{status="yellow",state="open",provider="gitlab"} 45
ampel_prs_total{status="red",state="open",provider="bitbucket"} 10
Type: Histogram
Description: Time from PR creation to merge
Labels:
provider- github/gitlab/bitbucketrepository- Repository name
Buckets: 3600, 7200, 14400, 28800, 86400, 172800, 604800 (1h, 2h, 4h, 8h, 1d, 2d, 1w)
Example:
ampel_pr_time_to_merge_seconds_sum{provider="github"} 432000
ampel_pr_time_to_merge_seconds_count{provider="github"} 50
Usage:
# Average time to merge (in hours)
(
rate(ampel_pr_time_to_merge_seconds_sum[24h])
/ rate(ampel_pr_time_to_merge_seconds_count[24h])
) / 3600
# P95 time to merge
histogram_quantile(0.95,
rate(ampel_pr_time_to_merge_seconds_bucket[24h])
)
Type: Histogram
Description: Time from PR creation to first review
Labels:
provider- github/gitlab/bitbucket
Buckets: 600, 1800, 3600, 7200, 14400, 28800, 86400 (10m, 30m, 1h, 2h, 4h, 8h, 1d)
Type: Histogram
Description: Number of review rounds before approval
Buckets: 0, 1, 2, 3, 5, 10
Type: Histogram
Description: Number of comments on a PR
Buckets: 0, 5, 10, 20, 50, 100
Type: Gauge
Description: Total number of repositories tracked
Labels:
provider- github/gitlab/bitbucketorganization- Organization name
Type: Counter
Description: Total number of successful repository syncs
Labels:
provider- github/gitlab/bitbucket
Type: Counter
Description: Total number of repository sync errors
Labels:
provider- github/gitlab/bitbucketerror_type- Rate limit, authentication, network, etc.
Type: Gauge
Description: Unix timestamp of last successful repository sync
Labels:
repository- Repository IDprovider- github/gitlab/bitbucket
Usage:
# Repositories not synced in >1 hour
(time() - ampel_repo_last_sync_timestamp) > 3600
# Time since last sync (in minutes)
(time() - ampel_repo_last_sync_timestamp) / 60
Type: Counter
Description: Total API requests to Git providers
Labels:
provider- github/gitlab/bitbucketendpoint- API endpointstatus- HTTP status code
Type: Histogram
Description: API request duration to Git providers
Labels:
provider- github/gitlab/bitbucketendpoint- API endpoint
Type: Gauge
Description: Remaining API rate limit for provider
Labels:
provider- github/gitlab/bitbuckettoken_id- Token identifier (hashed)
Example:
ampel_provider_rate_limit_remaining{provider="github",token_id="abc123"} 4500
Alert:
# Rate limit warning
ampel_provider_rate_limit_remaining < 1000
Type: Gauge
Description: Resident memory size in bytes (RSS)
Type: Gauge
Description: Virtual memory size in bytes
Type: Counter
Description: Total CPU time consumed in seconds
Usage:
# CPU usage percentage
rate(process_cpu_seconds_total[1m]) * 100
Type: Gauge
Description: Number of open file descriptors
Type: Gauge
Description: Maximum number of file descriptors
Alert:
# File descriptor exhaustion
process_open_fds / process_max_fds > 0.8
Type: Gauge
Description: Number of active async tasks
Type: Counter
Description: Total number of tasks spawned
Type: Gauge
Description: Number of tasks waiting in blocking thread pool queue
use prometheus::{Counter, Histogram, Gauge, Registry};
use once_cell::sync::Lazy;
// Counter example
static PR_MERGED_COUNTER: Lazy<Counter> = Lazy::new(|| {
Counter::new(
"ampel_prs_merged_total",
"Total number of PRs merged"
).expect("metric creation failed")
});
// Histogram example
static PR_MERGE_TIME: Lazy<Histogram> = Lazy::new(|| {
Histogram::with_opts(
HistogramOpts::new(
"ampel_pr_merge_time_seconds",
"Time from PR creation to merge"
)
.buckets(vec![3600.0, 7200.0, 14400.0, 86400.0, 604800.0])
).expect("metric creation failed")
});
// Gauge example
static ACTIVE_USERS: Lazy<Gauge> = Lazy::new(|| {
Gauge::new(
"ampel_active_users",
"Number of currently active users"
).expect("metric creation failed")
});use prometheus::Registry;
pub fn register_metrics(registry: &Registry) -> Result<()> {
registry.register(Box::new(PR_MERGED_COUNTER.clone()))?;
registry.register(Box::new(PR_MERGE_TIME.clone()))?;
registry.register(Box::new(ACTIVE_USERS.clone()))?;
Ok(())
}// Increment counter
PR_MERGED_COUNTER.inc();
// Observe histogram value (seconds)
let duration = (merged_at - created_at).num_seconds() as f64;
PR_MERGE_TIME.observe(duration);
// Set gauge value
ACTIVE_USERS.set(current_user_count as f64);use prometheus::IntCounterVec;
static PR_COUNTER: Lazy<IntCounterVec> = Lazy::new(|| {
IntCounterVec::new(
Opts::new("ampel_prs_total", "Total PRs by status"),
&["status", "provider"]
).expect("metric creation failed")
});
// Use with labels
PR_COUNTER.with_label_values(&["green", "github"]).inc();
PR_COUNTER.with_label_values(&["yellow", "gitlab"]).inc();# Uptime percentage
100 * (
sum(up{job="ampel-api"})
/ count(up{job="ampel-api"})
)
Target: 99.9% (3 nines)
# Success rate percentage
100 * (
sum(rate(ampel_http_requests_total{status=~"2.."}[5m]))
/ sum(rate(ampel_http_requests_total[5m]))
)
Target: 99.5%
# P95 latency
histogram_quantile(0.95,
rate(ampel_http_request_duration_seconds_bucket[5m])
)
Target: <500ms for 95% of requests
| SLI | SLO | Error Budget (30 days) |
|---|---|---|
| Availability | 99.9% | 43.2 minutes |
| Success Rate | 99.5% | 216,000 failed requests (at 100 req/s) |
| P95 Latency | <500ms | 36,000 slow requests |
Error Budget Calculation:
# Availability error budget remaining
(
(1 - 0.999) * 30 * 24 * 60 # Total allowed downtime
- (30 * 24 * 60 - sum(up{job="ampel-api"}) * 30 * 24 * 60) # Actual downtime
)
# ✅ Good: Clear, consistent naming
ampel_http_requests_total
ampel_db_query_duration_seconds
ampel_jobs_processed_total
# ❌ Bad: Inconsistent, unclear
requests
query_time
jobsProcessed
# ✅ Counter for cumulative values
ampel_requests_total
# ✅ Gauge for current value
ampel_connections_active
# ✅ Histogram for distributions
ampel_request_duration_seconds
# ✅ Good: Low cardinality labels
{method="GET", status="200"}
# ❌ Bad: High cardinality (unbounded)
{user_id="550e8400-...", email="[email protected]"}
Rule: Keep unique label combinations <10,000 per metric
# ✅ Good: Histogram for latency percentiles
histogram_quantile(0.95,
rate(ampel_request_duration_seconds_bucket[5m])
)
# ❌ Bad: Gauge with average (loses distribution)
ampel_request_duration_seconds_avg
Every custom metric should have:
- Clear description
- Unit in metric name
- Example usage in PromQL
- Alert threshold recommendations
- Prometheus Best Practices
- Prometheus Metric Types
- PromQL Basics
- Ampel Monitoring Guide
- Ampel Observability Guide
Last Updated: 2025-12-22 Maintainer: Ampel Metrics Team