Metrics Integration
Monitor Site Availability with Prometheus metrics and integrate with observability platforms.
Prometheus Metrics Endpoint
The application exposes metrics at /metrics
in Prometheus format.
Default Metrics
Application Metrics
# Application availability status
site_availability_up{app="frontend",location="New York"} 1
site_availability_up{app="backend",location="London"} 0
# Scrape duration in seconds
site_availability_scrape_duration_seconds{target="frontend"} 0.142
# Scrape requests total
site_availability_scrape_requests_total{target="frontend",status="success"} 1245
site_availability_scrape_requests_total{target="backend",status="error"} 3
# Last successful scrape timestamp
site_availability_last_scrape_timestamp{target="frontend"} 1638360000
HTTP Metrics
# HTTP requests total
http_requests_total{method="GET",path="/api/apps",status="200"} 1245
http_requests_total{method="POST",path="/api/scrape-interval",status="200"} 12
# HTTP request duration
http_request_duration_seconds{method="GET",path="/api/apps"} 0.045
# Active HTTP connections
http_connections_active 5
System Metrics
# Go runtime metrics
go_goroutines 25
go_memstats_alloc_bytes 2.5e+06
go_memstats_gc_duration_seconds 0.001
# Process metrics
process_cpu_seconds_total 12.5
process_resident_memory_bytes 2.5e+07
process_uptime_seconds 3600
Custom Metrics
Business Metrics
# Overall system availability
site_availability_system_uptime_percentage 98.7
# Applications by status
site_availability_apps_by_status{status="up"} 4
site_availability_apps_by_status{status="down"} 1
# Response time percentiles
site_availability_response_time_p50 0.125
site_availability_response_time_p95 0.245
site_availability_response_time_p99 0.389
Location Metrics
# Applications per location
site_availability_location_apps{location="New York"} 3
site_availability_location_apps{location="London"} 2
# Location availability
site_availability_location_uptime{location="New York"} 0.667
site_availability_location_uptime{location="London"} 1.0
Prometheus Configuration
Scrape Configuration
Add Site Availability Monitoring to your prometheus.yml
:
global:
scrape_interval: 15s
scrape_configs:
- job_name: "site-availability"
static_configs:
- targets: ["site-availability:8080"]
scrape_interval: 30s
metrics_path: "/metrics"
scheme: "http"
# Optional: Add labels
relabel_configs:
- target_label: "service"
replacement: "site-availability"
- target_label: "environment"
replacement: "production"
Service Discovery
For Kubernetes deployments:
scrape_configs:
- job_name: "site-availability-k8s"
kubernetes_sd_configs:
- role: pod
namespaces:
names: ["monitoring"]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: site-availability-backend
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Recording Rules
Create recording rules for common queries:
# recording-rules.yml
groups:
- name: site_availability_rules
interval: 30s
rules:
# Overall system availability
- record: site_availability:system:uptime_5m
expr: avg_over_time(site_availability_up[5m])
# Application availability by location
- record: site_availability:location:uptime_5m
expr: avg_over_time(site_availability_up[5m]) by (location)
# Response time moving average
- record: site_availability:response_time:avg_5m
expr: avg_over_time(site_availability_scrape_duration_seconds[5m])
# Error rate
- record: site_availability:error_rate_5m
expr: rate(site_availability_scrape_requests_total{status="error"}[5m])
Alerting Rules
Set up alerts for critical conditions:
# alerting-rules.yml
groups:
- name: site_availability_alerts
rules:
# Application down alert
- alert: ApplicationDown
expr: site_availability_up == 0
for: 1m
labels:
severity: critical
service: site-availability
annotations:
summary: "Application {{ $labels.app }} is down"
description: "Application {{ $labels.app }} in {{ $labels.location }} has been down for more than 1 minute"
runbook_url: "https://docs.example.com/runbooks/app-down"
# High error rate alert
- alert: HighErrorRate
expr: rate(site_availability_scrape_requests_total{status="error"}[5m]) > 0.1
for: 5m
labels:
severity: warning
service: site-availability
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for target {{ $labels.target }}"
# System availability alert
- alert: LowSystemAvailability
expr: site_availability:system:uptime_5m < 0.95
for: 10m
labels:
severity: warning
service: site-availability
annotations:
summary: "System availability below threshold"
description: "System availability is {{ $value | humanizePercentage }}, below 95% threshold"
# Scraping issues
- alert: ScrapingDown
expr: up{job="site-availability"} == 0
for: 2m
labels:
severity: critical
service: site-availability
annotations:
summary: "Site Availability Monitoring scraping is down"
description: "Prometheus cannot scrape Site Availability Monitoring metrics"
Grafana Integration
Dashboard Configuration
Import the provided dashboard from chart/grafana-dashboards/dashboard.json
or create custom dashboards:
Overview Dashboard
{
"dashboard": {
"title": "Site Availability Monitoring",
"panels": [
{
"title": "System Overview",
"type": "stat",
"targets": [
{
"expr": "site_availability:system:uptime_5m * 100",
"legendFormat": "Availability %"
}
]
},
{
"title": "Applications Status",
"type": "piechart",
"targets": [
{
"expr": "count by (status) (site_availability_up)",
"legendFormat": "{{ status }}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "site_availability_scrape_duration_seconds",
"legendFormat": "{{ target }}"
}
]
}
]
}
}
Application Details Dashboard
{
"dashboard": {
"title": "Application Details",
"templating": {
"list": [
{
"name": "app",
"type": "query",
"query": "label_values(site_availability_up, app)"
}
]
},
"panels": [
{
"title": "Availability - $app",
"type": "graph",
"targets": [
{
"expr": "site_availability_up{app=\"$app\"}",
"legendFormat": "{{ location }}"
}
]
}
]
}
}
Grafana Provisioning
Automatically provision dashboards:
# grafana/provisioning/dashboards/site-availability.yml
apiVersion: 1
providers:
- name: "site-availability"
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards/site-availability
Custom Metrics Implementation
Adding New Metrics
// metrics/custom.go
package metrics
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Custom business metric
ApplicationResponseTime = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "site_availability_app_response_time_seconds",
Help: "Application response time in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"app", "location", "status_code"},
)
// Configuration change metric
ConfigurationChanges = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "site_availability_config_changes_total",
Help: "Total configuration changes",
},
[]string{"type", "user"},
)
// Cache hit ratio
CacheHitRatio = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: "site_availability_cache_hit_ratio",
Help: "Cache hit ratio",
},
[]string{"cache_type"},
)
)
// Usage in application code
func recordMetrics(app, location string, responseTime float64, statusCode int) {
ApplicationResponseTime.WithLabelValues(
app,
location,
fmt.Sprintf("%d", statusCode),
).Observe(responseTime)
}
Instrumenting Code
// Example: Instrument HTTP handlers
func instrumentHandler(handler http.HandlerFunc) http.HandlerFunc {
return promhttp.InstrumentHandlerDuration(
prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
},
[]string{"method", "path", "status"},
),
handler,
)
}
// Example: Instrument scraping operations
func (s *Scraper) instrumentedScrape(target string) error {
timer := prometheus.NewTimer(prometheus.ObserverFunc(func(v float64) {
ScrapeDuration.WithLabelValues(target).Observe(v)
}))
defer timer.ObserveDuration()
err := s.scrape(target)
status := "success"
if err != nil {
status = "error"
}
ScrapeRequests.WithLabelValues(target, status).Inc()
return err
}
Observability Best Practices
Metric Naming
Follow Prometheus naming conventions:
- Use
_total
suffix for counters - Use
_seconds
suffix for time durations - Use descriptive names with units
- Group related metrics with common prefixes
Label Best Practices
- Keep cardinality low (< 1000 unique combinations)
- Use meaningful label names
- Avoid high-cardinality labels (user IDs, request IDs)
- Be consistent across metrics
Performance Considerations
// Use label values caching
var httpRequestsCounter = prometheus.NewCounterVec(...)
// Pre-create metric instances for known label combinations
func init() {
for _, method := range []string{"GET", "POST", "PUT", "DELETE"} {
for _, path := range knownPaths {
httpRequestsCounter.WithLabelValues(method, path)
}
}
}
Monitoring the Monitor
Monitor Site Availability Monitoring itself:
Self-Monitoring Metrics
# Monitor scraping health
up{job="site-availability"} 1
# Monitor response times
prometheus_rule_evaluation_duration_seconds{rule_group="site_availability_rules"}
# Monitor disk usage
prometheus_tsdb_symbol_table_size_bytes
Health Checks
# Kubernetes liveness probe
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# Kubernetes readiness probe
readinessProbe:
httpGet:
path: /metrics
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
This comprehensive metrics integration ensures full observability of your Site Availability Monitoring system and the applications it monitors.