前言
Prometheus是云原生时代最流行的监控系统,是CNCF的第二个毕业项目(仅次于Kubernetes)。本文将全面讲解Prometheus的架构、指标模型、PromQL查询语言、告警机制和高可用方案。
Prometheus架构
graph TB
subgraph Targets["监控目标"]
App["应用程序<br>/metrics"]
Node["Node Exporter"]
MySQL["MySQL Exporter"]
K8s["kube-state-metrics"]
end
subgraph Prometheus["Prometheus Server"]
Retrieval["数据采集<br>(Pull)"]
TSDB["时序数据库<br>(TSDB)"]
HTTP["HTTP Server<br>(PromQL API)"]
Rules["规则引擎"]
end
subgraph AlertStack["告警"]
AM["Alertmanager"]
Slack["Slack"]
PD["PagerDuty"]
Email["Email"]
end
subgraph Discovery["服务发现"]
K8sSD["Kubernetes SD"]
ConsulSD["Consul SD"]
FileSD["File SD"]
end
Discovery --> Retrieval
Retrieval --> |"scrape"| Targets
Retrieval --> TSDB
TSDB --> HTTP
Rules --> |"evaluate"| TSDB
Rules --> |"alert"| AM
AM --> Slack
AM --> PD
AM --> Email
Grafana["Grafana"] --> HTTP
PushGW["Pushgateway"] --> Retrieval
核心特性
多维数据模型 :时间序列由指标名和标签(label)唯一标识
Pull模型 :主动从目标拉取指标
PromQL :强大灵活的查询语言
自治性 :不依赖外部存储,单机即可运行
服务发现 :自动发现监控目标
指标类型
Prometheus定义了四种核心指标类型:
graph TB
subgraph MetricTypes["四种指标类型"]
Counter["Counter (计数器)<br>只增不减<br>如: http_requests_total"]
Gauge["Gauge (仪表盘)<br>可增可减<br>如: temperature, memory_usage"]
Histogram["Histogram (直方图)<br>分桶统计分布<br>如: request_duration_seconds"]
Summary["Summary (摘要)<br>客户端计算分位数<br>如: request_duration_quantile"]
end
Counter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import "github.com/prometheus/client_golang/prometheus" var httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total" , Help: "Total number of HTTP requests" , }, []string {"method" , "path" , "status" }, )func init () { prometheus.MustRegister(httpRequestsTotal) }func handleRequest (w http.ResponseWriter, r *http.Request) { httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200" ).Inc() }
Gauge
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 var memoryUsageBytes = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "app_memory_usage_bytes" , Help: "Current memory usage in bytes" , }, )var activeConnections = prometheus.NewGaugeVec( prometheus.GaugeOpts{ Name: "app_active_connections" , Help: "Number of active connections" , }, []string {"pool" }, )func updateMetrics () { var m runtime.MemStats runtime.ReadMemStats(&m) memoryUsageBytes.Set(float64 (m.Alloc)) activeConnections.WithLabelValues("db" ).Set(42 ) }
Histogram
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 var requestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds" , Help: "HTTP request duration in seconds" , Buckets: []float64 {0.005 , 0.01 , 0.025 , 0.05 , 0.1 , 0.25 , 0.5 , 1 , 2.5 , 5 , 10 }, }, []string {"method" , "path" }, )func handleRequest (w http.ResponseWriter, r *http.Request) { start := time.Now() duration := time.Since(start).Seconds() requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration) }
Histogram会自动生成三类时间序列:
1 2 3 4 5 6 7 8 9 10 11 12 # 各桶的累计计数 http_request_duration_seconds_bucket{le="0.01"} 24054 http_request_duration_seconds_bucket{le="0.025"} 35782 http_request_duration_seconds_bucket{le="0.05"} 48923 http_request_duration_seconds_bucket{le="0.1"} 52109 http_request_duration_seconds_bucket{le="+Inf"} 53271 # 总计数 http_request_duration_seconds_count 53271 # 总和 http_request_duration_seconds_sum 2876.43
PromQL查询
基础查询
1 2 3 4 5 6 7 8 9 10 11 # 即时向量查询 http_requests_total{method="GET", status="200"} # 范围向量查询(最近5分钟) http_requests_total{method="GET"}[5m] # 正则匹配 http_requests_total{status=~"5.."} # 排除匹配 http_requests_total{path!~"/health.*"}
聚合函数
1 2 3 4 5 6 7 8 9 10 11 12 13 # 求和 - 按状态码分组 sum by (status) (rate(http_requests_total[5m])) # 平均值 - 按实例分组 avg by (instance) (node_cpu_seconds_total{mode="idle"}) # Top 5 内存使用 topk(5, container_memory_usage_bytes) # 百分位数(Histogram) histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
常用运算
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # QPS (每秒请求数) rate(http_requests_total[5m]) # 错误率 sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # CPU使用率 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # 内存使用率 (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 # 磁盘使用率 (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 # 请求延迟P99 histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) ) # 预测磁盘空间何时用尽(线性预测) predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
PromQL运算优先级与类型
graph LR
IV["即时向量<br>Instant Vector"] --> |"聚合函数"| IV2["即时向量"]
RV["范围向量<br>Range Vector"] --> |"rate/increase"| IV3["即时向量"]
IV --> |"[5m]"| RV
S["标量 Scalar"] --> |"运算"| IV
IV --> |"算术运算"| IV
服务发现
Kubernetes服务发现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 scrape_configs: - job_name: 'kubelet' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape ] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path ] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__ , __meta_kubernetes_pod_annotation_prometheus_io_port ] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace ] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name ] action: replace target_label: pod - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape ] action: keep regex: true
Pod注解配置示例:
1 2 3 4 5 6 7 apiVersion: v1 kind: Pod metadata: annotations: prometheus.io/scrape: "true" prometheus.io/path: "/metrics" prometheus.io/port: "8080"
Alertmanager告警
告警规则
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 groups: - name: application rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical team: backend annotations: summary: "高错误率: {{ $value | humanizePercentage }} " description: "过去5分钟HTTP 5xx错误率超过5%,当前值: {{ $value | humanizePercentage }} " runbook_url: "https://wiki.example.com/runbook/high-error-rate" - alert: HighLatency expr: | histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 1 for: 10m labels: severity: warning annotations: summary: "P99延迟过高: {{ $value }} s" - name: infrastructure rules: - alert: NodeMemoryHigh expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9 for: 5m labels: severity: warning annotations: summary: "节点 {{ $labels.instance }} 内存使用率超过90%" - alert: DiskSpaceRunningOut expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24 *3600) < 0 for: 30m labels: severity: warning annotations: summary: "节点 {{ $labels.instance }} 磁盘预计24小时内写满" - alert: PodCrashLooping expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3 for: 5m labels: severity: critical annotations: summary: "Pod {{ $labels.namespace }} /{{ $labels.pod }} 频繁重启"
Alertmanager配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz' route: receiver: default group_by: ['alertname' , 'namespace' ] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: critical-alerts group_wait: 10s repeat_interval: 1h - match: severity: warning receiver: warning-alerts repeat_interval: 4h receivers: - name: default slack_configs: - channel: '#alerts' title: '{{ .GroupLabels.alertname }} ' text: '{{ range .Alerts }} {{ .Annotations.summary }} {{ end }} ' - name: critical-alerts pagerduty_configs: - service_key: '<pagerduty-service-key>' severity: critical slack_configs: - channel: '#critical-alerts' color: '#ff0000' title: '[CRITICAL] {{ .GroupLabels.alertname }} ' text: | {{ range .Alerts }} *Summary:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} *Runbook:* {{ .Annotations.runbook_url }} {{ end }} - name: warning-alerts slack_configs: - channel: '#warnings' color: '#ff9800' title: '[WARNING] {{ .GroupLabels.alertname }} ' inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: ['alertname' , 'namespace' ]
sequenceDiagram
participant Prom as Prometheus
participant AM as Alertmanager
participant Route as 路由树
participant Recv as 接收器
Prom->>AM: 发送告警
AM->>AM: 去重 (Dedup)
AM->>AM: 分组 (Group)
AM->>Route: 路由匹配
Route->>AM: 检查抑制规则
Route->>AM: 检查静默规则
AM->>Recv: 发送通知
Recv->>Recv: Slack/PagerDuty/Email
Recording Rules
Recording Rules预计算频繁使用的查询,提高查询性能:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 groups: - name: recording_rules interval: 30s rules: - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) - record: job:http_error_rate:ratio5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) - record: job:http_request_duration:p99_5m expr: | histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) - record: instance:node_cpu_utilisation:rate5m expr: | 100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
高可用方案
Federation
graph TB
subgraph DC1["数据中心 1"]
P1["Prometheus (Local)"]
T1["Targets"]
P1 --> T1
end
subgraph DC2["数据中心 2"]
P2["Prometheus (Local)"]
T2["Targets"]
P2 --> T2
end
GlobalP["Global Prometheus"] --> |"federation"| P1
GlobalP --> |"federation"| P2
Grafana["Grafana"] --> GlobalP
Thanos
Thanos为Prometheus提供全局查询视图、长期存储和高可用:
graph TB
subgraph Cluster1["Cluster 1"]
P1["Prometheus"]
S1["Thanos Sidecar"]
P1 --> S1
end
subgraph Cluster2["Cluster 2"]
P2["Prometheus"]
S2["Thanos Sidecar"]
P2 --> S2
end
S1 --> |"上传"| ObjStore["对象存储<br>(S3/GCS)"]
S2 --> |"上传"| ObjStore
Query["Thanos Query"] --> S1
Query --> S2
Query --> Store["Thanos Store Gateway"]
Store --> ObjStore
Compact["Thanos Compactor"] --> ObjStore
Grafana["Grafana"] --> Query
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus spec: template: spec: containers: - name: prometheus image: prom/prometheus:v2.50.0 args: - --config.file=/etc/prometheus/prometheus.yml - --storage.tsdb.path=/prometheus - --storage.tsdb.min-block-duration=2h - --storage.tsdb.max-block-duration=2h - --web.enable-lifecycle - name: thanos-sidecar image: thanosio/thanos:v0.34.0 args: - sidecar - --tsdb.path=/prometheus - --prometheus.url=http://localhost:9090 - --objstore.config-file=/etc/thanos/objstore.yml volumeMounts: - name: prometheus-data mountPath: /prometheus - name: thanos-config mountPath: /etc/thanos
总结
Prometheus监控体系的关键要点:
指标设计 :遵循命名规范,选择正确的指标类型,控制标签基数
PromQL :掌握rate/increase/histogram_quantile等核心函数
服务发现 :利用Kubernetes SD自动发现监控目标
告警分级 :critical/warning分级,配合抑制和静默减少告警噪音
Recording
Rules :预计算高频查询,提升Dashboard性能
高可用 :Thanos/Victoria
Metrics实现全局视图和长期存储
可观测性 :与Grafana、Loki、Tempo配合构建完整可观测性平台
踩坑记录
微服务扩展到 80+ 个服务后,每天告警邮件 200
封,值班同学把告警邮件设成了自动归档。有一次真实的服务宕机持续了 40
分钟没人发现,是用户投诉才知道。
根本原因:所有告警规则都是「CPU > 80%」「内存 >
85%」这类资源阈值,既没有业务含义,误报率又高。CPU 跑到 82%
不一定有问题,但接口错误率超过 1% 一定是出了事。
重新设计告警体系:只对 SLO 指标(接口错误率 > 1%、P99 >
500ms、可用性 < 99.9%)触发 critical 告警,资源类告警全部降级为
warning,只在 Dashboard 展示,不发通知。同时为每条告警写
runbook_url,值班同学收到告警后知道怎么处理。
改造后第一个月:每天有效告警从 200 封降到 8
封以内,值班同学终于开始认真看告警了。
实测结果
每日告警数量
200+ 封
< 10 封
告警误报率
~85%
< 5%
MTTD(平均检测时间)
40 分钟(靠用户投诉)
< 3 分钟
核心链路 SLO 覆盖率
0%
100%
值班响应率
约 20%(大多被忽略)
约 95%
我的看法
监控的本质是「让人能做决策」,不是「收集所有数据」。很多团队的监控系统数据量巨大,但故障时完全用不上,因为找不到有用的信息——所有指标都在动,不知道看哪个。
我现在的建议是:先定义
SLO,再倒推需要哪些指标 ,而不是先收集所有指标再想怎么用。对用户有直接影响的才是
critical,其余的降级处理。告警疲劳比没有告警更危险,因为它会让真正的故障淹没在噪音里。