DevOps · #prometheus#monitoring#alerting

Prometheus监控体系:从部署到告警

2024.01.22 8 min 3.1k
// 目录 · contents

前言

Prometheus是云原生时代最流行的监控系统,是CNCF的第二个毕业项目(仅次于Kubernetes)。本文将全面讲解Prometheus的架构、指标模型、PromQL查询语言、告警机制和高可用方案。

Prometheus架构

graph TB
    subgraph Targets["监控目标"]
        App["应用程序<br>/metrics"]
        Node["Node Exporter"]
        MySQL["MySQL Exporter"]
        K8s["kube-state-metrics"]
    end

    subgraph Prometheus["Prometheus Server"]
        Retrieval["数据采集<br>(Pull)"]
        TSDB["时序数据库<br>(TSDB)"]
        HTTP["HTTP Server<br>(PromQL API)"]
        Rules["规则引擎"]
    end

    subgraph AlertStack["告警"]
        AM["Alertmanager"]
        Slack["Slack"]
        PD["PagerDuty"]
        Email["Email"]
    end

    subgraph Discovery["服务发现"]
        K8sSD["Kubernetes SD"]
        ConsulSD["Consul SD"]
        FileSD["File SD"]
    end

    Discovery --> Retrieval
    Retrieval --> |"scrape"| Targets
    Retrieval --> TSDB
    TSDB --> HTTP
    Rules --> |"evaluate"| TSDB
    Rules --> |"alert"| AM
    AM --> Slack
    AM --> PD
    AM --> Email

    Grafana["Grafana"] --> HTTP
    PushGW["Pushgateway"] --> Retrieval

核心特性

  • 多维数据模型:时间序列由指标名和标签(label)唯一标识
  • Pull模型:主动从目标拉取指标
  • PromQL:强大灵活的查询语言
  • 自治性:不依赖外部存储,单机即可运行
  • 服务发现:自动发现监控目标

指标类型

Prometheus定义了四种核心指标类型:

graph TB
    subgraph MetricTypes["四种指标类型"]
        Counter["Counter (计数器)<br>只增不减<br>如: http_requests_total"]
        Gauge["Gauge (仪表盘)<br>可增可减<br>如: temperature, memory_usage"]
        Histogram["Histogram (直方图)<br>分桶统计分布<br>如: request_duration_seconds"]
        Summary["Summary (摘要)<br>客户端计算分位数<br>如: request_duration_quantile"]
    end

Counter

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Go应用程序中定义Counter
import "github.com/prometheus/client_golang/prometheus"

var httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "path", "status"},
)

func init() {
prometheus.MustRegister(httpRequestsTotal)
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
// 业务逻辑...
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}

Gauge

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
var memoryUsageBytes = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "app_memory_usage_bytes",
Help: "Current memory usage in bytes",
},
)

var activeConnections = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "app_active_connections",
Help: "Number of active connections",
},
[]string{"pool"},
)

func updateMetrics() {
var m runtime.MemStats
runtime.ReadMemStats(&m)
memoryUsageBytes.Set(float64(m.Alloc))
activeConnections.WithLabelValues("db").Set(42)
}

Histogram

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
},
[]string{"method", "path"},
)

func handleRequest(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// 业务逻辑...
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
}

Histogram会自动生成三类时间序列:

1
2
3
4
5
6
7
8
9
10
11
12
# 各桶的累计计数
http_request_duration_seconds_bucket{le="0.01"} 24054
http_request_duration_seconds_bucket{le="0.025"} 35782
http_request_duration_seconds_bucket{le="0.05"} 48923
http_request_duration_seconds_bucket{le="0.1"} 52109
http_request_duration_seconds_bucket{le="+Inf"} 53271

# 总计数
http_request_duration_seconds_count 53271

# 总和
http_request_duration_seconds_sum 2876.43

PromQL查询

基础查询

1
2
3
4
5
6
7
8
9
10
11
# 即时向量查询
http_requests_total{method="GET", status="200"}

# 范围向量查询(最近5分钟)
http_requests_total{method="GET"}[5m]

# 正则匹配
http_requests_total{status=~"5.."}

# 排除匹配
http_requests_total{path!~"/health.*"}

聚合函数

1
2
3
4
5
6
7
8
9
10
11
12
13
# 求和 - 按状态码分组
sum by (status) (rate(http_requests_total[5m]))

# 平均值 - 按实例分组
avg by (instance) (node_cpu_seconds_total{mode="idle"})

# Top 5 内存使用
topk(5, container_memory_usage_bytes)

# 百分位数(Histogram)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

常用运算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# QPS (每秒请求数)
rate(http_requests_total[5m])

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# CPU使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 磁盘使用率
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# 请求延迟P99
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# 预测磁盘空间何时用尽(线性预测)
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0

PromQL运算优先级与类型

graph LR
    IV["即时向量<br>Instant Vector"] --> |"聚合函数"| IV2["即时向量"]
    RV["范围向量<br>Range Vector"] --> |"rate/increase"| IV3["即时向量"]
    IV --> |"[5m]"| RV
    S["标量 Scalar"] --> |"运算"| IV
    IV --> |"算术运算"| IV

服务发现

Kubernetes服务发现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# prometheus.yml
scrape_configs:
# 采集kubelet指标
- job_name: 'kubelet'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

# 采集Pod指标(基于annotation自动发现)
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# 仅采集带有prometheus.io/scrape=true注解的Pod
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# 自定义采集路径
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# 自定义采集端口
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 添加命名空间标签
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
# 添加Pod名称标签
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod

# 采集Service指标
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true

Pod注解配置示例:

1
2
3
4
5
6
7
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "8080"

Alertmanager告警

告警规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# alert-rules.yaml
groups:
- name: application
rules:
# 高错误率告警
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "高错误率: {{ $value | humanizePercentage }}"
description: "过去5分钟HTTP 5xx错误率超过5%,当前值: {{ $value | humanizePercentage }}"
runbook_url: "https://wiki.example.com/runbook/high-error-rate"

# 响应延迟告警
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "P99延迟过高: {{ $value }}s"

- name: infrastructure
rules:
# 节点内存不足
- alert: NodeMemoryHigh
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.instance }} 内存使用率超过90%"

# 磁盘即将写满
- alert: DiskSpaceRunningOut
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.instance }} 磁盘预计24小时内写满"

# Pod频繁重启
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 频繁重启"

Alertmanager配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

# 告警路由
route:
receiver: default
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# critical告警发送到PagerDuty + Slack
- match:
severity: critical
receiver: critical-alerts
group_wait: 10s
repeat_interval: 1h
# warning告警发送到Slack
- match:
severity: warning
receiver: warning-alerts
repeat_interval: 4h

# 告警接收器
receivers:
- name: default
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

- name: critical-alerts
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
severity: critical
slack_configs:
- channel: '#critical-alerts'
color: '#ff0000'
title: '[CRITICAL] {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Runbook:* {{ .Annotations.runbook_url }}
{{ end }}

- name: warning-alerts
slack_configs:
- channel: '#warnings'
color: '#ff9800'
title: '[WARNING] {{ .GroupLabels.alertname }}'

# 抑制规则
inhibit_rules:
# critical告警触发时,抑制同名的warning告警
- source_match:
severity: critical
target_match:
severity: warning
equal: ['alertname', 'namespace']
sequenceDiagram
    participant Prom as Prometheus
    participant AM as Alertmanager
    participant Route as 路由树
    participant Recv as 接收器

    Prom->>AM: 发送告警
    AM->>AM: 去重 (Dedup)
    AM->>AM: 分组 (Group)
    AM->>Route: 路由匹配
    Route->>AM: 检查抑制规则
    Route->>AM: 检查静默规则
    AM->>Recv: 发送通知
    Recv->>Recv: Slack/PagerDuty/Email

Recording Rules

Recording Rules预计算频繁使用的查询,提高查询性能:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
groups:
- name: recording_rules
interval: 30s
rules:
# 预计算QPS
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))

# 预计算错误率
- record: job:http_error_rate:ratio5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))

# 预计算P99延迟
- record: job:http_request_duration:p99_5m
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)

# 预计算节点CPU使用率
- record: instance:node_cpu_utilisation:rate5m
expr: |
100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

高可用方案

Federation

graph TB
    subgraph DC1["数据中心 1"]
        P1["Prometheus (Local)"]
        T1["Targets"]
        P1 --> T1
    end

    subgraph DC2["数据中心 2"]
        P2["Prometheus (Local)"]
        T2["Targets"]
        P2 --> T2
    end

    GlobalP["Global Prometheus"] --> |"federation"| P1
    GlobalP --> |"federation"| P2
    Grafana["Grafana"] --> GlobalP

Thanos

Thanos为Prometheus提供全局查询视图、长期存储和高可用:

graph TB
    subgraph Cluster1["Cluster 1"]
        P1["Prometheus"]
        S1["Thanos Sidecar"]
        P1 --> S1
    end

    subgraph Cluster2["Cluster 2"]
        P2["Prometheus"]
        S2["Thanos Sidecar"]
        P2 --> S2
    end

    S1 --> |"上传"| ObjStore["对象存储<br>(S3/GCS)"]
    S2 --> |"上传"| ObjStore

    Query["Thanos Query"] --> S1
    Query --> S2
    Query --> Store["Thanos Store Gateway"]
    Store --> ObjStore

    Compact["Thanos Compactor"] --> ObjStore
    Grafana["Grafana"] --> Query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Thanos Sidecar配置(作为Prometheus容器的sidecar)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: prometheus
spec:
template:
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.50.0
args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.min-block-duration=2h
- --storage.tsdb.max-block-duration=2h
- --web.enable-lifecycle
- name: thanos-sidecar
image: thanosio/thanos:v0.34.0
args:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/objstore.yml
volumeMounts:
- name: prometheus-data
mountPath: /prometheus
- name: thanos-config
mountPath: /etc/thanos

总结

Prometheus监控体系的关键要点:

  1. 指标设计:遵循命名规范,选择正确的指标类型,控制标签基数
  2. PromQL:掌握rate/increase/histogram_quantile等核心函数
  3. 服务发现:利用Kubernetes SD自动发现监控目标
  4. 告警分级:critical/warning分级,配合抑制和静默减少告警噪音
  5. Recording Rules:预计算高频查询,提升Dashboard性能
  6. 高可用:Thanos/Victoria Metrics实现全局视图和长期存储
  7. 可观测性:与Grafana、Loki、Tempo配合构建完整可观测性平台

踩坑记录

微服务扩展到 80+ 个服务后,每天告警邮件 200 封,值班同学把告警邮件设成了自动归档。有一次真实的服务宕机持续了 40 分钟没人发现,是用户投诉才知道。

根本原因:所有告警规则都是「CPU > 80%」「内存 > 85%」这类资源阈值,既没有业务含义,误报率又高。CPU 跑到 82% 不一定有问题,但接口错误率超过 1% 一定是出了事。

重新设计告警体系:只对 SLO 指标(接口错误率 > 1%、P99 > 500ms、可用性 < 99.9%)触发 critical 告警,资源类告警全部降级为 warning,只在 Dashboard 展示,不发通知。同时为每条告警写 runbook_url,值班同学收到告警后知道怎么处理。

改造后第一个月:每天有效告警从 200 封降到 8 封以内,值班同学终于开始认真看告警了。

实测结果

指标 改造前 改造后
每日告警数量 200+ 封 < 10 封
告警误报率 ~85% < 5%
MTTD(平均检测时间) 40 分钟(靠用户投诉) < 3 分钟
核心链路 SLO 覆盖率 0% 100%
值班响应率 约 20%(大多被忽略) 约 95%

我的看法

监控的本质是「让人能做决策」,不是「收集所有数据」。很多团队的监控系统数据量巨大,但故障时完全用不上,因为找不到有用的信息——所有指标都在动,不知道看哪个。

我现在的建议是:先定义 SLO,再倒推需要哪些指标,而不是先收集所有指标再想怎么用。对用户有直接影响的才是 critical,其余的降级处理。告警疲劳比没有告警更危险,因为它会让真正的故障淹没在噪音里。

作者 · authorzt
发布 · date2024-01-22
篇幅 · length3.1k 字 · 8 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论