Prometheus监控体系：从部署到告警

前言

Prometheus是云原生时代最流行的监控系统，是CNCF的第二个毕业项目（仅次于Kubernetes）。本文将全面讲解Prometheus的架构、指标模型、PromQL查询语言、告警机制和高可用方案。

Prometheus架构

graph TB
    subgraph Targets["监控目标"]
        App["应用程序<br>/metrics"]
        Node["Node Exporter"]
        MySQL["MySQL Exporter"]
        K8s["kube-state-metrics"]
    end

    subgraph Prometheus["Prometheus Server"]
        Retrieval["数据采集<br>(Pull)"]
        TSDB["时序数据库<br>(TSDB)"]
        HTTP["HTTP Server<br>(PromQL API)"]
        Rules["规则引擎"]
    end

    subgraph AlertStack["告警"]
        AM["Alertmanager"]
        Slack["Slack"]
        PD["PagerDuty"]
        Email["Email"]
    end

    subgraph Discovery["服务发现"]
        K8sSD["Kubernetes SD"]
        ConsulSD["Consul SD"]
        FileSD["File SD"]
    end

    Discovery --> Retrieval
    Retrieval --> |"scrape"| Targets
    Retrieval --> TSDB
    TSDB --> HTTP
    Rules --> |"evaluate"| TSDB
    Rules --> |"alert"| AM
    AM --> Slack
    AM --> PD
    AM --> Email

    Grafana["Grafana"] --> HTTP
    PushGW["Pushgateway"] --> Retrieval

核心特性

多维数据模型：时间序列由指标名和标签（label）唯一标识
Pull模型：主动从目标拉取指标
PromQL：强大灵活的查询语言
自治性：不依赖外部存储，单机即可运行
服务发现：自动发现监控目标

指标类型

Prometheus定义了四种核心指标类型：

graph TB
    subgraph MetricTypes["四种指标类型"]
        Counter["Counter (计数器)<br>只增不减<br>如: http_requests_total"]
        Gauge["Gauge (仪表盘)<br>可增可减<br>如: temperature, memory_usage"]
        Histogram["Histogram (直方图)<br>分桶统计分布<br>如: request_duration_seconds"]
        Summary["Summary (摘要)<br>客户端计算分位数<br>如: request_duration_quantile"]
    end

Counter

// Go应用程序中定义Counter
import "github.com/prometheus/client_golang/prometheus"

var httpRequestsTotal = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests",
    },
    []string{"method", "path", "status"},
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
}

func handleRequest(w http.ResponseWriter, r *http.Request) {
    // 业务逻辑...
    httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}

Gauge

var memoryUsageBytes = prometheus.NewGauge(
    prometheus.GaugeOpts{
        Name: "app_memory_usage_bytes",
        Help: "Current memory usage in bytes",
    },
)

var activeConnections = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "app_active_connections",
        Help: "Number of active connections",
    },
    []string{"pool"},
)

func updateMetrics() {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    memoryUsageBytes.Set(float64(m.Alloc))
    activeConnections.WithLabelValues("db").Set(42)
}

Histogram

var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request duration in seconds",
        Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10},
    },
    []string{"method", "path"},
)

func handleRequest(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    // 业务逻辑...
    duration := time.Since(start).Seconds()
    requestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
}

Histogram会自动生成三类时间序列：

# 各桶的累计计数
http_request_duration_seconds_bucket{le="0.01"} 24054
http_request_duration_seconds_bucket{le="0.025"} 35782
http_request_duration_seconds_bucket{le="0.05"} 48923
http_request_duration_seconds_bucket{le="0.1"} 52109
http_request_duration_seconds_bucket{le="+Inf"} 53271

# 总计数
http_request_duration_seconds_count 53271

# 总和
http_request_duration_seconds_sum 2876.43

PromQL查询

基础查询

# 即时向量查询
http_requests_total{method="GET", status="200"}

# 范围向量查询（最近5分钟）
http_requests_total{method="GET"}[5m]

# 正则匹配
http_requests_total{status=~"5.."}

# 排除匹配
http_requests_total{path!~"/health.*"}

聚合函数

# 求和 - 按状态码分组
sum by (status) (rate(http_requests_total[5m]))

# 平均值 - 按实例分组
avg by (instance) (node_cpu_seconds_total{mode="idle"})

# Top 5 内存使用
topk(5, container_memory_usage_bytes)

# 百分位数（Histogram）
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

常用运算

# QPS (每秒请求数)
rate(http_requests_total[5m])

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# CPU使用率
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 磁盘使用率
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# 请求延迟P99
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# 预测磁盘空间何时用尽（线性预测）
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0

PromQL运算优先级与类型

graph LR
    IV["即时向量<br>Instant Vector"] --> |"聚合函数"| IV2["即时向量"]
    RV["范围向量<br>Range Vector"] --> |"rate/increase"| IV3["即时向量"]
    IV --> |"[5m]"| RV
    S["标量 Scalar"] --> |"运算"| IV
    IV --> |"算术运算"| IV

服务发现

Kubernetes服务发现

# prometheus.yml
scrape_configs:
  # 采集kubelet指标
  - job_name: 'kubelet'
    kubernetes_sd_configs:
      - role: node
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

  # 采集Pod指标（基于annotation自动发现）
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # 仅采集带有prometheus.io/scrape=true注解的Pod
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # 自定义采集路径
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      # 自定义采集端口
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      # 添加命名空间标签
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      # 添加Pod名称标签
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

  # 采集Service指标
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: endpoints
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Pod注解配置示例：

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/metrics"
    prometheus.io/port: "8080"

Alertmanager告警

告警规则

# alert-rules.yaml
groups:
  - name: application
    rules:
      # 高错误率告警
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "高错误率: {{ $value | humanizePercentage }}"
          description: "过去5分钟HTTP 5xx错误率超过5%，当前值: {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.example.com/runbook/high-error-rate"

      # 响应延迟告警
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "P99延迟过高: {{ $value }}s"

  - name: infrastructure
    rules:
      # 节点内存不足
      - alert: NodeMemoryHigh
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "节点 {{ $labels.instance }} 内存使用率超过90%"

      # 磁盘即将写满
      - alert: DiskSpaceRunningOut
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "节点 {{ $labels.instance }} 磁盘预计24小时内写满"

      # Pod频繁重启
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 3
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 频繁重启"

Alertmanager配置

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

# 告警路由
route:
  receiver: default
  group_by: ['alertname', 'namespace']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    # critical告警发送到PagerDuty + Slack
    - match:
        severity: critical
      receiver: critical-alerts
      group_wait: 10s
      repeat_interval: 1h
    # warning告警发送到Slack
    - match:
        severity: warning
      receiver: warning-alerts
      repeat_interval: 4h

# 告警接收器
receivers:
  - name: default
    slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

  - name: critical-alerts
    pagerduty_configs:
      - service_key: '<pagerduty-service-key>'
        severity: critical
    slack_configs:
      - channel: '#critical-alerts'
        color: '#ff0000'
        title: '[CRITICAL] {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Summary:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Runbook:* {{ .Annotations.runbook_url }}
          {{ end }}

  - name: warning-alerts
    slack_configs:
      - channel: '#warnings'
        color: '#ff9800'
        title: '[WARNING] {{ .GroupLabels.alertname }}'

# 抑制规则
inhibit_rules:
  # critical告警触发时，抑制同名的warning告警
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ['alertname', 'namespace']

sequenceDiagram
    participant Prom as Prometheus
    participant AM as Alertmanager
    participant Route as 路由树
    participant Recv as 接收器

    Prom->>AM: 发送告警
    AM->>AM: 去重 (Dedup)
    AM->>AM: 分组 (Group)
    AM->>Route: 路由匹配
    Route->>AM: 检查抑制规则
    Route->>AM: 检查静默规则
    AM->>Recv: 发送通知
    Recv->>Recv: Slack/PagerDuty/Email

Recording Rules

Recording Rules预计算频繁使用的查询，提高查询性能：

groups:
  - name: recording_rules
    interval: 30s
    rules:
      # 预计算QPS
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # 预计算错误率
      - record: job:http_error_rate:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # 预计算P99延迟
      - record: job:http_request_duration:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

      # 预计算节点CPU使用率
      - record: instance:node_cpu_utilisation:rate5m
        expr: |
          100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100

高可用方案

Federation

graph TB
    subgraph DC1["数据中心 1"]
        P1["Prometheus (Local)"]
        T1["Targets"]
        P1 --> T1
    end

    subgraph DC2["数据中心 2"]
        P2["Prometheus (Local)"]
        T2["Targets"]
        P2 --> T2
    end

    GlobalP["Global Prometheus"] --> |"federation"| P1
    GlobalP --> |"federation"| P2
    Grafana["Grafana"] --> GlobalP

Thanos

Thanos为Prometheus提供全局查询视图、长期存储和高可用：

graph TB
    subgraph Cluster1["Cluster 1"]
        P1["Prometheus"]
        S1["Thanos Sidecar"]
        P1 --> S1
    end

    subgraph Cluster2["Cluster 2"]
        P2["Prometheus"]
        S2["Thanos Sidecar"]
        P2 --> S2
    end

    S1 --> |"上传"| ObjStore["对象存储<br>(S3/GCS)"]
    S2 --> |"上传"| ObjStore

    Query["Thanos Query"] --> S1
    Query --> S2
    Query --> Store["Thanos Store Gateway"]
    Store --> ObjStore

    Compact["Thanos Compactor"] --> ObjStore
    Grafana["Grafana"] --> Query

# Thanos Sidecar配置（作为Prometheus容器的sidecar）
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: prometheus
spec:
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.50.0
          args:
            - --config.file=/etc/prometheus/prometheus.yml
            - --storage.tsdb.path=/prometheus
            - --storage.tsdb.min-block-duration=2h
            - --storage.tsdb.max-block-duration=2h
            - --web.enable-lifecycle
        - name: thanos-sidecar
          image: thanosio/thanos:v0.34.0
          args:
            - sidecar
            - --tsdb.path=/prometheus
            - --prometheus.url=http://localhost:9090
            - --objstore.config-file=/etc/thanos/objstore.yml
          volumeMounts:
            - name: prometheus-data
              mountPath: /prometheus
            - name: thanos-config
              mountPath: /etc/thanos

总结

Prometheus监控体系的关键要点：

指标设计：遵循命名规范，选择正确的指标类型，控制标签基数
PromQL：掌握rate/increase/histogram_quantile等核心函数
服务发现：利用Kubernetes SD自动发现监控目标
告警分级：critical/warning分级，配合抑制和静默减少告警噪音
Recording Rules：预计算高频查询，提升Dashboard性能
高可用：Thanos/Victoria Metrics实现全局视图和长期存储
可观测性：与Grafana、Loki、Tempo配合构建完整可观测性平台

踩坑记录

微服务扩展到 80+ 个服务后，每天告警邮件 200 封，值班同学把告警邮件设成了自动归档。有一次真实的服务宕机持续了 40 分钟没人发现，是用户投诉才知道。

根本原因：所有告警规则都是「CPU > 80%」「内存 > 85%」这类资源阈值，既没有业务含义，误报率又高。CPU 跑到 82% 不一定有问题，但接口错误率超过 1% 一定是出了事。

重新设计告警体系：只对 SLO 指标（接口错误率 > 1%、P99 > 500ms、可用性 < 99.9%）触发 critical 告警，资源类告警全部降级为 warning，只在 Dashboard 展示，不发通知。同时为每条告警写 runbook_url，值班同学收到告警后知道怎么处理。

改造后第一个月：每天有效告警从 200 封降到 8 封以内，值班同学终于开始认真看告警了。

实测结果

指标	改造前	改造后
每日告警数量	200+ 封	< 10 封
告警误报率	~85%	< 5%
MTTD（平均检测时间）	40 分钟（靠用户投诉）	< 3 分钟
核心链路 SLO 覆盖率	0%	100%
值班响应率	约 20%（大多被忽略）	约 95%

我的看法

监控的本质是「让人能做决策」，不是「收集所有数据」。很多团队的监控系统数据量巨大，但故障时完全用不上，因为找不到有用的信息——所有指标都在动，不知道看哪个。

我现在的建议是：先定义 SLO，再倒推需要哪些指标，而不是先收集所有指标再想怎么用。对用户有直接影响的才是 critical，其余的降级处理。告警疲劳比没有告警更危险，因为它会让真正的故障淹没在噪音里。