Grafana可观测性平台搭建指南

前言

可观测性（Observability）的三大支柱是指标（Metrics）、日志（Logs）和链路追踪（Traces）。Grafana生态提供了一站式的可观测性解决方案：Grafana用于可视化，Prometheus用于指标，Loki用于日志，Tempo用于追踪。本文将系统讲解如何搭建完整的可观测性平台。

可观测性三大支柱

graph TB
    subgraph Observability["可观测性三大支柱"]
        Metrics["指标 (Metrics)<br>Prometheus<br>时序数据，数值型"]
        Logs["日志 (Logs)<br>Loki<br>事件记录，文本型"]
        Traces["追踪 (Traces)<br>Tempo<br>请求链路，分布式"]
    end

    App["应用程序"] --> Metrics
    App --> Logs
    App --> Traces

    Metrics --> Grafana["Grafana<br>统一可视化"]
    Logs --> Grafana
    Traces --> Grafana

    Grafana --> Alert["告警"]
    Grafana --> Dashboard["Dashboard"]
    Grafana --> Explore["Explore"]

Grafana数据源配置

Prometheus数据源

# Grafana数据源配置 (provisioning)
# /etc/grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: 15s
      httpMethod: POST
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: "traceID=(\\w+)"
          name: TraceID
          url: "$${__value.raw}"

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    uid: tempo
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        tags: ['service.name', 'namespace']
        mappedTags: [{ key: 'service.name', value: 'app' }]
        filterByTraceID: true
      tracesToMetrics:
        datasourceUid: prometheus
        tags: [{ key: 'service.name', value: 'service' }]
        queries:
          - name: Request Rate
            query: 'rate(http_requests_total{$$__tags}[5m])'
      nodeGraph:
        enabled: true
      serviceMap:
        datasourceUid: prometheus

数据源关联

graph LR
    subgraph Correlation["数据关联"]
        M["Metrics<br>(Prometheus)"] --> |"exemplar"| T["Traces<br>(Tempo)"]
        T --> |"tracesToLogs"| L["Logs<br>(Loki)"]
        L --> |"derivedFields"| T
        T --> |"tracesToMetrics"| M
    end

    style M fill:#E65100,color:#fff
    style L fill:#1565C0,color:#fff
    style T fill:#2E7D32,color:#fff

Dashboard设计

面板类型

Grafana提供多种面板类型：

面板类型	用途	适用数据
Time Series	时序折线图	QPS、延迟、资源使用率
Stat	单值展示	当前值、总数
Gauge	仪表盘	百分比、利用率
Bar Chart	柱状图	分类对比
Table	表格	详细数据列表
Heatmap	热力图	延迟分布
Logs	日志面板	Loki日志
Node Graph	拓扑图	服务依赖关系

Dashboard JSON模型

{
  "dashboard": {
    "title": "Application Overview",
    "tags": ["production", "application"],
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "templating": {
      "list": [
        {
          "name": "namespace",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(http_requests_total, namespace)",
          "refresh": 2,
          "multi": false,
          "includeAll": true
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
          "refresh": 2,
          "multi": true
        }
      ]
    },
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "sum by (service) (rate(http_requests_total{namespace=\"$namespace\", service=~\"$service\"}[5m]))",
            "legendFormat": "{{ service }}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "custom": {
              "drawStyle": "line",
              "lineInterpolation": "smooth",
              "fillOpacity": 10
            }
          }
        }
      }
    ]
  }
}

Dashboard Provisioning

# /etc/grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Production"
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

变量（Variables）

变量使Dashboard具备交互能力和复用性：

# 变量定义示例
variables:
  # Query变量 - 从数据源动态获取
  - name: cluster
    type: query
    query: "label_values(up, cluster)"

  # Custom变量 - 预定义选项
  - name: interval
    type: custom
    values: "1m,5m,15m,1h"
    current: "5m"

  # Interval变量 - 自动计算采样间隔
  - name: __rate_interval
    type: interval
    auto: true
    auto_min: "1m"

  # Chained变量 - 级联选择
  - name: pod
    type: query
    query: "label_values(kube_pod_info{namespace='$namespace'}, pod)"

Loki日志聚合

Loki架构

graph TB
    subgraph Sources["日志来源"]
        Promtail["Promtail<br>(DaemonSet)"]
        FluentBit["Fluent Bit"]
        OTel["OpenTelemetry<br>Collector"]
    end

    subgraph Loki["Loki"]
        Distributor["Distributor"]
        Ingester["Ingester"]
        Querier["Querier"]
        QueryFrontend["Query Frontend"]
        Compactor["Compactor"]
    end

    subgraph Storage["存储"]
        Chunks["Chunks<br>(S3/GCS)"]
        Index["Index<br>(BoltDB/TSDB)"]
    end

    Promtail --> Distributor
    FluentBit --> Distributor
    OTel --> Distributor
    Distributor --> Ingester
    Ingester --> Chunks
    Ingester --> Index
    Querier --> Chunks
    Querier --> Index
    QueryFrontend --> Querier
    Compactor --> Index
    Grafana["Grafana"] --> QueryFrontend

Promtail配置

# promtail-config.yaml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # 采集Kubernetes Pod日志
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      # 解析Docker JSON日志格式
      - docker: {}
      # 解析JSON结构化日志
      - json:
          expressions:
            level: level
            msg: message
            traceID: traceID
            timestamp: timestamp
      # 设置标签
      - labels:
          level:
          traceID:
      # 设置时间戳
      - timestamp:
          source: timestamp
          format: RFC3339Nano
      # 丢弃健康检查日志
      - match:
          selector: '{app="nginx"}'
          stages:
            - regex:
                expression: '.*"(?P<path>/health[z]?)".*'
            - match:
                selector: '{path=~"/health.*"}'
                action: drop
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

LogQL查询

# 基本查询 - 标签过滤
{namespace="production", app="api-server"}

# 关键词过滤
{app="api-server"} |= "error"
{app="api-server"} != "health"
{app="api-server"} |~ "timeout|connection refused"

# JSON解析
{app="api-server"} | json | level="error" | line_format "{{.message}}"

# 正则解析
{app="nginx"} | regexp `(?P<ip>\d+\.\d+\.\d+\.\d+) .* "(?P<method>\w+) (?P<path>[^ ]+).*" (?P<status>\d+)`
  | status >= 500

# 指标查询 - 日志计数
count_over_time({app="api-server", level="error"}[5m])

# 指标查询 - QPS
rate({app="api-server"}[5m])

# 指标查询 - 按状态码分组
sum by (status) (
  count_over_time(
    {app="nginx"} | regexp `"(?P<method>\w+) (?P<path>[^ ]+).*" (?P<status>\d+)` [5m]
  )
)

# 错误率
sum(rate({app="api-server"} |= "error" [5m]))
/
sum(rate({app="api-server"} [5m]))

Tempo链路追踪

Tempo架构

graph TB
    subgraph App["应用程序"]
        SDK["OpenTelemetry SDK"]
    end

    subgraph OTelCol["OpenTelemetry Collector"]
        Receiver["Receiver<br>(OTLP/Jaeger/Zipkin)"]
        Processor["Processor<br>(Batch/Filter)"]
        Exporter["Exporter"]
    end

    subgraph Tempo["Grafana Tempo"]
        Dist["Distributor"]
        Ing["Ingester"]
        QFE["Query Frontend"]
        Que["Querier"]
        Comp["Compactor"]
    end

    subgraph Store["Object Storage"]
        S3["S3/GCS/Azure Blob"]
    end

    SDK --> Receiver
    Receiver --> Processor
    Processor --> Exporter
    Exporter --> Dist
    Dist --> Ing
    Ing --> S3
    QFE --> Que
    Que --> S3
    Comp --> S3
    Grafana["Grafana"] --> QFE

OpenTelemetry配置

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  # 自动生成服务指标
  spanmetrics:
    metrics_exporter: prometheus
    dimensions:
      - name: http.method
      - name: http.status_code

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, spanmetrics]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

应用程序集成

// Go应用中集成OpenTelemetry
package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
    "go.opentelemetry.io/otel/trace"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("api-server"),
            semconv.ServiceVersionKey.String("1.0.0"),
            semconv.DeploymentEnvironmentKey.String("production"),
        )),
        sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)), // 10%采样
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

var tracer = otel.Tracer("api-server")

func handleOrder(ctx context.Context, orderID string) error {
    ctx, span := tracer.Start(ctx, "handleOrder",
        trace.WithAttributes(
            attribute.String("order.id", orderID),
        ),
    )
    defer span.End()

    // 调用下游服务
    if err := validateOrder(ctx, orderID); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    return processPayment(ctx, orderID)
}

Grafana Stack（LGTM）部署

使用Docker Compose快速搭建完整的Grafana Stack：

# docker-compose.yaml
version: "3.8"
services:
  grafana:
    image: grafana/grafana:10.3.0
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana

  prometheus:
    image: prom/prometheus:v2.50.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
      - --web.enable-remote-write-receiver
      - --enable-feature=exemplar-storage

  loki:
    image: grafana/loki:2.9.4
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/config.yaml
      - loki-data:/loki
    command: -config.file=/etc/loki/config.yaml

  tempo:
    image: grafana/tempo:2.3.1
    ports:
      - "3200:3200"   # Tempo API
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    volumes:
      - ./tempo-config.yaml:/etc/tempo/config.yaml
      - tempo-data:/var/tempo
    command: -config.file=/etc/tempo/config.yaml

  promtail:
    image: grafana/promtail:2.9.4
    volumes:
      - ./promtail-config.yaml:/etc/promtail/config.yaml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/config.yaml

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.93.0
    ports:
      - "4317"
      - "4318"
      - "8889:8889"
    volumes:
      - ./otel-collector-config.yaml:/etc/otel/config.yaml
    command: --config /etc/otel/config.yaml

volumes:
  grafana-data:
  prometheus-data:
  loki-data:
  tempo-data:

graph LR
    subgraph LGTM["Grafana LGTM Stack"]
        L["Loki<br>Logs"]
        G["Grafana<br>Visualization"]
        T["Tempo<br>Traces"]
        M["Mimir/Prometheus<br>Metrics"]
    end

    App["Application"] --> OTel["OpenTelemetry<br>Collector"]
    OTel --> L
    OTel --> T
    OTel --> M
    L --> G
    T --> G
    M --> G

    style G fill:#F46800,color:#fff
    style L fill:#1565C0,color:#fff
    style T fill:#2E7D32,color:#fff
    style M fill:#E65100,color:#fff

Grafana告警

# Grafana Alerting (Unified Alerting)
# 告警规则通过UI或Provisioning定义
apiVersion: 1
groups:
  - orgId: 1
    name: application-alerts
    folder: Production
    interval: 1m
    rules:
      - uid: high-error-rate
        title: High Error Rate
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
          - refId: B
            datasourceUid: "-100"
            model:
              type: reduce
              reducer: last
              expression: A
          - refId: C
            datasourceUid: "-100"
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [0.05]
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate exceeds 5%"

Dashboard设计最佳实践

USE方法（基础设施）：Utilization（利用率）、Saturation（饱和度）、Errors（错误）
RED方法（微服务）：Rate（请求率）、Errors（错误率）、Duration（延迟）
四个黄金信号（SRE）：延迟、流量、错误、饱和度

graph TB
    subgraph RED["RED方法 - 微服务Dashboard"]
        Rate["Rate<br>请求速率 (QPS)"]
        Errors["Errors<br>错误率 (%)"]
        Duration["Duration<br>延迟 (P50/P95/P99)"]
    end

    subgraph USE["USE方法 - 基础设施Dashboard"]
        Utilization["Utilization<br>CPU/内存/磁盘利用率"]
        Saturation["Saturation<br>队列深度/等待数"]
        ErrRate["Errors<br>硬件/软件错误"]
    end

总结

构建完整的可观测性平台需要将指标、日志和追踪三大支柱有机结合：

Prometheus负责指标采集和告警评估
Loki负责日志聚合，通过标签实现高效查询
Tempo负责分布式追踪，支持全量存储
Grafana作为统一可视化层，关联三大数据源
OpenTelemetry作为统一的数据采集标准

关键是建立数据之间的关联：从指标异常钻取到相关日志，从日志中的TraceID跳转到完整链路。这种端到端的关联能力是快速定位和排查问题的核心。