DevOps · #observability#grafana#dashboard

Grafana可观测性平台搭建指南

2024.09.04 7 min 3.0k
// 目录 · contents

前言

可观测性(Observability)的三大支柱是指标(Metrics)、日志(Logs)和链路追踪(Traces)。Grafana生态提供了一站式的可观测性解决方案:Grafana用于可视化,Prometheus用于指标,Loki用于日志,Tempo用于追踪。本文将系统讲解如何搭建完整的可观测性平台。

可观测性三大支柱

graph TB
    subgraph Observability["可观测性三大支柱"]
        Metrics["指标 (Metrics)<br>Prometheus<br>时序数据,数值型"]
        Logs["日志 (Logs)<br>Loki<br>事件记录,文本型"]
        Traces["追踪 (Traces)<br>Tempo<br>请求链路,分布式"]
    end

    App["应用程序"] --> Metrics
    App --> Logs
    App --> Traces

    Metrics --> Grafana["Grafana<br>统一可视化"]
    Logs --> Grafana
    Traces --> Grafana

    Grafana --> Alert["告警"]
    Grafana --> Dashboard["Dashboard"]
    Grafana --> Explore["Explore"]

Grafana数据源配置

Prometheus数据源

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Grafana数据源配置 (provisioning)
# /etc/grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: 15s
httpMethod: POST
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo

- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "traceID=(\\w+)"
name: TraceID
url: "$${__value.raw}"

- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
uid: tempo
jsonData:
tracesToLogs:
datasourceUid: loki
tags: ['service.name', 'namespace']
mappedTags: [{ key: 'service.name', value: 'app' }]
filterByTraceID: true
tracesToMetrics:
datasourceUid: prometheus
tags: [{ key: 'service.name', value: 'service' }]
queries:
- name: Request Rate
query: 'rate(http_requests_total{$$__tags}[5m])'
nodeGraph:
enabled: true
serviceMap:
datasourceUid: prometheus

数据源关联

graph LR
    subgraph Correlation["数据关联"]
        M["Metrics<br>(Prometheus)"] --> |"exemplar"| T["Traces<br>(Tempo)"]
        T --> |"tracesToLogs"| L["Logs<br>(Loki)"]
        L --> |"derivedFields"| T
        T --> |"tracesToMetrics"| M
    end

    style M fill:#E65100,color:#fff
    style L fill:#1565C0,color:#fff
    style T fill:#2E7D32,color:#fff

Dashboard设计

面板类型

Grafana提供多种面板类型:

面板类型 用途 适用数据
Time Series 时序折线图 QPS、延迟、资源使用率
Stat 单值展示 当前值、总数
Gauge 仪表盘 百分比、利用率
Bar Chart 柱状图 分类对比
Table 表格 详细数据列表
Heatmap 热力图 延迟分布
Logs 日志面板 Loki日志
Node Graph 拓扑图 服务依赖关系

Dashboard JSON模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
{
"dashboard": {
"title": "Application Overview",
"tags": ["production", "application"],
"timezone": "browser",
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
},
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(http_requests_total, namespace)",
"refresh": 2,
"multi": false,
"includeAll": true
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(http_requests_total{namespace=\"$namespace\"}, service)",
"refresh": 2,
"multi": true
}
]
},
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum by (service) (rate(http_requests_total{namespace=\"$namespace\", service=~\"$service\"}[5m]))",
"legendFormat": "{{ service }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"custom": {
"drawStyle": "line",
"lineInterpolation": "smooth",
"fillOpacity": 10
}
}
}
}
]
}
}

Dashboard Provisioning

1
2
3
4
5
6
7
8
9
10
11
12
13
# /etc/grafana/provisioning/dashboards/dashboards.yaml
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Production"
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true

变量(Variables)

变量使Dashboard具备交互能力和复用性:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 变量定义示例
variables:
# Query变量 - 从数据源动态获取
- name: cluster
type: query
query: "label_values(up, cluster)"

# Custom变量 - 预定义选项
- name: interval
type: custom
values: "1m,5m,15m,1h"
current: "5m"

# Interval变量 - 自动计算采样间隔
- name: __rate_interval
type: interval
auto: true
auto_min: "1m"

# Chained变量 - 级联选择
- name: pod
type: query
query: "label_values(kube_pod_info{namespace='$namespace'}, pod)"

Loki日志聚合

Loki架构

graph TB
    subgraph Sources["日志来源"]
        Promtail["Promtail<br>(DaemonSet)"]
        FluentBit["Fluent Bit"]
        OTel["OpenTelemetry<br>Collector"]
    end

    subgraph Loki["Loki"]
        Distributor["Distributor"]
        Ingester["Ingester"]
        Querier["Querier"]
        QueryFrontend["Query Frontend"]
        Compactor["Compactor"]
    end

    subgraph Storage["存储"]
        Chunks["Chunks<br>(S3/GCS)"]
        Index["Index<br>(BoltDB/TSDB)"]
    end

    Promtail --> Distributor
    FluentBit --> Distributor
    OTel --> Distributor
    Distributor --> Ingester
    Ingester --> Chunks
    Ingester --> Index
    Querier --> Chunks
    Querier --> Index
    QueryFrontend --> Querier
    Compactor --> Index
    Grafana["Grafana"] --> QueryFrontend

Promtail配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# promtail-config.yaml
server:
http_listen_port: 9080

positions:
filename: /tmp/positions.yaml

clients:
- url: http://loki:3100/loki/api/v1/push

scrape_configs:
# 采集Kubernetes Pod日志
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
# 解析Docker JSON日志格式
- docker: {}
# 解析JSON结构化日志
- json:
expressions:
level: level
msg: message
traceID: traceID
timestamp: timestamp
# 设置标签
- labels:
level:
traceID:
# 设置时间戳
- timestamp:
source: timestamp
format: RFC3339Nano
# 丢弃健康检查日志
- match:
selector: '{app="nginx"}'
stages:
- regex:
expression: '.*"(?P<path>/health[z]?)".*'
- match:
selector: '{path=~"/health.*"}'
action: drop
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod

LogQL查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 基本查询 - 标签过滤
{namespace="production", app="api-server"}

# 关键词过滤
{app="api-server"} |= "error"
{app="api-server"} != "health"
{app="api-server"} |~ "timeout|connection refused"

# JSON解析
{app="api-server"} | json | level="error" | line_format "{{.message}}"

# 正则解析
{app="nginx"} | regexp `(?P<ip>\d+\.\d+\.\d+\.\d+) .* "(?P<method>\w+) (?P<path>[^ ]+).*" (?P<status>\d+)`
| status >= 500

# 指标查询 - 日志计数
count_over_time({app="api-server", level="error"}[5m])

# 指标查询 - QPS
rate({app="api-server"}[5m])

# 指标查询 - 按状态码分组
sum by (status) (
count_over_time(
{app="nginx"} | regexp `"(?P<method>\w+) (?P<path>[^ ]+).*" (?P<status>\d+)` [5m]
)
)

# 错误率
sum(rate({app="api-server"} |= "error" [5m]))
/
sum(rate({app="api-server"} [5m]))

Tempo链路追踪

Tempo架构

graph TB
    subgraph App["应用程序"]
        SDK["OpenTelemetry SDK"]
    end

    subgraph OTelCol["OpenTelemetry Collector"]
        Receiver["Receiver<br>(OTLP/Jaeger/Zipkin)"]
        Processor["Processor<br>(Batch/Filter)"]
        Exporter["Exporter"]
    end

    subgraph Tempo["Grafana Tempo"]
        Dist["Distributor"]
        Ing["Ingester"]
        QFE["Query Frontend"]
        Que["Querier"]
        Comp["Compactor"]
    end

    subgraph Store["Object Storage"]
        S3["S3/GCS/Azure Blob"]
    end

    SDK --> Receiver
    Receiver --> Processor
    Processor --> Exporter
    Exporter --> Dist
    Dist --> Ing
    Ing --> S3
    QFE --> Que
    Que --> S3
    Comp --> S3
    Grafana["Grafana"] --> QFE

OpenTelemetry配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
batch:
timeout: 5s
send_batch_size: 1000
# 自动生成服务指标
spanmetrics:
metrics_exporter: prometheus
dimensions:
- name: http.method
- name: http.status_code

exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889

service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, spanmetrics]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]

应用程序集成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Go应用中集成OpenTelemetry
package main

import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
"go.opentelemetry.io/otel/trace"
)

func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}

tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("api-server"),
semconv.ServiceVersionKey.String("1.0.0"),
semconv.DeploymentEnvironmentKey.String("production"),
)),
sdktrace.WithSampler(sdktrace.TraceIDRatioBased(0.1)), // 10%采样
)

otel.SetTracerProvider(tp)
return tp, nil
}

var tracer = otel.Tracer("api-server")

func handleOrder(ctx context.Context, orderID string) error {
ctx, span := tracer.Start(ctx, "handleOrder",
trace.WithAttributes(
attribute.String("order.id", orderID),
),
)
defer span.End()

// 调用下游服务
if err := validateOrder(ctx, orderID); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
return err
}

return processPayment(ctx, orderID)
}

Grafana Stack(LGTM)部署

使用Docker Compose快速搭建完整的Grafana Stack:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# docker-compose.yaml
version: "3.8"
services:
grafana:
image: grafana/grafana:10.3.0
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- ./provisioning:/etc/grafana/provisioning
- grafana-data:/var/lib/grafana

prometheus:
image: prom/prometheus:v2.50.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=30d
- --web.enable-remote-write-receiver
- --enable-feature=exemplar-storage

loki:
image: grafana/loki:2.9.4
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/config.yaml

tempo:
image: grafana/tempo:2.3.1
ports:
- "3200:3200" # Tempo API
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./tempo-config.yaml:/etc/tempo/config.yaml
- tempo-data:/var/tempo
command: -config.file=/etc/tempo/config.yaml

promtail:
image: grafana/promtail:2.9.4
volumes:
- ./promtail-config.yaml:/etc/promtail/config.yaml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yaml

otel-collector:
image: otel/opentelemetry-collector-contrib:0.93.0
ports:
- "4317"
- "4318"
- "8889:8889"
volumes:
- ./otel-collector-config.yaml:/etc/otel/config.yaml
command: --config /etc/otel/config.yaml

volumes:
grafana-data:
prometheus-data:
loki-data:
tempo-data:
graph LR
    subgraph LGTM["Grafana LGTM Stack"]
        L["Loki<br>Logs"]
        G["Grafana<br>Visualization"]
        T["Tempo<br>Traces"]
        M["Mimir/Prometheus<br>Metrics"]
    end

    App["Application"] --> OTel["OpenTelemetry<br>Collector"]
    OTel --> L
    OTel --> T
    OTel --> M
    L --> G
    T --> G
    M --> G

    style G fill:#F46800,color:#fff
    style L fill:#1565C0,color:#fff
    style T fill:#2E7D32,color:#fff
    style M fill:#E65100,color:#fff

Grafana告警

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Grafana Alerting (Unified Alerting)
# 告警规则通过UI或Provisioning定义
apiVersion: 1
groups:
- orgId: 1
name: application-alerts
folder: Production
interval: 1m
rules:
- uid: high-error-rate
title: High Error Rate
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
- refId: B
datasourceUid: "-100"
model:
type: reduce
reducer: last
expression: A
- refId: C
datasourceUid: "-100"
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params: [0.05]
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate exceeds 5%"

Dashboard设计最佳实践

  1. USE方法(基础设施):Utilization(利用率)、Saturation(饱和度)、Errors(错误)
  2. RED方法(微服务):Rate(请求率)、Errors(错误率)、Duration(延迟)
  3. 四个黄金信号(SRE):延迟、流量、错误、饱和度
graph TB
    subgraph RED["RED方法 - 微服务Dashboard"]
        Rate["Rate<br>请求速率 (QPS)"]
        Errors["Errors<br>错误率 (%)"]
        Duration["Duration<br>延迟 (P50/P95/P99)"]
    end

    subgraph USE["USE方法 - 基础设施Dashboard"]
        Utilization["Utilization<br>CPU/内存/磁盘利用率"]
        Saturation["Saturation<br>队列深度/等待数"]
        ErrRate["Errors<br>硬件/软件错误"]
    end

总结

构建完整的可观测性平台需要将指标、日志和追踪三大支柱有机结合:

  1. Prometheus负责指标采集和告警评估
  2. Loki负责日志聚合,通过标签实现高效查询
  3. Tempo负责分布式追踪,支持全量存储
  4. Grafana作为统一可视化层,关联三大数据源
  5. OpenTelemetry作为统一的数据采集标准

关键是建立数据之间的关联:从指标异常钻取到相关日志,从日志中的TraceID跳转到完整链路。这种端到端的关联能力是快速定位和排查问题的核心。

作者 · authorzt
发布 · date2024-09-04
篇幅 · length3.0k 字 · 7 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论