Database · #clickhouse#olap#analytics

ClickHouse OLAP引擎实战指南

2025.03.26 9 min 3.7k
// 目录 · contents

引言

ClickHouse是由Yandex开发的开源列式OLAP数据库,以极致的查询性能著称。它能够在数十亿行数据上实现亚秒级的分析查询,广泛应用于日志分析、用户行为分析、实时报表、监控告警等场景。ClickHouse的核心设计哲学是:通过列式存储、向量化执行、数据压缩和并行处理,最大化地利用现代硬件的计算能力。

列式存储原理

行存储 vs 列存储

graph TD
    subgraph "行存储(MySQL/PostgreSQL)"
        direction LR
        R1["Row 1: id=1, name=Alice, age=25, city=Beijing"]
        R2["Row 2: id=2, name=Bob, age=30, city=Shanghai"]
        R3["Row 3: id=3, name=Carol, age=28, city=Guangzhou"]
    end

    subgraph "列存储(ClickHouse)"
        C1["id列:    [1, 2, 3, ...]"]
        C2["name列:  [Alice, Bob, Carol, ...]"]
        C3["age列:   [25, 30, 28, ...]"]
        C4["city列:  [Beijing, Shanghai, Guangzhou, ...]"]
    end

列式存储的优势:

特性 行存储 列存储
读取特定列 需要读取整行 只读取需要的列
数据压缩 压缩率一般 同类型数据压缩率极高
聚合查询 需要扫描整行 只扫描参与计算的列
SIMD向量化 不适合 同类型数据适合SIMD
单行查询 高效 需要合并多列
事务写入 高效 低效(不适合频繁单行插入)

数据压缩效果

1
2
3
4
5
6
7
8
典型压缩比示例(LZ4压缩):
- 整数列 (Int32): 原始 400MB -> 压缩后 ~40MB (10:1)
- 时间戳列 (DateTime): 原始 800MB -> 压缩后 ~30MB (27:1)
- 低基数字符串 (status): 原始 500MB -> 压缩后 ~5MB (100:1)
- 高基数字符串 (uuid): 原始 1GB -> 压缩后 ~400MB (2.5:1)

整体典型压缩比: 10:1 到 40:1
100GB原始数据可能只占用3-10GB磁盘空间

MergeTree 引擎族

MergeTree是ClickHouse最核心的表引擎族,几乎所有生产环境的表都使用MergeTree或其变体。

MergeTree 基础

1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE events (
event_date Date,
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
page_url String,
duration_ms UInt32,
properties Map(String, String)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (user_id, event_time)
TTL event_date + INTERVAL 90 DAY
SETTINGS index_granularity = 8192;

MergeTree 存储结构

graph TD
    subgraph "MergeTree表: events"
        subgraph "Partition: 202503"
            P1["Part: all_1_1_0"] --> D1["primary.idx<br/>(稀疏索引)"]
            P1 --> D2["event_date.bin + .mrk2"]
            P1 --> D3["event_time.bin + .mrk2"]
            P1 --> D4["user_id.bin + .mrk2"]
            P1 --> D5["event_type.bin + .mrk2"]
            P1 --> D6["...其他列文件"]
            P1 --> D7["count.txt<br/>checksums.txt<br/>columns.txt"]

            P2["Part: all_2_2_0"]
            P3["Part: all_1_2_1<br/>(merged)"]
        end

        subgraph "Partition: 202504"
            P4["Part: all_3_3_0"]
        end
    end

核心概念:

  • Partition(分区):按分区键将数据物理隔离,便于数据管理(删除旧分区、分区级查询优化)
  • Part(数据片段):每次INSERT生成一个新Part,后台异步合并(Merge)为更大的Part
  • Primary Index(主键索引):稀疏索引,每8192行(index_granularity)记录一个索引条目
  • Column File(列文件):每列独立存储为.bin(压缩数据)+ .mrk2(标记文件,用于定位granule)

稀疏索引工作原理

graph LR
    subgraph "Primary Index (稀疏索引)"
        I1["Mark 0: user_id=100, event_time=2025-03-01"]
        I2["Mark 1: user_id=100, event_time=2025-03-15"]
        I3["Mark 2: user_id=200, event_time=2025-03-01"]
        I4["Mark 3: user_id=300, event_time=2025-03-10"]
    end

    subgraph "Data Granules (每8192行)"
        G0["Granule 0: rows 0-8191"]
        G1["Granule 1: rows 8192-16383"]
        G2["Granule 2: rows 16384-24575"]
        G3["Granule 3: rows 24576-32767"]
    end

    I1 --> G0
    I2 --> G1
    I3 --> G2
    I4 --> G3
1
2
3
4
5
-- 查询 user_id = 200 的数据
-- ClickHouse通过稀疏索引确定:
-- Mark 2 (user_id=200) <= 200 < Mark 3 (user_id=300)
-- 只需要读取 Granule 2,跳过其他所有数据块
SELECT * FROM events WHERE user_id = 200;

MergeTree 引擎变体

graph TD
    MT[MergeTree] --> RMT[ReplacingMergeTree<br/>去重 按主键保留最新版本]
    MT --> SMT[SummingMergeTree<br/>自动聚合 按主键合并SUM]
    MT --> AMT[AggregatingMergeTree<br/>增量聚合 存储中间状态]
    MT --> CMT[CollapsingMergeTree<br/>折叠 通过sign列实现变更]
    MT --> VCMT[VersionedCollapsingMergeTree<br/>带版本的折叠]

    RMT --> RRMT[ReplicatedReplacingMergeTree<br/>复制版本]
    SMT --> RSMT[ReplicatedSummingMergeTree]
    AMT --> RAMT[ReplicatedAggregatingMergeTree]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- ReplacingMergeTree: 最终去重(按主键保留ver最大的行)
CREATE TABLE users (
user_id UInt64,
name String,
email String,
ver UInt64
) ENGINE = ReplacingMergeTree(ver)
ORDER BY user_id;

-- 注意: 去重发生在后台Merge时,查询时可能有重复
-- 使用FINAL关键字强制去重(有性能开销)
SELECT * FROM users FINAL WHERE user_id = 12345;

-- SummingMergeTree: 自动累加
CREATE TABLE daily_stats (
date Date,
page_url String,
pv UInt64,
uv UInt64,
total_duration UInt64
) ENGINE = SummingMergeTree((pv, uv, total_duration))
ORDER BY (date, page_url);

-- 相同主键的行在Merge时自动累加pv, uv, total_duration

-- AggregatingMergeTree: 存储聚合中间状态
CREATE TABLE agg_stats (
date Date,
category LowCardinality(String),
cnt AggregateFunction(count, UInt64),
amount_sum AggregateFunction(sum, Decimal(10,2)),
user_uniq AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY (date, category);

-- 配合物化视图使用
INSERT INTO agg_stats
SELECT
toDate(event_time) AS date,
category,
countState() AS cnt,
sumState(amount) AS amount_sum,
uniqState(user_id) AS user_uniq
FROM raw_events
GROUP BY date, category;

物化视图

物化视图是ClickHouse中实现实时聚合的核心机制。

flowchart LR
    A[原始数据表<br/>raw_events] -->|INSERT触发| B[物化视图<br/>MV]
    B -->|增量聚合| C[目标表<br/>agg_events]

    D[客户端] -->|INSERT| A
    D -->|查询聚合结果| C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- 原始事件表
CREATE TABLE raw_events (
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
amount Decimal(10, 2),
properties String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, event_time)
TTL event_time + INTERVAL 30 DAY;

-- 聚合目标表
CREATE TABLE hourly_stats (
hour DateTime,
event_type LowCardinality(String),
event_count SimpleAggregateFunction(sum, UInt64),
unique_users AggregateFunction(uniq, UInt64),
total_amount SimpleAggregateFunction(sum, Decimal(12, 2))
) ENGINE = AggregatingMergeTree()
ORDER BY (event_type, hour);

-- 物化视图(增量更新)
CREATE MATERIALIZED VIEW hourly_stats_mv
TO hourly_stats AS
SELECT
toStartOfHour(event_time) AS hour,
event_type,
count() AS event_count,
uniqState(user_id) AS unique_users,
sum(amount) AS total_amount
FROM raw_events
GROUP BY hour, event_type;

-- 查询时从聚合表读取
SELECT
hour,
event_type,
event_count,
uniqMerge(unique_users) AS uv,
total_amount
FROM hourly_stats
WHERE hour >= '2025-03-15 00:00:00'
GROUP BY hour, event_type
ORDER BY hour;

数据摄入

批量插入的最佳实践

1
2
3
4
5
6
7
8
9
10
11
12
-- 推荐: 大批量插入(每次至少数千行)
INSERT INTO events FORMAT JSONEachRow
{"event_time": "2025-03-15 10:00:00", "user_id": 1001, "event_type": "page_view"}
{"event_time": "2025-03-15 10:00:01", "user_id": 1002, "event_type": "click"}
...

-- 不推荐: 频繁小批量插入(会产生大量小Part)
-- 如果必须实时插入,使用Buffer表缓冲
CREATE TABLE events_buffer AS events
ENGINE = Buffer(currentDatabase(), events, 16, 10, 100, 10000, 100000, 1000000, 10000000);
-- 参数: num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes
-- 数据先写入内存Buffer,满足条件后刷入底层表

从外部系统导入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- 从Kafka消费
CREATE TABLE events_kafka (
event_time DateTime,
user_id UInt64,
event_type String,
amount Decimal(10,2)
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka1:9092,kafka2:9092',
kafka_topic_list = 'events',
kafka_group_name = 'clickhouse_consumer',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 4;

-- 创建物化视图将Kafka数据持久化到MergeTree
CREATE MATERIALIZED VIEW events_kafka_mv TO events AS
SELECT * FROM events_kafka;

-- 从MySQL导入
CREATE TABLE mysql_orders
ENGINE = MySQL('mysql_host:3306', 'mydb', 'orders', 'user', 'password');

INSERT INTO local_orders SELECT * FROM mysql_orders WHERE created_at >= '2025-01-01';

-- 从S3/HDFS导入
INSERT INTO events
SELECT * FROM s3(
'https://bucket.s3.amazonaws.com/data/*.parquet',
'access_key',
'secret_key',
'Parquet'
);

查询优化

ORDER BY 键的选择

ORDER BY键的选择直接决定了稀疏索引的效果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- 原则: 将过滤频率最高的列放在前面

-- 场景: 主要按用户+时间查询
CREATE TABLE user_events (...)
ENGINE = MergeTree()
ORDER BY (user_id, event_time);
-- SELECT ... WHERE user_id = 123 AND event_time > '2025-03-01'
-- 可以利用稀疏索引快速定位

-- 场景: 主要按事件类型+时间查询
CREATE TABLE analytics (...)
ENGINE = MergeTree()
ORDER BY (event_type, event_date, user_id);
-- SELECT ... WHERE event_type = 'purchase' AND event_date = '2025-03-15'
-- 同样可以利用稀疏索引

-- 查看查询是否利用了主键索引
SET send_logs_level = 'trace';
SELECT count() FROM events WHERE user_id = 12345;
-- 日志中会显示: "Selected N parts by partition key, M parts by primary key"

跳数索引(Data Skipping Index)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- 为非主键列创建跳数索引
CREATE TABLE logs (
timestamp DateTime,
level LowCardinality(String),
service LowCardinality(String),
message String,
trace_id String,
INDEX idx_trace_id trace_id TYPE bloom_filter(0.01) GRANULARITY 4,
INDEX idx_service service TYPE set(100) GRANULARITY 4,
INDEX idx_message message TYPE tokenbf_v1(10240, 3, 0) GRANULARITY 4
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (level, timestamp);

-- bloom_filter: 适合高基数列的等值查询
-- set: 适合低基数列(记录granule内所有不同值)
-- tokenbf_v1: 适合文本的token级布隆过滤器
-- minmax: 记录granule内的最大最小值(适合数值范围查询)

-- 查询可以利用跳数索引
SELECT * FROM logs WHERE trace_id = 'abc-123-def';
-- 通过bloom_filter索引快速排除不包含该trace_id的granule

查询性能调优

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- 1. 使用 PREWHERE 替代 WHERE(ClickHouse通常自动优化)
SELECT * FROM events
PREWHERE user_id = 12345 -- 先过滤,减少读取的列数据
WHERE event_type = 'click';

-- 2. 避免 SELECT *,只查需要的列
-- Bad
SELECT * FROM events WHERE user_id = 12345;
-- Good
SELECT event_time, event_type, duration_ms FROM events WHERE user_id = 12345;

-- 3. 使用 LowCardinality 优化低基数字符串
-- 内部使用字典编码,大幅减少内存和IO
ALTER TABLE events MODIFY COLUMN event_type LowCardinality(String);

-- 4. 使用近似计算提升性能
-- 精确去重(慢)
SELECT uniqExact(user_id) FROM events;
-- 近似去重(快,误差 < 2%)
SELECT uniq(user_id) FROM events;
-- HyperLogLog去重
SELECT uniqHLL12(user_id) FROM events;

-- 5. 使用采样(Sampling)
-- 建表时指定采样键
CREATE TABLE events_sampled (
event_time DateTime,
user_id UInt64,
event_type String
) ENGINE = MergeTree()
ORDER BY (intHash32(user_id), event_time)
SAMPLE BY intHash32(user_id);

-- 查询10%的数据(性能提升约10倍)
SELECT event_type, count() * 10 AS estimated_count
FROM events_sampled SAMPLE 0.1
GROUP BY event_type;

集群部署

集群架构

graph TD
    subgraph "ClickHouse Cluster (2 Shards x 2 Replicas)"
        subgraph "Shard 1"
            S1R1["Replica 1-1<br/>Node: ch1"]
            S1R2["Replica 1-2<br/>Node: ch2"]
            S1R1 <-->|ZooKeeper复制| S1R2
        end

        subgraph "Shard 2"
            S2R1["Replica 2-1<br/>Node: ch3"]
            S2R2["Replica 2-2<br/>Node: ch4"]
            S2R1 <-->|ZooKeeper复制| S2R2
        end
    end

    ZK[ZooKeeper / ClickHouse Keeper<br/>协调复制]

    S1R1 --> ZK
    S1R2 --> ZK
    S2R1 --> ZK
    S2R2 --> ZK

分布式表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- 本地表(每个节点上创建)
CREATE TABLE events_local ON CLUSTER my_cluster (
event_time DateTime,
user_id UInt64,
event_type LowCardinality(String),
amount Decimal(10,2)
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_time)
ORDER BY (user_id, event_time);

-- 分布式表(逻辑层,路由到各个Shard)
CREATE TABLE events_distributed ON CLUSTER my_cluster AS events_local
ENGINE = Distributed(my_cluster, default, events_local, sipHash64(user_id));
-- 参数: 集群名, 数据库名, 本地表名, 分片键表达式

-- 写入分布式表(自动路由到正确的Shard)
INSERT INTO events_distributed VALUES
('2025-03-15 10:00:00', 1001, 'click', 0.00),
('2025-03-15 10:00:01', 2002, 'purchase', 99.90);

-- 查询分布式表(自动聚合所有Shard的结果)
SELECT
toStartOfHour(event_time) AS hour,
event_type,
count() AS cnt,
uniq(user_id) AS uv,
sum(amount) AS total
FROM events_distributed
WHERE event_time >= '2025-03-15'
GROUP BY hour, event_type
ORDER BY hour;

集群配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<!-- config.xml 集群配置 -->
<clickhouse>
<remote_servers>
<my_cluster>
<shard>
<replica>
<host>ch1</host>
<port>9000</port>
</replica>
<replica>
<host>ch2</host>
<port>9000</port>
</replica>
</shard>
<shard>
<replica>
<host>ch3</host>
<port>9000</port>
</replica>
<replica>
<host>ch4</host>
<port>9000</port>
</replica>
</shard>
</my_cluster>
</remote_servers>

<macros>
<shard>01</shard> <!-- 每个节点不同 -->
<replica>replica_01</replica> <!-- 每个节点不同 -->
</macros>
</clickhouse>

典型应用场景

场景一:实时日志分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- 日志表
CREATE TABLE access_logs (
timestamp DateTime64(3),
date Date DEFAULT toDate(timestamp),
service LowCardinality(String),
level Enum8('DEBUG'=1, 'INFO'=2, 'WARN'=3, 'ERROR'=4),
method LowCardinality(String),
path String,
status_code UInt16,
duration_ms UInt32,
client_ip IPv4,
user_agent String,
request_id String,
INDEX idx_request_id request_id TYPE bloom_filter(0.01) GRANULARITY 4
) ENGINE = MergeTree()
PARTITION BY date
ORDER BY (service, level, timestamp)
TTL date + INTERVAL 30 DAY DELETE
SETTINGS index_granularity = 8192;

-- 实时错误监控
SELECT
service,
count() AS error_count,
uniq(request_id) AS affected_requests,
avg(duration_ms) AS avg_duration
FROM access_logs
WHERE timestamp >= now() - INTERVAL 5 MINUTE
AND level >= 'ERROR'
GROUP BY service
ORDER BY error_count DESC;

-- P99延迟分析
SELECT
service,
method,
quantile(0.5)(duration_ms) AS p50,
quantile(0.9)(duration_ms) AS p90,
quantile(0.99)(duration_ms) AS p99,
max(duration_ms) AS max_duration
FROM access_logs
WHERE date = today()
GROUP BY service, method
ORDER BY p99 DESC
LIMIT 20;

场景二:用户行为分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- 漏斗分析
SELECT
windowFunnel(86400)(event_time,
event_type = 'page_view',
event_type = 'add_to_cart',
event_type = 'checkout',
event_type = 'payment'
) AS funnel_step
FROM user_events
WHERE event_date >= '2025-03-01' AND event_date <= '2025-03-31'
GROUP BY user_id
-- 结果: 每个用户在24小时内到达了漏斗的第几步

-- 汇总漏斗
SELECT
step,
count() AS user_count
FROM (
SELECT
windowFunnel(86400)(event_time,
event_type = 'page_view',
event_type = 'add_to_cart',
event_type = 'checkout',
event_type = 'payment'
) AS step
FROM user_events
WHERE event_date BETWEEN '2025-03-01' AND '2025-03-31'
GROUP BY user_id
)
GROUP BY step
ORDER BY step;

-- 留存分析
SELECT
first_day,
dateDiff('day', first_day, return_day) AS day_n,
uniq(user_id) AS retained_users
FROM (
SELECT
user_id,
min(event_date) OVER (PARTITION BY user_id) AS first_day,
event_date AS return_day
FROM user_events
WHERE event_date >= '2025-03-01'
)
WHERE dateDiff('day', first_day, return_day) <= 30
GROUP BY first_day, day_n
ORDER BY first_day, day_n;

总结

ClickHouse是一个专为OLAP场景设计的高性能分析数据库:

  • 列式存储是性能的基石,同列数据的高压缩率和SIMD向量化执行是查询速度快的核心原因
  • MergeTree引擎族提供了丰富的数据管理能力:去重(Replacing)、聚合(Summing/Aggregating)、变更追踪(Collapsing)
  • 稀疏索引以极小的索引体积实现了高效的数据定位,跳数索引进一步补充了非主键列的过滤能力
  • 物化视图是实现实时聚合的利器,可以在数据写入时增量计算聚合结果
  • 数据摄入应以批量写入为主,避免频繁小批量插入产生过多Part
  • 集群部署通过Shard分片和Replica复制实现水平扩展和高可用
  • ClickHouse不适合OLTP场景(高频单行更新、事务需求),应与MySQL/PostgreSQL等OLTP数据库配合使用
作者 · authorzt
发布 · date2025-03-26
篇幅 · length3.7k 字 · 9 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论