引言
ClickHouse是由Yandex开发的开源列式OLAP数据库,以极致的查询性能著称。它能够在数十亿行数据上实现亚秒级的分析查询,广泛应用于日志分析、用户行为分析、实时报表、监控告警等场景。ClickHouse的核心设计哲学是:通过列式存储、向量化执行、数据压缩和并行处理,最大化地利用现代硬件的计算能力。
列式存储原理
行存储 vs 列存储
graph TD
subgraph "行存储(MySQL/PostgreSQL)"
direction LR
R1["Row 1: id=1, name=Alice, age=25, city=Beijing"]
R2["Row 2: id=2, name=Bob, age=30, city=Shanghai"]
R3["Row 3: id=3, name=Carol, age=28, city=Guangzhou"]
end
subgraph "列存储(ClickHouse)"
C1["id列: [1, 2, 3, ...]"]
C2["name列: [Alice, Bob, Carol, ...]"]
C3["age列: [25, 30, 28, ...]"]
C4["city列: [Beijing, Shanghai, Guangzhou, ...]"]
end
列式存储的优势:
读取特定列
需要读取整行
只读取需要的列
数据压缩
压缩率一般
同类型数据压缩率极高
聚合查询
需要扫描整行
只扫描参与计算的列
SIMD向量化
不适合
同类型数据适合SIMD
单行查询
高效
需要合并多列
事务写入
高效
低效(不适合频繁单行插入)
数据压缩效果
1 2 3 4 5 6 7 8 典型压缩比示例(LZ4压缩): - 整数列 (Int32): 原始 400MB -> 压缩后 ~40MB (10:1) - 时间戳列 (DateTime): 原始 800MB -> 压缩后 ~30MB (27:1) - 低基数字符串 (status): 原始 500MB -> 压缩后 ~5MB (100:1) - 高基数字符串 (uuid): 原始 1GB -> 压缩后 ~400MB (2.5:1) 整体典型压缩比: 10:1 到 40:1 100GB原始数据可能只占用3-10GB磁盘空间
MergeTree 引擎族
MergeTree是ClickHouse最核心的表引擎族,几乎所有生产环境的表都使用MergeTree或其变体。
MergeTree 基础
1 2 3 4 5 6 7 8 9 10 11 12 13 CREATE TABLE events ( event_date Date , event_time DateTime, user_id UInt64, event_type LowCardinality(String), page_url String, duration_ms UInt32, properties Map(String, String) ) ENGINE = MergeTree()PARTITION BY toYYYYMM(event_date)ORDER BY (user_id, event_time) TTL event_date + INTERVAL 90 DAY SETTINGS index_granularity = 8192 ;
MergeTree 存储结构
graph TD
subgraph "MergeTree表: events"
subgraph "Partition: 202503"
P1["Part: all_1_1_0"] --> D1["primary.idx<br/>(稀疏索引)"]
P1 --> D2["event_date.bin + .mrk2"]
P1 --> D3["event_time.bin + .mrk2"]
P1 --> D4["user_id.bin + .mrk2"]
P1 --> D5["event_type.bin + .mrk2"]
P1 --> D6["...其他列文件"]
P1 --> D7["count.txt<br/>checksums.txt<br/>columns.txt"]
P2["Part: all_2_2_0"]
P3["Part: all_1_2_1<br/>(merged)"]
end
subgraph "Partition: 202504"
P4["Part: all_3_3_0"]
end
end
核心概念:
Partition(分区) :按分区键将数据物理隔离,便于数据管理(删除旧分区、分区级查询优化)
Part(数据片段) :每次INSERT生成一个新Part,后台异步合并(Merge)为更大的Part
Primary
Index(主键索引) :稀疏索引,每8192行(index_granularity)记录一个索引条目
Column
File(列文件) :每列独立存储为.bin(压缩数据)+
.mrk2(标记文件,用于定位granule)
稀疏索引工作原理
graph LR
subgraph "Primary Index (稀疏索引)"
I1["Mark 0: user_id=100, event_time=2025-03-01"]
I2["Mark 1: user_id=100, event_time=2025-03-15"]
I3["Mark 2: user_id=200, event_time=2025-03-01"]
I4["Mark 3: user_id=300, event_time=2025-03-10"]
end
subgraph "Data Granules (每8192行)"
G0["Granule 0: rows 0-8191"]
G1["Granule 1: rows 8192-16383"]
G2["Granule 2: rows 16384-24575"]
G3["Granule 3: rows 24576-32767"]
end
I1 --> G0
I2 --> G1
I3 --> G2
I4 --> G3
1 2 3 4 5 SELECT * FROM events WHERE user_id = 200 ;
MergeTree 引擎变体
graph TD
MT[MergeTree] --> RMT[ReplacingMergeTree<br/>去重 按主键保留最新版本]
MT --> SMT[SummingMergeTree<br/>自动聚合 按主键合并SUM]
MT --> AMT[AggregatingMergeTree<br/>增量聚合 存储中间状态]
MT --> CMT[CollapsingMergeTree<br/>折叠 通过sign列实现变更]
MT --> VCMT[VersionedCollapsingMergeTree<br/>带版本的折叠]
RMT --> RRMT[ReplicatedReplacingMergeTree<br/>复制版本]
SMT --> RSMT[ReplicatedSummingMergeTree]
AMT --> RAMT[ReplicatedAggregatingMergeTree]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 CREATE TABLE users ( user_id UInt64, name String, email String, ver UInt64 ) ENGINE = ReplacingMergeTree(ver)ORDER BY user_id;SELECT * FROM users FINAL WHERE user_id = 12345 ;CREATE TABLE daily_stats ( date Date , page_url String, pv UInt64, uv UInt64, total_duration UInt64 ) ENGINE = SummingMergeTree((pv, uv, total_duration))ORDER BY (date , page_url);CREATE TABLE agg_stats ( date Date , category LowCardinality(String), cnt AggregateFunction(count, UInt64), amount_sum AggregateFunction(sum, Decimal (10 ,2 )), user_uniq AggregateFunction(uniq, UInt64) ) ENGINE = AggregatingMergeTree()ORDER BY (date , category);INSERT INTO agg_statsSELECT toDate(event_time) AS date , category, countState() AS cnt, sumState(amount) AS amount_sum, uniqState(user_id) AS user_uniqFROM raw_eventsGROUP BY date , category;
物化视图
物化视图是ClickHouse中实现实时聚合的核心机制。
flowchart LR
A[原始数据表<br/>raw_events] -->|INSERT触发| B[物化视图<br/>MV]
B -->|增量聚合| C[目标表<br/>agg_events]
D[客户端] -->|INSERT| A
D -->|查询聚合结果| C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 CREATE TABLE raw_events ( event_time DateTime, user_id UInt64, event_type LowCardinality(String), amount Decimal (10 , 2 ), properties String ) ENGINE = MergeTree()PARTITION BY toYYYYMM(event_time)ORDER BY (event_type, event_time) TTL event_time + INTERVAL 30 DAY ;CREATE TABLE hourly_stats ( hour DateTime, event_type LowCardinality(String), event_count SimpleAggregateFunction(sum, UInt64), unique_users AggregateFunction(uniq, UInt64), total_amount SimpleAggregateFunction(sum, Decimal (12 , 2 )) ) ENGINE = AggregatingMergeTree()ORDER BY (event_type, hour );CREATE MATERIALIZED VIEW hourly_stats_mvTO hourly_stats AS SELECT toStartOfHour(event_time) AS hour , event_type, count () AS event_count, uniqState(user_id) AS unique_users, sum (amount) AS total_amountFROM raw_eventsGROUP BY hour , event_type;SELECT hour , event_type, event_count, uniqMerge(unique_users) AS uv, total_amountFROM hourly_statsWHERE hour >= '2025-03-15 00:00:00' GROUP BY hour , event_typeORDER BY hour ;
数据摄入
批量插入的最佳实践
1 2 3 4 5 6 7 8 9 10 11 12 INSERT INTO events FORMAT JSONEachRow {"event_time": "2025-03-15 10:00:00", "user_id": 1001 , "event_type": "page_view"} {"event_time": "2025-03-15 10:00:01", "user_id": 1002 , "event_type": "click"} ...CREATE TABLE events_buffer AS events ENGINE = Buffer(currentDatabase(), events, 16 , 10 , 100 , 10000 , 100000 , 1000000 , 10000000 );
从外部系统导入
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 CREATE TABLE events_kafka ( event_time DateTime, user_id UInt64, event_type String, amount Decimal (10 ,2 ) ) ENGINE = Kafka() SETTINGS kafka_broker_list = 'kafka1:9092,kafka2:9092' , kafka_topic_list = 'events' , kafka_group_name = 'clickhouse_consumer' , kafka_format = 'JSONEachRow' , kafka_num_consumers = 4 ;CREATE MATERIALIZED VIEW events_kafka_mv TO events AS SELECT * FROM events_kafka;CREATE TABLE mysql_orders ENGINE = MySQL('mysql_host:3306' , 'mydb' , 'orders' , 'user' , 'password' );INSERT INTO local_orders SELECT * FROM mysql_orders WHERE created_at >= '2025-01-01' ;INSERT INTO eventsSELECT * FROM s3( 'https://bucket.s3.amazonaws.com/data/*.parquet' , 'access_key' , 'secret_key' , 'Parquet' );
查询优化
ORDER BY 键的选择
ORDER BY键的选择直接决定了稀疏索引的效果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CREATE TABLE user_events (...) ENGINE = MergeTree()ORDER BY (user_id, event_time);CREATE TABLE analytics (...) ENGINE = MergeTree()ORDER BY (event_type, event_date, user_id);SET send_logs_level = 'trace' ;SELECT count () FROM events WHERE user_id = 12345 ;
跳数索引(Data Skipping
Index)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 CREATE TABLE logs ( timestamp DateTime, level LowCardinality(String), service LowCardinality(String), message String, trace_id String, INDEX idx_trace_id trace_id TYPE bloom_filter(0.01 ) GRANULARITY 4 , INDEX idx_service service TYPE set (100 ) GRANULARITY 4 , INDEX idx_message message TYPE tokenbf_v1(10240 , 3 , 0 ) GRANULARITY 4 ) ENGINE = MergeTree()PARTITION BY toYYYYMM(timestamp )ORDER BY (level, timestamp );SELECT * FROM logs WHERE trace_id = 'abc-123-def' ;
查询性能调优
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 SELECT * FROM events PREWHERE user_id = 12345 WHERE event_type = 'click' ;SELECT * FROM events WHERE user_id = 12345 ;SELECT event_time, event_type, duration_ms FROM events WHERE user_id = 12345 ;ALTER TABLE events MODIFY COLUMN event_type LowCardinality(String);SELECT uniqExact(user_id) FROM events;SELECT uniq(user_id) FROM events;SELECT uniqHLL12(user_id) FROM events;CREATE TABLE events_sampled ( event_time DateTime, user_id UInt64, event_type String ) ENGINE = MergeTree()ORDER BY (intHash32(user_id), event_time) SAMPLE BY intHash32(user_id);SELECT event_type, count () * 10 AS estimated_countFROM events_sampled SAMPLE 0.1 GROUP BY event_type;
集群部署
集群架构
graph TD
subgraph "ClickHouse Cluster (2 Shards x 2 Replicas)"
subgraph "Shard 1"
S1R1["Replica 1-1<br/>Node: ch1"]
S1R2["Replica 1-2<br/>Node: ch2"]
S1R1 <-->|ZooKeeper复制| S1R2
end
subgraph "Shard 2"
S2R1["Replica 2-1<br/>Node: ch3"]
S2R2["Replica 2-2<br/>Node: ch4"]
S2R1 <-->|ZooKeeper复制| S2R2
end
end
ZK[ZooKeeper / ClickHouse Keeper<br/>协调复制]
S1R1 --> ZK
S1R2 --> ZK
S2R1 --> ZK
S2R2 --> ZK
分布式表
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 CREATE TABLE events_local ON CLUSTER my_cluster ( event_time DateTime, user_id UInt64, event_type LowCardinality(String), amount Decimal (10 ,2 ) ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events' , '{replica}' )PARTITION BY toYYYYMM(event_time)ORDER BY (user_id, event_time);CREATE TABLE events_distributed ON CLUSTER my_cluster AS events_local ENGINE = Distributed(my_cluster, default , events_local, sipHash64(user_id));INSERT INTO events_distributed VALUES ('2025-03-15 10:00:00' , 1001 , 'click' , 0.00 ), ('2025-03-15 10:00:01' , 2002 , 'purchase' , 99.90 );SELECT toStartOfHour(event_time) AS hour , event_type, count () AS cnt, uniq(user_id) AS uv, sum (amount) AS totalFROM events_distributedWHERE event_time >= '2025-03-15' GROUP BY hour , event_typeORDER BY hour ;
集群配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 <clickhouse > <remote_servers > <my_cluster > <shard > <replica > <host > ch1</host > <port > 9000</port > </replica > <replica > <host > ch2</host > <port > 9000</port > </replica > </shard > <shard > <replica > <host > ch3</host > <port > 9000</port > </replica > <replica > <host > ch4</host > <port > 9000</port > </replica > </shard > </my_cluster > </remote_servers > <macros > <shard > 01</shard > <replica > replica_01</replica > </macros > </clickhouse >
典型应用场景
场景一:实时日志分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 CREATE TABLE access_logs ( timestamp DateTime64(3 ), date Date DEFAULT toDate(timestamp ), service LowCardinality(String), level Enum8('DEBUG' = 1 , 'INFO' = 2 , 'WARN' = 3 , 'ERROR' = 4 ), method LowCardinality(String), path String, status_code UInt16, duration_ms UInt32, client_ip IPv4, user_agent String, request_id String, INDEX idx_request_id request_id TYPE bloom_filter(0.01 ) GRANULARITY 4 ) ENGINE = MergeTree()PARTITION BY date ORDER BY (service, level, timestamp ) TTL date + INTERVAL 30 DAY DELETE SETTINGS index_granularity = 8192 ;SELECT service, count () AS error_count, uniq(request_id) AS affected_requests, avg (duration_ms) AS avg_durationFROM access_logsWHERE timestamp >= now() - INTERVAL 5 MINUTE AND level >= 'ERROR' GROUP BY serviceORDER BY error_count DESC ;SELECT service, method , quantile(0.5 )(duration_ms) AS p50, quantile(0.9 )(duration_ms) AS p90, quantile(0.99 )(duration_ms) AS p99, max (duration_ms) AS max_durationFROM access_logsWHERE date = today()GROUP BY service, method ORDER BY p99 DESC LIMIT 20 ;
场景二:用户行为分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 SELECT windowFunnel(86400 )(event_time, event_type = 'page_view' , event_type = 'add_to_cart' , event_type = 'checkout' , event_type = 'payment' ) AS funnel_stepFROM user_eventsWHERE event_date >= '2025-03-01' AND event_date <= '2025-03-31' GROUP BY user_idSELECT step, count () AS user_countFROM ( SELECT windowFunnel(86400 )(event_time, event_type = 'page_view' , event_type = 'add_to_cart' , event_type = 'checkout' , event_type = 'payment' ) AS step FROM user_events WHERE event_date BETWEEN '2025-03-01' AND '2025-03-31' GROUP BY user_id )GROUP BY stepORDER BY step;SELECT first_day, dateDiff('day' , first_day, return_day) AS day_n, uniq(user_id) AS retained_usersFROM ( SELECT user_id, min (event_date) OVER (PARTITION BY user_id) AS first_day, event_date AS return_day FROM user_events WHERE event_date >= '2025-03-01' )WHERE dateDiff('day' , first_day, return_day) <= 30 GROUP BY first_day, day_nORDER BY first_day, day_n;
总结
ClickHouse是一个专为OLAP场景设计的高性能分析数据库:
列式存储 是性能的基石,同列数据的高压缩率和SIMD向量化执行是查询速度快的核心原因
MergeTree引擎族 提供了丰富的数据管理能力:去重(Replacing)、聚合(Summing/Aggregating)、变更追踪(Collapsing)
稀疏索引 以极小的索引体积实现了高效的数据定位,跳数索引进一步补充了非主键列的过滤能力
物化视图 是实现实时聚合的利器,可以在数据写入时增量计算聚合结果
数据摄入 应以批量写入为主,避免频繁小批量插入产生过多Part
集群部署 通过Shard分片和Replica复制实现水平扩展和高可用
ClickHouse不适合OLTP场景(高频单行更新、事务需求),应与MySQL/PostgreSQL等OLTP数据库配合使用