引言
Elasticsearch(简称ES)是基于Apache
Lucene构建的开源分布式搜索和分析引擎。它不仅提供了强大的全文搜索能力,还支持结构化搜索、数据分析、日志处理等多种场景。随着ELK
Stack的普及,Elasticsearch已成为企业级搜索和日志分析的事实标准。本文将从底层原理到实战优化,系统讲解Elasticsearch的核心知识。
倒排索引原理
正排索引 vs 倒排索引
graph LR
subgraph "正排索引(Forward Index)"
D1[Doc1] --> T1["The quick brown fox"]
D2[Doc2] --> T2["The quick dog"]
D3[Doc3] --> T3["The brown fox jumps"]
end
subgraph "倒排索引(Inverted Index)"
W1[the] --> P1["Doc1:1, Doc2:1, Doc3:1"]
W2[quick] --> P2["Doc1:2, Doc2:2"]
W3[brown] --> P3["Doc1:3, Doc3:2"]
W4[fox] --> P4["Doc1:4, Doc3:3"]
W5[dog] --> P5["Doc2:3"]
W6[jumps] --> P6["Doc3:4"]
end
倒排索引由两部分组成: 1. Term
Dictionary(词典) :存储所有不重复的词项,按字典序排列 2.
Posting
List(倒排列表) :每个词项对应的文档ID列表及位置信息
Lucene 的倒排索引结构
graph TD
A[Term Dictionary<br/>FST有限状态转换器] --> B[Term Index<br/>内存中的前缀树<br/>加速定位Term]
A --> C[Posting List<br/>文档ID列表]
C --> D[Doc ID: Frame of Reference压缩]
C --> E[Term Frequency: 词频]
C --> F[Position: 词的位置]
C --> G[Offset: 字符偏移量]
Posting List的压缩技术:
1 2 3 4 5 6 7 8 9 10 11 原始Doc IDs: [73, 300, 302, 332, 343, 372] Frame of Reference (FOR) 编码: 1. 计算增量: [73, 227, 2, 30, 11, 29] 2. 分块(Block): 每128个ID为一块 3. 位压缩: 根据块内最大值确定每个值所需的bit数 Roaring Bitmaps (用于Filter场景): - 将Doc ID空间按65536分块 - 稀疏块用有序数组存储 - 稠密块用位图存储
Mapping 设计
核心数据类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 { "mappings" : { "properties" : { "title" : { "type" : "text" , "analyzer" : "ik_max_word" , "search_analyzer" : "ik_smart" , "fields" : { "keyword" : { "type" : "keyword" , "ignore_above" : 256 } } } , "status" : { "type" : "keyword" } , "price" : { "type" : "scaled_float" , "scaling_factor" : 100 } , "created_at" : { "type" : "date" , "format" : "yyyy-MM-dd HH:mm:ss||epoch_millis" } , "location" : { "type" : "geo_point" } , "tags" : { "type" : "keyword" } , "content" : { "type" : "text" , "analyzer" : "ik_max_word" } , "metadata" : { "type" : "object" , "enabled" : false } } } }
text vs keyword
graph TD
A[字段类型选择] --> B{需要全文搜索?}
B -->|是| C[text类型<br/>会分词]
B -->|否| D{需要精确匹配?}
D -->|是| E[keyword类型<br/>不分词]
D -->|否| F[disabled<br/>不索引]
C --> G[适用: 文章内容 商品描述]
E --> H[适用: 状态 ID 邮箱 标签]
F --> I[适用: 仅存储不查询的字段]
Mapping 设计最佳实践
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 { "settings" : { "number_of_shards" : 3 , "number_of_replicas" : 1 , "index.mapping.total_fields.limit" : 2000 , "analysis" : { "analyzer" : { "my_analyzer" : { "type" : "custom" , "tokenizer" : "ik_max_word" , "filter" : [ "lowercase" , "my_synonym" ] } } , "filter" : { "my_synonym" : { "type" : "synonym" , "synonyms_path" : "analysis/synonyms.txt" } } } } , "mappings" : { "dynamic" : "strict" , "_source" : { "enabled" : true } , "properties" : { "id" : { "type" : "keyword" } , "title" : { "type" : "text" , "analyzer" : "my_analyzer" , "fields" : { "keyword" : { "type" : "keyword" } } } , "category" : { "type" : "keyword" } , "price" : { "type" : "double" } , "created_at" : { "type" : "date" } } } }
关键设计原则:
设置dynamic为strict :防止意外字段导致mapping膨胀
合理使用multi-fields :同一字段同时支持全文搜索和精确匹配
禁用不需要搜索的字段 :"index": false减少索引大小
使用合适的数值类型 :能用integer不用long,能用scaled_float不用double
Query DSL
查询与过滤
graph TD
A[Query Context] --> B[计算相关性评分<br/>_score]
C[Filter Context] --> D[不计算评分<br/>可缓存 性能更好]
A --> E["match, multi_match<br/>query_string"]
C --> F["term, terms, range<br/>exists, bool.filter"]
常用查询示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 GET /products/_search{ "query" : { "bool" : { "must" : [ { "multi_match" : { "query" : "苹果手机" , "fields" : [ "title^3" , "description" ] , "type" : "best_fields" } } ] , "filter" : [ { "term" : { "status" : "on_sale" } } , { "range" : { "price" : { "gte" : 3000 , "lte" : 8000 } } } , { "terms" : { "category" : [ "phone" , "electronics" ] } } ] , "should" : [ { "term" : { "brand" : { "value" : "apple" , "boost" : 2.0 } } } ] , "must_not" : [ { "term" : { "is_deleted" : true } } ] , "minimum_should_match" : 0 } } , "sort" : [ { "_score" : "desc" } , { "sales" : "desc" } , { "created_at" : "desc" } ] , "from" : 0 , "size" : 20 , "highlight" : { "pre_tags" : [ "<em>" ] , "post_tags" : [ "</em>" ] , "fields" : { "title" : { } , "description" : { "fragment_size" : 150 } } } , "_source" : [ "id" , "title" , "price" , "brand" , "image_url" ] }
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 GET /orders/_search{ "query" : { "nested" : { "path" : "items" , "query" : { "bool" : { "must" : [ { "match" : { "items.product_name" : "iPhone" } } , { "range" : { "items.quantity" : { "gte" : 2 } } } ] } } } } } GET /restaurants/_search{ "query" : { "bool" : { "filter" : { "geo_distance" : { "distance" : "3km" , "location" : { "lat" : 39.9 , "lon" : 116.4 } } } } } , "sort" : [ { "_geo_distance" : { "location" : { "lat" : 39.9 , "lon" : 116.4 } , "order" : "asc" } } ] } GET /products/_search{ "query" : { "function_score" : { "query" : { "match" : { "title" : "手机" } } , "functions" : [ { "field_value_factor" : { "field" : "sales" , "modifier" : "log1p" , "factor" : 0.1 } } , { "gauss" : { "created_at" : { "origin" : "now" , "scale" : "30d" , "decay" : 0.5 } } } ] , "score_mode" : "sum" , "boost_mode" : "multiply" } } }
聚合分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 GET /orders/_search{ "size" : 0 , "aggs" : { "monthly_stats" : { "date_histogram" : { "field" : "created_at" , "calendar_interval" : "month" , "format" : "yyyy-MM" } , "aggs" : { "total_amount" : { "sum" : { "field" : "amount" } } , "avg_amount" : { "avg" : { "field" : "amount" } } , "category_breakdown" : { "terms" : { "field" : "category" , "size" : 10 } , "aggs" : { "revenue" : { "sum" : { "field" : "amount" } } } } , "amount_percentiles" : { "percentiles" : { "field" : "amount" , "percents" : [ 50 , 75 , 90 , 95 , 99 ] } } } } , "price_ranges" : { "range" : { "field" : "amount" , "ranges" : [ { "to" : 100 } , { "from" : 100 , "to" : 500 } , { "from" : 500 , "to" : 1000 } , { "from" : 1000 } ] } } } } GET /orders/_search{ "size" : 0 , "aggs" : { "monthly_sales" : { "date_histogram" : { "field" : "created_at" , "calendar_interval" : "month" } , "aggs" : { "revenue" : { "sum" : { "field" : "amount" } } } } , "avg_monthly_revenue" : { "avg_bucket" : { "buckets_path" : "monthly_sales>revenue" } } , "max_monthly_revenue" : { "max_bucket" : { "buckets_path" : "monthly_sales>revenue" } } } }
集群架构
节点角色
graph TD
subgraph "ES Cluster"
M1["Master Node<br/>集群管理 元数据<br/>node.roles: master"]
M2["Master Node<br/>候选"]
M3["Master Node<br/>候选"]
D1["Data Node<br/>存储数据 执行搜索<br/>node.roles: data"]
D2["Data Node"]
D3["Data Node"]
D4["Data Node"]
C1["Coordinating Node<br/>请求路由 结果聚合<br/>node.roles: 空"]
I1["Ingest Node<br/>数据预处理<br/>node.roles: ingest"]
end
Client --> C1
C1 --> D1
C1 --> D2
C1 --> D3
C1 --> D4
M1 -.->|管理| D1
M1 -.->|管理| D2
M1 -.->|管理| D3
M1 -.->|管理| D4
分片(Shard)策略
graph TD
subgraph "Index: products (3 Primary + 1 Replica)"
subgraph "Node 1"
P0[Primary Shard 0]
R2[Replica Shard 2]
end
subgraph "Node 2"
P1[Primary Shard 1]
R0[Replica Shard 0]
end
subgraph "Node 3"
P2[Primary Shard 2]
R1[Replica Shard 1]
end
end
分片数量的选择:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 经验公式: 单个分片大小: 10-50GB(理想值30GB左右) 分片数 = 预估数据总量 / 单个分片大小 示例: - 预估数据量 300GB - 分片数 = 300 / 30 = 10个Primary Shard - 加上1个Replica = 20个Shard总数 注意事项: - 每个分片有内存开销(约占用几十MB堆内存) - 单个节点建议不超过600-800个分片 - 分片数创建后不能修改(需要reindex) - 小索引避免过多分片(1-2个即可)
搜索执行流程
sequenceDiagram
participant C as Client
participant Co as Coordinating Node
participant S0 as Shard 0
participant S1 as Shard 1
participant S2 as Shard 2
C->>Co: Search Request
Note over Co: Query Phase (散发)
Co->>S0: Query (from=0, size=10)
Co->>S1: Query (from=0, size=10)
Co->>S2: Query (from=0, size=10)
S0->>Co: Top 10 doc IDs + scores
S1->>Co: Top 10 doc IDs + scores
S2->>Co: Top 10 doc IDs + scores
Note over Co: 合并排序, 取全局Top 10
Note over Co: Fetch Phase (获取)
Co->>S0: Fetch doc 3, 7
Co->>S1: Fetch doc 1, 5, 8
Co->>S2: Fetch doc 2, 4, 6, 9, 10
S0->>Co: Document data
S1->>Co: Document data
S2->>Co: Document data
Co->>C: Final Results (10 documents)
性能调优
写入优化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 POST /_bulk{ "index" : { "_index" : "products" , "_id" : "1" } } { "title" : "iPhone 15" , "price" : 5999 } { "index" : { "_index" : "products" , "_id" : "2" } } { "title" : "Galaxy S24" , "price" : 4999 } PUT /products/_settings{ "index" : { "refresh_interval" : "30s" , "number_of_replicas" : 0 , "translog" : { "durability" : "async" , "sync_interval" : "30s" } } }
写入性能优化要点:
使用Bulk API :批量大小建议5-15MB
增大refresh_interval :从默认1s调整为30s或更长,减少segment生成频率
初始导入时关闭副本 :number_of_replicas: 0,导入完成后恢复
调整translog刷盘策略 :异步刷盘提升写入性能
合理设置线程池 :thread_pool.write.queue_size
查询优化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 { "query" : { "match" : { "status" : "active" } } } { "query" : { "bool" : { "filter" : { "term" : { "status" : "active" } } } } } { "_source" : [ "id" , "title" , "price" ] } PUT /orders/_doc/1 ?routing=user_123{ "user_id" : "user_123" , "amount" : 99.9 } GET /orders/_search?routing=user_123{ "query" : { "term" : { "user_id" : "user_123" } } } GET /products/_search{ "size" : 20 , "sort" : [ { "created_at" : "desc" } , { "_id" : "asc" } ] , "search_after" : [ "2025-03-15T10:30:00" , "abc123" ] }
JVM 与系统配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 -Xms16g -Xmx16g vm.max_map_count=262144 vm.swappiness=1 elasticsearch soft nofile 65536 elasticsearch hard nofile 65536 sudo swapoff -a bootstrap.memory_lock: true
索引生命周期管理(ILM)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 PUT /_ilm/policy/logs_policy{ "policy" : { "phases" : { "hot" : { "min_age" : "0ms" , "actions" : { "rollover" : { "max_size" : "50gb" , "max_age" : "7d" , "max_docs" : 100000000 } , "set_priority" : { "priority" : 100 } } } , "warm" : { "min_age" : "30d" , "actions" : { "shrink" : { "number_of_shards" : 1 } , "forcemerge" : { "max_num_segments" : 1 } , "set_priority" : { "priority" : 50 } } } , "cold" : { "min_age" : "90d" , "actions" : { "freeze" : { } , "set_priority" : { "priority" : 0 } } } , "delete" : { "min_age" : "365d" , "actions" : { "delete" : { } } } } } }
graph LR
A["Hot Phase<br/>0-30天<br/>SSD存储<br/>频繁读写"] --> B["Warm Phase<br/>30-90天<br/>合并segments<br/>减少分片"]
B --> C["Cold Phase<br/>90-365天<br/>冻结索引<br/>HDD存储"]
C --> D["Delete Phase<br/>365天+<br/>删除索引"]
总结
Elasticsearch是一个功能强大但需要精心调优的分布式系统:
倒排索引 是全文搜索的基础,理解其原理有助于设计更好的Mapping
Mapping设计 需要根据查询需求选择合适的字段类型和分析器
Query DSL 中应优先使用Filter
Context,善用Bool查询组合条件
聚合分析 可以实现复杂的多维度统计,但要注意内存消耗
集群架构 中分片数量的选择至关重要,直接影响查询性能和资源利用
写入优化 重点关注Bulk批量操作和refresh_interval配置
查询优化 关键在于合理使用filter、routing和search_after分页
ILM 是管理日志类时序数据的最佳实践