Elasticsearch实战：从入门到集群优化

引言

Elasticsearch（简称ES）是基于Apache Lucene构建的开源分布式搜索和分析引擎。它不仅提供了强大的全文搜索能力，还支持结构化搜索、数据分析、日志处理等多种场景。随着ELK Stack的普及，Elasticsearch已成为企业级搜索和日志分析的事实标准。本文将从底层原理到实战优化，系统讲解Elasticsearch的核心知识。

倒排索引原理

正排索引 vs 倒排索引

graph LR
    subgraph "正排索引（Forward Index）"
        D1[Doc1] --> T1["The quick brown fox"]
        D2[Doc2] --> T2["The quick dog"]
        D3[Doc3] --> T3["The brown fox jumps"]
    end

    subgraph "倒排索引（Inverted Index）"
        W1[the] --> P1["Doc1:1, Doc2:1, Doc3:1"]
        W2[quick] --> P2["Doc1:2, Doc2:2"]
        W3[brown] --> P3["Doc1:3, Doc3:2"]
        W4[fox] --> P4["Doc1:4, Doc3:3"]
        W5[dog] --> P5["Doc2:3"]
        W6[jumps] --> P6["Doc3:4"]
    end

倒排索引由两部分组成： 1. Term Dictionary（词典）：存储所有不重复的词项，按字典序排列 2. Posting List（倒排列表）：每个词项对应的文档ID列表及位置信息

Lucene 的倒排索引结构

graph TD
    A[Term Dictionary<br/>FST有限状态转换器] --> B[Term Index<br/>内存中的前缀树<br/>加速定位Term]
    A --> C[Posting List<br/>文档ID列表]
    C --> D[Doc ID: Frame of Reference压缩]
    C --> E[Term Frequency: 词频]
    C --> F[Position: 词的位置]
    C --> G[Offset: 字符偏移量]

Posting List的压缩技术：

原始Doc IDs: [73, 300, 302, 332, 343, 372]

Frame of Reference (FOR) 编码:
1. 计算增量: [73, 227, 2, 30, 11, 29]
2. 分块(Block): 每128个ID为一块
3. 位压缩: 根据块内最大值确定每个值所需的bit数

Roaring Bitmaps (用于Filter场景):
- 将Doc ID空间按65536分块
- 稀疏块用有序数组存储
- 稠密块用位图存储

Mapping 设计

核心数据类型

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "status": {
        "type": "keyword"
      },
      "price": {
        "type": "scaled_float",
        "scaling_factor": 100
      },
      "created_at": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||epoch_millis"
      },
      "location": {
        "type": "geo_point"
      },
      "tags": {
        "type": "keyword"
      },
      "content": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "metadata": {
        "type": "object",
        "enabled": false
      }
    }
  }
}

text vs keyword

graph TD
    A[字段类型选择] --> B{需要全文搜索?}
    B -->|是| C[text类型<br/>会分词]
    B -->|否| D{需要精确匹配?}
    D -->|是| E[keyword类型<br/>不分词]
    D -->|否| F[disabled<br/>不索引]
    C --> G[适用: 文章内容 商品描述]
    E --> H[适用: 状态 ID 邮箱 标签]
    F --> I[适用: 仅存储不查询的字段]

Mapping 设计最佳实践

{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["lowercase", "my_synonym"]
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonyms.txt"
        }
      }
    }
  },
  "mappings": {
    "dynamic": "strict",
    "_source": {
      "enabled": true
    },
    "properties": {
      "id": { "type": "keyword" },
      "title": {
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "keyword": { "type": "keyword" }
        }
      },
      "category": { "type": "keyword" },
      "price": { "type": "double" },
      "created_at": { "type": "date" }
    }
  }
}

关键设计原则：

设置dynamic为strict：防止意外字段导致mapping膨胀
合理使用multi-fields：同一字段同时支持全文搜索和精确匹配
禁用不需要搜索的字段："index": false减少索引大小
使用合适的数值类型：能用integer不用long，能用scaled_float不用double

Query DSL

查询与过滤

graph TD
    A[Query Context] --> B[计算相关性评分<br/>_score]
    C[Filter Context] --> D[不计算评分<br/>可缓存 性能更好]

    A --> E["match, multi_match<br/>query_string"]
    C --> F["term, terms, range<br/>exists, bool.filter"]

常用查询示例

// 全文搜索
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "苹果手机",
            "fields": ["title^3", "description"],
            "type": "best_fields"
          }
        }
      ],
      "filter": [
        { "term": { "status": "on_sale" } },
        { "range": { "price": { "gte": 3000, "lte": 8000 } } },
        { "terms": { "category": ["phone", "electronics"] } }
      ],
      "should": [
        { "term": { "brand": { "value": "apple", "boost": 2.0 } } }
      ],
      "must_not": [
        { "term": { "is_deleted": true } }
      ],
      "minimum_should_match": 0
    }
  },
  "sort": [
    { "_score": "desc" },
    { "sales": "desc" },
    { "created_at": "desc" }
  ],
  "from": 0,
  "size": 20,
  "highlight": {
    "pre_tags": ["<em>"],
    "post_tags": ["</em>"],
    "fields": {
      "title": {},
      "description": { "fragment_size": 150 }
    }
  },
  "_source": ["id", "title", "price", "brand", "image_url"]
}

// 嵌套对象查询
GET /orders/_search
{
  "query": {
    "nested": {
      "path": "items",
      "query": {
        "bool": {
          "must": [
            { "match": { "items.product_name": "iPhone" } },
            { "range": { "items.quantity": { "gte": 2 } } }
          ]
        }
      }
    }
  }
}

// 地理位置查询
GET /restaurants/_search
{
  "query": {
    "bool": {
      "filter": {
        "geo_distance": {
          "distance": "3km",
          "location": { "lat": 39.9, "lon": 116.4 }
        }
      }
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": { "lat": 39.9, "lon": 116.4 },
        "order": "asc"
      }
    }
  ]
}

// Function Score 自定义评分
GET /products/_search
{
  "query": {
    "function_score": {
      "query": { "match": { "title": "手机" } },
      "functions": [
        {
          "field_value_factor": {
            "field": "sales",
            "modifier": "log1p",
            "factor": 0.1
          }
        },
        {
          "gauss": {
            "created_at": {
              "origin": "now",
              "scale": "30d",
              "decay": 0.5
            }
          }
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

聚合分析

// 多维度聚合分析
GET /orders/_search
{
  "size": 0,
  "aggs": {
    "monthly_stats": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "month",
        "format": "yyyy-MM"
      },
      "aggs": {
        "total_amount": { "sum": { "field": "amount" } },
        "avg_amount": { "avg": { "field": "amount" } },
        "category_breakdown": {
          "terms": {
            "field": "category",
            "size": 10
          },
          "aggs": {
            "revenue": { "sum": { "field": "amount" } }
          }
        },
        "amount_percentiles": {
          "percentiles": {
            "field": "amount",
            "percents": [50, 75, 90, 95, 99]
          }
        }
      }
    },
    "price_ranges": {
      "range": {
        "field": "amount",
        "ranges": [
          { "to": 100 },
          { "from": 100, "to": 500 },
          { "from": 500, "to": 1000 },
          { "from": 1000 }
        ]
      }
    }
  }
}

// Pipeline聚合
GET /orders/_search
{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "month"
      },
      "aggs": {
        "revenue": { "sum": { "field": "amount" } }
      }
    },
    "avg_monthly_revenue": {
      "avg_bucket": {
        "buckets_path": "monthly_sales>revenue"
      }
    },
    "max_monthly_revenue": {
      "max_bucket": {
        "buckets_path": "monthly_sales>revenue"
      }
    }
  }
}

集群架构

节点角色

graph TD
    subgraph "ES Cluster"
        M1["Master Node<br/>集群管理 元数据<br/>node.roles: master"]
        M2["Master Node<br/>候选"]
        M3["Master Node<br/>候选"]

        D1["Data Node<br/>存储数据 执行搜索<br/>node.roles: data"]
        D2["Data Node"]
        D3["Data Node"]
        D4["Data Node"]

        C1["Coordinating Node<br/>请求路由 结果聚合<br/>node.roles: 空"]

        I1["Ingest Node<br/>数据预处理<br/>node.roles: ingest"]
    end

    Client --> C1
    C1 --> D1
    C1 --> D2
    C1 --> D3
    C1 --> D4

    M1 -.->|管理| D1
    M1 -.->|管理| D2
    M1 -.->|管理| D3
    M1 -.->|管理| D4

分片（Shard）策略

graph TD
    subgraph "Index: products (3 Primary + 1 Replica)"
        subgraph "Node 1"
            P0[Primary Shard 0]
            R2[Replica Shard 2]
        end
        subgraph "Node 2"
            P1[Primary Shard 1]
            R0[Replica Shard 0]
        end
        subgraph "Node 3"
            P2[Primary Shard 2]
            R1[Replica Shard 1]
        end
    end

分片数量的选择：

经验公式:
单个分片大小: 10-50GB（理想值30GB左右）
分片数 = 预估数据总量 / 单个分片大小

示例:
- 预估数据量 300GB
- 分片数 = 300 / 30 = 10个Primary Shard
- 加上1个Replica = 20个Shard总数

注意事项:
- 每个分片有内存开销（约占用几十MB堆内存）
- 单个节点建议不超过600-800个分片
- 分片数创建后不能修改（需要reindex）
- 小索引避免过多分片（1-2个即可）

搜索执行流程

sequenceDiagram
    participant C as Client
    participant Co as Coordinating Node
    participant S0 as Shard 0
    participant S1 as Shard 1
    participant S2 as Shard 2

    C->>Co: Search Request

    Note over Co: Query Phase (散发)
    Co->>S0: Query (from=0, size=10)
    Co->>S1: Query (from=0, size=10)
    Co->>S2: Query (from=0, size=10)

    S0->>Co: Top 10 doc IDs + scores
    S1->>Co: Top 10 doc IDs + scores
    S2->>Co: Top 10 doc IDs + scores

    Note over Co: 合并排序, 取全局Top 10

    Note over Co: Fetch Phase (获取)
    Co->>S0: Fetch doc 3, 7
    Co->>S1: Fetch doc 1, 5, 8
    Co->>S2: Fetch doc 2, 4, 6, 9, 10

    S0->>Co: Document data
    S1->>Co: Document data
    S2->>Co: Document data

    Co->>C: Final Results (10 documents)

性能调优

写入优化

// 批量写入（Bulk API）
POST /_bulk
{"index": {"_index": "products", "_id": "1"}}
{"title": "iPhone 15", "price": 5999}
{"index": {"_index": "products", "_id": "2"}}
{"title": "Galaxy S24", "price": 4999}

// 写入优化配置
PUT /products/_settings
{
  "index": {
    "refresh_interval": "30s",
    "number_of_replicas": 0,
    "translog": {
      "durability": "async",
      "sync_interval": "30s"
    }
  }
}

写入性能优化要点：

使用Bulk API：批量大小建议5-15MB
增大refresh_interval：从默认1s调整为30s或更长，减少segment生成频率
初始导入时关闭副本：number_of_replicas: 0，导入完成后恢复
调整translog刷盘策略：异步刷盘提升写入性能
合理设置线程池：thread_pool.write.queue_size

查询优化

// 使用filter替代query（可缓存，不计算评分）
// Bad
{ "query": { "match": { "status": "active" } } }
// Good
{ "query": { "bool": { "filter": { "term": { "status": "active" } } } } }

// 控制返回字段
{ "_source": ["id", "title", "price"] }

// 使用routing减少分片扫描
PUT /orders/_doc/1?routing=user_123
{ "user_id": "user_123", "amount": 99.9 }

GET /orders/_search?routing=user_123
{ "query": { "term": { "user_id": "user_123" } } }

// 避免深分页，使用 search_after
GET /products/_search
{
  "size": 20,
  "sort": [
    { "created_at": "desc" },
    { "_id": "asc" }
  ],
  "search_after": ["2025-03-15T10:30:00", "abc123"]
}

JVM 与系统配置

# JVM堆内存设置（不超过物理内存50%，不超过32GB）
# jvm.options
-Xms16g
-Xmx16g

# 系统配置
# /etc/sysctl.conf
vm.max_map_count=262144
vm.swappiness=1

# 文件描述符
# /etc/security/limits.conf
elasticsearch soft nofile 65536
elasticsearch hard nofile 65536

# 禁用内存交换
sudo swapoff -a
# 或在 elasticsearch.yml 中
bootstrap.memory_lock: true

索引生命周期管理（ILM）

// 定义ILM策略
PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "7d",
            "max_docs": 100000000
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "30d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "90d",
        "actions": {
          "freeze": {},
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

graph LR
    A["Hot Phase<br/>0-30天<br/>SSD存储<br/>频繁读写"] --> B["Warm Phase<br/>30-90天<br/>合并segments<br/>减少分片"]
    B --> C["Cold Phase<br/>90-365天<br/>冻结索引<br/>HDD存储"]
    C --> D["Delete Phase<br/>365天+<br/>删除索引"]

总结

Elasticsearch是一个功能强大但需要精心调优的分布式系统：

倒排索引是全文搜索的基础，理解其原理有助于设计更好的Mapping
Mapping设计需要根据查询需求选择合适的字段类型和分析器
Query DSL中应优先使用Filter Context，善用Bool查询组合条件
聚合分析可以实现复杂的多维度统计，但要注意内存消耗
集群架构中分片数量的选择至关重要，直接影响查询性能和资源利用
写入优化重点关注Bulk批量操作和refresh_interval配置
查询优化关键在于合理使用filter、routing和search_after分页
ILM是管理日志类时序数据的最佳实践