AI · #embedding#nlp#similarity

Embedding模型原理与应用实践

2024.10.02 7 min 2.6k
// 目录 · contents

引言

Embedding(嵌入向量)是将离散数据(文本、图片、音频)映射到连续向量空间的技术,是现代 AI 系统的基石。从搜索引擎到推荐系统,从 RAG 到异常检测,Embedding 无处不在。本文将从 Word2Vec 的历史讲起,深入 Sentence Transformers 和对比学习的原理,探讨如何微调 Embedding 模型,并通过 MTEB 评估体系选择最适合的模型。

Embedding 技术演进

graph LR
    A[One-Hot<br/>2000s] --> B[Word2Vec<br/>2013]
    B --> C[GloVe<br/>2014]
    C --> D[ELMo<br/>2018]
    D --> E[BERT<br/>2018]
    E --> F[Sentence-BERT<br/>2019]
    F --> G[E5/BGE<br/>2023]
    G --> H[Multimodal<br/>CLIP/SigLIP]

    style A fill:#95a5a6,color:#fff
    style B fill:#e74c3c,color:#fff
    style E fill:#3498db,color:#fff
    style F fill:#2ecc71,color:#fff
    style G fill:#f39c12,color:#000

Word2Vec:词向量的起点

Word2Vec 通过两种训练目标学习词向量:

graph TB
    subgraph "CBOW (Continuous Bag of Words)"
        A1["context: ['机器', '是', '的', '子集']"] --> B1["predict: '学习'"]
    end

    subgraph "Skip-gram"
        A2["input: '学习'"] --> B2["predict: ['机器', '是', '的', '子集']"]
    end

    style B1 fill:#e74c3c,color:#fff
    style A2 fill:#3498db,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from gensim.models import Word2Vec

# Train Word2Vec
sentences = [
["机器", "学习", "是", "人工智能", "的", "子集"],
["深度", "学习", "使用", "神经网络"],
["自然", "语言", "处理", "分析", "文本"],
]

model = Word2Vec(
sentences,
vector_size=100, # Embedding dimension
window=5, # Context window
min_count=1,
sg=1, # 1=Skip-gram, 0=CBOW
workers=4,
)

# Word similarity
similar = model.wv.most_similar("学习", topn=3)
# [('机器', 0.95), ('深度', 0.88), ...]

# Vector arithmetic: king - man + woman ≈ queen
result = model.wv.most_similar(
positive=["国王", "女人"],
negative=["男人"],
topn=1,
)

Word2Vec 局限: - 静态向量——同一个词在不同语境下的向量相同 - “苹果”(水果)和”苹果”(公司)共享同一向量

从词向量到句向量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Naive approach: average word vectors (poor quality)
import numpy as np

def naive_sentence_embedding(sentence, word_model):
words = sentence.split()
vectors = [word_model.wv[w] for w in words if w in word_model.wv]
if not vectors:
return np.zeros(word_model.vector_size)
return np.mean(vectors, axis=0)

# Problems:
# 1. Ignores word order ("dog bites man" = "man bites dog")
# 2. Common words dominate the average
# 3. No contextual understanding

Sentence Transformers

Sentence-BERT(SBERT)通过在 BERT 上添加池化层和对比学习训练目标,生成语义丰富的句向量:

graph TB
    subgraph "Sentence-BERT Training (Siamese Network)"
        A["Sentence A"] --> B1["BERT Encoder"]
        C["Sentence B"] --> B2["BERT Encoder<br/>(Shared Weights)"]
        B1 --> D1["Mean Pooling"]
        B2 --> D2["Mean Pooling"]
        D1 --> E["u (768-dim)"]
        D2 --> F["v (768-dim)"]
        E --> G["Cosine Similarity"]
        F --> G
        G --> H["Contrastive Loss"]
    end

    style B1 fill:#3498db,color:#fff
    style B2 fill:#3498db,color:#fff
    style H fill:#e74c3c,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')

# Encode sentences
sentences = [
"机器学习是人工智能的一个分支",
"深度学习使用多层神经网络",
"今天天气真好适合出去玩",
]

embeddings = model.encode(sentences, normalize_embeddings=True)
print(f"Embedding shape: {embeddings.shape}") # (3, 1024)

# Compute similarity matrix
from sentence_transformers.util import cos_sim

similarity_matrix = cos_sim(embeddings, embeddings)
print(similarity_matrix)
# [[1.00, 0.82, 0.15],
# [0.82, 1.00, 0.12],
# [0.15, 0.12, 1.00]]

对比学习(Contrastive Learning)

对比学习是训练 Embedding 模型的核心方法——让相似样本的向量接近,不相似样本的向量远离。

InfoNCE Loss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import torch
import torch.nn.functional as F

def info_nce_loss(query, positive, negatives, temperature=0.05):
"""
InfoNCE contrastive loss.

query: (batch_size, dim) - anchor embeddings
positive: (batch_size, dim) - positive pair embeddings
negatives: (batch_size, num_neg, dim) - negative samples
"""
# Positive similarity
pos_sim = F.cosine_similarity(query, positive, dim=-1) / temperature
# (batch_size,)

# Negative similarities
neg_sim = F.cosine_similarity(
query.unsqueeze(1), negatives, dim=-1
) / temperature
# (batch_size, num_neg)

# Logits: positive + all negatives
logits = torch.cat([pos_sim.unsqueeze(1), neg_sim], dim=1)
# (batch_size, 1 + num_neg)

# Labels: positive is always index 0
labels = torch.zeros(query.size(0), dtype=torch.long, device=query.device)

loss = F.cross_entropy(logits, labels)
return loss

In-batch Negatives

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def in_batch_negatives_loss(queries, passages, temperature=0.05):
"""
Use other samples in the batch as negatives.

queries: (batch_size, dim)
passages: (batch_size, dim)
"""
# All-pairs similarity matrix
similarity = torch.mm(queries, passages.T) / temperature
# (batch_size, batch_size)

# Diagonal elements are positive pairs
labels = torch.arange(similarity.size(0), device=similarity.device)

loss = F.cross_entropy(similarity, labels)
return loss

# With batch_size=32:
# Each query has 1 positive and 31 in-batch negatives
# Very efficient: O(B^2) comparisons from 2B embeddings

Hard Negative Mining

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sentence_transformers import InputExample, losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator

# Hard negatives: similar but incorrect passages
training_examples = [
InputExample(
texts=[
"Python如何读取CSV文件?", # Query
"使用pandas的read_csv函数读取CSV", # Positive
"Python的csv模块可以写入CSV文件", # Hard negative (related but wrong)
]
),
InputExample(
texts=[
"什么是Docker容器?",
"Docker容器是轻量级的虚拟化技术,打包应用及其依赖",
"Docker镜像是容器的只读模板", # Hard negative
]
),
]

# Train with triplet loss
train_loss = losses.TripletLoss(model=model)
# Or use MultipleNegativesRankingLoss for in-batch negatives
train_loss = losses.MultipleNegativesRankingLoss(model=model)

微调 Embedding 模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# 1. Prepare training data
train_examples = [
InputExample(texts=["query1", "relevant_doc1"]),
InputExample(texts=["query2", "relevant_doc2"]),
# ... more pairs
]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

# 2. Choose loss function
train_loss = losses.MultipleNegativesRankingLoss(model)

# 3. Configure training
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
evaluation_steps=500,
output_path="./fine-tuned-embedding",
show_progress_bar=True,
)

Matryoshka Representation Learning

MRL 允许一个模型产生多种维度的有效 Embedding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sentence_transformers import SentenceTransformer
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

# Train with Matryoshka loss
base_loss = MultipleNegativesRankingLoss(model)
matryoshka_loss = MatryoshkaLoss(
model,
base_loss,
matryoshka_dims=[768, 512, 256, 128, 64],
)

# After training, truncate embeddings for efficiency
full_embedding = model.encode("text") # 768-dim
truncated = full_embedding[:256] # Still effective at 256-dim!
truncated = truncated / np.linalg.norm(truncated) # Re-normalize

评估:MTEB Benchmark

graph TB
    A[MTEB 评估任务] --> B[Classification<br/>文本分类]
    A --> C[Clustering<br/>文本聚类]
    A --> D[Pair Classification<br/>文本对分类]
    A --> E[Reranking<br/>重排序]
    A --> F[Retrieval<br/>检索]
    A --> G[STS<br/>语义相似度]
    A --> H[Summarization<br/>摘要]

    style F fill:#e74c3c,color:#fff
    style G fill:#3498db,color:#fff

主流模型 MTEB 成绩

模型 维度 参数量 中文检索 中文STS 多语言
bge-large-zh-v1.5 1024 326M 71.5 63.4 中文优秀
bge-m3 1024 568M 69.8 64.2 多语言最佳
text-embedding-3-large 3072 - 68.5 62.1 多语言好
e5-mistral-7b 4096 7B 70.2 65.8 多语言优秀
jina-embeddings-v3 1024 570M 70.1 64.5 多语言好

分数为示意值,实际应参考 MTEB 排行榜最新数据。

实际应用

语义搜索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer('BAAI/bge-large-zh-v1.5')

# Index documents
documents = [
"Python是一种解释型高级编程语言",
"JavaScript主要用于Web前端开发",
"Rust注重内存安全和并发性能",
"Go语言由Google开发,适合后端服务",
"TypeScript是JavaScript的超集,添加了静态类型",
]

doc_embeddings = model.encode(documents, normalize_embeddings=True)

# Search
query = "哪种编程语言适合写后端?"
query_embedding = model.encode(query, normalize_embeddings=True)

# Compute similarities
scores = util.cos_sim(query_embedding, doc_embeddings)[0]
top_results = torch.topk(scores, k=3)

for score, idx in zip(top_results.values, top_results.indices):
print(f"Score: {score:.4f} | {documents[idx]}")

文本聚类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate embeddings for clustering
texts = load_texts() # List of documents
embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)

# Find optimal k
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)
score = silhouette_score(embeddings, labels)
silhouette_scores.append((k, score))
print(f"k={k}, silhouette_score={score:.4f}")

# Cluster with best k
best_k = max(silhouette_scores, key=lambda x: x[1])[0]
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
labels = kmeans.fit_predict(embeddings)

# Analyze clusters
for cluster_id in range(best_k):
cluster_texts = [texts[i] for i in range(len(texts)) if labels[i] == cluster_id]
print(f"\n--- Cluster {cluster_id} ({len(cluster_texts)} docs) ---")
for text in cluster_texts[:3]:
print(f" {text[:80]}...")

异常检测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from scipy.spatial.distance import cosine

def detect_anomalies(texts, model, threshold=0.3):
"""Detect texts that are semantically different from the majority."""
embeddings = model.encode(texts, normalize_embeddings=True)

# Compute centroid
centroid = np.mean(embeddings, axis=0)
centroid = centroid / np.linalg.norm(centroid)

# Distance from centroid
distances = [cosine(emb, centroid) for emb in embeddings]

anomalies = []
for i, (text, dist) in enumerate(zip(texts, distances)):
if dist > threshold:
anomalies.append({"text": text, "distance": dist, "index": i})

return sorted(anomalies, key=lambda x: x["distance"], reverse=True)

跨语言检索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# BGE-M3 supports 100+ languages
model = SentenceTransformer('BAAI/bge-m3')

# Encode texts in different languages
texts = {
"zh": "机器学习是人工智能的一个分支",
"en": "Machine learning is a branch of artificial intelligence",
"ja": "機械学習は人工知能の一分野です",
"ko": "머신러닝은 인공지능의 한 분야입니다",
}

embeddings = {lang: model.encode(text) for lang, text in texts.items()}

# Cross-lingual similarity
for lang_a, emb_a in embeddings.items():
for lang_b, emb_b in embeddings.items():
sim = float(util.cos_sim(emb_a, emb_b))
if lang_a != lang_b:
print(f"{lang_a}-{lang_b}: {sim:.4f}")
# All pairs show high similarity (>0.85) despite different languages

选型建议

flowchart TD
    A[Embedding 模型选型] --> B{语言需求?}
    B -->|仅中文| C[bge-large-zh-v1.5]
    B -->|多语言| D{性能 vs 成本?}
    D -->|高性能| E[bge-m3 / e5-mistral-7b]
    D -->|低成本 API| F[text-embedding-3-small]
    B -->|仅英文| G[bge-large-en-v1.5]

    A --> H{部署约束?}
    H -->|本地部署 GPU有限| I[bge-small-zh<br/>小模型 33M]
    H -->|GPU充足| J[bge-large 或 e5-mistral]
    H -->|云API| K[OpenAI / Cohere]

    style C fill:#2ecc71,color:#fff
    style E fill:#3498db,color:#fff
    style F fill:#f39c12,color:#000

总结

Embedding 技术从 Word2Vec 的词向量发展到如今的多语言多模态句向量,经历了巨大的飞跃。选择 Embedding 模型时需要考虑:

  1. 任务匹配:检索任务优先选用检索优化模型(如 BGE/E5),分类任务可选通用模型
  2. 语言支持:中文场景优选 bge-zh 系列,多语言场景选 bge-m3
  3. 维度权衡:高维度(1024+)检索精度更好,但存储和计算成本更高
  4. 微调价值:在特定领域数据上微调通常能带来 5-15% 的性能提升
  5. Matryoshka:支持 MRL 的模型在存储受限场景下优势明显
作者 · authorzt
发布 · date2024-10-02
篇幅 · length2.6k 字 · 7 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论