graph LR
A[One-Hot<br/>2000s] --> B[Word2Vec<br/>2013]
B --> C[GloVe<br/>2014]
C --> D[ELMo<br/>2018]
D --> E[BERT<br/>2018]
E --> F[Sentence-BERT<br/>2019]
F --> G[E5/BGE<br/>2023]
G --> H[Multimodal<br/>CLIP/SigLIP]
style A fill:#95a5a6,color:#fff
style B fill:#e74c3c,color:#fff
style E fill:#3498db,color:#fff
style F fill:#2ecc71,color:#fff
style G fill:#f39c12,color:#000
Word2Vec:词向量的起点
Word2Vec 通过两种训练目标学习词向量:
graph TB
subgraph "CBOW (Continuous Bag of Words)"
A1["context: ['机器', '是', '的', '子集']"] --> B1["predict: '学习'"]
end
subgraph "Skip-gram"
A2["input: '学习'"] --> B2["predict: ['机器', '是', '的', '子集']"]
end
style B1 fill:#e74c3c,color:#fff
style A2 fill:#3498db,color:#fff
# Naive approach: average word vectors (poor quality) import numpy as np
defnaive_sentence_embedding(sentence, word_model): words = sentence.split() vectors = [word_model.wv[w] for w in words if w in word_model.wv] ifnot vectors: return np.zeros(word_model.vector_size) return np.mean(vectors, axis=0)
# Problems: # 1. Ignores word order ("dog bites man" = "man bites dog") # 2. Common words dominate the average # 3. No contextual understanding
from sentence_transformers import InputExample, losses from sentence_transformers.evaluation import InformationRetrievalEvaluator
# Hard negatives: similar but incorrect passages training_examples = [ InputExample( texts=[ "Python如何读取CSV文件?", # Query "使用pandas的read_csv函数读取CSV", # Positive "Python的csv模块可以写入CSV文件", # Hard negative (related but wrong) ] ), InputExample( texts=[ "什么是Docker容器?", "Docker容器是轻量级的虚拟化技术,打包应用及其依赖", "Docker镜像是容器的只读模板", # Hard negative ] ), ]
# Train with triplet loss train_loss = losses.TripletLoss(model=model) # Or use MultipleNegativesRankingLoss for in-batch negatives train_loss = losses.MultipleNegativesRankingLoss(model=model)
from sentence_transformers import SentenceTransformer from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
# Train with Matryoshka loss base_loss = MultipleNegativesRankingLoss(model) matryoshka_loss = MatryoshkaLoss( model, base_loss, matryoshka_dims=[768, 512, 256, 128, 64], )
# After training, truncate embeddings for efficiency full_embedding = model.encode("text") # 768-dim truncated = full_embedding[:256] # Still effective at 256-dim! truncated = truncated / np.linalg.norm(truncated) # Re-normalize
评估:MTEB Benchmark
graph TB
A[MTEB 评估任务] --> B[Classification<br/>文本分类]
A --> C[Clustering<br/>文本聚类]
A --> D[Pair Classification<br/>文本对分类]
A --> E[Reranking<br/>重排序]
A --> F[Retrieval<br/>检索]
A --> G[STS<br/>语义相似度]
A --> H[Summarization<br/>摘要]
style F fill:#e74c3c,color:#fff
style G fill:#3498db,color:#fff
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score
# Generate embeddings for clustering texts = load_texts() # List of documents embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
# Find optimal k silhouette_scores = [] for k inrange(2, 11): kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) labels = kmeans.fit_predict(embeddings) score = silhouette_score(embeddings, labels) silhouette_scores.append((k, score)) print(f"k={k}, silhouette_score={score:.4f}")
# Cluster with best k best_k = max(silhouette_scores, key=lambda x: x[1])[0] kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10) labels = kmeans.fit_predict(embeddings)
# Analyze clusters for cluster_id inrange(best_k): cluster_texts = [texts[i] for i inrange(len(texts)) if labels[i] == cluster_id] print(f"\n--- Cluster {cluster_id} ({len(cluster_texts)} docs) ---") for text in cluster_texts[:3]: print(f" {text[:80]}...")
异常检测
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
from scipy.spatial.distance import cosine
defdetect_anomalies(texts, model, threshold=0.3): """Detect texts that are semantically different from the majority.""" embeddings = model.encode(texts, normalize_embeddings=True)
# BGE-M3 supports 100+ languages model = SentenceTransformer('BAAI/bge-m3')
# Encode texts in different languages texts = { "zh": "机器学习是人工智能的一个分支", "en": "Machine learning is a branch of artificial intelligence", "ja": "機械学習は人工知能の一分野です", "ko": "머신러닝은 인공지능의 한 분야입니다", }
embeddings = {lang: model.encode(text) for lang, text in texts.items()}
# Cross-lingual similarity for lang_a, emb_a in embeddings.items(): for lang_b, emb_b in embeddings.items(): sim = float(util.cos_sim(emb_a, emb_b)) if lang_a != lang_b: print(f"{lang_a}-{lang_b}: {sim:.4f}") # All pairs show high similarity (>0.85) despite different languages
选型建议
flowchart TD
A[Embedding 模型选型] --> B{语言需求?}
B -->|仅中文| C[bge-large-zh-v1.5]
B -->|多语言| D{性能 vs 成本?}
D -->|高性能| E[bge-m3 / e5-mistral-7b]
D -->|低成本 API| F[text-embedding-3-small]
B -->|仅英文| G[bge-large-en-v1.5]
A --> H{部署约束?}
H -->|本地部署 GPU有限| I[bge-small-zh<br/>小模型 33M]
H -->|GPU充足| J[bge-large 或 e5-mistral]
H -->|云API| K[OpenAI / Cohere]
style C fill:#2ecc71,color:#fff
style E fill:#3498db,color:#fff
style F fill:#f39c12,color:#000