AI · #llm#fine-tuning#lora#qlora

LLM微调实战:LoRA与QLoRA

2025.10.01 7 min 2.7k
// 目录 · contents

引言

大语言模型的通用能力令人惊叹,但在特定领域(如法律、医疗、代码)往往需要微调(Fine-tuning)来提升专业表现。全量微调需要的计算资源与模型参数量成正比,对于 7B 甚至 70B 的模型来说成本高昂。LoRA(Low-Rank Adaptation)和 QLoRA 通过参数高效微调(PEFT)大幅降低了微调的门槛。本文将从数学原理讲起,涵盖数据准备、训练配置、评估和部署的完整流程。

微调方法全景

graph TB
    A[LLM 微调方法] --> B[全量微调<br/>Full Fine-tuning]
    A --> C[参数高效微调<br/>PEFT]

    B --> B1[更新所有参数<br/>成本最高 / 效果最好]

    C --> D[Adapter 方法]
    C --> E[Prefix/Prompt 方法]
    C --> F[LoRA 方法]

    D --> D1[Adapter Layers<br/>在层间插入小网络]
    E --> E1[Prefix Tuning<br/>P-Tuning v2]
    F --> F1[LoRA<br/>低秩矩阵分解]
    F --> F2[QLoRA<br/>量化 + LoRA]
    F --> F3[DoRA<br/>Weight-Decomposed LoRA]

    style B fill:#e74c3c,color:#fff
    style F1 fill:#2ecc71,color:#fff
    style F2 fill:#3498db,color:#fff

全量微调 vs PEFT 对比

维度 全量微调 LoRA QLoRA
可训练参数 100% ~0.1-1% ~0.1-1%
7B 模型 GPU 需求 ~60GB (A100) ~16GB (V100) ~6GB (RTX 3060)
训练速度
效果 最优 接近全量 略低于 LoRA
灾难性遗忘风险

LoRA 数学原理

LoRA 的核心思想:预训练模型的权重更新矩阵是低秩的,可以用两个小矩阵的乘积来近似。

公式推导

对于预训练权重矩阵 W_0 (维度 d x k):

1
2
3
全量微调: W = W_0 + ΔW    (ΔW 是 d x k 的全秩矩阵)

LoRA: W = W_0 + BA 其中 B: d x r, A: r x k, r << min(d, k)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# LoRA implementation concept
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
def __init__(self, original_linear: nn.Linear, rank: int = 8, alpha: float = 16):
super().__init__()
self.original = original_linear
self.original.weight.requires_grad = False # Freeze original weights

d_out, d_in = original_linear.weight.shape

# Low-rank matrices
self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01) # r x d_in
self.lora_B = nn.Parameter(torch.zeros(d_out, rank)) # d_out x r

self.scaling = alpha / rank # Scaling factor

def forward(self, x):
# Original output + LoRA adjustment
original_out = self.original(x)
lora_out = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return original_out + lora_out
graph LR
    subgraph "Original Layer"
        X[Input x] --> W["W₀ (d × k)<br/>Frozen"]
        W --> Y1[Output]
    end

    subgraph "LoRA Branch"
        X --> A["A (r × k)<br/>Trainable"]
        A --> B["B (d × r)<br/>Trainable"]
        B --> |"× α/r"| Y2[LoRA Output]
    end

    Y1 --> PLUS((+))
    Y2 --> PLUS
    PLUS --> Y[Final Output]

    style W fill:#95a5a6,color:#fff
    style A fill:#2ecc71,color:#fff
    style B fill:#2ecc71,color:#fff

关键超参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
lora_config = {
"r": 16, # Rank: 越高表达能力越强,但参数越多
# 典型值: 4, 8, 16, 32, 64
"alpha": 32, # Scaling factor: 通常设为 2 * r
"dropout": 0.05, # LoRA dropout
"target_modules": [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention layers
"gate_proj", "up_proj", "down_proj", # MLP layers
],
}

# Trainable parameters calculation:
# For each target module (d x k):
# LoRA params = r * (d + k)
# Example: Llama-2-7B attention q_proj (4096 x 4096), r=16:
# LoRA params = 16 * (4096 + 4096) = 131,072 (vs 16,777,216 original)
# Compression ratio: 128x

QLoRA — 量化 LoRA

QLoRA 在 LoRA 基础上引入 4-bit 量化,进一步减少内存占用:

graph TB
    A[QLoRA 创新点] --> B[4-bit NormalFloat<br/>NF4 量化]
    A --> C[Double Quantization<br/>二次量化]
    A --> D[Paged Optimizers<br/>分页优化器]

    B --> B1[基于正态分布的<br/>最优4-bit编码]
    C --> C1[量化常数本身<br/>也做量化]
    D --> D1[GPU显存不足时<br/>自动迁移到CPU]

    style A fill:#3498db,color:#fff
    style B fill:#e74c3c,color:#fff
    style C fill:#f39c12,color:#000
    style D fill:#2ecc71,color:#fff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from transformers import BitsAndBytesConfig
import torch

# QLoRA quantization config
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # 4-bit quantization
bnb_4bit_quant_type="nf4", # NormalFloat4 (better than FP4)
bnb_4bit_compute_dtype=torch.bfloat16, # Computation in bfloat16
bnb_4bit_use_double_quant=True, # Double quantization
)

# Memory comparison for Llama-2-7B:
# FP32: ~28 GB
# FP16: ~14 GB
# 4-bit (QLoRA): ~3.5 GB
# + LoRA trainable params: ~0.1 GB
# Total: ~3.6 GB (fits on RTX 3060 12GB!)

数据准备

指令微调数据格式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Alpaca format
training_data = [
{
"instruction": "将以下英文翻译为中文",
"input": "Machine learning is a subset of artificial intelligence.",
"output": "机器学习是人工智能的一个子集。"
},
{
"instruction": "总结以下文章的要点",
"input": "RAG(检索增强生成)是一种将外部知识库与大语言模型结合的技术...",
"output": "1. RAG将LLM与外部知识库结合\n2. 通过检索相关文档增强生成质量\n3. 相比微调成本更低且数据可更新"
},
]

# ChatML format (preferred for chat models)
chat_data = [
{
"messages": [
{"role": "system", "content": "你是一个专业的技术文档助手。"},
{"role": "user", "content": "什么是Docker?"},
{"role": "assistant", "content": "Docker是一个容器化平台..."}
]
},
]

数据质量保证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import json
from datasets import Dataset

def validate_and_clean(data: list[dict]) -> list[dict]:
"""Validate and clean training data."""
cleaned = []
for item in data:
# Check required fields
if not item.get("instruction") or not item.get("output"):
continue

# Filter too short/long samples
output_len = len(item["output"])
if output_len < 10 or output_len > 4096:
continue

# Remove duplicates by instruction
instruction = item["instruction"].strip()
item["instruction"] = instruction
item["output"] = item["output"].strip()

cleaned.append(item)

# Deduplicate
seen = set()
unique = []
for item in cleaned:
key = item["instruction"]
if key not in seen:
seen.add(key)
unique.append(item)

return unique

# Format for training
def format_prompt(example):
"""Format example into model's expected prompt format."""
if example.get("input"):
text = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
else:
text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
return {"text": text}

dataset = Dataset.from_list(training_data)
dataset = dataset.map(format_prompt)

训练实战

使用 Hugging Face + PEFT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch

# 1. Load model with quantization
model_name = "meta-llama/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config, # QLoRA 4-bit
device_map="auto",
trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# 3. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 6,751,539,200 || trainable%: 0.2019

# 4. Training arguments
training_args = TrainingArguments(
output_dir="./output/llama2-7b-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
bf16=True,
optim="paged_adamw_8bit", # Memory-efficient optimizer
gradient_checkpointing=True,
max_grad_norm=0.3,
report_to="wandb",
)

# 5. Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
args=training_args,
max_seq_length=2048,
dataset_text_field="text",
packing=True, # Pack multiple samples into one sequence
)

trainer.train()

# 6. Save LoRA adapter (only ~50MB!)
trainer.save_model("./output/llama2-7b-lora/final")

训练监控

graph LR
    A[训练监控指标] --> B[Loss Curve<br/>训练/验证损失]
    A --> C[Learning Rate<br/>学习率变化]
    A --> D[GPU Memory<br/>显存使用]
    A --> E[Gradient Norm<br/>梯度范数]

    B --> B1[过拟合: train↓ val↑]
    B --> B1b[欠拟合: 两者都高]
    C --> C1[Warmup + Cosine Decay]
    D --> D1[OOM → 减小batch_size<br/>或增加gradient_accumulation]

    style B1 fill:#e74c3c,color:#fff
    style C1 fill:#2ecc71,color:#fff

评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

# Evaluation using lm-evaluation-harness
model_wrapper = HFLM(
pretrained="./output/llama2-7b-lora/final",
batch_size=8,
)

results = evaluator.simple_evaluate(
model=model_wrapper,
tasks=["mmlu", "hellaswag", "arc_challenge"],
num_fewshot=5,
)

print(results["results"])

# Custom evaluation
def evaluate_custom(model, tokenizer, test_data):
"""Evaluate on custom test set."""
correct = 0
total = len(test_data)

for item in test_data:
prompt = format_prompt(item)["text"].rsplit("### Response:", 1)[0] + "### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.1,
do_sample=False,
)

response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

if evaluate_response(response, item["output"]):
correct += 1

accuracy = correct / total
print(f"Accuracy: {accuracy:.2%}")
return accuracy

部署

合并 LoRA 权重

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
device_map="auto",
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./output/llama2-7b-lora/final")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./output/llama2-7b-merged")
tokenizer.save_pretrained("./output/llama2-7b-merged")

# Now deploy as a standard model (e.g., with vLLM)
# vllm serve ./output/llama2-7b-merged

多 LoRA Serving

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# vLLM supports serving multiple LoRA adapters on one base model
# Start server:
# vllm serve meta-llama/Llama-2-7b-hf \
# --enable-lora \
# --lora-modules \
# legal=./adapters/legal \
# medical=./adapters/medical \
# code=./adapters/code

# Client request with specific adapter:
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1")

# Use legal adapter
response = client.chat.completions.create(
model="legal", # LoRA adapter name
messages=[{"role": "user", "content": "分析这个合同条款..."}],
)

常见问题与调优

问题 可能原因 解决方案
训练不收敛 学习率过高/低 从 2e-4 开始调整
过拟合 数据量不足/epochs过多 增加数据或减少epochs
OOM batch_size 过大 减小 batch_size + 增加 gradient_accumulation
生成质量差 数据质量不足 清洗数据,增加多样性
灾难性遗忘 LoRA rank 过高 降低 r 值,增加正则化

总结

LoRA 和 QLoRA 使得在消费级 GPU 上微调大语言模型成为可能。关键要点:

  1. 从 QLoRA 开始:6GB 显存就能微调 7B 模型
  2. 数据质量 > 数据数量:1000 条高质量数据胜过 10000 条低质量数据
  3. 从小 rank 开始:r=8 或 r=16 通常足够,只在效果不佳时增大
  4. target_modules 选择:至少包含 attention 的 q/v proj,MLP 层视情况添加
  5. 合并后部署:LoRA 推理有额外开销,生产环境建议合并权重
  6. 多 LoRA serving:vLLM 支持在同一个基座模型上动态切换 LoRA adapter
作者 · authorzt
发布 · date2025-10-01
篇幅 · length2.7k 字 · 7 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论