LLM微调实战：LoRA与QLoRA

引言

大语言模型的通用能力令人惊叹，但在特定领域（如法律、医疗、代码）往往需要微调（Fine-tuning）来提升专业表现。全量微调需要的计算资源与模型参数量成正比，对于 7B 甚至 70B 的模型来说成本高昂。LoRA（Low-Rank Adaptation）和 QLoRA 通过参数高效微调（PEFT）大幅降低了微调的门槛。本文将从数学原理讲起，涵盖数据准备、训练配置、评估和部署的完整流程。

微调方法全景

graph TB
    A[LLM 微调方法] --> B[全量微调<br/>Full Fine-tuning]
    A --> C[参数高效微调<br/>PEFT]

    B --> B1[更新所有参数<br/>成本最高 / 效果最好]

    C --> D[Adapter 方法]
    C --> E[Prefix/Prompt 方法]
    C --> F[LoRA 方法]

    D --> D1[Adapter Layers<br/>在层间插入小网络]
    E --> E1[Prefix Tuning<br/>P-Tuning v2]
    F --> F1[LoRA<br/>低秩矩阵分解]
    F --> F2[QLoRA<br/>量化 + LoRA]
    F --> F3[DoRA<br/>Weight-Decomposed LoRA]

    style B fill:#e74c3c,color:#fff
    style F1 fill:#2ecc71,color:#fff
    style F2 fill:#3498db,color:#fff

全量微调 vs PEFT 对比

维度	全量微调	LoRA	QLoRA
可训练参数	100%	~0.1-1%	~0.1-1%
7B 模型 GPU 需求	~60GB (A100)	~16GB (V100)	~6GB (RTX 3060)
训练速度	慢	快	中
效果	最优	接近全量	略低于 LoRA
灾难性遗忘风险	高	低	低

LoRA 数学原理

LoRA 的核心思想：预训练模型的权重更新矩阵是低秩的，可以用两个小矩阵的乘积来近似。

公式推导

对于预训练权重矩阵 W_0 (维度 d x k)：

1
2
3

全量微调: W = W_0 + ΔW    (ΔW 是 d x k 的全秩矩阵)

LoRA:     W = W_0 + BA     其中 B: d x r,  A: r x k,  r << min(d, k)

# LoRA implementation concept
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, original_linear: nn.Linear, rank: int = 8, alpha: float = 16):
        super().__init__()
        self.original = original_linear
        self.original.weight.requires_grad = False  # Freeze original weights

        d_out, d_in = original_linear.weight.shape

        # Low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)  # r x d_in
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))         # d_out x r

        self.scaling = alpha / rank  # Scaling factor

    def forward(self, x):
        # Original output + LoRA adjustment
        original_out = self.original(x)
        lora_out = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return original_out + lora_out

graph LR
    subgraph "Original Layer"
        X[Input x] --> W["W₀ (d × k)<br/>Frozen"]
        W --> Y1[Output]
    end

    subgraph "LoRA Branch"
        X --> A["A (r × k)<br/>Trainable"]
        A --> B["B (d × r)<br/>Trainable"]
        B --> |"× α/r"| Y2[LoRA Output]
    end

    Y1 --> PLUS((+))
    Y2 --> PLUS
    PLUS --> Y[Final Output]

    style W fill:#95a5a6,color:#fff
    style A fill:#2ecc71,color:#fff
    style B fill:#2ecc71,color:#fff

关键超参数

lora_config = {
    "r": 16,           # Rank: 越高表达能力越强，但参数越多
                       # 典型值: 4, 8, 16, 32, 64
    "alpha": 32,       # Scaling factor: 通常设为 2 * r
    "dropout": 0.05,   # LoRA dropout
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj",      # MLP layers
    ],
}

# Trainable parameters calculation:
# For each target module (d x k):
#   LoRA params = r * (d + k)
# Example: Llama-2-7B attention q_proj (4096 x 4096), r=16:
#   LoRA params = 16 * (4096 + 4096) = 131,072  (vs 16,777,216 original)
#   Compression ratio: 128x

QLoRA — 量化 LoRA

QLoRA 在 LoRA 基础上引入 4-bit 量化，进一步减少内存占用：

graph TB
    A[QLoRA 创新点] --> B[4-bit NormalFloat<br/>NF4 量化]
    A --> C[Double Quantization<br/>二次量化]
    A --> D[Paged Optimizers<br/>分页优化器]

    B --> B1[基于正态分布的<br/>最优4-bit编码]
    C --> C1[量化常数本身<br/>也做量化]
    D --> D1[GPU显存不足时<br/>自动迁移到CPU]

    style A fill:#3498db,color:#fff
    style B fill:#e74c3c,color:#fff
    style C fill:#f39c12,color:#000
    style D fill:#2ecc71,color:#fff

from transformers import BitsAndBytesConfig
import torch

# QLoRA quantization config
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # 4-bit quantization
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (better than FP4)
    bnb_4bit_compute_dtype=torch.bfloat16, # Computation in bfloat16
    bnb_4bit_use_double_quant=True,      # Double quantization
)

# Memory comparison for Llama-2-7B:
# FP32: ~28 GB
# FP16: ~14 GB
# 4-bit (QLoRA): ~3.5 GB
# + LoRA trainable params: ~0.1 GB
# Total: ~3.6 GB (fits on RTX 3060 12GB!)

数据准备

指令微调数据格式

# Alpaca format
training_data = [
    {
        "instruction": "将以下英文翻译为中文",
        "input": "Machine learning is a subset of artificial intelligence.",
        "output": "机器学习是人工智能的一个子集。"
    },
    {
        "instruction": "总结以下文章的要点",
        "input": "RAG（检索增强生成）是一种将外部知识库与大语言模型结合的技术...",
        "output": "1. RAG将LLM与外部知识库结合\n2. 通过检索相关文档增强生成质量\n3. 相比微调成本更低且数据可更新"
    },
]

# ChatML format (preferred for chat models)
chat_data = [
    {
        "messages": [
            {"role": "system", "content": "你是一个专业的技术文档助手。"},
            {"role": "user", "content": "什么是Docker?"},
            {"role": "assistant", "content": "Docker是一个容器化平台..."}
        ]
    },
]

数据质量保证

import json
from datasets import Dataset

def validate_and_clean(data: list[dict]) -> list[dict]:
    """Validate and clean training data."""
    cleaned = []
    for item in data:
        # Check required fields
        if not item.get("instruction") or not item.get("output"):
            continue

        # Filter too short/long samples
        output_len = len(item["output"])
        if output_len < 10 or output_len > 4096:
            continue

        # Remove duplicates by instruction
        instruction = item["instruction"].strip()
        item["instruction"] = instruction
        item["output"] = item["output"].strip()

        cleaned.append(item)

    # Deduplicate
    seen = set()
    unique = []
    for item in cleaned:
        key = item["instruction"]
        if key not in seen:
            seen.add(key)
            unique.append(item)

    return unique

# Format for training
def format_prompt(example):
    """Format example into model's expected prompt format."""
    if example.get("input"):
        text = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}

dataset = Dataset.from_list(training_data)
dataset = dataset.map(format_prompt)

训练实战

使用 Hugging Face + PEFT

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import torch

# 1. Load model with quantization
model_name = "meta-llama/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,  # QLoRA 4-bit
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# 3. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 6,751,539,200 || trainable%: 0.2019

# 4. Training arguments
training_args = TrainingArguments(
    output_dir="./output/llama2-7b-lora",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    bf16=True,
    optim="paged_adamw_8bit",  # Memory-efficient optimizer
    gradient_checkpointing=True,
    max_grad_norm=0.3,
    report_to="wandb",
)

# 5. Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=True,  # Pack multiple samples into one sequence
)

trainer.train()

# 6. Save LoRA adapter (only ~50MB!)
trainer.save_model("./output/llama2-7b-lora/final")

训练监控

graph LR
    A[训练监控指标] --> B[Loss Curve<br/>训练/验证损失]
    A --> C[Learning Rate<br/>学习率变化]
    A --> D[GPU Memory<br/>显存使用]
    A --> E[Gradient Norm<br/>梯度范数]

    B --> B1[过拟合: train↓ val↑]
    B --> B1b[欠拟合: 两者都高]
    C --> C1[Warmup + Cosine Decay]
    D --> D1[OOM → 减小batch_size<br/>或增加gradient_accumulation]

    style B1 fill:#e74c3c,color:#fff
    style C1 fill:#2ecc71,color:#fff

评估

from lm_eval import evaluator
from lm_eval.models.huggingface import HFLM

# Evaluation using lm-evaluation-harness
model_wrapper = HFLM(
    pretrained="./output/llama2-7b-lora/final",
    batch_size=8,
)

results = evaluator.simple_evaluate(
    model=model_wrapper,
    tasks=["mmlu", "hellaswag", "arc_challenge"],
    num_fewshot=5,
)

print(results["results"])

# Custom evaluation
def evaluate_custom(model, tokenizer, test_data):
    """Evaluate on custom test set."""
    correct = 0
    total = len(test_data)

    for item in test_data:
        prompt = format_prompt(item)["text"].rsplit("### Response:", 1)[0] + "### Response:\n"
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=512,
                temperature=0.1,
                do_sample=False,
            )

        response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

        if evaluate_response(response, item["output"]):
            correct += 1

    accuracy = correct / total
    print(f"Accuracy: {accuracy:.2%}")
    return accuracy

部署

合并 LoRA 权重

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, "./output/llama2-7b-lora/final")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./output/llama2-7b-merged")
tokenizer.save_pretrained("./output/llama2-7b-merged")

# Now deploy as a standard model (e.g., with vLLM)
# vllm serve ./output/llama2-7b-merged

多 LoRA Serving

# vLLM supports serving multiple LoRA adapters on one base model
# Start server:
# vllm serve meta-llama/Llama-2-7b-hf \
#   --enable-lora \
#   --lora-modules \
#     legal=./adapters/legal \
#     medical=./adapters/medical \
#     code=./adapters/code

# Client request with specific adapter:
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1")

# Use legal adapter
response = client.chat.completions.create(
    model="legal",  # LoRA adapter name
    messages=[{"role": "user", "content": "分析这个合同条款..."}],
)

常见问题与调优

问题	可能原因	解决方案
训练不收敛	学习率过高/低	从 2e-4 开始调整
过拟合	数据量不足/epochs过多	增加数据或减少epochs
OOM	batch_size 过大	减小 batch_size + 增加 gradient_accumulation
生成质量差	数据质量不足	清洗数据，增加多样性
灾难性遗忘	LoRA rank 过高	降低 r 值，增加正则化

总结

LoRA 和 QLoRA 使得在消费级 GPU 上微调大语言模型成为可能。关键要点：

从 QLoRA 开始：6GB 显存就能微调 7B 模型
数据质量 > 数据数量：1000 条高质量数据胜过 10000 条低质量数据
从小 rank 开始：r=8 或 r=16 通常足够，只在效果不佳时增大
target_modules 选择：至少包含 attention 的 q/v proj，MLP 层视情况添加
合并后部署：LoRA 推理有额外开销，生产环境建议合并权重
多 LoRA serving：vLLM 支持在同一个基座模型上动态切换 LoRA adapter