本地LLM部署方案：vLLM与Ollama进阶

引言

随着开源 LLM 的蓬勃发展（Llama、Qwen、Mistral、DeepSeek），本地部署 LLM 成为越来越多团队的选择——数据隐私可控、无 API 费用、可深度定制。本文将深入两大主流部署方案：vLLM（面向高性能生产环境）和 Ollama（面向个人开发和原型验证），涵盖 PagedAttention 原理、连续批处理、量化技术（GGUF/GPTQ/AWQ）、GPU 内存优化和性能基准测试。

部署方案全景

graph TB
    A[本地 LLM 部署] --> B[生产级方案]
    A --> C[开发级方案]
    A --> D[边缘部署]

    B --> B1[vLLM<br/>PagedAttention<br/>高吞吐量]
    B --> B2[TGI<br/>HuggingFace<br/>Text Generation Inference]
    B --> B3[TensorRT-LLM<br/>NVIDIA 优化]

    C --> C1[Ollama<br/>一键运行<br/>最简单]
    C --> C2[llama.cpp<br/>CPU/GPU 推理<br/>量化模型]

    D --> D1[MLC-LLM<br/>手机/IoT]

    style B1 fill:#e74c3c,color:#fff
    style C1 fill:#2ecc71,color:#fff
    style B3 fill:#76b900,color:#fff

vLLM 深入

PagedAttention 原理

PagedAttention 是 vLLM 的核心创新——借鉴操作系统的虚拟内存和分页机制来管理 KV-Cache：

graph TB
    subgraph "传统方案: 连续内存分配"
        A1["Request 1 KV-Cache<br/>预分配 max_seq_len"] --> B1[大量内部碎片<br/>浪费 60-80%]
    end

    subgraph "PagedAttention: 分页管理"
        A2["Request 1"] --> C1["Block 1"]
        A2 --> C2["Block 2"]
        A2 --> C3["Block 3"]
        A3["Request 2"] --> C4["Block 4"]
        A3 --> C5["Block 5"]
        C1 --> D[Block Table<br/>映射逻辑块到物理块]
        C2 --> D
        C4 --> D
        D --> E[GPU Memory<br/>按需分配<br/>接近 0 浪费]
    end

    style B1 fill:#e74c3c,color:#fff
    style E fill:#2ecc71,color:#fff

传统方案的问题： 每个请求需要为最大序列长度预分配 KV-Cache 内存。如果 max_seq_len=2048，但实际只用了 200 tokens，90% 的内存被浪费。

PagedAttention 的解决： 将 KV-Cache 分成固定大小的 Block（如每块 16 tokens），像操作系统的虚拟内存页一样按需分配。

# PagedAttention key concepts
# 1. Block: fixed-size KV-Cache unit (e.g., 16 tokens)
# 2. Block Table: maps logical blocks to physical GPU memory blocks
# 3. Copy-on-Write: for parallel sampling, share blocks until modification

# Memory savings example:
# Model: Llama-2-7B, max_seq_len=2048, batch_size=32
# Traditional: 32 * 2048 * 2 * 32 * 128 * 2 bytes = ~32 GB
# PagedAttention: only allocate for actual tokens, ~5-8 GB typical

Continuous Batching

sequenceDiagram
    participant S as Server
    participant R1 as Request 1 (Short)
    participant R2 as Request 2 (Long)
    participant R3 as Request 3 (New)

    Note over S: Iteration 1
    S->>R1: Generate token
    S->>R2: Generate token

    Note over S: Iteration 2
    S->>R1: Generate token (DONE)
    S->>R2: Generate token

    Note over S: Iteration 3 - R1 finished, R3 fills slot
    S->>R3: Generate token (NEW - immediate)
    S->>R2: Generate token

    Note over S: No waiting for batch to complete!

传统 Static Batching： 必须等待 batch 中所有请求完成才能接受新请求。 Continuous Batching： 请求完成后立即释放资源，新请求无需等待。

vLLM 生产部署

# Install vLLM
pip install vllm

# Start OpenAI-compatible API server
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \        # Multi-GPU parallelism
  --max-model-len 8192 \            # Max context length
  --gpu-memory-utilization 0.9 \    # Use 90% of GPU memory
  --enable-prefix-caching \         # Cache common prompt prefixes
  --quantization awq                # Use AWQ quantization

# vLLM Python API for batch inference
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=8192,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
    repetition_penalty=1.1,
    stop=["<|endoftext|>"],
)

# Batch inference
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to sort a list.",
    "What are the benefits of microservices?",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Generated: {output.outputs[0].text}")
    print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
    print("---")

vLLM + OpenAI 兼容 API

# Client code — fully compatible with OpenAI SDK
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy",  # vLLM doesn't require a real key
)

# Chat completion
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "你是一个专业的技术顾问。"},
        {"role": "user", "content": "如何设计一个高可用的微服务架构？"},
    ],
    temperature=0.7,
    max_tokens=1024,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Ollama 进阶

基础使用

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull qwen2.5:7b
ollama run qwen2.5:7b "什么是Transformer?"

# List available models
ollama list

# Show model details
ollama show qwen2.5:7b

Modelfile 自定义

# Modelfile — custom model configuration
FROM qwen2.5:7b

# System prompt
SYSTEM """你是一个Python编程专家，专注于后端开发。
你的回答应该：
1. 包含可运行的代码示例
2. 使用Python 3.10+语法
3. 遵循PEP 8规范
4. 解释关键设计决策"""

# Parameter customization
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|endoftext|>"

# Template customization (Jinja-like)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""

# Build custom model
ollama create python-expert -f ./Modelfile

# Run custom model
ollama run python-expert "如何用FastAPI实现JWT认证?"

Ollama REST API

import httpx

OLLAMA_BASE = "http://localhost:11434"

# Generate completion
response = httpx.post(
    f"{OLLAMA_BASE}/api/generate",
    json={
        "model": "qwen2.5:7b",
        "prompt": "Explain Docker in 3 sentences.",
        "stream": False,
        "options": {
            "temperature": 0.7,
            "num_ctx": 4096,
        },
    },
    timeout=60.0,
)

print(response.json()["response"])

# Chat API
response = httpx.post(
    f"{OLLAMA_BASE}/api/chat",
    json={
        "model": "qwen2.5:7b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is Kubernetes?"},
        ],
        "stream": False,
    },
    timeout=60.0,
)

print(response.json()["message"]["content"])

# Embeddings
response = httpx.post(
    f"{OLLAMA_BASE}/api/embed",
    json={
        "model": "nomic-embed-text",
        "input": ["Hello world", "Machine learning is great"],
    },
)

embeddings = response.json()["embeddings"]
print(f"Embedding dimension: {len(embeddings[0])}")

量化技术对比

graph TB
    A[模型量化] --> B[GGUF<br/>llama.cpp 格式]
    A --> C[GPTQ<br/>GPU 量化]
    A --> D[AWQ<br/>Activation-aware]
    A --> E[GGML Legacy]

    B --> B1["CPU + GPU 混合推理<br/>灵活的量化级别<br/>Q4_K_M / Q5_K_M / Q8_0"]

    C --> C1["纯 GPU 推理<br/>4-bit 量化<br/>需要校准数据"]

    D --> D1["纯 GPU 推理<br/>基于激活值的量化<br/>精度优于 GPTQ"]

    style B fill:#3498db,color:#fff
    style C fill:#e74c3c,color:#fff
    style D fill:#2ecc71,color:#fff

量化级别详解

量化格式	每参数 Bits	7B 模型大小	速度	精度损失
FP16	16	~14 GB	基准	无
Q8_0	8	~7 GB	快	极小
Q5_K_M	5.5	~5 GB	快	小
Q4_K_M	4.5	~4.4 GB	快	小-中
Q4_0	4	~4 GB	最快	中
Q3_K_M	3.5	~3.3 GB	快	中-大
Q2_K	2.5	~2.7 GB	快	大
GPTQ 4-bit	4	~4 GB	GPU 快	小
AWQ 4-bit	4	~4 GB	GPU 快	极小

GGUF 量化实践

# Using llama.cpp to quantize a model
# 1. Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py ./Qwen2.5-7B --outfile qwen2.5-7b-f16.gguf --outtype f16

# 2. Quantize to Q4_K_M
./llama-quantize qwen2.5-7b-f16.gguf qwen2.5-7b-q4_k_m.gguf Q4_K_M

# 3. Run with llama.cpp
./llama-server \
  -m qwen2.5-7b-q4_k_m.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 35 \           # Offload 35 layers to GPU
  -c 8192 \            # Context length
  --threads 8          # CPU threads

AWQ 量化

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-7B-Instruct"
quant_path = "qwen2.5-7b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",  # or "GEMV" for batch_size=1
}

model.quantize(
    tokenizer,
    quant_config=quant_config,
    calib_data="wikitext",  # Calibration dataset
)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

# Deploy with vLLM
# vllm serve ./qwen2.5-7b-awq --quantization awq

GPU 内存优化

graph TD
    A[GPU 内存分配] --> B[模型权重<br/>Weights]
    A --> C[KV Cache<br/>注意力缓存]
    A --> D[激活值<br/>Activations]

    B --> B1["7B FP16: ~14GB<br/>7B INT4: ~4GB"]
    C --> C1["取决于 batch_size<br/>和 seq_len"]
    D --> D1["前向推理时<br/>临时分配"]

    style B fill:#e74c3c,color:#fff
    style C fill:#3498db,color:#fff
    style D fill:#f39c12,color:#000

内存估算公式

def estimate_gpu_memory(
    num_params_billion: float,
    precision_bits: int = 16,
    batch_size: int = 1,
    seq_len: int = 2048,
    num_layers: int = 32,
    hidden_size: int = 4096,
    num_kv_heads: int = 8,
    head_dim: int = 128,
):
    """Estimate GPU memory requirements for LLM inference."""

    # Model weights
    weight_bytes = num_params_billion * 1e9 * (precision_bits / 8)

    # KV Cache (per token per layer: 2 * num_kv_heads * head_dim * precision_bytes)
    kv_per_token = 2 * num_kv_heads * head_dim * (precision_bits / 8)
    kv_cache_bytes = batch_size * seq_len * num_layers * kv_per_token

    # Activation memory (rough estimate)
    activation_bytes = batch_size * seq_len * hidden_size * 4  # FP32 activations

    total_gb = (weight_bytes + kv_cache_bytes + activation_bytes) / (1024**3)

    print(f"Model weights: {weight_bytes / (1024**3):.1f} GB")
    print(f"KV Cache: {kv_cache_bytes / (1024**3):.1f} GB")
    print(f"Activations: {activation_bytes / (1024**3):.1f} GB")
    print(f"Total estimate: {total_gb:.1f} GB")

    return total_gb

# Example: Qwen2.5-7B with INT4 quantization
estimate_gpu_memory(
    num_params_billion=7,
    precision_bits=4,
    batch_size=8,
    seq_len=4096,
)
# Model weights: 3.3 GB
# KV Cache: 2.0 GB
# Activations: 0.5 GB
# Total estimate: 5.8 GB

Tensor Parallelism

# Split model across multiple GPUs
# vLLM: automatic tensor parallelism
vllm serve Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 4 \     # Use 4 GPUs
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95

graph LR
    subgraph "Tensor Parallelism (4 GPUs)"
        A[Input] --> G1[GPU 0<br/>Layers 0-19<br/>Column Split]
        A --> G2[GPU 1<br/>Layers 0-19<br/>Column Split]
        A --> G3[GPU 2<br/>Layers 20-39<br/>Column Split]
        A --> G4[GPU 3<br/>Layers 20-39<br/>Column Split]

        G1 --> R[AllReduce<br/>Sync]
        G2 --> R
        G3 --> R
        G4 --> R
        R --> O[Output]
    end

    style G1 fill:#3498db,color:#fff
    style G2 fill:#e74c3c,color:#fff
    style G3 fill:#2ecc71,color:#fff
    style G4 fill:#f39c12,color:#000

性能基准测试

import time
import httpx

def benchmark_throughput(
    base_url: str,
    model: str,
    num_requests: int = 100,
    prompt_tokens: int = 128,
    max_tokens: int = 256,
):
    """Benchmark LLM serving throughput."""
    prompt = "Explain the concept of " * (prompt_tokens // 5)

    start = time.perf_counter()
    total_output_tokens = 0

    for i in range(num_requests):
        response = httpx.post(
            f"{base_url}/v1/completions",
            json={
                "model": model,
                "prompt": prompt,
                "max_tokens": max_tokens,
                "temperature": 0.7,
            },
            timeout=120.0,
        )
        data = response.json()
        total_output_tokens += data["usage"]["completion_tokens"]

    elapsed = time.perf_counter() - start

    print(f"Total time: {elapsed:.1f}s")
    print(f"Requests/s: {num_requests / elapsed:.1f}")
    print(f"Output tokens/s: {total_output_tokens / elapsed:.1f}")
    print(f"Avg latency: {elapsed / num_requests * 1000:.0f}ms")

# Benchmark
benchmark_throughput(
    base_url="http://localhost:8000",
    model="Qwen/Qwen2.5-7B-Instruct",
    num_requests=50,
)

各方案性能参考

方案	7B Q4 吞吐量	TTFT	部署复杂度	适用场景
vLLM	~3000 tok/s	~50ms	中	生产API服务
TGI	~2500 tok/s	~60ms	中	生产API服务
Ollama	~80 tok/s	~200ms	低	开发/个人使用
llama.cpp	~100 tok/s	~150ms	低	CPU推理/边缘
TensorRT-LLM	~4000 tok/s	~30ms	高	NVIDIA最优化

以上数据基于单 A100 80GB GPU 的参考值，实际取决于硬件和配置。

选型决策

flowchart TD
    A[本地LLM部署需求] --> B{使用场景?}

    B -->|个人开发/原型| C[Ollama<br/>一键安装, 最简单]
    B -->|生产API服务| D{GPU?}
    D -->|NVIDIA| E{预算?}
    E -->|多卡 A100/H100| F[vLLM + Tensor Parallel<br/>最高吞吐]
    E -->|单卡 消费级| G[vLLM + AWQ 量化<br/>性价比最高]
    D -->|无GPU / Apple Silicon| H[Ollama / llama.cpp<br/>CPU + Metal]

    B -->|最高性能| I[TensorRT-LLM<br/>NVIDIA 深度优化]

    style C fill:#2ecc71,color:#fff
    style F fill:#e74c3c,color:#fff
    style G fill:#3498db,color:#fff
    style H fill:#f39c12,color:#000

总结

本地 LLM 部署方案的选择取决于场景需求：

快速上手：Ollama 一键安装，支持 macOS/Linux/Windows，适合开发和实验
生产部署：vLLM 的 PagedAttention 和 Continuous Batching 提供最佳吞吐量
量化选择：AWQ > GPTQ > GGUF Q4_K_M（GPU 场景下 AWQ 精度损失最小）
内存优化：4-bit 量化将 7B 模型从 14GB 压缩到 4GB，单卡即可部署
多 GPU：70B+ 模型使用 Tensor Parallelism 分布到多卡
Apple Silicon：Ollama 原生支持 Metal 加速，M2/M3 Max 可流畅运行 7B-14B 模型

始终先做基准测试（TTFT、吞吐量、内存占用），根据实际业务的延迟和 QPS 要求选择合适的方案。