AI · #llm#vllm#ollama#deployment

本地LLM部署方案:vLLM与Ollama进阶

2025.11.12 8 min 3.2k
// 目录 · contents

引言

随着开源 LLM 的蓬勃发展(Llama、Qwen、Mistral、DeepSeek),本地部署 LLM 成为越来越多团队的选择——数据隐私可控、无 API 费用、可深度定制。本文将深入两大主流部署方案:vLLM(面向高性能生产环境)和 Ollama(面向个人开发和原型验证),涵盖 PagedAttention 原理、连续批处理、量化技术(GGUF/GPTQ/AWQ)、GPU 内存优化和性能基准测试。

部署方案全景

graph TB
    A[本地 LLM 部署] --> B[生产级方案]
    A --> C[开发级方案]
    A --> D[边缘部署]

    B --> B1[vLLM<br/>PagedAttention<br/>高吞吐量]
    B --> B2[TGI<br/>HuggingFace<br/>Text Generation Inference]
    B --> B3[TensorRT-LLM<br/>NVIDIA 优化]

    C --> C1[Ollama<br/>一键运行<br/>最简单]
    C --> C2[llama.cpp<br/>CPU/GPU 推理<br/>量化模型]

    D --> D1[MLC-LLM<br/>手机/IoT]

    style B1 fill:#e74c3c,color:#fff
    style C1 fill:#2ecc71,color:#fff
    style B3 fill:#76b900,color:#fff

vLLM 深入

PagedAttention 原理

PagedAttention 是 vLLM 的核心创新——借鉴操作系统的虚拟内存和分页机制来管理 KV-Cache:

graph TB
    subgraph "传统方案: 连续内存分配"
        A1["Request 1 KV-Cache<br/>预分配 max_seq_len"] --> B1[大量内部碎片<br/>浪费 60-80%]
    end

    subgraph "PagedAttention: 分页管理"
        A2["Request 1"] --> C1["Block 1"]
        A2 --> C2["Block 2"]
        A2 --> C3["Block 3"]
        A3["Request 2"] --> C4["Block 4"]
        A3 --> C5["Block 5"]
        C1 --> D[Block Table<br/>映射逻辑块到物理块]
        C2 --> D
        C4 --> D
        D --> E[GPU Memory<br/>按需分配<br/>接近 0 浪费]
    end

    style B1 fill:#e74c3c,color:#fff
    style E fill:#2ecc71,color:#fff

传统方案的问题: 每个请求需要为最大序列长度预分配 KV-Cache 内存。如果 max_seq_len=2048,但实际只用了 200 tokens,90% 的内存被浪费。

PagedAttention 的解决: 将 KV-Cache 分成固定大小的 Block(如每块 16 tokens),像操作系统的虚拟内存页一样按需分配。

1
2
3
4
5
6
7
8
9
# PagedAttention key concepts
# 1. Block: fixed-size KV-Cache unit (e.g., 16 tokens)
# 2. Block Table: maps logical blocks to physical GPU memory blocks
# 3. Copy-on-Write: for parallel sampling, share blocks until modification

# Memory savings example:
# Model: Llama-2-7B, max_seq_len=2048, batch_size=32
# Traditional: 32 * 2048 * 2 * 32 * 128 * 2 bytes = ~32 GB
# PagedAttention: only allocate for actual tokens, ~5-8 GB typical

Continuous Batching

sequenceDiagram
    participant S as Server
    participant R1 as Request 1 (Short)
    participant R2 as Request 2 (Long)
    participant R3 as Request 3 (New)

    Note over S: Iteration 1
    S->>R1: Generate token
    S->>R2: Generate token

    Note over S: Iteration 2
    S->>R1: Generate token (DONE)
    S->>R2: Generate token

    Note over S: Iteration 3 - R1 finished, R3 fills slot
    S->>R3: Generate token (NEW - immediate)
    S->>R2: Generate token

    Note over S: No waiting for batch to complete!

传统 Static Batching: 必须等待 batch 中所有请求完成才能接受新请求。 Continuous Batching: 请求完成后立即释放资源,新请求无需等待。

vLLM 生产部署

1
2
3
4
5
6
7
8
9
10
11
12
# Install vLLM
pip install vllm

# Start OpenAI-compatible API server
vllm serve Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \ # Multi-GPU parallelism
--max-model-len 8192 \ # Max context length
--gpu-memory-utilization 0.9 \ # Use 90% of GPU memory
--enable-prefix-caching \ # Cache common prompt prefixes
--quantization awq # Use AWQ quantization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# vLLM Python API for batch inference
from vllm import LLM, SamplingParams

llm = LLM(
model="Qwen/Qwen2.5-7B-Instruct",
tensor_parallel_size=2,
gpu_memory_utilization=0.9,
max_model_len=8192,
)

sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
repetition_penalty=1.1,
stop=["<|endoftext|>"],
)

# Batch inference
prompts = [
"Explain quantum computing in simple terms.",
"Write a Python function to sort a list.",
"What are the benefits of microservices?",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Generated: {output.outputs[0].text}")
print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}")
print("---")

vLLM + OpenAI 兼容 API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Client code — fully compatible with OpenAI SDK
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy", # vLLM doesn't require a real key
)

# Chat completion
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": "你是一个专业的技术顾问。"},
{"role": "user", "content": "如何设计一个高可用的微服务架构?"},
],
temperature=0.7,
max_tokens=1024,
stream=True,
)

for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

Ollama 进阶

基础使用

1
2
3
4
5
6
7
8
9
10
11
12
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull qwen2.5:7b
ollama run qwen2.5:7b "什么是Transformer?"

# List available models
ollama list

# Show model details
ollama show qwen2.5:7b

Modelfile 自定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Modelfile — custom model configuration
FROM qwen2.5:7b

# System prompt
SYSTEM """你是一个Python编程专家,专注于后端开发。
你的回答应该:
1. 包含可运行的代码示例
2. 使用Python 3.10+语法
3. 遵循PEP 8规范
4. 解释关键设计决策"""

# Parameter customization
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|endoftext|>"

# Template customization (Jinja-like)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
1
2
3
4
5
# Build custom model
ollama create python-expert -f ./Modelfile

# Run custom model
ollama run python-expert "如何用FastAPI实现JWT认证?"

Ollama REST API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import httpx

OLLAMA_BASE = "http://localhost:11434"

# Generate completion
response = httpx.post(
f"{OLLAMA_BASE}/api/generate",
json={
"model": "qwen2.5:7b",
"prompt": "Explain Docker in 3 sentences.",
"stream": False,
"options": {
"temperature": 0.7,
"num_ctx": 4096,
},
},
timeout=60.0,
)

print(response.json()["response"])

# Chat API
response = httpx.post(
f"{OLLAMA_BASE}/api/chat",
json={
"model": "qwen2.5:7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Kubernetes?"},
],
"stream": False,
},
timeout=60.0,
)

print(response.json()["message"]["content"])

# Embeddings
response = httpx.post(
f"{OLLAMA_BASE}/api/embed",
json={
"model": "nomic-embed-text",
"input": ["Hello world", "Machine learning is great"],
},
)

embeddings = response.json()["embeddings"]
print(f"Embedding dimension: {len(embeddings[0])}")

量化技术对比

graph TB
    A[模型量化] --> B[GGUF<br/>llama.cpp 格式]
    A --> C[GPTQ<br/>GPU 量化]
    A --> D[AWQ<br/>Activation-aware]
    A --> E[GGML Legacy]

    B --> B1["CPU + GPU 混合推理<br/>灵活的量化级别<br/>Q4_K_M / Q5_K_M / Q8_0"]

    C --> C1["纯 GPU 推理<br/>4-bit 量化<br/>需要校准数据"]

    D --> D1["纯 GPU 推理<br/>基于激活值的量化<br/>精度优于 GPTQ"]

    style B fill:#3498db,color:#fff
    style C fill:#e74c3c,color:#fff
    style D fill:#2ecc71,color:#fff

量化级别详解

量化格式 每参数 Bits 7B 模型大小 速度 精度损失
FP16 16 ~14 GB 基准
Q8_0 8 ~7 GB 极小
Q5_K_M 5.5 ~5 GB
Q4_K_M 4.5 ~4.4 GB 小-中
Q4_0 4 ~4 GB 最快
Q3_K_M 3.5 ~3.3 GB 中-大
Q2_K 2.5 ~2.7 GB
GPTQ 4-bit 4 ~4 GB GPU 快
AWQ 4-bit 4 ~4 GB GPU 快 极小

GGUF 量化实践

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Using llama.cpp to quantize a model
# 1. Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py ./Qwen2.5-7B --outfile qwen2.5-7b-f16.gguf --outtype f16

# 2. Quantize to Q4_K_M
./llama-quantize qwen2.5-7b-f16.gguf qwen2.5-7b-q4_k_m.gguf Q4_K_M

# 3. Run with llama.cpp
./llama-server \
-m qwen2.5-7b-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35 \ # Offload 35 layers to GPU
-c 8192 \ # Context length
--threads 8 # CPU threads

AWQ 量化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "Qwen/Qwen2.5-7B-Instruct"
quant_path = "qwen2.5-7b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM", # or "GEMV" for batch_size=1
}

model.quantize(
tokenizer,
quant_config=quant_config,
calib_data="wikitext", # Calibration dataset
)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

# Deploy with vLLM
# vllm serve ./qwen2.5-7b-awq --quantization awq

GPU 内存优化

graph TD
    A[GPU 内存分配] --> B[模型权重<br/>Weights]
    A --> C[KV Cache<br/>注意力缓存]
    A --> D[激活值<br/>Activations]

    B --> B1["7B FP16: ~14GB<br/>7B INT4: ~4GB"]
    C --> C1["取决于 batch_size<br/>和 seq_len"]
    D --> D1["前向推理时<br/>临时分配"]

    style B fill:#e74c3c,color:#fff
    style C fill:#3498db,color:#fff
    style D fill:#f39c12,color:#000

内存估算公式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def estimate_gpu_memory(
num_params_billion: float,
precision_bits: int = 16,
batch_size: int = 1,
seq_len: int = 2048,
num_layers: int = 32,
hidden_size: int = 4096,
num_kv_heads: int = 8,
head_dim: int = 128,
):
"""Estimate GPU memory requirements for LLM inference."""

# Model weights
weight_bytes = num_params_billion * 1e9 * (precision_bits / 8)

# KV Cache (per token per layer: 2 * num_kv_heads * head_dim * precision_bytes)
kv_per_token = 2 * num_kv_heads * head_dim * (precision_bits / 8)
kv_cache_bytes = batch_size * seq_len * num_layers * kv_per_token

# Activation memory (rough estimate)
activation_bytes = batch_size * seq_len * hidden_size * 4 # FP32 activations

total_gb = (weight_bytes + kv_cache_bytes + activation_bytes) / (1024**3)

print(f"Model weights: {weight_bytes / (1024**3):.1f} GB")
print(f"KV Cache: {kv_cache_bytes / (1024**3):.1f} GB")
print(f"Activations: {activation_bytes / (1024**3):.1f} GB")
print(f"Total estimate: {total_gb:.1f} GB")

return total_gb

# Example: Qwen2.5-7B with INT4 quantization
estimate_gpu_memory(
num_params_billion=7,
precision_bits=4,
batch_size=8,
seq_len=4096,
)
# Model weights: 3.3 GB
# KV Cache: 2.0 GB
# Activations: 0.5 GB
# Total estimate: 5.8 GB

Tensor Parallelism

1
2
3
4
5
6
# Split model across multiple GPUs
# vLLM: automatic tensor parallelism
vllm serve Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \ # Use 4 GPUs
--max-model-len 32768 \
--gpu-memory-utilization 0.95
graph LR
    subgraph "Tensor Parallelism (4 GPUs)"
        A[Input] --> G1[GPU 0<br/>Layers 0-19<br/>Column Split]
        A --> G2[GPU 1<br/>Layers 0-19<br/>Column Split]
        A --> G3[GPU 2<br/>Layers 20-39<br/>Column Split]
        A --> G4[GPU 3<br/>Layers 20-39<br/>Column Split]

        G1 --> R[AllReduce<br/>Sync]
        G2 --> R
        G3 --> R
        G4 --> R
        R --> O[Output]
    end

    style G1 fill:#3498db,color:#fff
    style G2 fill:#e74c3c,color:#fff
    style G3 fill:#2ecc71,color:#fff
    style G4 fill:#f39c12,color:#000

性能基准测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import time
import httpx

def benchmark_throughput(
base_url: str,
model: str,
num_requests: int = 100,
prompt_tokens: int = 128,
max_tokens: int = 256,
):
"""Benchmark LLM serving throughput."""
prompt = "Explain the concept of " * (prompt_tokens // 5)

start = time.perf_counter()
total_output_tokens = 0

for i in range(num_requests):
response = httpx.post(
f"{base_url}/v1/completions",
json={
"model": model,
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": 0.7,
},
timeout=120.0,
)
data = response.json()
total_output_tokens += data["usage"]["completion_tokens"]

elapsed = time.perf_counter() - start

print(f"Total time: {elapsed:.1f}s")
print(f"Requests/s: {num_requests / elapsed:.1f}")
print(f"Output tokens/s: {total_output_tokens / elapsed:.1f}")
print(f"Avg latency: {elapsed / num_requests * 1000:.0f}ms")

# Benchmark
benchmark_throughput(
base_url="http://localhost:8000",
model="Qwen/Qwen2.5-7B-Instruct",
num_requests=50,
)

各方案性能参考

方案 7B Q4 吞吐量 TTFT 部署复杂度 适用场景
vLLM ~3000 tok/s ~50ms 生产API服务
TGI ~2500 tok/s ~60ms 生产API服务
Ollama ~80 tok/s ~200ms 开发/个人使用
llama.cpp ~100 tok/s ~150ms CPU推理/边缘
TensorRT-LLM ~4000 tok/s ~30ms NVIDIA最优化

以上数据基于单 A100 80GB GPU 的参考值,实际取决于硬件和配置。

选型决策

flowchart TD
    A[本地LLM部署需求] --> B{使用场景?}

    B -->|个人开发/原型| C[Ollama<br/>一键安装, 最简单]
    B -->|生产API服务| D{GPU?}
    D -->|NVIDIA| E{预算?}
    E -->|多卡 A100/H100| F[vLLM + Tensor Parallel<br/>最高吞吐]
    E -->|单卡 消费级| G[vLLM + AWQ 量化<br/>性价比最高]
    D -->|无GPU / Apple Silicon| H[Ollama / llama.cpp<br/>CPU + Metal]

    B -->|最高性能| I[TensorRT-LLM<br/>NVIDIA 深度优化]

    style C fill:#2ecc71,color:#fff
    style F fill:#e74c3c,color:#fff
    style G fill:#3498db,color:#fff
    style H fill:#f39c12,color:#000

总结

本地 LLM 部署方案的选择取决于场景需求:

  1. 快速上手:Ollama 一键安装,支持 macOS/Linux/Windows,适合开发和实验
  2. 生产部署:vLLM 的 PagedAttention 和 Continuous Batching 提供最佳吞吐量
  3. 量化选择:AWQ > GPTQ > GGUF Q4_K_M(GPU 场景下 AWQ 精度损失最小)
  4. 内存优化:4-bit 量化将 7B 模型从 14GB 压缩到 4GB,单卡即可部署
  5. 多 GPU:70B+ 模型使用 Tensor Parallelism 分布到多卡
  6. Apple Silicon:Ollama 原生支持 Metal 加速,M2/M3 Max 可流畅运行 7B-14B 模型

始终先做基准测试(TTFT、吞吐量、内存占用),根据实际业务的延迟和 QPS 要求选择合适的方案。

作者 · authorzt
发布 · date2025-11-12
篇幅 · length3.2k 字 · 8 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论