# Batch inference prompts = [ "Explain quantum computing in simple terms.", "Write a Python function to sort a list.", "What are the benefits of microservices?", ]
outputs = llm.generate(prompts, sampling_params)
for output in outputs: print(f"Prompt: {output.prompt[:50]}...") print(f"Generated: {output.outputs[0].text}") print(f"Tokens/s: {len(output.outputs[0].token_ids) / output.metrics.finished_time:.1f}") print("---")
graph TB
A[模型量化] --> B[GGUF<br/>llama.cpp 格式]
A --> C[GPTQ<br/>GPU 量化]
A --> D[AWQ<br/>Activation-aware]
A --> E[GGML Legacy]
B --> B1["CPU + GPU 混合推理<br/>灵活的量化级别<br/>Q4_K_M / Q5_K_M / Q8_0"]
C --> C1["纯 GPU 推理<br/>4-bit 量化<br/>需要校准数据"]
D --> D1["纯 GPU 推理<br/>基于激活值的量化<br/>精度优于 GPTQ"]
style B fill:#3498db,color:#fff
style C fill:#e74c3c,color:#fff
style D fill:#2ecc71,color:#fff
量化级别详解
量化格式
每参数 Bits
7B 模型大小
速度
精度损失
FP16
16
~14 GB
基准
无
Q8_0
8
~7 GB
快
极小
Q5_K_M
5.5
~5 GB
快
小
Q4_K_M
4.5
~4.4 GB
快
小-中
Q4_0
4
~4 GB
最快
中
Q3_K_M
3.5
~3.3 GB
快
中-大
Q2_K
2.5
~2.7 GB
快
大
GPTQ 4-bit
4
~4 GB
GPU 快
小
AWQ 4-bit
4
~4 GB
GPU 快
极小
GGUF 量化实践
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Using llama.cpp to quantize a model # 1. Convert HuggingFace model to GGUF python convert_hf_to_gguf.py ./Qwen2.5-7B --outfile qwen2.5-7b-f16.gguf --outtype f16
# 2. Quantize to Q4_K_M ./llama-quantize qwen2.5-7b-f16.gguf qwen2.5-7b-q4_k_m.gguf Q4_K_M
# 3. Run with llama.cpp ./llama-server \ -m qwen2.5-7b-q4_k_m.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 35 \ # Offload 35 layers to GPU -c 8192 \ # Context length --threads 8 # CPU threads
# Save quantized model model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)
# Deploy with vLLM # vllm serve ./qwen2.5-7b-awq --quantization awq
GPU 内存优化
graph TD
A[GPU 内存分配] --> B[模型权重<br/>Weights]
A --> C[KV Cache<br/>注意力缓存]
A --> D[激活值<br/>Activations]
B --> B1["7B FP16: ~14GB<br/>7B INT4: ~4GB"]
C --> C1["取决于 batch_size<br/>和 seq_len"]
D --> D1["前向推理时<br/>临时分配"]
style B fill:#e74c3c,color:#fff
style C fill:#3498db,color:#fff
style D fill:#f39c12,color:#000
defestimate_gpu_memory( num_params_billion: float, precision_bits: int = 16, batch_size: int = 1, seq_len: int = 2048, num_layers: int = 32, hidden_size: int = 4096, num_kv_heads: int = 8, head_dim: int = 128, ): """Estimate GPU memory requirements for LLM inference."""
flowchart TD
A[本地LLM部署需求] --> B{使用场景?}
B -->|个人开发/原型| C[Ollama<br/>一键安装, 最简单]
B -->|生产API服务| D{GPU?}
D -->|NVIDIA| E{预算?}
E -->|多卡 A100/H100| F[vLLM + Tensor Parallel<br/>最高吞吐]
E -->|单卡 消费级| G[vLLM + AWQ 量化<br/>性价比最高]
D -->|无GPU / Apple Silicon| H[Ollama / llama.cpp<br/>CPU + Metal]
B -->|最高性能| I[TensorRT-LLM<br/>NVIDIA 深度优化]
style C fill:#2ecc71,color:#fff
style F fill:#e74c3c,color:#fff
style G fill:#3498db,color:#fff
style H fill:#f39c12,color:#000