MODEL-ZOO

68个适合个人GPU部署的LLM

本指南策划了 68 个开源模型,您可以在个人 GPU 上用于聊天/编码/推理、VLM 用于图像和文档,以及高效的 SLM 用于设备任务。

admin

Mar 15, 2026 • 20 min read

微信 ezpoda免费咨询：AI编程 | AI模型微调| AI私有化部署
AI工具导航 | Tripo 3D | Meshy AI | ElevenLabs | KlingAI | ArtSpace | Phot.AI | InVideo

过去两年,本地 AI 已经从一个利基黑客爱好变成了一个可靠的云推理替代方案。在开源权重模型(LLMs、VLMs 和紧凑的 SLMs)、更智能的量化(Q4_Q8、AWQ、GPTQ、GGUF)以及高吞吐量运行时间(vLLM、llama.cpp、TensorRT-LLM)之间,您现在可以在单个工作站 GPU 上运行一个强大的助手或整个专家舰队。隐私得到改善,延迟降低,成本变得可预测,并且您可以保持对数据和提示词的控制。

PewDiePie 刚刚向主流展示了"个人 AI 机架"的样子。在最近的视频系列中,他演示了一个自托管设置("ChatOS"),运行了多个开源模型,甚至对它们进行了 排名/投票,以在"AI 议会"(后来是一个更大的"蜂群") 中选出最佳答案。这正是许多从业者正在前进的方向:集成和 编排代理,它们在您的硬件上协同工作。

为什么现在这很重要?首先,边缘计算 显然正在兴起:您可以保持敏感上下文本地(RAG 覆盖您的文件、代码库和数据库),避免出口费用,并按任务定制模型。其次,GPU 稀缺和定价波动 意味着云推理并不总是稳定工作负载的最便宜或最可靠的选择。如果您已经拥有一张 12-24 GB 的显卡(或者两张),在本地运行量化 7B-13B 模型的经济优势是引人注目的,更大的 MoE / 密集模型在您需要时通过张量并行或混合精度进行扩展。PewDiePie 的机架是一个华丽的例子,但潜在主题是务实的:拥有自己的技术栈,正确调整模型的大小,并将它们像工具一样组合起来。

本指南策划了 68 个开源模型,您可以在个人 GPU 上用于聊天/编码/推理、VLM 用于图像和文档,以及高效的 SLM 用于设备任务。对于每个模型,您将发现它的优势、按精度的实际 VRAM 目标,以及一个"立即运行"片段(vLLM、llama.cpp 或 Ollama)。使用它来组装您自己的"议会",或者选择一个小型日常驱动器,它只是工作。

1、快速开始

Ollama(单命令)

# NVIDIA Linux / WSL2
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral:7b
ollama run mistral:7b

llama.cpp(GGUF 服务器)

# 使用 CUDA 构建
cmake -B build -DGGML_CUDA=ON && cmake -B build -j
# 服务模型
./build/server -m ./models/mistral-7b.Q4_K_M.gguf -c 8192 -ngl 99 -t 8

vLLM(OpenAI 兼容服务器)

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  -model mistralai/Mistral-7B-v0.3 \
  -dtype auto -max-model-len 8192 -tensor-parallel-size 1

AMD(ROCm) 注意

优先使用 vLLM ROCm 轮子 或使用 HIP 构建的 llama.cpp。* 一些 GGUF 量化内核也是 NVIDIA 优先的,在提交到路径之前进行验证。

Windows 提示

使用 WSL2 + CUDA 以获得更简单的设置或将 GPU 传递到 Linux 用户空间。

2、100 个开源模型

[1] Llama 3.1 8B Instruct (≈8B)

解决了在单个 GPU 上的"稳固的通用主义"问题,生态系统支持极佳,输出稳定。

最适合: 聊天、RAG、轻量级编码

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct

[2] Mistral 7B v0.3 (≈7B)

紧凑、快速、对齐良好;易于微调和部署。

最适合: 通用聊天 & RAG

VRAM: ~5-7 GB(4-bit),~8-10 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-v0.3

[3] Qwen 2.5 7B Instruct (≈7B)

多语言、强大的 JSON/工具输出、长上下文变体。

最适合: 多语言聊天、结构化输出

VRAM: ~6-8 GB(4-bit),~9-11 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct

[4] Gemma 2 9B IT (≈9B)

在小规模上经过精细的指令调整。

最适合: 通用助手、轻量级代码

VRAM: ~7-9 GB(4-bit),~12-14 GB(8-bit),~18 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model google/gemma-2-9b-it

[5] StableLM 2 12B Chat (≈12B)

能够多语言助手;易于自托管。

最适合: 聊天、RAG

VRAM: ~9-11 GB(4-bit),~14-16 GB(8-bit),~24 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model stabilityai/stablelm-2-12b-chat

[6] Mixtral 8×7B Instruct (MoE; ≈12B active)

"70B 级别"的质量,具有稀疏激活效率。

最适合: 在单张大卡上更高质量的聊天/RAG

VRAM: ~12-16 GB(4-bit),~20-28 GB(8-bit),>40 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1

[7] StarCoder2 15B (≈15B)

强大的代码模型,具有宽松的训练集。

最适合: 本地代码助手

VRAM: ~10-12 GB(4-bit),~18-20 GB(8-bit),~30 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model bigcode/starcoder2-15b

[8] DeepSeek-Coder V2 Instruct (MoE; 16B/236B)

现代编码 MoE;16B 单 GPU 友好。

最适合: 代码生成、重构、多文件编辑

VRAM (16B): ~10-12 GB(4-bit),~16-20 GB(8-bit),~32 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct

[9] Code Llama 13B Instruct (≈13B)

经典的开源代码模型;稳定且文档良好。

最适合: IDE 助手、解释

VRAM: ~10-12 GB(4-bit),~16-18 GB(8-bit),~26 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model codellama/CodeLlama-13b-Instruct-hf

[10] Llama 3.2 3B Instruct (≈3B)

小型、快速、推理体面尚可;笔记本电脑友好。

最适合: 设备聊天、小型 RAG

VRAM: ~2.5-3.5 GB(4-bit),~4-6 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-3B-Instruct

[11] Llama 3.2 1B Instruct (≈1B)

真正微小;方便的路由/计划员。

最适合: 工具路由、分类器、提示词重写

VRAM: ~1-2 GB(4-bit),~2-3 GB(8-bit),~4 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-1B-Instruct

[12] OpenELM 3B (≈3B)

Apple 领导的紧凑 SLM;研究友好。

最适合: 边缘演示、基线

VRAM: ~2-3 GB(4-bit),~4-5 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model apple/OpenELM-3B

[13] BTLM-3B-8K (≈3B)

为更长上下文调整的微小模型。

最适合: 轻量级摘要、助手

VRAM: ~2-3 GB(4-bit),~4-5 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model cerebras/btlm-3b-8k

[14] TinyLlama 1.1B Chat (≈1.1B)

轻量级 Llama 兼容的聊天。

最适合: 路由器/过滤器、微小聊天

VRAM: ~1-1.5 GB(4-bit),~2-3 GB(8-bit),~4 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

[15] Idefics2 8B (VLM; ≈8B)

用于图表/文档/屏幕的开放 VLM;简单 API。

最适合: 单图像描述、文档 VQA

VRAM: ~10-12 GB(4-bit),图像大小依赖,~16-18 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model HuggingFaceM4/idefics2-8b

[16] LLaVA-1.6 Mistral 7B (VLM; ≈7B)

社区默认的开放 VLM,具有强大的屏幕截图问答功能。

最适合: 图像问答、UI/屏幕截图读取

VRAM: ~9-12 GB(4-bit),~14-16 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model liuhaotian/llava-v1.6-mistral-7b

[17] Pixtral 12B (VLM; ≈12B)

在表格/图表和多图像推理方面表现优异。

最适合: 文档 VQA、图表/表格理解

VRAM: ~12-16 GB(4-bit),~20-24 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model mistralai/Pixtral-12b

[18] InternVL2 8B (VLM; ≈8B)

研究中使用的竞争性开放 VLM。

最适合: 通用 VQA、OCR 类场景

VRAM: ~10-12 GB(4-bit),~16-18 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-8B

[19] Qwen 2.5-VL 7B Instruct (VLM; ≈7B)

现代 VLM,具有多语言优势。

最适合: 屏幕截图问答、幻灯片/图表

VRAM: ~10-12 GB(4-bit),~16-18 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-7B-Instruct

[20] Llama 3.2 11B Vision Instruct (VLM; ≈11B)

用于视觉和文本的一体化 Llama。

最适合: 通用图像理解

VRAM: ~12-16 GB(4-bit),~18-22 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-11B-Vision-Instruct

[21] Qwen2.5-Coder 32B Instruct (≈32B)

仓库级编码、函数调用、长编辑。

最适合: 重型编码助手

VRAM: ~18-24 GB(4-bit),~32-40 GB(8-bit),~60+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-32B-Instruct

[22] CodeGemma 7B IT (≈7B)

紧凑的编码器;干净的格式化 + 函数调用。

最适合: 日常编码助手

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model google/codegemma-7b-it

[23] StarCoder2 3B (≈3B)

微型多语言编码器;边缘/笔记本电脑。

最适合: Lint+、快速修复

VRAM: ~3-4 GB(4-bit),~5-6 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model bigcode/starcoder2-3b

[24] Codestral 22B (≈22B; 研究/评估许可证)

高准确性代码模型;生产前检查许可证。

最适合: 研究代码生成/评估

VRAM: ~12-16 GB(4-bit),~24-30 GB(8-bit),~44+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model mistralai/Codestral-22B-v0.1

[25] GLM-4-9B Chat (≈9B)

平衡的通用主义;强大的数学/推理能力。

最适合: 日常驱动聊天

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~18 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model zai-org/glm-4-9b-chat-hf

[26] CodeGeeX4-ALL 9B (≈9B)

基于 GLM 的编码器,用于生成-修复-执行。

最适合: 实用编码 + 工具

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~18 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model THUDM/codegeex4-all-9b

[27] Mistral-NeMo 12B Instruct (≈12B)

NVIDIA×Mistral;高效的 ONNX/TensorRT。

最适合: 快速多语言聊天/编码

VRAM: ~8-10 GB(INT4 TRT),~20-24 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model nvidia/Mistral-NeMo-12B-Instruct

[28] AI21 Jamba Mini 1.7 (混合)

混合 Transformer-Mamba;长上下文和高吞吐量。

最适合: 长文档助手 & 工具

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model ai21labs/AI21-Jamba-Mini-1.7

[29] MAmmoTH2 8×7B (MoE; ≈12-14B active)

推理调优;强烈的 CoT 风格。

最适合: 数学/逻辑任务

VRAM: ~16-24 GB(4-bit),~32+ GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model TIGER-Lab/MAmmoTH2-8x7B

[30] OpenChat 3.5-1210 7B (≈7B)

社区微调,在重量之上表现出色。

最适合: 聊天 + 编码

VRAM: ~5-6 GB(4-bit),~8-10 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model openchat/openchat-3.5-1210

[31] Nous-Hermes-2 Mistral 7B DPO (≈7B)

清晰的指令遵循;结构化输出。

最适合: 乐于助手的助手角色

VRAM: ~5-6 GB(4-bit),~8-10 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mistral-7B-DPO

[32] OpenHermes-2.5 Mistral 7B (≈7B)

快速、友好的日常驱动程序。

最适合: 8 GB GPU 上的通用聊天

VRAM: ~5-6 GB(4-bit),~8-10 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B

[33] Replit-Code v1.5 3.3B (≈3.3B)

轻量级代码伴侣。

最适合: 小型代码任务、CI

VRAM: ~3-4 GB(4-bit),~5-6 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model replit/replit-code-v1_5-3b

[34] Stable-Code 3B (≈3B)

在广泛混合上训练的小型编码器。

最适合: CI 机器人、设备实验

VRAM: ~3-4 GB(4-bit),~5-6 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model Stability-AI/stable-code-3b

[35] Qwen 2.5 72B Instruct (≈72B)

顶级的密集开源模型;推荐多 GPU。

最适合: 旗舰质量的本地技术栈

VRAM: ~40-48 GB(4-bit,紧凑),~80-96 GB(8-bit),~140+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-72B-Instruct --tensor-parallel-size 2

[36] Granite-3.3 8B Instruct (≈8B)

Apache 2.0;稳定的输出;3.x 中的长上下文。

最适合: 工具调用后端、企业

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-3.3-8b-instruct

[37] Granite-20B Code Instruct 8K (≈20B)

用于仓库 QA、修补建议、代码审查的大型代码模型。

最适合: 重型代码助手

VRAM: ~12-16 GB(4-bit),~24-30 GB(8-bit),~40+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-20b-code-instruct-8k

[38] Phi-3 Medium 128K Instruct (≈14B)

长上下文 SLM;非常适合个人 RAG。

最适合: 长文档、离线助手

VRAM: ~8-10 GB(4-bit),~16-20 GB(8-bit),~24 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3-medium-128k-instruct

[39] Phi-3.5 Vision Instruct (≈4.2B, VLM)

用于屏幕截图/图表的轻量级 VLM。

最适合: 简单的视觉问答、UI 文档

VRAM: ~4-6 GB(4-bit),~8-10 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3.5-vision-instruct

[40] MiniCPM-V 2.6 (≈7B, VLM)

高效的 VLM(SigLIP+Qwen2 血统),用于边缘。

最适合: 轻量级文档/屏幕问答

VRAM: ~5-7 GB(4-bit),~10-12 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model openbmb/MiniCPM-V-2_6

[41] Qwen2.5-Coder 32B Instruct (≈32B)

仓库级编码、工具、长编辑。

最适合: 重型 IDE/代理编码

VRAM: ~18-24 GB(4-bit),~32-40 GB(8-bit),~60+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-32B-Instruct

[42] CodeGemma 7B IT (≈7B)

干净的格式化 + 函数调用。

最适合: 日常本地编码助手

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model google/codegemma-7b-it

[43] StarCoder2-3B (≈3B)

边缘/笔记本电脑编码器,用于 lint/完成/修复。

最适合: 快速代码任务

VRAM: ~3-4 GB(4-bit),~5-6 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model bigcode/starcoder2-3b

[44] Codestral 22B (≈22B; 研究/评估)

高准确性代码;生产前检查许可证。

最适合: 研究代码生成/评估

VRAM: ~12-16 GB(4-bit),~24-30 GB(8-bit),~44+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model mistralai/Codestral-22B-v0.1

[45] GLM-4-9B Chat (≈9B)

通用主义,数学/推理能力强。

最适合: 日常聊天

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~18 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model zai-org/glm-4-9b-chat-hf

[46] CodeGeeX4-ALL 9B (≈9B)

生成-修复-执行工作流程。

最适合: 具有工具的实用编码

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~18 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model THUDM/codegeex4-all-9b

[47] Mistral-NeMo 12B-Instruct (≈12B)

通过 ONNX/TensorRT 高效。

最适合: 快速多语言聊天/编码

VRAM: ~8-10 GB(INT4 TRT),~20-24 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model nvidia/Mistral-NeMo-12B-Instruct

[48] AI21 Jamba Mini 1.7 (混合)

长上下文和高吞吐量。

最适合: 长文档助手 & 工具

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model ai21labs/AI21-Jamba-Mini-1.7

[49] DeepSeek-V2 / V2.5 (MoE)

快速通用聊天+代码。

最适合: 高吞吐量的通用主义

VRAM: 根据激活专家而变化(4-bit 单 GPU 可行)

运行:

python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-V2

[50] MAmmoTH2-8×7B (MoE)

CoT 重型推理。

最适合: 数学/逻辑

VRAM: ~16-24 GB(4-bit),~32+ GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model TIGER-Lab/MAmmoTH2-8x7B

[51] OpenChat 3.5-1210 (7B)

对大小遵循;对 8 GB 强大。

最适合: 8 GB GPU 下的聊天 + 编码

VRAM: ~5-6 GB(4-bit),~8-10 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model openchat/openchat-3.5-1210

[52] Nous-Hermes-2-Mistral 7B-DPO (7B)

结构化输出;友好的角色。

最适合: 乐于助人的助手

VRAM: ~5-6 GB(4-bit),~8-10 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model NousResearch/Nous-Hermes-2-Mistral-7B-DPO

[53] OpenHermes-2.5-Mistral 7B (7B)

快速的日常驱动程序。

最适合: 小型 GPU 上的通用聊天

VRAM: ~5-6 GB(4-bit),~8-10 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B

[54] Replit-Code v1.5 3.3B (≈3.3B)

CI 的轻量级代码伴侣。

最适合: 小型代码任务、流水线

VRAM: ~3-4 GB(4-bit),~5-6 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model replit/replit-code-v1_5-3b

[55] Stable-Code 3B (≈3B)

小型编码器;广泛的语言混合。

最适合: CI 机器人、设备编码

VRAM: ~3-4 GB(4-bit),~5-6 GB(8-bit),~6-7 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model Stability-AI/stable-code-3b

[56] CodeGemma 7B IT (≈7B)

紧凑的编码器;干净的格式化 + 函数调用。

最适合: 日常本地编码助手

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~14-16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model google/codegemma-7b-it

[57] Qwen2.5-Coder 32B-Instruct (≈32B)

仓库级编码、工具、长编辑。

最适合: 重型 IDE/代理编码

VRAM: ~18-24 GB(4-bit),~32-40 GB(8-bit),~60+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-32B-Instruct

[58] Granite-3.3-8B-Instruct (≈8B)

Apache 2.0;企业友好。

最适合: 工具调用后端

VRAM: ~6-8 GB(4-bit),~10-12 GB(8-bit),~16 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-3.3-8b-instruct

[59] Granite-20B-Code-Instruct 8K (≈20B)

仓库 QA、修补建议、代码审查。

最适合: 重型代码助手

VRAM: ~12-16 GB(4-bit),~24-30 GB(8-bit),~40+ GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-20b-code-instruct-8k

[60] Phi-3-Medium-128K-Instruct (≈14B)

具有稳固推理的长上下文。

最适合: 在单个 GPU 上进行长文档 RAG

VRAM: ~8-10 GB(4-bit),~16-20 GB(8-bit),~24 GB(fp16)

运行:

python -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3-medium-128k-instruct

[61] LLaVA-NeXT (7-8B 骨干,VLM)

非常适合屏幕截图、图表、UI 问答。

最适合: 屏幕/文档的视觉问答

VRAM: ~9-12 GB(4-bit),~14-18 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model llava-hf/llava-v1.6-mistral-7b-hf

[62] LLaVA-OneVision-Qwen2-7B-OV (VLM; ≈7B)

SigLIP 视觉 + Qwen2 文本;处理更大的图像。

最适合: 具有更高分辨率输入的通用 VQA

VRAM: ~10-14 GB(4-bit),~18-22 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model lmms-lab/llava-onevision-qwen2-7b-ov

[63] Moondream 2 (微型 VLM)

几乎可以在任何地方运行。

最适合: 小型 GPU/CPU 上的轻量级文档 VQA

VRAM: ~3-6 GB(量化),~8 GB(fp16)

运行:

ollama run moondream

[64] Florence-2-Large (VLM)

可提示的视觉基础模型。

最适合: 标题/检测/分割管道

VRAM: ~6-10 GB(任务相关)

运行:

from transformers import AutoProcessor, AutoModelForCausalLM
p = AutoProcessor.from_pretrained("microsoft/Florence-2-large")
m = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large")

[65] DocOwl 1.5 (VLM; ≈7B 文本)

针对 PDF/表格/收据进行调整。

最适合: OCR 类文档 QA

VRAM: ~9-12 GB(4-bit),~16-20 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model mPLUG/DocOwl1.5

[66] VILA-1.3B (VLM; 13B)

多图像文档推理。

最适合: 多图像文档 VQA

VRAM: ~12-16 GB(4-bit),~20-24 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model Efficient-Large-Model/VILA-13b

[67] InternVL2 8B (VLM; ≈8B)

研究中使用的竞争性开放 VLM。

最适合: 通用 VQA、OCR 类场景

VRAM: ~10-12 GB(4-bit),~16-18 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model OpenGVLab/InternVL2-8B

[68] Qwen 2.5-VL 7B Instruct (VLM; 2B)

现代 VLM,具有多语言优势。

最适合: 屏幕截图问答、幻灯片/图表

VRAM: ~10-12 GB(4-bit),~16-18 GB(8-bit)

运行:

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-7B-Instruct

如何按 VRAM 选择

8 GB GPU: Mistral-7B、Qwen-2.5-7B、Gemma-2-9B(紧张)、Phi-3-Mini、OpenHermes-2.5-7B、Dolphin-Mistral-7B。* 12 GB GPU: StableLM-2-12B、StarCoder2-15B(紧张)、Code Llama-13B、Idefics2-8B(观察图片)、Yi-1.5-6B/9B。* 24 GB GPU: Mixtral-8×7B、Pixtral-12B(VLM)、MPT-30B、Yi-1.5-34B、Granite-20B-Code。* 48 GB+/多 GPU: Llama-3.1-70B、Qwen-2.5-72B、Yi-1.5-72B、大型 VLMs(InternVL2-26B) 配备高分辨率图片。

运行提示

llama.cpp/GGUF → 在单个 GPU 上最省 VRAM 的运行。
vLLM → 单用户吞吐量最佳(分页 KV 缓存)。
TensorRT-LLM/ONNX-TRT → 在支持的 NVIDIA 栈上性能卓越。
对于 VLMs,降低图像分辨率并限制每轮图像以保持内存舒适。

3、结束语

在本地运行 AI 不再是一个派对技巧,它是一个架构选择,具有真正的优势:

隐私与控制: 您的提示词、嵌入和文档永远不会离开您的盒子。
性能: 更低的尾延迟、无冷启动、无节流。
成本可预测性: 一次性 GPU 花费 + 电力 vs. 按月令牌计费账单。
可组合性: 协调多个模型(路由 → 专家 → 验证器)甚至做 响应排名,这与最近的 PewDiePie 的"AI 议会/蜂群"演示中普及的模式相同。

是的,云端在突发和 非常大的上下文/模型方面仍然胜出,对于没有 MLOps 经验的团队来说,托管端可以更简单。但趋势线是清晰的:边缘 AI 正在变得更容易、更便宜和更好。通过量化、分页 KV 缓存和更智能的调度器,现在单个消费 GPU 可以提供一年前感觉"仅云端"的体验。随着 GPU 供应保持紧张和价格波动,当更多创作者和初创公司使用本地 GPU 进行稳定工作负载并将云端用于突发时,不要感到惊讶。

我的观点: 获胜模式是混合。保持一个 本地核心(您的日常驱动模型 + 嵌入 + 检索)并仅在您必须(极端上下文、巨型 MoE 或团队规模)时添加云端突发路径。从小处开始,使用一个出色的 7B-9B 模型,如果您需要屏幕截图/PDF,添加一个视觉模型,并在质量最重要时向排名集成演进。今天的开源模型已经足够好,您自己的"AI 议会"既实惠又私密。

如果您根据此列表构建出很酷的东西,请开源您的食谱。社区通过一起学习进步得更快,这就是我们都获得更好的模型、更好的工具化以及更好的本地 AI。

原文链接: 100 Deployable Open Models (LLM/ VLM/ SLM) for Personal GPUs: From Tiny to 70B+

汇智网翻译整理,转载请标明出处