🔥 Updated March 2026 · 60+ Models · 80+ GPUs

Can My GPU Run This LLM?

The most accurate VRAM calculator for local AI. Supports inference, fine-tuning, multi-GPU, and all major quantization formats.

60+Models
80+GPUs
4Modes
FreeForever
🖥️
Your Hardware
FP16
Q8
Q4KM
Q4
Q3
Q2
All
DeepSeek
Llama
Qwen
Mistral
Gemma
Phi
Sequence Length4K
1K4K16K64K128K
Flash Attention 2 (reduces KV cache)
Models you can run
All
✅ Fits FP16
⚠️ Needs Quantization
❌ Too Large
❓ FAQ
Why is my VRAM estimate different from Ollama?
Ollama uses Flash Attention 2 and Q4_K_M quantization by default, significantly reducing VRAM vs raw FP16. Use the framework selector and set quantization to Q4KM for a close match. Also, Ollama can spill to CPU RAM, so it may "run" a model that technically doesn't fit fully in VRAM.
Why do MoE models need so much VRAM?
Mixture-of-Experts models like DeepSeek-V3 or Mixtral have many "expert" sub-networks. Only a fraction activate per token, but ALL must reside in VRAM for fast switching. A 671B MoE model still loads all 671B parameters—you can't selectively load only active experts.
What is the KV Cache and why does it grow with context?
The KV (Key-Value) cache stores attention states for all previous tokens. It grows linearly with sequence length: a 70B model at 128K context adds 12+ GB for KV cache alone on top of model weights. This is why "context length" is as important as model size for VRAM planning.
How accurate are the token/sec estimates?
Speed estimates are based on memory bandwidth (for small-batch inference) and calibrated against community benchmarks. Expect ±20% accuracy. Actual performance depends on driver version, PCIe bandwidth, thermal throttling, and concurrent system load.
🕐 Recent Updates
Mar 2026Add Flash Attention framework-aware overhead modeling. Fix Ollama vs raw HF discrepancy.
Feb 2026Add "What Can I Run?" inverse mode. GPU Upgrade Advisor launched.
Feb 2026Add Llama 4 Scout, Maverick. Qwen3 full series. DeepSeek-V3.1.
Jan 2026RTX 5090/5080/5070 added. Memory bandwidth-based TPS formula improved.
Dec 2025Fix TTFT prefill calculation. MoE active-expert scaling for TPS estimates.

What Can I Run?

Select your GPU and instantly see which LLMs fit in VRAM, with TPS and TTFT estimates for each runnable model.

VRAM Calculator

Configure any model, quantization, context length, batch size and framework to get a precise VRAM breakdown — weights, KV cache, activations, overhead.

GPU Upgrade Advisor

Compare two GPUs side-by-side for your target model. See which workloads move from "doesn't fit" to "fits" with a GPU upgrade.

Fine-Tuning Calculator

Plan LoRA, QLoRA and full fine-tune runs. Accounts for optimizer states, gradients, and mixed-precision training overhead.