❓ FAQ
Why is my VRAM estimate different from Ollama? ▾
Ollama uses Flash Attention 2 and Q4_K_M quantization by default, significantly reducing VRAM vs raw FP16. Use the framework selector and set quantization to Q4KM for a close match. Also, Ollama can spill to CPU RAM, so it may "run" a model that technically doesn't fit fully in VRAM.
Why do MoE models need so much VRAM? ▾
Mixture-of-Experts models like DeepSeek-V3 or Mixtral have many "expert" sub-networks. Only a fraction activate per token, but ALL must reside in VRAM for fast switching. A 671B MoE model still loads all 671B parameters—you can't selectively load only active experts.
What is the KV Cache and why does it grow with context? ▾
The KV (Key-Value) cache stores attention states for all previous tokens. It grows linearly with sequence length: a 70B model at 128K context adds 12+ GB for KV cache alone on top of model weights. This is why "context length" is as important as model size for VRAM planning.
How accurate are the token/sec estimates? ▾
Speed estimates are based on memory bandwidth (for small-batch inference) and calibrated against community benchmarks. Expect ±20% accuracy. Actual performance depends on driver version, PCIe bandwidth, thermal throttling, and concurrent system load.
🕐 Recent Updates
Mar 2026Add Flash Attention framework-aware overhead modeling. Fix Ollama vs raw HF discrepancy.
Feb 2026Add "What Can I Run?" inverse mode. GPU Upgrade Advisor launched.
Feb 2026Add Llama 4 Scout, Maverick. Qwen3 full series. DeepSeek-V3.1.
Jan 2026RTX 5090/5080/5070 added. Memory bandwidth-based TPS formula improved.
Dec 2025Fix TTFT prefill calculation. MoE active-expert scaling for TPS estimates.