How accurate are the tokens per second estimates?

Speed estimates are based on memory bandwidth (for small-batch inference) and calibrated against community benchmarks. Expect plus or minus 20% accuracy. Actual performance depends on driver version, PCIe bandwidth, thermal throttling, and concurrent system load.

LLM VRAM Calculator 2026 — Can My GPU Run This Model?

⚡

Configuration

Model

Runtime Framework

🦙Ollama

🎛️LM Studio

⚡vLLM

🤗HF Transformers

🦙 Ollama uses Flash Attention 2 by default and Q4_K_M quantization. This reduces VRAM vs raw FP16.

Weight Precision

KV Cache Precision

GPU

Num GPUs

Batch Size1

181632

Sequence Length4K

1K4K16K64K128K

Concurrent Users1

181632

Enable CPU/RAM Offloading

📊

Results

—

of — GB VRAM

—

⚡ Tokens/sec

—

⏱ First Token (ms)

—

📊 Throughput tok/s

—

📡 ms/token latency

Memory Breakdown

Weights——

KV Cache——

Activations——

Overhead——

—W

Power

$—/hr

Cloud Cost

— kg

CO₂/day

GPU Upgrade Advisor

Compare two GPUs side-by-side and see exactly what new models you unlock.

🖥️

Current GPU

🚀

Upgrade Target

📊 Head-to-Head: DeepSeek-R1 3B

✅ Models unlocked by upgrade

📋 Already runs on current GPU

🎓

Fine-Tuning Config

Model

Method

Base Precision

LoRA Rank (r)

Batch Size

Sequence Length2K

1K4K16K64K128K

GPU

Optimizations

Flash Attention 2

Gradient Checkpointing

8-bit Optimizer (bitsandbytes)

DeepSpeed ZeRO-3

📊

Fine-Tuning Results

—

of — GB VRAM

—

Samples/sec

—

Time for 1K samples

Memory Breakdown

Weights——

Gradients——

Optimizer——

Activations——

Overhead——

❓ FAQ

Why is my VRAM estimate different from Ollama? ▾

Ollama uses Flash Attention 2 and Q4_K_M quantization by default, significantly reducing VRAM vs raw FP16. Use the framework selector and set quantization to Q4KM for a close match. Also, Ollama can spill to CPU RAM, so it may "run" a model that technically doesn't fit fully in VRAM.

Why do MoE models need so much VRAM? ▾

Mixture-of-Experts models like DeepSeek-V3 or Mixtral have many "expert" sub-networks. Only a fraction activate per token, but ALL must reside in VRAM for fast switching. A 671B MoE model still loads all 671B parameters—you can't selectively load only active experts.

What is the KV Cache and why does it grow with context? ▾

The KV (Key-Value) cache stores attention states for all previous tokens. It grows linearly with sequence length: a 70B model at 128K context adds 12+ GB for KV cache alone on top of model weights. This is why "context length" is as important as model size for VRAM planning.

How accurate are the token/sec estimates? ▾

Speed estimates are based on memory bandwidth (for small-batch inference) and calibrated against community benchmarks. Expect ±20% accuracy. Actual performance depends on driver version, PCIe bandwidth, thermal throttling, and concurrent system load.

🕐 Recent Updates

Mar 2026Add Flash Attention framework-aware overhead modeling. Fix Ollama vs raw HF discrepancy.

Feb 2026Add "What Can I Run?" inverse mode. GPU Upgrade Advisor launched.

Feb 2026Add Llama 4 Scout, Maverick. Qwen3 full series. DeepSeek-V3.1.

Jan 2026RTX 5090/5080/5070 added. Memory bandwidth-based TPS formula improved.

Dec 2025Fix TTFT prefill calculation. MoE active-expert scaling for TPS estimates.

What Can I Run?

Select your GPU and instantly see which LLMs fit in VRAM, with TPS and TTFT estimates for each runnable model.

VRAM Calculator

Configure any model, quantization, context length, batch size and framework to get a precise VRAM breakdown — weights, KV cache, activations, overhead.

GPU Upgrade Advisor

Compare two GPUs side-by-side for your target model. See which workloads move from "doesn't fit" to "fits" with a GPU upgrade.

Fine-Tuning Calculator

Plan LoRA, QLoRA and full fine-tune runs. Accounts for optimizer states, gradients, and mixed-precision training overhead.

Can My GPU Run This LLM?

VRAM Calculator — GPU Memory Requirements for LLM Inference

GPU Upgrade Advisor

Fine-Tuning VRAM Calculator — LoRA, QLoRA and Full Fine-Tune Memory Requirements

What Can I Run?

VRAM Calculator

GPU Upgrade Advisor

Fine-Tuning Calculator