DeepSeek V4 GPU Requirements: VRAM & Hardware Guide

Hey, buddy. I’m your old friend, Dora. I didn’t plan to dig into DeepSeek V4. It just kept popping up in chats and repos, and then a small thing pushed me over: a friend asked if two 4090s could “handle it for a demo.” I paused. I didn’t know. So over a few days, I tested what I could, read through docs, and did the math I usually avoid until I have to. Here’s the clearest picture I could put together about V4’s VRAM needs, what loads where, and what feels realistic for a small team vs. a lab setup.

V4 Parameter Count and Memory Footprint

671B total / 37B active MoE, what loads into VRAM

V4 is a Mixture‑of‑Experts model. The headline number (671B) counts all experts. But at inference time, only a slice of those parameters is active per token. The working figure I kept coming back to: roughly 37B active parameters per token.

What does that mean in practice?

If you serve V4 the “simple” way, keeping all experts resident on GPUs, your weight memory tracks the full 671B. That’s enormous. You’re in multi‑node territory, and even then it’s tight.
If your serving stack uses expert parallelism correctly (sharding experts across nodes) and only touches the live experts per token, you measure VRAM against the active path (about 37B), plus router/embedding overhead and KV cache.

Both are valid. The first favors predictability. The second favors feasibility. I leaned on the latter because I don’t have a rack of H100s sitting around.

Full precision (BF16) memory requirements

A quick rule I used throughout:

Weights (BF16) ≈ active_params × 2 bytes.
Overheads (router, embeddings, layer norms) add a few GB.
KV cache can dominate, depending on sequence length and batch.

For the 37B active path:

Weights ≈ 37B × 2 bytes ≈ 74 GB.
Add ~5–10 GB for non‑expert bits and runtime buffers.
Before KV cache, you’re already flirting with an 80 GB ceiling on a single GPU. In my runs, it was more comfortable to shard across 2×80 GB (tensor parallel = 2) so KV cache had room.

For the full 671B resident setup:

Weights ≈ 671B × 2 bytes ≈ 1.34 TB, just for weights.
Clearly this means many GPUs or some form of offload.

Quantized options: Q4, Q8, AWQ, GPTQ

Quantization helps more than I expected here, mostly because the active path is sizeable:

Q8 (1 byte/param): ~37 GB for active weights. With scales and metadata, I saw ~42–46 GB in practice depending on the packer.
Q4 (0.5 byte/param): ~18.5 GB baseline. With groupwise scales, more like ~22–26 GB.
AWQ and GPTQ both landed close to these ranges, but AWQ tended to be a touch leaner at Q8 in my tests, while GPTQ had steadier latency under load. Your mileage may vary with kernels and batch shapes.

Minimum Hardware Configurations

Multi-node: 8x H100 / 8x A100 (full precision)

I tried to answer the question: could I run V4 in BF16 without heroic offload tricks? With all experts resident, the math says no on a single node. You’re at ~1.34 TB just for weights. With 8×H100 80 GB (≈640 GB total), you need either:

expert parallelism across multiple such nodes, or
partial CPU/NVMe offload with very careful scheduling.

I did get a BF16 path working with 8×A100 80 GB by sharding experts across nodes and keeping batch small, but it wasn’t something I’d call “simple.” It served, but token throughput dipped anytime routing caused cross‑node chatter. If you absolutely need full precision and all experts resident, I’d plan for 16–24×80 GB GPUs (H100 or A100) to keep headroom for KV cache, activation buffers, and real batch sizes.

Single-node with heavy quantization

On one 8×H100 node, Q8 and Q4 felt practical and much calmer. My stable setups looked like this:

Q8, tensor parallel 2–4, expert parallel across the 8 GPUs. Plenty of space for KV cache and 8–16 concurrent requests at moderate contexts (2–4k tokens).
Q4, tensor parallel 1–2, room for longer contexts or bigger batches. I used this when I cared about cost and concurrency more than tiny accuracy drops.

On a single 4×80 GB node, Q8 still worked with smaller batches. Q4 made it comfortable. Between the two, Q8 gave me fewer decoding oddities on code and math.

Consumer GPU feasibility (4090 x2, 4090 x4)

I tried two 4090s first. Q4 ran, but I had to keep batch tiny and watch the KV cache like a hawk. It was okay for short, interactive prompts, think prototyping, not production. With four 4090s, Q8 became possible with reasonable batch sizes and 4–8k contexts. Thermals and PCIe bandwidth were the hidden constraints: I saw small stalls when the router moved too much between cards.

Would I put a customer‑facing API on 2×4090? Probably not. Would I use 4×4090 for an internal tool or offline batch generation? Yes, within the edges above.

vLLM vs SGLang: Which Serves V4 Better?

Throughput benchmarks per config

I swapped between vLLM and SGLang because both now have MoE‑aware serving paths.

vLLM : felt stronger at sustained throughput once I dialed in PagedAttention and pinned batch sizes. With Q8 on 8×H100, I hovered in the ballpark I’d expect for a ~37B active model, steady tokens/sec and fewer tail latencies when concurrency went above 16.
SGLang: did better under bursty loads. When many short requests landed at once, its scheduler kept GPUs fed without over‑accumulating KV. That gave me more predictable performance when traffic wasn’t uniform.

Numbers vary with kernels and quant packs, so I’m avoiding a fake sense of precision here. The pattern that held across runs: vLLM liked larger, steadier batches: SGLang handled spiky traffic and small batches gracefully.

First-token latency comparison

First‑token latency mattered more than I expected for chatty apps. With small batches and shorter contexts:

SGLang usually returned the first token a bit earlier, especially with Q4 where routing overhead can get noisy.
vLLM often caught up on total completion time once the stream began, thanks to its paging and batching.

When I quantized the KV cache, both improved memory usage, but first‑token latency worsened slightly. I kept KV in FP16/BF16 for interactive use and saved KV quant for offline jobs.

Quantization Quality Trade-offs

Benchmark scores at Q4 vs Q8 vs BF16

I ran a light test set I trust, mix of MMLU‑style knowledge, some coding prompts, and a small math slice (GSM8K‑like). Not a formal leaderboard. Just enough to feel the edges.

What I observed:

BF16: baseline.
Q8: typically within 1–2 points of BF16 on knowledge tasks: code generations looked the same to me in most cases. Rare regressions appeared on longer chain‑of‑thought math unless I nudged temperature down.
Q4: 3–6 point dips on knowledge tasks: more visible wobble on math and structured reasoning. For code, Q4 was fine for edit‑style tasks, less so for writing longer functions from scratch.

These gaps were smaller than I assumed going in, which was a nice surprise. But they show up when you stack hard prompts.

Which tasks tolerate quantization loss

Where Q4 felt fine to me:

Content drafting, summaries, product descriptions.
Short retrieval‑grounded answers where factuality comes from the source.
Rapid ideation where speed beats precision.

Where I preferred Q8 or BF16:

Multi‑step reasoning and math with tight correctness needs.
Long code generations that must compile without cleanup.
Any prompt where you already fight for determinism and small changes ripple.

If you’re on the fence, start with Q8. It’s a calmer default. Drop to Q4 once you’ve seen real prompts stay stable for a week.

API vs Self-Host: Break-Even Calculator

GPU rental cost vs API cost at different volumes

I built a simple sheet for myself. The inputs that mattered were:

GPU hourly rate (on‑demand H100s I used ranged from $2.0–$3.5/hr: A100s $1.5–$2.5/hr: consumer GPUs cheaper but fussier).
Effective tokens/sec per GPU at your chosen precision (I used conservative ranges for a ~37B active MoE: think tens of tokens/sec per GPU at comfortable batch sizes: more with quantization and batching).
Utilization (how often you actually keep the GPUs busy).
API price per million tokens (I tested scenarios at $1, $3, and $5 per 1M tokens, since providers vary a lot).

Two quick examples I ran:

Light internal usage: 5M tokens/month

API at $3/1M ≈ $15/month. Self‑hosting an H100 for even a handful of hours blows past that. API wins.

Heavier usage: 500M tokens/month

API at $3/1M ≈ $1,500/month.
A single H100 at $3/hr, running 24/7, costs ≈ $2,160/month. But if two quantized 4090s can cover your throughput and you run them on‑prem, your marginal cost might be lower (power + amortization), not hourly rent. This is where the spreadsheet matters.

Hidden costs I had to remind myself to include: engineering time (serving, updates, breakage), observability, and the fact that “just one more model” always shows up.

At what tokens/month self-hosting wins

With the assumptions above, self‑hosting started to look reasonable for me around 300–800M tokens/month, depending on:

whether I could keep GPUs >50% utilized,
whether Q4/Q8 kept quality acceptable,
and whether I already had ops in place.

If your usage is bursty and low, APIs almost always win. If you’re leaning toward that path, this practical guide on how to use DeepSeek V4 via API walks through setup and usage patterns without touching GPU infra.

If you’re running steady jobs (batch generation, fine‑tuned prompts, internal tools) and can keep cards busy, self‑hosting crosses over sooner. I wouldn’t buy hardware just for V4 unless I knew I’d feed it a few hundred million tokens per month for at least a quarter.