DeepSeek V4 Cost per Million Tokens: Full Calculator

Hey, guys. Dora here.

I spent three weeks last month running DeepSeek V4 in production. My monthly bill came to $18. The same workload on GPT-4o would’ve cost around $380. On Claude Opus 4.5, closer to $720.

That gap made me dig into the numbers properly — not to celebrate cheap compute, but to understand whether the pricing holds up under real use and where the hidden costs hide.

Published Pricing at Launch (verified table)

DeepSeek V4’s official pricing went live:

Standard rates (per 1M tokens):

Input tokens (cache miss): $0.30
Input tokens (cache hit): $0.03
Output tokens: $0.50

Off-peak rates (per 1M tokens):

Input tokens (cache miss): $0.15
Input tokens (cache hit): $0.015
Output tokens: $0.25

The cache hit discount is 90%. That means if you structure your prompts with repeating elements — system instructions, tool definitions, document templates — the cost drops dramatically after the first request.

Input tokens — standard vs cache hit vs off-peak

Cache hits happen when DeepSeek recognizes part of your prompt has been processed recently and reuses the computation. This only works with consistent prefixes — system instructions or tool definitions that don’t change between calls.

I tested this with a research summarizer. The system prompt and extraction schema stayed constant across runs. After the first request, cache hit rates stayed around 65-70%. My effective input cost dropped from $0.30 to roughly $0.12 per million tokens.

Off-peak pricing runs from approximately 11 PM to 7 AM Beijing time (UTC+8) with a 50% discount across all token types. I scheduled my weekly batch jobs for 2 AM Beijing time. Same workload, half the cost. The latency didn’t matter for batch processing, so the trade-off was straightforward.

Output tokens — standard vs off-peak

Output tokens cost more because generation requires sequential computation — the model can’t parallelize output the way it processes input. At $0.50 per million (standard) or $0.25 (off-peak), you’re still paying less than most models charge for input alone.

GPT-4o charges $2.50 per million output tokens. Claude Opus 4.5 charges $15. For my use case — generating 800-1200 token summaries from 3000-5000 token inputs — output costs remained lower than input costs even without caching benefits.

How V4 compares to V3 pricing

V4 launched at $0.30 input / $0.50 output versus V3’s $0.14 / $0.28 when it debuted in late December 2024. That’s roughly a 15% increase in absolute terms.

The increase reflects real architectural improvements: longer context windows (up to 1M tokens), better tool calling accuracy, and hybrid reasoning modes that weren’t available in V3. What changed isn’t just the price but the capability-to-cost ratio. V4 scores 81% on SWE-bench Verified compared to V3’s 69%, meaning you’re getting significantly better performance for only 1.14x the cost.

Why DeepSeek Is 20-50x Cheaper Than OpenAI

The pricing gap isn’t marketing. It’s architectural efficiency translating to operational cost.

MoE architecture: 671B total, 37B active

DeepSeek V4 uses Mixture-of-Experts with 671 billion total parameters but only activates 37 billion per token. When you send a request, the model’s routing mechanism selects 8 specialized experts from a pool of 256, plus one shared expert that processes everything. Those 9 experts handle the computation. The other 247 stay dormant.

This matters because compute cost scales with active parameters, not total parameters. Compare this to dense models like GPT-4, which activate all parameters for every token. A 405-billion-parameter model like Llama 3.1 requires roughly 2,448 GFLOPs per token. DeepSeek V4 requires approximately 250 GFLOPs — nearly 10x less computation.

That efficiency shows up in deployment requirements too. V4 can run on a single server with dual RTX 4090s for smaller workloads. Dense models of comparable capability need multi-node GPU clusters. Hardware costs compound over millions of API calls, and those savings flow through to pricing. The efficiency gains come partly from DeepSeek’s manifold-constrained hyper-connections (mHC) architecture, which optimizes routing between expert layers.

Training cost ($5.6M vs GPT-4 $100M+)

DeepSeek trained V3 for $5.6 million using 2.788 million H800 GPU hours across 14.8 trillion tokens. Industry estimates place GPT-4’s training cost around $100 million or more — roughly 18x higher.

The gap comes from two factors: MoE architecture trains faster than dense models at similar capability levels, and DeepSeek used H800 GPUs which cost less than H100s while still delivering sufficient performance.

Lower training costs don’t automatically mean lower inference prices — companies can charge whatever the market bears — but DeepSeek has consistently passed savings through. V2, V3, and V4 have all launched below frontier model rates while matching or exceeding performance on key benchmarks. That pattern suggests the pricing is sustainable, not temporary.

Real Cost Calculator Template

Inputs: daily tokens, cache hit rate, off-peak %

The variables that matter:

Total input/output tokens per day
Cache hit rate (0-100%)
Off-peak percentage (0-100%)
Days per month

The calculation is straightforward:

cacheable_input = (input_tokens × cache_hit_rate × $0.03) / 1M
non_cacheable_input = (input_tokens × (1 - cache_hit_rate) × $0.30) / 1M
output_cost = (output_tokens × $0.50) / 1M
daily_cost = cacheable_input + non_cacheable_input + output_cost

Apply off-peak discount (50% during off-peak hours)
monthly_cost = adjusted_daily_cost × 30

Example: 10M tokens/day workload

A workload processing 10 million tokens daily typically splits into roughly 6 million input and 4 million output tokens. This ratio is common for summarization, rewriting, or content generation tasks.

Assumptions:

40% cache hit rate (conservative for workflows with consistent system prompts)
30% off-peak usage (batch jobs scheduled overnight)
Standard V4 pricing

Daily cost breakdown:

Cacheable input: (6M × 0.40 × $0.03) / 1M = $0.072
Non-cacheable input: (6M × 0.60 × $0.30) / 1M = $1.08
Output: (4M × $0.50) / 1M = $2.00
Total before off-peak: $3.15

With 30% off-peak scheduling:

Standard portion (70%): $2.21
Off-peak portion (30% × 50% discount): $0.47
Adjusted daily: $2.68/day or $80.40/month

For comparison, the same 10M daily token workload would cost:

GPT-4o: ~$450/month
Claude Opus 4.5: ~$900/month
DeepSeek V4: $80.40/month

That’s an 82-91% cost reduction for comparable capability.

Example: RAG pipeline with 80% cache hit rate

Retrieval-augmented generation pipelines see higher cache hit rates because the retrieved context often overlaps between similar queries.

A RAG system answering 1,000 queries daily:

8,000 input tokens per query (2,000 for user question + 6,000 for retrieved context)
500 output tokens per query (generated answer)
80% cache hit rate (document chunks repeat across queries)
0% off-peak (user-facing, requires immediate response)

Daily cost:

Total input: 8M tokens
Cacheable: (8M × 0.80 × $0.03) / 1M = $0.192
Non-cacheable: (8M × 0.20 × $0.30) / 1M = $0.48
Output: (500K × $0.50) / 1M = $0.25
Daily total: $0.92
Monthly: $27.66

Without caching, this workload would cost $122.50/month. Proper cache optimization saves approximately $95/month — a 77% reduction. This is why structured, repeatable prompts matter more than they might seem.

Hidden Costs to Budget For

Retry overhead on rate limit hits

DeepSeek enforces rate limits around ~100,000 TPM and ~500 RPM (based on V3 behavior and testing). When you hit limits, the API returns a 429 status and you need to retry with backoff. During a test that deliberately exceeded limits, about 8% of requests needed one retry, 2% needed two. Token cost of retries is zero (failed requests don’t bill), but the latency matters for time-sensitive workloads.

Long-context (1M token) requests

A single 1M token input costs $0.30. If you’re processing 100 documents daily, that’s $270/month just for input. More importantly, long-context requests take longer — my tests showed 500K token inputs took 12-18 seconds for first token versus 2-3 seconds for 10K inputs. For most use cases, chunking documents delivers better cost and latency.

Tool call token inflation

Tool definitions consume input tokens. A typical tool runs 150-300 tokens. With 20 tools exposed, that’s 3,000-6,000 tokens added to every request. Tool calls also inflate output because the model generates structured JSON for each invocation (50-150 tokens per call). My test agent with 15 tools averaged 250 additional output tokens per request. The fix: only include tools relevant for each request type.

When V4 Stops Being Cheap (scale thresholds)

Around 50 million tokens daily (~$4,000/month with moderate caching), self-hosting economics start to make sense. DeepSeek open-sources its weights, so running V4 on your own infrastructure means upfront hardware costs but zero per-token fees. Rough breakeven:

50M+ tokens daily: self-hosting may be cheaper within 6-12 months
Sporadic bursts: API pricing remains more efficient
Geographic data residency needs: self-hosting may be required regardless of cost

Around 200-300 million tokens daily ($12K-15K/month), building your own inference cluster with quantized models starts to make economic sense.

The other threshold is operational complexity. Below 10M tokens daily, managing infrastructure feels like overkill. Above 100M daily, not managing it feels like leaving money on the table.

I’m at 5-7M tokens daily. The API is cheap enough that I never think about the bill, and the operational simplicity — no servers, no scaling decisions, no downtime — is worth the cost. But I track the number.

The calculator I shared is the same one I check every Monday. I don’t watch it obsessively. I just want to know if something changed — if cache hit rates dropped, if off-peak scheduling stopped working.

DeepSeek V4’s pricing feels stable right now. Predictable enough that I can budget three months out without worrying about surprise bills. That steadiness matters more than the absolute number.