AI · Frontier models · DTW exclusive

DeepSeek V4 takes the top spot on reasoning benchmarks at one-eighth the training cost of GPT-5

The 671-billion-parameter mixture-of-experts model from a quiet Hangzhou lab scored 94.3 on MMLU-Pro and outperformed OpenAI's o3 on math tasks — yet DeepSeek spent only $8.2 million on the final training run, according to internal documents reviewed by DailyTechWire.

Nguyen Hoang Anh

Senior AI correspondent · Singapore

Jiang Kai

From Hangzhou

Published 14:08 ICT · May 25, 2026 Updated 15:42 ICT 10 min read

DeepSeek founder Liang Wenfeng demonstrates the V4 model at a closed event in Hangzhou, May 24, 2026. PHOTO · DTW / Jiang Kai

On a humid evening in Hangzhou, a 39-year-old former quant trader named Liang Wenfeng walked onto a small stage in front of perhaps two hundred people. There was no global livestream, no press kit handed to Reuters, no hashtag prepared by a marketing team. He simply opened a laptop, ran a benchmark suite live, and watched a score appear on the projector behind him: 94.3 on MMLU-Pro, the highest any publicly disclosed model has reached. He nodded once. The room exhaled.

That moment, on the night of May 24, marked the public arrival of DeepSeek V4 — and the most consequential cost-curve disruption in frontier AI since the release of Llama 3. According to internal DeepSeek documents reviewed by DailyTechWire, the lab spent $8.2 million on the final training run: roughly one-eighth what OpenAI is believed to have spent on GPT-5, and less than 6% of the figure attributed to Anthropic's Claude Opus 5.

◆ Key numbers · DeepSeek V4

671Btotal parameters · 37B active per token

94.3MMLU-Pro · #1 of 52 tracked models

$8.2Mfinal training run · 1/8 cost of GPT-5

14.8Ttokens of pre-training data

$0.27per million output tokens · API pricing

2,048H800 GPUs used · 67 days

A different bet on architecture

V4 doubles down on the mixture-of-experts design DeepSeek pioneered with V3, but with three changes that, taken together, account for most of the efficiency gain. First, the routing network was rewritten to use what the team calls "auxiliary-loss-free balancing" — keeping experts evenly used without the usual stability penalty. Second, multi-token prediction is now native: every forward pass produces a candidate sequence two tokens deep, not one. Third, FP8 training was extended end-to-end, including the optimizer state.

The result, internal benchmarks show, is a model that activates only 37 billion of its 671 billion parameters per token while matching or exceeding much larger dense models on reasoning. On the GSM8K math benchmark, V4 reached 96.1% — a hair above Claude Opus 5 and four points above GPT-5. On the more punishing AIME 2025 problems, the gap widened: 88.4% to o3's 81.7%.

◆ MMLU-Pro · top 6 models

Source: DTW benchmark notebook · v2026.5

DeepSeek V4 DeepSeek · 24/05

94.3

Claude Opus 5 Anthropic · 12/05

92.8

GPT-5 OpenAI · 03/04

91.2

Gemini 3 Ultra Google · 28/03

89.4

Qwen 3 Max Alibaba · 15/05

88.1

Llama 4 405B Meta · 18/04

85.6

52 models tracked · 8 benchmarks · refresh 06h↓ CSV · API

"The architectural choices are not revolutionary on their own," said Dr. Sarah Chen, a researcher at Stanford's Center for Research on Foundation Models, when reached by phone Sunday night. "What's revolutionary is the engineering discipline to make every layer of the stack — compiler, scheduler, attention kernel, dataloader — actually deliver on those choices. Most labs can write the paper. Very few can ship a 671-billion-parameter model trained for eight million dollars."

Most labs can write the paper. Very few can ship a 671-billion-parameter model trained for eight million dollars. Dr. Sarah Chen · Stanford, Center for Research on Foundation Models

The cost-curve question Washington can no longer dodge

For the past 18 months, the policy assumption underwriting US export controls on advanced GPUs has been that frontier models require an ever-larger compute footprint, and that denying Chinese labs access to the latest hardware would slow them down by a year or more per generation. V4 is the third major signal — after DeepSeek V3 and Alibaba's Qwen 3 series — that this assumption may have already broken.

The lab claims V4 was trained on 2,048 H800 GPUs, NVIDIA's deliberately downgraded variant designed to comply with the October 2023 export rules. (The H800 has the same compute as the H100 but with bandwidth between chips throttled by roughly two-thirds.) That is a fraction of the cluster size used by Anthropic or OpenAI. The training run took 67 days.

1.2× FP8 throughput on H800 vs. the team's own V3 baseline, achieved through a rewritten cross-node all-to-all kernel.
Communication overlap reduced from 31% of step time to 9% via a custom expert-parallelism schedule.
14.8 trillion tokens of pre-training data, with what the team describes as "aggressive deduplication" and a heavy focus on synthetic math and code from V3.
An RL stage modeled on the lab's earlier R1 work, but extended to multi-turn agent trajectories using ~1.6 million curated tasks.

None of these moves are individually novel; some have been described in public papers from Anthropic, Microsoft Research, and the Tsinghua KEG group. What is new is the integration — and the willingness of DeepSeek's small team (it has fewer than 200 employees in total, only 80 of them on the research side) to ruthlessly cut anything that doesn't compound.

→ Related context

EU proposes raising the AI Act training-compute threshold to 10²⁶ FLOPs — V4 would land below the new line

Sophie Pellerin · from Brussels · published earlier today · 9 min

What the API tells us

The pricing DeepSeek announced is, if anything, more disruptive than the benchmark scores. V4 is being offered at $0.27 per million output tokens through the company's own API — a 96.4% discount to Claude Opus 5 ($15.00) and a 97.3% discount to GPT-5 ($10.00) at comparable output rates. Input pricing is even more aggressive at $0.07 per million.

Two veteran inference engineers — one at a US hyperscaler, one at a Singapore-based AI infrastructure startup — both told DTW the pricing implies serving margins north of 60% if the company can achieve sustained throughput above 8,000 tokens/sec per H100-class node. The lab's prior V3 deployment, they say, is already close to that benchmark.

$ curl https://api.deepseek.com/v4/chat/completions \

-H "Authorization: Bearer $DEEPSEEK_KEY" \

-d '{"model":"deepseek-v4","messages":[{"role":"user","content":"Solve AIME 2025 #14"}]}'

# 1,124 ms first token · 312 tokens/sec sustained · cost <$0.0002

▸ Answer verified · matches solution key

"At this price point, the question for everyone building on the OpenAI API stops being theoretical," said Aakash Patel, CTO of a 40-person agentic-search startup in Bangalore that began porting workloads to V4 over the weekend. "If we don't migrate at least the cheap-to-validate workloads, we're effectively setting a 10× higher COGS than a competitor that does."

The unanswered safety question

What V4 does not yet have is a system card, a red-team report, or any peer-reviewed evaluation of dangerous-capability uplift. The lab has said one is "in preparation"; people familiar with the matter said DTW should expect the document within three weeks. The model was, however, evaluated internally on a Chinese-government-derived set of 1,800 prompts covering politically sensitive content, terrorism uplift, and CSAM resistance — results which DeepSeek says will be published in summary form only.

That gap matters. The EU is in the middle of finalising its amended AI Act compute thresholds; both Anthropic and OpenAI have argued, in public submissions, that frontier-model providers should be required to publish red-team results before release, not after. V4 will be the first major test of whether the new European regime treats a Chinese lab's voluntary disclosure as sufficient.

The line that comes next

What V4 is not — and DeepSeek has been careful to say so — is a multimodal model. The team has no public image-, audio-, or video-generation roadmap. That leaves an obvious gap relative to Gemini 3 Ultra and the rumoured GPT-5 Vision tier. Whether the lab can close it on the same cost discipline is the question that will dominate every fundraising deck in Asia for the next quarter.

For now, the leaderboard has been redrawn. The cost curve has been redrawn with it. And in Hangzhou, on a humid Saturday evening, one of the most quietly important benchmarks of the year passed by with no PR department, no preview embargo, and a slide deck that opened with the line: "We ship, then we explain."

Notes & sources

Cost figure of $8.2M reflects the final pre-training and post-training compute. It does not include salaries, prior research compute, or evaluation cost. ^{[DTW reviewed DeepSeek internal budget document, dated May 18, 2026]}
OpenAI and Anthropic comparison figures are estimates from SemiAnalysis and three independent infrastructure analysts. Both companies declined to comment on training cost.
Benchmark scores are from the DTW benchmark notebook v2026.5, run on the released checkpoint between 02:00 and 09:30 UTC on May 25. CSV available to Pro subscribers.
The H800's bandwidth-per-chip limit is 400 GB/s vs. the H100's 900 GB/s. Compute (FP16 / FP8) is identical.

DeepSeek MoE MMLU-Pro Frontier models China AI Export controls Inference pricing

Nguyen Hoang Anh

Senior AI correspondent · Based in Singapore · Formerly at Reuters, Tech in Asia

Hoang Anh covers frontier model labs, alignment policy and the economics of inference across Asia and the US. She holds a master's in computer science from NUS and writes the AI Weekly newsletter every Wednesday.

All articles → AI Weekly newsletter → Email · hoanganh@dtw.asia

Reader responses · 182

Verified DTW subscribers only · House rules apply

Sarah Chen Researcher, Stanford CRFM14:51 ICT

One thing I'd add to the piece: the FP8 optimizer-state work is the part the field will be copying within a month. The MoE routing change is more situational — it works because DeepSeek's data mix is unusually code-heavy.

♥ 84 Reply Quote

Jiang Kai DTW · author15:02 ICTDTW

Re: the cost figure — to be very clear, $8.2M is the final training run only. The team estimates that prior research compute (including failed runs) was roughly $4–6M more. Still ~$13M total, which is the number to compare against the $80M+ that frontier dense models reportedly cost.

♥ 62 Reply Quote

Aakash Patel CTO · Bangalore15:18 ICT

We migrated three workloads over the weekend. Honest take: the model is genuinely good at code, slightly weaker than Claude on nuance, but at this price the math is not even close.

♥ 41 Reply Quote

Ryan Tan Reader · Singapore15:35 ICT

Will be very curious to see the system card. The Chinese-internal eval suite is fine for what it is, but it doesn't substitute for an external red-team on biosec uplift.

♥ 28 Reply Quote

Show 178 more responses →

DeepSeek V4

DeepSeek · Hangzhou · 671B MoE

MMLU-Pro	94.3
GSM8K	96.1
AIME 2025	88.4
SWE-bench	62.7
Output · $/1M	0.27
License	MIT

Weekly · Wednesday

Read AI Weekly

The five things from the AI desk worth your time, by Hoang Anh.