DeepSeek V4 takes the top spot on reasoning benchmarks at one-eighth the training cost of GPT-5
The 671-billion-parameter mixture-of-experts model from a quiet Hangzhou lab scored 94.3 on MMLU-Pro and outperformed OpenAI's o3 on math tasks — yet DeepSeek spent only $8.2 million on the final training run, according to internal documents reviewed by DailyTechWire.
On a humid evening in Hangzhou, a 39-year-old former quant trader named Liang Wenfeng walked onto a small stage in front of perhaps two hundred people. There was no global livestream, no press kit handed to Reuters, no hashtag prepared by a marketing team. He simply opened a laptop, ran a benchmark suite live, and watched a score appear on the projector behind him: 94.3 on MMLU-Pro, the highest any publicly disclosed model has reached. He nodded once. The room exhaled.
That moment, on the night of May 24, marked the public arrival of DeepSeek V4 — and the most consequential cost-curve disruption in frontier AI since the release of Llama 3. According to internal DeepSeek documents reviewed by DailyTechWire, the lab spent $8.2 million on the final training run: roughly one-eighth what OpenAI is believed to have spent on GPT-5, and less than 6% of the figure attributed to Anthropic's Claude Opus 5.
A different bet on architecture
V4 doubles down on the mixture-of-experts design DeepSeek pioneered with V3, but with three changes that, taken together, account for most of the efficiency gain. First, the routing network was rewritten to use what the team calls "auxiliary-loss-free balancing" — keeping experts evenly used without the usual stability penalty. Second, multi-token prediction is now native: every forward pass produces a candidate sequence two tokens deep, not one. Third, FP8 training was extended end-to-end, including the optimizer state.
The result, internal benchmarks show, is a model that activates only 37 billion of its 671 billion parameters per token while matching or exceeding much larger dense models on reasoning. On the GSM8K math benchmark, V4 reached 96.1% — a hair above Claude Opus 5 and four points above GPT-5. On the more punishing AIME 2025 problems, the gap widened: 88.4% to o3's 81.7%.
◆ MMLU-Pro · top 6 models
Source: DTW benchmark notebook · v2026.5"The architectural choices are not revolutionary on their own," said Dr. Sarah Chen, a researcher at Stanford's Center for Research on Foundation Models, when reached by phone Sunday night. "What's revolutionary is the engineering discipline to make every layer of the stack — compiler, scheduler, attention kernel, dataloader — actually deliver on those choices. Most labs can write the paper. Very few can ship a 671-billion-parameter model trained for eight million dollars."
Most labs can write the paper. Very few can ship a 671-billion-parameter model trained for eight million dollars.Dr. Sarah Chen · Stanford, Center for Research on Foundation Models
The cost-curve question Washington can no longer dodge
For the past 18 months, the policy assumption underwriting US export controls on advanced GPUs has been that frontier models require an ever-larger compute footprint, and that denying Chinese labs access to the latest hardware would slow them down by a year or more per generation. V4 is the third major signal — after DeepSeek V3 and Alibaba's Qwen 3 series — that this assumption may have already broken.
The lab claims V4 was trained on 2,048 H800 GPUs, NVIDIA's deliberately downgraded variant designed to comply with the October 2023 export rules. (The H800 has the same compute as the H100 but with bandwidth between chips throttled by roughly two-thirds.) That is a fraction of the cluster size used by Anthropic or OpenAI. The training run took 67 days.
- 1.2× FP8 throughput on H800 vs. the team's own V3 baseline, achieved through a rewritten cross-node all-to-all kernel.
- Communication overlap reduced from 31% of step time to 9% via a custom expert-parallelism schedule.
- 14.8 trillion tokens of pre-training data, with what the team describes as "aggressive deduplication" and a heavy focus on synthetic math and code from V3.
- An RL stage modeled on the lab's earlier R1 work, but extended to multi-turn agent trajectories using ~1.6 million curated tasks.
None of these moves are individually novel; some have been described in public papers from Anthropic, Microsoft Research, and the Tsinghua KEG group. What is new is the integration — and the willingness of DeepSeek's small team (it has fewer than 200 employees in total, only 80 of them on the research side) to ruthlessly cut anything that doesn't compound.
What the API tells us
The pricing DeepSeek announced is, if anything, more disruptive than the benchmark scores. V4 is being offered at $0.27 per million output tokens through the company's own API — a 96.4% discount to Claude Opus 5 ($15.00) and a 97.3% discount to GPT-5 ($10.00) at comparable output rates. Input pricing is even more aggressive at $0.07 per million.
Two veteran inference engineers — one at a US hyperscaler, one at a Singapore-based AI infrastructure startup — both told DTW the pricing implies serving margins north of 60% if the company can achieve sustained throughput above 8,000 tokens/sec per H100-class node. The lab's prior V3 deployment, they say, is already close to that benchmark.
"At this price point, the question for everyone building on the OpenAI API stops being theoretical," said Aakash Patel, CTO of a 40-person agentic-search startup in Bangalore that began porting workloads to V4 over the weekend. "If we don't migrate at least the cheap-to-validate workloads, we're effectively setting a 10× higher COGS than a competitor that does."
The unanswered safety question
What V4 does not yet have is a system card, a red-team report, or any peer-reviewed evaluation of dangerous-capability uplift. The lab has said one is "in preparation"; people familiar with the matter said DTW should expect the document within three weeks. The model was, however, evaluated internally on a Chinese-government-derived set of 1,800 prompts covering politically sensitive content, terrorism uplift, and CSAM resistance — results which DeepSeek says will be published in summary form only.
That gap matters. The EU is in the middle of finalising its amended AI Act compute thresholds; both Anthropic and OpenAI have argued, in public submissions, that frontier-model providers should be required to publish red-team results before release, not after. V4 will be the first major test of whether the new European regime treats a Chinese lab's voluntary disclosure as sufficient.
The line that comes next
What V4 is not — and DeepSeek has been careful to say so — is a multimodal model. The team has no public image-, audio-, or video-generation roadmap. That leaves an obvious gap relative to Gemini 3 Ultra and the rumoured GPT-5 Vision tier. Whether the lab can close it on the same cost discipline is the question that will dominate every fundraising deck in Asia for the next quarter.
For now, the leaderboard has been redrawn. The cost curve has been redrawn with it. And in Hangzhou, on a humid Saturday evening, one of the most quietly important benchmarks of the year passed by with no PR department, no preview embargo, and a slide deck that opened with the line: "We ship, then we explain."
Notes & sources
- Cost figure of $8.2M reflects the final pre-training and post-training compute. It does not include salaries, prior research compute, or evaluation cost. [DTW reviewed DeepSeek internal budget document, dated May 18, 2026]
- OpenAI and Anthropic comparison figures are estimates from SemiAnalysis and three independent infrastructure analysts. Both companies declined to comment on training cost.
- Benchmark scores are from the DTW benchmark notebook v2026.5, run on the released checkpoint between 02:00 and 09:30 UTC on May 25. CSV available to Pro subscribers.
- The H800's bandwidth-per-chip limit is 400 GB/s vs. the H100's 900 GB/s. Compute (FP16 / FP8) is identical.
Hoang Anh covers frontier model labs, alignment policy and the economics of inference across Asia and the US. She holds a master's in computer science from NUS and writes the AI Weekly newsletter every Wednesday.
Reader responses · 182
Verified DTW subscribers only · House rules applyRe: the cost figure — to be very clear, $8.2M is the final training run only. The team estimates that prior research compute (including failed runs) was roughly $4–6M more. Still ~$13M total, which is the number to compare against the $80M+ that frontier dense models reportedly cost.