Tuesday · June 2, 2026 · Singapore
NVDA 1,284.30 ▲ 1.42% TSM 248.72 ▲ 0.68% 9988.HK 142.80 ▼ 2.11% BTC 71,420 ▲ 0.84% USD/VND 25,412 ▼ 0.03%
Asia edition · No. 412
DTW
dailytechwire
Tech Intelligence, Wired Daily
DTW AI Reading the GPT-5.1 Model Card: What the New Refusal Rate and Failure Modes Say About OpenAI’s Direction
AI

Reading the GPT-5.1 Model Card: What the New Refusal Rate and Failure Modes Say About OpenAI’s Direction

Refusal rate and failure modes in a model card determine whether a model is usable in production more than any benchmark score does.

DA
dailytechwire
Published June 2, 2026 4 min read
Reading the GPT-5.1 Model Card: What the New Refusal Rate and Failure Modes Say About OpenAI’s Direction

When a lab releases a new-generation model, the most frequently cited part is always the table of MMLU, GPQA, or HumanEval scores. But for a developer preparing to put a model into production, two other numbers in the model card usually matter more: the refusal rate (how often the model declines to answer) and the list of failure modes the lab has acknowledged.

For a hypothetical model around the level of GPT-5.1, these two metrics shape the real-world experience more than any benchmark score. This article does not evaluate a specific released model, but analyzes how to read those sections of a model card to avoid forming the wrong expectations.

Refusal rate: the most misunderstood number

Refusal rate measures how frequently a model declines to carry out a request. The problem is that a single aggregate number hides two opposite phenomena. Over-refusal is when the model declines legitimate requests—for example, refusing to write about basic pharmacology because it mistakes it for dangerous content. Under-refusal is when the model carries out a request it should have declined.

A valuable model card separates these two types, often through evals such as an over-refusal test set built on harmless prompts phrased in a sensitive way. If a card provides only a single overall refusal rate, that is a signal to be cautious. For a developer building legal, medical, or financial applications, over-refusal is a hidden cost: every time the model declines a legitimate query is a token-costly fallback that degrades the UX.

The general trend across model generations is that labs try to reduce over-refusal after first-generation RLHF made models overly cautious. But this loosening increases the risk at the other end. This is a trade-off, not a problem solved once and for all.

Failure modes: the least-cited part, the most worth reading

The failure modes section is where a lab acknowledges what the model gets systematically wrong. The common groups:

Hallucination under specific conditions. A good model card will point out where the hallucination rate rises—for example, with questions about events after the data cutoff date, or with academic citations that require precise figures.

Degradation with context length. This is a point directly related to the keyword “context window.” A model advertising a large context window does not mean it maintains consistent accuracy across that entire length. The phenomenon of information in the middle of the context being ignored has been widely documented in long-context research. When reading a card, the right question is not “how many tokens is the context window” but “what is the retrieval accuracy at 80% of the context length.”

Errors in agentic chains. For a model marketed for agentic tasks and multi-step chain-of-thought, the notable failure mode is cascading error: a single mistake at step three propagates across the entire chain. Evals for this part are more complex than single-turn benchmarks and are often presented more sparsely.

Reading the card like a skeptic

A few pragmatic rules. Benchmarks are cherry-picked; failure modes are admitted reluctantly, so the latter section is usually more honest about the real limitations. If a card is dense with scores but thin on the limitations section, it is more of a marketing document than a technical one. Comparing refusal behavior across models is meaningful only on the same eval set, because each lab defines for itself what counts as a request that should be refused.

Asia angle

For developers and startups in Southeast Asia and East Asia, these trade-offs carry real weight. Refusal behavior and failure modes are usually measured primarily on English prompts, so behavior in Vietnamese, Thai, or Indonesian can deviate considerably from the numbers in the card. Open models from Asian labs such as DeepSeek and Qwen let teams run their own refusal and failure-mode evals on native-language datasets—a degree of control that closed APIs do not provide. For teams constrained by inference budgets, the ability to measure failure modes on their own domain is often worth more than a few percentage points of MMLU.

A pragmatic conclusion: before trusting the benchmark table, read the section on what the model still gets wrong and how much it refuses. Those are the two numbers that decide whether a model is usable in your product.

DA
dailytechwire