Blog
Tech
Prediction Markets: Honest Measure of AI Progress

Tech PredictionsExpert AnalysisUpdate on Apr 22, 2026

Prediction Markets: Honest Measure of AI Progress

MMLU is saturated above 90% for all frontier models. LMArena scores were inflated by up to 100 points through selective submission. Meta admitted it “cheated a little bit”. AI benchmarks have become marketing tools, not measurement tools. The only evaluation mechanism that cannot be gamed by the entity being measured is one where the evaluators have real money at stake. That mechanism is prediction markets.

Key Takeaways

MMLU scores are now above 91% for all frontier models. GSM8K above 94%. SuperGLUE saturated within months of release. When every serious model clusters within a few percentage points of the ceiling, the benchmark no longer differentiates them. A vendor citing MMLU in 2026 is citing a number with no decision-relevant information.
Researchers found selective model submissions to LMArena inflated scores by up to 100 points through cherry-picking. Meta privately tested 27 model variants before its Llama 4 release and published only the best results. This is Goodhart’s Law made concrete: when a measure becomes a target, it ceases to be a good measure.
Experts in the Longitudinal Expert AI Panel (LEAP Wave 4) predicted SOTA accuracy on LiveCodeBench Pro would reach 14% by 2026. OpenAI’s GPT-5.2 achieved 33% shortly after the survey closed. Expert forecasters systematically underestimated AI progress - even as they complained benchmarks were too easy.
Prediction markets cannot be gamed by the entity being measured because the market price is set by independent, financially incentivized participants with no stake in producing a favorable result. A benchmark designed and reported by a lab that is also evaluated by the benchmark is structurally compromised. A prediction market is not.
The practical DuelDuck application: AI capability milestone duels (model releases, benchmark scores, deployment metrics) generate genuine information asymmetry. Developer and researcher communities have domain expertise that general prediction market participants lack. The information advantage + creator fee income is the reward for that expertise.

2,567 Words

13 min Read

Expert Verified

Stan HorunaCEOPublished on Apr 5, 2026Updated on Apr 22, 2026

The Measurement Crisis in AI

In early 2026, you can walk into a pitch meeting for any AI company and hear a benchmark table: MMLU 91%, GSM8K 94%, HumanEval 88%. The numbers are real. What they measure is not.

MMLU scores above 88% are shared by virtually all frontier models. GSM8K is above 94%. SuperGLUE was saturated almost immediately upon release. When the spread between the best and worst model is smaller than the noise in the measurement, the benchmark has stopped differentiating. It is no longer a measurement tool. It is a marketing tool.

The gaming problem compounds the saturation problem. Researchers analyzing 2.8 million model comparison records from LMArena found that selective model submissions inflated scores by up to 100 points through cherry-picking. Major labs ran private tests, submitted only their best variants, and turned evaluation into a competition to optimize the leaderboard rather than to measure genuine capability. Meta acknowledged it “cheated a little bit” when testing Llama 4.

This is not a technical failure. It is Goodhart’s Law at scale: when a measure becomes a target, it ceases to be a good measure. The moment AI labs understood that MMLU scores drove valuations, funding decisions, and competitive positioning, MMLU scores stopped measuring AI capability and started measuring optimization effort directed at MMLU.

How AI Benchmarks Fail - The Four Structural Modes

Mode 1: Saturation

Saturation occurs when models achieve scores so close to the maximum that differences between them become statistically meaningless. When GPT-5.3, Claude Opus 4.6, and Gemini 3.1 all score 88–93% on MMLU, the practical difference between them on a real-world knowledge task cannot be read from the benchmark. The score range has compressed to the point where noise exceeds signal.

The community’s response to saturation is to create harder benchmarks. MMLU-Pro was designed to address MMLU saturation by adding more complex reasoning questions. As of early 2026, frontier LLMs are approaching 90% on MMLU-Pro - suggesting it may face the same saturation dynamic that motivated its creation. The pattern is recursive: every benchmark created to replace a saturated benchmark eventually saturates itself.

Mode 2: Data Contamination

StarCoder-7b scored 4.9x higher on leaked versus clean data. When GPT-4 was tested with masked MMLU questions, it correctly inferred missing answers 57% of the time - far exceeding random chance. The benchmark questions were present in the training data. The model was not demonstrating reasoning; it was demonstrating memory.

Contamination is hard to detect because most training datasets are not fully disclosed. When a model achieves an unusually high score on an older benchmark, contamination should be the first hypothesis, not the last. But the labs have no incentive to investigate their own contamination - and no external mechanism forces them to.

Mode 3: Cherry-Picking

Two major 2025 studies found that selective disclosure on platforms like LMArena inflated proprietary model scores by up to 112%. Researchers described it as “symptoms of a system that lacks guardrails.”

The mechanism: large labs can privately test many model variants, publish only the best result, and retract scores that don’t look favorable. Sara Hooker, Head of Cohere Labs and co-author of the critique, wrote: “It is critical for scientific integrity that we trust our measure of progress. The Chatbot Arena has become the go-to evaluation for frontier AI systems... We show that coordination among a handful of providers and preferential policies have led to distorted Arena rankings.”

Mode 4: Metric-Capability Gaps

High coding benchmark scores mask AI-generated code quality issues like 4x bug rates, revealing what leaderboards don’t measure: reliability under real constraints. A model that achieves 88% on HumanEval may fail consistently in production workflows. The benchmark measures the ability to pattern-match known problem formats - not the ability to write software that works.

Benchmark	Frontier Score(2026)	Problem	What ItStill Measures
MMLU	>91% for all top models	Saturated; no differentiation	Smaller/mid-tier model comparison
GSM8K	>94% for all top models	Saturated + contamination risk	Eliminated as signal for frontier models
MMLU-Pro	88–90% for top models	Approaching saturation	Meaningful for 60–90% range models
GPQA-Diamond	81–94% range	Not yet saturated; best current discriminator	Expert-level reasoning at frontier tier
SWE-bench Verified	Varies; not saturated	Best production proxy available	Real software engineering tasks
LMArena Chatbot Arena	Up to 100pt inflation from cherry-picking	Gaming-compromised	Directional signal only; treat with skepticism

Benchmark

Frontier Score(2026)

Problem

What ItStill Measures

MMLU

>91% for all top models

Saturated; no differentiation

Smaller/mid-tier model comparison

GSM8K

>94% for all top models

Saturated + contamination risk

Eliminated as signal for frontier models

MMLU-Pro

88–90% for top models

Approaching saturation

Meaningful for 60–90% range models

GPQA-Diamond

81–94% range

Not yet saturated; best current discriminator

Expert-level reasoning at frontier tier

SWE-bench Verified

Varies; not saturated

Best production proxy available

Real software engineering tasks

LMArena Chatbot Arena

Up to 100pt inflation from cherry-picking

Gaming-compromised

Directional signal only; treat with skepticism

The Expert Forecaster Problem - Even Humans Fail

The benchmark crisis would be less significant if expert human forecasting of AI progress were reliable. It is not - and the failure mode is systematic in both directions.

Underestimation: Experts Miss the Pace

The Longitudinal Expert AI Panel (LEAP Wave 4) surveyed 253 experts, 58 superforecasters, and 810 members of the public in November–December 2025. The median expert predicted SOTA accuracy on LiveCodeBench Pro (Hard) would reach 14% in 2026. Shortly after the survey closed, OpenAI’s GPT-5.2 achieved 33% - more than double the median expert prediction.

AI 2027’s grading of its own 2025 predictions found that aggregate AI progress was running at 65% of the pace predicted, with qualitative milestones broadly on track but quantitative benchmarks missed in both directions. AI forecasting at the frontier is genuinely hard - both for the experts writing the reports and for the prediction markets pricing the outcomes.

The Asymmetry Between Forecaster Types

ForecastBench, a dynamic benchmark evaluating ML systems on forecasting questions (ICLR 2025), found that state-of-the-art models - including Claude 3.5 Sonnet and GPT-4 Turbo - perform only roughly as well as a simple median of forecasts from humans with no forecasting experience. The models performed significantly worse than superforecasters.

The implication: neither AI models themselves nor general expert opinion are reliable forecasters of AI progress. What reliably outperforms both, across every empirical study of forecasting accuracy, is the mechanism that combines financial incentives with diverse information from many independent participants. That mechanism is prediction markets.

Forecaster Type	AI Capability Forecast Accuracy	Key Limitation
AI models forecasting AI	Worse than superforecasters (ForecastBench)	Cannot forecast their own development trajectories
General public	Lowest accuracy; systematically underestimates	No domain knowledge; no skin in the game
Domain experts	Better but systematic biases in both directions	Overconfident; reputational herding
Superforecasters	Best calibration among named groups	Small population; slow to update on new information
Prediction markets	Best continuous calibration at scale	Thin in AI-specific markets; improving rapidly

Forecaster Type

AI Capability Forecast Accuracy

Key Limitation

AI models forecasting AI

Worse than superforecasters (ForecastBench)

Cannot forecast their own development trajectories

General public

Lowest accuracy; systematically underestimates

No domain knowledge; no skin in the game

Domain experts

Better but systematic biases in both directions

Overconfident; reputational herding

Superforecasters

Best calibration among named groups

Small population; slow to update on new information

Prediction markets

Best continuous calibration at scale

Thin in AI-specific markets; improving rapidly

Why Prediction Markets Cannot Be Gamed the Same Way

The structural argument for prediction markets as an AI measurement mechanism is not that they are perfect. It is that their failure modes are categorically different from benchmark failure modes - and importantly, they cannot be gamed by the entity being measured.

The Independence Property

A benchmark designed and reported by a lab that is simultaneously evaluated by that benchmark is structurally compromised. The lab has both the means and the incentive to optimize the measure rather than the underlying capability. This is not a flaw in the labs’ integrity - it is a structural consequence of who controls the evaluation.

A prediction market on an AI capability milestone is priced by independent participants who have no relationship to the lab being evaluated. They have financial incentives to be accurate, not to be favorable. If they price a capability milestone too generously, they lose money when the milestone is not achieved. If they price it too conservatively, they lose the opportunity to profit from correct prediction. The incentive structure is aligned with accuracy, not with the interests of the evaluated entity.

The Accountability Property

Benchmark scores do not expire. A 91% MMLU score from 2024 is still being cited in 2026 sales decks with no discount for age, contamination risk, or cherry-picking. Prediction markets expire on resolution. A mispriced prediction market is corrected at resolution: the wrong side loses capital, and the correct side profits. This creates a continuous accountability mechanism that benchmarks lack.

Metaculus, a forecasting platform, has achieved a 0.111 Brier score on its track record of AI-related predictions. This calibration exists because every question resolves, every forecast is scored, and every forecaster’s track record is public. The same transparency does not exist for labs’ benchmark claims.

The Continuous Update Property

Benchmarks are published on the lab’s timeline. A prediction market updates continuously as new information arrives. When a leaked internal eval, a developer test, or a pre-release paper circulates, prediction market prices reflect that information within minutes. Benchmark reports can take months to update.

AI forecasters at ai2025.org found the forecaster aggregate was about right on benchmarks, underestimated revenue growth, and overestimated public salience. The areas where prediction markets add the most value are precisely the areas where the forecaster aggregate missed most: revenue (harder to game than benchmarks) and public adoption metrics (requires real-world evidence, not lab-controlled tests).

KEY INSIGHT

The argument is not that prediction markets are always right about AI progress. It is that their error mechanism is qualitatively different. Benchmark errors are systematic - they bias toward the measured entity’s interests. Prediction market errors are unsystematic - they represent diverse participants making independent miscalibrated probability estimates. Systematic errors are harder to correct because they point in one direction. Unsystematic errors average out over time and across questions.

The Practical Prediction Market Signal for AI Progress

If prediction markets are the more honest measurement mechanism for AI progress, the relevant question is: which prediction markets are currently providing the most signal, and where are the gaps?

Where Prediction Markets Currently Have Good Coverage

AI Event Category	Coverage onPolymarket/Kalshi	Signal Quality	Information Gap(DuelDuck Opportunity)
Company IPO/valuation milestones	High volume; well-covered	Good calibration; liquid markets	Developer/engineer community has earlier signal on company health
Major model release timelines	Moderate coverage	Reasonable but thin	Researchers tracking internal roadmap signals see mispricings before market corrects
Named benchmark score thresholds	Low coverage; emerging	Thin markets; high noise	Developer communities tracking benchmark trajectories have structural information advantage
AI agent capability milestones	Very low coverage	High uncertainty; wide spreads	Researchers running agentic evals have domain expertise most market participants lack
Enterprise AI adoption metrics	Near zero	Not yet established	Business analysts tracking deployment have strong forward signals

AI Event Category

Coverage onPolymarket/Kalshi

Signal Quality

Information Gap(DuelDuck Opportunity)

Company IPO/valuation milestones

High volume; well-covered

Good calibration; liquid markets

Developer/engineer community has earlier signal on company health

Major model release timelines

Moderate coverage

Reasonable but thin

Researchers tracking internal roadmap signals see mispricings before market corrects

Named benchmark score thresholds

Low coverage; emerging

Thin markets; high noise

Developer communities tracking benchmark trajectories have structural information advantage

AI agent capability milestones

Very low coverage

High uncertainty; wide spreads

Researchers running agentic evals have domain expertise most market participants lack

Enterprise AI adoption metrics

Near zero

Not yet established

Business analysts tracking deployment have strong forward signals

The DuelDuck Opportunity in AI Prediction

The AI category is the highest-information-asymmetry prediction market domain in 2026. The gap between what domain experts know and what general prediction market participants price is wider in AI than in politics, sports, or crypto - because AI capability development is opaque, fast-moving, and requires genuine technical literacy to evaluate.

This creates a specific DuelDuck use case: community duels on AI capability milestones, designed by creators with domain expertise (developers, researchers, AI engineers), priced at the 50/50 opening ratio while the broader Polymarket consensus reflects the general public’s much less informed view.

DUELDUCK EDGE

The developer or researcher who creates a DuelDuck duel on “Will Claude Opus 5 score above 85% on GPQA-Diamond at launch?” is not just running a prediction duel. They are creating a market that the broader prediction market ecosystem has not built. They have domain expertise (they track the benchmark trajectory). They open at 50/50 while Polymarket may price 35% (if it has any market at all). They earn creator fee (up to 5% net) regardless of which side is correct. The information advantage + structural entry advantage + fee income is the DuelDuck creator’s reward for genuine domain expertise in a domain where general prediction market coverage is thin.

The Benchmark Replacement Argument

The strongest form of the prediction market argument for AI measurement is not just that prediction markets avoid benchmark gaming. It is that prediction markets can replace specific types of benchmark claims that are currently providing false signal.

What Prediction Markets Can Replace

Deployment-based milestones - Did X AI system reach Y daily active users? Did it achieve Z revenue by Q3? These are binary, verifiable, financially-stakes questions that the entity being measured cannot control without the market knowing. Prediction markets on deployment metrics are harder to game than benchmarks because the verification is external (app store data, regulatory filings, third-party analytics).

Independent third-party evaluation results - Did a specific model achieve X on a named benchmark as tested by an independent lab (not the releasing company)? Prediction markets on third-party evaluations eliminate the cherry-picking problem because the lab cannot control which variants the evaluator tests.

Real-world task completion rates - Did an AI agent successfully complete X% of tasks in a named independent evaluation (METR, HELM, etc.)? These are harder to game than internal benchmarks because the evaluation methodology is not controlled by the lab.

What Prediction Markets Cannot Replace

Prediction markets cannot replace benchmarks as a tool for evaluating research-stage capabilities that have not yet been publicly deployed or publicly demonstrated. A market that prices an event which has not yet occurred is forecasting, not measuring. The quality of the forecast depends on the information available to market participants - and for pre-publication research capabilities, that information is thin.

The honest framework is: use prediction markets for deployment milestones and third-party evaluation results; use carefully designed internal benchmarks for research-stage capability assessment; and hold both accountable with the same standard of reproducibility and independence.

Conclusion: The Only Evaluation You Cannot Game Is One With Skin in the Game

The benchmark crisis in AI is not going to be solved by designing harder benchmarks. Every new benchmark created to replace a saturated one faces the same contamination, cherry-picking, and gaming vulnerabilities within months of release. The problem is structural: the entity being evaluated controls the evaluation, and the incentive to produce favorable results is enormous.

Prediction markets solve the structural problem by separating the entity being evaluated from the mechanism that prices the evaluation. Independent participants with financial stakes in accuracy cannot be paid to produce favorable results. They can be wrong - and they frequently are - but their errors are unsystematic, financially penalized, and self-correcting. This is categorically different from benchmarks whose errors are systematic, reputationally rewarded, and self-reinforcing.

Stanford HAI faculty predict that 2026 will mark the moment AI confronts its actual utility: the era of AI evangelism is giving way to an era of AI evaluation. That era needs honest evaluation infrastructure. Prediction markets are not a complete replacement for benchmarks. But they are the only evaluation mechanism that cannot be gamed by the entity it evaluates. For the specific class of AI progress claims that matter most - deployment milestones, adoption metrics, real-world performance - they are the most honest signal available.

Start Predicting. Start Earning

DuelDuck - P2P prediction market on Solana. No vig. No KYC. USDC payouts. Create community duels on AI capability milestones where your domain expertise is the edge - and earn up to 10% creator fee on every pool you design.

Create your first duel today

Create a Duel