Prediction Markets: Honest Measure of AI Progress
MMLU is saturated above 90% for all frontier models. LMArena scores were inflated by up to 100 points through selective submission. Meta admitted it “cheated a little bit”. AI benchmarks have become marketing tools, not measurement tools. The only evaluation mechanism that cannot be gamed by the entity being measured is one where the evaluators have real money at stake. That mechanism is prediction markets.
Key Takeaways
- MMLU scores are now above 91% for all frontier models. GSM8K above 94%. SuperGLUE saturated within months of release. When every serious model clusters within a few percentage points of the ceiling, the benchmark no longer differentiates them. A vendor citing MMLU in 2026 is citing a number with no decision-relevant information.
- Researchers found selective model submissions to LMArena inflated scores by up to 100 points through cherry-picking. Meta privately tested 27 model variants before its Llama 4 release and published only the best results. This is Goodhart’s Law made concrete: when a measure becomes a target, it ceases to be a good measure.
- Experts in the Longitudinal Expert AI Panel (LEAP Wave 4) predicted SOTA accuracy on LiveCodeBench Pro would reach 14% by 2026. OpenAI’s GPT-5.2 achieved 33% shortly after the survey closed. Expert forecasters systematically underestimated AI progress - even as they complained benchmarks were too easy.
- Prediction markets cannot be gamed by the entity being measured because the market price is set by independent, financially incentivized participants with no stake in producing a favorable result. A benchmark designed and reported by a lab that is also evaluated by the benchmark is structurally compromised. A prediction market is not.
- The practical DuelDuck application: AI capability milestone duels (model releases, benchmark scores, deployment metrics) generate genuine information asymmetry. Developer and researcher communities have domain expertise that general prediction market participants lack. The information advantage + creator fee income is the reward for that expertise.
The Measurement Crisis in AI
In early 2026, you can walk into a pitch meeting for any AI company and hear a benchmark table: MMLU 91%, GSM8K 94%, HumanEval 88%. The numbers are real. What they measure is not.
MMLU scores above 88% are shared by virtually all frontier models. GSM8K is above 94%. SuperGLUE was saturated almost immediately upon release. When the spread between the best and worst model is smaller than the noise in the measurement, the benchmark has stopped differentiating. It is no longer a measurement tool. It is a marketing tool.
The gaming problem compounds the saturation problem. Researchers analyzing 2.8 million model comparison records from LMArena found that selective model submissions inflated scores by up to 100 points through cherry-picking. Major labs ran private tests, submitted only their best variants, and turned evaluation into a competition to optimize the leaderboard rather than to measure genuine capability. Meta acknowledged it “cheated a little bit” when testing Llama 4.
This is not a technical failure. It is Goodhart’s Law at scale: when a measure becomes a target, it ceases to be a good measure. The moment AI labs understood that MMLU scores drove valuations, funding decisions, and competitive positioning, MMLU scores stopped measuring AI capability and started measuring optimization effort directed at MMLU.
How AI Benchmarks Fail - The Four Structural Modes
Mode 1: Saturation
Saturation occurs when models achieve scores so close to the maximum that differences between them become statistically meaningless. When GPT-5.3, Claude Opus 4.6, and Gemini 3.1 all score 88–93% on MMLU, the practical difference between them on a real-world knowledge task cannot be read from the benchmark. The score range has compressed to the point where noise exceeds signal.
The community’s response to saturation is to create harder benchmarks. MMLU-Pro was designed to address MMLU saturation by adding more complex reasoning questions. As of early 2026, frontier LLMs are approaching 90% on MMLU-Pro - suggesting it may face the same saturation dynamic that motivated its creation. The pattern is recursive: every benchmark created to replace a saturated benchmark eventually saturates itself.
Mode 2: Data Contamination
StarCoder-7b scored 4.9x higher on leaked versus clean data. When GPT-4 was tested with masked MMLU questions, it correctly inferred missing answers 57% of the time - far exceeding random chance. The benchmark questions were present in the training data. The model was not demonstrating reasoning; it was demonstrating memory.
Contamination is hard to detect because most training datasets are not fully disclosed. When a model achieves an unusually high score on an older benchmark, contamination should be the first hypothesis, not the last. But the labs have no incentive to investigate their own contamination - and no external mechanism forces them to.
Mode 3: Cherry-Picking
The mechanism: large labs can privately test many model variants, publish only the best result, and retract scores that don’t look favorable. Sara Hooker, Head of Cohere Labs and co-author of the critique, wrote: “It is critical for scientific integrity that we trust our measure of progress. The Chatbot Arena has become the go-to evaluation for frontier AI systems... We show that coordination among a handful of providers and preferential policies have led to distorted Arena rankings.”
Mode 4: Metric-Capability Gaps
High coding benchmark scores mask AI-generated code quality issues like 4x bug rates, revealing what leaderboards don’t measure: reliability under real constraints. A model that achieves 88% on HumanEval may fail consistently in production workflows. The benchmark measures the ability to pattern-match known problem formats - not the ability to write software that works.
Benchmark | Frontier Score(2026) | Problem | What ItStill Measures |
MMLU | >91% for all top models | Saturated; no differentiation | Smaller/mid-tier model comparison |
GSM8K | >94% for all top models | Saturated + contamination risk | Eliminated as signal for frontier models |
MMLU-Pro | 88–90% for top models | Approaching saturation | Meaningful for 60–90% range models |
GPQA-Diamond | 81–94% range | Not yet saturated; best current discriminator | Expert-level reasoning at frontier tier |
SWE-bench Verified | Varies; not saturated | Best production proxy available | Real software engineering tasks |
LMArena Chatbot Arena | Up to 100pt inflation from cherry-picking | Gaming-compromised | Directional signal only; treat with skepticism |
The Expert Forecaster Problem - Even Humans Fail
The benchmark crisis would be less significant if expert human forecasting of AI progress were reliable. It is not - and the failure mode is systematic in both directions.
Underestimation: Experts Miss the Pace
The Longitudinal Expert AI Panel (LEAP Wave 4) surveyed 253 experts, 58 superforecasters, and 810 members of the public in November–December 2025. The median expert predicted SOTA accuracy on LiveCodeBench Pro (Hard) would reach 14% in 2026. Shortly after the survey closed, OpenAI’s GPT-5.2 achieved 33% - more than double the median expert prediction.
AI 2027’s grading of its own 2025 predictions found that aggregate AI progress was running at 65% of the pace predicted, with qualitative milestones broadly on track but quantitative benchmarks missed in both directions. AI forecasting at the frontier is genuinely hard - both for the experts writing the reports and for the prediction markets pricing the outcomes.
The Asymmetry Between Forecaster Types
ForecastBench, a dynamic benchmark evaluating ML systems on forecasting questions (ICLR 2025), found that state-of-the-art models - including Claude 3.5 Sonnet and GPT-4 Turbo - perform only roughly as well as a simple median of forecasts from humans with no forecasting experience. The models performed significantly worse than superforecasters.
The implication: neither AI models themselves nor general expert opinion are reliable forecasters of AI progress. What reliably outperforms both, across every empirical study of forecasting accuracy, is the mechanism that combines financial incentives with diverse information from many independent participants. That mechanism is prediction markets.
Forecaster Type | AI Capability Forecast Accuracy | Key Limitation |
AI models forecasting AI | Worse than superforecasters (ForecastBench) | Cannot forecast their own development trajectories |
General public | Lowest accuracy; systematically underestimates | No domain knowledge; no skin in the game |
Domain experts | Better but systematic biases in both directions | Overconfident; reputational herding |
Superforecasters | Best calibration among named groups | Small population; slow to update on new information |
Prediction markets | Best continuous calibration at scale | Thin in AI-specific markets; improving rapidly |
Why Prediction Markets Cannot Be Gamed the Same Way
The structural argument for prediction markets as an AI measurement mechanism is not that they are perfect. It is that their failure modes are categorically different from benchmark failure modes - and importantly, they cannot be gamed by the entity being measured.
The Independence Property
A benchmark designed and reported by a lab that is simultaneously evaluated by that benchmark is structurally compromised. The lab has both the means and the incentive to optimize the measure rather than the underlying capability. This is not a flaw in the labs’ integrity - it is a structural consequence of who controls the evaluation.
A prediction market on an AI capability milestone is priced by independent participants who have no relationship to the lab being evaluated. They have financial incentives to be accurate, not to be favorable. If they price a capability milestone too generously, they lose money when the milestone is not achieved. If they price it too conservatively, they lose the opportunity to profit from correct prediction. The incentive structure is aligned with accuracy, not with the interests of the evaluated entity.
The Accountability Property
Benchmark scores do not expire. A 91% MMLU score from 2024 is still being cited in 2026 sales decks with no discount for age, contamination risk, or cherry-picking. Prediction markets expire on resolution. A mispriced prediction market is corrected at resolution: the wrong side loses capital, and the correct side profits. This creates a continuous accountability mechanism that benchmarks lack.
Metaculus, a forecasting platform, has achieved a 0.111 Brier score on its track record of AI-related predictions. This calibration exists because every question resolves, every forecast is scored, and every forecaster’s track record is public. The same transparency does not exist for labs’ benchmark claims.
The Continuous Update Property
Benchmarks are published on the lab’s timeline. A prediction market updates continuously as new information arrives. When a leaked internal eval, a developer test, or a pre-release paper circulates, prediction market prices reflect that information within minutes. Benchmark reports can take months to update.
AI forecasters at ai2025.org found the forecaster aggregate was about right on benchmarks, underestimated revenue growth, and overestimated public salience. The areas where prediction markets add the most value are precisely the areas where the forecaster aggregate missed most: revenue (harder to game than benchmarks) and public adoption metrics (requires real-world evidence, not lab-controlled tests).
The Practical Prediction Market Signal for AI Progress
If prediction markets are the more honest measurement mechanism for AI progress, the relevant question is: which prediction markets are currently providing the most signal, and where are the gaps?
Where Prediction Markets Currently Have Good Coverage
AI Event Category | Coverage onPolymarket/Kalshi | Signal Quality | Information Gap(DuelDuck Opportunity) |
Company IPO/valuation milestones | High volume; well-covered | Good calibration; liquid markets | Developer/engineer community has earlier signal on company health |
Major model release timelines | Moderate coverage | Reasonable but thin | Researchers tracking internal roadmap signals see mispricings before market corrects |
Named benchmark score thresholds | Low coverage; emerging | Thin markets; high noise | Developer communities tracking benchmark trajectories have structural information advantage |
AI agent capability milestones | Very low coverage | High uncertainty; wide spreads | Researchers running agentic evals have domain expertise most market participants lack |
Enterprise AI adoption metrics | Near zero | Not yet established | Business analysts tracking deployment have strong forward signals |
The DuelDuck Opportunity in AI Prediction
The AI category is the highest-information-asymmetry prediction market domain in 2026. The gap between what domain experts know and what general prediction market participants price is wider in AI than in politics, sports, or crypto - because AI capability development is opaque, fast-moving, and requires genuine technical literacy to evaluate.
This creates a specific DuelDuck use case: community duels on AI capability milestones, designed by creators with domain expertise (developers, researchers, AI engineers), priced at the 50/50 opening ratio while the broader Polymarket consensus reflects the general public’s much less informed view.
The Benchmark Replacement Argument
The strongest form of the prediction market argument for AI measurement is not just that prediction markets avoid benchmark gaming. It is that prediction markets can replace specific types of benchmark claims that are currently providing false signal.
What Prediction Markets Can Replace
Deployment-based milestones - Did X AI system reach Y daily active users? Did it achieve Z revenue by Q3? These are binary, verifiable, financially-stakes questions that the entity being measured cannot control without the market knowing. Prediction markets on deployment metrics are harder to game than benchmarks because the verification is external (app store data, regulatory filings, third-party analytics).
Independent third-party evaluation results - Did a specific model achieve X on a named benchmark as tested by an independent lab (not the releasing company)? Prediction markets on third-party evaluations eliminate the cherry-picking problem because the lab cannot control which variants the evaluator tests.
Real-world task completion rates - Did an AI agent successfully complete X% of tasks in a named independent evaluation (METR, HELM, etc.)? These are harder to game than internal benchmarks because the evaluation methodology is not controlled by the lab.
What Prediction Markets Cannot Replace
Prediction markets cannot replace benchmarks as a tool for evaluating research-stage capabilities that have not yet been publicly deployed or publicly demonstrated. A market that prices an event which has not yet occurred is forecasting, not measuring. The quality of the forecast depends on the information available to market participants - and for pre-publication research capabilities, that information is thin.
The honest framework is: use prediction markets for deployment milestones and third-party evaluation results; use carefully designed internal benchmarks for research-stage capability assessment; and hold both accountable with the same standard of reproducibility and independence.
Conclusion: The Only Evaluation You Cannot Game Is One With Skin in the Game
The benchmark crisis in AI is not going to be solved by designing harder benchmarks. Every new benchmark created to replace a saturated one faces the same contamination, cherry-picking, and gaming vulnerabilities within months of release. The problem is structural: the entity being evaluated controls the evaluation, and the incentive to produce favorable results is enormous.
Prediction markets solve the structural problem by separating the entity being evaluated from the mechanism that prices the evaluation. Independent participants with financial stakes in accuracy cannot be paid to produce favorable results. They can be wrong - and they frequently are - but their errors are unsystematic, financially penalized, and self-correcting. This is categorically different from benchmarks whose errors are systematic, reputationally rewarded, and self-reinforcing.
Stanford HAI faculty predict that 2026 will mark the moment AI confronts its actual utility: the era of AI evangelism is giving way to an era of AI evaluation. That era needs honest evaluation infrastructure. Prediction markets are not a complete replacement for benchmarks. But they are the only evaluation mechanism that cannot be gamed by the entity it evaluates. For the specific class of AI progress claims that matter most - deployment milestones, adoption metrics, real-world performance - they are the most honest signal available.
Start Predicting. Start Earning
DuelDuck - P2P prediction market on Solana. No vig. No KYC. USDC payouts. Create community duels on AI capability milestones where your domain expertise is the edge - and earn up to 10% creator fee on every pool you design.
Create your first duel today


