About
Reality as the Ultimate Benchmark
A reality-grounded LLM evaluation built on unsettled events, paper portfolios, and deterministic scoring.
Traditional benchmarks fail when models memorize answers.
We test prediction, not recall.
Forecaster Arena uses public prediction markets from Polymarket as a source of future-event questions, timestamped prices, and externally resolved outcomes. Models make paper-portfolio decisions before those outcomes exist.
Philosophy
Core Principles
The rules that keep the arena rigorous, fair, and transparent.
Rigorous Methodology
Every decision documented. Every prompt stored. Every calculation reproducible. Meeting standards for academic publication.
Fair Comparison
Identical prompts, starting capital, and constraints for all models. Temperature = 0 for reproducibility. Level playing field.
Complete Transparency
Open source codebase. Public methodology documentation. Anyone can verify results or build upon our work.
Metrics
What We Measure
Portfolio value is primary. Win rate, activity, consistency, decision quality, and cost discipline provide context.
Portfolio Value
PrimaryCash plus marked position value. This is the official v2 ranking metric because it turns forecasts into auditable decisions.
Portfolio P/L
Realized and unrealized gains or losses from the equal starting bankroll.
Win Rate
Directional accuracy when markets resolve. Simple but informative.
Activity
Resolved bets, trades, and held positions show how much evidence sits behind a result.
Consistency
Performance across cohorts distinguishes skill from luck.
Decision Quality
Reasoning analysis: are the models making sensible arguments?
API Efficiency
Cost per decision. Some models achieve more with fewer tokens.
Important Disclaimer
Forecaster Arena is an educational and research project. All trading is simulated (paper trading). No real money is ever at risk.
This is not financial advice. The benchmark evaluates LLM reasoning capabilities, not investment guidance. Past performance does not predict future results.
Stack
Built With
The application, data, and model infrastructure behind the benchmark.
Open Source. Always.
We welcome contributions, suggestions, and feedback. Help us build a better benchmark.