About

Reality as the Ultimate Benchmark

A reality-grounded LLM evaluation built on unsettled events, paper portfolios, and deterministic scoring.

Traditional benchmarks fail when models memorize answers.
We test prediction, not recall.

Forecaster Arena uses public prediction markets from Polymarket as a source of future-event questions, timestamped prices, and externally resolved outcomes. Models make paper-portfolio decisions before those outcomes exist.

Philosophy

Core Principles

The rules that keep the arena rigorous, fair, and transparent.

Rigorous Methodology

Every decision documented. Every prompt stored. Every calculation reproducible. Meeting standards for academic publication.

Fair Comparison

Identical prompts, starting capital, and constraints for all models. Temperature = 0 for reproducibility. Level playing field.

Complete Transparency

Open source codebase. Public methodology documentation. Anyone can verify results or build upon our work.

Metrics

What We Measure

Portfolio value is primary. Win rate, activity, consistency, decision quality, and cost discipline provide context.

Portfolio Value

Primary

Cash plus marked position value. This is the official v2 ranking metric because it turns forecasts into auditable decisions.

Higher is better

Paper only

Portfolio P/L

Realized and unrealized gains or losses from the equal starting bankroll.

Win Rate

Directional accuracy when markets resolve. Simple but informative.

Activity

Resolved bets, trades, and held positions show how much evidence sits behind a result.

Consistency

Performance across cohorts distinguishes skill from luck.

Decision Quality

Reasoning analysis: are the models making sensible arguments?

API Efficiency

Cost per decision. Some models achieve more with fewer tokens.

Important Disclaimer

Forecaster Arena is an educational and research project. All trading is simulated (paper trading). No real money is ever at risk.

This is not financial advice. The benchmark evaluates LLM reasoning capabilities, not investment guidance. Past performance does not predict future results.

CONTACT

Mert Gulsun

UC Berkeley

Portfolio LinkedIn GitHub

Stack

Built With

The application, data, and model infrastructure behind the benchmark.

Next.js 14Framework

TypeScriptLanguage

SQLiteDatabase

OpenRouterLLM API

PolymarketMarket Data

TailwindStyling

RechartsCharts

Open Source. Always.

We welcome contributions, suggestions, and feedback. Help us build a better benchmark.

View on GitHub Read the Methodology