AAbstract
Forecaster Arena is an LLM evaluation grounded in reality. Each model receives the same bankroll, the same market universe, and the same operating rules. It must convert its view of unsettled future events into paper-trading decisions, and those decisions are scored by the value of the resulting portfolio after real-world outcomes resolve.
Methodology v2 ranks models by portfolio value. Historical v1 cohorts used calibration metrics as a secondary evaluation axis.
1. Evaluation Objective
1.1 What the Benchmark Measures
The benchmark asks whether language models can make useful decisions about reality before reality has settled. The task is not to answer a static test item. It is to act under uncertainty, manage a constrained paper portfolio, and be judged by outcomes that were unknown at decision time.
1.2 Why Future Events
- +No answer memorization: resolved answers do not exist when decisions are logged
- +External truth: outcomes are settled by real-world events, not subjective grading
- +Continuous renewal: new events keep the benchmark from becoming a fixed answer key
1.3 Prediction Markets as Substrate
Prediction markets are not the benchmark itself. They provide a public, timestamped, machine-readable stream of future-event questions, market prices, and resolution criteria. That substrate makes reality comparable across models.
2. Competition Design
2.1 Cohorts
| Start frequency | Every Sunday 00:00 UTC |
| Models per cohort | 7 |
| Starting capital | $10,000 |
| Decision window | Latest 5 cohort numbers |
| Duration | Tracked until positions resolve or settle |
2.2 Participating Models
2.3 Market Universe
Markets are sourced from Polymarket's public API. For each decision run, the benchmark presents the top 500 markets by trading volume. The volume filter keeps the universe liquid enough for meaningful paper execution while preserving a simple rule that applies equally to every model.
3. Decision Protocol
3.1 Information Provided
Every model in a decision-eligible cohort receives the same closed-book snapshot for the run:
3.2 Action Space
3.3 Constraints
| Constraint | Value | Purpose |
|---|---|---|
| Minimum bet | $50 | Prevents noise from trivial paper positions |
| Maximum bet | 25% of cash | Prevents all-in decisions from dominating a cohort |
| Positions per market | 1 per side | Keeps accounting and attribution auditable |
4. Scoring Methodology
4.1 Primary Ranking
The official v2 ranking is portfolio value. A model wins by growing its paper portfolio under the same market universe, bankroll, and rules as every other participant.
4.2 Returns and Settlement
Returns are measured from the initial $10,000 starting balance. Open positions are marked to current market prices, while resolved positions settle according to the market outcome. The public leaderboard can show realized P/L, unrealized P/L, win rate, and activity, but portfolio value is the primary score.
4.3 Paper Trading Boundary
Forecaster Arena does not execute real-money trades. Paper portfolios make model decisions concrete, timestamped, comparable, and auditable without creating a consumer trading product.
5. Non-Gameability and Auditability
Future outcomes
Models cannot memorize outcomes that have not happened yet.
Decision logs
Prompts, responses, parsed actions, and trades are stored before outcomes resolve.
Shared snapshots
All competitors receive the same market and portfolio context for a run.
Deterministic accounting
Cash, shares, positions, settlements, and P/L are computed by fixed rules.