Methodology

LLM Evaluation Grounded in Reality

Forecaster Arena evaluates language models on unsettled future events. Prediction markets, paper portfolios, and real-world resolutions make the benchmark verifiable.

Methodology v2

AAbstract

Forecaster Arena is an LLM evaluation grounded in reality. Each model receives the same bankroll, the same market universe, and the same operating rules. It must convert its view of unsettled future events into paper-trading decisions, and those decisions are scored by the value of the resulting portfolio after real-world outcomes resolve.

Methodology v2 ranks models by portfolio value. Historical v1 cohorts used calibration metrics as a secondary evaluation axis.

1. Evaluation Objective

1.1 What the Benchmark Measures

The benchmark asks whether language models can make useful decisions about reality before reality has settled. The task is not to answer a static test item. It is to act under uncertainty, manage a constrained paper portfolio, and be judged by outcomes that were unknown at decision time.

1.2 Why Future Events

  • +No answer memorization: resolved answers do not exist when decisions are logged
  • +External truth: outcomes are settled by real-world events, not subjective grading
  • +Continuous renewal: new events keep the benchmark from becoming a fixed answer key

1.3 Prediction Markets as Substrate

Prediction markets are not the benchmark itself. They provide a public, timestamped, machine-readable stream of future-event questions, market prices, and resolution criteria. That substrate makes reality comparable across models.

2. Competition Design

2.1 Cohorts

Start frequencyEvery Sunday 00:00 UTC
Models per cohort7
Starting capital$10,000
Decision windowLatest 5 cohort numbers
DurationTracked until positions resolve or settle

2.2 Participating Models

GPT(OpenAI)
Gemini(Google)
Grok(xAI)
Claude(Anthropic)
DeepSeek(DeepSeek)
Kimi(Moonshot AI)
Qwen(Alibaba)

2.3 Market Universe

Markets are sourced from Polymarket's public API. For each decision run, the benchmark presents the top 500 markets by trading volume. The volume filter keeps the universe liquid enough for meaningful paper execution while preserving a simple rule that applies equally to every model.

3. Decision Protocol

3.1 Information Provided

Every model in a decision-eligible cohort receives the same closed-book snapshot for the run:

MARKETQuestion, market ID, current prices, volume, and close date
PORTFOLIOCurrent cash, open positions, marked value, and unrealized P/L
RULESAction schema, position limits, and bankroll constraints

3.2 Action Space

BETOpen or add exposure to a market side or outcome
SELLReduce or close an existing position
HOLDLeave the portfolio unchanged for the week

3.3 Constraints

Minimum bet
$50
Prevents noise from trivial paper positions
Maximum bet
25% of cash
Prevents all-in decisions from dominating a cohort
Positions per market
1 per side
Keeps accounting and attribution auditable

4. Scoring Methodology

4.1 Primary Ranking

The official v2 ranking is portfolio value. A model wins by growing its paper portfolio under the same market universe, bankroll, and rules as every other participant.

portfolio_value = cash + marked_position_value

4.2 Returns and Settlement

Returns are measured from the initial $10,000 starting balance. Open positions are marked to current market prices, while resolved positions settle according to the market outcome. The public leaderboard can show realized P/L, unrealized P/L, win rate, and activity, but portfolio value is the primary score.

4.3 Paper Trading Boundary

Forecaster Arena does not execute real-money trades. Paper portfolios make model decisions concrete, timestamped, comparable, and auditable without creating a consumer trading product.

5. Non-Gameability and Auditability

+

Future outcomes

Models cannot memorize outcomes that have not happened yet.

+

Decision logs

Prompts, responses, parsed actions, and trades are stored before outcomes resolve.

+

Shared snapshots

All competitors receive the same market and portfolio context for a run.

+

Deterministic accounting

Cash, shares, positions, settlements, and P/L are computed by fixed rules.