PokéBench

Competitive Pokémon decision-making benchmark. Scrapes high-level replays, generates realistic battle scenarios, and evaluates LLM actions against human ground truth. Features pairwise ELO rankings, accuracy metrics, and performance analysis.

PokéBench is a competitive Pokémon decision-making benchmark that evaluates language models against real human gameplay. It scrapes high-level Pokémon Showdown replays, generates realistic battle scenarios, and measures how well models can predict the actions that skilled human players actually chose.

How it works

1. Scrape — Fetch recent replays from high-rated players (1700-1800+ ELO) with deduplication and format filtering (e.g., Gen9 VGC 2025).
2. Generate scenarios — Parse battle logs and extract representative decision points across opening, midgame, and endgame. Each scenario captures visible game state and the human player's actual chosen actions.
3. Evaluate models — Present scenarios to language models and measure exact match accuracy against human ground truth. Models must predict both Pokémon's actions (moves, switches, etc.) to score correctly.
4. Rank performance — Compute pairwise ELO ratings by comparing models head-to-head on identical scenarios, plus traditional metrics like accuracy, latency, and cost.

Pairwise ELO Rankings

Rather than averaging human ratings where models succeeded, PokéBench uses robust head-to-head comparisons. When two models face the same scenario, the one that gets it right (while the other fails) wins that matchup. ELO ratings start at 1000 (matching Pokémon Showdown's base rating) and adjust based on pairwise performance.

  • Rating: Bradley-Terry model ELO computed from win-loss records across all pairwise matchups (base: 1000)
  • Confidence: Statistical certainty of the rating based on sample size (0-100%)
  • Credible interval: Plausible range for the true ELO rating (±X points)
  • Win rate: Overall percentage of head-to-head victories against other models
  • Sample size: Number of pairwise comparisons the rating is based on

Additional Metrics

  • Accuracy: Exact-match rate of predicted actions per scenario, broken down by ELO bin
  • Latency: Average response time per API request (milliseconds)
  • Cost: Total USD cost as reported by the model provider
  • Performance per Dollar: Overall accuracy divided by total cost spent
  • Token Usage: Input and output tokens consumed during evaluation

Usage & Setup

CLI Commands:
bun run --cwd packages/cli dev:once scrape --format gen9vgc2025 --count-per-bin 120
bun run --cwd packages/cli dev:once scenarios --source data/replays
bun run --cwd packages/cli dev:once eval --input data/scenarios/scenarios.json
Model Providers: Models are accessed via OpenRouter. Set your OPENROUTER_API_KEY environment variable.
Data Pipeline: Results are written to runs/<version>/<bin>/*.summary.json for analysis and visualization.

PokéBench is experimental research software. Results should be interpreted carefully and in context. The benchmark measures a specific type of strategic reasoning under competitive constraints, which may not generalize to other AI capabilities.