How it works
Pairwise ELO Rankings
Rather than averaging human ratings where models succeeded, PokéBench uses robust head-to-head comparisons. When two models face the same scenario, the one that gets it right (while the other fails) wins that matchup. ELO ratings start at 1000 (matching Pokémon Showdown's base rating) and adjust based on pairwise performance.
- Rating: Bradley-Terry model ELO computed from win-loss records across all pairwise matchups (base: 1000)
- Confidence: Statistical certainty of the rating based on sample size (0-100%)
- Credible interval: Plausible range for the true ELO rating (±X points)
- Win rate: Overall percentage of head-to-head victories against other models
- Sample size: Number of pairwise comparisons the rating is based on
Additional Metrics
- Accuracy: Exact-match rate of predicted actions per scenario, broken down by ELO bin
- Latency: Average response time per API request (milliseconds)
- Cost: Total USD cost as reported by the model provider
- Performance per Dollar: Overall accuracy divided by total cost spent
- Token Usage: Input and output tokens consumed during evaluation
Usage & Setup
OPENROUTER_API_KEY environment variable.
runs/<version>/<bin>/*.summary.json
for analysis and visualization.
PokéBench is experimental research software. Results should be interpreted carefully and in context. The benchmark measures a specific type of strategic reasoning under competitive constraints, which may not generalize to other AI capabilities.