PokéBench

Competitive Pokémon decision-making benchmark. Scrapes high-level replays, generates realistic battle scenarios, and evaluates LLM actions against human ground truth. Features pairwise ELO rankings, accuracy metrics, and performance analysis.