What is BalatroBench?
BalatroBench benchmarks Large Language Models playing Balatro, a strategic card game that combines poker hands with roguelike progression. We track how different AI models perform across key metrics like rounds reached, decision accuracy, and resource efficiency to understand their strategic reasoning abilities. The benchmark specifically focuses on the LLM's ability to perform tool calls, measuring how effectively models can execute game actions through API interactions.
Metrics Measured
- • Main Score:
-
RoundAverage round reached by playing the game
-
- • Tool Calls:
-
Responses with valid tool calls that can be executed in the current game state.
-
Responses with valid tool calls that cannot be executed in the current game state.
-
Responses without valid tool calls.
-
- • Token Usage:
-
In /Average input tokens per tool call
-
Out /Average output tokens per tool call (including reasoning tokens)
-
- • Performance:
-
/ [s]Average time per tool call in seconds
-
/ [m$]Average cost per tool call in milli-dollars
-
Leaderboards
Model Leaderboard
Compare performance across different LLMs. Models are ranked by their average final round reached, with detailed statistics on costs, speed, and reliability.
Community Strategies
Explore community-submitted strategies and approaches to playing Balatro with fixed model. You can contribute with your own strategies.
Related Projects
BalatroBot
Python framework for automated Balatro interaction. Core API for game state reading and action execution.
BalatroLLM
LLM-powered bot that generates benchmark data. Uses strategies and game state analysis for decisions.
BalatroBench
This site! Processes and visualizes benchmark data into interactive leaderboards and statistics.