About - BalatroBench

What is BalatroBench?

BalatroBench benchmarks Large Language Models playing Balatro, a strategic card game that combines poker hands with roguelike progression. We track how different AI models perform across key metrics like rounds reached, decision accuracy, and resource efficiency to understand their strategic reasoning abilities. The benchmark specifically focuses on the LLM's ability to perform tool calls, measuring how effectively models can execute game actions through API interactions.

Metrics Measured

• Main Score:
- Round
  Average round reached by playing the game
• Tool Calls:
- Responses with valid tool calls that can be executed in the current game state.
- Responses with valid tool calls that cannot be executed in the current game state.
- Responses without valid tool calls.
• Token Usage:
- In /
  Average input tokens per tool call
- Out /
  Average output tokens per tool call (including reasoning tokens)
• Performance:
- / [s]
  Average time per tool call in seconds
- / [m$]
  Average cost per tool call in milli-dollars

Leaderboards

Model Leaderboard

Compare performance across different LLMs. Models are ranked by their average final round reached, with detailed statistics on costs, speed, and reliability.

Community Strategies

Explore community-submitted strategies and approaches to playing Balatro with fixed model. You can contribute with your own strategies.

Related Projects

BalatroBot

Python framework for automated Balatro interaction. Core API for game state reading and action execution.

BalatroLLM

LLM-powered bot that generates benchmark data. Uses strategies and game state analysis for decisions.

BalatroBench

This site! Processes and visualizes benchmark data into interactive leaderboards and statistics.