What is BalatroBench?

BalatroBench benchmarks Large Language Models playing Balatro, a strategic card game that combines poker hands with roguelike progression. We track how different AI models perform across key metrics like rounds reached, decision accuracy, and resource efficiency to understand their strategic reasoning abilities. The benchmark specifically focuses on the LLM's ability to perform tool calls, measuring how effectively models can execute game actions through API interactions.

Metrics Measured

  • Main Score:
    • Round
      Average round reached by playing the game
  • Tool Calls:
    • Responses with valid tool calls that can be executed in the current game state.
    • Responses with valid tool calls that cannot be executed in the current game state.
    • Responses without valid tool calls.
  • Token Usage:
    • In /
         Average input tokens per tool call
    • Out /
      Average output tokens per tool call (including reasoning tokens)
  • Performance:
    • / [s]
         Average time per tool call in seconds
    • / [m$]
      Average cost per tool call in milli-dollars