People are now using Super Mario to test AI performance

Thought Pokémon was a tough benchmark for AI? Well, some researchers argue that Super Mario Bros. is even tougher. Hao AI Lab at the University of California San Diego recently put AI to the test in live Super Mario Bros. games. Anthropic’s Claude 3.7 came out on top, with Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggling to keep up.

The game wasn’t exactly the same as the original 1985 release, as it ran in an emulator and integrated with a framework called GamingAgent. This framework gave the AIs control over Mario by providing basic instructions and in-game screenshots.

Interestingly, the lab found that reasoning models, which “think” through problems step by step, performed worse than “non-reasoning” models in the game. This is because reasoning models take longer to decide on actions, which can be detrimental in real-time games like Super Mario Bros.

While some experts question the significance of AI’s gaming skills in technological advancement, flashy gaming benchmarks like these highlight an “evaluation crisis.” As Andrej Karpathy from OpenAI mentioned, it’s unclear which metrics to focus on when evaluating AI models. But hey, at least we can enjoy watching AI play Mario.