Four artificial intelligence developers have created a benchmark that allows to “clash” different large language models (LLMs) in Street Fighter III, The Register reports.

The benchmark is called LLM Colosseum and was created during the Mistral hackathon in San Francisco last month. The benchmark informs the language model about everything that happens in the game, and the model responds based on the rules of the game.

The benchmark can be recreated on your own, and the project is available on GitHub for everyone to use.

According to the official LLM Colosseum leaderboard based on 342 matches between eight different LLMs, ChatGPT-3.5 Turbo is the clear winner with an Elo score of 1,776.11. This is far ahead of several ChatGPT-4 iterations that have scores in the range of 1,400 to 1,500.

According to Nikolai Ulyanov, one of the developers of LLM Colosseum, the balance between key characteristics makes LLM good even in Street Fighter III. GPT-3.5 Turbo has a good balance between speed and power. GPT-4 is a bigger model, and therefore smarter, but much slower.

The difference between ChatGPT-3.5 and 4 in LLM Colosseum shows what features are prioritized in the latest LLMs. According to the developer, existing benchmarks are too focused on performance. In fighting games, even fractions of a second are important, so extra time can lead to a quick loss.

Another experiment with LLM Colosseum was documented by Amazon Web Services developer Banjo Obayomi, running models on Amazon Bedrock. Dozens of different models took part in this tournament, but Claude was clearly ahead of the competition, taking first through fourth place, with Claude 3 Haiku taking first place.

There were cases when LLMs simply refused to play. AI models tend to have an anti-violent outlook and often refuse to respond to any promos they consider too violent. Claude 2.1 was particularly pacifistic, stating that it could not tolerate even fictional fights.

However, compared to real players, these chatbots do not play at a professional level. The developer played against LLM and reported that the neural model could only beat a 70-year-old or a five-year-old player.