Meta, Google, and Amazon accused of manipulating AI benchmark rankings
Researchers warn of “distorted playing field” in industry-standard Chatbot Arena tests
A group of AI researchers has accused major tech companies—including Meta, Google, and Amazon—of undermining the integrity of one of the most influential AI model benchmarking tools in the industry, calling into question the transparency and scientific credibility of leaderboard rankings on Chatbot Arena.
Sara Hooker, head of non-profit research group Cohere Labs, and her colleagues have published an analysis revealing systemic issues in the way models are submitted and ranked on the popular evaluation platform. Their findings suggest that the system is being gamed by industry giants who selectively release only high-performing model variants, skewing the public perception of AI model performance.
“If you can choose what score to post, we’re not doing science any more,” said Hooker. “This was a very uncomfortable paper to write... it’s a low for AI rigour.”
How Chatbot Arena works—and how it's being gamed
Chatbot Arena has emerged as a widely used benchmark for comparing AI models. It enables users to submit prompts and compare anonymized outputs from two models head-to-head, with the winner chosen by user vote. These outcomes feed into a public leaderboard that ranks the best-performing models.
While the concept is meant to provide an objective, community-driven metric for performance, Hooker’s team argues that it allows organizations—particularly tech giants—to test numerous private models behind the scenes. Poorly performing models can be quietly removed, while only high-scoring ones are released to the public.
Data from over two million tests conducted between January 2024 and April 2025 showed that Meta tested 27 private variants of its Llama 4 model before release, while Google had 10 variants in the lead-up to its Gemma 3 launch. None of these appeared on the public leaderboard but were active in head-to-head matchups, allowing the companies to gauge performance before selecting a winner.
Meta was also found to have tested another 16 model variants across code generation and vision-specific leaderboards, bringing its total to 43 variants evaluated on the platform.
Claims of unfair visibility and sampling bias
The paper further criticizes how models are selected for head-to-head battles. According to the analysis, OpenAI and Google’s models received around 20% each of all test prompts on Chatbot Arena, while 41 fully open-source models combined received less than 9%.
This imbalance gives larger firms more exposure, Hooker argues, marginalizing smaller labs and skewing outcomes in favor of corporate players.
Responding to the claims, Anastasios Angelopoulos, co-creator of Chatbot Arena and a researcher at the University of California, Berkeley, said that private models can be removed to avoid scoring non-public systems, and acknowledged that top and new models are “upsampled” to give users more meaningful comparisons.
However, he denied that big tech firms were being granted special privileges: “Nobody is getting preferential treatment,” he said.
Scientists question credibility of benchmarked AI claims
The controversy has reignited long-standing concerns about the tension between scientific transparency and commercial incentives in AI development.
“I have trouble believing any results that people report about their models,” said Sasha Luccioni, a researcher at Hugging Face. “AI research has become so intertwined with commercialisation and profit that it’s no longer just about showing scientific progress.”
Andrew Rogoyski of the University of Surrey echoed those sentiments, suggesting it was no surprise that companies dedicate resources to optimizing performance on public benchmarks.
To improve fairness, Hooker and her team propose that all models receiving a score on Chatbot Arena must remain publicly listed, with no deletions. They also recommend limiting users to testing a maximum of three variants simultaneously and establishing new sampling algorithms to ensure open-source models are given equal exposure.
Chatbot Arena developers respond amid growing scrutiny
In a post on X, the developers behind Chatbot Arena acknowledged some of the recommendations as “reasonable,” but claimed the paper contained “a number of factual errors and misleading statements.” They did not specify which claims they disputed and have yet to respond in detail.
Google, Meta, and Amazon have not issued public statements in response to the allegations or data presented in the study.
As benchmark rankings play an increasingly central role in AI research, funding, and public trust, the debate underscores the urgent need for greater transparency and standardized ethical guidelines in model evaluation.
Stay tuned to The Horizons Times for the latest developments in AI accountability, open science, and the battle between Big Tech and independent research.
Prev Article
Quantum computers can be hacked using classic row-hammer techniques
Next Article
Scientists reveal never-before-seen color by targeting specific eye cells
Leave a Comment