- The general conceit of this article, which is something that many frontier labs seem to be beginning to realize, is that the average human is no longer smart enough to provide sufficient signal to improve AI models.
- > It's past time for LMArena people to sit down and have some thorough reflection on whether it is still worth running at all
They've raised about $250 million, so I don't see that happening anytime soon.
- There's something deeply ironic about this being written by AI. Baitception, even.
- >Would you trust a medical system measured by: which doctor would the average Internet user vote for?
Yes, the system desperately needs this. Many doctors malpractice for DECADES.
I would absolutely seek to, damn, even pay good money to, be able to talk with a doctor's previous patients, particularly if they're going to perform a life-changing procedure on me.
- > They're not reading carefully. They're not fact-checking, or even trying.
It’s not how I do, and I suppose how many people do. I specifically ask questions related to niche subjects that I know perfectly well and that is very easy for me to spot mistakes.
The first time I used it, that’s what came naturally to my mind. I believe it’s the same for others.
- When they released GPT-4.5, it was miles ahead of others when it comes to its linguistic skills and insight. Yet, it was never at top of the arena - it felt that not everone was able to appreciate the edge.
- I have to somewhat agree on the "deceptive" answers part: Specifically, Grok4.1(#3 currently) is psychopathically manipulative and easily hallucinates things to appear more competent, even if there is nothing to form the answer it generated. Gemini3 pro(#1) casually subverts the intent of prompt and rewrites the question as if there was a literal genie on the other side mocking you with the power of thousand language lawyers. If you examine the answers, fact-check everything you will not like the "fake confidence" and the style will appear like scam artist trying to sound professional.
However, LMarena,despite its flaws(recaptcha in 2026?) is the only "testing ground" where you can examine the entire breadth of internet users. Everything else is incredibly selective, hamstrung bureaucratic benchmark on pre-approved QA sessions. It doesn't handle edge cases or out-of-distribution content. LMarena is the "out-of-distribution" questions that trigger the corner cases and expose weak parts in processing(like tokenization/parsing bugs) or inference inefficiency(infinite loops, stalling and various suboptimal paths), its "idiot-proofing" any future interactions beyond sterile test-sets.
- Seems like they just raised 150m at 1.7B valuation. Crazy.
- True and what you can realize/read between the lines is something deeper.
LLMs are fallible. Humans are fallible. LLMs improve (and improve fast). Humans do not (overall, ie. "group of N experts in X", "N random internet people").
All those "turing tests" will start flipping.
Today it's "N random internet humans" score too low on those benchmarks, tomorrow it'll be "group of N expert humans in X" score too low.
- this argument is also broadly true about the quality and correctness of posts on any vote-based discussion board
> Why is LMArena so easy to game? The answer is structural. > The system is fully open to the Internet. LMArena is built on unpaid labor from uncontrolled volunteers.
also all user's votes count equally, bu not all users have equal knowledge.
- Any metric that can be targeted can be gamed
- There is wisdom in the crowd but yes agreed
- When the Meta cheating scandal happened I was surprised how little of the attention was on this.
Meta "cheated" on lmarena not by using a smarter model but by using one that was more verbose and friendly with excessive emojis.
- Couldn't "The Wisdom of Crowds" help with this?
Maybe if they started ranking the answers on a 1-10 range, allowing people to specify graduations of correctness/wrongness, then the crowd would work?
deleted
- From https://lmarena.ai/how-it-works:
> In battle mode, you'll be served 2 anonymous models. Dig into the responses and decide which answer best fits your needs.
It's not a given that someone's needs are "factual accuracy". Maybe they're after entertainment, or winning an argument.
- > It's like going to the grocery store and buying tabloids, pretending they're scientific journals.
This is pure gold. I've always found this approach of evals on a moving-target via consensus broken.
deleted
- > Being verbose. Longer responses look more authoritative!
I know we can solve this in ordinary tasks just using prompt but that's really annoying. Sometimes I just want a yes or no answer and then I get a phd thesis in the matter.
deleted
- Aside from Meta is there any reason to think the big AI labs are still using LMArena data for training? The weaknesses are well understood and with the shift to RL there are so many better ways to design a reward function.
- Is there a reason wrong data isn't considered more broadly in its context as still valuable?
Shouldn't the model effectively 1. learn to complete the incorrect thing and 2. learn the context that it's correct and incorrect? In this case the context being lazy LMArena users. And presumably, in the future, poorly filtered training data.
We seem to be able to read incorrect things and not be corrupted (well, theoretically). It's not ideal, but it seems an important component to intellectual resilience.
It seems like the model knowing the data is LMArena, or some type of un-trusted, would be sufficient to shift the prior to a reasonable place.
- AI is a cancer on humanity
- Is there any reason to believe LMArena isn't botted by the people releasing these models?
- We need a service that ranks AI model ranking services. Maybe powered by AI instead of humans?
- maybe it would work if they could encourage end users to be rigorous? (ie, detect if they have the capability to rate well and then reward them when they do by comparing them against other highly rated raters of the same phenotype)
deleted
- Uhm, yes that's why you rely on LMArena (core) results only to judge the answering style and structure. I thought this was common knowledge.
deleted
- Since AI is itself a cancer, maybe this is good? The cancer of my cancer is my chemo.
- > Voilà: bold text, emojis, and plenty of sycophancy – every trick in the LMArena playbook! – to avoid answering the question it was asked.
This is hard to swallow.
I don't believe a single word this article says. Apparently the "real author" (the human being who wrote the original prompt to generate this article) only intend to use this article to generate clicks and engagement but don't care at all about what's in there.
- Has anyone else noticed that there isn't a single AI karma company?
The idea is simple*: Instead of users rating content, AI does it based on fact check.
None. Zero products or roadmaps on that.
Worse than that, people don't want this. It might tell them that they are wrong, with no chance to get your buddies to upvote you or game the system socially. It would probably flop.
Both AI companies and users want control, they want to game stuff. LMArena is ideal for that.
---
* I know it's a simple idea, but hard to achieve, and I'm not underestimating the difficulty. Doesn't matter thuogh: no one is even signaling the intention of solving it. Harder problems have been signaled (protein research, math).
- > What actually happens: random Internet users spend two seconds skimming, then click their favorite.
> They're not reading carefully. They're not fact-checking, or even trying.
Uhhh, how was that established?
- and AI is a cancer on humanity... this article is clearly LLM written too.
- The average person is dumber than an LLM in terms of having a grasp on the facts, and basic arithmetic.
A voting system open to the public is completely screwed even if somehow its incentives are optimized toward strongly encouraging ideal behavior.
