Doubleword | Behind the Stack, Ep 9 - How to Evaluate Open Source LLMs

Intro: The Hidden Challenge in LLM Selection

Choosing the right LLM for your workload isn’t just about picking the latest open-source release or switching to a cheaper closed model. If you’re self-hosting language models - whether for RAG pipelines, agents, or fine-tuned data tasks - knowing how good a model is (and compared to what) is a critical decision.

Most teams rely on academic benchmarks like MMLU, ARC, or HumanEval. But these don’t always reflect real-world usage. Benchmark scores may go up while actual task performance stays flat.

The only way to evaluate models with complete confidence would be to build an in-house evaluation pipeline tailored to your exact use case. That means defining your task - whether it's data extraction, question answering, or multi-step reasoning - then collecting example documents, crafting queries, running each model in a controlled environment, and comparing results against a gold standard set you’ve manually verified.

This lets you directly compare open and closed-source models on your terms. But there's a catch: it’s incredibly time-consuming, complex, and expensive to do well.

So how do you compare models in a way that actually reflects user experience?

Enter LM Arena: a public leaderboard built on human preferences, blind rankings, and model-to-model comparisons. In this blog, we’ll break down how it works, what it tells us, and why it’s one of the most practical tools for evaluating LLM quality, especially when you’re deciding whether to switch from closed-source to open-source models.

Understanding the Landscape: Open vs Closed Source

Closed-source models (like GPT-4 or Claude 3) remain the gold standard - but they’re expensive, not always tunable, and you can’t self-host. Open-source models are catching up fast, and newer entrants like Qwen 3, Gemma, and Kimi are surprisingly performant.

But comparing them requires more than benchmark wins. That’s where LM Arena comes in.

What Is LM Arena?

LM Arena is a public evaluation tool from LMSys.org which uses blind, head-to-head rankings where users see two model outputs and vote for the one they prefer. Behind the scenes, it calculates an ELO score - a method borrowed from chess rankings - to determine which model wins more often in direct comparison.

Why This Matters:

It simulates real-world evaluation by real users
It covers open and closed models
It’s hard to "hack" without actually improving output quality

It also provides category filters (e.g. code, instruction, math, long-form), so you can evaluate models in domains specific to your use case.

Pros and Cons of LM Arena as an Evaluation Tool

‍Pros:

Human-centered: evaluates what users actually prefer
Domain-filtered: lets you focus on tasks like coding, math, or dialogue
Transparent: you can see battle examples and individual win rates
Good historical data: compare today's models to previous SOTA baselines

Cons:

Some known biases (longer responses often win)
Limited to general-purpose tasks (not fine-tuned domains)

When LM Arena is (and isn’t) Enough

LM Arena is a great first stop when exploring open-source alternatives to SOTA proprietary models, exploring new model releases and tradeoffs between latency and quality in local setups.

But if you're building for narrow domains - e.g., genomics, finance, therapeutics - you’ll eventually need custom evaluation datasets, internal A/B test frameworks and internal user feedback pipelines.

Still, for general-purpose tasks and exploratory evaluation, LM Arena offers the most useful public signals available today.

Conclusion: Let Real Users Decide

Benchmarks are helpful, but they’re no substitute for human judgment. LM Arena is one of the few public tools that reflects how real people experience LLM outputs - and that makes it uniquely valuable for anyone comparing models, especially in self-hosted or hybrid stacks.

If you’re evaluating a switch from closed to open source, or between open models, this should be your first stop.

Behind the Stack, Ep 9 - How to Evaluate Open Source LLMs