The Mediocre Anchor: Why Your Strongest Model Makes the Worst Judge

The Evaluation Scalability Crisis

The meteoric rise of large language models has precipitated a parallel crisis in evaluation methodology. Traditional reference-based metrics such as BLEU or ROUGE, designed for the narrow constraints of machine translation or summarization, have proven inadequate for assessing open-ended generation, reasoning, and instruction following. Human evaluation remains the gold standard but scales linearly with cost and time, rendering it impractical for the rapid iteration cycles characteristic of modern model development.

In response, the community has widely adopted the "LLM-as-a-judge" paradigm, where a capable model evaluates the outputs of other systems. This approach correlates reasonably well with human preferences and offers the automation necessary for benchmarking at scale. Yet this scalability introduces its own computational bottlenecks. Exhaustive pairwise comparison of M models requires O(M²) evaluations, a quadratic explosion that becomes prohibitive as leaderboards expand to dozens of entries.

The prevailing solution, implemented in prominent benchmarks such as Arena-Hard and AlpacaEval, is anchor-based evaluation. Rather than comparing every model against every other, practitioners designate a single anchor model and compare all candidates exclusively against this fixed reference point. This reduces complexity to O(M), but introduces a critical question that has remained largely unexamined: which model should serve as the anchor?

In their recent work, "Mediocrity is the key for LLM as a Judge Anchor Selection," Don-Yehiya et al. conduct a systematic investigation that challenges conventional wisdom. Through an extensive empirical study involving over 850,000 pairwise comparisons across 22 different anchor models on the Arena-Hard-v2.0 dataset, the authors demonstrate that anchor selection is not merely a implementation detail but a fundamental determinant of evaluation reliability. Their findings reveal a counterintuitive inverted U-shaped relationship between model capability and anchor quality, with profound implications for how we design benchmarks and conduct production evaluations.

The Inverted U: When Strength Becomes Weakness

The central empirical finding of Don-Yehiya et al. is elegantly captured in Figure 1 of their paper, which plots Kendall's τ correlation between anchor-based rankings and the quadratic "ground truth" ranking against the anchor's position in that ranking. The resulting curve forms a distinct inverted U, with the highest correlation achieved by mid-ranked models and dramatic degradation at the extremes.

Strong models make poor anchors. When the authors selected top-tier models such as o3 or GPT-4 as anchors, they observed severely diminished discriminative power. The mechanism is straightforward but insidious: these models win too consistently. In the Arena-Hard-v2.0 benchmark, o3 defeats competitor models in approximately 500 out of 750 samples, rendering two-thirds of the evaluation budget statistically inert. When an anchor dominates nearly all comparisons, the resulting win-rate distribution becomes severely skewed, compressing the dynamic range available for distinguishing between the remaining models. The evaluation effectively degenerates into measuring tail behavior, identifying only which models are competitive with the frontier while losing resolution in the mid-range where most practical differentiation occurs.

Weak models are equally problematic. Conversely, selecting low-performing models as anchors induces the opposite skew, where most candidates win consistently. This creates a floor effect that similarly fails to discriminate between capable systems.

The authors quantify this degradation precisely. A poor anchor choice can reduce correlation with human rankings by up to 0.30, and correlation with quadratic rankings by up to 0.19. To put this in context, they demonstrate that the effect size of anchor selection is comparable to the effect size of judge model selection. We obsess over whether to use GPT-4, Claude, or DeepSeek as our evaluator, yet pay little attention to what we compare against; this research suggests we should weigh these decisions equally.

Statistical Power and the Sample Size Crisis

Beyond the correlation analysis, Don-Yehiya et al. investigate the statistical foundations of anchor-based evaluation through power analysis. Their methodology introduces the concept of informativeness rate: the proportion of comparisons against an anchor that actually provide signal about relative model quality, as opposed to inevitable victories or defeats.

Extreme anchors inherently suffer from low informativeness rates. When comparing two competitive mid-tier models against a frontier anchor, both may lose consistently, providing no basis for distinguishing between them. The authors calculate that detecting a small effect size (a 5% performance difference) with standard statistical power requirements demands sample sizes exceeding the current Arena-Hard-v2.0 benchmark when using low-informativeness anchors.

This finding exposes a structural vulnerability in current benchmarking practices. Standard benchmark sizes, while sufficient for quadratic pairwise evaluation, are statistically inadequate for anchor-based evaluation with extreme anchors. Practitioners conducting A/B tests or leaderboard evaluations with insufficient samples risk Type II errors, failing to detect genuine improvements between competitive models. The paper suggests that if anchor-based evaluation must be used, either the anchor should be selected to maximize informativeness (i.e., a mediocre model), or the benchmark size must increase substantially, potentially by orders of magnitude for extreme anchors.

Beyond the Paper: Implications for Production Systems

While Don-Yehiya et al. focus on public benchmarks, their findings carry significant weight for industrial evaluation pipelines. The prevailing practice in many organizations is to benchmark new model iterations against the current production model, typically their strongest available system. This paper suggests this approach systematically underestimates the value of incremental improvements.

Consider a team iterating on a 70B parameter model. If they evaluate against GPT-4o or Claude 3.5 Sonnet, and their model wins 15% of the time while a previous iteration won 12%, the signal is buried in noise. However, if they benchmark against a mid-tier open model (perhaps a well-tuned Llama 3 70B or Qwen 72B), the win rate might be 45% versus 42%, providing sufficient dynamic range to detect the improvement with statistical confidence.

This insight offers a pathway to cost reduction as well. Frontier model APIs are expensive; evaluating against them consumes budget rapidly while providing poor signal. Evaluating against smaller, locally hosted mid-tier models reduces both monetary costs and informational waste.

The paper also invites comparison to active learning and information theory. The optimal anchor maximizes the entropy of the preference distribution, ensuring each comparison has high expected information gain. Extreme anchors minimize entropy, producing predictable outcomes. This suggests future work might develop dynamic anchor selection algorithms that adapt to the current model distribution, perhaps beginning with a mediocre anchor and shifting only when models approach parity with it.

Limitations and Open Questions

The authors acknowledge that their analysis assumes transitivity of preferences, an assumption challenged by recent work showing non-transitive behavior in LLM comparisons. While they argue this does not invalidate their findings regarding anchor quality, it remains a simplifying assumption that may not hold in all domains.

Furthermore, the "mediocrity" of an anchor is defined relative to the specific distribution of models being evaluated. A model that is mediocre on Arena-Hard might be strong on a specialized coding benchmark. The generalizability of specific anchor recommendations across domains remains an open empirical question.

The study also relies on correlation with quadratic rankings as a proxy for ground truth, which itself may be imperfect if judge models exhibit systematic biases. However, the consistency of findings across both quadratic and human rankings strengthens confidence in the core conclusion.

Conclusion: Rethinking the Baseline

The research by Don-Yehiya et al. delivers a necessary correction to evaluation orthodoxy. The intuitive appeal of comparing against the best available model, or establishing a weak baseline for easy wins, systematically degrades the quality of our measurements. Instead, mediocre anchors maximize correlation with human judgment and statistical power.

For practitioners, the recommendations are clear. When evaluating small model sets (N ≤ 3), avoid anchor-based evaluation entirely in favor of direct pairwise comparison. When scaling to leaderboard settings where anchors are unavoidable, select models from the middle of the capability distribution rather than the tails. Explicitly report the informativeness rate of your chosen anchor to enable validity assessment. Finally, consider whether your benchmark size is adequate for the effect sizes you hope to detect; standard datasets may be insufficient for discriminating between competitive models using anchor-based methods.

As the field matures beyond the phase of comparing frontier models to random baselines, our evaluation infrastructure must evolve to detect increasingly subtle distinctions. Recognizing that the strongest model makes the worst yardstick is a crucial step toward more rigorous, efficient, and reliable assessment of language model capabilities.