The Problem No One Wants to Talk About
Large language models have quietly become competitive machine translation systems. GPT-4, Claude, Gemini — they all translate between dozens of language pairs with fluency that often matches or exceeds dedicated MT systems like Google Translate or DeepL. For high-resource pairs like English↔French or English↔Chinese, the gap between LLM output and professional human translation has narrowed to the point where casual readers can't tell the difference.
But here's the uncomfortable truth: we don't actually have reliable ways to measure whether these translations are correct.
The MT evaluation toolkit we've relied on for two decades — BLEU, METEOR, TER, and even newer learned metrics like COMET and BLEURT — was designed for a world where translation systems made obvious, mechanical errors. Dropped words. Garbled syntax. Wrong tense. These metrics work by comparing system output against human reference translations, and they're reasonably good at catching the kinds of mistakes that statistical and early neural MT systems made.
LLM translations fail differently. They fail by being too fluent.
Fluent but Wrong: The New Failure Mode
Consider this example. Given the Japanese sentence:
彼は顔が広い。 (kare wa kao ga hiroi)
A literal translation: "His face is wide." A correct translation: "He knows a lot of people" or "He's well-connected." This is a common Japanese idiom — 顔が広い means to have a wide social network, not a physically wide face.
Older MT systems would sometimes produce the literal translation, and metrics would flag it as wrong because it didn't match the reference. Fine.
But an LLM might produce: "He's a very sociable person with connections everywhere." This is fluent, natural English. It roughly captures the meaning. But it's also an interpretation that adds connotations not present in the source — "sociable" implies personality traits the Japanese doesn't claim, and "everywhere" exaggerates the scope.
BLEU gives this a low score because it shares few n-grams with the reference. COMET might rate it highly because it's fluent and semantically adjacent. Neither metric captures the real issue: the translation subtly shifts the meaning in ways that matter for actual use.
Why Reference-Based Metrics Are Hitting a Ceiling
The fundamental problem is that reference-based metrics assume there's a small set of "correct" translations, and quality is measured by proximity to those references. This was a reasonable assumption when MT systems produced disfluent output — there were only so many ways to say something correctly.
LLMs generate from a much wider distribution of valid English (or target-language) expressions. Two translations can both be perfectly fluent and natural while capturing different facets of the source meaning, using completely different vocabulary and sentence structures. Reference-based metrics penalize valid diversity.
This is especially acute for:
- Literary translation: Where style, tone, and register matter as much as denotative meaning. A novel translated by an LLM might score poorly on BLEU against a human reference while being preferred by readers.
- Low-resource languages: Where we have few reference translations to compare against, making reference-based evaluation statistically unreliable.
- Document-level coherence: BLEU operates sentence-by-sentence. It can't detect that an LLM inconsistently translated a technical term across paragraphs, or shifted the formality register mid-document.
What the Field Is Trying (and Where It Falls Short)
Learned metrics (COMET, BLEURT, UniTE)
These train neural models to predict human quality judgments. They're better than BLEU — significantly so — but they inherit the biases of their training data. Most are trained on human judgments of older MT systems, so they're calibrated for the error distribution of 2018-era NMT, not 2025-era LLMs. They also tend to overweight fluency relative to adequacy, which is exactly the wrong bias when evaluating LLMs.
Reference-free / QE metrics (CometKiwi, etc.)
Quality estimation metrics that compare source and translation directly, without a reference. Promising in theory — they sidestep the reference bottleneck entirely. In practice, they struggle with exactly the subtle meaning shifts that LLMs produce. If the translation is fluent and topically related to the source, QE metrics tend to rate it highly even when important nuances are lost.
LLM-as-judge (GEMBA, AutoMQM)
Using one LLM to evaluate another LLM's translation. This has shown surprisingly strong correlation with human judgments in some benchmarks. But it has a fundamental circularity problem: LLMs share similar biases about what "good" translation looks like. If GPT-4 and Claude both think a slightly inaccurate but fluent translation is fine, the LLM-judge approach will agree — and they'll all be wrong together.
There's also the practical issue: if you need GPT-4 to evaluate translations, your evaluation pipeline is now as expensive as your translation pipeline.
MQM-style human evaluation
Multidimensional Quality Metrics (MQM) — where trained annotators categorize specific errors by type and severity — remains the gold standard. But it's slow, expensive, and doesn't scale. Annotating 1,000 sentence pairs takes trained linguists days of work. For rapid iteration on MT systems, this isn't practical.
What We Actually Need
I think the field needs to move in three directions simultaneously:
1. Meaning-decomposition metrics. Instead of comparing surface forms, decompose both source and translation into structured meaning representations — predicate-argument structures, entity coreferences, discourse relations — and compare those. This is hard (it's basically the semantic parsing problem), but partial solutions exist. AMR (Abstract Meaning Representation) parsing has gotten good enough to be useful here, and cross-lingual AMR alignment could give us metrics that catch meaning shifts that surface-level metrics miss.
2. Contrastive evaluation sets. Instead of asking "how good is this translation?" ask "can the system distinguish correct translations from specific types of errors?" Build test sets with minimal pairs: one correct translation and one that's fluent but wrong in a specific, controlled way (wrong idiom interpretation, incorrect number, shifted modality, etc.). This tests whether systems understand meaning, not just whether they produce fluent text.
3. Task-based evaluation. The ultimate test of translation quality is whether it works for its intended purpose. Can a doctor use the translated medical record to make correct decisions? Can a lawyer rely on the translated contract? Can a reader enjoy the translated novel? We need evaluation frameworks that measure downstream task performance, not just linguistic similarity.
None of these are new ideas. But the urgency has increased dramatically now that LLMs have made the easy problems (fluency, basic adequacy) essentially solved for high-resource pairs, and our metrics can't reliably distinguish "good enough" from "subtly wrong."
The Stakes Are Real
This isn't an academic exercise. Machine translation is used in legal proceedings, medical settings, immigration cases, and international commerce. When an LLM confidently produces a fluent translation that subtly misrepresents the source, and our automated metrics say it's fine, real harm can follow.
The MT community has been remarkably successful at improving translation quality over the past decade. The evaluation side hasn't kept pace. Until it does, we're flying blind — producing translations that look better than ever while lacking the tools to verify that they actually are better.
We need to fix the thermometer, not just celebrate that the patient looks healthy.