We tested AI translation on 30 idioms. Here is what happened when the models disagreed.

Idioms are the section of a language that no dictionary fully explains. You can know every word in “to bite the bullet” and still miss the point entirely. They carry history, emotion, and cultural weight that sits outside the literal meaning. That is exactly what makes them so valuable to language learners, and exactly what makes them so hard for machines to handle.

There is a well-documented gap in how AI handles figurative language. Research from Appen (2025) found that up to 47% of contextual meaning is lost in standard machine translation, with idioms, humor, and culturally specific expressions flagged as the most common failure points across more than 20 languages. A February 2026 report from Slator confirmed the pattern: even top-performing LLMs frequently leave idioms untranslated altogether, apparently choosing omission over the risk of getting it wrong.

We wanted to know what that actually looks like in practice. So we ran a structured test. The results did not reveal that AI is simply bad at idioms. They revealed something more interesting: that the models disagree with each other, often sharply, and that how metaphors shape language is something AI systems approach from very different angles depending on how they were trained.

The test: 30 idioms, 8 AI models, one question

We selected 30 English idioms and ran each through eight leading AI translation models, targeting Spanish, French, German, and Japanese as output languages. The idioms were drawn from idioms for culture and organized into three categories:

High-frequency idioms (e.g., “bite the bullet,” “spill the beans”) — widely documented and commonly included in AI training datasets
Mid-frequency idioms (e.g., “burn the midnight oil,” “on the same page”) — common in professional writing
Low-frequency, culture-specific idioms (e.g., “the ball is in your court,” “hit the sack”) — less predictable in non-English contexts

Each model output was reviewed for three criteria: accuracy (did it preserve the intended meaning), naturalization (did it adapt the idiom to an equivalent expression in the target language or explain it), and consistency (did the same model produce the same output on repeated runs).

What we did not do was ask the models to translate the idiom literally. We asked them to handle it as a fluent speaker would.

Where single models failed

The failure modes fell into three recognizable patterns.

Literal translation. The most common error. “Hit the sack” became “golpear el saco” in Spanish (to hit the bag) rather than any equivalent of “go to sleep.” The idiom was treated as two nouns and a verb rather than as a fixed expression with independent meaning.

Forced naturalization. Some models tried to substitute a target-language equivalent that did not carry the same weight. A German output for “burn the midnight oil” offered a phrase roughly equivalent to “work late,” losing the implication of sustained, solitary effort entirely.

Omission. Consistent with what Slator’s 2026 benchmark study found, several models simply dropped idiomatic passages in Japanese outputs, producing fluent text that skipped the figurative section of the source sentence.

Across the 30 idioms, no single model produced consistently accurate results in all four target languages. The best-performing individual model in our test achieved correct or near-correct naturalization on 71% of expressions in Spanish, but dropped to 48% in Japanese. The gap between languages was as large as the gap between models. Researchers from ICITTBT 2025 reached a parallel conclusion: AI systems struggle reliably with culturally specific idioms and proverbs in ways that do not improve predictably with model scale.

What model disagreement actually looks like

The more revealing finding was not where individual models failed. It was how much the models disagreed with each other on the same idiom.

Take “bite the bullet” into Spanish. Across eight models, we received five distinct outputs, ranging from a direct calque (“morder la bala,” which means nothing culturally in Spanish) to “aguantar el dolor” (endure the pain) to the native equivalent “armarse de valor” (to gather courage). Three of the eight models chose a different strategy entirely. There was no consensus on what the correct handling should be.

This is the core challenge idiom translation presents for AI. It is not that models perform poorly on average. It is that their errors are non-overlapping. Model A fails on the cultural substitution while Model B gets it right. Model B fails on the Japanese output while Model C handles it cleanly. If you are relying on a single model, the quality you get is a function of which model you chose, and you typically have no way to know in advance which that is.

For language learners and educators, this variability has real consequences. A student relying on an AI-translated example sentence to understand how an idiom functions in context may receive a naturalized equivalent, a literal rendering, or a blank where the idiom should have appeared. Each outcome teaches something different, and only one of them is correct.

What changes when models are compared against each other

One approach to the disagreement problem is to not choose a single model at all. Instead of asking which model performs best and trusting it, you can run the same text through multiple models simultaneously and look for where they converge.

In our test, when models produced the same output on an idiomatic expression, that output was correct or near-correct in 94% of cases. Majority agreement among AI models was a reliable signal that the translation had captured both the meaning and the naturalization strategy. Where models diverged sharply, the chances of any single output being accurate dropped substantially.

This is the principle behind the architecture used by MachineTranslation.com, an AI translator which compares the outputs of 22 AI models and selects the translation that most of them agree on. Applied to idiomatic content specifically, the approach reduces the reliance on any single model’s training on figurative language, letting the point of convergence across 22 models do the selection work. Internal data from MachineTranslation.com shows that critical translation errors, including mistranslations of fixed expressions, are reduced to under 2% through this consensus approach, compared with a 10–18% error baseline on individual LLMs.

The implication for idiom translation is specific: you are not looking for the best AI at handling figurative language. You are looking for the output that the most AI models agree is correct. When the models converge, that convergence is the signal.

What this means for language learners and educators

Idioms are not an edge case in language learning. They are central to fluency. The gap between understanding words and writing with idioms naturally is, in many ways, the gap between intermediate and advanced. If learners are using AI tools to understand how idiomatic expressions work in a target language, the quality of those tools directly affects what they learn.

The test results suggest three practical implications for anyone learning idioms with AI assistance:

Do not treat any single AI output as definitive for culturally loaded expressions. Compare outputs where you can.
Prioritize AI tools that show you multiple translations side by side, not just a single result. The disagreements are informative.
Apply extra scrutiny to Japanese, Chinese, and Arabic outputs specifically. Our test confirmed the pattern that figurative language performance degrades most significantly in logographically or grammatically distant languages.

Idiom translation is not a problem that will be solved by any one model getting smarter. It is a problem of model confidence without mechanism. Any AI can produce a translation that sounds right. The test is whether it is right across languages, contexts, and edge cases. That is a harder bar.

One finding, one implication

The result that surprised us most was not the error rate. It was how consistent the disagreement pattern was. Models failed on different idioms in different languages, but they disagreed with each other at roughly the same rate regardless of the expression type. High-frequency idioms and low-frequency idioms produced similar levels of inter-model variance.

That points to an architectural issue, not a data issue. The models are not failing because they have not seen enough idioms. They are failing because they have no mechanism for checking their own output against an independent standard. A single model, however capable, translates with confidence but without verification.

For anyone building tools that help people understand idiomatic language, or anyone learning a language using AI-generated examples, the practical takeaway is the same: the output you can trust most is the one multiple independent models agreed on. Not the most fluent-sounding one. Not the one from the model with the best general benchmark. The one that the majority confirmed.