Translation Isn’t One Problem: Why We Blend LLMs and NMT at Lingopal

Translation isn’t a single technical problem. It’s a spectrum of tradeoffs.
At Lingopal, we’ve translated everything from congressional hearings to dunk contests. Both are “translation,” but the bar - and the risk - is very different.
Hearings demand repeatable accuracy: names, legal phrasing, policy terms - no improvisation. Dunk contests demand timing and vibe: the words have to land on the same beat as the moment.
The mistake teams make is assuming one model can optimize for both precision and performance. In practice, the strongest systems blend two families of models - LLMs and neural machine translation (NMT) - because they’re built for different jobs.
LLMs (large language models) are trained on vast amounts of text and learn to predict the next token given context. That next-token training is why they’re so good at fluent, flexible language: paraphrasing, smoothing, adapting style, and making a translation sound native. But that same flexibility can introduce variability. Even with conservative decoding, real-world LLM serving can produce small shifts in tone or emphasis across runs - a problem when you need strict one-to-one consistency.
NMT models, by contrast, are trained specifically for translation on parallel corpora - aligned sentence pairs (often called “bitext”). Modern NMT systems typically use an encoder–decoder Transformer architecture: the encoder creates a contextual representation of the source sentence, and the decoder generates the target sentence conditioned on it. The Transformer architecture itself was introduced in machine translation, where it achieved strong quality while remaining highly parallelizable.
So which is “better” for translation - LLMs or NMT?
It depends on what “good” means in your workflow. Some industry evaluations still find that domain-trained or custom NMT leads on raw adequacy and terminology accuracy, while LLM-based approaches can come close and add value in fluency and stylistic adaptation.
There’s also a reliability wrinkle that matters in production: hallucinations. Research comparing multilingual NMT systems and GPT-style LLMs prompted for translation shows that models can sometimes generate pathological outputs that drift from the source. That drift undermines trust - especially in low-resource language directions or high-risk contexts.
That’s why serious translation isn’t a single model call. It’s a pipeline:
- Draft for fidelity (often NMT) to preserve meaning and terminology.
- Lock what must not change (names, numbers, branded terms) using glossaries and alignment checks.
- Score adequacy and fluency using metrics designed to correlate with human judgment (e.g., COMET), with human review where risk is high.
- Refine for audience - using an LLM as a style and clarity layer when you want broadcast-ready phrasing, without changing the facts.
Now the dunk contest vs. the hearing becomes a feature, not a headache. For hearings, you bias toward determinism: strict terminology control, minimal rewriting, aggressive checks. For dunk contests, you unlock the LLM’s superpower: expressive language and pacing - while still safeguarding facts and timing.
Maybe in a few short years, translation really will be “one simple call.” But today, the competitive edge comes from understanding what each model is good at, where it fails, and how to orchestrate them so the output is both trustworthy and alive.

