Voice Matters. But Reliability Wins.

The web is still far more English than the market it serves. As of March 2026, English accounts for ~50% of websites whose content language W3Techs can identify. Yet a European Commission summary of a six-country study found that a majority of users - and in some studies up to 90% - prefer browsing in their own language rather than English, with especially strong demand for local-language news, government, health, ecommerce, and financial sites. CSA Research’s 8,709-consumer survey across 29 countries reached a similar conclusion: 76% of online shoppers prefer buying when product information is in their own language, and 40% will not buy from other-language websites at all. Multilingual communication is no longer a courtesy layer. It is part of the commercial plumbing.
That helps explain why expressive dubbing has become the glamorous frontier of AI translation. In Meta’s Seamless research, evaluation does not stop at whether the words are correct; it also tracks expressivity preservation and vocal-style similarity alongside latency. That is a meaningful step forward, especially in entertainment and creator media, where flat synthetic delivery can make even an accurate translation feel dead on arrival.
But once AI translation leaves the lab and enters live operations - earnings calls, regulatory announcements, multilingual investor briefings, global sports coverage - the question changes from “Does it sound human?” to “Can it be trusted under pressure?” A Harvard Business School summary of research analyzing over 11,000 conference-call transcripts from 4,540 firms in 41 countries found that harder-to-understand language was linked to lower trading volume, more muted price reactions, and wider analyst disagreement even after controlling for the underlying earnings news. Markets, in other words, punish ambiguity before they reward atmosphere.
Broadcast regulation reaches a similar conclusion. The FCC’s captioning standards do not define quality in terms of charm or emotional realism. They define it in terms of accuracy, synchronicity, completeness, and placement, and they explicitly say accuracy should not substitute words for proper names and places. The agency’s live-captioning best practices go further, urging vendors to measure those dimensions, run sample audits, maintain failover systems, minimize service interruptions, and avoid covering essential on-screen graphics such as sports information.
Research on terminology-constrained translation shows that adding glossary terms materially improves term accuracy and often overall quality across language pairs. Apple researchers, meanwhile, have demonstrated that targeted fine-tuning approaches can significantly reduce hallucinated translations across multiple languages, including in zero-shot scenarios. Glossaries and guardrails may sound less magical than voice cloning, but they are what keep a sponsor name, product specification, or earnings figure from drifting into something else.
Latency is the clearest place where marketing and production reality part ways. In analyses of simultaneous speech translation, systems can show identical average delay on paper and appear equally fast. But their worst-case behavior can differ dramatically, with tail latency diverging by multiples. The result is that average delay hides spikes large enough to desynchronize source audio and translated output. For a live audience, tail latency is the moment when the experience visibly breaks.
Performance is also far less uniform across languages than headline demos imply. In Meta’s English-centered Seamless work, streaming speech-to-text quality dropped far less for high-resource languages than for low-resource languages, about 10.1% versus 21.5% BLEU loss in one comparison, and the paper notes better quality and lower lag for language families closer to English than for more distant ones such as Sinitic and Japanesic. “Works in 100 languages” is not the same thing as “works equally well in 100 languages.”
That is one reason serious evaluation is moving beyond a single headline score. A major 2021 study based on 2.3 million sentence-level human judgments across 4,380 systems concluded that relying on BLEU alone led to bad deployment decisions. Newer approaches try to get closer to what operators actually need: COMET was designed to align more closely with human judgments, while MQM gives teams a standardized way to classify and analyze translation failures. For enterprise buyers, it is critical to know the difference between quality slippage and problems with terminology, locale fit, or style.
Expressive dubbing can make multilingual media feel more native, more intimate, and more watchable. But in high-visibility communication, voice is the finish, not the foundation. The foundation is semantic precision, bounded adaptation, stable latency, strong observability, and predictable behavior under load.
In global communication, voice matters.
But reliability wins.

