Why is human opinion the gold standard for language quality?

AI can now write and translate at a level that, in many cases, feels close to human output. And yet, humans are still expected to review and approve content before it goes live. Why?

Why are humans still the final decision-makers when it comes to language? If AI can already produce fluent, accurate translations, why can't it also judge the output?

In this article, I explore that question.  My name is Erika Ockelfelt, and I’m Senior Director of Operations Excellence, Compliance & Application Support at LanguageWire. With over 13 years of experience in the language services industry, I’ve spent much of my time working at the intersection of language, quality, and technology. I'm hoping to do this question justice and give you information to more deeply understand the way you should think about human/AI collaboration.

Judging translation quality requires taste. And taste is human

Language is not just a product, nor is it something separate from us. Language is an extension of us as humans. Right now, as you are reading this to yourself, you are interacting with language on your own. If you were to pause here and ask yourself, “What language do I use when thinking?”, you would probably notice that you’re forming sentences, using vocabulary, and even applying grammar rules to these sentences in your mind.

Language is not just when we talk to one another; language is also part of the way we understand and “operate” our own sentient mind.

I’m a non-native English speaker. If you are a native British English speaker or American, you are likely going to find some of my phrases and words a little awkward or just “out of tune”. Because my brain is Swedish first, English second. When I reach my English limit, it is Swedish that kicks in. And often, I’m still translating in my head even when I think I’m not because I’m not a true bilingual speaker. I still grew up hearing, practising and learning Swedish for the language-formative years of my life.

Now, why am I rambling on about myself? Well, to prove a point. The production and consumption of language is a highly personal and internal process. The function of language is not just a form of communication; it is also a tool for how we express our identity.

Understanding language is an important starting point when addressing quality, because it explains well why quality is so difficult to define.

Elements of language are rule-based; however, a big portion of it is not. It is influenced and impacted by context and a person’s individual interaction with language. The same English text read by a native Danish person, living for more than 20 years in Spain, will result in a different assessment compared to that of a native English speaker, living in the UK. Language quality and language assessment are a mixture of objective and subjective elements.

And the subjective is not only influenced by context we can map, such as culture, medium, sender, and receiver. It also has to deal with the hardest (in my opinion) subjective element, one that is almost impossible to quantify: Taste.

The function of language is to express, imagine, and connect. It’s a person’s subjective taste that decides whether that function is being fulfilled or not.

I cannot think of a more subjective matter – can you?

Why can’t we just set rules to make language less subjective?

To reduce this subjective element of language, we can apply rules on grammar, spelling and phonetics. We control the right way to write or speak a word, and in which order to use words, so that, when we follow them, we all (in theory) understand the information the same way.

But the rules are not enough alone, because language is not just words.

Language is also gestures, body language, volume, and tone. Language doesn't live in a vacuum. It comes alive when it meets a recipient. The way we absorb language through listening, reading, or touching (braille), is going to be influenced by the processor – a human individual brain.

Each human brain comes with its own unique preset conditions. The variation in how neurons fire, the natural ability and intellect, prejudice, experience and cultural lens all contribute to the way we absorb and produce language.

Humans have tried to make up rules and theories that can quantify the complexity of language, trying to understand and maybe control exactly what happens when it is used for communication.

But this is not fully possible because both sender and receiver sit with their own context, which impacts the way language is understood and assessed.

The personal context of the language users (culture, age, experience), but also the context of where the language is used (public, internal, company) play a role. These contexts may bring their own rules or preferences into the assessment of the quality.

The better we can map out the context of the language as part of our assessment of it, the closer we can get to saying something about the language quality.

And we can get very close using the rules of language. However, there will always be a subjective element left that needs to be taken into account, no matter how good we get at quantifying quality.

No one, whether AI or human, can decide what good is; instead, we strive to find the quality expectation.

This complicated formula is what we try to tackle in language quality assessment. Quality assessment is not absolute, and I think it will never work if what you strive for is “good quality”. There is no one, not us humans, nor AI, who gets to say what good quality in language is. Because language quality is all about what purpose it serves. Is it meeting the requirements of the author and audience? Is it appropriate for the context? These are things we humans, and also AI, can assist us in answering.

When building or performing a language quality assessment, you want a judge to rule based on the conditions and relevant parameters for the case in front of you. Because I believe you can only judge the quality of language in a specific context, or according to a specific group, or possibly by a majority vote (meaning, what is most common).

Language is intrinsically human, and human is not a quality to define as good or bad.

Human is a scale. It is pliable, always changing, and perfectly flawed. We can only measure if language is meeting our needs and expectations.

AI can assist us here. Feeding it parameters, metrics and weighing the different subjective elements appropriately, an AI can alleviate us from the more tedious, error-prone, or repetitive tasks. We can then spend more time on the interesting tasks; the human interaction, the ideation, and being creative. AI can be in the loop, even for quality assessments. Humans, however, should be the ones dictating what AI does or does not do and how. We should be the ones controlling the levers whenever AI provides us with any form of data that can speak to or indicate language quality. And those levers need to be more considered than just looking for spelling or what is most common, it needs to include context, purpose and expected outcome of the produced language.

The final judge is you

To conclude, AI is already good at, and will become even better, at producing language content for us. The more data we give it, the more we can allow it to customise the subjective aspects of language. The more we feed contextual details to the AI model, the better we will think the output we get is. And in the end, the harder it will become for you, the human reader, to distinguish between AI-produced and human-produced. With the improved performance of AI in producing language content, it is also fair to assume it will get better at performing assessments of language content (AI-produced or human-created). From my standpoint, the key element to improving and measuring quality is more human control over the process.

Whatever method of estimating the quality of language content - be it with or without AI - the judgment of whether that is the right quality is up to you, the person consuming the content. And that opinion is unique to you.

High or low quality is not an absolute. The best we can strive for is a majority vote from a jury of humans, who are the intended audience of the language content.

Additional resources

AI can be an efficient way of localising content and evaluating the quality of a translation, but it can’t stand alone. It needs to be trained and modelled for context, vocabulary, and language pair. Using AI for translation without the right structures around it, including the guidance and governance of human expertise, might work for single projects, but won’t constitute a sufficient solution at scale.

To help guide you through the complex AI localisation landscape, we’ve put together a list of additional resources for you:

Why is human opinion the gold standard for language quality?

Judging translation quality requires taste. And taste is human

Why can’t we just set rules to make language less subjective?

The final judge is you

Additional resources

Related Content

Solving a validation bottleneck: a real-life example

How to navigate conflicting demands within AI localisation

The hidden work behind successful localisation

Need easier translation workflows?