Jasmin Nesbigall (JN): How nice that today, as part of our Quality Time, we can talk about a topic close to both our hearts: machine translation. Let’s start with a brief look back and then forward, including for TextShuttle: With the breakthrough of NMT in 2016, the focus was on the very good results, which were the subject of press and media coverage. Now the focus is increasingly on errors, sources of errors and the expectation that machine translation should deliver the same results as a human translator.
Samuel Läubli (SL): Basically, it can be compared to a spelling error: if native speakers make any, it is pointed out immediately, people like to do that on social media too. But if someone who is just learning the language makes a mistake, most people overlook it.
JN: In your view, has the tipping point been reached where MT is no longer considered an aid and additional tool, but where it is directly compared and even equated with humans?
SL: The answer is kind of yes and no and it depends on who we are talking about. We saw this very early on, in 2018, when we were working for banks and insurance companies. The language specialists and translators there who knew about translation always saw it as a machine and there was never a direct comparison or equation made with a human being. However, among people who are more concerned with productivity and profit and less familiar with language, it began being equated with human translation sooner. So it depends how much you are involved with the topic.
JN: This also means that the more professional the users are, the more attention is paid to the mistakes the system makes. But those who simply integrate it into their everyday life, for example by using translation apps or by clicking on “Translate” on a website, see it as a nice aid and additional function on their browser.
“The more professional the users and use, the more focus there is on the errors of machine translation.”
SL: Exactly. For these users, it’s not even a thing. They don’t go specifically to a provider platform, they simply land on a Japanese website and there the message pops up: “Would you like to translate this page?” But they have little awareness of what’s going on in the background and don’t expect perfect results.
But we also had translators in their first and second semesters at ZHAW who used machine translation to check whether what they translated really fit. Therefore, the transition of use is fluid, but the status and expectations are definitely different than they were a few years ago. The systems can do a lot and the question is simple: what do they do differently to humans?
JN: At the same time as saying that the systems can do a lot, we also see that providers like to claim that their system is the best in a certain language or for a certain subject area. The Intento Report regularly concludes that actually every MT system has scored as the best in at least one language direction and at least one domain. But now we also know how different texts and requirements can be in a subject area. So what can you really judge the quality of an MT system by? Is it about which one makes the fewest errors? Or makes the fewest serious errors? Or whose errors are the easiest to post-edit?
SL: We must always ask ourselves what the purpose of MT is and there are two types: MT for understanding, so not in professional translation. If I simply want to understand a text, the system that makes the fewest serious errors is probably the best. Whether or not I still have the eszett (ß) in a translation for Switzerland, where they don’t use this character, has no bearing on understanding the text.
However, using MT in professional translation, i.e. with a human-in-the-loop, becomes more complex. There’s no linear relationship between errors and time spent post-editing, even if people like to think so. It does provide an indication: if one system makes ten serious errors and another only two, you can probably work more productively with the second system.
However, when we specialise systems at TextShuttle, it’s also about stylistic adjustments. The machine learning community almost laughs at how much effort we put into quotation marks, for example. Of course, these are not obvious errors, but they do take time when post-editing, if every quotation mark has to be corrected. So, to choose the best system, I’d have to measure productivity over a period of time with a certain number of post-editors to decide which system is best to work with.
JN: Productivity is a key word. We see time and again in our evaluations that the severity of an error and the effort required to correct it aren’t directly related. For example, I may have a serious terminology error if a term has been translated with a term that isn’t appropriate for the domain. However, if there’s a specified translation in the terminology database or a standardised, specialised term, then it’s a simple matter of replacing the wrong word and very little effort is required in post-editing. On the other hand, a sentence may simply not fit stylistically, which is only a minor error severity level, but it must be completely rewritten when correcting it, meaning a relatively high post-editing effort.
SL: Great example! This shows that there may be a correlation, but you can’t conclude what the post-editing effort is from the number of errors. And getting back to the question of the best MT system: I think it has been the case for some time with generic MT systems that the quality is simply converging. The rest, in many cases, is simply about branding: DeepL is simply a household name for many and the cool underdog compared to Google Translate. Just as you might have your favourite brand of jumper, even though, in reality, most would keep you equally warm. However, when we look at what the next wave in the MT sector is going to be, it’s more about features, ease of use and also the ability to integrate it in other systems. I believe that this is where the big differences will show up in the next few years, and not in who makes how many minor or serious errors.
“When we look at what the next wave in the MT sector is going to be, it’s more about features and ease of use.”
JN: Talking about errors also brings us to the subject of quality assessments. If an evaluation metric compares the MT output with the result that a human being would have delivered, this doesn’t do justice to what happens in practice. If two people translate the same sentence, the results will also differ. So why is an MT system judged by how close it is to a human translation? Is this still an up-to-date or practical approach?
SL: The key words here are ‘reference-free evaluation’. Resources are always central to the whole issue of evaluation: how much time do I have, how much money do I have, how many people do I have who can work on it? In system development, of course, I need automated evaluation because I can’t have 2,000 sentences evaluated by humans every hour when training an engine. And for automated evaluation, I need some result to compare the current result of my engine with, so I use a human translation.
The BLEU metric, which is often criticised, was actually intended to compare several reference translations by humans with a machine-generated output, so there was linguistic variance involved. Therefore, the metric isn’t bad at all, but for cost reasons no one uses several reference translations for an automated evaluation.
JN: So, in your view, what approach makes more sense for assessing the quality of MT output?
SL: When we test systems, we always test with humans when it comes to important decisions. For example, we do A/B tests with a source text and then two versions of the translation. And then we look at how often, on average, the human translation is rated better and how often the machine translation is rated better. Actually, it’s important to move away from rigid metrics. Not because people are bad, but if you mathematise it like that, it doesn’t do justice to the fact that there would theoretically be an infinite number of correct translations for each sentence and not just one golden solution. But this is of course difficult to model with limited resources.
“When we test systems, we always test with humans when it comes to important decisions.”
JN: Earlier, we talked about stylistic adjustments in machine translation, for example, not using the eszett (ß) for Switzerland. Our evaluations repeatedly show that adjustments to style guide specifications are one of the main sources of errors in generic MT. So is adapting the MT to these specifications a big element for ensuring higher quality? If we single out gender forms for a moment: can something like that be defined by style guides? So if I use a uniform gender form in the source text, can I teach the machine that, for example, an internal dot should always be used as the gender form in the French target text?
SL: That’s definitely an element with a big impact. The problem is that things that look so simple are sometimes very difficult to implement. Replacing the ß for Switzerland with double S is one line of code and really not difficult. Gender-sensitive language, however, is at the other end of the spectrum because it’s much more complex. There are no uniform rules and no training data in the languages, but there are many ambiguities. For example, “Leiter” in German can mean a ladder or a leader. Does it mean a thing or a person? That’s no longer about stylistic adjustments, but about teaching a system to be able to create these kinds of texts.
JN: But I only see that problem when I translate from a non-gendered language, for example from English, where the term “teacher” is used for men and women. But if I then have “Lehrer:in” (a term encompassing the masculine and feminine forms of “teacher”) in the source text in German, then the system only needs to be told how this gendered form should be implemented in Spanish or French.
SL: And that, unfortunately, isn’t that easy. Sure, in this scenario, I can give the system a signal: there’s a gender-fair form here and it should be expressed accordingly in the target language. But the reason the systems are so good today is that they’re no longer based on fixed rules, but learn from data. I don’t have to teach the system the genitive, because instead I show it lots of examples, which it uses to learn the genitive. However, there aren’t that many examples for gender-sensitive language. So what happens? Most systems simply ignore the gender markers. That means I can start tinkering and bring these rules into the system, but in doing so I bypass the neural mechanism to some extent because as yet there simply aren’t enough real-world examples.
JN: In the real world, there’s not only too little MT training material for these situations, but sometimes also for entire languages. We Europeans have a very comfortable situation with our languages and relatively good MT results. But what is it like for low-resource languages? How much data needs to be available for training to be worthwhile and to be able to reproduce a language by machine? And is it then more about offering a language in good quality or about being the first provider to offer an under-represented language direction?
SL: This threshold for training does exist, but it’s lower than you think. If I want to train a system for German to English, I can use 100 million translated sentences from easily accessible sources. I don’t have to do much methodically because I have extensive training material. You can also train relatively robust systems with 100,000 sentences, but then you have to methodically compensate for many more things. So the question in practice is: Is it even worth it for a company to do that? At TextShuttle we did this with Rhaetian, i.e. for a language with around 30,000 speakers, but there was also a relevant use case for a national broadcaster. In principle, systems can also be trained with little data, but then simply more AI specialists have to work on it, which is why it’s certainly rarely done in practice.
JN: For this reason, many smaller languages go through relay languages with MT providers. With this approach, there’s always the danger that specificity is lost. If I enter “Lehrerin” (the feminine form of “teacher”) in German, it becomes gender-neutral term “teacher” in English and then the masculine “maestro” in Spanish. The English “you” can be used in formal and informal contexts, and can refer to one or more people. So if English is used as the relay language, I’m using an intermediary language that’s significantly less specific than German or French, for example. So are relay languages more of a curse because specificity is lost? Or a blessing, because some languages can only be reproduced using a relay language, since there’s not enough material available for the direct language pair?
SL: That’s a seriously fascinating topic! With the very small, i.e. the low-resource languages, it’s probably more of a blessing. There are always other methods, for example pre-training with a related high-resource language and then just fine-tuning with the low-resource language.
In practice, proper pivoting is often done to save hardware resources and, as a result, money, because relay languages can support more languages with fewer engines.
To return to the question: With small languages, relay languages do make sense and the results may be better than without an intermediary language, because English simply provides much more training material. But with languages where this isn’t necessary, you clearly lose out due to the lack of specificity: we also see this time and again in internal evaluations.
JN: Finally, I have five quick questions. The answers should be as concise and as instinctual as possible.
The first question is about post-editing: superfluous at some point or here to stay?
SL: Here to stay and needed in fewer and fewer cases in the future.
JN: Your favourite mistake in a translation system?
SL: I like mistakes on a meta-level. If I translate “This sentence contains exactly 45 characters.” into German with a translation system, the result is “Dieser Satz enthält genau 45 Zeichen.”. Linguistically, it’s a great translation, but it’s content isn’t great, because there are only 37 characters. This shows very nicely where translation systems still have limitations today.
JN: If you could master a new language at the touch of a button, which would it be?
SL: A Filipino language with few speakers. We have family in the Philippines and Kapampangan is the native language there. And I don’t stand a chance at the moment! Tagalog can be translated with Google Translate, but unfortunately I’m completely out of the loop when it comes to Kapampangan.
JN: Machine translation for me is…?
SL: …still infinitely exciting! When you see how simply these systems are built, how they learn, and how little it can be compared to human thinking, but consider the things you can generate with a few million floating-point numbers, it still fascinates me.
JN: I share that view. When you consider how long it takes people to learn a language and thus gain access to texts and content, it’s all the more fascinating that language barriers can now be removed at the touch of a button.
The last quick question is: Where do you see TextShuttle in five years’ time?
SL: Five years from now, TextShuttle will definitely be much more visible than it is today. There aren’t that many MT providers, but if you ask people which ones they know, the answer might be DeepL or Google Translate. I don’t want to claim that just as many people will know TextShuttle in five years, but it will be at least significantly more than just Swiss insurance companies and banks.
JN: Thank you very much for the great conversation!