Push the button, get set, go! Checking and comparing popular MT systems

Machine translation (MT) systems make contextual and system-related errors that can be avoided if you know what causes them and, most importantly, if you have the right solutions. Under the guidance of head of department Jasmin Nesbigall, our MTPE team has evaluated several popular MT systems.

The foundation: Error categories and error types

When evaluating translations, including machine translations, many things need to be considered in advance: what errors may arise and how can they be categorised? How do you turn a pure gut feeling into a measurable assessment? And what performance indicators are needed to obtain a meaningful result?

For our MT evaluations, we currently work with seven error categories, each of which includes several types of error. In the error category “Terminology”, for example, we distinguish between three types of error: failure to follow the terminology database, inconsistent use of terms or occurrence of terms in the text that do not belong to the subject matter.

The terminology category is regularly rated very poorly in the feedback from our post-editors after MTPE projects, which is not surprising given the use of generic machines: how are generically trained MT systems supposed to know and implement a company’s exact terminology? An approach to significantly reduce this source of error using glossary functions is now available and we took this into account in our evaluation project. Two systems offering this functionality were able to compete against one another: one with and one without terminology integration.

Let’s set the scene: a source text and terminology

But, let’s not get ahead of ourselves! The first step at the project outset was to find a suitable source text containing specifics about the company but also maintaining a balance between general language and technical specifics. For our evaluation project, we produced a text from two pieces of text about one of our client’s product range, which presented some challenges: in addition to long and sometimes convoluted sentences, the text contained numerous split compound nouns (for example, “Leistungsstecker und -verbinder”) and company-specific spellings for product and department names.

Based on the source text, a terminology list was created with 16 equivalents for German and English, which occurred a total of 37 times in the text. In other words, there were 37 opportunities for a generic MT to implement the desired specialised terminology correctly in its native form or to disregard it.

Evaluation performance indicators at a glance

And this takes us into the many key performance indicators that the project has delivered. Of the 37 places where technical terms were required, the systems implemented between 35 and 57 per cent incorrectly, i.e. in up to 21 places. After integrating the terminology as a glossary, which was possible with two systems, this error rate dropped to as low as 19 per cent.

The project comprised a total of just under 570 words, in which the systems made between 82 and 121 errors. One third (33%) to just under half (46%) of these errors were classified as “severe”, i.e. having a significant impact on the comprehension and usability of the text. However, correcting these errors was only rated as “time-consuming” in a quarter (25%) to a third (33%) of cases. So, there are many errors, one in three of them serious, but the majority of them require little effort to correct. The reason is that a terminology error, for example, may be serious in terms of content but requires only a small amount of effort to correct; replacing it with the correct term in this case. This means that both performance indicators are required to assess the MT output and the effort required in the post-editing phase.

The six systems differed surprisingly little in terms of the number of segments that had to be post-edited. Of a total of 38 segments, between 34 and 36 had to be adjusted, i.e. usually about 90 per cent, so a great deal overall. The adjustments ranged from adding a comma to restructuring a segment, or even completely translating them afresh. It is therefore interesting to look at the edit distance, meaning the number of character-based changes between the original text and the post-edited version. For example, an added comma provides an edit distance of 1, as only one character has been changed. The lower the edit distance, the less was changed or the fewer changes were needed before adopting the machine translation.

In the evaluation project, the majority of segments have a low edit distance of 1 to 20 (40-50 % in all systems). Only in one system did just 30 per cent of the segments fall into this edit distance level, meaning that overall there was a higher rate of change. All in all, however, the values mean that only minor changes to the segments were necessary during the post-editing process in the majority of cases, which puts the overall change rate of 90 per cent into perspective.

What is interesting in an evaluation, however, is not only the number of errors and the effort required to correct them, but, most importantly, the type of error: does one system produce a particularly large number of grammatical errors, while another omits entire parts of sentences? Does one system understand the meaning of a sentence better than another? And are there particular weaknesses that only become apparent when comparing several systems? The clear “winners” in the error categories were style and accuracy, each comprising about 40 per cent of the errors. And there were also clear frontrunners in the types of errors within these categories: 90 to 97 per cent of style errors related to style guide errors, 80 to 97 per cent of accuracy errors were misunderstandings. The systems also occasionally added information or left out words or parts of sentences, for example.

Terminology integration as a determining factor

And what about terminology? The error rate for terminology was 14 to 18 per cent, depending on the system used. This rate could be reduced to 8 to 12 per cent with the two systems that allow terminology integration in the form of a glossary. But why are there still terminology errors at all, even after glossary specifications have been included? A detailed analysis showed three clear sources of error:

1. Split compound nouns are not recognised as such, so they are translated separately. This means that the second part of the compound noun is not usually recognised as belonging to the first part, so it is not recognised as a specified term.
Example: For “Leistungsverbinder und -stecker”, “Leistungsstecker” was not recognised and therefore the specified term from the glossary was not implemented.

2. If terms within the specifications contradict each other or if a term is to be translated differently when it is on its own compared to when it appears in a compound noun, this causes problems for the systems. Sometimes only one of the specifications is implemented, sometimes all the affected terms are completely ignored.

3. Special spellings, such as small caps, capitalisation within a product name or punctuation within a term, are completely ignored by the MT systems. This source of error is particularly interesting with regard to gender-appropriate word forms in German, which always contain punctuation within the word.

This shows that terminology errors that still occur are systematic and calculable and so, in the best-case scenario, can also be avoided with more targeted specifications.

Conclusion: Six systems, many opportunities (for error)

A comparative evaluation is always fascinating, even if it is very time-consuming. A direct comparison of the systems reveals not only individual weaknesses or stumbling blocks, but also rankings that show which system is better than the others for a particular domain and language combination. The biggest sources of error we found were accuracy of meaning and implementing extensive style guide specifications. This is somewhat surprising, since, according to evaluations by post-editors, adherence to terminology would be expected to be the largest error source. An increasing number of systems allow terminology integration in the form of glossaries, which is a good way of incorporating company-specific specifications into the machine output.

However, depending on the use case, specifically training an MT engine with company data is still more crucial, as this allows the MT result to be adapted to style specifications and specific linguistic requirements. If training is not possible or desired, certain style guide specifications can also be implemented using scripts or automated checks after machine pre-translation. In this scenario, checks must be performed to determine which specifications can be implemented and where corrections can be made before post-editing to reduce the actual PE effort. Creating a PE style guide that identifies known sources of error and provides guidelines on how to deal with them (systematically) can also play a role in optimising the MTPE process.

For more information on the interaction between terminology and MT, we recommend our tcworld presentation “Machine translation and terminology – a difficult relationship or a great love affair?” (only available in German) on the oneword YouTube channel.

If you would like to tap into the advantages of machine translation for your company, we would be delighted to run evaluation projects, test the most suitable MT engine or create guidelines for you. Feel free to contact us about this.

8 good reasons to choose oneword.

Learn more about what we do and what sets us apart from traditional translation agencies.

We explain 8 good reasons and more to choose oneword for a successful partnership.

Explore reasons