30/09/2025
From data chaos to vocabulary: maintaining translation memories
Translation memories can grow quickly and become confusing. At the same time, they are a true treasure that can save on translation costs, ensure consistency and train AI systems. This makes it all the more crucial that translation memory data is regularly checked, cleaned up and maintained, so that not only are potential savings identified, but your language data is also suitable for use with AI.
An inventory sheds light on the jungle of data
Translation memories (TMs) are made up of pairs of segments, where a source-language segment is assigned to its target-language equivalent. Some companies may give a rough estimate of the number of these segment pairs, also known as translation units, but for most they are a complete mystery. For TMs, this is valuable language data that is currently a key issue for training or enhancing AI systems. However, the quality of the TMs is crucial for their meaningful use and often just as little is known about this as about their size.
This is because each translation project adds hundreds or thousands of new segment pairs. And while the growth rate often only trends upwards, data is almost never systematically reviewed or cleaned up. Something that has been built up as a shared knowledge base over years, gradually becomes a confusing data dump.
This lack of transparency is especially controversial if the language data is to be used for additional processes such as knowledge databases or AI translations. After all, AI systems require clean, consistent training data. Things that translators can compensate for through experience and context, lead to incorrect results in automated processes. Professional assessment of your TMs is therefore not simply a mere academic exercise, but an economic necessity. It lays the foundation for more efficient translation processes and opens up new technological options.
Primary causes of unclean translation memory data
Despite the continual growth of the TM, few companies have established review or clean-up routines, also because they are often given absolutely no access to TMs managed by translators or service providers. As well as the large quantities of data and the lack of routines, there are several additional causes of unclean translation memories.
Figure 1: Six primary causes of unclean translation memory data. (Source: oneword GmbH)
Metadata, style or quality can become inconsistent especially when translations originate from a range of sources such as different service providers. And if there is a lack of standards for system settings such as how to handle tags and placeables, or the translations originate from different CAT tools, then uncontrolled growth of data is inevitable. Correct segmentation also plays an important role, whereby both the quality of the source text and the segmentation rules in the CAT tool are crucial. For example, if the layout results in a sentence being broken up, this can result in two translation segments that no longer correspond in the source and target languages, as shown in the following example. Such fragments can pose a translation risk or be worthless for reuse.
Figure 2: Incorrect line breaks within a sentence and incorrect allocation of the segment parts between the source and target language. (Source: oneword GmbH)
Stored translations are only then really valuable if they are not only “clean” in terms of form but also in terms of content. This also depends on the correct use of technical terms. Any terminology definitions or changes to existing terminology require correction in TMs accordingly, but this is rarely considered or implemented.
Once the reasons for the resulting chaos are given, it is important to bring the desired order to the chaos of data. Specific starting points are essential for this, because they provide information on what can be reviewed and cleaned up in TMs.
Duplicates as cost drivers
Of all issues in translation memories, duplicates result in the most easily avoidable costs. Duplicates often arise gradually and unnoticed, for example when various TM inventories are combined, as a result of multiple teams working on the same database in parallel or tool settings creating new entries when changes are made instead of overwriting existing ones. Duplicate entries are, in any case, not only unnecessary but can also constitute a real cost trap.
Because depending on the CAT system, they receive a percentage deduction as no clear assignment between the source and target segments is possible. In the event of a recurrence, they are therefore not considered as full TM matches, but as partial matches, so-called fuzzy matches. Here is a sample calculation: Translation of a sentence with ten words into English for the first time costs €2, for example. In a subsequent project, this sentence could either be completely locked as a 100% match and therefore not included in the calculation, or if it needs to be checked again, it would be calculated at a discounted price of just 67 cents. However, if the sentence appears in the TM as a duplicate, it is given a 1% penalty by the system and treated as a fuzzy match. The calculation then totals €1.34, making it twice as expensive as a 100% match. Extrapolated to thousands of segments in a translation project, duplicates can therefore result in significant and completely superfluous costs.
Duplicates in TMs can take three forms: fully identical source and target segment, identical target segment for different source texts or different translations for an identical source segment. The latter represents an additional time factor during the translation, since both options must be checked in order to select one. Removing duplicates is therefore an ideal first step in cleaning up the TM, both with regard to its impact on time and costs as well as in terms of the effort involved, because duplicates can be found quickly and the removal can be partially automated.
Fragments in the database
Translation units, however, may not only be included in TMs several times, they can also be incomplete. Something saved as a segment, largely depends on the (pre-)set segmentation. A punctuation mark such as a full stop or question mark typically indicates the end of a segment. But individual segmentation rules can also be defined for these and all other punctuation marks, such as colons. In principle, a segment should always represent a complete unit so that it can be optimally reused. Many TMs, however, contain fragmented segments broken around manual line breaks, for example.
A sentence that should remain as one is then split across several segments, meaning that, at worst, the parts are attributed incorrectly across the languages (see Figure 2). TM clean-up must therefore also be applied to fragmented segments to either merge or delete them. The absence of a punctuation mark at the end of a sentence or the presence of a comma at the end of a segment can therefore be used as a starting point to find such fragments.
Worthless for reuse
In addition, every database contains segments that are unique and complete but that cannot be reused due to their content. This includes unique press releases, special newsletters or texts about discontinued products. Ideally, such translations should not be included in the TM when they are created. If you have an existing TM, it can be helpful to search for old product names and use them to find the creation date of a text, if possible with the precise time, because multiple projects are often imported into the memory on the same day – it is then possible to filter out and delete these other segments with this date specifically.
The final steps on the path to vocabulary
As is generally the case with all data, TM segments are easiest to work with when they are clean in terms of both form and content. Examples of formally unclean data includes tag errors or segment pairs with end-of-sentence punctuation that differs in each language. This category also includes an incorrect source or target language that has ended up in the TM accidentally from multilingual instructions. In addition, many TMs contain untranslated segments or segments considered empty that contain only a full stop or special character. Inaccurate data includes all segments containing linguistic, content-related or terminology errors. This includes spelling and grammar mistakes as well as incorrect or unused technical terms or content-related errors in the source or target language. Checking and cleaning up these errors certainly requires significant effort and corresponding language proficiency.
The right clean-up measures
To ensure that effort a is kept to a minimum for each clean-up criteria, various methods can be employed to clean up the TM: Within CAT tools, i.e. where the TM segments originated, various settings as well as targeted features can both be utilized. In settings, it is possible to define which metadata (e.g. project name or division) is saved to the segment and whether existing matches are to be overwritten once edited or if duplicates are to be created. TM maintenance features, such as searching for duplicates, and extensive filter options make it easy to start cleaning up the TM. For example, the creation date of a segment stored as standard can be used to filter out all segments for a particular project.
What’s more, every CAT tool also provides the option to export the translation memory in formats such as .tmx or .csv. Export formats can then be searched, filtered and processed in text editors or spreadsheet programs. In case of extremely large quantities of data, we recommend using scripts that automatically check for and implement the clean-up criteria. In addition to expertise about language data, these measures, however, require corresponding knowledge of programming.
Managing data growth in translation memories and terminology databases
Cleaning up language data made easy: our oneCleanup service combines decades of language and technological expertise to turn your data chaos into valuable vocabulary. oneCleanup is our complete service for monitoring, maintaining and cleaning up your TM and terminology databases of all sizes. We analyse your data and provide you with an overview of potential clean-up goals such as duplicates and incomplete data. You decide which of those are implemented and whether or not you would also like us to take care of the clean-up immediately after the analysis!
Clean data for the future of AI
The importance of clean translation memories today extends far beyond traditional translation work. This is because the quality of language data also plays a decisive role in the success of your AI initiatives. Companies that regard their translation memories as true assets and curate them accordingly, gain a sustainable competitive advantage. They not only increase the efficiency of your current translation processes and reduce costs, but also simultaneously invest in a valuable treasure trove of data, which can be used as multilingual training and reference material for Large Language Models and in-house AI solutions.
Do you want to realise the full potential of your translation memories? At oneword, our experts analyse your language data, identify potential savings and develop a customised clean-up strategy together with you. Please feel free to contact us for a no-obligation consultation or to learn more about our TM clean-up services.
8 good reasons to choose oneword.
Learn more about what we do and what sets us apart from traditional translation agencies.
We explain 8 good reasons and more to choose oneword for a successful partnership.

