01/12/2025
Language data for LLM use: making AI an expert on your company
Whether text generation, translation or smart chatbots, Large Language Models (LLMs) are currently one of the most exciting AI innovations and are changing the face of corporate communications. They understand, process and generate language and already perform a wide range of tasks and processes. However, in order to deliver not just generic, but company-specific results, the systems must be instructed and trained. We explain the ways in which LLM output can be customised to your company’s own style and the important role your language data from translation memories and terminology databases plays in this.
Options for LLM customisation
There are several ways to adapt AI-generated texts and translations to your company’s tone of voice and specifications.
In Prompt Engineering, the existing model is specifically influenced and controlled by instructions and examples. If the system is given terminology or style specifications within a prompt, it takes these instructions into account and, in the best case, implements them accordingly. Prompt engineering is easy to implement and test, but with complex requirements it reaches its limits and, depending on the model, leads to increased costs per request.
With Retrieval Augmented Generation (RAG), the LLM is connected to the company’s own data sources, which the system can access in a targeted manner. This method leads to more specific answers and has the advantage that data can be queried on a daily basis, ensuring that it is up to date.
As an advanced method, fine-tuning comes into play, in which an LLM is further trained and fine-tuned with company-specific data. Working on the basis of monolingual or bilingual texts, the AI system learns the specialised terminology used and patterns in the text style. The result is more consistent texts that address users in the desired corporate tone. Such fine-tuning is therefore particularly worthwhile if generic models do not achieve good results or implement specific requirements.
Regardless of the method chosen, training the LLMs requires selected and, above all, clean data. And this is where language data comes into play.
Why language data is indispensable for LLMs
Your company already has a treasure trove of data when it comes to monolingual and multilingual communication: language data in the form of translation memories (TMs) and terminology data. Translation memories store all previous translations, while technical terms are stored and defined in terminology databases. Both data sources thus serve as a kind of multilingual “knowledge base”.
From translation memories, an LLM can adopt translation patterns, style guidelines in the source and target language, linguistic features and the correct use of specialised terminology. A look at the TM also helps with prompt engineering, by providing the LLM with suitable translations as examples and thus illustrating the desired result.
The second valuable source is terminology databases. These databases store technical terms and their equivalents in all required foreign languages, from product names to technical terms. Consistent terminology is not only a quality feature, but is often also legally relevant, especially in technical documentation. Since terminology accounts for up to 45% of the post-editing work on AI translations, the integration of terminology can be a real game-changer in the text creation and translation process.
LLMs can therefore obtain from these two data sources exactly what is often missing from generic MT systems: specialised terminology and company specifics. Utilising language data for LLM use has the following advantages:
- Consistency: Technical terms and formulations remain consistent across all channels.
- Correctness: LLM output is more correct and customised to your own company. Especially in direct contact with users, for example through AI chatbots, the potential for error can be reduced by utilising language data.
- Time and cost savings: The texts generated on the basis of existing language data require significantly less post-editing.
- Brand voice: Answers in the company’s tone of voice strengthen authenticity and the customer experience.
- Competitive advantage: Companies who use their own language data stand out from standard solutions and achieve more relevant results.
Requirements of language data for LLMs
Even though many companies already have extensive language data, it will usually have to be checked and cleaned up before being used for LLMs. Because this is a case where quality definitely takes precedence over quantity. Incorrect or contradictory training data leads to false patterns that the LLM would reproduce or even reinforce.
However, language data has often grown over the years and may contain inconsistencies or outdated terminology. It must therefore be checked and cleaned up in order to function as clear and valuable training data. In the terminology database, for example, there should only be one preferred term per language for each concept. This is the only way to create a glossary from the data, from which the AI system can extract the terms and deliver consistent results.
Data protection must also be taken into account with training material. Translation memories often contain personal data that, through fine-tuning, is stored permanently and may be overlooked in the event of deletion requests.
As a general rule, a smaller but high-quality data set delivers better results than large, unfiltered data volumes. In order to create a good AI basis from existing data, a clean-up is therefore usually needed.
The path to clean language data
With our oneCleanup service, we scrutinise your language data set and make it ready for an AI future. Whether terminology or TM data, we analyse the potential for cleaning up and provide you with an overview of the cleaning steps you need to take to achieve a clean database.
Our language experts combine technological expertise with linguistic know-how. We know what data for prompt engineering, RAG and fine-tuning should look like in order to train the systems in a meaningful way.
Would you like to realise the full potential of your language data? With our services for language data clean-up, we quickly make your data AI-ready. Contact us and let our language experts advise you.
8 good reasons to choose oneword.
Learn more about what we do and what sets us apart from traditional translation agencies.
We explain 8 good reasons and more to choose oneword for a successful partnership.