Regardless of whether the task involves creating a terminology pool for the first time, collecting bilingual terms from a translation project, or compiling project-related terms for a specific product, there are generally five sources from which to retrieve terminology:
1. Terminology in its pure form, as it appears in catalogues, product lists and glossaries. The terms can be easily extracted from these sources and prepared as a dataset.
2. Lists and existing datasets that have already been compiled in projects or departments. Companies often have several of these lists circulating that other departments may not even know about.
3. The staff’s implicit knowledge of terminology. Everyone in the company communicates about products and services and names them accordingly.
4. Use of terminology during text creation. Terminology is researched and used when creating the source language text and when translating it. Collecting this terminology again separately can therefore be described as manual extraction from texts.
5. Partially automated extraction from existing documents and datasets. Here, a body of text is checked for the terms used in it and these are extracted as a list.
In order to obtain terminology used in the company relatively quickly and universally, the last-mentioned source, semi-automated extraction, is a good choice. Two methods are available for software-based terminology extraction: linguistic and statistical.
Linguistic systems are language-dependent because they are based on word formation rules and syntactic algorithms of a specific language. Linguistic extraction software therefore only covers a certain number of languages. It is mostly used for monolingual extractions, but multilingual extractions are also possible as a hybrid form (linguistic/statistical). The result of the linguistic procedures is a list of linguistically correct words, which, however, are not necessarily relevant from a terminology point of view.
Statistical methods, on the other hand, do not depend on a language. They determine word frequencies, in other words, how often a word occurs in the body of text. As a result, the language under consideration is irrelevant. The extraction is done monolingually from a set of texts or bilingually from previous translations, for example from a translation memory system. The basic principle is: The larger the text base, the more often individual words occur and the better the result. In the hit list, however, the words appear not only in their pure form, but also in plural and inflected forms or as parts of speech. The result must therefore be cleaned up both linguistically and terminologically.
Both of the procedures above do not detect the terminology relevance of the extracted words. So much of what is extracted may not be technical terms at all, but general words. Only the project managers can ever judge the relevance of individual candidates for terms, as terminology is very subjective. What is important specialised terminology for one company is irrelevant for the next. Evaluating and validating candidates for terms therefore always requires manual rework and human decision-making.
Along with a list of relevant terms, semi-automated extraction provides further data that can add value to a company’s terminology work: most tools can directly specify how often a term occurs, the file names in which it occurs, and a matching context sentence from the body of text.
Depending on the software, the tools also recognise different spellings directly and combine them (for example, ink jet printers and inkjet printers) or can quickly identify them by alphabetical sorting. Bilingual extraction also even provides the potential to recognise synonyms in the source language due to the foreign language equivalents. For example, if “tie” is used in English for both “Krawatte” and “Schlips” in German, a German synonym can be found for this word.
While terminology work is far from being finished when a list of relevant terms has been compiled, extraction can be used to reach an important first milestone. Companies should seek support from a language service provider with terminology expertise for both the extraction itself as well as the linguistic and terminology clean-up and enrichment with additional information. Through our services or in a personal terminology consultation, we will be happy to explain how this works effectively.