When term mining is mentioned, it can mean two things: firstly, the basic structure of a terminology inventory in one or more languages. This involves going back to existing texts to extract the terminology used in them and process it for the company for the first time. Secondly, term mining also includes the continuous search for new terms and the expansion of the existing stock.
The process can be carried out manually by searching for and collecting terminology candidates or automatically by extracting them from documents or translation projects.
There are generally five sources from which to retrieve terminology:
- Terminology in its pure form, as it appears in catalogues, product lists and glossaries. Terms can be easily extracted from these sources and prepared as a dataset.
- Lists and existing datasets that have already been compiled in projects or departments. Companies often have several of these lists circulating that other departments may not even know about.
- Active communication within the workforce. This is often implicit terminology knowledge, as everyone in the company communicates (possibly differently) about products and services.
- Active text creation: Terminology is researched and used when creating the source language text and when translating it. Collecting this terminology again separately can be described as manual extraction from texts.
- Partially automated extraction from existing documents and datasets. Here, a body of text is checked for the terms used in it and these are extracted in list form or directly into a database.
In the latter two forms of extraction, the mining metaphor still works well: searching a pile of soil by hand and sieve for rocks and examining and evaluating them directly leads to an accurate and good result with a low error rate. The time required is also manageable or less than having to come up with special tools or machinery for this purpose. As the amount of earth increases, however, there is no way around the use of further aids – up to and including the use of heavy machinery.
The metaphor can also be applied to terminology work: small amounts of text can be easily searched manually for technical terms, which are then extracted, cleaned up and directly enriched with additional information such as a suitable context sentence. For large bodies of text, however, this manual work is too time-consuming and, despite terminological accuracy, also prone to errors, as it would be necessary to repeatedly check whether a term has already been recorded.
As the amount of text increases, machines are therefore used in the term mining process to quickly and comprehensively analyse all the source material and extract potential terminology candidates from among the terms used in it. Some tools can provide further information on the estimated relevance and frequency of occurrence and automatically extract context sentences. However, just like in mining, the machines do not deliver clean end results, but raw data or preliminary stages that subsequently have to be evaluated and cleaned up. Extraction may therefore be partially automated, with the result of the automated step being post-edited manually.
In the “heavy machinery” of software-based terminology extraction, a distinction is made between linguistic and statistical methods.
Linguistic systems are based on word formation rules and syntactic algorithms of a specific language. They are thus always language-dependent. The system can trace back terminology candidates to a basic form (stemming) and recognise the part of speech of the candidate (tagging). The result of this method is a list of linguistically correct words, which, however, are not necessarily relevant from a terminology point of view. The dependence on language-specific rules also means that linguistic extraction systems are only available for a limited number of languages and that the known providers can only offer a handful of languages as the source language for extraction.
Statistical systems, on the other hand, function independently of language, as they determine word frequency, i.e. the frequency of occurrence of a word in the body of text. The language for which this happens is irrelevant, as it is a pure string match. As a rule, the tool can be configured to determine the frequency of occurrence at which a term is to be extracted and how many words it may consist of.
However, the extracted words are not classified according to part of speech. Together with the purely statistical matching, this leads to an extraction result that usually contains many common language terms as well as plural and inflectional forms. Stop word lists that exclude certain words (e.g. articles, conjunctions, common verbs) in advance can help to prevent a profusion of common words from being included in the results. However, the result of a statistical extraction must always be cleaned up both linguistically and terminologically.
A statistical extraction can be monolingual from a source language or bilingual from previous translations. In the second case, the statistically relevant equivalent in the target language is extracted for each candidate term. This approach even provides the potential to recognise synonyms in the source language through the foreign language equivalents. For example, if “tie” is used in English for both “Krawatte” and “Schlips” in German, then a German synonym can be found.
Linguistic systems usually deliver higher quality results, but are also more complex and therefore more expensive. Moreover, they do not take into account the frequency of words, which often provides useful information about the relevance of technical terms. Statistical systems, on the other hand, can be used quickly and cover the frequency of occurrence, but always require linguistic rework as they rely simply on string matching. A solution that meets both requirements is hybrid term extraction, a combination of both methods. Hybrid systems combine linguistics with statistics by analysing which words in a monolingual body of text are suitable terminology candidates based on their frequency and linguistic form. However, as a result they are only suitable for monolingual terminology extraction.
In an ideal scenario, the output of hybrid systems is therefore statistically relevant and already linguistically clean. But neither of these necessarily amounts to terminological relevance, because terminology is very subjective: what is an important technical term for one company may be completely irrelevant for another. And just because a word sounds particularly technical doesn’t mean it has to be part of the corporate language. For example, “bellows” is relevant for a company in the compressed air industry, but not for a software company, whereas “merge request” is a technical term from the IT sector that other companies do not need in their terminology list. The relevance of individual term candidates is therefore ultimately decided by those responsible for the project. This also means that manual rework and a human decision are always required to finally evaluate and validate the results of an extraction.
While the creation of a list of relevant terms is only the beginning of systematic terminology work, it can be a first milestone in gradually moving from the raw output to a company’s valuable terminology data.
If you would like to extract terminology from your database or have existing terminology expanded and cleaned up, then feel free to contact us at email@example.com.