Curating translation memories is a long standing pervasive problem and manual cleaning is a laborious process. Clean translation memories leads to better references for linguists and higher quality machine translation which is especially relevant for Phrase NextMT, given its advanced capabilities for leveraging language assets, like translation memories and glossaries.
To create a dataset for the purpose of using a curated TM in TMS, follow these steps:
-
On the Datasets page, click Clean a translation memory.
The
page opens. -
Provide a name for the dataset.
-
The language selectors allow for various options:
-
To create a general language dataset, select the same source and target languages in the source and target language and locale selectors.
-
To create a locale-specific dataset, select the source and target languages from the first dropdown list then specify the source and target locales from the second dropdown list.
Multiple target locales to leverage more data sources can also be added.
-
To create a dataset with multiple source and target locales, select the source and target languages from the first dropdown list, specify the source and target locales from the second dropdown list (multiple target locales can be added) and click on + Add more locale pairs.
The
window appears. -
-
Click Add translation memories.
The
page opens with a search functionality . -
To add a TM to the dataset, click the icon. The TM is added to the
column.Multiple TMs can be added to a maximum of 200 TMs and a maximum of 20 million segments.
Clicking on the TM name will present the selection on the translation memory page.
Click the icon to remove the TM from the
column. -
Click Save.
The
page opens. -
Review the details as presented and if correct, click Continue.
The
page opens. -
Apply required filters and click Create.
The dataset is created and added to the list on the
page with the initial status of and the status of in the column.
Phrase Custom AI allows curating translation memories with the help of AI-powered and rule-based cleaning filters. Default settings are provided which may be suitable for new users.
This process preserves the original TM segment metadata and TM tags which allows users to maintain TM leverage when using the cleaned TMs in TMS.
The set of filters available include both rule-based filters and ML-based filters:
-
Rule-based
Filters that operate with clearly defined rules that are easily understandable by humans. This filter category includes
, , , , , , . -
ML-based
Filters that analyze the content of the text itself to make a decision, rather than simply following a fixed set of rules. This filter category includes
, and .
Date range
Both end and start date is included with the date of last modification of a segment taken into account.
Misaligned source and target
This filter allows users to determine how well the segments match in terms of meaning and semantic similarity, removing the worst rated . The sentence pair alignment is measured using the LASER metric.
An AI engine is used to check that the source and target text mean the same thing or how much of the same thing. The recommended setting discards the 10% worst segments while keeping the 90% best segments.
Advanced settings allows changing the alignment or can be a filter based on the raw similarity score using a number between 0 and 1 (1 meaning complete alignment). Caution is advised if using the raw similarity score as each language pair has a different distribution of scores and what is considered a good score for one language pair may be an unsatisfactory score for another.
Typically segments below 0.5 are not very good and segments close or over 1 are segments that are the same in both languages.
Examples:
Sentence pair length
This filter removes all segments that are longer than the threshold value set by users.
The total character count includes all characters - letters, white spaces and punctuation- from both the source and target sentences. Take the type of language into consideration (for example Chinese and English); if the source language is not CJK-like and the target language is CJK (or the other way around), this filter will be ignored.
Length ratio
This filter identifies segments where length is significantly higher when comparing the source segment and the target segment. Some translations increase or decrease in length when translating from a source to a target language. Too long or too short translations may indicate low-quality segments.
If the source language is not CJK-like and the target language is (or the other way around), this filter will be ignored.CJK
Some languages are more verbose than others, so 200% is a good default. If the target language is similar to the source language, or more data needs to be filtered out, the value can be lower.
Examples:
One language is CJK - ratio is 1. It will not be discarded:
{"source": "This is a sentence.", "target": "这是一个句子。", "ratio": 1}
The German translation is of comparable length as the English source and will not be discarded:
{"source": "This is a sentence.", "target": "Dies ist ein Satz.", "ratio": 1.1}
The German translation is a lot longer than the English source and will be discarded:
{"source": "This is a sentence.", "target": "Dies ist ein Satz mit zusätzlichen unnötigen Füllungen.", "ratio": 3.1}
Non-translatables
Non-translatables are segments where the source and target segments are the same. Excludes all non-translatable sentence pairs where the target text remains unchanged from the source text.
Duplicates
Groups of segments are created that have the same source sentence. From each group, only the best segment is kept so if a segment’s source sentence is unique, it is automatically kept. Otherwise. the segment with the highest similarity score is kept.
Near-duplicates
When testing for near-duplicates, the (slightly cleaner version of) a source sentence is normalized; all non-letter characters (some examples: “,?)!-
) are replaced with a space and all letters are rendered lowercase.
Using the normalized source sentence, groups of segments that have the same normalized source sentence are created. From each group, only the best segment is kept so a segment’s normalized source sentence is unique and is automatically kept. Otherwise. the segment with the highest similarity score is kept.
Language identification
An AI engine is used to identify the source and target language based on the sentences. A segment is only removed if the engine recognizes a (source/target) language (as an example, shorter sentences are often not enough for the engine to determine a language) and the language is different than expected.
QPS
The QPS filter makes it possible to remove the lowest-quality sentence pairs in the translation memory to ensure that the resulting segments are of the highest quality.
The QPS filter can be configured in two ways:
-
Removing a specified percentage of sentence pairs with the lowest QPS scores. The recommendation is 10%.
-
Selecting a score threshold. Use the advanced settings to eliminate sentence pairs falling below an adjustable QPS threshold. The recommended starting point is 50.
These two options provide automated translation memory curation to align with users’ quality objectives.
The translation memory cleaning process, which may take several hours, must be complete before a curated TM can be used.
To use a curated TM in TMS, follow these steps:
This will trigger a dataset export process that will take only a few minutes. The resulting curated TM in .TMX format can be then uploaded to TMS as a new, curated TM up to 1 Gb in size.
If two or more cleaning processes have been performed on the same TM, different versions can be accessed in the
tab.