Cleaning Filters

Content is machine translated from English by Phrase Language AI.

The most important aspects of training MT engines is not just volume, but data quality. Cleaning data is a pervasive problem and manual cleaning is laborious. Clean data leads to faster training and higher quality models.

Phrase Custom AI adapts translation memories into datasets with the help of AI-powered and rule-based cleaning filters. Default settings are provided shoud be suitable for new users.

The set of filters available include both rule-based filters and ML-based filters:

Rule-based

Filters that operate with clearly defined rules that are easily understandable by humans. This filter category includes Date range, Minimum character count, Sentence pair length, Length ratio, Non-translatables, Duplicates, Near-duplicates.
ML-based

Filters that analyze the content of the text itself to make a decision, rather than simply following a fixed set of rules. This filter category includes Misaligned source and target, and Language identification.

All filters evaluate on cleaned versions of the segments; among other things, multiple spaces are reduced to one and Phrase tags are removed.

Date range

Excludes segments outside of the set dates. The end and start dates are included along with the date of last modification of a segment.

Misaligned source and target

This filter determines how well the segments match in terms of meaning and semantic similarity, removing the worst rated. The sentence pair alignment is measured using the LASER metric.

An AI engine is used to check that the source and target text mean the same thing or how much of the same thing. The recommended setting discards the 10% worst segments while keeping the 90% best segments.

Advanced settings allows changing the alignment or can be a filter based on the raw similarity score using a number between 0 and 1 (1 meaning complete alignment). Caution is advised if using the raw similarity score as each language pair has a different distribution of scores and what is considered a good score for one language pair may be an unsatisfactory score for another.

Typically segments below 0.5 are not very good and segments close or over 1 are segments that are the same in both languages.

Examples:

{"source": "Super.", "target": "Super.", "similarity": 1.05}

{"source": "Hello", "target": "http://wwww.sdsadsa.com", "similarity": 0.3}

Minimum character and letter count

Character count includes all characters. This includes all letters, white spaces, and punctuation and symbols. For training purposes, it may be useful to discard segments that do not contain any letters.

Letter count counts only letters such as in the English alphabet, but also more complex characters with diacritics or Chinese characters. One Chinese character is counted as one letter, even if it represents more than one character. For character-based languages default values are 1, but for word-based languages default values are 4 (characters) and 3 (letters). The minimum value is 1 and the maximum value is 500.

If keeping a lot of short segments in data (for example acronyms), keep the filter values low.

Example:

The string "Hello, World! 1 2 3" has 19 characters and 10 letters.

Sentence pair length

This filter removes all segments that are longer than the threshold value set by users. The reason for this filter is that most NMT systems will not actually train on segments that are longer than their internal threshold.

For instance, NextMT’s internal threshold is 200 tokens, which equals approximately 100 - 1,000 words. To train a custom engine on shorter sentences, set this value lower than the default.

The total character count includes all characters - letters, white spaces and punctuation- from both the source and target sentences. Take the type of language into consideration (for example Chinese and English); if the source language is not CJK-like and the target language is CJK (or the other way around), this filter will be ignored.

Length ratio

This filter identifies segments where length is significantly higher when comparing the source segment and the target segment. Some translations increase or decrease in length when translating from a source to a target language. Too long or too short translations may indicate low-quality training data.

If the source language is not CJK-like and the target language is (or the other way around), this filter will be ignored.CJK

Some languages are more verbose than others, so 200% is a good default. If the target language is similar to the source language, or more data needs to be filtered out, the value can be lower.

Examples:

One language is CJK - ratio is 1. It will not be discarded:

{"source": "This is a sentence.", "target": "这是一个句子。", "ratio": 1}

The German translation is of comparable length as the English source and will not be discarded:

{"source": "This is a sentence.", "target": "Dies ist ein Satz.", "ratio": 1.1}

The German translation is a lot longer than the English source and will be discarded:

{"source": "This is a sentence.", "target": "Dies ist ein Satz mit zusätzlichen unnötigen Füllungen.", "ratio": 3.1}

Non-translatables

Non-translatables are segments where the source and target segments are the same. Excludes all non-translatable sentence pairs where the target text remains unchanged from the source text.

Duplicates

Groups of segments are created that have the same source sentence. From each group, only the best segment is kept so if a segment’s source sentence is unique, it is automatically kept. Otherwise. the segment with the highest similarity score is kept.

Near-duplicates

When testing for near-duplicates, the (slightly cleaner version of) a source sentence is normalized; all non-letter characters (some examples: “,?)!-) are replaced with a space and all letters are rendered lowercase.

Using the normalized source sentence, groups of segments that have the same normalized source sentence are created. From each group, only the best segment is kept so a segment’s normalized source sentence is unique and is automatically kept. Otherwise. the segment with the highest similarity score is kept.

Language identification

An AI engine is used to identify the source and target language based on the sentences. A segment is only removed if the engine recognizes a (source/target) language (as an example, shorter sentences are often not enough for the engine to determine a language) and the language is different than expected.

QPS

The QPS filter removes the lowest-quality sentence pairs in the dataset to ensure that the resulting AI models are trained on the highest-quality data available. Generally, the higher the quality of training data, the better the customized model performs.

The QPS filter can be configured in two ways:

Removing a specified percentage of sentence pairs with the lowest QPS scores. The recommendation is 10%.
Selecting a score threshold. Use the advanced settings to eliminate sentence pairs falling below an adjustable QPS threshold. The recommended starting point is 50.

These two options provide automated dataset curation to align with users’ quality objectives.