Custom AI

Cleaning Filters

Content is machine translated from English by Phrase Language AI.

Phrase Custom AI allows adapting translation memories into datasets with the help of AI-powered and rule-based cleaning filters. Default settings are provided which may be suitable for new users.

All filters evaluate on cleaned versions of the segments. For example, multiple spaces are reduced to one and Phrase tags are removed.

Date range

Both end and start date is included with the date of last modification of a segment taken into account.

Misaligned source and target

This filter allows users to determine how well the segments match in terms of meaning and semantic similarity, removing the worst rated . The sentence pair alignment is measured using the LASER metric.

An AI engine is used to check that the source and target text mean the same thing or how much of the same thing. The recommended setting discards the 10% worst segments while keeping the 90% best segments.

Advanced settings allows changing alignment or can be a filter based on the raw similarity score using a number between 0 and 1 (1 meaning complete alignment). Caution is advised if using the raw similarity score as each language pair has a different distribution of scores and what is considered a good score for one language pair may be an unsatisfactory score for another.

Minimum character count

Character count includes all characters. This includes all letters, white spaces, and punctuation and symbols.

Letter count counts only letters such as in the English alphabet, but also more complex characters with diacritics or Chinese characters. One Chinese character is counted as one letter, even if it represents more than one character.

Sentence pair length

The total character count includes all characters - letters but also white spaces, punctuation- from both the source and target sentences. Be sure to take the type of language into consideration (for example Chinese and English). If the source language is not CJK-like and the target language is CJK (or the other way around), this filter will be ignored.

Length ratio

This filter identifies segments where length is significantly higher when comparing the source segment and the target segment. Some translations increase or decrease in length when translating from a source to a target language. Too long or too short translations may indicate low-quality training data.

If the source language is not CJK-like alphabet and the target language is CJK (or the other way around), this filter will be ignored.

Non-translatables

Exclude all non-translatable sentence pairs where the target text remains unchanged from the source text.

Duplicates

Groups of segments are created that have the same source sentence. From each group, only the best segment is kept so if a segment’s source sentence is unique, it is automatically kept. Otherwise. the segment with the highest similarity score is kept.

Near-duplicates

When testing for near-duplicates, the (slightly cleaner version of) a source sentence is normalized; all non-letter characters (some examples: “,?)!-) are replaced with a space and all letters are rendered lowercase.

Using the normalized source sentence, groups of segments that have the same normalized source sentence are created. From each group, only the best segment is kept so a segment’s normalized source sentence is unique and is automatically kept. Otherwise. the segment with the highest similarity score is kept.

Language identification

An AI engine is used to identify the source and target language based on the sentences. A segment is only removed if the engine recognizes a (source/target) language (as an example, shorter sentences are often not enough for the engine to determine a language) and the language is different than expected.

Was this article helpful?

Sorry about that! In what way was it not helpful?

The article didn’t address my problem.
I couldn’t understand the article.
The feature doesn’t do what I need.
Other reason.

Note that feedback is provided anonymously so we aren't able to reply to questions.
If you'd like to ask a question, submit a request to our Support team.
Thank you for your feedback.