Using MT Metrics

Content is machine translated from English by Phrase Language AI.

Automated evaluation metrics play a crucial role in assessing the quality of translations produced by machine translation systems. Unlike human evaluations, which can be subjective and time-consuming, automated metrics provide a quick, objective, and repeatable way to gauge the performance of MT systems.

Phrase Custom AI incorporates several well-established automated metrics to evaluate machine translation quality: BLEU, TER, chrf3, and COMET.

It is advised to deploy customized systems to a production environment if both of the following conditions are met:

BLEU improvement of at least 5 points (absolute, e.g. 40 vs 35), or chrf improvement of at least 4 points.
No significant decrease of COMET score.

In most cases, improvements of this magnitude are easily noticeable for human translators and lead to improved post-editing times.

Recommended Approach

In general, the absolute values of the metrics vary depending on the language pair, domain and other factors. To gauge how successful the customization process was, examine the difference between the scores of the generic and customized system.

BLEU, chrf and TER all measure the string overlap between the MT outputs and reference translations. By definition, a significant improvement in these scores implies less post-editing effort for translators.

COMET measures translation quality in a general sense. COMET will not necessarily increase after customization (the customized system may output translations of similar quality, the difference is whether the translations match the customer’s style, tone of voice, terminology etc.). However, a significant decrease of COMET may signal a problem with the customized system.

Available Metrics

Phrase Custom AI incorporates several well-established automated metrics to evaluate machine translation quality: BLEU, TER, chrf3, and COMET. Each of these metrics offers a unique approach to assessing translation quality, catering to different aspects of translation.

COMET (Cross-lingual Optimized Metric for Evaluation of Translation)

Overview

COMET is a more recent metric that employs machine learning models to evaluate translations. Unlike traditional metrics, it does not solely rely on surface-level text comparisons.
Working Mechanism

COMET uses a neural network model trained on large datasets of human judgments. It assesses translations by considering various aspects of translation quality, including fluency, adequacy, and the preservation of meaning.
Use Cases

COMET is effective in scenarios where a deeper understanding of translation quality is required. It is particularly useful for evaluating translations where contextual and semantic accuracy are more important than literal word-for-word correspondence.

BLEU (Bilingual Evaluation Understudy)

Overview

BLEU, one of the earliest and most widely used metrics, evaluates the quality of machine-translated text by comparing it with one or more high-quality reference translations. BLEU measures the correspondence of phrases between the machine-generated text and the reference texts, focusing on the precision of word matches.
Working Mechanism

BLEU calculates the n-gram precision for various n-gram lengths (typically 1 to 4 words) and then combines these scores using a geometric mean. It also incorporates a brevity penalty to address the issue of overly short translations.
Use Cases

BLEU is particularly effective for evaluating translations where the exact matching of phrases and word order is important. However, its reliance on exact matches can be a limitation in capturing the quality of more fluent or idiomatic translations.

TER (Translation Edit Rate)

Overview

TER is a metric that measures the number of edits required to change a machine-translated text into a reference translation. It is based on the edit distance concept and includes operations like insertions, deletions, and substitutions. Unlike other metrics on this list, a lower TER score signifies a better translation.
Working Mechanism

TER calculates the minimum number of edits needed to transform the machine translation into one of the reference translations. The score is then normalized by the total number of words in the reference translation.
Use Cases

TER is useful for evaluating translations where the focus is on the amount of post-editing work required. It is particularly relevant in scenarios where translations will be post-edited by humans.

chrf3 (Character n-gram F-score)

Overview

chrf3, or character n-gram F-score, is a metric that evaluates translations based on character-level n-grams. It considers both precision and recall, providing a balance between the two.
Working Mechanism

chrf3 calculates the F-score, a harmonic mean of precision and recall, based on the overlap of character n-grams between the machine translation and the reference text.
Use Cases

chrf3 is advantageous for languages where word segmentation is challenging or for morphologically rich languages. It is also less sensitive to word order than BLEU, making it more flexible in evaluating translations with different but acceptable phrasings.

Score ranges

Absolute values of the metrics vary significantly depending on the language pair, domain and other factors. It is therefore difficult to establish general guidelines for interpreting score values and users should primarily decide based on the differences between the generic and the customized system (evaluated on an identical dataset).

This table can be considered a useful starting point for interpreting the values of the individual metrics:

Scores below the low-quality MT threshold may be indicative of serious issues and such systems should typically not be deployed without further analysis.
Scores which exceed the threshold for high-quality MT typically indicate a very well-performing system which produces fluent and adequate translations.

Metric	Range	Low-quality MT threshold	High-quality MT threshold
COMET	Typically 0 to 1	< 0.3	> 0.8
BLEU	0 to 100	< 15	> 50
TER	0 to 100, lower is better	> 70	< 30
chrf3	0 to 100	< 20	> 60