Automated evaluation metrics play a crucial role in assessing the quality of translations produced by machine translation systems. Unlike human evaluations, which can be subjective and time-consuming, automated metrics provide a quick, objective, and repeatable way to gauge the performance of MT systems.
Phrase Custom AI incorporates several well-established automated metrics to evaluate machine translation quality: BLEU, TER, chrf3, and COMET.
It is advised to deploy customized systems to a production environment if both of the following conditions are met:
-
BLEU improvement of at least 5 points (absolute, e.g. 40 vs 35), or chrf improvement of at least 4 points.
-
No significant decrease of COMET score.
In most cases, improvements of this magnitude are easily noticeable for human translators and lead to improved post-editing times.
Recommended Approach
In general, the absolute values of the metrics vary depending on the language pair, domain and other factors. To gauge how successful the customization process was, examine the difference between the scores of the generic and customized system.
BLEU, chrf and TER all measure the string overlap between the MT outputs and reference translations. By definition, a significant improvement in these scores implies less post-editing effort for translators.
COMET measures translation quality in a general sense. COMET will not necessarily increase after customization (the customized system may output translations of similar quality, the difference is whether the translations match the customer’s style, tone of voice, terminology etc.). However, a significant decrease of COMET may signal a problem with the customized system.
Available Metrics
Phrase Custom AI incorporates several well-established automated metrics to evaluate machine translation quality: BLEU, TER, chrf3, and COMET. Each of these metrics offers a unique approach to assessing translation quality, catering to different aspects of translation.
COMET (Cross-lingual Optimized Metric for Evaluation of Translation)
-
Overview
COMET is a more recent metric that employs machine learning models to evaluate translations. Unlike traditional metrics, it does not solely rely on surface-level text comparisons.
-
Working Mechanism
COMET uses a neural network model trained on large datasets of human judgments. It assesses translations by considering various aspects of translation quality, including fluency, adequacy, and the preservation of meaning.
-
Use Cases
COMET is effective in scenarios where a deeper understanding of translation quality is required. It is particularly useful for evaluating translations where contextual and semantic accuracy are more important than literal word-for-word correspondence.
BLEU (Bilingual Evaluation Understudy)
-
Overview
BLEU, one of the earliest and most widely used metrics, evaluates the quality of machine-translated text by comparing it with one or more high-quality reference translations. BLEU measures the correspondence of phrases between the machine-generated text and the reference texts, focusing on the precision of word matches.
-
Working Mechanism
BLEU calculates the n-gram precision for various n-gram lengths (typically 1 to 4 words) and then combines these scores using a geometric mean. It also incorporates a brevity penalty to address the issue of overly short translations.
-
Use Cases
BLEU is particularly effective for evaluating translations where the exact matching of phrases and word order is important. However, its reliance on exact matches can be a limitation in capturing the quality of more fluent or idiomatic translations.
TER (Translation Edit Rate)
-
Overview
TER is a metric that measures the number of edits required to change a machine-translated text into a reference translation. It is based on the edit distance concept and includes operations like insertions, deletions, and substitutions. Unlike other metrics on this list, a lower TER score signifies a better translation.
-
Working Mechanism
TER calculates the minimum number of edits needed to transform the machine translation into one of the reference translations. The score is then normalized by the total number of words in the reference translation.
-
Use Cases
TER is useful for evaluating translations where the focus is on the amount of post-editing work required. It is particularly relevant in scenarios where translations will be post-edited by humans.
chrf3 (Character n-gram F-score)
-
Overview
chrf3, or character n-gram F-score, is a metric that evaluates translations based on character-level n-grams. It considers both precision and recall, providing a balance between the two.
-
Working Mechanism
chrf3 calculates the F-score, a harmonic mean of precision and recall, based on the overlap of character n-grams between the machine translation and the reference text.
-
Use Cases
chrf3 is advantageous for languages where word segmentation is challenging or for morphologically rich languages. It is also less sensitive to word order than BLEU, making it more flexible in evaluating translations with different but acceptable phrasings.
Score ranges
Absolute values of the metrics vary significantly depending on the language pair, domain and other factors. It is therefore difficult to establish general guidelines for interpreting score values and users should primarily decide based on the differences between the generic and the customized system (evaluated on an identical dataset).
This table can be considered a useful starting point for interpreting the values of the individual metrics:
-
Scores below the low-quality MT threshold may be indicative of serious issues and such systems should typically not be deployed without further analysis.
-
Scores which exceed the threshold for high-quality MT typically indicate a very well-performing system which produces fluent and adequate translations.
Metric |
Range |
Low-quality MT threshold |
High-quality MT threshold |
---|---|---|---|
COMET |
Typically 0 to 1 |
< 0.3 |
> 0.8 |
BLEU |
0 to 100 |
< 15 |
> 50 |
TER |
0 to 100, lower is better |
> 70 |
< 30 |
chrf3 |
0 to 100 |
< 20 |
> 60 |