What is Metric for Evaluation of Translation with Explicit ORdering?

When evaluating machine translations, assessing how closely the translation matches a human’s understanding is essential. METEOR (Metric for Evaluation of Translation with Explicit ORdering) is an evaluation metric that provides a more comprehensive and accurate measurement of translation quality. Unlike traditional metrics like BLEU, METEOR considers precision and recall and the semantic meaning of words and their order in a sentence, offering a more nuanced and reliable translation evaluation.

In this article, we’ll delve deeper into METEOR, how it works, and why it’s used to evaluate machine translation quality. Additionally, we’ll provide an easy-to-understand cheat sheet that summarizes its features.

What is METEOR?

METEOR is an automatic evaluation metric designed to assess the quality of machine-generated translations. It was created to address some of the limitations of earlier metrics, such as BLEU, by considering linguistic factors such as synonymy, word stemming, and word order key elements in determining the quality of a translation. Unlike BLEU, which only evaluates word-level matches, METEOR incorporates a more holistic approach by accounting for context, linguistic variations, and word order, resulting in a more accurate reflection of human judgment.

Key Features and Example of METEOR Evaluation

Let’s consider an example to better understand how METEOR evaluates a machine translation.

Scenario:

Suppose the source sentence in English is:

“The car is fast.”

And the reference translation (human translation) in Filipino is:

“Ang kotse ay mabilis.”

Now, suppose the machine translation is:

“Ang sasakyan ay mabilis.”

Unigram Precision and Recall
- Precision measures how many words in the translation appear in the reference translations.
  - Example: This measures how many words in the machine translation (“Ang sasakyan ay mabilis“) appear in the reference translation (“Ang kotse ay mabilis“).
    - Words in the machine translation: “Ang“, “sasakyan“, “ay“, “mabilis“
    - Words in the reference translation: “Ang“, “kotse“, “ay“, “mabilis“
    - Matching words: “Ang”, “ay”, “mabilis”
    - Precision = Number of matching words in the machine translation / Total number of words in the machine translation = 3/4 = 0.75
  - Recall measures how many words in the reference translations appear in the machine translation.
    - Example: This measures how many words in the reference translation (“Ang kotse ay mabilis“) appear in the machine translation (“Ang sasakyan ay mabilis“).
      - Matching words: “Ang”, “ay”, “mabilis”
      - Recall = Number of matching words in the reference translation / Total number of words in the reference translation = 3/4 = 0.75
Synonym Matching
- METEOR improves over simple word matching by considering synonyms, allowing more flexibility in translation evaluation. For example, if “car” is translated as “automobile,” METEOR would still consider this a valid match.
  - Example: The word “car” in the source sentence is translated as “sasakyan” in the machine translation. Although “kotse” and “sasakyan” are different words, they are both valid translations for “car” in Filipino, making them synonyms.
  - METEOR will recognize this synonym match and count it as a valid match.
Stemming
- METEOR can match words based on their root form.
  - Example: If the machine translation had used the word “mabilis” in a different form (e.g., “mabilis-mabilis” for emphasis), METEOR would still count them as valid matches.
Word Order
- METEOR considers the order in which words appear in the sentence. They will receive higher scores if words appear in the same sequence in the reference and machine translations.
  - Example: The words in both the reference (“Ang kotse ay mabilis“) and the machine translation (“Ang sasakyan ay mabilis“) are in the same order, which is a positive factor for the score. If the word order were more fragmented, a penalty would be applied.

Penalty for Fragmentation

There is no fragmentation because the machine translation maintains the same word order as the reference. However, METEOR would apply penalties if the words were scrambled or missing, reducing the final score.

Final METEOR Score Calculation:

Fmean (harmonic mean of precision and recall):
- Fmean = (2 * Precision * Recall) / (Precision + Recall)
- Fmean = (2 * 0.75 * 0.75) / (0.75 + 0.75) = 0.75
Penalty: Since there are no fragmented matches, the penalty is 0.
METEOR Score = Fmean × (1 – Penalty) = 0.75 × (1 – 0) = 0.75

How METEOR Works: The Formula

The METEOR score is calculated using the following formula:

METEOR=Fmean×(1−Penalty)

Where:

Fmean is the harmonic mean of precision and recall. Precision weighs the proportion of correct words in the machine translation, and recall weighs the proportion of words in the reference translations covered in the machine translation.
The penalty is a factor that penalizes non-contiguous matches to account for poor word order.

Advantages of METEOR

Improved Human Correlation: METEOR correlates more closely with human evaluations than traditional metrics like BLEU.
Context Awareness: METEOR can better understand the context of translations by considering synonyms and stemming.
Penalty for Fragmentation: This helps ensure the translation is fluent and coherent, not just a patchwork of correct words.

Limitations of METEOR

Computational Complexity: METEOR requires more processing power than BLEU due to the extra steps (like synonym matching and stemming).
Dependency on External Resources: METEOR relies on resources such as WordNet for synonym matching, which may not cover all words in all languages.
Language-Specific: METEOR may not be as effective for languages that lack well-supported linguistic resources like WordNet.

Advantages

Higher correlation with human judgment

Flexible evaluation (considers synonyms and word order)

Better for evaluating fluent translations

Disadvantages

Computationally expensive
Depends on linguistic resources (e.g., WordNet)
Language-specific challenges

Conclusion

In summary, METEOR improves upon traditional metrics like BLEU by considering synonyms, stemming, and word order, making it more aligned with human judgment. It evaluates precision and recall while penalizing non-contiguous matches, ensuring better translation accuracy and fluency. Though computationally more expensive and dependent on resources like WordNet, METEOR’s comprehensive approach makes it a valuable tool for evaluating high-quality machine translations, particularly in tasks requiring contextual understanding.

References:

METEOR | Machine Translate

METEOR – Wikipedia

Written by: Ace Kenneth Batacandulo

Ace is AWS Certified, AWS Community Builder, and Cloud Consultant at Tutorials Dojo Pte. Ltd. He is also the Co-Lead Organizer of K8SUG Philippines and a member of the Content Committee for Google Developer Groups Cloud Manila. Ace actively contributes to the tech community through his volunteer work with AWS User Group PH, GDG Cloud Manila, K8SUG Philippines, and Devcon PH. He is deeply passionate about technology and is dedicated to exploring and advancing his expertise in the field.

What is Metric for Evaluation of Translation with Explicit ORdering?

What is Metric for Evaluation of Translation with Explicit ORdering?

What is METEOR?

Key Features and Example of METEOR Evaluation

Final METEOR Score Calculation: