How does METEOR evaluation metric calculate the similarity score?

Evaluation metrics are quantitative measures of machine learning models' performance. They are essential to determining whether our model is performing well or poorly for specific tasks.

What is METEOR?

METEOR (Metric for Evaluation of Translation with Explicit Ordering) is a metric used to measure the quality of candidate text based on the alignmentAlignment is a mapping between the unigrams of condidate and reference summaries. between the candidate text and the reference text.

Calculating the METEOR score

Following are the steps to calculate the METEOR score:

  1. Calculate the unigram precision and recall.

  2. Compute the F-score.

  3. Compute chunk penalty.

  4. Calculate the METEOR score.

Calculate the unigram precision and recall

We calculate the unigramUnigram refers to a single word or token in a sequence of text. precision as the ratio between the overlapping unigrams between the candidate and reference summary and the total number of unigrams in the candidate summary.

The unigram recall is calculated as the ratio between the overlapping unigrams between the candidate and reference summary and the total number of unigrams in the reference summary.

Compute the F-score

After calculating the unigram precision and recall, we compute the weighted F-score by taking their harmonic mean, with precision being weighted higher than recall.

where,

  • P: Unigram precision

  • R: Unigram recall

  • α\alpha: It is the relative weight for precision and recall.

Note: The precision is weighted higher than the recall so that the candidate summary is more precised in the meaning then the word-to-word matches.

Compute chunk penalty

A chunk is a set of consecutive words appearing in the sentence. The precision, recall, and FmeanF_{\text{mean}} are computed on the base of single word matches. To take into account longer matches and their word order, METEOR computes a chunk penalty. The chunk penalty is calculated as follows:

Where,

γ\gamma: It determines the relative weight assigned to the fragmentation fractionIt is a ratio between the chunks and the matching unigrams.. Its value ranges from (0γ1)(0 \leq \gamma \leq 1).

β\beta: It determines the functional relation between fragmentation and the penalty.

Question

What would be the chunk size in case of candidate summary is exactly similar to the reference summary?

Show Answer

Calculate the METEOR score

After computing the F-score and chunk penalty, we are now ready to calculate the METEOR score.

METEOR scores are given on a scale of 0 to 1, with higher values indicating greater similarity between the candidate and the reference summary.

Code example

Now, let’s see how to calculate the METEOR score using Python.

import nltk
nltk.download('wordnet')

reference_summary = [['Machine', 'learning', 'is', 'a', 'subset', 'of', 'artificial', 'intelligence']]
candidate_summary = ['Machine', 'learning', 'is', 'seen', 'as', 'a', 'subset', 'of', 'artificial', 'intelligence']

METEORscore = nltk.translate.meteor_score.meteor_score(reference_summary, candidate_summary)
print(METEORscore)
Calculating METEOR score

Code explanation

Let’s get the insight of the above code.

  • Line 1: We import the nltk library, which is used widely in the field of NLP.

  • Line 2: We download the wordnet corpus reader from the nltk library.

  • Line 4: We define a list named reference_summary and set “Machine learning is a subset of artificial intelligence” as a reference summary.

  • Line 5: We define a candidate_summary variable and set its value to “Machine learning is seen as a subset of artificial intelligence."

  • Line 7: We use the meteor_score() function from the nltk.translate.meteor_score to calculate the METEOR score.

  • Line 8: We print the METEOR score for the provided candidate summary.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved