How BERTScore evaluation metric evaluate the text summarization

Key takeaways:

  • BERTScore uses BERT model embeddings to assess the semantic similarity between a reference and a candidate summary, focusing on meaning over exact word matches.

  • BERTScore calculation involves computing contextual embeddings, cosine similarity, and determining precision, recall, and F1 scores, along with importance weighting and rescaling for clarity.

  • The metric provides precision, recall, and F1 scores, offering insights into the alignment of the generated summary with the reference.

  • A high BERTScore indicates strong semantic alignment, while a low score suggests poor similarity, highlighting content weaknesses.

Evaluation metrics are the quantitative measures to evaluate the performance of machine learning models. Understanding these metrics is essential to measure the effectiveness of our model for particular tasks.

What is BERTScore?

BERTScore is an evaluation metric that uses the BERT model to find the similarity between the reference and the candidate summary. It uses contextual embeddings from pretrained BERT models to compute the similarity between both. It assesses how well the meanings of words align rather than just focusing on exact word matches. BERTScore calculates precision, recall, and F1 score based on these semantic similarities, providing a more nuanced evaluation of text quality.

Calculating BERTScore

Following are the steps to calculate the BERTScore:

  1. Computing contextual embedding

  2. Computing cosine similarity

  3. Computing precision, recall, and F1

  4. Importance weighting

  5. Rescaling

1. Computing contextual embedding

The first step for calculating the BERTScore is to compute the contextual embeddingRepresentation of the token based on its context is called contextual embedding. It can generate different embeddings for the same word in different sentences depending on the context. of the reference summary and the candidate summary using the pretrained BERT model. Consider the contextual embedding for the reference summary(r)(r) as:

Consider the contextual embedding for the candidate summary(c)(c) as:

2. Computing cosine similarity

After computing the contextual embedding for both summaries, the next step is to find their similarity. For this purpose, we compute the cosine similarity of contextual embedding vectors.

The cosine similarity between the reference summary token(ri)(r_i) and the candidate summary token(cj)(c_j) is given as:

Where \top is the element-wise multiplication of vectors.

We used the pre-normalized vectors, so now the cosine similarity will be given as:

3. Computing precision, recall, and F1

Calculating the BERTScore requires computing recall and precision. The recall is computed by comparing each token in the reference summary(r)(r) to the most similar tokens in the candidate summary(c)(c).

The precision is computed by comparing each token in the candidate summary(c)(c) to the most similar tokens in the reference summary(r)(r).

The F1F1 score for the reference and the candidate summary will be calculated by combining precision and recall as:

4. Importance weighting

We apply importance weighting to the BERTScore to emphasize important words and de-emphasize common ones. Inverse Document Frequency (IDF) can be incorporated into the BERTScore equations, although the effectiveness of this step can depend on both data availability and the specific domain of the text.

5. Rescaling

The cosine similarity score lies within the range [1,1][-1,1]. We know that the scores normally lie within a limited range. To address this, the BERTScore is rescaled with respect to its baseline to make the score more human-readable. The rescaled recall (R^)(\hat{R}) can be computed as follows:

Where bb is computed using the Common Crawl monolingual datasets for each language and contextual embedding model.

After rescaling, its value will lie within the range [0,1][0,1]. Similarly, we can apply rescaling to the precision (P)(P) and the F1F1 score. Note that this process doesn’t affect the ranking ability but only improves the readability of the BERTScore.

Code

Now, let’s learn how to calculate the BERTScore score using Python:

Note: We used the evaluate package from the Hugging Face, which is widely used for evaluating the performances of the models. You can install this package using the command pip3 install bert_score evaluate. We have already installed this package for yow below.

from evaluate import load
bertscore = load("bertscore")
reference_summary = ["Machine learning is a subset of artificial intelligence"]
predicted_summary = ["Machine learning is seen as a subset of artificial intelligence"]
results = bertscore.compute(predictions=predicted_summary, references=reference_summary, model_type="distilbert-base-uncased")
print("\nThe BERTScore for the predicted summary is given as: ", results)
Calculating BERTScore

Code explanation

The code above is explained in detail below:

  • Line 1: We import the load library from the evaluate package.

  • Line 2: We load the bertscore library using the load function.

  • Line 3: We define a reference_summary variable and set its value to "Machine learning is a subset of artificial intelligence".

  • Line 4: We define a predicted_summary variable and set its value to "Machine learning is seen as a subset of artificial intelligence".

  • Line 5: We used the compute() function from the bertscore library, which needs three parameters predictions, references, and the model_type or lang to compute the BERTScore for the predicted summary.

  • Line 6: We print the results dictionary, which prints the precision, recall, F1 score, and library hash code for the provided predicted summary.

The final output of the BERTScore calculation consists of three primary metrics: precision, recall, and F1 score. These scores provide a comprehensive overview of how well the generated summary aligns with the reference summary, capturing both the accuracy and relevance of the information presented.

The high BERTScore indicates that the generated summary closely aligns with the reference summary in terms of meaning and context and the low BERTScore signifies that the generated summary lacks semantic similarity to the reference summary.

Ready to unlock the power of language processing? Getting Started with Google BERT will guide you through BERT’s architecture, transformer basics, and practical applications to excel in NLP tasks.

Conclusion

BERTScore represents a significant advancement in evaluating the quality of text summarization models by leveraging contextual embeddings from pretrained BERT models. This metric, based on precision, recall, and F1-score calculations, provides a robust measure of similarity between reference and candidate summaries. By incorporating importance weighting and rescaling techniques, BERTScore enhances its interpretability and relevance in assessing the fidelity of machine-generated summaries to their source texts. Its implementation through libraries like Hugging Face's evaluate package simplifies integration into Python workflows, making it accessible for evaluating and improving natural language processing tasks.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is the meteor metric for summarization?

METEOR (Metric for Evaluation of Translation with Explicit Ordering) is a metric used to measure the quality of candidate text based on the alignment between the candidate text and the reference text.


What are the two types of BERT?

BERT can be categorized into two types: BERT-base and BERT-large.


What does BERT mean?

BERT stands for Bidirectional Encoder Representations from Transformers.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved