What is BERTScore - Bidirectional Encoder Representations from Transformers Score?

BERTScore (Bidirectional Encoder Representations from Transformers Score) Cheat Sheet

BERTScore is an effective evaluation metric that looks beyond surface-level word matching to assess the meaning behind the generated text. Instead of counting overlapping words like traditional metrics such as BLEU or ROUGE, BERTScore taps into the power of pre-trained transformer models (like BERT) to compare the semantic similarity between tokens in the generated output and a reference sentence. It does this by calculating the cosine similarity between their contextual embeddings.

Initially proposed by Zhang et al. (2020), BERTScore has quickly become a popular choice in natural language processing tasks where understanding context and meaning matter more than exact word matches. Its strong alignment with human judgment makes it especially useful for evaluating tasks like summarization, translation, and paraphrasing—where the wording may differ, but the intended message should remain intact.

BERTScore Typical Use Cases

BERTScore is widely used in natural language generation tasks where getting the meaning right matters more than using the exact same words. It is practical in some cases where traditional metrics like BLEU or ROUGE might miss the mark due to rewording, paraphrasing, or different phrasing choices.

Text Summarization: Great for scoring abstractive summaries that don’t copy the reference word-for-word but still capture the same core message.
Paraphrase Generation: Helps check if a paraphrased version stays true to the original meaning, even with different wording.
Machine Translation (MT): Works alongside BLEU to evaluate translations that may use synonyms or alternate phrasing while preserving meaning.
Caption Generation: Assesses whether generated captions for images or videos communicate the same idea as those written by humans.
Dialogue Generation: Useful in chatbots and conversational AI to ensure responses are contextually appropriate and make sense in the conversation.
Language Model Evaluation: Commonly used to compare how well large language models (LLMs) or fine-tuned transformers produce semantically accurate outputs.

How BERTScore Works

BERTScore evaluates text similarity by comparing how contextually similar each token in the generated output is to the most relevant token in the reference.

Tokenization
First, the candidate and reference texts are broken down into smaller parts called tokens (usually words or subwords). These tokens are then fed into a pre-trained transformer model like BERT.
Embedding Generation
The model processes each token in its context and turns it into a vector, also known as a contextual embedding. These embeddings capture not just the word itself, but its meaning based on the surrounding words.
Cosine Similarity
Once we have the embeddings, BERTScore compares how similar each token in the candidate is to the tokens in the reference by calculating cosine similarity. It checks each direction, such as how well candidate tokens match reference tokens, and vice versa.
Measuring the Match: Precision, Recall, and F1
- Precision
  Looks at the candidate sentence and asks: “How well do its tokens match what’s in the reference?“
- Recall
  Flips the question to the reference: “How well does the candidate cover the reference tokens?“
- F1 Score
  Combines both precision and recall into a single value. It balances how much the candidate overlaps with the reference and how much it misses. This final score is what we call the BERTScore.

Interpreting BERTScore values

Unlike BLEU (0 to 100%), BERTScore’s F1 values typically range from 0.7 to 0.95 for reasonable outputs. A higher score means stronger semantic alignment.

BERTScore F1 Score	Interpretation
< 0.70	Weak or unrelated content
0.70 – 0.80	Some shared meaning, but major gaps exist
0.80 – 0.90	Good semantic alignment
> 0.90	Very strong match in meaning

How is BERTScore different from BLEU or ROUGE?

BLEU and ROUGE have been the standard metrics for evaluating generated text, especially when the output closely matches the reference. However, they often struggle when the wording is different but the meaning is the same. This is where BERTScore comes in. It focuses on understanding the meaning of the text using contextual embeddings. The comparison below shows how BERTScore differs and when it can be a more useful choice.

Features	BERTScore	BLEU/ROUGE
Matching Strategy	Contextual semantic similarity	Exact word or n-gram match
Handles Synonyms	Yes	No
Context Awareness	Yes (uses contextual embeddings from transformer models)	No
Sentence-Level Reliable	Yes	Not ideal
Score Range	Typically ranges from 0.70 to 0.95 (F1 score)	0–1 (often shown as %)
Interpretability	Requires understanding of embedding space	Easy to interpret
Recommended for	Flexible, meaning-preserving generation	Rigid structure tasks

Sample BERTScore Evaluations

Example 1: Different words but same meaning

Reference: She brought her lunch from home.
Candidate: She packed food at home and took it with her.

BLEU: Low score due to few direct word matches
BERTScore: High score (≈ 0.91), captures the intended meaning
Why? “brought her lunch” ≈ “packed food and took it”, both describe the same action using different words

Example 2: Usage of the same words but with opposite meanings

Reference: Cheska didn’t show up at the barangay meeting.
Candidate: Cheska attended the barangay meeting.

BLEU: Moderate score (many overlapping words)
BERTScore: Lower score (≈ 0.60), detects semantic contradiction
Why? Despite word overlap, the meanings are opposite: “didn’t show up” vs. “attended”

Example 3: Different words and structure, but same intent

Reference: They left early to avoid the traffic.
Candidate: To beat traffic, they departed ahead of time.

BLEU: Moderate to low score (most of the words are different from each other)
BERTScore: High score (≈ 0.92), recognizes semantic match
Why? “left early” ≈ “departed ahead of time”, “avoid traffic” ≈ “beat traffic”

References:

https://arxiv.org/abs/1904.09675
https://rumn.medium.com/bert-score-explained-8f384d37bb06
https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-report-programmatic.html
https://huggingface.co/spaces/evaluate-metric/bertscore

Written by: Lois Angelo Dar Juan

Lois Angelo Dar Juan is a licensed Electronics Engineer, an AWS-certified professional, and currently a Cloud Engineer at Tutorials Dojo, with a passion for emerging technologies, cloud computing, and IT automation. He continuously seeks opportunities to learn and innovate, applying his expertise to solve problems efficiently.

What is BERTScore – Bidirectional Encoder Representations from Transformers Score?

What is BERTScore – Bidirectional Encoder Representations from Transformers Score?

BERTScore (Bidirectional Encoder Representations from Transformers Score) Cheat Sheet

BERTScore Typical Use Cases

How BERTScore Works

Interpreting BERTScore values

How is BERTScore different from BLEU or ROUGE?

Sample BERTScore Evaluations

Example 1: Different words but same meaning

Example 2: Usage of the same words but with opposite meanings

Example 3: Different words and structure, but same intent

References:

🚀 $0.99 Claude CCA-F NEW Study Guide eBook is now available

Turn Your Team Into Cloud-Ready Professionals Today

Learn AWS with our PlayCloud Hands-On Labs

$2.99 AWS and Azure Exam Study Guide eBooks

New Claude Certified Architect Foundations CCA-F

Learn GCP By Doing! Try Our GCP PlayCloud

Learn Azure with our Azure PlayCloud

FREE AI and AWS Digital Courses

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Follow Us On Linkedin

Written by: Lois Angelo Dar Juan

Our Community

What our students say about us?

What is BERTScore – Bidirectional Encoder Representations from Transformers Score?

What is BERTScore – Bidirectional Encoder Representations from Transformers Score?

BERTScore (Bidirectional Encoder Representations from Transformers Score) Cheat Sheet

BERTScore Typical Use Cases

How BERTScore Works

Interpreting BERTScore values

How is BERTScore different from BLEU or ROUGE?

Sample BERTScore Evaluations

Example 1: Different words but same meaning

Example 2: Usage of the same words but with opposite meanings

Example 3: Different words and structure, but same intent

References:

🚀 $0.99 Claude CCA-F NEW Study Guide eBook is now available

Turn Your Team Into Cloud-Ready Professionals Today

Learn AWS with our PlayCloud Hands-On Labs

$2.99 AWS and Azure Exam Study Guide eBooks

New Claude Certified Architect Foundations CCA-F

Learn GCP By Doing! Try Our GCP PlayCloud

Learn Azure with our Azure PlayCloud

FREE AI and AWS Digital Courses

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Follow Us On Linkedin

Written by: Lois Angelo Dar Juan

Our Community

What our students say about us?

Did you find our content helpful?