Ends in
00
days
00
hrs
00
mins
00
secs
ENROLL NOW

⏳ Bonus Deal: $3 Off All Products, $1.99 eBooks, 25% Bundle Discounts & CodeQuest AI Labs Access!

What is BERTScore – Bidirectional Encoder Representations from Transformers Score?

Home » AI Cheat Sheets » What is BERTScore – Bidirectional Encoder Representations from Transformers Score?

What is BERTScore – Bidirectional Encoder Representations from Transformers Score?

BERTScore (Bidirectional Encoder Representations from Transformers Score) Cheat Sheet

BERTScore is an effective evaluation metric that looks beyond surface-level word matching to assess the meaning behind the generated text. Instead of counting overlapping words like traditional metrics such as BLEU or ROUGE, BERTScore taps into the power of pre-trained transformer models (like BERT) to compare the semantic similarity between tokens in the generated output and a reference sentence. It does this by calculating the cosine similarity between their contextual embeddings.

Initially proposed by Zhang et al. (2020), BERTScore has quickly become a popular choice in natural language processing tasks where understanding context and meaning matter more than exact word matches. Its strong alignment with human judgment makes it especially useful for evaluating tasks like summarization, translation, and paraphrasing—where the wording may differ, but the intended message should remain intact.

BERTScore Typical Use Cases

BERTScore is widely used in natural language generation tasks where getting the meaning right matters more than using the exact same words. It is practical in some cases where traditional metrics like BLEU or ROUGE might miss the mark due to rewording, paraphrasing, or different phrasing choices.

  • Text Summarization: Great for scoring abstractive summaries that don’t copy the reference word-for-word but still capture the same core message.
  • Paraphrase Generation: Helps check if a paraphrased version stays true to the original meaning, even with different wording.
  • Machine Translation (MT): Works alongside BLEU to evaluate translations that may use synonyms or alternate phrasing while preserving meaning.
  • Caption Generation: Assesses whether generated captions for images or videos communicate the same idea as those written by humans.
  • Dialogue Generation: Useful in chatbots and conversational AI to ensure responses are contextually appropriate and make sense in the conversation.
  • Language Model Evaluation: Commonly used to compare how well large language models (LLMs) or fine-tuned transformers produce semantically accurate outputs.

How BERTScore Works

BERTScore evaluates text similarity by comparing how contextually similar each token in the generated output is to the most relevant token in the reference.

  • Tokenization
    First, the candidate and reference texts are broken down into smaller parts called tokens (usually words or subwords). These tokens are then fed into a pre-trained transformer model like BERT.
  • Embedding Generation
    The model processes each token in its context and turns it into a vector, also known as a contextual embedding. These embeddings capture not just the word itself, but its meaning based on the surrounding words.
  • Cosine Similarity
    Once we have the embeddings, BERTScore compares how similar each token in the candidate is to the tokens in the reference by calculating cosine similarity. It checks each direction, such as how well candidate tokens match reference tokens, and vice versa.
  • Measuring the Match: Precision, Recall, and F1
    • Precision
      Looks at the candidate sentence and asks: “How well do its tokens match what’s in the reference?
    • Recall
      Flips the question to the reference: “How well does the candidate cover the reference tokens?
    • F1 Score
      Combines both precision and recall into a single value. It balances how much the candidate overlaps with the reference and how much it misses. This final score is what we call the BERTScore.
Tutorials dojo strip

Interpreting BERTScore values

Unlike BLEU (0 to 100%), BERTScore’s F1 values typically range from 0.7 to 0.95 for reasonable outputs. A higher score means stronger semantic alignment.

BERTScore F1 Score Interpretation
< 0.70 Weak or unrelated content
0.70 – 0.80 Some shared meaning, but major gaps exist
0.80 – 0.90 Good semantic alignment
> 0.90 Very strong match in meaning

How is BERTScore different from BLEU or ROUGE?

BLEU and ROUGE have been the standard metrics for evaluating generated text, especially when the output closely matches the reference. However, they often struggle when the wording is different but the meaning is the same. This is where BERTScore comes in. It focuses on understanding the meaning of the text using contextual embeddings. The comparison below shows how BERTScore differs and when it can be a more useful choice.

Features BERTScore BLEU/ROUGE
Matching Strategy Contextual semantic similarity Exact word or n-gram match
Handles Synonyms  Yes No
Context Awareness Yes (uses contextual embeddings from transformer models) No
Sentence-Level Reliable
Yes Not ideal
Score Range Typically ranges from 0.70 to 0.95 (F1 score) 0–1 (often shown as %)
Interpretability Requires understanding of embedding space Easy to interpret
Recommended for Flexible, meaning-preserving generation Rigid structure tasks

Sample BERTScore Evaluations

Example 1: Different words but same meaning

Reference: She brought her lunch from home.
Candidate: She packed food at home and took it with her.

  • BLEU: Low score due to few direct word matches
  • BERTScore: High score (≈ 0.91), captures the intended meaning
  • Why?brought her lunch” ≈ “packed food and took it”, both describe the same action using different words

Example 2: Usage of the same words but with opposite meanings

Reference: Cheska didn’t show up at the barangay meeting.
Candidate: Cheska attended the barangay meeting.

  • BLEU: Moderate score (many overlapping words)
  • BERTScore: Lower score (≈ 0.60), detects semantic contradiction
  • Why? Despite word overlap, the meanings are opposite: “didn’t show up” vs. “attended

Example 3: Different words and structure, but same intent

Reference: They left early to avoid the traffic.
Candidate: To beat traffic, they departed ahead of time.

  • BLEU: Moderate to low score (most of the words are different from each other)
  • BERTScore: High score (≈ 0.92), recognizes semantic match
  • Free AWS Courses
  • Why?left early” ≈ “departed ahead of time”, “avoid traffic” ≈ “beat traffic

References:

https://arxiv.org/abs/1904.09675
https://rumn.medium.com/bert-score-explained-8f384d37bb06
https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-report-programmatic.html
https://huggingface.co/spaces/evaluate-metric/bertscore

⏳ Bonus Deal: $3 Off All Products, $1.99 eBooks, 25% Bundle Discounts & CodeQuest AI Labs Access!

Tutorials Dojo portal

Learn AWS with our PlayCloud Hands-On Labs

🧑‍💻 CodeQuest – AI-Powered Programming Labs

FREE AI and AWS Digital Courses

Tutorials Dojo Exam Study Guide eBooks

tutorials dojo study guide eBook

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

Join Data Engineering Pilipinas – Connect, Learn, and Grow!

Data-Engineering-PH

K8SUG

Follow Us On Linkedin

Recent Posts

Written by: Lois Angelo Dar Juan

Lois is a fresh graduate of BS ECE and current Junior Cloud Engineer of Tutorials Dojo. Motivated by his interest in engineering, Lois is keen on expanding his expertise and competency in cloud computing and the broader IT industry.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses

Our Community

~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?