BERTScore (Bidirectional Encoder Representations from Transformers Score) Cheat Sheet
BERTScore is an effective evaluation metric that looks beyond surface-level word matching to assess the meaning behind the generated text. Instead of counting overlapping words like traditional metrics such as BLEU or ROUGE, BERTScore taps into the power of pre-trained transformer models (like BERT) to compare the semantic similarity between tokens in the generated output and a reference sentence. It does this by calculating the cosine similarity between their contextual embeddings.
Initially proposed by Zhang et al. (2020), BERTScore has quickly become a popular choice in natural language processing tasks where understanding context and meaning matter more than exact word matches. Its strong alignment with human judgment makes it especially useful for evaluating tasks like summarization, translation, and paraphrasing—where the wording may differ, but the intended message should remain intact.
BERTScore Typical Use Cases
BERTScore is widely used in natural language generation tasks where getting the meaning right matters more than using the exact same words. It is practical in some cases where traditional metrics like BLEU or ROUGE might miss the mark due to rewording, paraphrasing, or different phrasing choices.
- Text Summarization: Great for scoring abstractive summaries that don’t copy the reference word-for-word but still capture the same core message.
- Paraphrase Generation: Helps check if a paraphrased version stays true to the original meaning, even with different wording.
- Machine Translation (MT): Works alongside BLEU to evaluate translations that may use synonyms or alternate phrasing while preserving meaning.
- Caption Generation: Assesses whether generated captions for images or videos communicate the same idea as those written by humans.
- Dialogue Generation: Useful in chatbots and conversational AI to ensure responses are contextually appropriate and make sense in the conversation.
- Language Model Evaluation: Commonly used to compare how well large language models (LLMs) or fine-tuned transformers produce semantically accurate outputs.
How BERTScore Works
BERTScore evaluates text similarity by comparing how contextually similar each token in the generated output is to the most relevant token in the reference.
- Tokenization
First, the candidate and reference texts are broken down into smaller parts called tokens (usually words or subwords). These tokens are then fed into a pre-trained transformer model like BERT. - Embedding Generation
The model processes each token in its context and turns it into a vector, also known as a contextual embedding. These embeddings capture not just the word itself, but its meaning based on the surrounding words. - Cosine Similarity
Once we have the embeddings, BERTScore compares how similar each token in the candidate is to the tokens in the reference by calculating cosine similarity. It checks each direction, such as how well candidate tokens match reference tokens, and vice versa. - Measuring the Match: Precision, Recall, and F1
- Precision
Looks at the candidate sentence and asks: “How well do its tokens match what’s in the reference?“ - Recall
Flips the question to the reference: “How well does the candidate cover the reference tokens?“ - F1 Score
Combines both precision and recall into a single value. It balances how much the candidate overlaps with the reference and how much it misses. This final score is what we call the BERTScore.
- Precision
Interpreting BERTScore values
Unlike BLEU (0 to 100%), BERTScore’s F1 values typically range from 0.7 to 0.95 for reasonable outputs. A higher score means stronger semantic alignment.
BERTScore F1 Score | Interpretation |
< 0.70 | Weak or unrelated content |
0.70 – 0.80 | Some shared meaning, but major gaps exist |
0.80 – 0.90 | Good semantic alignment |
> 0.90 | Very strong match in meaning |
How is BERTScore different from BLEU or ROUGE?
BLEU and ROUGE have been the standard metrics for evaluating generated text, especially when the output closely matches the reference. However, they often struggle when the wording is different but the meaning is the same. This is where BERTScore comes in. It focuses on understanding the meaning of the text using contextual embeddings. The comparison below shows how BERTScore differs and when it can be a more useful choice.
Features | BERTScore | BLEU/ROUGE |
Matching Strategy | Contextual semantic similarity | Exact word or n-gram match |
Handles Synonyms | Yes | No |
Context Awareness | Yes (uses contextual embeddings from transformer models) | No |
Sentence-Level Reliable |
Yes | Not ideal |
Score Range | Typically ranges from 0.70 to 0.95 (F1 score) | 0–1 (often shown as %) |
Interpretability | Requires understanding of embedding space | Easy to interpret |
Recommended for | Flexible, meaning-preserving generation | Rigid structure tasks |
Sample BERTScore Evaluations
Example 1: Different words but same meaning
Reference: She brought her lunch from home.
Candidate: She packed food at home and took it with her.
- BLEU: Low score due to few direct word matches
- BERTScore: High score (≈ 0.91), captures the intended meaning
- Why? “brought her lunch” ≈ “packed food and took it”, both describe the same action using different words
Example 2: Usage of the same words but with opposite meanings
Reference: Cheska didn’t show up at the barangay meeting.
Candidate: Cheska attended the barangay meeting.
- BLEU: Moderate score (many overlapping words)
- BERTScore: Lower score (≈ 0.60), detects semantic contradiction
- Why? Despite word overlap, the meanings are opposite: “didn’t show up” vs. “attended”
Example 3: Different words and structure, but same intent
Reference: They left early to avoid the traffic.
Candidate: To beat traffic, they departed ahead of time.
- BLEU: Moderate to low score (most of the words are different from each other)
- BERTScore: High score (≈ 0.92), recognizes semantic match
- Why? “left early” ≈ “departed ahead of time”, “avoid traffic” ≈ “beat traffic”
References:
https://arxiv.org/abs/1904.09675
https://rumn.medium.com/bert-score-explained-8f384d37bb06
https://docs.aws.amazon.com/bedrock/latest/userguide/model-evaluation-report-programmatic.html
https://huggingface.co/spaces/evaluate-metric/bertscore