BiLingual Evaluation Understudy (BLEU) Score Cheat Sheet
BLEU (BiLingual Evaluation Understudy) is a corpus-level metric designed to automatically evaluate the quality of machine-generated text, most commonly in machine translation (MT). It compares n-gram overlap between the machine’s output and one or more human reference translations.
Introduced by Papineni et al. (2002), BLEU became the first automated metric to correlate highly with human judgments in large-scale MT evaluations. It remains a widely used baseline for evaluating machine-generated text.
Common Use Cases
The BLEU score is widely used in natural language processing tasks that require comparing machine-generated text to human-written references. Its primary application lies in evaluating translation systems, but its utility extends beyond that into various generative language tasks. Below are the most common scenarios where BLEU is applied:
-
Neural Machine Translation (NMT): BLEU is the standard metric for evaluating deep learning-based translation systems like those built with transformers (e.g., MarianNMT, Google’s NMT).
-
Statistical Machine Translation (SMT): BLEU was initially designed for SMT systems and remains a key metric in comparing output from statistical models such as phrase-based or hierarchical models.
-
Caption Generation: BLEU is frequently used to assess the quality of text generated by models that produce image or video captions, comparing them against human reference descriptions.
-
Text Summarization: Although less ideal than ROUGE for this task, BLEU is still used to quantify the overlap between machine-generated and human-written summaries.
-
Benchmarking NLP Models: BLEU is often employed to benchmark models in shared tasks and research papers, particularly for tools like OpenNMT, Fairseq, and other sequence-to-sequence systems that produce structured outputs.
Mathematical Expression for BLEU Score
The BLEU score is a metric that ranges from 0 to 1 and reflects how closely a machine-generated translation matches one or more high-quality reference translations. A score of 0 indicates no overlap with the references, suggesting poor translation quality, while a score of 1 signifies perfect overlap, indicating a highly accurate translation.
As a general guideline, the following ranges of BLEU scores—expressed as percentages—can help interpret the quality of a machine-generated translation.
BLEU Score | Interpretation |
< 10 | Almost useless |
10 – 19 | Hard to get the gist |
20 – 29 | The gist is clear, but it has significant grammatical errors |
30 – 40 | Understandable to good translations |
40 – 50 | High-quality translations |
50 – 60 | Very high-quality, adequate, and fluent translations |
> 60 | Quality is often better than human |
The BLEU score is calculated using the following formula:
where:
is the number of i-grams in the candidate translation that also appear in the reference translation.
is the number of i-grams present in the reference translation.
is the total number of i-grams in the candidate translation.
The BLEU score formula comprises two main components: the brevity penalty and the n-gram overlap.
🔹 Brevity Penalty (BP)
The brevity penalty discourages translations that are excessively short compared to the reference. It applies an exponential penalty when the candidate translation is shorter than the closest reference length. This component addresses the lack of a recall term in the BLEU score, ensuring that translations are not overly concise at the cost of completeness.
🔹 N-Gram Overlap
The n-gram overlap measures how many unigrams, bigrams, trigrams, and four-grams (i.e., for i = 1 to 4) in the candidate match those in the reference translations. This acts as a precision metric, where:
-
Unigrams capture the adequacy of the translation (i.e., correct word choice),
-
Higher-order n-grams reflect the fluency and grammatical coherence.
To prevent overestimation, matched n-grams are clipped to the maximum count of each n-gram found in the reference translations.
Examples of BLEU Score Calculation
🔹 Simple Unigram Precision Example
Let’s explore how BLEU handles unigram precision with a basic example.
Reference Sentence:
she enjoys reading science fiction books
Candidate Translation:
she she she books fiction
Begin by counting unigram occurrences and applying clipping, which prevents overcounting repeated words in the candidate that exceed their frequency in the reference.
Unigram | Count in Candidate | Max Count in Reference | Clipped Count |
she | 3 | 1 | 1 |
books | 1 | 1 | 1 |
fiction | 1 | 1 | 1 |
reading | 0 | 1 | 0 |
science | 0 | 1 | 0 |
enjoys | 0 | 11 | 0 |
-
Total candidate unigrams: 5
-
Total clipped matches: 1 (she) + 1 (books) + 1 (fiction) = 3
-
Unigram precision: 3 / 5 = 0.6
🔹 Full BLEU Score Calculation Example
Let’s now walk through a more complete BLEU score example using up to 4-gram precision and brevity penalty.
New Reference:
The astronaut collected rock samples from the lunar surface during the mission.
Candidate 1:
The astronaut picked up moon rocks from the surface on the mission.
Candidate 2:
The astronaut gathered rock samples during the mission on the moon.
Tokenized versions are used in BLEU, where punctuation is separated.
N-Gram Precisions
Metric | Candidate 1 | Candidate 2 |
1-gram | 8 / 12 | 9 / 12 |
2-gram | 5 / 11 | 6 / 11 |
3-gram | 2 / 10 | 3 / 10 |
4-gram | 1 / 9 | 2 / 9 |
Brevity Penalty (BP)
-
Reference length: 13 tokens
-
Candidate lengths: 12 tokens each
-
Since both candidates are shorter than the reference, the BP is applied:
BP=exp(1−1213)≈0.92
Final BLEU Scores
Candidate | BLEU Score |
Candidate 1 | ~0.20 |
Candidate 2 | ~0.34 |
Hence, Candidate 2 achieves a higher BLEU score due to better overlap in higher-order n-grams, such as “rock samples during the”.
Properties of the BLEU Score
-
Corpus-Level Evaluation
BLEU is designed as a corpus-based metric, meaning it works best when scores are aggregated over many sentences. When applied to individual sentences, BLEU often yields misleadingly low scores—even when the translations are semantically accurate—because short sequences offer limited n-gram statistics. This is why sentence-level BLEU scores can be unreliable, and BLEU is not meant to be decomposed per sentence.
-
Equal Weight for All Words
BLEU does not distinguish between content and function words. For example, omitting a small function word like “a” is penalized just as heavily as misidentifying a critical content word like “NASA” with “ESA.” This lack of semantic weighting can skew evaluation results.
-
Limited Grammatical and Semantic Sensitivity
BLEU does not directly assess meaning or grammatical correctness. A single missing word, such as “not,” can reverse the sentence’s intent, yet only slightly affect the BLEU score. Additionally, since BLEU typically considers n-grams up to 4-grams, it misses long-range dependencies, resulting in weak penalties for ungrammatical structures.
-
Sensitive to Preprocessing
Before scoring, both candidate and reference translations must undergo normalization and tokenization. These preprocessing steps, such as handling punctuation, case sensitivity, or special tokens, can significantly influence BLEU results. Consistent tokenization is essential for fair comparisons across models or systems.