Amazon Bedrock’s LLM-as-a-Judge

Last updated on January 23, 2026

Evaluating your LLM’s quality should not cost you too much money or even weeks of your time.

You’re probably stuck in a limbo of choosing between two options that have their own drawbacks:

Automated metrics like BLEU, ROUGE and accuracy scores? Sure, they are quite fast and cheap, but they ultimately fall short in judging real conversations. Since they simply match word patterns, they’re not a good fit for open ended responses simply because they can’t tell if they actually understood the question or its tone.

As for human reviewers, they do get it right. They captured the context/subtlety and spot tone issues much better than the tools earlier. The problem though is that it’s costly and super time consuming. Imagine the time consumed in testing responses that are over a thousand. Your budget’s wont like it for sure.

So as an alternative, there’s another technique that will make your LLM judging-life easier….

It’s called LLM-as-a-judge!

From the name itself, LLM- as a-judge uses one AI model to assess another’s output. Amazon Bedrock now offers this feature that is up to par with the human preference by 80% for a much cheaper and faster way!

In this guide, you will learn how to set up LLM-as-a-judge using Amazon Nova Lite (the model you’re testing) and Claude (the judge). You’ll automate quality checks, lower cost, and finally scale your testing.

What is LLM-as-a-Judge?

LLM-as-a-judge is a process where one LLM evaluates another LLM’s quality of generated output based on guidelines you’ve created. Instead of the manual approach using human evaluators (which takes a lot of effort and time!), you speed up model evaluation by automating it.

Why Amazon Nova Lite + Claude?

You might be thinking that out of all the available models to use, why choose Nova Lite? Well, simply because it’s an overachiever in so many things! Not only is it cheap, but it’s also good. Like real good.

Amazon Nova Lite (the model being evaluated) has a balanced and consistent performance in all dimensions: problem identification, communication clarity, logical coherence and empathy. This means that you could confidently deploy your AI system with an assurance knowing that Nova Lite is reliable and smart.

Claude 3.5 Sonnet (the judge) is your evaluator. It’s one of the best models at assessing quality, catching nuance, and providing consistent scores. When you need an AI to judge another AI’s work, Claude is the model you want.

With that said, let’s get to work and automate!

Prerequisites

Before we start, I assume you already have an AWS account and access to Bedrock. Let’s start by setting up our resources and IAM configurations.

You’ll need the following:

AWS Identity and Access Management (IAM) policies (3 policies)
Service role for Bedrock
Amazon S3 bucket with CORS (Cross-Origin Resource Sharing)
Model access verification to Amazon Nova Lite and Anthropic Claude

Step 1: Create IAM Policies

Policy #1: Bedrock Model Access

Go to IAM Console → Policies → Create policy
Click JSON tab

Paste this and name it: BedrockEvaluationModelAccessPolicy

 
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:CreateModelInvocationJob",
        "bedrock:StopModelInvocationJob"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/*",
        "arn:aws:bedrock:*:YOUR_ACCOUNT_ID:inference-profile/*",
        "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:provisioned-model/*",
        "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:imported-model/*"
      ]
    }
  ]
}

Note: Replace YOUR_ACCOUNT_ID with your AWS account ID.

Policy #2: Amazon S3 Access

First, create the Amazon S3 bucket:

Go to Amazon S3 Console → Create bucket
Bucket name: bedrock-evaluation-results-YOUR_ACCOUNT_ID
Region: US East (N. Virginia) us-east-1
Keep all other settings as default
Click Create bucket

Secondly, create folders:

This is where you’ll be putting our input dataset and the result of the evaluation of your generator.

Click into your bucket
Click Create folder -> Name: datasets -> Create
Click Create folder -> Name: results -> Create

Thirdly, configure Cross-Origin Resource Sharing (CORS):

Enabling this allows you to upload files to Amazon S3 through the Bedrock console.

Go to Permissions tab
Scroll to Cross-origin resource sharing (CORS)

Click Edit and paste:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET",
            "PUT",
            "POST",
            "DELETE"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [
            "Access-Control-Allow-Origin"
        ]
    }
]

Now, we need to create a policy.

1.Go to IAM -> Policies -> Create Policy

Click JSON tab
Paste this and name it: BedrockEvaluationS3AccessPolicy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "FetchAndUpdateOutputBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject",
        "s3:GetBucketLocation",
        "s3:AbortMultipartUpload",
        "s3:ListBucketMultipartUploads"
      ],
      "Resource": [
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID",
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID/*"
      ]
    }
  ]
}

Note: Replace YOUR_ACCOUNT_ID and update bucket name if its different.

Policy #3: Console User Access

Next, we’ll edit our IAM policy to grant access to our models for generator and evaluator.

Go to IAM -> Policies->Create Policy
Click JSON tab

Paste this and name it BedrockConsoleUserPolicy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockConsole",
      "Effect": "Allow",
      "Action": [
        "bedrock:CreateEvaluationJob",
        "bedrock:GetEvaluationJob",
        "bedrock:ListEvaluationJobs",
        "bedrock:StopEvaluationJob",
        "bedrock:GetCustomModel",
        "bedrock:ListCustomModels",
        "bedrock:CreateProvisionedModelThroughput",
        "bedrock:UpdateProvisionedModelThroughput",
        "bedrock:GetProvisionedModelThroughput",
        "bedrock:ListProvisionedModelThroughputs",
        "bedrock:GetImportedModel",
        "bedrock:ListImportedModels",
        "bedrock:ListFoundationModels",
        "bedrock:GetFoundationModel",
        "bedrock:ListTagsForResource",
        "bedrock:UntagResource",
        "bedrock:TagResource"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowConsoleS3Access",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetBucketCORS",
        "s3:ListBucket",
        "s3:ListBucketVersions",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID",
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID/*"
      ]
    },
    {
      "Sid": "AllowPassingServiceRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::YOUR_ACCOUNT_ID:role/BedrockModelEvaluationServiceRole",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "bedrock.amazonaws.com"
        }
      }
    }
  ]
}

Note: Replace YOUR_ACCOUNT_ID in all 3 places (S3 bucket ARNs and service role ARN).

Step 2: Create Service Role

Now you need to create a service role where you will attached the 2 policies created earlier (BedrockEvaluationModelAccess and BedrockEvaluationS3AccessPolicy)

Go to IAM Console → Roles → Create role
Select Custom trust policy
Paste this trust policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "AllowBedrockToAssumeRole",
    "Effect": "Allow",
    "Principal": {
      "Service": "bedrock.amazonaws.com"
    },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": {
        "aws:SourceAccount": "YOUR_ACCOUNT_ID"
      },
      "ArnEquals": {
        "aws:SourceArn": "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:evaluation-job/*"
      }
    }
  }]
}

Note: Don't forget to replace YOUR_ACCOUNT_ID with your own aws account id!

Click Next
You will be directed to Permission policies
Search the BedrockEvaluationModelAccess and BedrockEvaluationS3AccessPolicy
Name your Service Role as BedrockConsoleUserPolicy

Now you’re done setting up the Prerequisites! Quite lengthy isn't it?

Next we’re moving forward to set access for the models, namely the Claude 3.5 Sonnet and Amazon Nova Lite.

Step 4: Enable Model Access

You have to verify if you can access the Claude Sonnet 3.5 and Amazon Nova Lite

Go to Bedrock -> Left Panel -> Model Catalog

Search for Claude 3.5 Sonnet. Click Open in playground.

Ask it out a question. If you receive a response, it means you can access it. If not, you might need to submit a use case form.
Test the Amazon Nova Lite as well.

Step 5. Prepare your Input Dataset

Moving on, we now create prompts that the Amazon Nova Lite will respond to.

You can create your own dataset by having:

Prompt (required)
referenceResponse: this serves as a reference point
category (optional): Scores reported by category results to a better analysis

Create a file called evaluation-dataset.jsonl:

{"prompt": "Write a professional email declining a meeting request politely.", "referenceResponse": "A professional decline should thank the sender, briefly explain you're unavailable, and suggest an alternative time or delegate if possible. Keep the tone friendly and respectful.", "category": "professional_writing"}

{"prompt": "Explain the difference between machine learning and deep learning.", "referenceResponse": "Machine learning is when computers learn from data to make predictions. Deep learning is a type of machine learning that uses neural networks with many layers, similar to how the human brain processes information.", "category": "technical"}

{"prompt": "What are three tips for improving work-life balance?", "referenceResponse": "Set clear work hours and stick to them. Make time for hobbies and activities you enjoy. Learn to say no to extra commitments when you're already busy.", "category": "productivity"}

Step 6. Upload Dataset to Amazon S3

Go to Amazon S3 Console ->bedrock-evaluation-results-YOUR_ACCOUNT_ID
Open the datasets/ folder
Click Upload and upload your evaluation-dataset.jsonl file

Step 7: Create Your First Evaluation Job

1. Go to Bedrock Console → Evaluations (under Assess section)

2. Click Create.

3. Select Automatic: LLM-as-a-judge

4. Enter evaluation name and fill in description. Select Claude 3.5 Sonnet.

In the inference source, select Amazon Nova Lite to evaluate.

Select the desired metrics for evaluating your model response.

Click Browse S3.

For Prompt Dataset, click your jsonl file while for the evaluation results, click your results folder.

Select the service role you have created during prerequisites and click Create.

Wait for 5-10 minutes for the evaluation job to complete and keep refreshing. Your evaluation job is successful once you see the status Completed.

You may now see the details of your evaluation.

Yay that’s it, you're done! You now learned how to set up an automatic model that used Claude Sonnet 3.5 to judge your Amazon Nova Lite outputs. All free from manual review and costly evaluation.

Still not convinced to use llm-as-a-judge? Check out this AWS documentation showing how LLM-as-a-judge compares to other alternative options.