Ends in
00
days
00
hrs
00
mins
00
secs
ENROLL NOW

💪 25% OFF on ALL Reviewers to Start Your 2026 Strong with our New Year, New Skills Sale!

Amazon Bedrock’s LLM-as-a-Judge: Automate AI Evaluation with Nova Lite + Claude

Home » AWS » Amazon Bedrock’s LLM-as-a-Judge: Automate AI Evaluation with Nova Lite + Claude

Amazon Bedrock’s LLM-as-a-Judge: Automate AI Evaluation with Nova Lite + Claude

 

Evaluating your LLM’s quality should not cost you too much money or even weeks of your time. 

You’re probably  stuck in a limbo of choosing between two options that have their own drawbacks:

Automated metrics like BLEU, ROUGE and accuracy scores? Sure, they are quite fast and cheap, but they ultimately fall short in judging real conversations. Since they simply match word patterns, they’re not a good fit for open ended responses simply because they can’t tell  if they actually understood the question or its tone.

As for human reviewers, they do get it right. They captured the context/subtlety and spot tone issues much better than the tools earlier. The problem though is that it’s costly and super time consuming. Imagine the time consumed in testing responses that are over a thousand. Your budget’s wont like it for sure.

So as an alternative, there’s another technique that will make your LLM judging-life easier….

LLM- as a-judge uses one AI model to assess another’s output.  Amazon Bedrock now offers this feature that is  up to par with the human preference by 80% for a much cheaper and faster way!

In this guide, you will learn how to set up LLM-as-a-judge using Amazon Nova Lite (the model you’re testing) and Claude (the judge). You’ll automate quality checks, lower cost, and finally scale your testing.

What is LLM-as-a-Judge?

LLM-as-a-judge is a process where one LLM evaluates another LLM’s quality of generated output based on guidelines you’ve created. Instead of the manual approach using human evaluators (which takes a lot of effort and time!), you speed up model evaluation by automating it.

Why Amazon Nova Lite + Claude?

Architecture-diagram-LLM-as-aJudge-Amazon-Bedrock

You might be thinking that out of all the available models to use, why choose Nova Lite? Well, simply because it’s an overachiever in so many things! Not only is it cheap, but it’s also good. Like real good.

 Amazon Nova Lite (the model being evaluated) has a balanced and consistent performance in all dimensions: problem identification, communication clarity, logical coherence and empathy. This means that you could confidently deploy your AI system with an assurance knowing that Nova Lite is reliable and smart.

Claude 3.5 Sonnet (the judge) is your evaluator. It’s one of the best models at assessing quality, catching nuance, and providing consistent scores. When you need an AI to judge another AI’s work, Claude is the model you want.

With that said, let’s get to work and automate!

Prerequisites

Before we start, I assume you already have an AWS account and access to Bedrock. Let’s start by setting up our resources and IAM configurations. 

You’ll need the following:

  1. IAM policies (3 policies)
  2. Service role for Bedrock
  3. S3 bucket with CORS (Cross-Origin Resource Sharing)
  4. Model access verification to Amazon Nova Lite and Anthropic Claude

Step 1: Create IAM Policies

Policy  #1: Bedrock Model Access

  1. Go to IAM Console Policies → Create policy
  2. Click JSON tab

Paste this and name it: BedrockEvaluationModelAccessPolicy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:CreateModelInvocationJob",
        "bedrock:StopModelInvocationJob"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/*",
        "arn:aws:bedrock:*:YOUR_ACCOUNT_ID:inference-profile/*",
        "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:provisioned-model/*",
        "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:imported-model/*"
      ]
    }
  ]
}

Note: Replace YOUR_ACCOUNT_ID with your AWS account ID.

Tutorials dojo strip

Policy #2: S3 Access

First, create the S3 bucket:

  • Go to S3 Console → Create bucket
  • Bucket name: bedrock-evaluation-results-YOUR_ACCOUNT_ID
    1. Region: US East (N. Virginia) us-east-1
    2. Keep all other settings as default
  • Click Create bucket

Secondly, create folders:

This is where you’ll be putting our input dataset and the result of the evaluation of your generator.

  1. Click into your bucket
  2. Click Create folder → Name: datasets → Create
  3. Click Create folder → Name: results → Create

Creating S3 bucket to store evaluation datasets for Amazon Bedrock LLM-as-a-Judge with Nova Lite

 

Thirdly, configure CORS:

Enabling this allows you to upload files to S3 through the Bedrock console.

  1. Go to Permissions tab
  2. Scroll to Cross-origin resource sharing (CORS)
  3. Click Edit and paste:

[
    {
        "AllowedHeaders": [
"*"
        ],
        "AllowedMethods": [
"GET",
"PUT",
"POST",
"DELETE"
        ],
        "AllowedOrigins": [
"*"
        ],
        "ExposeHeaders": [
"Access-Control-Allow-Origin"
        ]
    }
]

Now, we need to create a policy. 

1.Go to IAM  → Policies  → Create Policy

Creating IAM policy for Amazon Bedrock model access to enable LLM-as-a-Judge evaluation with Nova Lite

 

  1. Click JSON tab
  2. Paste this and name it: BedrockEvaluationS3AccessPolicy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "FetchAndUpdateOutputBucket",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObject",
        "s3:GetBucketLocation",
        "s3:AbortMultipartUpload",
        "s3:ListBucketMultipartUploads"
      ],
      "Resource": [
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID",
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID/*"
      ]
    }
  ]
}

Note: Replace YOUR_ACCOUNT_ID and update bucket name if its different.

Policy #3: Console User Access

Next, we’ll edit our IAM policy to grant access to our models for generator and evaluator.

  1. Go to IAM → Policies→  Create Policy 
  2. Click JSON tab

Paste this and name it BedrockConsoleUserPolicy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockConsole",
      "Effect": "Allow",
      "Action": [
        "bedrock:CreateEvaluationJob",
        "bedrock:GetEvaluationJob",
        "bedrock:ListEvaluationJobs",
        "bedrock:StopEvaluationJob",
        "bedrock:GetCustomModel",
        "bedrock:ListCustomModels",
        "bedrock:CreateProvisionedModelThroughput",
        "bedrock:UpdateProvisionedModelThroughput",
        "bedrock:GetProvisionedModelThroughput",
        "bedrock:ListProvisionedModelThroughputs",
        "bedrock:GetImportedModel",
        "bedrock:ListImportedModels",
        "bedrock:ListFoundationModels",
        "bedrock:GetFoundationModel",
        "bedrock:ListTagsForResource",
        "bedrock:UntagResource",
        "bedrock:TagResource"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowConsoleS3Access",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetBucketCORS",
        "s3:ListBucket",
        "s3:ListBucketVersions",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID",
        "arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID/*"
      ]
    },
    {
      "Sid": "AllowPassingServiceRole",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::YOUR_ACCOUNT_ID:role/BedrockModelEvaluationServiceRole",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "bedrock.amazonaws.com"
        }
      }
    }
  ]
}

Note: Replace YOUR_ACCOUNT_ID in all 3 places (S3 bucket ARNs and service role ARN).

Step 2: Create Service Role

Now you need to create a service role where you will attached the 2 policies created earlier (BedrockEvaluationModelAccess and BedrockEvaluationS3AccessPolicy)

  1. Go to IAM Console → Roles → Create role
  2. Select Custom trust policy
  3. Paste this trust policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "AllowBedrockToAssumeRole",
    "Effect": "Allow",
    "Principal": {
      "Service": "bedrock.amazonaws.com"
    },
    "Action": "sts:AssumeRole",
    "Condition": {
      "StringEquals": {
        "aws:SourceAccount": "YOUR_ACCOUNT_ID"
      },
      "ArnEquals": {
        "aws:SourceArn": "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:evaluation-job/*"
      }
    }
  }]
}

Note: Don’t forget to replace YOUR_ACCOUNT_ID with your own aws account id!

  1. Click Next
  2. You will be directed to Permission policies 
  3.  Search the BedrockEvaluationModelAccess and BedrockEvaluationS3AccessPolicy
  4. Name your Service Role as BedrockConsoleUserPolicy

Now you’re done setting up the Prerequisites! Quite lengthy isn’t it? 

Next we’re moving forward to set access for the models, namely the Claude 3.5 Sonnet and Amazon Nova Lite.

Step 4: Enable Model Access

You have to verify if you can access the Claude and Amazon Nova

  1. Go to Bedrock -> Left Panel -> Model Catalog

Enabling model access for Amazon Nova Lite and Claude on Bedrock for LLM-as-a-Judge evaluation

  1. Search for Claude 3.5 Sonnet. Click Open in playground.

Enabling model access for Amazon Nova Lite and Claude on Bedrock for LLM-as-a-Judge evaluation

  1. Ask it out a question. If you receive a response, it means you can access it. If not, you might need to submit a use case form.
  2. Test the Amazon Nova Lite as well!

Enabling model access for Amazon Nova Lite and Claude on Bedrock for LLM-as-a-Judge evaluation

Step 5. Prepare your Input Dataset

Moving on, we now create prompts that the Amazon Nova Lite will respond to.

 You can create your own dataset by having:

  1. Prompt (required)
  2. Free AWS Courses
  3. referenceResponse: this serves as a reference point
  4. category (optional): Scores reported by category results to a better analysis

Create a file called evaluation-dataset.jsonl:

{"prompt": "Write a professional email declining a meeting request politely.", "referenceResponse": "A professional decline should thank the sender, briefly explain you're unavailable, and suggest an alternative time or delegate if possible. Keep the tone friendly and respectful.", "category": "professional_writing"}

{"prompt": "Explain the difference between machine learning and deep learning.", "referenceResponse": "Machine learning is when computers learn from data to make predictions. Deep learning is a type of machine learning that uses neural networks with many layers, similar to how the human brain processes information.", "category": "technical"}


{"prompt": "What are three tips for improving work-life balance?", "referenceResponse": "Set clear work hours and stick to them. Make time for hobbies and activities you enjoy. Learn to say no to extra commitments when you're already busy.", "category": "productivity"}

Step 6. Upload Dataset to S3

  1. Go to S3 Console ->bedrock-evaluation-results-YOUR_ACCOUNT_ID
  2. Open the datasets/ folder
  3.  Click Upload and upload your `evaluation-dataset.jsonl  file 

Uploading JSONL evaluation dataset to S3 for Amazon Bedrock LLM-as-a-Judge with Nova Lite

Step  7: Create Your First Evaluation Job

1. Go to Bedrock Console → Evaluations (under Assess section)

Creating LLM-as-a-Judge evaluation job on Amazon Bedrock with Nova Lite generator and Claude judge

2. Click Create.

3. Select Automatic: LLM-as-a-judge.

 

Creating LLM-as-a-Judge evaluation job on Amazon Bedrock with Nova Lite generator and Claude judge

4.Enter evaluation name and fill in description. Select Claude 3.5 Sonnet.

Creating LLM-as-a-Judge evaluation job on Amazon Bedrock with Nova Lite generator and Claude judge

  1. In the inference source, select Amazon Nova Lite to evaluate.

Selecting Amazon Nova Lite as generator and Claude as judge for Bedrock LLM-as-a-Judge evaluation

  1. Select the desired metrics for evaluating your model response.

Selecting-desired-metrics-for-amazon-bedrock-evaluation-job

 

  1. Click Browse S3. 

For Prompt Dataset, click your jsonl file while for the evaluation results, click your results folder.

Configuring CORS on S3 bucket for Amazon Bedrock LLM-as-a-Judge console file uploads

  1. Select the service role you have created during prerequisites and click Create.

Selecting-service-role-Amazon-Bedrock

  1. Wait for 5-10 minutes for the evaluation job to complete and keep refreshing. Your evaluation job is successful once you see the status Completed.

Completed Amazon Bedrock LLM-as-a-Judge evaluation job showing successful Nova Lite and Claude assessment

  1. You may now see the details of your evaluation.

Amazon Bedrock LLM-as-a-Judge evaluation results showing Nova Lite quality metrics and scores from Claude

 

Yay that’s it, you’re done! You now learned how to set up an automatic model that used Claude Sonnet 3.5 to judge your Amazon Nova Lite outputs. All free from manual review and costly evaluation.

 

References:

  1. https://aws.amazon.com/blogs/machine-learning/llm-as-a-judge-on-amazon-bedrock-model-evaluation/ 
  2. https://aws.amazon.com/blogs/machine-learning/real-world-reasoning-how-amazon-nova-lite-2-0-handles-complex-customer-support-scenarios/
  3. https://github.com/aws-samples/sample-amazon-nova-reasoning-eval/blob/main/README.md
  4. https://docs.aws.amazon.com/bedrock/latest/userguide/judge-service-roles.html 
  5. https://labelstud.io/blog/llm-evaluations-techniques-challenges-and-best-practices/

💪 25% OFF on ALL Reviewers to Start Your 2026 Strong with our New Year, New Skills Sale!

Tutorials Dojo portal

Learn AWS with our PlayCloud Hands-On Labs

$2.99 AWS and Azure Exam Study Guide eBooks

tutorials dojo study guide eBook

New AWS Generative AI Developer Professional Course AIP-C01

AIP-C01 Exam Guide AIP-C01 examtopics AWS Certified Generative AI Developer Professional Exam Domains AIP-C01

Learn GCP By Doing! Try Our GCP PlayCloud

Learn Azure with our Azure PlayCloud

FREE AI and AWS Digital Courses

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

Follow Us On Linkedin

Written by: Darla Sumanting

Darla Nova Sumanting is an AWS AI Practitioner Certified Computer Science student at Adamson University.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses

Our Community

~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?