Evaluating your LLM’s quality should not cost you too much money or even weeks of your time. You’re probably stuck in a limbo of choosing between two options that have their own drawbacks: Automated metrics like BLEU, ROUGE and accuracy scores? Sure, they are quite fast and cheap, but they ultimately fall short in judging real conversations. Since they simply match word patterns, they’re not a good fit for open ended responses simply because they can’t tell if they actually understood the question or its tone. As for human reviewers, they do get it right. They captured the context/subtlety and spot tone issues much better than the tools earlier. The problem though is that it’s costly and super time consuming. Imagine the time consumed in testing responses that are over a thousand. Your budget’s wont like it for sure. So as an alternative, there’s another technique that will make your LLM judging-life easier…. LLM- as a-judge uses one AI model to assess another’s output. Amazon Bedrock now offers this feature that is up to par with the human preference by 80% for a much cheaper and faster way! In this guide, you will learn how to set up LLM-as-a-judge using Amazon Nova Lite (the model you’re testing) and Claude (the judge). You’ll automate quality checks, lower cost, and finally scale your testing. LLM-as-a-judge is a process where one LLM evaluates another LLM’s quality of generated output based on guidelines you’ve created. Instead of the manual approach using human evaluators (which takes a lot of effort and time!), you speed up model evaluation by automating it. You might be thinking that out of all the available models to use, why choose Nova Lite? Well, simply because it’s an overachiever in so many things! Not only is it cheap, but it’s also good. Like real good. Amazon Nova Lite (the model being evaluated) has a balanced and consistent performance in all dimensions: problem identification, communication clarity, logical coherence and empathy. This means that you could confidently deploy your AI system with an assurance knowing that Nova Lite is reliable and smart. Claude 3.5 Sonnet (the judge) is your evaluator. It’s one of the best models at assessing quality, catching nuance, and providing consistent scores. When you need an AI to judge another AI’s work, Claude is the model you want. With that said, let’s get to work and automate! Before we start, I assume you already have an AWS account and access to Bedrock. Let’s start by setting up our resources and IAM configurations. You’ll need the following: Policy #1: Bedrock Model Access Paste this and name it: BedrockEvaluationModelAccessPolicy Note: Replace YOUR_ACCOUNT_ID with your AWS account ID. Policy #2: S3 Access First, create the S3 bucket: Secondly, create folders: This is where you’ll be putting our input dataset and the result of the evaluation of your generator. Thirdly, configure CORS: Enabling this allows you to upload files to S3 through the Bedrock console. Now, we need to create a policy. 1.Go to IAM → Policies → Create Policy Note: Replace YOUR_ACCOUNT_ID and update bucket name if its different. Next, we’ll edit our IAM policy to grant access to our models for generator and evaluator. Paste this and name it BedrockConsoleUserPolicy Note: Replace YOUR_ACCOUNT_ID in all 3 places (S3 bucket ARNs and service role ARN). Now you need to create a service role where you will attached the 2 policies created earlier (BedrockEvaluationModelAccess and BedrockEvaluationS3AccessPolicy) Note: Don’t forget to replace YOUR_ACCOUNT_ID with your own aws account id! Now you’re done setting up the Prerequisites! Quite lengthy isn’t it? Next we’re moving forward to set access for the models, namely the Claude 3.5 Sonnet and Amazon Nova Lite. You have to verify if you can access the Claude and Amazon Nova Moving on, we now create prompts that the Amazon Nova Lite will respond to. You can create your own dataset by having: Create a file called evaluation-dataset.jsonl: 1. Go to Bedrock Console → Evaluations (under Assess section) 2. Click Create. 3. Select Automatic: LLM-as-a-judge. 4.Enter evaluation name and fill in description. Select Claude 3.5 Sonnet. For Prompt Dataset, click your jsonl file while for the evaluation results, click your results folder. Yay that’s it, you’re done! You now learned how to set up an automatic model that used Claude Sonnet 3.5 to judge your Amazon Nova Lite outputs. All free from manual review and costly evaluation.
What is LLM-as-a-Judge?
Why Amazon Nova Lite + Claude?
Prerequisites
Step 1: Create IAM Policies
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BedrockModelInvoke",
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:CreateModelInvocationJob",
"bedrock:StopModelInvocationJob"
],
"Resource": [
"arn:aws:bedrock:*::foundation-model/*",
"arn:aws:bedrock:*:YOUR_ACCOUNT_ID:inference-profile/*",
"arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:provisioned-model/*",
"arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:imported-model/*"
]
}
]
}
[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET",
"PUT",
"POST",
"DELETE"
],
"AllowedOrigins": [
"*"
],
"ExposeHeaders": [
"Access-Control-Allow-Origin"
]
}
]
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "FetchAndUpdateOutputBucket",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject",
"s3:GetBucketLocation",
"s3:AbortMultipartUpload",
"s3:ListBucketMultipartUploads"
],
"Resource": [
"arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID",
"arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID/*"
]
}
]
}
Policy #3: Console User Access
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "BedrockConsole",
"Effect": "Allow",
"Action": [
"bedrock:CreateEvaluationJob",
"bedrock:GetEvaluationJob",
"bedrock:ListEvaluationJobs",
"bedrock:StopEvaluationJob",
"bedrock:GetCustomModel",
"bedrock:ListCustomModels",
"bedrock:CreateProvisionedModelThroughput",
"bedrock:UpdateProvisionedModelThroughput",
"bedrock:GetProvisionedModelThroughput",
"bedrock:ListProvisionedModelThroughputs",
"bedrock:GetImportedModel",
"bedrock:ListImportedModels",
"bedrock:ListFoundationModels",
"bedrock:GetFoundationModel",
"bedrock:ListTagsForResource",
"bedrock:UntagResource",
"bedrock:TagResource"
],
"Resource": "*"
},
{
"Sid": "AllowConsoleS3Access",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetBucketCORS",
"s3:ListBucket",
"s3:ListBucketVersions",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID",
"arn:aws:s3:::bedrock-evaluation-results-YOUR_ACCOUNT_ID/*"
]
},
{
"Sid": "AllowPassingServiceRole",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::YOUR_ACCOUNT_ID:role/BedrockModelEvaluationServiceRole",
"Condition": {
"StringEquals": {
"iam:PassedToService": "bedrock.amazonaws.com"
}
}
}
]
}
Step 2: Create Service Role
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "AllowBedrockToAssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "bedrock.amazonaws.com"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "YOUR_ACCOUNT_ID"
},
"ArnEquals": {
"aws:SourceArn": "arn:aws:bedrock:us-east-1:YOUR_ACCOUNT_ID:evaluation-job/*"
}
}
}]
}
Step 4: Enable Model Access
Step 5. Prepare your Input Dataset
{"prompt": "Write a professional email declining a meeting request politely.", "referenceResponse": "A professional decline should thank the sender, briefly explain you're unavailable, and suggest an alternative time or delegate if possible. Keep the tone friendly and respectful.", "category": "professional_writing"}{"prompt": "Explain the difference between machine learning and deep learning.", "referenceResponse": "Machine learning is when computers learn from data to make predictions. Deep learning is a type of machine learning that uses neural networks with many layers, similar to how the human brain processes information.", "category": "technical"}
{"prompt": "What are three tips for improving work-life balance?", "referenceResponse": "Set clear work hours and stick to them. Make time for hobbies and activities you enjoy. Learn to say no to extra commitments when you're already busy.", "category": "productivity"}Step 6. Upload Dataset to S3
Step 7: Create Your First Evaluation Job
References:
Amazon Bedrock’s LLM-as-a-Judge: Automate AI Evaluation with Nova Lite + Claude
AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!
Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!
View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE coursesOur Community
~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.




























