Amazon Bedrock + Promptfoo: Rethinking LLM Evaluation Methods

Last updated on January 21, 2026

I discovered something embarrasing about my LLM development workflow last month.

After spending hours crafting what I thought was the perfect prompt for a customer service chatbot on Amazon Bedrock, I deployed it and called it done. My validation process? I asked it five questions, nodded approvingly at the responses, and moved on.

Sound familiar?

This “vibe-based prompting” approach worked fine until the chatbot confidently told a user that our fictional company offers “24/7 phone support,” a feature that never existed. The model hallucinated, and I had no automated way to catch it.

That experience sent me down a rabbit hole into LLM evaluation, also known as “evals,” among others. What I found changed how I think about building GenAI applications. If you’re a student learning to build with Amazon Bedrock, this might save you from my mistakes.

The Uncomfortable Truth About LLM Development

Here’s what nobody tells you: LLMs are confidently wrong all the time.

They’ll cite nonexistent research papers. They’ll invent API endpoints. And they’ll do it with the same confident tone they use when giving you actually correct information.

This matters more than ever because people now use LLMs for serious stuff:

Summarizing legal documents and policy changes
Displaying real-time financial data in customer apps
Creating educational content for universities
Explaining medication interactions to patients
Recommending software libraries that release updates monthly

A model trained on data from 2023 doesn’t know what happened yesterday. It doesn’t know if that SDK version you’re asking about has a critical security vulnerability discovered last week. But it’ll answer anyway.

Traditional software testing doesn’t help here. You can’t just check if the output equals some expected string—LLM responses are non-deterministic by design. The same prompt might give you three different (but valid) phrasings of the same answer.

So how do you test something that’s supposed to be creative and variable?

LLM Evaluation and Why It’s Different

LLM evaluation is like unit testing for prompts, but with a twist.

Instead of checking “does output == expected,” you’re asking questions like:

Does this response contain the key information it should?
Is the tone appropriate for a customer service context?
Did the model stay grounded in the provied context, or did it make stuff up?
Would a human expert consider this response helpful?

A proper evaluation setup has four compoents:

Test inputs: the prompts or questions you give at the model
Expected behavior: not exact strings, but criteria for what “good” looks like
Model outputs: what your LLM actually generates
Scoring logic: how you measure success

The scoring part is where things diverge from traditional testing. You’ve got options:

Deterministic checks work for structured outputs. Does the JSON parse correctly? does it contain required fields? Is the response under 500 tokens?

Semantic similarity compares meaning rather than exact words. “The capital of France is Paris” should match with “Paris serves as France’s capital city.”

LLM-as-a-judge uses another model to evaluate responses. You basically ask Claude to grade Claude’s homework. Surprisingly effective for subjective qualities like helpfulness and tone.

Human evaluation remains the gold standard for nuanced assessment, but it doesn’t scale.

The best evaluation strategies combine multiple approaches. You might use deterministic checks for format validation, semantic similarity for content accuracy, and LLM-as-a-judge for subjective quality, all on the same test case.

Promptfoo: The Testing Framework That Made It Click

After researching options, I landed on Promptfoo. An open-source evaluation framework with over 9,000 stars on GitHub. Think of it as Jest or Pytest, but for prompts. promptfoo is an open-source CLI and library for evaluating and red-teaming LLM apps.

What sold me:

It’s actually open source. MIT license, no strings attached. LLM evals should be a commodity, not a premium feature locked behind enterprise pricing.

It speaks YAML. Configuration is declarative and readable. No complex SDKs to learn.

It integrates with everything. Bedrock, OpenAI, Anthropic, local models, custom APIs.

It runs locally. Your prompts and test data never leave your machine unless you want them to.

promptfoo produces matrix views that let you quickly evaluate outputs across many prompts. Here’s an example of a side-by-side comparison of multiple prompts and inputs from promptfoo docs.

It also works on the command line:

also produces high-level vulnerability and risk reports:

Here’s what a basic Promptfoo config looks like for Amazon Bedrock:

# promptfooconfig.yaml

description: "Customer service chatbot evaluation"

providers:

- id: bedrock:anthropic.claude-3-sonnet-20240229-v1:0

config:

region: us-east-1

temperature: 0.3

prompts:

- |

You are a helpful customer service assistant for TechGadgets Inc.

Answer the customer's question based only on the following context:

{{context}}

Customer question: {{question}}

tests:

- vars:

context: "TechGadgets offers email support Monday-Friday 9am-5pm EST. No phone support is available."

question: "How can I contact support?"

assert:

- type: contains

value: "email"

- type: not-contains

value: "phone support"

transform: output.toLowerCase()

- type: llm-rubric

value: "Response should be helpful and not invent features that don't exist"

Run promptfoo eval and you get a report showing which tests passed, which failed, and why.

But Wait — Doesn’t Bedrock Already Do This?

Yes, Amazon Bedrock now has native evaluation capabilities.

At re:Invent 2024, AWS announced LLM-as-a-judge and RAG evaluation features, which became generally available in March 2025. These are well-designed tools that let you:

Evaluate model responses using judge models like Claude or Nova
Assess RAG pipelines for context relevance and faithfulness
Run evaluations directly from the Bedrock console
Get normalized scores with explanations

AWS claims up to 98% cost savings compared to human evaluation, and for certain use cases, that’s compelling.

So why bother with Promptfoo?

Different tools for different jobs.

Bedrock Evaluations excels at:

Comparing foundation models before you commit to one
One-time assessments of model quality
Human evaluation workflows with managed workforces
Enterprise scenarios where everything must stay within AWS

Promptfoo excels at:

CI/CD integration (block deployments on failing evals)
Rapid iteration during development
Testing across multiple providers simultaneously
Security testing and red-teaming
Running locally without AWS credentials during development

Amazon Bedrock Evaluations helps you pick the right model. Promptfoo helps you ship the right prompts.

Setting Up Promptfoo with Amazon Bedrock

Here’s how to set up Promptfoo for Bedrock development.

Prerequisites

You’ll need:

Node.js 18 or higher
AWS credentials configured (via environment variables, credentials file, or IAM role)
Access to Bedrock models in your AWS account

Installation

npm install -g promptfoo@latest

# or, if you use npx
npx promptfoo init

Configure AWS Credentials

Promptfoo uses the standard AWS SDK, so your existing credentials work:

# Environment variables

export AWS_ACCESS_KEY_ID=your_key

export AWS_SECRET_ACCESS_KEY=your_secret

export AWS_REGION=us-east-1

Your First Bedrock Evaluation

Create a file called promptfooconfig.yaml

description: "My first Bedrock evaluation"

defaultTest:
options:
provider: bedrock:anthropic.claude-3-haiku-20240307-v1:0

providers:
- id: bedrock:anthropic.claude-3-haiku-20240307-v1:0
config:
region: us-east-1

- id: bedrock:amazon.nova-lite-v1:0
config:
region: us-east-1

prompts:
- "Explain {{concept}} to a college freshman in 2-3 sentences."

tests:
- vars:
concept: "machine learning"
assert:
- type: contains-any
value: ["data", "pattern", "learn", "algorithm"]
- type: javascript
value: output.length < 500

- vars:
concept: "quantum computing"
assert:
- type: contains-any
value: ["qubit", "superposition", "quantum"]
- type: llm-rubric
value: "Explanation should be accurate and accessible to someone without a physics background"

Run it:

npx promptfoo eval -c promptfooconfig.yaml

You’ll see a table comparing how Claude Haiku and Nova performed on each test case. Maybe Haiku nails the explanations, but runs slower. Maybe Nova is cheaper but misses some key concepts. Now you have data to make decisions.

Promptfoo CLI showing evaluation results comparing Claude Haiku and Titan models on Amazon Bedrock

When To Use What: A Decision Framework

After spending time with both tools, here’s my mental model:

Use Bedrock Evaluations when:

You’re selecting between foundation models for a new project
You need human evaluation at scale
Compliance requries everything stays within AWS
You’re doing comprehensive quarterly assessments

Use Promptfoo when:

You’re actively developing and iterating on prompts
You need evals in CI/CD pipelines
You’re testing across multiple providers (Bedrock + OpenAI + local)
You need security/red-team testing
You want to run evals locally without cloud dependencies

Use both when:

You’re building production GenAI systems that need continuous validation
Initial model selection with Bedrock Evaluations
Ongoing development and deployment with Promptfoo

They’re complementary, not competing.

What I Wish Someone Had Told Me

A few lessons from my journey into LLM evaluation:

Start small. You don’t need 500 test cases on day one. Start with 10 that cover your critical paths. Add more as you find edge cases.

Test for what matters. Don’t eval everything. Focus on behaviors that would cause real problems if wrong — hallucinations, safety issues, brand voice violations.

Invest in good test data. Garbage inputs produce garbage insights. Spend time crafting realistic test cases that reflect actual user behavior.

Evals reveal architecture problems. When evaluation showed our RAG system failing, we discovered the retrieval step was broken — not the generation. Evals are diagnostics, not just grades.

Embrace non-determinism. LLM outputs vary. Write assertions that check for semantic correctness, not exact matches. Use contains-any, llm-rubric, and similarity checks.

Track costs. Running evals costs money (API calls). Promptfoo shows token usage. Keep an eye on it, especially with LLM-as-judge.

The documentation at promptfoo.dev covers Bedrock-specific setup in detail.

Resources

Promptfoo Documentation
Promptfoo AWS Bedrock Provider
Amazon Bedrock Evaluations
AWS Guidance: Evaluating GenAI Applications using OSS
Promptfoo GitHub
AWS Builder Center: Building an LLM Evaluation Framework with AWS Bedrock and Promptfoo
https://tutorialsdojo.com/amazon-bedrocks-llm-as-a-judge-automate-ai-evaluation-with-nova-lite-claude/
https://tutorialsdojo.com/what-are-ai-agents/

Written by: Ashley Nicole Santos

A Computer Science student and a developer with a strong interest in cloud computing and AI. She currently serves as the Country Lead for Agora Philippines and is also an Agora Ambassador, where she helps build and engage the local developer community. She is an AWS Certified AI and Cloud Practitioner and an IT Intern at Tutorials Dojo, gaining hands-on experience in cloud technologies and real-world applications.

Amazon Bedrock + Promptfoo: Rethinking LLM Evaluation Methods

Amazon Bedrock + Promptfoo: Rethinking LLM Evaluation Methods

The Uncomfortable Truth About LLM Development

LLM Evaluation and Why It’s Different

Promptfoo: The Testing Framework That Made It Click

But Wait — Doesn’t Bedrock Already Do This?