Ends in
00
days
00
hrs
00
mins
00
secs
ENROLL NOW

đŸ“£ Save 20% OFF on AI & ML Mock Exams, Video Courses and eBooks – AWS, Azure, Google Clouds, and GitHub Reviewers!

Real-time Personally Identifiable Information (PII) Redaction Pipeline with S3 + Lambda + Comprehend

Home » AWS » Real-time Personally Identifiable Information (PII) Redaction Pipeline with S3 + Lambda + Comprehend

Real-time Personally Identifiable Information (PII) Redaction Pipeline with S3 + Lambda + Comprehend

In my previous article, I demonstrated how to use the Amazon Comprehend console to manually detect and redact Personally Identifiable Information (PII) from text files. While this hands-on method is excellent for learning the fundamentals of PII detection, it’s not practical in real-world, high-volume environments where speed and accuracy are essential. In such scenarios, organizations need more than just a simple, one-time approach—they require a robust, fully automated pipeline that sanitizes sensitive data as soon as it enters the system, without the need for manual intervention.

This article will walk you through the creation of an automated workflow that solves this challenge. We’ll design a solution where uploading a text file to Amazon S3 will automatically trigger PII detection and redaction using Amazon Comprehend. The sanitized output will then be stored back in S3, neatly organized under a dedicated folder (prefix) for easy access.

This approach not only ensures that sensitive data is processed securely but also eliminates delays associated with manual processes, enabling real-time data sanitization at scale. By the end of this guide, you’ll have a production-ready pipeline that can handle thousands of files daily while ensuring compliance with data protection regulations

Why Automate Personally Identifiable Information (PII) Redaction?

  • Compliance: Regulations like GDPR, HIPAA, and PCI DSS require strict handling of PII. Automation reduces human error.
  • Scalability: Manual console jobs don’t scale when you’re processing thousands of files daily.
  • Real-time processing: Event-driven pipelines ensure sensitive data is sanitized before it’s consumed downstream.

Architecture Overview

Here’s the high-level flow:

  1. Upload to S3 → A text file containing raw data is uploaded to the raw/ folder of an S3 bucket.
  2. S3 Event Notification → The bucket is configured to trigger an AWS Lambda function whenever a new object is created in raw/.
  3. Lambda Function → The function retrieves the file, calls Amazon Comprehend’s PII detection API, redacts sensitive entities, and saves the sanitized version.
  4. Output Storage → The redacted file is written back to the same bucket under the redacted/ folder.

Step-by-Step Implementation:

1. Create an S3 Bucket

Tutorials dojo strip
  • Bucket name: Enter a unique bucket name.
  • Inside this bucket, organize files with prefixes:
    • raw/ → for unprocessed input files
    • redacted/ → for sanitized output files

2. Create the Lambda Function

Lambda retrieves the file from raw/, processes it with Amazon Comprehend, and writes the redacted version back to redacted/

3. IAM Permissions

Ensure that the Lambda function has the necessary permissions to interact with S3 and Amazon Comprehend. You can create an inline policy for the Lambda execution role as follows:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "comprehend:DetectPiiEntities"
            ],
            "Resource": "*"
        }
    ]
}

4. Configure S3 Event Notifications

  • On the bucket, go to the Properties tab. 
  • Scroll down, then navigate to Event notifications
  • Click Create event notification.
  • Set up an event notification:
    • Event name: Enter desired name.
    • Prefix filter: raw/ (so only new raw files trigger the workflow)
    • Event type: s3:ObjectCreated:*

      Create event notification

    • Destination: AWS Lambda function

      Create event notification

5. Test the Workflow

Upload a file to raw/. The Lambda function will automatically generate a redacted copy in redacted/.

Input:

Customer: Juan Dela Cruz
SSN: 123-45-6789
Card: 4111 1111 1111 1111

Output (redacted):

Automating PII Redaction

Considerations:

  • File size limits: DetectPiiEntities supports up to 100 kilobytes of UTF-8 encoded characters. For larger files, use asynchronous batch jobs (StartPiiEntitiesDetectionJob).
  • Latency: Synchronous detection is fast for small files; batch jobs scale better for large datasets.
  • Cost: Amazon Comprehend charges based on the volume of text processed. You can monitor your usage and costs using AWS Cost Explorer to keep track of your expenses.
  • Bucket vs. Prefix: For simplicity, it’s generally advisable to use one bucket with appropriate prefixes (e.g., raw/ and redacted/). However, if your organization has strict compliance or IAM isolation requirements, you might opt for separate buckets

Automate File Uploading with Slack

For even more convenience, you can automate file uploads directly from Slack to your S3 bucket using AWS Lambda. This way, whenever a file is uploaded to a designated Slack channel, it can be automatically processed and uploaded to the raw/ folder in your S3 bucket. For more details, refer to this: https://tutorialsdojo.com/automating-file-uploads-from-slack-to-amazon-s3-harnessing-aws-lambda-and-slack-api/

Conclusion:

By combining Amazon S3 event notifications, AWS Lambda, and Amazon Comprehend, organizations can set up a real-time, automated pipeline for Personally Identifiable Information (PII) redaction. This approach enhances compliance, scalability, and security while minimizing the need for manual intervention.

By transforming a console-based workflow into a production-ready architecture, this solution is ideal for organizations handling sensitive customer data at scale.

References:

https://docs.aws.amazon.com/comprehend/latest/APIReference/API_StartPiiEntitiesDetectionJob.html
https://docs.aws.amazon.com/comprehend/latest/dg/realtime-pii-api.html

đŸ“£ Save 20% OFF on AI & ML Mock Exams, Video Courses and eBooks – AWS, Azure, Google Clouds, and GitHub Reviewers!

Tutorials Dojo portal

Learn AWS with our PlayCloud Hands-On Labs

$2.99 AWS and Azure Exam Study Guide eBooks

tutorials dojo study guide eBook

New AWS Generative AI Developer Professional Course AIP-C01

AIP-C01 Exam Guide AIP-C01 examtopics AWS Certified Generative AI Developer Professional Exam Domains AIP-C01

Learn GCP By Doing! Try Our GCP PlayCloud

Learn Azure with our Azure PlayCloud

FREE AI and AWS Digital Courses

FREE AWS, Azure, GCP Practice Test Samplers

SAA-C03 Exam Guide SAA-C03 examtopics AWS Certified Solutions Architect Associate

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

Follow Us On Linkedin

Written by: Nestor Mayagma Jr.

Nestor is a cloud engineer and content creator at Tutorials Dojo. He's been an active AWS Community Builder since 2022, with a growing interest in multi-cloud technologies across AWS, Azure, and Google Cloud. In his leisure time, he indulges in playing FPS games.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses

Our Community

~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?