The Challenge
Data privacy is a top priority for businesses, especially amid the rising global data regulations. One common challenge is ensuring sensitive data, like personally identifiable information (PII), remains protected when data is accessed or transferred.
Imagine you have a bunch of employee profiles stored as CSV files in an S3 bucket. These profiles include sensitive information such as real names, social security numbers, and email addresses, along with non-sensitive data like job titles and office locations. Various teams within your company often need to access these files for their work. For instance, HR may need to understand the distribution of roles, while the operations team may need the names and locations to plan logistics. However, PII, like social security numbers and personal email addresses, should remain confidential and not be accessible to these teams.
Before S3 Object Lambda, you would have to create and maintain two separate versions of your data – one with the sensitive information for authorized personnel and another redacted version for others. This approach not only doubles your storage requirements but also increases the risk of accidental data leakage or mishandling.
Enter S3 Object Lambda
With S3 Object Lambda, you can process data as it is being retrieved from S3 without altering the original stored data. You can modify the data returned by a standard S3 GET request, for example, to redact sensitive information, convert the data format, or compress the data on the fly.
How it works
S3 Object Lambda works by triggering your custom Lambda function upon a GET request to an S3 Object Lambda Access Point. The function receives details about the request, including a pre-signed URL for reading the original object. The function processes the data and writes it back to S3. The requester receives this processed data as a response to their GET request, while the original data in S3 remains intact.
Demo
Let’s see this in action. We’ll use Python in our Lambda function to redact the social_security_number and email columns from the following CSV file:
The transformed data will contain sensitive fields replaced with the word ‘REDACTED’:
Steps
- Creating the Lambda Functions
- Creating an S3 Bucket
- Setting up an S3 Access Point
- Creating an S3 Object Lambda Access Point
Creating the Lambda Functions
First, create two AWS Lambda functions.
- Redact function: This function is responsible for redacting sensitive information from the original CSV file. Make sure to attach the AmazonS3ObjectLambdaExecutionRolePolicy to the function’s execution role. Set the timeout settings to 30 seconds to prevent the function from timing out.
- Reader function: We’ll use this function to simulate an end user or application retrieving the CSV file from the S3 bucket. Set the timeout settings to 30 seconds to prevent the function from timing out.
Attach the policy below to the Reader function’s execution role. For demo purposes, we are granting IAM policy actions unrestricted access to all resources using a wildcard.
Creating an S3 Bucket
- Create a new S3 bucket or use an existing one.
- Upload the CSV file to your bucket.
Setting up an S3 Access Point
- Navigate to the Amazon S3 console and select your bucket.
- Move to the Access points tab and click on Create access point.
- Provide a name for your access point.
- Select Internet as the Network origin and leave other settings to their default values.
- Click Create access point.
Creating an S3 Object Lambda Access Point
- Go to the Object Lambda Access Points window and select the region where your bucket is located.
- Provide a name for your Object Lambda Access Point and choose your S3 bucket
- Select the Access Point that you created in Step 2.
- In the Transformation Configuration section, select GetObject from the S3 APIs and pick your Redact Lambda function.
- Leave the rest to their default settings and click Create Object Lambda Access Point.
Copy the ARN of your Object Lambda Access Point and update the value in the Reader function.
Testing
Now that everything is set up, you can test the system by running the Reader function. You should be able to see the redacted version of the original CSV.