Scalable Data Processing and Transformation using SageMaker Processing (Part 2 of 2)

Last updated on April 28, 2023

In this 2-part tutorial, we will focus on SageMaker Processing and how we can use it to solve our data processing needs. Our overall goal is to demonstrate how to use SageMaker Processing to help us perform Min-Max scaling on a dataset in its own dedicated and easily scalable environment. If you are looking for Part 1, you can find it here.

As mentioned in Part I, we have divided this tutorial into 4 sections:

[1] Synthetic Data Generation (found in PART I)

[2] Using MinMaxScaler without the managed infrastructure support of SageMaker Processing (found in PART I)

[3] Using SageMaker Processing with local mode

[4] Using SageMaker Processing with dedicated ML instances

Once you’ve read and followed the steps in the previous tutorial, you can now proceed with Section III.

Section III. Using SageMaker Processing with local mode

In this section, we will (1) create a custom script that makes use of MinMaxScaler from scikit-learn, and then (2) use the SageMaker Python SDK to run a SageMaker Processing job using local mode. Run the following steps and blocks of code inside the same Jupyter Notebook as with the previous 2 sections.

1. Inspect the contents of the original DataFrame df

This should give us a DataFrame similar to what is shown in the following image:

2. Save the original DataFrame of values to a CSV file

!mkdir -p tmp

df.to_csv(“tmp/dataset.input.csv”, index=False)

In the previous block of code, we initially made use of a bash command that creates the tmp directory if it does not exist yet. Note the exclamation point (!) before the bash command which allows us to execute bash commands directly in the Jupyter cell.

3. Create a new file called processing.py. This should be in the same directory where our Jupyter Notebook ipynb file is located. To create this file, simply click the New button and select Text File from the list of dropdown options.

Inside processing.py, add the following lines of code:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def main():
input_path = “/opt/ml/processing/input/dataset.input.csv”
df = pd.read_csv(input_path)

scaler = MinMaxScaler()
df[[“x1”, “x2”, “y”]] = scaler.fit_transform(df[[“x1”, “x2”, “y”]])

output_path = “/opt/ml/processing/output/output.csv”
df.to_csv(output_path, index=False)
print(“[DONE]”)

if __name__ == “__main__”:
main()

What this script does is that it (1) reads the CSV file stored inside the /opt/ml/processing/input directory into a DataFrame, (2) makes use of MinMaxScaler to normalize the data stored in the DataFrame, and (3) saves this updated DataFrame into a CSV file inside the /opt/ml/processing/output directory. Later when the processing job running this custom script has finished, the CSV file in the output directory will automatically be uploaded by SageMaker to the specified S3 bucket.

4. Back in our Jupyter Notebook, initialize the SKLearnProcessor object and specify ‘local’ as the parameter value for instance_type. This allows us to use local mode which gives us the opportunity to test and verify if our custom script is working just fine before using dedicated ML instances.

import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
role = get_execution_role()
processor_local = SKLearnProcessor(framework_version=’0.20.0′,
role=role,
instance_count=1,
instance_type=’local’)

5. Check the container image URI of the processor_local object

processor_local.image_uri

This should give us a string value similar to ‘683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3’

6. Prepare the ProcessingInput and ProcessingOutput objects containing the configuration which will be used later when running the SageMaker Processing job.

from sagemaker.processing import ProcessingInput, ProcessingOutput
source = ‘tmp/dataset.input.csv’

pinput = ProcessingInput(
source=source,
destination=’/opt/ml/processing/input’)

poutput = ProcessingOutput(source=’/opt/ml/processing/output’)

Run the SageMaker Processing job in local mode.

processor_local.run(
code=’processing.py’,
inputs=[pinput],
outputs=[poutput] )

This should yield a set of logs similar to what is shown in the following image:

What happened here? Since we are using local mode, the container (where our custom script is running) runs inside the local machine. In this case, the local machine is the SageMaker Notebook instance where our Jupyter Notebook is running. After the processing job has been completed, the output file is uploaded to S3.

8. Store the location of the output files inside the s3_dest variable

s3_dest = processor_local.latest_job.outputs[0].destination
s3_dest

This should give us a string value similar to ‘s3://sagemaker-us-east-1-1234567890/sagemaker-scikit-learn-2021-10-03-04-03-46-069/output/output-1‘

9. Download the output CSV file using the AWS CLI

!aws s3 cp “{s3_dest}/output.csv” tmp/dataset.output.csv

Read the contents of the downloaded CSV file using pd.read_csv()

pd.read_csv(“tmp/dataset.output.csv”)

This should give us a DataFrame similar to what is shown in the following image:

In the next section, we will use a dedicated ML instance with SageMaker Processing using a similar set of steps.

Section IV. Using SageMaker Processing with dedicated ML instances

In this section, we will stop using local mode and we will use SageMaker Processing with a real ML instance instead. Run the following steps and blocks of code inside the same Jupyter Notebook.

Initialize the SKLearnProcessor object and specify ‘ml.m5.xlarge’ as the parameter value for instance_type.

processor = SKLearnProcessor(framework_version=’0.20.0′,
role=role,
instance_count=1,
instance_type=’ml.m5.xlarge’)

Use the run() method to start the SageMaker Processing job. Note that we are going to reuse the processing.py file, and the pinput and poutput variables from the previous section.

%%time
processor.run(
code=’processing.py’,
inputs=[pinput],
outputs=[poutput] )

This should give us a set of logs similar to what is shown in the following image:

Since we have specified ‘ml.m5.xlarge’ as the parameter value for instance_type, the processing job will launch a dedicated instance where the container and the custom script will run.

Download the CSV output using the AWS CLI. Use pd.read_csv() to read the contents of the downloaded CSV file.

3_dest = processor.latest_job.outputs[0].destination
!aws s3 cp “{s3_dest}/output.csv” tmp/dataset.output.csv
pd.read_csv(“tmp/dataset.output.csv”)

This should give us a DataFrame similar to what is shown in the following image:

With that, we were able to use SageMaker Processing to run a custom script that makes use of MinMaxScaler from scikit-learn. You may decide to generate significantly more records for the dataset and see how easy it is to deal with larger datasets with SageMaker Processing. If we need a more powerful instance, we can simply change that instance_type parameter value to something like ml.m5.2xlarge. If you are worried about the additional cost of larger instance types, note that we are able to save a lot since (1) the ML instance is deleted right after the processing job is completed, and (2) we only pay for the time the ML instance for the processing job is running (which is usually just for a few minutes).

Once you are done, do not forget to turn off (and optionally delete) the running SageMaker Notebook instance. Make sure to delete all resources created in all the 4 sections of this tutorial.

What’s next?

If you want to dig deeper into what Amazon SageMaker can do, feel free to check the 762-page book I’ve written here: https://amzn.to/3CCMf0S. Working on the hands-on solutions in this book will make you an advanced ML practitioner using SageMaker in no time.

In this book, you should also find the other ways on how to configure and use SageMaker Processing:

Installing and using libraries within the custom Python script
Using custom container images with SageMaker Processing
Passing arguments from the Jupyter notebook and reading the arguments within the custom script
Creating automated workflows with Step Functions and SageMaker Pipelines which utilize SageMaker Processing for the data preparation and transformation step

You should find all the other features and capabilities of SageMaker such as SageMaker Clarify, SageMaker Model Monitor, and SageMaker Debugger here as well.

That’s all for now and stay tuned for more!

Written by: Joshua Arvin Lat

[Guest Post] Joshua Arvin Lat is the Chief Technology Officer (CTO) of NuWorks Interactive Labs, Inc. He previously served as the CTO of 3 Australian-owned companies and also served as the Director for Software Development and Engineering for multiple e-commerce startups in the past which allowed him to be more effective as a leader. Years ago, he and his team won 1st place in a global cybersecurity competition with their published research paper. He is also an AWS Machine Learning Hero and he has been sharing his knowledge in several international conferences to discuss practical strategies on machine learning, engineering, security, and management. He is also the author of the book “Machine Learning with Amazon SageMaker Cookbook: 80 proven recipes for data scientists and developers to perform machine learning experiments and deployments”