In this 2-part tutorial, we will focus on SageMaker Processing and how we can use it to solve our data processing needs. Our overall goal is to demonstrate how to use SageMaker Processing to help us perform Min-Max scaling on a dataset in its own dedicated and easily scalable environment.  If you are looking for Part 1, you can find it here.

As mentioned in Part I, we have divided this tutorial into 4 sections:

[1] Synthetic Data Generation (found in PART I)

[2] Using MinMaxScaler without the managed infrastructure support of SageMaker Processing (found in PART I)

[3] Using SageMaker Processing with local mode

[4] Using SageMaker Processing with dedicated ML instances 

 

Once you’ve read and followed the steps in the previous tutorial, you can now proceed with Section III.

 

Section III. Using SageMaker Processing with local mode

 

In this section, we will (1) create a custom script that makes use of MinMaxScaler from scikit-learn, and then (2) use the SageMaker Python SDK to run a SageMaker Processing job using local mode. Run the following steps and blocks of code inside the same Jupyter Notebook as with the previous 2 sections.

1. Inspect the contents of the original DataFrame df

df

 

This should give us a DataFrame similar to what is shown in the following image:


2. Save the original DataFrame of values to a CSV file

!mkdir -p tmp

df.to_csv(“tmp/dataset.input.csv”, index=False)

 

In the previous block of code, we initially made use of a bash command that creates the tmp directory if it does not exist yet. Note the exclamation point (!) before the bash command which allows us to execute bash commands directly in the Jupyter cell.

3. Create a new file called processing.py. This should be in the same directory where our Jupyter Notebook ipynb file is located. To create this file, simply click the New button and select Text File from the list of dropdown options.

Inside processing.py, add the following lines of code:

Tutorials dojo strip

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def main():
 input_path = “/opt/ml/processing/input/dataset.input.csv”
 df = pd.read_csv(input_path)

 scaler = MinMaxScaler()
 df[[“x1”, “x2”, “y”]] = scaler.fit_transform(df[[“x1”, “x2”, “y”]])
 
 output_path = “/opt/ml/processing/output/output.csv”
 df.to_csv(output_path, index=False)
 print(“[DONE]”)
 
 
if __name__ == “__main__”:
 main()

 

What this script does is that it (1) reads the CSV file stored inside the /opt/ml/processing/input directory into a DataFrame, (2) makes use of MinMaxScaler to normalize the data stored in the DataFrame, and (3) saves this updated DataFrame into a CSV file inside the /opt/ml/processing/output directory. Later when the processing job running this custom script has finished, the CSV file in the output directory will automatically be uploaded by SageMaker to the specified S3 bucket.

 

4. Back in our Jupyter Notebook, initialize the SKLearnProcessor object and specify ‘local’ as the parameter value for instance_type. This allows us to use local mode which gives us the opportunity to test and verify if our custom script is working just fine before using dedicated ML instances.

 

import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
role = get_execution_role()
processor_local = SKLearnProcessor(framework_version=’0.20.0′,
role=role,
instance_count=1,
instance_type=’local’)

 

5. Check the container image URI of the processor_local object

processor_local.image_uri

 

This should give us a string value similar to ‘683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3’

 

6. Prepare the ProcessingInput and ProcessingOutput objects containing the configuration which will be used later when running the SageMaker Processing job.

from sagemaker.processing import ProcessingInput, ProcessingOutput
source = ‘tmp/dataset.input.csv’

pinput = ProcessingInput(
source=source,
destination=’/opt/ml/processing/input’)

poutput = ProcessingOutput(source=’/opt/ml/processing/output’)

 

 

  1. Run the SageMaker Processing job in local mode

 

processor_local.run(
code=’processing.py’,
inputs=[pinput],
outputs=[poutput] )

 

This should yield a set of logs similar to what is shown in the following image:

What happened here? Since we are using local mode, the container (where our custom script is running) runs inside the local machine. In this case, the local machine is the SageMaker Notebook instance where our Jupyter Notebook is running. After the processing job has been completed, the output file is uploaded to S3.

 

8. Store the location of the output files inside the s3_dest variable

s3_dest = processor_local.latest_job.outputs[0].destination
s3_dest 

 

This should give us a string value similar to s3://sagemaker-us-east-1-1234567890/sagemaker-scikit-learn-2021-10-03-04-03-46-069/output/output-1

 

9. Download the output CSV file using the AWS CLI

!aws s3 cp “{s3_dest}/output.csv” tmp/dataset.output.csv

 

  1. Read the contents of the downloaded CSV file using pd.read_csv()
pd.read_csv(“tmp/dataset.output.csv”)

 

This should give us a DataFrame similar to what is shown in the following image:

In the next section, we will use a dedicated ML instance with SageMaker Processing using a similar set of steps.

 

Section IV. Using SageMaker Processing with dedicated ML instances

 

In this section, we will stop using local mode and we will use SageMaker Processing with a real ML instance instead. Run the following steps and blocks of code inside the same Jupyter Notebook.

 

  1. Initialize the SKLearnProcessor object and specify ‘ml.m5.xlarge’ as the parameter value for instance_type.

 

processor = SKLearnProcessor(framework_version=’0.20.0′,
role=role,
instance_count=1,
instance_type=’ml.m5.xlarge’) 

 

  1. Use the run() method to start the SageMaker Processing job. Note that we are going to reuse the processing.py file, and the pinput and poutput variables from the previous section.

 

%%time
processor.run(
code=’processing.py’,
inputs=[pinput],
outputs=[poutput] )

 

This should give us a set of logs similar to what is shown in the following image:

AWS Exam Readiness Courses

Since we have specified ‘ml.m5.xlarge’ as the parameter value for instance_type, the processing job will launch a dedicated instance where the container and the custom script will run.

 

  1. Download the CSV output using the AWS CLI. Use pd.read_csv() to read the contents of the downloaded CSV file.

3_dest = processor.latest_job.outputs[0].destination
!aws s3 cp “{s3_dest}/output.csv” tmp/dataset.output.csv
pd.read_csv(“tmp/dataset.output.csv”)

 

This should give us a DataFrame similar to what is shown in the following image:

With that, we were able to use SageMaker Processing to run a custom script that makes use of MinMaxScaler from scikit-learn. You may decide to generate significantly more records for the dataset and see how easy it is to deal with larger datasets with SageMaker Processing. If we need a more powerful instance, we can simply change that instance_type parameter value to something like ml.m5.2xlarge. If you are worried about the additional cost of larger instance types, note that we are able to save a lot since (1) the ML instance is deleted right after the processing job is completed, and (2) we only pay for the time the ML instance for the processing job is running (which is usually just for a few minutes). 

Once you are done, do not forget to turn off (and optionally delete) the running SageMaker Notebook instance. Make sure to delete all resources created in all the 4 sections of this tutorial.

 

What’s next?

If you want to dig deeper into what Amazon SageMaker can do, feel free to check the 762-page book I’ve written here: https://amzn.to/3CCMf0S. Working on the hands-on solutions in this book will make you an advanced ML practitioner using SageMaker in no time.

In this book, you should also find the other ways on how to configure and use SageMaker Processing:

  • Installing and using libraries within the custom Python script
  • Using custom container images with SageMaker Processing
  • Passing arguments from the Jupyter notebook and reading the arguments within the custom script
  • Creating automated workflows with Step Functions and SageMaker Pipelines which utilize SageMaker Processing for the data preparation and transformation step

You should find all the other features and capabilities of SageMaker such as SageMaker Clarify, SageMaker Model Monitor, and SageMaker Debugger here as well.

That’s all for now and stay tuned for more!

Tutorials Dojo portal

Win Exciting Freebies!

FREE AWS Exam Readiness Digital Courses

Enroll Now – Our Azure Certification Exam Reviewers

azure reviewers tutorials dojo

Enroll Now – Our Google Cloud Certification Exam Reviewers

Tutorials Dojo Exam Study Guide eBooks

tutorials dojo study guide eBook

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

FREE Intro to Cloud Computing for Beginners

FREE AWS, Azure, GCP Practice Test Samplers

Browse Other Courses

Generic Category (English)300x250

Recent Posts