Train and Deploy a Scikit-Learn Model in Amazon SageMaker

Introduction

In the ever-evolving world of machine learning (ML), the ability to efficiently train and deploy models is crucial for turning innovative ideas into real-world applications. This is where Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly, stands out.

Amazon SageMaker streamlines the machine learning workflow, offering a broad set of capabilities that are designed to enable users to focus more on the problem at hand and less on the underlying infrastructure. While SageMaker comes equipped with built-in algorithms and support for various ML frameworks, it notably offers the flexibility to leverage traditional libraries and frameworks. This includes the popular Scikit-Learn library, which is renowned for its simplicity and effectiveness in handling various machine learning tasks.

Overview

In this tutorial, we will demonstrate how to harness the robustness of Scikit-Learn within the SageMaker environment. Although SageMaker provides its own set of built-in models, it gives users the option to work with traditional libraries like Scikit-Learn, offering the best of both worlds – the ease and familiarity of Scikit-Learn combined with the powerful and scalable infrastructure of SageMaker.

We’ll walk you through a straightforward yet comprehensive journey, starting from setting up your environment in SageMaker, moving through training a Scikit-Learn model, to finally deploying it for making predictions. This tutorial is designed as a demonstration aimed at providing a clear and concise guide without diving too deep into the technicalities. It’s perfect for those looking to quickly get up to speed with deploying Scikit-Learn models in SageMaker.

Prerequisites

Before diving into the practical steps of training and deploying a Scikit-Learn model in Amazon SageMaker, it’s essential to ensure that you have everything needed for a smooth and successful journey. This section outlines the key prerequisites required for this tutorial.

Amazon Web Services (AWS) Account

First and foremost, you will need an active AWS account. Amazon SageMaker is a service provided by AWS, and having an account is the gateway to accessing this and other cloud services. If you don’t have an account yet, you can sign up for one on the AWS website. AWS often offers a free tier for new users, which is perfect for getting started without incurring immediate costs.

Basic Understanding of Python

Python is a versatile and widely used programming language in the field of data science and machine learning. For this tutorial, a basic understanding of Python is required, as the code examples and the implementation of the Scikit-Learn model will be in Python. Familiarity with common Python libraries, especially Pandas and NumPy, is also beneficial.

Knowledge of Scikit-Learn

Scikit-Learn is a popular open-source machine learning library for Python. It provides simple and efficient tools for data analysis and modeling. A basic understanding of how to create and use models in Scikit-Learn will help you grasp the concepts covered in this tutorial more easily.

Familiarity with AWS SageMaker

Lastly, a foundational knowledge of AWS SageMaker is recommended. While this tutorial will guide you through using SageMaker for training and deploying models, having prior exposure to its interface and basic functionalities will enhance your learning experience.

With these prerequisites in place, you are well-prepared to embark on the journey of training and deploying a Scikit-Learn model using Amazon SageMaker.

Part 1: Setting Up the Environment

Getting your environment ready is the first step in working with Amazon SageMaker. This section covers the basics of SageMaker, setting up a SageMaker session, understanding the necessary roles and permissions, and preparing your data for model training.

Introduction to SageMaker

Amazon SageMaker is a fully managed service that provides developers and data scientists the tools to build, train, and deploy machine learning models. SageMaker simplifies the machine learning process by automating heavy-lifting tasks such as model tuning, scaling, and deployment. Its key capabilities include:

Jupyter Notebook Instances: For easy data exploration and analysis.
Built-In Algorithms and Frameworks: Supports popular ML/DL frameworks.
Model Training and Tuning: Automated model training and hyperparameter tuning.
Model Deployment: Quick and easy deployment of models to production.

Creating a SageMaker Session

To start working with SageMaker, you need to create a session. This session acts as the context in which all your SageMaker operations will be performed. Here’s a simple code snippet to initialize a SageMaker session:

import sagemaker

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

This code sets up a SageMaker session and identifies the AWS region that the session will communicate with.

Roles and Permissions

To use SageMaker, your AWS account needs the appropriate roles and permissions. These roles grant SageMaker access to AWS resources like S3 buckets. The get_execution_role function fetches the IAM role you set up for your SageMaker session:

role = sagemaker.get_execution_role()

Ensure your role has the necessary permissions for SageMaker operations and accessing S3 resources.

Data Preparation

Downloading the Iris Dataset

For this tutorial, we’ll use the classic Iris dataset. The dataset can be downloaded from an S3 bucket using the following code snippet:

import boto3
import os

bucket_name = "&lt;your-bucket-name&gt;"
file_key = "datasets/tabular/iris/iris.data"

s3_client = boto3.client("s3")
s3_client.download_file(bucket_name, file_key, "./data/iris.csv")

Replace <your-bucket-name> with your S3 bucket’s name. This code downloads the Iris dataset from an S3 bucket to your local environment.

Preprocessing and Saving the Dataset

Before training, the dataset needs to be preprocessed. This includes mapping categorical values to numerical ones and saving the modified dataset in a format compatible with SageMaker:

import pandas as pd
import numpy as np

df_iris = pd.read_csv("./data/iris.csv", header=None)
df_iris[4] = df_iris[4].map({"Iris-setosa": 0, "Iris-versicolor": 1,
"Iris-virginica": 2})
iris = df_iris[[4, 0, 1, 2, 3]].to_numpy()
np.savetxt("./data/iris.csv", iris, delimiter=",", fmt="%1.1f, %1.3f,
%1.3f, %1.3f, %1.3f")

This code prepares the Iris dataset by converting string labels to integers and saves it in a format suitable for SageMaker.

With these steps, your environment in Amazon SageMaker is now set up and ready for the next stages of model training and deployment.

Part 2: Writing the Training Script

In this section, we delve into the heart of our machine learning workflow: the training script. The script sklearn_entrypoint.py is a crucial component in the SageMaker training job. It defines how our model is trained and saved, making it a pivotal part of the ML pipeline.

Overview of the Entry Point File

The sklearn_entrypoint.py script serves multiple purposes:

Argument Parsing: It interprets command-line arguments passed to it.
Data Loading and Preprocessing: It handles the loading and preprocessing of the training data.
Model Training: It builds and trains the machine learning model.
Model Serialization: It saves the trained model for later use or deployment.

Let’s break down each part of the script.

Argument Parsing

The script uses argparse, a Python standard library for parsing command-line arguments. This allows the script to accept configuration options (like hyperparameters) when the SageMaker training job is initiated. Here’s a snippet:

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--max_leaf_nodes', type=int, default=-1)
parser.add_argument("--min_samples_split", type=int, default=2)
parser.add_argument("--min_samples_leaf", type=int, default=1)
# Additional arguments...
args = parser.parse_args()

Each parser.add_argument line defines a different configurable parameter, with default values provided if not specified during runtime.

Data Loading and Preprocessing

The script loads and preprocesses the training data:

import pandas as pd
import os

input_files = [os.path.join(args.train, file) for file in
os.listdir(args.train)]
train_data = pd.concat([pd.read_csv(file, header=None,
engine="python") for file in input_files])

This code concatenates all CSV files found in the specified training directory, preparing them for the training process.

Model Training

For model training, a Decision Tree Classifier from Scikit-Learn is used:

from sklearn import tree

clf = tree.DecisionTreeClassifier(
    max_leaf_nodes=args.max_leaf_nodes,
    min_samples_split=args.min_samples_split,
    min_samples_leaf=args.min_samples_leaf,
    min_weight_fraction_leaf=args.min_weight_fraction_leaf,
    # Other parameters...
)
clf = clf.fit(train_X, train_y)

The classifier is configured using the parameters parsed earlier and trained on the loaded data.

Cross-validation and Performance Evaluation

The script evaluates the model’s performance using cross-validation:

from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(clf, train_X, train_y, cv=5, scoring='accuracy')
mean_cv_accuracy = cv_scores.mean()

This snippet performs 5-fold cross-validation and calculates the mean accuracy, providing an estimate of the model’s performance.

Model Serialization

Finally, the trained model is serialized (saved) using Joblib:

import joblib

joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))

This step saves the model to a file, making it available for deployment or further analysis.

This training script encapsulates the end-to-end process of training a machine learning model in SageMaker, from data preprocessing to model serialization, readying it for deployment.

Part 3: Training the Model in SageMaker

After setting up the environment and preparing the training script, the next step in our journey is to train the model using Amazon SageMaker. This involves creating a SageMaker Estimator for Scikit-Learn and executing the training job.

Creating a SageMaker Estimator

SageMaker’s Estimator is a high-level interface for SageMaker training. It handles the allocation of resources needed for training, such as the type and number of instances. For Scikit-Learn models, SageMaker offers a pre-built SKLearn Estimator. Here’s how to configure it:

from sagemaker.sklearn.estimator import SKLearn

script_path = "sklearn_entrypoint.py"

metric_definitions = [{
    'Name': 'validation:accuracy',
    'Regex': 'validation:accuracy=([0-9\\.]+)'
}]


sklearn = SKLearn(
    entry_point=script_path,
    framework_version="0.20.0",
    metric_definitions=metric_definitions,
    instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={
        "max_leaf_nodes": 30,
        # Other hyperparameters...
    }
)

In this code snippet, you specify the script to run (script_path), the instance type, the role, and any hyperparameters you wish to set for the training.

Training the Model

With the Estimator set up, you can now train your model. Triggering a training job in SageMaker is as simple as calling the fit method on the Estimator:

train_input = sagemaker_session.upload_data("data", key_prefix="data")

sklearn.fit({"train": train_input})

This code uploads your training data to S3 and then starts the training job with the specified dataset.

Once the training is done, you will see something like this:

Part 4: Hyperparameter Tuning

Machine learning models often require fine-tuning of hyperparameters for optimal performance. SageMaker simplifies this process through its hyperparameter tuning functionality.

Setting Up Hyperparameter Tuning

Hyperparameter tuning in SageMaker involves defining the range of values for each hyperparameter and specifying the metric to optimize. Here’s how to set it up:

from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {
    'max_leaf_nodes': IntegerParameter(20, 100),
    'min_samples_split': IntegerParameter(2, 20),
    # Other parameters...
}

objective_metric_name = 'validation:accuracy'

Note that the hyperparameters that you will tune should be present in your entry point file.

In this snippet, hyperparameter_ranges define the ranges within which SageMaker will experiment. objective_metric_name is the metric that SageMaker will aim to optimize during the tuning.

Launching the Tuning Job

To start the tuning job, you create a HyperparameterTuner object and then call its fit method:

tuner = HyperparameterTuner(
    estimator=sklearn,
    objective_metric_name=objective_metric_name,
    hyperparameter_ranges=hyperparameter_ranges,
    metric_definitions=metric_definitions,
    max_jobs=20,
    max_parallel_jobs=3
)

tuner.fit({"train": train_input})

This process automatically runs multiple training jobs with different combinations of hyperparameters, seeking the combination that yields the best result according to the specified metric.

In your Jupyter notebook, you should see this once your hyperparameter tuning job is finished:

To inspect the best score and best hyperparameters, you can use the following code snippet:

# Get the name of the tuning job
tuning_job_name = tuner.latest_tuning_job.job_name

# Get details of the tuning job
tuning_job_result = sagemaker_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

# Extract the best training job
best_training_job = tuning_job_result['BestTrainingJob']

# Print the name of the best training job
print("Best Training Job Name: ",
best_training_job['TrainingJobName'])

# Print the hyperparameters of the best training job
print("Best Hyperparameters: ",
best_training_job['TunedHyperParameters'])

# Optionally, if you want to get more details about the best training 
job
best_job_details = sagemaker_client.describe_training_job(
    TrainingJobName=best_training_job['TrainingJobName']
)

print("Best Model Performance: ",
best_training_job['FinalHyperParameterTuningJobObjectiveMetric'])

Running this code snippet will show you the details about the best model:

Alternatively, you can inspect it in the SageMaker console and navigate to the Hyperparameter Jobs section. You should see something similar to this:

Through these steps, you have successfully trained a Scikit-Learn model in Amazon SageMaker and fine-tuned its hyperparameters for improved performance.

Part 5: Deploying the Model

Once your model is trained and fine-tuned, the next crucial step is deploying it to make predictions. In SageMaker, this involves creating a model endpoint.

Model Deployment

Deploying a model in Amazon SageMaker is straightforward. First, you deploy the model to a SageMaker endpoint:

predictor = tuner.deploy(initial_instance_count=1,
instance_type="ml.m5.xlarge")

This code snippet creates a SageMaker endpoint using the model produced by the best hyperparameter tuning job. The initial_instance_count specifies the number of instances, and instance_type defines the type of machine to use.

Creating an Endpoint

The endpoint created is a live HTTPs URL that can be used to make real-time predictions. This endpoint is the interface through which your application communicates with your model.

Part 6: Testing the Deployed Model

After deploying the model, it’s important to test it to ensure it’s making accurate predictions.

Preparing Test Data

To test the model, first prepare your test data. Assuming you have a CSV file with test data, you can load it as follows:

import pandas as pd

test_data = pd.read_csv("path/to/test/data.csv")
test_X = test_data.iloc[:, 1:]

Model Prediction

With the test data ready, you can now use the deployed model to make predictions:

predictions = predictor.predict(test_X.values)

This code sends the test data to the model endpoint and receives the predictions.

Evaluating Model Performance

Compare these predictions with the actual values to evaluate the model’s performance:

actual = test_data.iloc[:, 0].values
print("Predictions:", predictions)
print("Actual:", actual)

This comparison gives you a sense of how well your model is performing on unseen data.

Part 7: Cleanup

After testing your model, it’s important to clean up resources to avoid incurring unnecessary charges.

Deleting the Endpoint

To delete the endpoint, use the following command:

predictor.delete_endpoint()

This command removes the SageMaker endpoint and ensures you are no longer billed for it.

Through these steps, you have successfully trained, deployed, tested, and cleaned up a Scikit-Learn model in Amazon SageMaker. This process demonstrates how SageMaker can be an efficient and powerful tool for machine learning workflows.

Final Words

In this tutorial, we’ve navigated through the end-to-end process of training and deploying a Scikit-Learn model using Amazon SageMaker. From setting up the environment, preparing the data, and writing the training script, to training, tuning, deploying, and testing the model, each step demonstrated the power and simplicity of integrating Scikit-Learn with SageMaker.

Key Points

Environment Setup: We established a SageMaker session, ensuring all necessary permissions and roles were in place.
Training Script: The sklearn_entrypoint.py script was crafted to load data, train a model using Scikit-Learn, and serialize the trained model.
Model Training and Tuning: The model was trained, and hyperparameters were fine-tuned in SageMaker, leveraging its managed infrastructure.

Deployment and Testing: We deployed the model to a SageMaker endpoint and tested its predictive capabilities, ensuring the model performed as expected.

Next Steps

With the fundamentals covered, the next steps could include:

Exploring Advanced Features: Dive deeper into SageMaker’s advanced features like automatic model tuning, and batch transform, or explore different machine learning models.
Scalability: Experiment with scaling the model to handle larger datasets or more complex machine learning tasks.
Cost Optimization: Learn about managing and optimizing costs related to training and deploying models in SageMaker.

Resources

https://docs.aws.amazon.com/sagemaker/

https://docs.aws.amazon.com/sagemaker/latest/dg/sklearn.html

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Written by: John Patrick Laurel

Pats is the Head of Data Science at a European short-stay real estate business group. He boasts a diverse skill set in the realm of data and AI, encompassing Machine Learning Engineering, Data Engineering, and Analytics. Additionally, he serves as a Data Science Mentor at Eskwelabs. Outside of work, he enjoys taking long walks and reading.

Train and Deploy a Scikit-Learn Model in Amazon SageMaker

Train and Deploy a Scikit-Learn Model in Amazon SageMaker

Introduction

Overview

Prerequisites

Part 1: Setting Up the Environment

Introduction to SageMaker

Creating a SageMaker Session

Roles and Permissions

Data Preparation

Downloading the Iris Dataset

Preprocessing and Saving the Dataset

Part 2: Writing the Training Script

Overview of the Entry Point File

Argument Parsing

Data Loading and Preprocessing

Model Training

Cross-validation and Performance Evaluation

Model Serialization

Part 3: Training the Model in SageMaker

Creating a SageMaker Estimator

Training the Model

Part 4: Hyperparameter Tuning

Setting Up Hyperparameter Tuning

Launching the Tuning Job

Part 5: Deploying the Model

Model Deployment

Creating an Endpoint

Part 6: Testing the Deployed Model

Preparing Test Data

Model Prediction

Evaluating Model Performance

Part 7: Cleanup

Deleting the Endpoint

Final Words

Key Points

Next Steps

Resources

Get $4 OFF in AWS Solutions Architect & Data Engineer Associate Practice Exams for $10.99 ONLY!

Be Inspired and Mentored with Cloud Career Journeys!

Enroll Now – Our Azure Certification Exam Reviewers

Enroll Now – Our Google Cloud Certification Exam Reviewers

Tutorials Dojo Exam Study Guide eBooks

FREE AWS Exam Readiness Digital Courses

Subscribe to our YouTube Channel

FREE Intro to Cloud Computing for Beginners

FREE AWS, Azure, GCP Practice Test Samplers

Recent Posts

Written by: John Patrick Laurel

Our Community

What our students say about us?

Did you find our content helpful?