In the ever-evolving world of machine learning (ML), the ability to efficiently train and deploy models is crucial for turning innovative ideas into real-world applications. This is where Amazon SageMaker, a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning models quickly, stands out. Amazon SageMaker streamlines the machine learning workflow, offering a broad set of capabilities that are designed to enable users to focus more on the problem at hand and less on the underlying infrastructure. While SageMaker comes equipped with built-in algorithms and support for various ML frameworks, it notably offers the flexibility to leverage traditional libraries and frameworks. This includes the popular Scikit-Learn library, which is renowned for its simplicity and effectiveness in handling various machine learning tasks. In this tutorial, we will demonstrate how to harness the robustness of Scikit-Learn within the SageMaker environment. Although SageMaker provides its own set of built-in models, it gives users the option to work with traditional libraries like Scikit-Learn, offering the best of both worlds – the ease and familiarity of Scikit-Learn combined with the powerful and scalable infrastructure of SageMaker. We’ll walk you through a straightforward yet comprehensive journey, starting from setting up your environment in SageMaker, moving through training a Scikit-Learn model, to finally deploying it for making predictions. This tutorial is designed as a demonstration aimed at providing a clear and concise guide without diving too deep into the technicalities. It’s perfect for those looking to quickly get up to speed with deploying Scikit-Learn models in SageMaker. Before diving into the practical steps of training and deploying a Scikit-Learn model in Amazon SageMaker, it’s essential to ensure that you have everything needed for a smooth and successful journey. This section outlines the key prerequisites required for this tutorial. First and foremost, you will need an active AWS account. Amazon SageMaker is a service provided by AWS, and having an account is the gateway to accessing this and other cloud services. If you don’t have an account yet, you can sign up for one on the AWS website. AWS often offers a free tier for new users, which is perfect for getting started without incurring immediate costs. Python is a versatile and widely used programming language in the field of data science and machine learning. For this tutorial, a basic understanding of Python is required, as the code examples and the implementation of the Scikit-Learn model will be in Python. Familiarity with common Python libraries, especially Pandas and NumPy, is also beneficial. Scikit-Learn is a popular open-source machine learning library for Python. It provides simple and efficient tools for data analysis and modeling. A basic understanding of how to create and use models in Scikit-Learn will help you grasp the concepts covered in this tutorial more easily. Lastly, a foundational knowledge of AWS SageMaker is recommended. While this tutorial will guide you through using SageMaker for training and deploying models, having prior exposure to its interface and basic functionalities will enhance your learning experience. With these prerequisites in place, you are well-prepared to embark on the journey of training and deploying a Scikit-Learn model using Amazon SageMaker. Getting your environment ready is the first step in working with Amazon SageMaker. This section covers the basics of SageMaker, setting up a SageMaker session, understanding the necessary roles and permissions, and preparing your data for model training. Amazon SageMaker is a fully managed service that provides developers and data scientists the tools to build, train, and deploy machine learning models. SageMaker simplifies the machine learning process by automating heavy-lifting tasks such as model tuning, scaling, and deployment. Its key capabilities include: To start working with SageMaker, you need to create a session. This session acts as the context in which all your SageMaker operations will be performed. Here’s a simple code snippet to initialize a SageMaker session: This code sets up a SageMaker session and identifies the AWS region that the session will communicate with. To use SageMaker, your AWS account needs the appropriate roles and permissions. These roles grant SageMaker access to AWS resources like S3 buckets. The get_execution_role function fetches the IAM role you set up for your SageMaker session: Ensure your role has the necessary permissions for SageMaker operations and accessing S3 resources. For this tutorial, we’ll use the classic Iris dataset. The dataset can be downloaded from an S3 bucket using the following code snippet: Replace <your-bucket-name> with your S3 bucket’s name. This code downloads the Iris dataset from an S3 bucket to your local environment. Before training, the dataset needs to be preprocessed. This includes mapping categorical values to numerical ones and saving the modified dataset in a format compatible with SageMaker: This code prepares the Iris dataset by converting string labels to integers and saves it in a format suitable for SageMaker. With these steps, your environment in Amazon SageMaker is now set up and ready for the next stages of model training and deployment. In this section, we delve into the heart of our machine learning workflow: the training script. The script sklearn_entrypoint.py is a crucial component in the SageMaker training job. It defines how our model is trained and saved, making it a pivotal part of the ML pipeline. The sklearn_entrypoint.py script serves multiple purposes: Let’s break down each part of the script. The script uses argparse, a Python standard library for parsing command-line arguments. This allows the script to accept configuration options (like hyperparameters) when the SageMaker training job is initiated. Here’s a snippet: Each parser.add_argument line defines a different configurable parameter, with default values provided if not specified during runtime. The script loads and preprocesses the training data: This code concatenates all CSV files found in the specified training directory, preparing them for the training process. For model training, a Decision Tree Classifier from Scikit-Learn is used: The classifier is configured using the parameters parsed earlier and trained on the loaded data. The script evaluates the model’s performance using cross-validation: This snippet performs 5-fold cross-validation and calculates the mean accuracy, providing an estimate of the model’s performance. Finally, the trained model is serialized (saved) using Joblib: This step saves the model to a file, making it available for deployment or further analysis. This training script encapsulates the end-to-end process of training a machine learning model in SageMaker, from data preprocessing to model serialization, readying it for deployment. After setting up the environment and preparing the training script, the next step in our journey is to train the model using Amazon SageMaker. This involves creating a SageMaker Estimator for Scikit-Learn and executing the training job. SageMaker’s Estimator is a high-level interface for SageMaker training. It handles the allocation of resources needed for training, such as the type and number of instances. For Scikit-Learn models, SageMaker offers a pre-built SKLearn Estimator. Here’s how to configure it: In this code snippet, you specify the script to run (script_path), the instance type, the role, and any hyperparameters you wish to set for the training. With the Estimator set up, you can now train your model. Triggering a training job in SageMaker is as simple as calling the fit method on the Estimator: This code uploads your training data to S3 and then starts the training job with the specified dataset. Once the training is done, you will see something like this: Machine learning models often require fine-tuning of hyperparameters for optimal performance. SageMaker simplifies this process through its hyperparameter tuning functionality. Hyperparameter tuning in SageMaker involves defining the range of values for each hyperparameter and specifying the metric to optimize. Here’s how to set it up: Note that the hyperparameters that you will tune should be present in your entry point file.
In this snippet, hyperparameter_ranges define the ranges within which SageMaker will experiment. objective_metric_name is the metric that SageMaker will aim to optimize during the tuning. To start the tuning job, you create a HyperparameterTuner object and then call its fit method: This process automatically runs multiple training jobs with different combinations of hyperparameters, seeking the combination that yields the best result according to the specified metric. In your Jupyter notebook, you should see this once your hyperparameter tuning job is finished: To inspect the best score and best hyperparameters, you can use the following code snippet: Running this code snippet will show you the details about the best model: Alternatively, you can inspect it in the SageMaker console and navigate to the Hyperparameter Jobs section. You should see something similar to this: Through these steps, you have successfully trained a Scikit-Learn model in Amazon SageMaker and fine-tuned its hyperparameters for improved performance. Once your model is trained and fine-tuned, the next crucial step is deploying it to make predictions. In SageMaker, this involves creating a model endpoint. Deploying a model in Amazon SageMaker is straightforward. First, you deploy the model to a SageMaker endpoint: This code snippet creates a SageMaker endpoint using the model produced by the best hyperparameter tuning job. The initial_instance_count specifies the number of instances, and instance_type defines the type of machine to use. The endpoint created is a live HTTPs URL that can be used to make real-time predictions. This endpoint is the interface through which your application communicates with your model. After deploying the model, it’s important to test it to ensure it’s making accurate predictions. To test the model, first prepare your test data. Assuming you have a CSV file with test data, you can load it as follows: With the test data ready, you can now use the deployed model to make predictions: This code sends the test data to the model endpoint and receives the predictions. Compare these predictions with the actual values to evaluate the model’s performance: This comparison gives you a sense of how well your model is performing on unseen data. After testing your model, it’s important to clean up resources to avoid incurring unnecessary charges. To delete the endpoint, use the following command: This command removes the SageMaker endpoint and ensures you are no longer billed for it. Through these steps, you have successfully trained, deployed, tested, and cleaned up a Scikit-Learn model in Amazon SageMaker. This process demonstrates how SageMaker can be an efficient and powerful tool for machine learning workflows. In this tutorial, we’ve navigated through the end-to-end process of training and deploying a Scikit-Learn model using Amazon SageMaker. From setting up the environment, preparing the data, and writing the training script, to training, tuning, deploying, and testing the model, each step demonstrated the power and simplicity of integrating Scikit-Learn with SageMaker. Deployment and Testing: We deployed the model to a SageMaker endpoint and tested its predictive capabilities, ensuring the model performed as expected. With the fundamentals covered, the next steps could include: https://docs.aws.amazon.com/sagemaker/ https://docs.aws.amazon.com/sagemaker/latest/dg/sklearn.html https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.htmlIntroduction
Overview
Prerequisites
Part 1: Setting Up the Environment
Introduction to SageMaker
Creating a SageMaker Session
import sagemaker
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
Roles and Permissions
role = sagemaker.get_execution_role()
Data Preparation
Downloading the Iris Dataset
import boto3
import os
bucket_name = "<your-bucket-name>"
file_key = "datasets/tabular/iris/iris.data"
s3_client = boto3.client("s3")
s3_client.download_file(bucket_name, file_key, "./data/iris.csv")
Preprocessing and Saving the Dataset
import pandas as pd
import numpy as np
df_iris = pd.read_csv("./data/iris.csv", header=None)
df_iris[4] = df_iris[4].map({"Iris-setosa": 0, "Iris-versicolor": 1,
"Iris-virginica": 2})
iris = df_iris[[4, 0, 1, 2, 3]].to_numpy()
np.savetxt("./data/iris.csv", iris, delimiter=",", fmt="%1.1f, %1.3f,
%1.3f, %1.3f, %1.3f")
Part 2: Writing the Training Script
Overview of the Entry Point File
Argument Parsing
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--max_leaf_nodes', type=int, default=-1)
parser.add_argument("--min_samples_split", type=int, default=2)
parser.add_argument("--min_samples_leaf", type=int, default=1)
# Additional arguments...
args = parser.parse_args()
Data Loading and Preprocessing
import pandas as pd
import os
input_files = [os.path.join(args.train, file) for file in
os.listdir(args.train)]
train_data = pd.concat([pd.read_csv(file, header=None,
engine="python") for file in input_files])
Model Training
from sklearn import tree
clf = tree.DecisionTreeClassifier(
max_leaf_nodes=args.max_leaf_nodes,
min_samples_split=args.min_samples_split,
min_samples_leaf=args.min_samples_leaf,
min_weight_fraction_leaf=args.min_weight_fraction_leaf,
# Other parameters...
)
clf = clf.fit(train_X, train_y)
Cross-validation and Performance Evaluation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(clf, train_X, train_y, cv=5, scoring='accuracy')
mean_cv_accuracy = cv_scores.mean()
Model Serialization
import joblib
joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))
Part 3: Training the Model in SageMaker
Creating a SageMaker Estimator
from sagemaker.sklearn.estimator import SKLearn
script_path = "sklearn_entrypoint.py"
metric_definitions = [{
'Name': 'validation:accuracy',
'Regex': 'validation:accuracy=([0-9\\.]+)'
}]
sklearn = SKLearn(
entry_point=script_path,
framework_version="0.20.0",
metric_definitions=metric_definitions,
instance_type="ml.c4.xlarge",
role=role,
sagemaker_session=sagemaker_session,
hyperparameters={
"max_leaf_nodes": 30,
# Other hyperparameters...
}
)
Training the Model
train_input = sagemaker_session.upload_data("data", key_prefix="data")
sklearn.fit({"train": train_input})
Part 4: Hyperparameter Tuning
Setting Up Hyperparameter Tuning
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
hyperparameter_ranges = {
'max_leaf_nodes': IntegerParameter(20, 100),
'min_samples_split': IntegerParameter(2, 20),
# Other parameters...
}
objective_metric_name = 'validation:accuracy'
Launching the Tuning Job
tuner = HyperparameterTuner(
estimator=sklearn,
objective_metric_name=objective_metric_name,
hyperparameter_ranges=hyperparameter_ranges,
metric_definitions=metric_definitions,
max_jobs=20,
max_parallel_jobs=3
)
tuner.fit({"train": train_input})
# Get the name of the tuning job
tuning_job_name = tuner.latest_tuning_job.job_name
# Get details of the tuning job
tuning_job_result = sagemaker_client.describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuning_job_name
)
# Extract the best training job
best_training_job = tuning_job_result['BestTrainingJob']
# Print the name of the best training job
print("Best Training Job Name: ",
best_training_job['TrainingJobName'])
# Print the hyperparameters of the best training job
print("Best Hyperparameters: ",
best_training_job['TunedHyperParameters'])
# Optionally, if you want to get more details about the best training
job
best_job_details = sagemaker_client.describe_training_job(
TrainingJobName=best_training_job['TrainingJobName']
)
print("Best Model Performance: ",
best_training_job['FinalHyperParameterTuningJobObjectiveMetric'])
Part 5: Deploying the Model
Model Deployment
predictor = tuner.deploy(initial_instance_count=1,
instance_type="ml.m5.xlarge")
Creating an Endpoint
Part 6: Testing the Deployed Model
Preparing Test Data
import pandas as pd
test_data = pd.read_csv("path/to/test/data.csv")
test_X = test_data.iloc[:, 1:]
Model Prediction
predictions = predictor.predict(test_X.values)
Evaluating Model Performance
actual = test_data.iloc[:, 0].values
print("Predictions:", predictions)
print("Actual:", actual)
Part 7: Cleanup
Deleting the Endpoint
predictor.delete_endpoint()
Final Words
Key Points
Next Steps
Resources
Train and Deploy a Scikit-Learn Model in Amazon SageMaker
AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!
Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!
View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE coursesOur Community
~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.