Amazon SageMaker Cheat Sheet

Last updated on December 11, 2024

Bookmarks

Concepts
Common Training Data Formats For Built-in Algorithms
Input modes for transferring training data
Two methods of deploying a model for inference
SageMaker features
Optimization
Amazon SageMaker Monitoring
Amazon SageMaker Pricing
Validate Your Knowledge

Amazon SageMaker AI Cheat Sheet

A fully managed service that allows data scientists and developers to easily build, train, and deploy machine learning models at scale.
Provides built-in algorithms that you can immediately use for model training.
Also supports custom algorithms through docker containers.
One-click model deployment.

Concepts

Hyperparameters
- It refers to a set of variables that controls how a model is trained.
- You can think of them as “volume knobs” that you can tune to acquire your model’s objective.
Automatic Model Tuning
- Finds the best version of a model by automating the training job within the limits of the hyperparameters that you specified.
Training
- The process where you create a machine learning model.
Inference
- The process of using the trained model to make predictions.

Local Mode
- Allows you to create and deploy estimators to your local machine for testing.
- You must install the Amazon SageMaker Python SDK on your local environment to use local mode.

Common Training Data Formats For Built-in Algorithms

CSV
Protobuf RecordIO
JSON
Libsvm
JPEG
PNG

Input modes for transferring training data

File mode
- Downloads data into the SageMaker instance volume before model training commences.
- Slower than pipe mode
- Used for Incremental training
Pipe mode
- Directly stream data from Amazon S3 into the training algorithm container.
- There’s no need to procure large volumes to store large datasets.
- Provides shorter startup and training times.
- Higher I/O throughputs
- Faster than File mode.
- You MUST use protobuf RecordIO as your training data format before you can take advantage of the Pipe mode.

Two methods of deploying a model for inference

Amazon SageMaker Hosting Services
- Provides a persistent HTTPS endpoint for getting predictions one at a time.
- Suited for web applications that need sub-second latency response.
Amazon SageMaker Batch Transform
- Doesn’t need a persistent endpoint
- Get inferences for an entire dataset

SageMaker features

SageMaker AutoPilot – automates the process of building, tuning, and deploying machine learning models based on a tabular dataset (CSV or Parquet). SageMaker Autopilot automatically explores different solutions to find the best model.
SageMaker GroundTruth – a data labeling service that lets you use workforce (human annotators) through your own private annotators, Amazon Mechanical Turk, or third-party services.
SageMaker Data Wrangler – a visual data preparation and cleaning tool that allows data scientists and engineers to easily clean and prepare data for machine learning.
SageMaker Neo – allows you to optimize machine learning models for deployment on edge devices to run faster with no loss in accuracy.
SageMaker Automatic Model Tuning – automates the process of hyperparameter tuning based on the algorithm and hyperparameter ranges you specify. This can result in saving a significant amount of time for data scientists and engineers.
Amazon SageMaker Debugger – provides real-time insights into the training process of machine learning models, enabling rapid iteration. It allows you to monitor and debug training issues, optimize model performance, and improve accuracy by analyzing various model-related metrics, such as weights, gradients, and biases.
Managed Spot Training – allows data scientists and engineers to save up to 90% on the cost of training machine learning models by using spare compute capacity.
Distributed Training – allows for splitting the data and distributing the workload across multiple instances, improving speed and performance. It supports various distributed training frameworks such as TensorFlow, PyTorch, and MXNet.
SageMaker Studio – A web-based IDE for machine learning. It provides tools for the entire ML lifecycle, including data wrangling, model training, and deployment, all in one unified interface. Helps data scientists and developers quickly build and train models and streamline ML workflows.
SageMaker Notebooks – A fully managed, scalable Jupyter notebook for quick data exploration, model building, and training. It helps you start working on ML models immediately without managing infrastructure.
SageMaker Distributed Data Parallelism (SMDDP)- A feature that enables efficient distributed training of deep learning models by automatically parallelizing data across multiple GPUs and instances. Speeds up the training of large models on massive datasets, improving scalability and reducing training time. It supports frameworks like TensorFlow and PyTorch, making it ideal for large-scale deep-learning tasks that require intensive computational resources.
SageMaker Pipelines – A fully managed CI/CD service for automating the end-to-end machine learning workflow, including data preprocessing, model training, and deployment. It helps automate and streamline the ML lifecycle, ensuring consistency and efficiency.
SageMaker Model Monitor – Monitors models in production to detect issues such as data drift or model performance degradation. Ensures that models continue to perform accurately after deployment.
SageMaker Model Registry – A centralized repository for managing ML models, including tracking versions and promoting models for deployment. Ensures proper model version control and governance across teams.
SageMaker Edge Manager – offers model management for edge devices, enabling you to optimize, secure, monitor, and manage machine learning models on various edge device fleets, including smart cameras, robots, PCs, and mobile devices.
SageMaker Feature Store – a fully managed repository designed to store, share, and manage features for machine learning models. It ensures high-quality, standardized features are available for both training and real-time inference, helping teams keep their feature data synchronized and consistent.
SageMaker JumpStart – provides pre-trained foundation models and ready-to-use solutions for common machine learning tasks like text summarization, image generation, and object detection, enabling users to deploy and experiment without deep expertise quickly.

Optimization

Convert training data into a protobuf RecordIO format to make use of Pipe mode.
Use Amazon FSx for Lustre to accelerate File mode training jobs.

Amazon SageMaker Monitoring

You can publish SageMaker instance metrics to the CloudWatch dashboard to gain a unified view of CPU utilization, memory utilization, and latency.
You can also send training metrics to the CloudWatch dashboard to monitor model performance in real time.
Amazon CloudTrail helps you detect unauthorized SageMaker API calls.

Amazon SageMaker Pricing

The building, training, and deploying of ML models are billed by the second, with no minimum fees and no upfront commitments.

Note: If you are studying for the AWS Certified Machine Learning Specialty exam, we highly recommend that you take our AWS Certified Machine Learning – Specialty Practice Exams and read our Machine Learning Specialty exam study guide.

Validate Your Knowledge

Question 1

A Machine Learning Specialist has various CSV training datasets stored in an S3 bucket. Previous models trained with similar training data sizes using the Amazon SageMaker Linear learner algorithm have a slow training process. The Specialist wants to decrease the amount of time spent on training the model.

Which combination of steps should be taken by the Specialist? (Select TWO.)

Convert the CSV training dataset into Apache Parquet format.
Train the model using Amazon SageMaker Pipe mode.
Convert the CSV training dataset into Protobuf RecordIO format.
Train the model using Amazon SageMaker File mode.
Stream the dataset into Amazon SageMaker using Amazon Kinesis Firehose to train the model.

Show me the answer!

Correct Answer: 2, 3

Most Amazon SageMaker algorithms work best when you use the optimized protobuf recordIO data format for training. Using this format allows you to take advantage of Pipe mode. In Pipe mode, your training job streams data directly from Amazon Simple Storage Service (Amazon S3).

Streaming can provide faster start times for training jobs and better throughput. This is in contrast to File mode, in which your data from Amazon S3 is stored on the training instance volumes. File mode uses disk space to store both your final model artifacts and your full training dataset. By streaming your data directly from Amazon S3 in Pipe mode, you reduce the size of Amazon Elastic Block Store volumes of your training instances.

Hence, the correct answers are:

– Convert the CSV training dataset into Protobuf RecordIO format.

– Train the model using Amazon SageMaker Pipe mode.

The option that says: Convert the CSV training dataset into Apache Parquet format is incorrect because Amazon SageMaker’s Pipe mode does not support Apache Parquet data format.

The option that says: Train the model using Amazon SageMaker File mode is incorrect because the File mode is the default input mode for Amazon SageMaker and is slower than Pipe mode.

The option that says: Stream the dataset into Amazon SageMaker using Amazon Kinesis Firehose to train the model is incorrect because you can’t use Amazon Kinesis Firehose in this way. It can’t use Amazon S3 as its data source.

References:
https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/

Note: This question was extracted from our AWS Certified Machine Learning – Specialty Practice Exams.

Question 2

A Machine Learning Specialist is using a 100GB EBS volume as a storage disk for an Amazon SageMaker instance. After running a few training jobs, the Specialist realized that he needed a higher I/O throughput and a shorter job startup and execution time.

Which approach will give the MOST satisfactory result based on the requirements?

Store the training dataset in Amazon S3 and use the Pipe input mode for training the model.
Increase the size of the EBS volume to obtain higher I/O throughput.
Upgrade the SageMaker instance to a larger size.
Increase the EBS volume to 500GB and use the File mode for training the model.

Show me the answer!

Correct Answer: 1

With Pipe input mode, your data is fed on-the-fly into the algorithm container without involving any disk I/O. This approach shortens the lengthy download process and dramatically reduces startup time. It also offers generally better read throughput than File input mode. This is because your data is fetched from Amazon S3 by a highly optimized multi-threaded background process. It also allows you to train on datasets that are much larger than the 16 TB Amazon Elastic Block Store (EBS) volume size limit.

Pipe mode enables the following:

– Shorter startup times because the data is being streamed instead of being downloaded to your training instances.

– Higher I/O throughputs due to high-performance streaming agent.

– Virtually limitless data processing capacity.

With Pipe mode, the startup time is reduced significantly from 11.5 minutes to 1.5 minutes in most experiments. Also, the overall I/O throughput is at least twice as fast as that of File mode. Both of these improvements made a positive impact on the total training time, which is reduced by up to 35%.

Hence, the correct answer is: Store the training dataset in Amazon S3 and use the Pipe input mode for training the model.

The option that says: Increase the size of the EBS volume to obtain higher I/O throughput is incorrect. Even if you set the EBS volume to its maximum throughput, training in Pipe mode would still have a greater impact in terms of reducing the job start-up time and execution time.

The option that says: Upgrade the SageMaker instance to a larger size is incorrect. Upgrading the instance alone won’t have as much effect as running the SageMaker instance in Pipe mode.

The option that says: Increase the EBS volume to 500GB and use the File mode for training the model is incorrect. File mode is the default mode for training a model in Amazon SageMaker. This would surely increase the throughput but it’s still not the best answer among the given choices.

References:
https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/
https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

Note: This question was extracted from our AWS Certified Machine Learning – Specialty Practice Exams.

For more AWS practice exam questions with detailed explanations, visit the Tutorials Dojo Portal: