AWS Certified Machine Learning – Specialty MLS-C01 Exam Study Path

Last updated on August 12, 2024

Bookmarks

Study Materials
AWS Services to Focus on
Common Exam Scenarios
Validate Your Knowledge
Final Remarks

The AWS Machine Learning — Specialty MLS-C01 Certification is intended for individuals who are responsible for developing data science or applied machine learning projects on the AWS Cloud. This specialty certification is quite different from any other AWS exam. If you already have prior experience with other AWS certifications, you’re probably expecting to be heavily tested on AWS services and how they can be architected to build solutions that can solve different business problems. However, this is not the case in the ML-Specialty certification. Aside from Amazon SageMaker, most of the questions that you’ll encounter have nothing to do with AWS services at all.

The exam covers a wide area of general machine learning concepts. One should at least have a high-level understanding of different stages in machine learning such as choosing the correct algorithm for a specific use case, data collection, feature engineering, test-train splitting, tuning, training, and deploying a model for inference. The exam also expects you to have knowledge on the common issues that arise from model training (e.g., overfitting, unbalanced dataset, missing values in the dataset) and the methods to fix them (e.g., regularization/early stopping, oversampling/adding noise to data, data imputation).

Machine Learning is more on math concepts rather than software engineering. Although not specifically required, it would be advantageous if you have a background in statistics or college math (Linear algebra, Differential calculus) to understand how an algorithm works behind the scenes. Also, It would be best to gain hands-on experience first by building simple models. This will allow you to learn quickly and get used to the jargon in machine learning.

We recommend checking out the following materials

STUDY MATERIALS FOR THE MLS-C01 SPECIALTY EXAM

We also recommend taking this free and highly interactive AWS Exam Readiness digital course for the AWS Certified Machine Learning Specialty MLS-C01 exam:

Other helpful materials

MLS-C01 RELATED AWS SERVICES TO FOCUS ON

Data Engineering

AWS Services

Concepts

Data ingestion techniques (Batch and Stream processing)
Data cleaning
ETL Pipeline
Building a data lake on Amazon S3
Available data storages for training with Amazon SageMaker
Amazon S3 lifecycle configuration
Amazon S3 data storage options

Exploratory Data Analysis

AWS Services

Concepts

Data Cleaning
Data labeling (for supervised models)
Using RecordIO protobuf format to leverage SageMaker’s Pipe mode for training
Data Visualization and Analysis
- Scatter plot
- Box plots
- Confusion matrix
Feature Engineering
- Normalization
- Scaling
- Data imputation techniques for filling missing values
- Oversampling/Undersampling methods to fix unbalanced dataset
- Regularization
- Dimensionality Reduction
  - Principal Component Analysis (PCA)
  - t-Distributed Stochastic Neighbor Embedding (t-SNE)
- One-hot encoding
- Label encoding
- Binning
- Test-train splitting with randomization

Modeling

AWS Services

Amazon SageMaker built-in algorithms

Linear regression
Logistic regression
K-means clustering
Principal component analysis (PCA)
Factorization machines
Neural topic modeling
Latent Dirichlet allocation
XGBoost
Sequence-to-sequence
Time-series forecasting
BlazingText
Object detection
Image classification
Semantic segmentation

Concepts:

Automated hyperparameter tuning
Supervised, Unsupervised models, Reinforcement learning
Managed Spot Training
Deep Learning
- Convolutional Neural Network (CNN), Recurrent Neural Networks (RNN)
- Weights and biases
- Activation functions
  - Softmax
  - Rectified Linear Unit (ReLu)
  - Tanh
- Network layers (flatten layer, convolutional layer, pooling layer, output layer)
- Dropout regularization
- Model pruning
Solving overfitting and underfitting problems
Training SageMaker models on local mode
Early Stopping
Metrics for confusion matrix (true positives, false positives, false negatives, true negatives)
Model evaluation
- ROC / AUC
- F1 Score
- Precision
- Accuracy

Machine Learning Implementation and Operations

AWS Services

Concepts:

Real-time and batch inference
Monitoring model metrics using CloudWatch
Monitoring SageMaker API logs using CloudTrail

Using Amazon Augmented A2I to involve human-reviewers in a machine learning workflow.
Multi-model endpoints
Encrypting data with AWS KMS
Lifecycle configuration script
Optimizing model for edge-devices using SageMaker Neo

MLS-C01 Common Exam Scenarios

Scenario	Solution
MLS-C01 Domain 1: Data Engineering
A company wants to automatically convert streaming JSON data into Apache Parquet before storing them in an S3 bucket	Use Amazon Kinesis Firehose
A company uses Amazon EMR for its ETL processes. The company is looking for an alternative with a lower operational overhead	Run the ETL jobs using AWS Glue
Which service should you use to deliver streaming data from Amazon MSK to a Redshift cluster with low latency?	Redshift Streaming Ingestion
A data engineer is building a pipeline for streaming data. The data will be fetched from various sources.	Create an application that uses Kinesis Producer Library (KPL) to load streaming data from various sources into a Kinesis Data stream.
A company wants to set up a data lake on Amazon S3. The data will be sourced from S3 buckets located in different AWS accounts. Which service can simplify the implementation of the data lake?	AWS Lake Formation
MLS-C01 Domain 2: Exploratory Data Analysis
An image classifier is getting high accuracy on the validation dataset. However, the accuracy significantly dropped when tested against real data. How can you improve the model’s performance?	Take existing images from the training data. Apply data augmentation techniques (ex: flipping, rotating, adjusting brightness) to the images and add them to the training data. Retrain the model
What methods can a machine learning engineer use to reduce the size of a large dataset while retaining only relevant features?	1. Principal Component Analysis (PCA) 2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
A dataset contains a mixture of categorical and numerical features. What feature engineering method should be done to prepare the data for training?	One-hot encoding
X and Y variables have a correlation coefficient of -0.98. What does it indicate?	Very strong negative correlation
A machine learning engineer handles a small dataset with missing values. What should they do to ensure no data points are lost?	Use imputation techniques to fill in missing values
MLS-C01 Domain 3: Modeling
An ML engineer wants to evaluate the performance of a binary classification model visually. What visualization technique should be used?	Confusion matrix
An ML engineer wants to discover topics available within a large text dataset. Which algorithm should the engineer train the model on?	Latent Dirichlet Allocation (LDA) algorithm
A SageMaker Object2vec model is overfitting on a validation dataset. How do you solve this problem?	Use Regularization, in this case, adjusting the value of the Dropout parameter.
A neural network model is being trained using a large dataset in batches. As the training progresses, the loss function begins to oscillate. Which could be the cause?	The learning rate is too high
What SageMaker built-in algorithm is suitable for predicting click-through rate (CTR) patterns?	Factorization machines
MLS-C01 Domain 4: Machine Learning Implementation and Operations
An ML engineer wants to auto-scale the instances behind a SageMaker endpoint according to the volume of incoming requests. Which metric should this scaling be based on?	`InvocationsPerInstance`
Which AWS service can you use to convert audio formats into text?	Amazon Transcribe
An ML engineer is training a cluster of SageMaker instances. The traffic between the instances must be encrypted.	Enable inter-container traffic encryption
A company wants to use Amazon SageMaker to deploy various ML models in a cost-effective way.	Use multi-model endpoint
What AWS service can help you build an AI-powered chatbot that can interact with customers?	Amazon Lex

Validate Your Knowledge For Your MLS-C01 Exam

For high-quality practice exams, you can use our AWS Certified Machine Learning Specialty MLS-C01 Practice Exams. These practice tests will help you boost your preparedness for the real exam. It contains multiple sets of questions that cover almost every area that you can expect from the real certification exam. We have also included detailed explanations and adequate reference links to help you understand why the option with the correct answer is better than the rest of the options. This is the value that you will get from our course. Practice exams are a great way to determine which areas you are weak in, and they will also highlight the important information that you might have missed during your review.

Sample Practice Test Questions for MLS-C01:

Question 1

A trucking company wants to improve situational awareness for its operations team. Each truck has GPS devices installed to monitor their locations.

The company requires to have the data stored in Amazon Redshift to conduct near real-time analytics, which will then be used to generate updated dashboard reports.

Which workflow offers the quickest processing time from ingestion to storage?

Use Amazon Kinesis Data Stream to ingest the location data. Load the streaming data into the cluster using Amazon Redshift Streaming ingestion.
Use Amazon Managed Streaming for Apache Kafka (MSK) to ingest the location data. Use Amazon Redshift Spectrum to deliver the data in the cluster.
Use Amazon Data Firehose to ingest the location data and set the Amazon Redshift cluster as the destination.
Use Amazon Data Firehose to ingest the location data. Load the streaming data into the cluster using Amazon Redshift Streaming ingestion.

Show me the answer!

Correct Answer: 1

The Amazon Redshift Streaming ingestion feature makes it easier to access and analyze data coming from real-time data sources. It simplifies the streaming architecture by providing native integration between Amazon Redshift and the streaming engines in AWS, which are Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming data sources like system logs, social media feeds, and IoT streams can continue to push events to the streaming engines, and Amazon Redshift simply becomes just another consumer.

Before, loading data from a stream into Amazon Redshift included several steps. These included connecting the stream to Amazon Data Firehose and waiting for Data Firehose to stage the data in Amazon S3, using various-sized batches at varying-length buffer intervals. After this, Data Firehose initiated a COPY command to load the data from Amazon S3 to a table in Redshift.

Amazon Redshift Streaming ingestion eliminates all of these extra steps, resulting in faster performance and improved latency.

Hence, the correct answer is: Use Amazon Kinesis Data Stream to ingest the location data. Load the streaming data into the cluster using Amazon Redshift Streaming ingestion.

The option that says: Use Amazon Managed Streaming for Apache Kafka (MSK) to ingest the location data. Use Amazon Redshift Spectrum to deliver the data in the cluster is incorrect. Redshift Spectrum is a Redshift feature that allows you to query data in Amazon S3 without loading them into Redshift tables. Redshift Spectrum is not capable of moving data from S3 to Redshift.

The option that says: Use Amazon Data Firehose to ingest the location data and set the Amazon Redshift cluster as the destination is incorrect. While you can configure Redshift as a destination for an Amazon Data firehose, Kinesis does not actually load the data directly into Redsfhit. Under the hood, Kinesis stages the data first in Amazon S3 and copies it into Redshift using the COPY command.

The option that says Use Amazon Data Firehose to ingest the location data. Load the streaming data into the cluster using Amazon Redshift Streaming ingestion is incorrect. Amazon Data Firehose is not a valid streaming source for Amazon Redshift Streaming ingestion.

References:
https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion.html
https://aws.amazon.com/blogs/big-data/build-near-real-time-logistics-dashboards-using-amazon-redshift-and-amazon-managed-grafana-for-better-operational-intelligence/
https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/

Check out this Amazon Redshift Cheat Sheet:
https://tutorialsdojo.com/amazon-redshift/

Question 2

A Machine Learning Specialist is training an XGBoost-based model for detecting fraudulent transactions using Amazon SageMaker. The training data contains 5,000 fraudulent behaviors and 500,000 non-fraudulent behaviors. The model reaches an accuracy of 99.5% during training.

When tested on the validation dataset, the model shows an accuracy of 99.1% but delivers a high false-negative rate of 87.7%. The Specialist needs to bring down the number of false-negative predictions for the model to be acceptable in production.

Which combination of actions must be taken to meet the requirement? (Select TWO.)

Increase the model complexity by specifying a larger value for the max_depth hyperparameter.
Increase the value of the rate_drop hyperparameter to reduce the overfitting of the model.
Adjust the balance of positive and negative weights by configuring the scale_pos_weight hyperparameter.
Alter the value of the eval_metric hyperparameter to MAP (Mean Average Precision).
Alter the value of the eval_metric hyperparameter to Area Under The Curve (AUC).

Show me the answer!

Correct Answer: 3, 5

Since the fraud detection model is a binary classifier, we should evaluate it using the Area Under the Curve metric. The AUC metric examines the ability of a binary classification model as its discrimination threshold is varied.

The scale_pos_weight hyperparameter allows you to fine-tune the threshold that matches your business need. In the scenario, the model has a high chance of outputting a high FNR (false-negative rate) due to a largely imbalanced dataset. You can fix that to reduce the predicted false-negatives by adjusting the scale_pos_weight.

Hence, the correct answers are:

– Alter the value of the eval_metric hyperparameter to Area Under The Curve (AUC) hyperparameter.

– Adjust the balance of positive and negative weights by configuring the scale_pos_weight hyperparameter.

The option that says: Increase the model complexity by specifying a larger value for the max_depth hyperparameter is incorrect. There’s no need to increase the model complexity because it already generalizes well on both the training and validation dataset.

The option that says: Increase the value of the rate_drop hyperparameter to reduce the overfitting of the model is incorrect because the training and validation accuracy is relatively good to be considered overfitting.

The option that says: Alter the value of the eval_metric hyperparameter to MAP (Mean Average Precision) is incorrect because this metric is only useful for evaluating ranking algorithms.

References:
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters

Click here for more AWS Certified Machine Learning Specialty practice exam questions.

Check out our other AWS practice test courses here:

Final Remarks

Machine Learning plays a major role in almost all industries. It provides numerous business benefits such as forecasting sales, predicting medical diagnosis, simplifying time-consuming data entry tasks, etc. With the proliferation of machine learning and AI applications, it’s not difficult to see how it will impact job demands in the market. The need for machine learning talent to build efficient and effective models at scale will definitely continue growing for years to come. And pairing your skills with the AWS Machine Learning — Specialty certification would absolutely make your resume stand out and boost your earning potential.

We hope that our guide has helped you achieve that goal, and we would love to hear back from your exam. We wish you the best of results.

Written by: Jon Bonso

Jon Bonso is the co-founder of Tutorials Dojo, an EdTech startup and an AWS Digital Training Partner that provides high-quality educational materials in the cloud computing space. He graduated from Mapúa Institute of Technology in 2007 with a bachelor's degree in Information Technology. Jon holds 10 AWS Certifications and is also an active AWS Community Builder since 2020.

AWS Certified Machine Learning – Specialty MLS-C01 Exam Study Path

AWS Certified Machine Learning – Specialty MLS-C01 Exam Study Path

STUDY MATERIALS FOR THE MLS-C01 SPECIALTY EXAM