Amazon SageMaker Data Wrangler

Last updated on October 3, 2024

Amazon SageMaker Data Wrangler Cheat Sheet

Amazon SageMaker Data Wrangler streamlines data preparation and feature engineering for machine learning.
Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio Classic.
It integrates data from various sources, allows you to explore, clean, transform, and visualize data, and automates these steps in your machine-learning workflow.

Amazon SageMaker Data Wrangler Core Functionalities

Data Wrangler provides core functionalities to facilitate data analysis and preparation in machine learning.

Import
- Easily access and import data stored in cloud-based data warehouses and data lakes, such as Amazon S3, Athena, Redshift, Snowflake, and Databricks.
- The dataset you import can contain up to 1000 columns.
Data Wrangler Flow
- Create a data flow to design a sequence of data preparation steps for machine learning.
- Combine datasets from various sources, specify the necessary transformations, and create a data preparation workflow that can be integrated into an ML pipeline.
- It provides details like the count of missing values and the number of outliers.
Transform:
- Use standard transformation tools, such as string, vector, and numeric data formatting, to clean and transform your data.
- Create new features by applying techniques like text, date/time embedding, and categorical encoding.

Data Insights:
- Automatically check data quality and detect potential issues in your data using Data Wrangler.
- Generating a Data Quality and Insights report for the entire dataset involves utilizing an Amazon SageMaker processing job.
- The report consists of six sections such as;
  - Summary
    - It summarizes the data, highlighting missing values, invalid entries, feature types, and outlier counts. It may also include high-severity warnings indicating potential data issues, which should be further investigated.
  - Target column
    - Data Wrangler allows for the selection of a target column for prediction. It automatically performs target column analysis and ranks features by their predictive power. You must also specify whether the problem is regression or classification.
  - Quick model
    - Provides an approximation of the expected predictive performance of a model trained on your data.
  - Feature summary
    - When a target column is specified, Data Wrangler ranks features by their predictive power, using an 80/20 training and validation split. Each feature’s predictive performance is measured individually, and scores are normalized between 0 and 1. Higher scores indicate more useful features for predicting the target, while lower scores suggest non-predictive or redundant features. A perfect score of 1 often signals target leakage, where a feature reveals information unavailable during actual predictions.
  - Samples -Indicates whether your samples are anomalous or if duplicates exist in your dataset.
  - Definitions – explain the technical terms used in the data insights report.
Analyze:
- Examine dataset features using built-in visualization tools (like scatter plots and histograms) and analysis tools (like target leakage analysis and quick modeling) to understand feature relationships.
- All analyses are performed using 100,000 rows from your dataset.
- A brief overview of your dataset, displaying the number of entries, minimum and maximum values for numeric data, and the most and least frequent categories for categorical data.
- A simple model of the dataset is used to calculate an importance score for each feature.
- A custom visualization created using your own code.
Export: Transfer your data preparation workflow to another destination, such as the following:
- Amazon Simple Storage Service (Amazon S3) bucket
- Amazon SageMaker Pipelines – Leverage Pipelines to automate the deployment of models. Transformed data can be directly exported to the pipelines.
- Amazon SageMaker Feature Store- Centralizes storage of features and their data.
- Python Script: Saves data and transformations in a Python script for custom workflows.

Amazon SageMaker Data Wrangler Use Cases

Data Wrangler simplifies cleaning, transforming, and preparing datasets for machine learning with built-in tools and integration with multiple data sources.
It allows users to create reusable, automated workflows for consistent data transformation in production environments.
Provides visual tools for exploring data, identifying patterns, and detecting anomalies to improve model understanding.
Detects and mitigates data quality issues and bias, ensuring fairer machine learning predictions.
Integrates with SageMaker Pipelines, automating end-to-end machine learning workflows for data preparation and model deployment.

Amazon SageMaker Data Wrangler References:

Written by: Irene Bonso

Irene Bonso is currently thriving as a Junior Software Engineer at Tutorials Dojo and also an active member of the AWS Community Builder Program. She is focused to gain knowledge and make it accessible to a broader audience through her contributions and insights.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses