Last updated on October 3, 2024
Amazon SageMaker Data Wrangler Cheat Sheet
-
Amazon SageMaker Data Wrangler streamlines data preparation and feature engineering for machine learning.
- Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio Classic.
- It integrates data from various sources, allows you to explore, clean, transform, and visualize data, and automates these steps in your machine-learning workflow.
Amazon SageMaker Data Wrangler Core Functionalities
Data Wrangler provides core functionalities to facilitate data analysis and preparation in machine learning.
- Import
- Easily access and import data stored in cloud-based data warehouses and data lakes, such as Amazon S3, Athena, Redshift, Snowflake, and Databricks.
- The dataset you import can contain up to 1000 columns.
- Data Wrangler Flow
- Create a data flow to design a sequence of data preparation steps for machine learning.
- Combine datasets from various sources, specify the necessary transformations, and create a data preparation workflow that can be integrated into an ML pipeline.
- It provides details like the count of missing values and the number of outliers.
- Transform:
- Use standard transformation tools, such as string, vector, and numeric data formatting, to clean and transform your data.
- Create new features by applying techniques like text, date/time embedding, and categorical encoding.
- Data Insights:
- Automatically check data quality and detect potential issues in your data using Data Wrangler.
- Generating a Data Quality and Insights report for the entire dataset involves utilizing an Amazon SageMaker processing job.
- The report consists of six sections such as;
- Summary
- It summarizes the data, highlighting missing values, invalid entries, feature types, and outlier counts. It may also include high-severity warnings indicating potential data issues, which should be further investigated.
- Target column
- Data Wrangler allows for the selection of a target column for prediction. It automatically performs target column analysis and ranks features by their predictive power. You must also specify whether the problem is regression or classification.
- Quick model
- Provides an approximation of the expected predictive performance of a model trained on your data.
- Feature summary
- When a target column is specified, Data Wrangler ranks features by their predictive power, using an 80/20 training and validation split. Each feature’s predictive performance is measured individually, and scores are normalized between 0 and 1. Higher scores indicate more useful features for predicting the target, while lower scores suggest non-predictive or redundant features. A perfect score of 1 often signals target leakage, where a feature reveals information unavailable during actual predictions.
- Samples -Indicates whether your samples are anomalous or if duplicates exist in your dataset.
- Definitions – explain the technical terms used in the data insights report.
- Summary
- Analyze:
- Examine dataset features using built-in visualization tools (like scatter plots and histograms) and analysis tools (like target leakage analysis and quick modeling) to understand feature relationships.
- All analyses are performed using 100,000 rows from your dataset.
- A brief overview of your dataset, displaying the number of entries, minimum and maximum values for numeric data, and the most and least frequent categories for categorical data.
- A simple model of the dataset is used to calculate an importance score for each feature.
- A custom visualization created using your own code.
- Export: Transfer your data preparation workflow to another destination, such as the following:
- Amazon Simple Storage Service (Amazon S3) bucket
- Amazon SageMaker Pipelines – Leverage Pipelines to automate the deployment of models. Transformed data can be directly exported to the pipelines.
- Amazon SageMaker Feature Store- Centralizes storage of features and their data.
- Python Script: Saves data and transformations in a Python script for custom workflows.
Amazon SageMaker Data Wrangler Use Cases
- Data Wrangler simplifies cleaning, transforming, and preparing datasets for machine learning with built-in tools and integration with multiple data sources.
- It allows users to create reusable, automated workflows for consistent data transformation in production environments.
- Provides visual tools for exploring data, identifying patterns, and detecting anomalies to improve model understanding.
- Detects and mitigates data quality issues and bias, ensuring fairer machine learning predictions.
- Integrates with SageMaker Pipelines, automating end-to-end machine learning workflows for data preparation and model deployment.
Amazon SageMaker Data Wrangler References:
- https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-data-insights.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-analyses.html