AWS Glue Data Quality Cheat Sheet
AWS Glue Data Quality is a service that provides a way to monitor and measure the quality of your data. It’s part of the AWS Glue service and is built on the open-source DeeQu framework.
Use Cases
- Analyzing data sets that are cataloged in the AWS Glue Data Catalog.
 - Continuously monitoring the quality of data in a data lake.
 - Adding a layer of data quality checks to traditional AWS Glue jobs.
 - AWS Glue Data Quality uses a domain-specific language called Data Quality Definition Language (DQDL) to define data quality rules.
 
Features
- Serverless: No need to manage servers, AWS takes care of it.
 - Quick Start: Analyzes your data and creates data quality rules quickly.
 - Data Quality Issues Detection: Uses machine learning to identify potential data quality issues.
 - Rule Customization: Comes with over 25 pre-defined data quality rules, but also allows you to create your own.
 - Data Quality Score: Provides a summary score that gives an overview of the overall quality of your data.
 - Bad Data Identification: Identifies the exact records that are causing your data quality scores to decrease.
 - Pay as you go: You only pay for what you use, with no upfront costs or long-term commitments.
 - No lock-in: Built on the open-source DeeQu framework.
 - Data Quality Checks: Allows you to enforce data quality checks on your AWS Glue ETL pipelines and Data Catalog.
 
Pricing
- AWS Glue Data Quality charges are based on the resources used and the duration they are running.
 - Adding data quality checks to ETL jobs may increase runtime or DPU consumption.
 - Charges are $0.44 per DPU-hour for standard usage.
 
References:
https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html
											
				











