AWS Glue Data Quality Cheat Sheet
AWS Glue Data Quality is a service that provides a way to monitor and measure the quality of your data. It’s part of the AWS Glue service and is built on the open-source DeeQu framework.
Use Cases
- Analyzing data sets that are cataloged in the AWS Glue Data Catalog.
- Continuously monitoring the quality of data in a data lake.
- Adding a layer of data quality checks to traditional AWS Glue jobs.
- AWS Glue Data Quality uses a domain-specific language called Data Quality Definition Language (DQDL) to define data quality rules.
Features
- Serverless: No need to manage servers, AWS takes care of it.
- Quick Start: Analyzes your data and creates data quality rules quickly.
- Data Quality Issues Detection: Uses machine learning to identify potential data quality issues.
- Rule Customization: Comes with over 25 pre-defined data quality rules, but also allows you to create your own.
- Data Quality Score: Provides a summary score that gives an overview of the overall quality of your data.
- Bad Data Identification: Identifies the exact records that are causing your data quality scores to decrease.
- Pay as you go: You only pay for what you use, with no upfront costs or long-term commitments.
- No lock-in: Built on the open-source DeeQu framework.
- Data Quality Checks: Allows you to enforce data quality checks on your AWS Glue ETL pipelines and Data Catalog.
Pricing
- AWS Glue Data Quality charges are based on the resources used and the duration they are running.
- Adding data quality checks to ETL jobs may increase runtime or DPU consumption.
- Charges are $0.44 per DPU-hour for standard usage.
References:
https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html