AWS Entity Resolution Cheat Sheet
- AWS Entity Resolution is a fully managed service that helps organizations match, link, and enhance records across various customer, product, business, or healthcare data sources. It automates data integration by resolving duplicates and merging fragmented records.
- The service offers rule-based, machine learning-powered, and third-party matching techniques, ensuring high-quality, unified datasets. It also supports data encryption for security and compliance.
- Improving data quality enables better decision-making, enhanced customer profiles, and streamlined operations.
Features
- Minimized Data Movement: Reads data where it resides (e.g., S3), reducing unnecessary data transfers.
-
Flexible Data Input & Schema Mapping:
- Supports up to 20 data inputs.
- Can operate on encrypted datasets.
- Schema can come from AWS Glue or be custom-defined.
- Built-in normalization (e.g., trimming, lowercasing) is enabled by default, with advanced options via the GitHub library.
-
Configurable Matching Workflows:
- Supports rule-based, ML-powered, or data-provider matching.
- Easy job monitoring and status updates.
-
Matching Techniques:
- Rule-based: Custom rules with priority, outputs tagged groups.
- ML-powered: Pre-configured model with confidence scores (0.0–1.0).
- Data Service Provider: Integrates with datasets like LiveRamp and UID2 (provider subscription needed).
- Advanced Fuzzy Matching: Supports Levenshtein, Cosine Similarity, Soundex, with customizable thresholds for typo/variation handling.
- Near Real-time Lookup: The GetMatchID API hashes incoming PII and returns match IDs, enabling real-time applications like personalization or fraud detection.
-
Built-in Data Protection & Regionalization:
- Default encryption for data, with optional customer-managed KMS keys.
- Workflows run in the same AWS Region as the data for compliance and latency.
Use Cases
-
Remove duplicate records to prepare for analytics or AI model use.
-
Link disparate customer interactions into unified profiles using flexible matching techniques.
-
Translate or append provider IDs like LiveRamp or UID2 for campaign targeting and measurement.
-
Real-time matching enables tailored recommendations, better guest experiences, patient care improvements, or fraud detection.
Security
-
Data encryption:
-
Uses AWS-owned encryption keys by default.
-
Supports customer-managed KMS keys via grants for enhanced control.
-
-
VPC & PrivateLink support: You can integrate through interface endpoints, define endpoint policies, and securely call APIs from within your VPC.
-
IAM integration:
-
Fine-grained permissions via IAM policies, cross-account support, and resource-level control.
-
Supports service roles (not service-linked roles)
-
-
Logging & auditability:
-
All API actions are captured in AWS CloudTrail.
-
Workflow logging into CloudWatch is available (standard logging costs apply).
-
Pricing
-
Rule-based or ML-powered matching: $0.25 per 1,000 records processed.
-
Data service provider matching (requires provider license): $0.10 per 1,000 records processed; separate from provider subscription cost.
-
No Free Tier available for AWS Entity Resolution.
References:
https://aws.amazon.com/entity-resolution/