Last updated on June 23, 2023
AWS Lake Formation Cheat Sheet
-
A service for managing and building data lakes.
-
It stores and catalogs data from databases and object storage before transferring it to a new S3 data lake.
-
You can also use ML algorithms to clean and classify data and secure access to sensitive data with granular controls at the column, row, and cell levels.Â
How It Works
-
Identify existing data stores, such as S3 or databases, and move the data to your data lake.Â
-
The data is then crawled, cataloged, and prepared for analytics.Â
-
Lastly, provide users with data access through their preferred analytics services.
Concepts
-
Data LakeÂ
-
A persistent data stored in Amazon S3:
-
Structured and unstructured data
-
Raw data and transformed data
-
-
When you register an Amazon S3 location, the S3 path and all folders under that path are registered.
-
-
Data Catalog
-
Persistent metadata store.
-
A repository where various systems can store and find metadata to keep track of data in data silos and use that metadata to query and transform the data.Â
-
The AWS Glue Data Catalog maintains metadata about data lakes, data sources, transforms, and targets.
-
Metadata about data sources and targets is in the form of databases and tables.
-
Databases – a collection of tables.
-
Tables – information about data in the data lake.
-
-
You can control access to databases and tables in the data catalog using permissions.
-
Each AWS account has one Data Catalog per AWS Region.
-
Governed tables are unique to AWS Lake Formation and have the following features:
-
ACID transactions
-
Automatic data compaction
-
Time-travel queries
-
-
Resource links
-
Links to shared databases and tables in the external accounts
-
It is used for cross-account access to data in the data lake.Â
-
-
-
Blueprint
-
A data management template to ingest data into a data lake.
-
You can use blueprints to configure the workflow by providing input such as the data source, data target, and schedule.Â
-
Types of blueprints:
-
Database snapshot
-
Incremental database
-
Log file
-
-
-
Workflow
-
Defines the data source and schedule for importing data into the data lake.
-
A container for a collection of AWS Glue jobs, crawlers, and triggers.
-
Uses AWS Glue to orchestrate the loading and updating of data.
-
It can be run on-demand or on a schedule.
-
With AWS Glue directed acyclic graph (DAG), you can monitor the progress of the workflow.
-
AWS Lake Formation Security
-
Encrypt and decrypt data in Amazon S3 using AWS KMS.
-
Use AWS CloudTrail to capture all Lake Formation API calls.
-
Types of permissions:
-
Metadata access – for data catalog resources.
-
Underlying data access – for Amazon S3 locations.
-
-
You can use the credential vending API to provide temporary credentials to registered Amazon S3 locations based on effective permissions, allowing authorized engines to access data on users’ behalf.Â
-
With Querying API, you can retrieve data from Amazon S3, filter the results based on effective permissions, and then share it to query engines.Â
-
The service-linked role provides necessary permissions to call other AWS services on your behalf.
-
When you need to grant more permissions than the service-linked role provides, use a user-defined role.
-
You can specify, grant, and revoke permissions on tables in the data catalog.Â
-
Manage your AWS Glue Data Catalog objects and data locations in Amazon S3 using the LakeFormation permissions model.
-
Use LakeFormation tag-based access control for a large number of data catalog resources.
-
You can create a data filter to restrict access to certain data in query results and engines.
AWS Lake Formation Pricing
-
You are charged for transaction requests and metadata storage.
-
You are charged for data filtering or the number of bytes scanned by the Storage API.
-
You are charged based on the number of bytes processed by the storage optimizer.Â
AWS Lake Formation Cheat Sheet References:
https://aws.amazon.com/lake-formation/
https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html