AWS Lake Formation Cheat Sheet

Last updated on June 23, 2023

Bookmarks

How It Works
Concepts
Security
Pricing

AWS Lake Formation Cheat Sheet

A service for managing and building data lakes.
It stores and catalogs data from databases and object storage before transferring it to a new S3 data lake.
You can also use ML algorithms to clean and classify data and secure access to sensitive data with granular controls at the column, row, and cell levels.

How It Works

Identify existing data stores, such as S3 or databases, and move the data to your data lake.

The data is then crawled, cataloged, and prepared for analytics.
Lastly, provide users with data access through their preferred analytics services.

Concepts

Data Lake
- A persistent data stored in Amazon S3:
  - Structured and unstructured data
  - Raw data and transformed data
- When you register an Amazon S3 location, the S3 path and all folders under that path are registered.
Data Catalog
- Persistent metadata store.
- A repository where various systems can store and find metadata to keep track of data in data silos and use that metadata to query and transform the data.
- The AWS Glue Data Catalog maintains metadata about data lakes, data sources, transforms, and targets.
- Metadata about data sources and targets is in the form of databases and tables.
  - Databases – a collection of tables.
  - Tables – information about data in the data lake.
- You can control access to databases and tables in the data catalog using permissions.
- Each AWS account has one Data Catalog per AWS Region.
- Governed tables are unique to AWS Lake Formation and have the following features:
  - ACID transactions
  - Automatic data compaction
  - Time-travel queries
- Resource links
  - Links to shared databases and tables in the external accounts
  - It is used for cross-account access to data in the data lake.
Blueprint
- A data management template to ingest data into a data lake.
- You can use blueprints to configure the workflow by providing input such as the data source, data target, and schedule.
- Types of blueprints:
  - Database snapshot
  - Incremental database
  - Log file
Workflow
- Defines the data source and schedule for importing data into the data lake.
- A container for a collection of AWS Glue jobs, crawlers, and triggers.
- Uses AWS Glue to orchestrate the loading and updating of data.
- It can be run on-demand or on a schedule.
- With AWS Glue directed acyclic graph (DAG), you can monitor the progress of the workflow.

AWS Lake Formation Security

Encrypt and decrypt data in Amazon S3 using AWS KMS.
Use AWS CloudTrail to capture all Lake Formation API calls.
Types of permissions:
- Metadata access – for data catalog resources.
- Underlying data access – for Amazon S3 locations.
You can use the credential vending API to provide temporary credentials to registered Amazon S3 locations based on effective permissions, allowing authorized engines to access data on users’ behalf.
With Querying API, you can retrieve data from Amazon S3, filter the results based on effective permissions, and then share it to query engines.
The service-linked role provides necessary permissions to call other AWS services on your behalf.
When you need to grant more permissions than the service-linked role provides, use a user-defined role.
You can specify, grant, and revoke permissions on tables in the data catalog.
Manage your AWS Glue Data Catalog objects and data locations in Amazon S3 using the LakeFormation permissions model.
Use LakeFormation tag-based access control for a large number of data catalog resources.
You can create a data filter to restrict access to certain data in query results and engines.

AWS Lake Formation Pricing

You are charged for transaction requests and metadata storage.
You are charged for data filtering or the number of bytes scanned by the Storage API.
You are charged based on the number of bytes processed by the storage optimizer.