Last updated on November 14, 2024
Amazon Athena Cheat Sheet
- An interactive query service that makes it easy to analyze data directly in Amazon S3 and other data sources using SQL.
Features
- Athena is serverless.
- Has a built-in query editor.
- Uses Presto, an open source, distributed SQL query engine optimized for low latency, ad hoc analysis of data.
- Athena supports a wide variety of data formats such as CSV, JSON, ORC, Avro, or Parquet.
- Athena automatically executes queries in parallel, so that you get query results in seconds, even on large datasets.
- Athena uses Amazon S3 as its underlying data store, making your data highly available and durable.
- Athena integrates with Amazon QuickSight for easy data visualization.
- Athena integrates out-of-the-box with AWS Glue.
Athena uses a managed Data Catalog to store information and schemas about the databases and tables that you create for your data stored in S3.
Queries
- You can query geospatial data.
- You can query different kinds of logs as your datasets.
- Athena stores query results in S3.
- Athena retains query history for 45 days.
- Amazon Athena does support User-Defined Functions (UDFs). UDFs in Amazon Athena allow you to create custom functions to process records or groups of records. They are executed with AWS Lambda when used in an Athena query. However, Athena only supports scalar UDFs, which process one row at a time and return a single column value.
- Athena supports both simple data types such as INTEGER, DOUBLE, VARCHAR and complex data types such as MAPS, ARRAY, and STRUCT.
- Athena supports querying data in Amazon S3 Requester Pays buckets.
Athena Federated Queries
- Allows you to query data sources other than S3 buckets using a data connector.
- A data connector is implemented in a Lambda function that uses Athena Query Federation SDK.
- There are pre-built connectors available for some popular data sources, such as:
- MySQL, PostgreSQL, Oracle, SQL Server databases
- Amazon DynamoDB
- Amazon Managed Streaming for Apache Kafka (MSK)
- Amazon RedShift
- Amazon OpenSearch
- Amazon CloudWatch Logs and CloudWatch metrics
- Amazon DocumentDB
- Apache Kafka
- You can write your own data connector using the Athena Query Federation SDK if your data source is not natively supported by Athena.
- You may also customize the pre-built connectors to fit your use case.
Optimizing query performance
- Data partitioning. For instance, partitioning data based on column values such as date, country, and region makes it possible to limit the amount of data that needs to be scanned by a query.
- Converting data format into columnar formats such as Parquet and ORC
- Compressing files
- Making files splittable. Athena can read a splittable file in parallel; thus, the time it takes for a query to complete is faster.
- AVRO, Parquet, and Orc are splittable files regardless of the compression codec used
- Only text files (TSV, CSV, JSON, and custom SerDes for text) compressed with BZIP2 and LZO are splittable.
Cost controls
- You can create workgroups to isolate queries for teams, applications, or different workloads and enforce cost controls.
- There are two types of cost controls available in a workgroup:
- Per-query limit – specifies a threshold for the total amount of data scanned per query. Any query running in a workgroup is canceled once it exceeds the specified limit. Only one per-query limit can be created in a workgroup.
- Per-workgroup limit – this limits the total amount of data scanned by all queries running within a specific time frame. You can establish multiple limits based on hourly or daily data scan totals for queries within the workgroup.
Partition projection with Amazon Athena
Partition projection in Amazon Athena is a feature that helps improve query performance by enabling Athena to infer partition metadata based on predefined configuration instead of querying the AWS Glue Data Catalog. This can significantly reduce the time and cost of loading partition information, especially for large datasets with numerous partitions.
Common use cases of Partition projection in Amazon Athena:
- Queries against extensively partitioned tables experience slower completion times than desired.
- Users can define relative date ranges that adapt to incoming data, facilitating seamless integration.
Partition projection is easiest to set up when your partitions follow a consistent pattern, such as:
- Integers – Any continuous sequence of integers such as [1, 2, 3, 4, …, 1000]
- Dates – Any continuous sequence of dates or datetimes such as [20200101, 20200102, …, 20201231] .
- Enumerated values – A finite set of enumerated values such as airport codes or AWS Regions.
- AWS service logs – AWS service logs typically have a known structure whose partition scheme you can specify in AWS Glue and that Athena can, therefore, use for partition projection.
Partition projection in Athena eliminates the need to manually specify partitions in AWS Glue or an external Hive metastore. When enabled, it ignores partition metadata, projects non-existing partitions, and does not return errors for out-of-range queries. However, if too many partitions are empty, performance may be slower. It only works when querying through Athena, and some limitations apply, such as not being compatible with Lake Formation data filters.
Amazon Athena Security
- Control access to your data by using IAM policies, access control lists, and S3 bucket policies.
- If the files in the target S3 bucket are encrypted, you can perform queries on the encrypted data itself.
Amazon Athena Pricing
- You pay only for the queries that you run. You are charged based on the amount of data scanned by each query.
- You are not charged for failed queries.
- You can get significant cost savings and performance gains by compressing, partitioning, or converting your data to a columnar format because each of those operations reduces the amount of data that Athena needs to scan to execute a query.
AWS Knowledge Center Videos: How do I analyze my Amazon S3 server access logs using Amazon Athena?
Note: If you are studying for the AWS Certified Data Engineer Associate exam, we highly recommend that you take our AWS Certified Data Engineer Associate Practice Exams and read our Data Engineer Associate exam study guide.
Validate Your Knowledge
Question 1
A multinational corporation is using Amazon Athena to analyze the data sets stored in Amazon S3. The Data Analyst needs to implement a solution that will control the maximum amount of data scanned in the S3 bucket and ensure that if the query exceeded the limit, all the succeeding queries will be canceled.
Which of the following approach can be used to fulfill this requirement?
- Set up a workload management (WLM) assignment rule in the primary workgroup.
- Set data limits in the per query data usage control.
- Integrate API Gateway with Amazon Athena. Configure an account-level throttling to control the queries in the S3 bucket.
- Create an IAM policy that will throttle the data limits in the primary workgroup.
Question 2
A company is using Amazon Athena query with Amazon QuickSight to visualize the AWS CloudTrail logs. The Security Administrator created a custom Athena query that reads the CloudTrail logs and checks if there are IAM user accounts or credentials created in the past 29, 30 or 31 days (depending on the current month). However, the Administrator always gets an Insufficient Permissions
error whenever she tries to run the query from Amazon QuickSight.
What is the MOST suitable solution that the Administrator should do to fix this issue?
- Disable the Log File Integrity feature in AWS CloudTrail.
- Enable Cross-Origin Resource Sharing (CORS) in the S3 bucket that is used by Athena.
- Use the AWS Account Root User to run the Athena query from Amazon QuickSight.
- Make sure that Amazon QuickSight can access the S3 buckets used by Athena.
For more AWS practice exam questions with detailed explanations, visit the Tutorials Dojo Portal:
Amazon Athena Cheat Sheet References:
https://docs.aws.amazon.com/athena/latest/ug/
https://aws.amazon.com/athena/features
https://aws.amazon.com/athena/pricing
https://aws.amazon.com/athena/faqs