Amazon Redshift Cheat Sheet

A fully managed, petabyte-scale data warehouse service.
Redshift extends data warehouse queries to your data lake. You can run analytic queries against petabytes of data stored locally in Redshift, and directly against exabytes of data stored in S3.
RedShift is an OLAP type of DB.

Currently, Redshift only supports Single-AZ deployments.

Features
- Redshift uses columnar storage, data compression, and zone maps to reduce the amount of I/O needed to perform queries.
- It uses a massively parallel processing data warehouse architecture to parallelize and distribute SQL operations.
- Redshift uses machine learning to deliver high throughput based on your workloads.
- Redshift uses result caching to deliver sub-second response times for repeat queries.
- Redshift automatically and continuously backs up your data to S3. It can asynchronously replicate your snapshots to S3 in another region for disaster recovery.

Components

- Cluster – a set of nodes, which consists of a leader node and one or more compute nodes.
  - Redshift creates one database when you provision a cluster. This is the database you use to load data and run queries on your data.
  - You can scale the cluster in or out by adding or removing nodes. Additionally, you can scale the cluster up or down by specifying a different node type.
  - Redshift assigns a 30-minute maintenance window at random from an 8-hour block of time per region, occurring on a random day of the week. During these maintenance windows, your cluster is not available for normal operations.
  - Redshift supports both the EC2–VPC and EC2-Classic platforms to launch a cluster. You create a cluster subnet group if you are provisioning your cluster in your VPC, which allows you to specify a set of subnets in your VPC.
- Redshift Nodes
  - The leader node receives queries from client applications, parses the queries, and develops query execution plans. It then coordinates the parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. Finally, it returns the results back to the client applications.
  - Compute nodes execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent to the leader node for aggregation before being sent back to the client applications.
  - Node Type
    - Dense storage (DS) node type – for large data workloads and use hard disk drive (HDD) storage.
    - Dense compute (DC) node types – optimized for performance-intensive workloads. Uses SSD storage.
- Parameter Groups – a group of parameters that apply to all of the databases that you create in the cluster. The default parameter group has preset values for each of its parameters, and it cannot be modified.
Database Querying Options
- Connect to your cluster and run queries on the AWS Management Console with the Query Editor.
- You can use the Query editor with Redshift clusters enabled and with enhanced VPC routing. Leverage AWS Secrets Manager to store your cluster credentials and use that with the Query Editor.
- Connect to your cluster through a SQL client tool using standard ODBC and JDBC connections.
Enhanced VPC Routing
- By using Enhanced VPC Routing, you can use VPC features to manage the flow of data between your cluster and other resources.
- You can also use VPC flow logs to monitor COPY and UNLOAD traffic.
RedShift Spectrum
- Enables you to run queries against exabytes of data in S3 without having to load or transform any data.
- Redshift Spectrum supports Enhanced VPC Routing.
- If you store data in a columnar format, Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows.
- If you compress your data using one of Redshift Spectrum’s supported compression algorithms, less data is scanned.

RedShift Streaming Ingestion
- Allows you to consume and process data directly from a streaming source to a Redshift cluster using SQL.
- Streaming ingestion eliminates the need for staging data in Amazon S3, which gives you a low-latency, high-speed ingestion.
- Valid data source:
  - - Amazon Kinesis Data Streams
    - Amazon Managed Streaming for Apache Kafka (MSK)

Redshift ML
- Allows you to train and deploy machine learning models using the data stored in your Amazon Redshift cluster through a simple CREATE MODEL SQL statement.
- You can make in-database local inferences using SQL, eliminating the need to move data between Redshift and other storage services like Amazon S3.
- Redshift ML uses Amazon SageMaker Autopilot behind the scenes to find the best model based on your input data.

Redshift Data Sharing
- Redshift Data Sharing is a secure way to share live data across Redshift clusters within an AWS account, without the need to copy or move data.
- Data Sharing provides live access to the data so that your users always see the most up-to-date and consistent information as it is updated in the data warehouse.
- Can be used on Redshift RA3 clusters at no additional cost.
Redshift Cross-Database Query
- Redshift Cross-database queries provide the ability to query across databases in a Redshift cluster, regardless of which database you are connected to.
- Available on Redshift RA3 node types at no additional cost.
Cluster Snapshots
- Point-in-time backups of a cluster. There are two types of snapshots: automated and manual. Snapshots are stored in S3 using SSL.
- Redshift periodically takes incremental snapshots of your data every 8 hours or 5 GB per node of data change.
- Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster until you delete the cluster. After you reach the free snapshot storage limit, you are charged for any additional storage at the normal rate.
- Automated snapshots are enabled by default when you create a cluster. These snapshots are deleted at the end of a retention period, which is one day, but you can modify it. You cannot delete an automated snapshot manually.
- By default, manual snapshots are retained indefinitely, even after you delete your cluster.
- You can share an existing manual snapshot with other AWS accounts by authorizing access to the snapshot.
- You can configure Amazon Redshift to automatically copy snapshots (automated or manual) for a cluster to another AWS Region. For automated snapshots, you can also specify the retention period to keep them in the destination AWS Region. The default retention period for copied snapshots is seven days.
- If you store a copy of your snapshots in another AWS Region, you can restore your cluster from recent data if anything affects the primary AWS Region. You can configure your cluster to copy snapshots to only one destination AWS Region at a time.

Amazon Redshift Monitoring

- Use the database audit logging feature to track information about authentication attempts, connections, disconnections, changes to database user definitions, and queries run in the database. The logs are stored in S3 buckets.
- Redshift tracks events and retains information about them for a period of several weeks in your AWS account.
- Redshift provides performance metrics and data so that you can track the health and performance of your clusters and databases. It uses CloudWatch metrics to monitor the physical aspects of the cluster, such as CPU utilization, latency, and throughput.
- Query/Load performance data helps you monitor database activity and performance.
- When you create a cluster, you can optionally configure a CloudWatch alarm to monitor the average percentage of disk space that is used across all of the nodes in your cluster, referred to as the default disk space alarm.

Amazon Redshift Security

- By default, an Amazon Redshift cluster is only accessible to the AWS account that creates the cluster.
- Use IAM to create user accounts and manage permissions for those accounts to control cluster operations.
- If you are using the EC2-Classic platform for your Redshift cluster, you must use Redshift security groups.
- If you are using the EC2-VPC platform for your Redshift cluster, you must use VPC security groups.
- When you provision the cluster, you can optionally choose to encrypt the cluster for additional security. Encryption is an immutable property of the cluster.
- Snapshots created from the encrypted cluster are also encrypted.

Amazon Redshift Pricing

- You pay a per-second billing rate based on the type and number of nodes in your cluster.
- You pay for the number of bytes scanned by RedShift Spectrum
- You can reserve instances by committing to using Redshift for a 1 or 3-year term and save costs.

Deep Dive and Best Practices for Amazon Redshift:

Note: If you are studying for the AWS Certified Database Specialty exam, we highly recommend that you take our AWS Certified Database – Specialty Practice Exams and read our Database Specialty exam study guide.

Validate Your Knowledge

Question 1

A financial services company based in Australia uses Amazon Redshift for its data-warehousing solutions. They have expanded recently in Singapore and is designing a solution that would integrate queries between data from an Amazon RDS for PostgreSQL database in Singapore with data from the Redshift cluster in Sydney.

Which solution will best simplify the integration between these data sources?

Configure cross-regional snapshots with Redshift cluster and restore a new cluster with the latest snapshot in Singapore. Export the PostgreSQL tables to an S3 bucket and load it to the new Amazon Redshift cluster.
Configure cross-regional snapshots with Redshift cluster and restore a new cluster with the latest snapshot in Singapore. Export the PostgreSQL tables to an S3 bucket. Create an external schema and external tables from the S3 files and use Redshift Spectrum to query from S3 and the new Redshift cluster.
Set up connectivity from your Amazon Redshift cluster to your Amazon RDS PostgreSQL cluster. Create an external schema from the PostgreSQL database and use federated queries to access both sources.
Export the RDS PostgreSQL tables to an S3 bucket. Create an external schema and external tables from the S3 files and use Redshift Spectrum to query from S3 and the new Redshift cluster.

Show me the answer!

Correct Answer: 3

Previously, you need to extract data from your PostgreSQL database to Amazon Simple Storage Service (Amazon S3) and load it to Amazon Redshift using COPY, or query it from Amazon S3 with Amazon Redshift Spectrum. Now, Amazon Redshift Federated Query enables you to use the analytic power of Amazon Redshift to directly query data stored in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases.

With Federated queries, you can query and analyze data across operational databases, data warehouses, and data lakes. This feature can integrate queries from Amazon Redshift on live data in external databases with queries across your Amazon Redshift and Amazon S3 environments.

In some cases, you might access an Amazon RDS or Aurora database in a different AWS region than Amazon Redshift. In these cases, you typically incur network latency and billing charges for transferring data across AWS regions. However, the scenario asks for a solution that simplifies the integration between the two sources.

Therefore, the correct answer is: Set up connectivity from your Amazon Redshift cluster to your Amazon RDS PostgreSQL cluster. Create an external schema from the PostgreSQL database and use federated queries to access both sources.

The three other options are all incorrect:

– Configure cross-regional snapshots with Redshift cluster and restore a new cluster with the latest snapshot in Singapore. Export the PostgreSQL tables to an S3 bucket and load it to the new Amazon Redshift cluster

– Configure cross-regional snapshots with Redshift cluster and restore a new cluster with the latest snapshot in Singapore. Export the PostgreSQL tables to an S3 bucket. Create an external schema and external tables from the S3 files and use Redshift Spectrum to query from S3 and the new Redshift cluster.

– Export the RDS PostgreSQL tables to an S3 bucket. Create an external schema and external tables from the S3 files and use Redshift Spectrum to query from S3 and the new Redshift cluster.

Although they are working solutions, using Federated Queries is the most simplified solution among the choices.

References:
https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html
https://aws.amazon.com/blogs/big-data/amazon-redshift-federated-query-best-practices-and-performance-considerations/

Note: This question was extracted from our AWS Certified Database Specialty Practice Exams.

Question 2

A Database Specialist manages an Amazon Redshift cluster for the company’s data warehousing solution. Keeping track of the maintenance tasks that run on the cluster, the Database Manager wants to receive e-mail notifications as soon as the cluster goes into and outside of maintenance mode and is advised if the maintenance was customer-initiated.

How should the Database Specialist meet this requirement with the least operational effort?

Using the Redshift console, create an event subscription that sends a notification for Management events.

Work with Amazon Redshift’s performance data metrics and create an alarm whenever the Maintenance Mode unit value is 1. Use Amazon SNS to send an e-mail.
Write an AWS Lambda function to trigger on AWS CloudTrail API calls. Filter on specific Redshift API calls and create an Amazon SNS topic to send the notifications.
Create an Amazon EventBridge rule with the operations that need to be tracked on the Redshift Cluster. Create an AWS Lambda function to act on these rules and use Amazon SNS to send an e-mail.

Show me the answer!

Correct Answer: 1

Amazon Redshift uses the Amazon Simple Notification Service (Amazon SNS) to communicate notifications of Amazon Redshift events. You enable notifications by creating an Amazon Redshift event subscription. In the Amazon Redshift subscription, you specify a set of filters for Amazon Redshift events and an Amazon SNS topic.

Whenever an event occurs that matches the filter criteria, Amazon Redshift publishes a notification message to the Amazon SNS topic. Amazon SNS then transmits the message to an Amazon SNS consumer with an Amazon SNS subscription to the topic. Amazon SNS can support the messages sent to the Amazon SNS consumers for an AWS Region, such as an e-mail, a text message, or a call to an HTTP endpoint.

When you create an event notification subscription, you specify one or more event filters. Amazon Redshift sends notifications through the subscription at any time an event matches all of the filter criteria.

You can set up event notifications for Amazon Redshift clusters to receive notifications regarding on-going or completed maintenance. After you create the proper event subscription, you receive notifications with the following event codes:

– REDSHIFT-EVENT-2003 for Amazon Redshift maintenance in progress.

– REDSHIFT-EVENT-2004 for Amazon Redshift maintenance that is complete.

Note: Amazon Redshift scheduled maintenance doesn’t trigger an event.

All choices can help notify the Manager whenever the cluster is in maintenance mode. However, the solution that addresses the requirement with the least operational effort is the option that says: Using the Redshift console, create an event subscription that sends a notification for Management events.

The option that says: Create an Amazon EventBridge rule with the operations that need to be tracked on the Redshift Cluster. Create an AWS Lambda function to act on these rules and use Amazon SNS to send an e-mail is incorrect. Although this is a possible solution, it requires more effort to implement the solution.

The option that says: Work with Amazon Redshift’s performance data metrics and create an alarm whenever the Maintenance Mode unit value is 1. Use Amazon SNS to send an e-mail is incorrect. Although the solution will work and does not need too much effort to implement, CloudWatch Alarms cannot identify if the maintenance mode was customer-initiated.

The option that says: Write an AWS Lambda function to trigger on AWS CloudTrail API calls. Filter on specific Redshift API calls and create an Amazon SNS topic to send the notifications is incorrect. Although this is a possible solution, it requires more effort to implement the solution.

Reference:
https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-event-notifications.html
https://aws.amazon.com/premiumsupport/knowledge-center/notification-maintenance-rds-redshift/
https://docs.aws.amazon.com/redshift/latest/mgmt/performance-metrics-perf.html
https://docs.aws.amazon.com/redshift/latest/mgmt/performance-metrics-alarms.html

Note: This question was extracted from our AWS Certified Database Specialty Practice Exams.

For more AWS practice exam questions with detailed explanations, check out the Tutorials Dojo Portal:

Additional Training Materials: Amazon Redshift Video Courses on Udemy