Amazon Redshift

  • A fully managed, petabyte-scale data warehouse service.
  • Redshift extends data warehouse queries to your data lake. You can run analytic queries against petabytes of data stored locally in Redshift, and directly against exabytes of data stored in S3.
  • RedShift is an OLAP type of DB.
  • Currently, Redshift only supports Single-AZ deployments.
  •  Features
    • Redshift uses columnar storage, data compression, and zone maps to reduce the amount of I/O needed to perform queries.
    • It uses a massively parallel processing data warehouse architecture to parallelize and distribute SQL operations.
    • Redshift uses machine learning to deliver high throughput based on your workloads.
    • Redshift uses result caching to deliver sub-second response times for repeat queries.
    • Redshift automatically and continuously backs up your data to S3. It can asynchronously replicate your snapshots to S3 in another region for disaster recovery.
  • Components

    • Cluster – a set of nodes, which consists of a leader node and one or more compute nodes.
      • Redshift creates one database when you provision a cluster. This is the database you use to load data and run queries on your data.
      • You can scale the cluster in or out by adding or removing nodes. Additionally, you can scale the cluster up or down by specifying a different node type.
      • Redshift assigns a 30-minute maintenance window at random from an 8-hour block of time per region, occurring on a random day of the week. During these maintenance windows, your cluster is not available for normal operations.
      • Redshift supports both the EC2-VPC and EC2-Classic platforms to launch a cluster. You create a cluster subnet group if you are provisioning your cluster in your VPC, which allows you to specify a set of subnets in your VPC.
    • Redshift Nodes
      • The leader node receives queries from client applications, parses the queries, and develops query execution plans. It then coordinates the parallel execution of these plans with the compute nodes and aggregates the intermediate results from these nodes. Finally, it returns the results back to the client applications.
      • Compute nodes execute the query execution plans and transmit data among themselves to serve these queries. The intermediate results are sent to the leader node for aggregation before being sent back to the client applications.
      • Node Type
        • Dense storage (DS) node type – for large data workloads and use hard disk drive (HDD) storage.
        • Dense compute (DC) node types – optimized for performance-intensive workloads. Uses SSD storage.
    • Parameter Groups – a group of parameters that apply to all of the databases that you create in the cluster. The default parameter group has preset values for each of its parameters, and it cannot be modified.
  • Database Querying Options

    • Connect to your cluster and run queries on the AWS Management Console with the Query Editor.
    • Connect to your cluster through a SQL client tool using standard ODBC and JDBC connections.
  • Enhanced VPC Routing

    • By using Enhanced VPC Routing, you can use VPC features to manage the flow of data between your cluster and other resources.
    • You can also use VPC flow logs to monitor COPY and UNLOAD traffic.
  • RedShift Spectrum

    • Enables you to run queries against exabytes of data in S3 without having to load or transform any data.
    • Redshift Spectrum doesn’t use Enhanced VPC Routing.
    • If you store data in a columnar format, Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows.
    • If you compress your data using one of Redshift Spectrum’s supported compression algorithms, less data is scanned.
  • Cluster Snapshots

    • Point-in-time backups of a cluster. There are two types of snapshots: automated and manual. Snapshots are stored in S3 using SSL.
    • Redshift periodically takes incremental snapshots that track changes to the cluster since the previous snapshot.
    • Redshift provides free storage for snapshots that is equal to the storage capacity of your cluster until you delete the cluster. After you reach the free snapshot storage limit, you are charged for any additional storage at the normal rate.
    • Automated snapshots are enabled by default when you create a cluster. These snapshots are deleted at the end of a retention period, which is one day, but you can modify it. You cannot delete an automated snapshot manually.
    • By default, manual snapshots are retained indefinitely, even after you delete your cluster.
    • You can share an existing manual snapshot with other AWS accounts by authorizing access to the snapshot.
  • Monitoring

    • Use the database audit logging feature to track information about authentication attempts, connections, disconnections, changes to database user definitions, and queries run in the database. The logs are stored in S3 buckets.
    • Redshift tracks events and retains information about them for a period of several weeks in your AWS account.
    • Redshift provides performance metrics and data so that you can track the health and performance of your clusters and databases. It uses CloudWatch metrics to monitor the physical aspects of the cluster, such as CPU utilization, latency, and throughput.
    • Query/Load performance data helps you monitor database activity and performance.
    • When you create a cluster, you can optionally configure a CloudWatch alarm to monitor the average percentage of disk space that is used across all of the nodes in your cluster, referred to as the default disk space alarm.
  • Security

    • By default, an Amazon Redshift cluster is only accessible to the AWS account that creates the cluster.
    • Use IAM to create user accounts and manage permissions for those accounts to control cluster operations.
    • If you are using the EC2-Classic platform for your Redshift cluster, you must use Redshift security groups.
    • If you are using the EC2-VPC platform for your Redshift cluster, you must use VPC security groups.
    • When you provision the cluster, you can optionally choose to encrypt the cluster for additional security. Encryption is an immutable property of the cluster.
    • Snapshots created from the encrypted cluster are also encrypted.
  • Pricing

    • You pay an hourly rate based on the type and number of nodes in your cluster.
    • You pay for the number of bytes scanned by RedShift Spectrum
    • You can reserve instances by committing to using Redshift for a 1 or 3 year term and save costs.
  • Limits per Region

Resource

Default Limit

Nodes per cluster

101

Nodes

200

Reserved Nodes

200

Snapshots

20

Parameter Groups

20

Event Subscriptions

20

    • The maximum number of tables is 9,900 for large and xlarge cluster node types and 20,000 for 8xlarge cluster node types.
    • The number of user-defined databases you can create per cluster is 60.
    • The number of concurrent user connections that can be made to a cluster is 500.
    • The number of AWS accounts you can authorize to restore a snapshot is 20 for each snapshot and 100 for each AWS KMS key.

Deep Dive and Best Practices for Amazon Redshift:

 

Validate Your Knowledge

Question 1

A company is using Redshift for its online analytical processing (OLAP) application which processes complex queries against large datasets. There is a requirement in which you have to define the number of query queues that are available and how queries are routed to those queues for processing.

Which of the following will you use to meet this requirement?

  1. This is not possible with Redshift because it is not intended for OLAP application but rather, for OLTP. Use RDS database instead.
  2. Create a Lambda function that can accept the number of query queues and use this value to control Redshift.
  3. Use the workload management (WLM) in the parameter group configuration.
  4. This is not possible with Redshift because it is not intended for OLAP application but rather, for OLTP. Use a NoSQL DynamoDB database instead.

Correct Answer: 3

When you create a parameter group, the default WLM configuration contains one queue that can run up to five queries concurrently. You can add additional queues and configure WLM properties in each of them if you want more control over query processing. Each queue that you add has the same default WLM configuration until you configure its properties. When you add additional queues, the last queue in the configuration is the default queue. Unless a query is routed to another queue based on criteria in the WLM configuration, it is processed by the default queue. You cannot specify user groups or query groups for the default queue.

As with other parameters, you cannot modify the WLM configuration in the default parameter group. Clusters associated with the default parameter group always use the default WLM configuration. If you want to modify the WLM configuration, you must create a parameter group and then associate that parameter group with any clusters that require your custom WLM configuration.

Option 3 is correct. In Amazon Redshift, you use workload management (WLM) to define the number of query queues that are available, and how queries are routed to those queues for processing. WLM is part of parameter group configuration. A cluster uses the WLM configuration that is specified in its associated parameter group.

Options 1 and 4 are incorrect. Redshift is a good choice if you want to perform OLAP transactions in the cloud. On the contrary, RDS and DynamoDB are more suitable for OLTP applications.

Option 2 is incorrect since it will be too costly and inefficient to use Lambda. Workload management (WLM) is a feature of Redshift that addresses the problem aptly.

Reference:
https://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html

Question 2

You are working as a Systems Administrator for a large audit firm where you have an assignment to tightly manage the flow of data between your Amazon Redshift cluster and your other AWS resources. The IT Security team of your company instructed you to use VPC flow logs to monitor all the COPY and UNLOAD traffic of your Redshift cluster.

How can you implement this solution in AWS?

  1. Enable Audit Logging in your Amazon Redshift cluster.
  2. Enable Enhanced VPC routing on your Amazon Redshift cluster.
  3. Use the Amazon Redshift Spectrum feature.
  4. Create a new flow log that tracks the traffic of your Amazon Redshift cluster.

Correct Answer: 2

When you use Amazon Redshift Enhanced VPC Routing, Amazon Redshift forces all COPY and UNLOAD traffic between your cluster and your data repositories through your Amazon VPC. By using Enhanced VPC Routing, you can use standard VPC features, such as VPC security groups, network access control lists (ACLs), VPC endpoints, VPC endpoint policies, internet gateways, and Domain Name System (DNS) servers. Hence, enabling Enhanced VPC routing on your Amazon Redshift cluster is the correct answer.

You use these features to tightly manage the flow of data between your Amazon Redshift cluster and other resources. When you use Enhanced VPC Routing to route traffic through your VPC, you can also use VPC flow logs to monitor COPY and UNLOAD traffic. If Enhanced VPC Routing is not enabled, Amazon Redshift routes traffic through the Internet, including traffic to other services within the AWS network.

Enabling Audit Logging in your Amazon Redshift cluster is incorrect because the Audit Logging feature is primarily used to get the information about the connection, queries, and user activities in your Redshift cluster.

Using the Amazon Redshift Spectrum feature is incorrect because the Redshift Spectrum is primarily used to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.

Creating a new flow log that tracks the traffic of your Amazon Redshift cluster is incorrect because, by default, you cannot create a flow log for your Amazon Redshift cluster. You have to enable Enhanced VPC Routing and set up the required VPC configuration.

Reference: 
https://docs.aws.amazon.com/redshift/latest/mgmt/enhanced-vpc-routing.html

For more AWS practice exam questions with detailed explanations, check this out:

Tutorials Dojo AWS Practice Exams

XX

Additional Training Materials: Amazon Redshift Video Courses on Udemy

  1. Hands-on with Amazon Redshift by Infinite Skills
  2. AWS Serverless Analytics: Glue, Redshift, Athena, QuickSight by Siddharth Mehta
  3. Mastering Amazon Redshift – Development and Administration by Siddharth Mehta

 

Sources:
https://docs.aws.amazon.com/redshift/latest/mgmt/
https://aws.amazon.com/redshift/features/
https://aws.amazon.com/redshift/pricing/
https://aws.amazon.com/redshift/faqs/

***

AWS Certifications are consistently among the top paying IT certifications in the world, considering that Amazon Web Services is the leading cloud services platform with almost 50% market share! Earn over $150,000 per year with an AWS certification!

Subscribe to our newsletter and notifications for more helpful AWS cheat sheets and study guides like this and answer as many AWS practice exams as you can.🙂

Enroll Now – AWS Certified Cloud Practitioner Practice Exams

AWS Certified Cloud Practitioner Practice Tests

Enroll Now – AWS Certified Solutions Architect Associate Practice Exams

AWS Certified Solutions Architect Associate

Enroll Now – AWS Certified Developer Associate Practice Exams

AWS Certified Developer Associate Tutorials Dojo

Enroll Now – AWS Certified SysOps Administrator Associate Practice Exams

AWS Certified SysOps Administrator Associate Tutorials Dojo

Enroll Now – AWS Certified Solutions Architect Professional Practice Exams

AWS Certified Solutions Architect Professional Tutorials Dojo

Affordable AWS Educational Materials

Browse Other Courses

Generic Category (English)300x250

Recent Posts