In-Place Querying in AWS

Last updated on December 13, 2023

When you look at the Gartner magic quadrant of cloud service providers, AWS is still the leader and the leading visionary of the cloud computing space. It offers an array of services that empower companies and organizations to manage and analyze massive sets of data with unprecedented agility. At the core of this revolution, “In-place querying” is a technique pioneered by AWS that is reshaping how data is processed in the cloud. In this article, we will be delving into the essence of in-place querying in AWS, exploring its mechanisms, applications, and the impact it has on data management.

In today’s data-driven world, the ability to quickly and efficiently access and analyze information is crucial. Companies and organizations, no matter how big or small they are, rely on data when it comes to decision-making and strategic planning. AWS’s in-place querying capability further empowers these companies and organizations by offering them a powerful yet cost-effective solution to data handling and business intelligence.

Understanding In-Place Querying

In-place querying, at its core, is a method that allows users to run queries directly on data that is stored in its native format without the need for moving or transforming data beforehand. This method is unlike the traditional one, where data is often extracted and loaded into a completely separate system for analysis. In-place querying not only streamlines the process but also significantly reduces the time and resources required for data processing.

Primary Advantages of In-Place Querying

Efficiency — Queries are executed directly on the data source, minimizing the movement of data.
Speed — Rapid query execution allows for near real-time analysis.
Cost-effectiveness — Reduces the costs associated with data transformation and storage.

Key Services Supporting In-Place Querying

Amazon S3 Select — This service enables users to retrieve only the required data from objects in Amazon S3. This is useful for querying large objects, allowing users to avoid the cost and latency of loading entire files.
Amazon Glacier Select — This service allows users to run queries directly on data archived in Amazon S3 Glacier, a long-term and archiving storage solution of AWS. You can use this service to analyze archived data without restoring and moving it. It is a cost-effective solution for accessing infrequently used data.
AWS Athena — Athena, at its core, is a serverless querying service. This service allows users to analyze data directly in Amazon S3 using standard SQL. It is ideal for ad-hoc querying and simplifies the analysis of large quantities of unstructured data.
Amazon Redshift Spectrum — This service allows users to run queries across data stored in Amazon Redshift and Amazon S3 data lakes. It is optimal for complex queries across large datasets, offering scalability and performance.

Since Amazon Athena and Amazon Redshift Spectrum share a common data catalog and common data format, it is possible to use both of these services to query data from the same data assets. Athena is best used in performing on-off or ad-hoc queries in Amazon S3, while Redshift Spectrum is used for more complex queries that are to be performed on a regular basis in either Amazon S3, Amazon Redshift, or both.

Each of these services plays a pivotal role in enabling efficient data analysis within the AWS ecosystem, as shown by their adoption in various industries for tasks ranging from log analysis to complex business intelligence.

Step-by-Step Tutorial: Amazon S3 Select vs Amazon Athena

Prerequisites

Set up a development environment or development account in AWS
Download this sample CSV.
Read and perform the tutorial in the previous article about Batch Data Ingestion Simplified in AWS.

Amazon S3 Select

In your AWS console, navigate to Amazon S3 and create a new bucket. Click Create bucket, then provide a globally unique name for your bucket and select a region. Leave the remaining settings as default and click Create bucket.

Open your newly created S3 bucket, then click Upload. Click Add files and select the sample CSV to upload that CSV file to your S3 bucket. Then click Upload.

Select and open your newly uploaded file and click Object Actions. Then, click Query with S3 Select. Run your SQL queries in the SQL Query portion.

Amazon Athena

Make sure you have performed the tutorial in the previous article entitled “Batch Data Ingestion Simplified in AWS”. Then, in the AWS console, navigate to Amazon Athena. You must first create an S3 bucket as a query result location.

In the query editor, click Data source and select AwsDataCatalog. For the database, input the AWS Glue Data Catalog database you created in the previous tutorial. You will then see the table generated when you ran your crawler in the previous tutorial. Enter your query, then click run.

Both S3 Select and Athena are able to query data through SQL statements. However, these two differ in the use case. The scope of Amazon S3 Select is for simple, single-object queries, whereas Amazon Athena, on the other hand, can handle complex queries across multiple files and formats. Athena offers more advanced query capabilities which includes joins and aggregations across multiple data sources.

Final Remarks

While in-place querying offers numerous benefits, it also comes with challenges. One of the primary concerns is the performance of queries against extremely large data sets or complex queries. Optimizing query performance and managing costs become critical in such scenarios. Additionally, ensuring data security and compliance with regulations is a non-trivial aspect of deploying in-place querying solutions. If you are forced to comply where analytics processing should be done in a separate, isolated, and controlled environment, then in-place querying cannot be utilized.

The future of in-place querying in AWS looks promising, with continuous advancements expected in areas like query optimization, cost efficiency, and integration with emerging technologies like machine learning and real-time analytics.

In-place querying in AWS represents a paradigm shift in data management and analysis. By enabling efficient, fast, and cost-effective data querying directly where it is stored, AWS is not only streamlining processes but also opening new avenues for innovation in data analytics. As the technology continues to evolve, it will undoubtedly play a crucial role in the future of cloud computing and big data.

References:

https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/in-place-querying.html

https://aws.amazon.com/blogs/aws/s3-glacier-select/

https://docs.aws.amazon.com/athena/latest/ug/what-is.html

https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

Written by: Iggy Yuson

Iggy is a DevOps engineer in the Philippines with a niche in cloud-native applications in AWS. He possesses extensive skills in developing full-stack solutions for both web and mobile platforms. His area of expertise lies in implementing serverless architectures in AWS. Outside of work, he enjoys playing basketball and competitive gaming.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses