Ends in
00
days
00
hrs
00
mins
00
secs
LEARN MORE

72-Hour Flash Sale! Get our AZ-900, AZ-104, and GCP ACE Practice Exams at Super Low Prices

AWS Glue

  • A fully managed service to extract, transform, and load (ETL) your data for analytics.
  • Discover and search across different AWS data sets without moving your data.
  • AWS Glue consists of:
    • Central metadata repository
    • ETL engine
    • Flexible scheduler

Use Cases

  • Run queries against an Amazon S3 data lake
    • You can use AWS Glue to make your data available for analytics without moving your data.
  • Analyze the log data in your data warehouse
    • Create ETL scripts to transform, flatten, and enrich the data from source to target.
  • Create event-driven ETL pipelines
    • As soon as new data becomes available in Amazon S3, you can run an ETL job by invoking AWS Glue ETL jobs using an AWS Lambda function.
  • A unified view of your data across multiple data stores
    • With AWS Glue Data Catalog, you can quickly search and discover all your datasets and maintain the relevant metadata in one central repository.

Concepts

  • AWS Glue Data Catalog
    • A persistent metadata store.
    • The data that is used as sources and targets of your ETL jobs are stored in the data catalog.
    • You can only use one data catalog per region.
    • AWS Glue Data catalog can be used as the Hive metastore.
    • It can contain database and table resource links.
  • IT Certification Category (English)728x90
  • Database
    • A set of associated table definitions, organized into a logical group.
    • A container for tables that define data from different data stores.
    • If the database is deleted from the Data Catalog, all the tables in the database are also deleted.
    • A link to a local or shared database is called a database resource link.
  • Data store, data source, and data target
    • To persistently store your data in a repository, you can use a data store.
      • Data stores: Amazon S3, Amazon RDS, Amazon Redshift, Amazon DynamoDB, JDBC
    • The data source is used as input to a process or transform.
    • A location where the data store process or transform writes to is called a data target.
  • Table
    • The metadata definition that represents your data.
    • You can define tables using JSON, CSV, Parquet, Avro, and XML.
    • You can use the table as the source or target in a job definition.
    • A link to a local or shared table is called a table resource link.
    • To add a table definition:
      • Run a crawler.
      • Create a table manually using the AWS Glue console.
      • Use AWS Glue API CreateTable operation.
      • Use AWS CloudFormation templates.
      • Migrate the Apache Hive metastore
    • A partitioned table describes an AWS Glue table definition of an Amazon S3 folder.
    • Reduce the overall data transfers, processing, and query processing time with PartitionIndexes.
  • Connection
    • It contains the properties that you need to connect to your data.
    • To store connection information for a data store, you can add a connection using:
      • JDBC
      • Amazon RDS
      • Amazon Redshift
      • Amazon DocumentDB
      • MongoDB
      • Kafka
      • Network
    • You can enable SSL connection for JDBC, Amazon RDS, Amazon Redshift, and MongoDB.
  • Crawler
    • You can use crawlers to populate the AWS Glue Data Catalog with tables.
    • Crawlers can crawl file-based and table-based data stores.
      • Data stores: S3, JDBC, DynamoDB, Amazon DocumentDB, and MongoDB
    • It can crawl multiple data stores in a single run.
    • How Crawlers work
      • Determine the format, schema, and associated properties of the raw data by classifying the data – create a custom classifier to configure the results of the classification.
      • Group the data into tables or partitions – you can group the data based on the crawler heuristics.
      • Writes metadata to the AWS Glue Data Catalog – set up how the crawler adds, updates, and deletes tables and partitions.
    • For incremental datasets with a stable table schema, you can use incremental crawls. It only crawls the folders that were added since the last crawler run.
    • You can run a crawler on-demand or based on a schedule.
    • Select the Logs link to view the results of the crawler. The link redirects you to CloudWatch Logs.
  • Classifier
    • It reads the data in the data store.
    • You can use a set of built-in classifiers or create a custom classifier.
    • By adding a classifier, you can determine the schema of your data.
    • Custom classifier types: Grok, XML, JSON and CSV
  • AWS Glue Studio
    • Visually author, run, view, and edit your ETL jobs.
    • Diagnose, debug, and check the status of your ETL jobs.
  • Job
    • To perform ETL works, you need to create a job.
    • When creating a job, you need to provide data sources, targets, and other information. The result will be generated in a PySpark script and store the job definition in the AWS Glue Data Catalog.
    • Job types: Spark, Streaming ETL, and Python shell
    • Job properties:
      • Job bookmarks maintain the state information and prevent the reprocessing of old data.
      • Job metrics allows you to enable or disable the creation of CloudWatch metrics when the job runs.
      • Security configuration helps you define the encryption options of the ETL job.
      • Worker type is the predefined worker that is allocated when a job runs.
        • Standard
        • G.1X (memory-intensive jobs)
        • G.2X (jobs with ML transforms)
      • Max concurrency is the maximum number of concurrent runs that are allowed for the created job. If the threshold is reached, an error will be returned.
      • Job timeout (minutes) is the execution time limit.
      • Delay notification threshold (minutes) is set if a job runs longer than the specified time. AWS Glue will send a delay notification via Amazon CloudWatch.
      • Number of retries allows you to specify the number of times AWS Glue would automatically restart the job if it fails.
      • Job parameters and Non-overrideable Job parameters are a set of key-value pairs.
  • Script
    • A script allows you to extract the data from sources, transform it, and load the data into the targets.
    • You can generate ETL scripts using Scala or PySpark.
    • AWS Glue has a script editor that displays both the script and diagram to help you visualize the flow of your data.
  • Development endpoint
    • An environment that allows you to develop and test your ETL scripts.
    • To create and test AWS Glue scripts, you can connect the development endpoint using:
      • Apache Zeppelin notebook on your local machine
      • Zeppelin notebook server in Amazon EC2 instance
      • SageMaker notebook
      • Terminal window
      • PyCharm Python IDE
    • With SageMaker notebooks, you can share development endpoints among single or multiple users.
      • Single-tenancy Configuration
      • Multi-tenancy Configuration
  • Notebook server
    • A web-based environment to run PySpark statements.
    • You can use a notebook server for interactive development and testing of your ETL scripts on a development endpoint.
      • SageMaker notebooks server
      • Apache Zeppelin notebook server
  • Trigger
    • It allows you to manually or automatically start one or more crawlers or ETL jobs.
    • You can define triggers based on schedule, job events, and on-demand.
    • You can also use triggers to pass job parameters. If a trigger starts multiple jobs, the parameters are passed on each job.
  • Workflows
    • It helps you orchestrate ETL jobs, triggers, and crawlers.
    • Workflows can be created using the AWS Management Console or AWS Glue API.
    • You can visualize the components and the flow of work with a graph using the AWS Management Console.
    • Jobs and crawlers can fire an event trigger within a workflow.
    • By defining the default workflow run properties, you can share and manage state throughout a workflow run.
    • With AWS Glue API, you can retrieve the static and dynamic view of a running workflow.
    • The static view shows the design of the workflow. While the dynamic view includes the latest run information for the jobs and crawlers. Run information shows the success status and error details.
    • You can stop, repair, and resume a workflow run.
    • Workflow restrictions:
      • You can only associate a trigger in one workflow.
      • When setting up a trigger, you can only have one starting trigger (on-demand or schedule).
      • If a crawler or job in a workflow is started by a trigger that is outside the workflow, any triggers within a workflow will not fire if it depends on the crawler or job completion.
      • If a crawler or job in a workflow is started within the workflow, only the triggers within a workflow will fire upon the crawler or job completion.
  • Transform
    • To process your data, you can use AWS Glue built-in transforms. These transforms can be called from your ETL script.
    • Enables you to manipulate your data into different formats.
    • Clean your data using machine learning (ML) transforms.
      • Tune transform:
        • Recall vs. Precision
        • Lower Cost vs. Accuracy
      • With match enforcement, you can force the output to match the labels used in teaching the ML transform.
  • Dynamic Frame
    • A distributed table that supports nested data.
    • A record for self-describing is designed for schema flexibility with semi-structured data.
    • Each record consists of data and schema.
    • You can use dynamic frames to provide a set of advanced transformations for data cleaning and ETL.

Populating the AWS Glue Data Catalog

  1. Select any custom classifiers that will run with a crawler to infer the format and schema of the data. You must provide a code for your custom classifiers and run them in the order that you specify.
  2. To create a schema, a custom classifier must successfully recognize the structure of your data.
  3. The built-in classifiers will try to recognize the data’s schema if no custom classifier matches the data’s schema.
  4. For a crawler to access the data stores, you need to configure the connection properties. This will allow the crawler to connect to a data store, and the inferred schema will be created for your data.
  5. The crawler will write metadata to the AWS Glue Data Catalog. The metadata is stored in a table definition, and the table will be written to a database.

Authoring Jobs

  1. You need to select a data source for your job. Define the table that represents your data source in the AWS Glue Data Catalog. If the source requires a connection, you can reference the connection in your job. You can add multiple data sources by editing the script.
  2. Select the data target of your job or allow the job to create the target tables when it runs.
  3. By providing arguments for your job and generated script, you can customize the job-processing environment.
  4. AWS Glue can generate an initial script, but you can also edit the script if you need to add sources, targets, and transforms.
  5. Configure how your job is invoked. You can select on-demand, time-based schedule, or by an event.
  6. Based on the input, AWS Glue generates a Scala or PySpark script. You can edit the script based on your needs.

Glue DataBrew

  • A visual data preparation tool for cleaning and normalizing data to prepare it for analytics and machine learning.
  • You can choose from over 250 pre-built transformations to automate data preparation tasks. You can automate filtering anomalies, converting data to standard formats, and correcting invalid values, and other tasks. After your data is ready, you can immediately use it for analytics and machine learning projects.
  • When running profile jobs in DataBrew to auto-generate 40+ data quality statistics like column-level cardinality, numerical correlations, unique values, standard deviation, and other statistics, you can configure the size of the dataset you want analyzed.

Monitoring

  • Record the actions taken by the user, role, and AWS service using AWS CloudTrail.
  • You can use Amazon CloudWatch Events with AWS Glue to automate the actions when an event matches a rule.
  • With Amazon CloudWatch Logs, you can monitor, store, and access log files from different sources.
  • You can assign a tag to crawler, job, trigger, development endpoint, and machine learning transform.
  • Monitor and debug ETL jobs and Spark applications using Apache Spark web UI.
  • You could view the real-time logs on the Amazon CloudWatch dashboard if you enabled continuous logging.

Security

  • Security configuration allows you to encrypt your data at rest using SSE-S3 and SSE-KMS.
    • S3 encryption mode
    • CloudWatch logs encryption mode
    • Job bookmark encryption mode
  • With AWS KMS keys, you can encrypt the job bookmarks and the logs generated by crawlers and ETL jobs.
  • AWS Glue only supports symmetric customer master keys (CMKs).
  • For data in transit, AWS provides SSL encryption.
  • Managing access to resources using:
    • Identity-Based Policies
    • Resource-Based Policies
  • You can grant cross-account access in AWS Glue using the Data Catalog resource policy and IAM role.
  • Data Catalog Encryption:
    • Metadata encryption
    • Encrypt connection passwords
  • You can create a policy for the data catalog to define fine-grained access control.

Pricing

  • You are charged at an hourly rate based on the number of DPUs used to run your ETL job.
  • You are charged at an hourly rate based on the number of DPUs used to run your crawler.
  • Data Catalog storage and requests:
    • You will be charged per month if you store more than a million objects.
    • You will be charged per month if you exceed a million requests in a month.

Note: If you are studying for the AWS Certified Data Analytics Specialty exam, we highly recommend that you take our AWS Certified Data Analytics – Specialty Practice Exams and read our Data Analytics Specialty exam study guide.

Tutorials Dojo Study Guide and Cheatsheet

AWS Certified Data Analytics Sepcialty

Validate Your Knowledge

Question 1

A company is using Amazon S3 to store financial data in CSV format. An AWS Glue crawler is used to populate the AWS Glue Data Catalog and create the tables and schema. The Data Analyst launched an AWS Glue job that processes the data from the tables and writes it to Amazon Redshift tables. After running several jobs, the Data Analyst noticed that duplicate records exist in the Amazon Redshift table. The analyst needs to ensure that the Redshift table must not have any duplicates when jobs were rerun.

Which of the following is the best approach to satisfy this requirement?

  1. Set up a staging table in the AWS Glue job. Utilize the DynamicFrameWriter class in AWS Glue to replace the existing rows in the Redshift table before persisting the new data.
  2. In the AWS Glue job, insert the previous data into a MySQL database. Use the upsert operation in MySQL and copy the data to Redshift.
  3. Remove the duplicate records using Apache Spark’s DataFrame dropDuplicates() API before persisting the new data.
  4. Select the latest data using the ResolveChoice built-in transform in AWS Glue to avoid duplicate records.

Correct Answer: 1

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage. Use the AWS Glue console to discover your data, transform it, and make it available for search and querying.

https://media.tutorialsdojo.com/aws-glue-arch.jpg

Amazon Redshift doesn’t support a single merge statement (update or insert, also known as an upsert) to insert and update data from a single data source. However, you can effectively perform a merge operation. To do so, load your data into a staging table and then join the staging table with your target table for an UPDATE statement and an INSERT statement.

In this scenario, you can implement an upsert/merge in Redshift using an AWS Glue job. You can load your data into a staging table first and then join the staging table with your target table using an UPSERT operation. The term ‘UPSERT’ is basically a portmanteau of the ‘UPdate’ and ‘inSERT’ operations. The upsert feature inserts or updates data if the row that is being inserted already exists in the table. By using this operation, the Redshift tables will no longer have duplicate records.

Hence, the correct answer is: Set up a staging table in the AWS Glue job. Utilize the DynamicFrameWriter class in AWS Glue to replace the existing rows in the Redshift table before persisting the new data.

The option that says: In the AWS Glue job, insert the previous data into a MySQL database. Use the upsert operation in MySQL and copy the data to Redshift is incorrect because you can’t use the COPY command to copy data directly from a MySQL database into Amazon Redshift. A workaround for this is to move the MySQL data into Amazon S3 and use AWS Glue as a staging table to perform the upsert operation. Since this method requires more effort, it is not the best approach to solve the problem.

The option that says: Remove the duplicate records using Apache Spark’s DataFrame dropDuplicates() API before persisting the new data is incorrect. This won’t ensure the removal of duplicates in the Amazon Redshift table because you’re just effectively removing duplicates from the staging table. Take note that the current Redshift table contains duplicate records. Even if you manage to join the target table and the staging table, the duplicate records in the Redshift table will still persist.

The option that says: Select the latest data using the ResolveChoice built-in transform in AWS Glue to avoid duplicate records is incorrect because ResolveChoice is just a built-in transform function that just resolves a choice type within a DynamicFrame.

References:

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-writer.html
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/

Note: This question was extracted from our AWS Certified Data Analytics Specialty Practice Exams.

For more AWS practice exam questions with detailed explanations, visit the Tutorials Dojo Portal:

Tutorials Dojo AWS Practice Tests

References:
https://aws.amazon.com/glue/faqs/
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Save More with Our SAA, CDA, and SysOps Triple Bundle Reviewers!

AWS Certified SysOps Administrator Associate Video Course – Early Access Discount Ends Soon!

Pass your AWS, Azure, and Google Cloud Certifications with the Tutorials Dojo Portal

Tutorials Dojo portal

Our Bestselling AWS Certified Solutions Architect Associate Practice Exams

AWS Certified Solutions Architect Associate Practice Exams

Enroll Now – Our AWS Practice Exams with 95% Passing Rate

AWS Practice Exams Tutorials Dojo

FREE AWS Cloud Practitioner Essentials Course!

Enroll Now – Our Azure Certification Exam Reviewers

azure reviewers tutorials dojo

Enroll Now – Our Google Cloud Certification Exam Reviewers

Tutorials Dojo Exam Study Guide eBooks

tutorials dojo study guide eBook

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

FREE Intro to Cloud Computing for Beginners

FREE AWS, Azure, GCP Practice Test Samplers

Browse Other Courses

Generic Category (English)300x250

Recent Posts

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers

Our Community

~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?

error: Content is protected !!