Google Cloud Dataflow Cheat Sheet

Last updated on March 28, 2023

Google Cloud Dataflow Cheat Sheet

Cloud Dataflow is a fully managed data processing service for executing a wide variety of data processing patterns.

Features

Dataflow templates allow you to easily share your pipelines with team members and across your organization.
You can also take advantage of Google-provided templates to implement useful but simple data processing tasks.
Autoscaling lets the Dataflow automatically choose the appropriate number of worker instances required to run your job.
You can build a batch or streaming pipeline protected with customer-managed encryption key (CMEK) or access CMEK-protected data in sources and sinks.
Dataflow is integrated with VPC Service Controls to provide additional security on data processing environments by improving the ability to mitigate the risk of data exfiltration.

Pricing

Dataflow jobs are billed per second, based on the actual use of Dataflow batch or streaming workers. Additional resources, such as Cloud Storage or Pub/Sub, are each billed per that service’s pricing.

Validate Your Knowledge

Question 1

Your company has 1 TB of unstructured data in various file formats that are securely stored on its on-premises data center. The Data Analytics team needs to perform ETL (Extract, Transform, Load) processes on these data which will eventually be consumed by a Dataflow SQL job.

What should you do?

Use the bq command-line tool in Cloud Shell and upload your on-premises data to Google BigQuery.
Use the Google Cloud Console to import the unstructured data by performing a dump into Cloud SQL.
Run a Dataflow import job using gcloud to upload the data into Cloud Spanner.
Using the gsutil command-line tool in Cloud SDK, move your on-premises data to Cloud Storage.

Show me the answer!

Correct Answer: 4

Dataflow SQL can query the following sources:

– Pub/Sub topics

– Cloud Storage filesets

– BigQuery tables

BigQuery is a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility. With serverless data warehousing, Google does all resource provisioning behind the scenes, so you can focus on data and analysis rather than worrying about upgrading, securing, or managing the infrastructure.

Google Cloud Storage is a powerful and cost-effective storage solution for unstructured objects, perfect for everything from hosting live web content to storing data for analytics to archiving and backup.

It is stated in the scenario that you need to upload unstructured data to the Google Cloud. Among the possible sources of data for running a Dataflow SQL job, Google Cloud Storage is the only storage that can support various data formats or unstructured data.

Hence, the correct answer is: Using the gsutil command-line tool in Cloud SDK, move your on-premises data to Cloud Storage.

The option that says: Use the bq command-line tool in Cloud Shell and upload your on-premises data to Google BigQuery is incorrect because loading data to BigQuery has to be in a structured format like JSON or CSV.

The option that says: Use the Google Cloud Console to import the unstructured data by performing a dump into Cloud SQL is incorrect because Cloud SQL is mainly used for storing relational data which means it’s not suitable for storing unstructured data.

The option that says: Run a Dataflow import job using gcloud to upload the data into Cloud Spanner is incorrect because Cloud Spanner is commonly used for relational data since it is a fully managed relational database service. It is not suitable for storing unstructured data. You have to use Cloud Storage instead.

References:
https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations
https://console.cloud.google.com/getting-started?tutorial=storage_quickstart

Note: This question was extracted from our Google Certified Associate Cloud Engineer Practice Exams.

For more Google Cloud practice exam questions with detailed explanations, check out the Tutorials Dojo Portal: