Last updated on March 28, 2023
Google Cloud Dataflow Cheat Sheet
- Cloud Dataflow is a fully managed data processing service for executing a wide variety of data processing patterns.
Features
- Dataflow templates allow you to easily share your pipelines with team members and across your organization.
- You can also take advantage of Google-provided templates to implement useful but simple data processing tasks.
- Autoscaling lets the Dataflow automatically choose the appropriate number of worker instances required to run your job.
- You can build a batch or streaming pipeline protected with customer-managed encryption key (CMEK) or access CMEK-protected data in sources and sinks.
- Dataflow is integrated with VPC Service Controls to provide additional security on data processing environments by improving the ability to mitigate the risk of data exfiltration.
Pricing
- Dataflow jobs are billed per second, based on the actual use of Dataflow batch or streaming workers. Additional resources, such as Cloud Storage or Pub/Sub, are each billed per that service’s pricing.
Validate Your Knowledge
Question 1
Your company has 1 TB of unstructured data in various file formats that are securely stored on its on-premises data center. The Data Analytics team needs to perform ETL (Extract, Transform, Load) processes on these data which will eventually be consumed by a Dataflow SQL job.
What should you do?
- Use the
bq
command-line tool in Cloud Shell and upload your on-premises data to Google BigQuery. - Use the Google Cloud Console to import the unstructured data by performing a dump into Cloud SQL.
- Run a Dataflow import job using
gcloud
to upload the data into Cloud Spanner. - Using the
gsutil
command-line tool in Cloud SDK, move your on-premises data to Cloud Storage.
For more Google Cloud practice exam questions with detailed explanations, check out the Tutorials Dojo Portal: