Ends in

Get up to $10 DISCOUNT on our AWS Solutions Architect Associate Reviewers!

AWS Data Pipeline

  • A web service for scheduling regular data movement and data processing activities in the AWS cloud. Data Pipeline integrates with on-premise and cloud-based storage systems.
  • A managed ETL (Extract-Transform-Load) service.
  • Native integration with S3, DynamoDB, RDS, EMR, EC2 and Redshift.
  • Tutorials dojo strip


  • You can quickly and easily provision pipelines that remove the development and maintenance effort required to manage your daily data operations, letting you focus on generating insights from that data.
  • Data Pipeline provides built-in activities for common actions such as copying data between Amazon Amazon S3 and Amazon RDS, or running a query against Amazon S3 log data.
  • Data Pipeline supports JDBC, RDS and Redshift databases.


  • A pipeline definition specifies the business logic of your data management.
  • A pipeline schedules and runs tasks by creating EC2 instances to perform the defined work activities.
  • Task Runner polls for tasks and then performs those tasks. For example, Task Runner could copy log files to S3 and launch EMR clusters. Task Runner is installed and runs automatically on resources created by your pipeline definitions. You can write a custom task runner application, or you can use the Task Runner application that is provided by Data Pipeline.

Pipeline Definition

  • From your pipeline definition, Data Pipeline determines the tasks, schedules them, and assigns them to task runners.
  • If a task is not completed successfully, Data Pipeline retries the task according to your instructions and, if necessary, reassigns it to another task runner. If the task fails repeatedly, you can configure the pipeline to notify you.
  • A pipeline definition can contain the following types of components
    • Data Nodes – The location of input data for a task or the location where output data is to be stored.
    • Activities – A definition of work to perform on a schedule using a computational resource and typically input and output data nodes.
    • Preconditions – A conditional statement that must be true before an action can run. There are two types of preconditions:
      • System-managed preconditions are run by the Data Pipeline web service on your behalf and do not require a computational resource.
      • User-managed preconditions only run on the computational resource that you specify using the runsOn or workerGroup fields. The workerGroup resource is derived from the activity that uses the precondition.
    • Scheduling Pipelines – Defines the timing of a scheduled event, such as when an activity runs. There are three types of items associated with a scheduled pipeline:
      • Pipeline Components – Specify the data sources, activities, schedule, and preconditions of the workflow.
      • Instances – Data Pipeline compiles the running pipeline components to create a set of actionable instances. Each instance contains all the information for performing a specific task.
      • Attempts – To provide robust data management, Data Pipeline retries a failed operation. It continues to do so until the task reaches the maximum number of allowed retry attempts.
    • Resources – The computational resource that performs the work that a pipeline defines.
    • Actions – An action that is triggered when specified conditions are met, such as the failure of an activity.
    • Schedules – Define when your pipeline activities run and the frequency with which the service expects your data to be available. All schedules must have a start date and a frequency.

Task Runners

  • When Task Runner is installed and configured, it polls Data Pipeline for tasks associated with pipelines that you have activated.
  • When a task is assigned to Task Runner, it performs that task and reports its status back to Data Pipeline.

AWS Training AWS Data Pipeline 2

AWS Data Pipeline vs Amazon Simple WorkFlow

  • Both services provide execution tracking, handling retries and exceptions, and running arbitrary actions.
  • AWS Data Pipeline is specifically designed to facilitate the specific steps that are common across a majority of data-driven workflows.


  • You are billed based on how often your activities and preconditions are scheduled to run and where they run (AWS or on-premises).

Note: If you are studying for the AWS Certified Data Analytics Specialty exam, we highly recommend that you take our AWS Certified Data Analytics – Specialty Practice Exams and read our Data Analytics Specialty exam study guide.

AWS Certified Data Analytics Sepcialty


Tutorials Dojo portal

FREE AWS Exam Readiness Digital Courses

Enroll Now – Our Azure Certification Exam Reviewers

azure reviewers tutorials dojo

Enroll Now – Our Google Cloud Certification Exam Reviewers

Tutorials Dojo Exam Study Guide eBooks

tutorials dojo study guide eBook

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

FREE Intro to Cloud Computing for Beginners

FREE AWS, Azure, GCP Practice Test Samplers

Browse Other Courses

Generic Category (English)300x250

Recent Posts

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses

Our Community

passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?