Scaling a Mature Data Pipeline—Managing Overhead

Scaling a Mature Data Pipeline—Managing Overhead

  • October 5, 2019
Table of Contents

Scaling a Mature Data Pipeline—Managing Overhead

Before delving into our specifics, I want to take a moment to discuss the technical stack backing our pipeline. Our platform uses a mixture of Spark and Hive jobs. Our core pipeline is primarily implemented in Scala.

However, we leverage Spark SQL in certain contexts. We leverage YARN for job scheduling and resource management, and execute our jobs on Amazon EMR. We use Airflow as our task orchestration system that takes care of the orchestration logic.

For a data pipeline, we define the orchestration logic as the logic that facilitates the execution of your tasks. It includes the logic that you use to define your dependency graph, your configuration system, your Spark job runner, and so on. In other words, anything required to run your pipeline that is not a map-reduce job or other business logic tends to be orchestration logic.

In total, the our pipelines are made up of a little over a thousand tasks.

Source: medium.com

Tags :
Share :
comments powered by Disqus

Related Posts

Accelerating Uber’s Self-Driving Vehicle Development with Data

Accelerating Uber’s Self-Driving Vehicle Development with Data

A key challenge faced by self-driving vehicles comes during interactions with pedestrians. In our development of self-driving vehicles, the Data Engineering and Data Science teams at Uber ATG (Advanced Technologies Group) contribute to the data processing and analysis that help make these interactions safe.

Read More