Scaling a Mature Data Pipeline—Managing Overhead

Scaling a Mature Data Pipeline—Managing Overhead

Before delving into our specifics, I want to take a moment to discuss the technical stack backing our pipeline. Our platform uses a mixture of Spark and Hive jobs. Our core pipeline is primarily implemented in Scala.

However, we leverage Spark SQL in certain contexts. We leverage YARN for job scheduling and resource management, and execute our jobs on Amazon EMR. We use Airflow as our task orchestration system that takes care of the orchestration logic.

For a data pipeline, we define the orchestration logic as the logic that facilitates the execution of your tasks. It includes the logic that you use to define your dependency graph, your configuration system, your Spark job runner, and so on. In other words, anything required to run your pipeline that is not a map-reduce job or other business logic tends to be orchestration logic.

In total, the our pipelines are made up of a little over a thousand tasks.