Scaling a Mature Data Pipeline—Managing Overhead

Scaling a Mature Data Pipeline—Managing Overhead

  • October 5, 2019
Table of Contents

Scaling a Mature Data Pipeline—Managing Overhead

Before delving into our specifics, I want to take a moment to discuss the technical stack backing our pipeline. Our platform uses a mixture of Spark and Hive jobs. Our core pipeline is primarily implemented in Scala.

However, we leverage Spark SQL in certain contexts. We leverage YARN for job scheduling and resource management, and execute our jobs on Amazon EMR. We use Airflow as our task orchestration system that takes care of the orchestration logic.

For a data pipeline, we define the orchestration logic as the logic that facilitates the execution of your tasks. It includes the logic that you use to define your dependency graph, your configuration system, your Spark job runner, and so on. In other words, anything required to run your pipeline that is not a map-reduce job or other business logic tends to be orchestration logic.

In total, the our pipelines are made up of a little over a thousand tasks.

Source: medium.com

Tags :
Share :
comments powered by Disqus

Related Posts

Scio 0.7: a deep dive

Scio 0.7: a deep dive

Large-scale data processing is a critical component of Spotify’s business model. It drives music recommendations, artist payouts based on stream counts, and insights about how users interact with Spotify. Every day we capture hundreds of terabytes of event data, in addition to database snapshots and derived datasets.

Read More
Altair: Declarative Visualization in Python

Altair: Declarative Visualization in Python

With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code.

Read More