Scio 0.7: a deep dive

Scio 0.7: a deep dive

  • June 8, 2019
Table of Contents

Scio 0.7: a deep dive

Large-scale data processing is a critical component of Spotify’s business model. It drives music recommendations, artist payouts based on stream counts, and insights about how users interact with Spotify. Every day we capture hundreds of terabytes of event data, in addition to database snapshots and derived datasets.

It’s imperative that engineers who want to work with this data can quickly write and execute application-level code without worrying about the low-level semantics of Map/Reduce frameworks, provisioning the right amount of compute power, or writing extensive boilerplate code for every job. To that end, Spotify engineers developed Scio, a Scala API for Apache Beam and Google Cloud Dataflow similar to frameworks like Spark or Scalding. Scio has been in development for almost four years, and we’re happy to announce the release of Scio

0.7.0! With thousands of production workflows running Scio, tens of thousands of batch Dataflow jobs a week, and hundreds of concurrently running streaming jobs, we’ve been able to analyze and address common pain points for our users, as well as add optimizations to reduce memory footprint and lower overall cost. Among the Spotify workflows that have upgraded to Scio 0.7, we’ve seen up to a 25% reduction in cost and a 20% reduction in runtime.

Source: spotify.com

Tags :
Share :
comments powered by Disqus

Related Posts

Python Data Visualization 2018: Why So Many Libraries?

Python Data Visualization 2018: Why So Many Libraries?

This post is the first in a three-part series on the state of Python data visualization tools and the trends that emerged from SciPy 2018.By James A. BednarAt a special session of SciPy 2018 in Austin, representatives of a wide range of open-source Python visualization tools shared their visions for the future of data visualization in Python. We heard updates on Matplotlib, Plotly, VisPy, and many more. I attended SciPy 2018 as a representative of PyViz, GeoViews, Datashader, Panel, hvPlot and Bokeh, and my Anaconda colleague Jean-Luc Stevens attended representing HoloViews.

Read More
Matplotlib—Making data visualization interesting

Matplotlib—Making data visualization interesting

Data visualization is a key step to understand the dataset and draw inferences from it. While one can always closely inspect the data row by row, cell by cell, it’s often a tedious task and does not highlight the big picture. Visuals on the other hand, define data in a form that is easy to understand with just a glance and keeps the audience engaged.

Read More