Scio 0.7: a deep dive

June 8, 2019

Table of Contents

Large-scale data processing is a critical component of Spotify’s business model. It drives music recommendations, artist payouts based on stream counts, and insights about how users interact with Spotify. Every day we capture hundreds of terabytes of event data, in addition to database snapshots and derived datasets.

It’s imperative that engineers who want to work with this data can quickly write and execute application-level code without worrying about the low-level semantics of Map/Reduce frameworks, provisioning the right amount of compute power, or writing extensive boilerplate code for every job. To that end, Spotify engineers developed Scio, a Scala API for Apache Beam and Google Cloud Dataflow similar to frameworks like Spark or Scalding. Scio has been in development for almost four years, and we’re happy to announce the release of Scio

0.7.0! With thousands of production workflows running Scio, tens of thousands of batch Dataflow jobs a week, and hundreds of concurrently running streaming jobs, we’ve been able to analyze and address common pain points for our users, as well as add optimizations to reduce memory footprint and lower overall cost. Among the Spotify workflows that have upgraded to Scio 0.7, we’ve seen up to a 25% reduction in cost and a 20% reduction in runtime.

Source: spotify.com

Python Data Visualization 2018: Why So Many Libraries?

This post is the first in a three-part series on the state of Python data visualization tools and the trends that emerged from SciPy 2018.By James A. BednarAt a special session of SciPy 2018 in Austin, representatives of a wide range of open-source Python visualization tools shared their visions for the future of data visualization in Python. We heard updates on Matplotlib, Plotly, VisPy, and many more. I attended SciPy 2018 as a representative of PyViz, GeoViews, Datashader, Panel, hvPlot and Bokeh, and my Anaconda colleague Jean-Luc Stevens attended representing HoloViews.

An Introduction to Hashing in the Era of Machine Learning

15 Types of Regression you should know

Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression.

Scio 0.7: a deep dive

Tags :

Share :

Related Posts

Python Data Visualization 2018: Why So Many Libraries?

An Introduction to Hashing in the Era of Machine Learning

15 Types of Regression you should know