Scio 0.7: a deep dive
Large-scale data processing is a critical component of Spotify’s business model. It drives music recommendations, artist payouts based on stream counts, and insights about how users interact with Spotify. Every day we capture hundreds of terabytes of event data, in addition to database snapshots and derived datasets.
It’s imperative that engineers who want to work with this data can quickly write and execute application-level code without worrying about the low-level semantics of Map/Reduce frameworks, provisioning the right amount of compute power, or writing extensive boilerplate code for every job. To that end, Spotify engineers developed Scio, a Scala API for Apache Beam and Google Cloud Dataflow similar to frameworks like Spark or Scalding. Scio has been in development for almost four years, and we’re happy to announce the release of Scio
0.7.0! With thousands of production workflows running Scio, tens of thousands of batch Dataflow jobs a week, and hundreds of concurrently running streaming jobs, we’ve been able to analyze and address common pain points for our users, as well as add optimizations to reduce memory footprint and lower overall cost. Among the Spotify workflows that have upgraded to Scio 0.7, we’ve seen up to a 25% reduction in cost and a 20% reduction in runtime.