Scaling a Mature Data Pipeline—Managing Overhead

Before delving into our specifics, I want to take a moment to discuss the technical stack backing our pipeline. Our platform uses a mixture of Spark and Hive jobs. Our core pipeline is primarily implemented in Scala. However, we leverage Spark SQL in certain contexts. We leverage YARN for job scheduling and resource management, and […]

Accelerating Uber’s Self-Driving Vehicle Development with Data

A key challenge faced by self-driving vehicles comes during interactions with pedestrians. In our development of self-driving vehicles, the Data Engineering and Data Science teams at Uber ATG (Advanced Technologies Group) contribute to the data processing and analysis that help make these interactions safe. Through data, we can learn the movement of cars and pedestrians […]

Scio 0.7: a deep dive

Large-scale data processing is a critical component of Spotify’s business model. It drives music recommendations, artist payouts based on stream counts, and insights about how users interact with Spotify. Every day we capture hundreds of terabytes of event data, in addition to database snapshots and derived datasets. It’s imperative that engineers who want to work […]

Matplotlib—Making data visualization interesting

Data visualization is a key step to understand the dataset and draw inferences from it. While one can always closely inspect the data row by row, cell by cell, it’s often a tedious task and does not highlight the big picture. Visuals on the other hand, define data in a form that is easy to […]

Python Data Visualization 2018: Why So Many Libraries?

This post is the first in a three-part series on the state of Python data visualization tools and the trends that emerged from SciPy 2018.By James A. BednarAt a special session of SciPy 2018 in Austin, representatives of a wide range of open-source Python visualization tools shared their visions for the future of data visualization […]

An Introduction to Hashing in the Era of Machine Learning

In December 2017, researchers at Google and MIT published a provocative research paper about their efforts into “learned index structures”. The research is quite exciting, as the authors state in the abstract: Indeed the results presented by the team of Google and MIT researchers includes findings that could signal new competition for the most venerable […]

Altair: Declarative Visualization in Python

With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code. Source: github

Exabytes in a Test Tube: The Case for DNA Data Storage

Our ability to sequence, synthesize, and edit DNA has advanced at a previously inconceivable speed. Far from being expensive and impractical, these DNA technologies are the most disruptive in all of biotechnology. It’s now possible to write custom DNA strands for pennies per base pair, at least for short strands. Two companies, GenScript Biotech Corp. […]

An Introduction to Hashing in the Era of Machine Learning

New research is an excellent opportunity to reexamine the fundamentals of a field; and it’s not often that something as fundamental (and well studied) as indexing experiences a breakthrough. This article serves as an introduction to hash tables, an abbreviated examination of what makes them fast and slow, and an intuitive view of the machine […]

15 Types of Regression you should know

Regression techniques are one of the most popular statistical techniques used for predictive modeling and data mining tasks. On average, analytics professionals know only 2-3 types of regression which are commonly used in real world. They are linear and logistic regression. But the fact is there are more than 10 types of regression algorithms designed […]