Automating Datacenter Operations at Dropbox

Switch provisioning at Dropbox is handled by a Pirlo component called the TOR Starter. The TOR Starter is responsible for validating and configuring switches in our datacenter server racks, PoP server racks, and at the different layers of our datacenter fabric that connect racks in the same facility together. Writing the TOR Starter on top […]

Kubernetes Failure Stories

I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from. Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems. Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were […]

The Biggest IT Failures of 2018

This year provedonce againthat IT-related failures “are universally unprejudiced: they happen in every country; to large companies and small; in commercial, nonprofit, and governmental organizations; and without regard to status or reputation.” Below is a review that just scratches the surface of the sundry failures, glitches, and other IT hiccups that made the news in […]

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex configurations that affect these products at city and sub-city levels. To maintain our growth and […]

Stack Overflow: How We Do Monitoring

What is monitoring? As far as I can tell, it means different things to different people. But we more or less agree on the concept. I think. Maybe. Let’s find out! Source: nickcraver

How Uber Beacon Helps Improve Safety for Riders and Drivers

Globally, there are approximately 1.3 million collision-related fatalities on the road every year. Crash fatalities are still the leading cause of death for people between 15-29 years old, impacting families, communities, and cities. Governments around the world are working to reduce the risks, committing more resources towards improving road safety. At Uber, we want to […]

Cape Technical Deep Dive

In this post, we’ll take a deep dive into the design of the Cape framework. First, we’ll discuss Cape’s architecture. Then we’ll look at the core scheduling component of the system. Throughout, we’ll focus the discussion on a few key design decisions. Before we begin, let’s touch on a few of our principles for developing […]

Bye bye Mongo, Hello Postgres

In April the Guardian switched off the Mongo DB cluster used to store our content after completing a migration to PostgreSQL on Amazon RDS. This post covers why and how At the Guardian, the majority of content – including articles, live blogs, galleries and video content – is produced in our in-house CMS tool, Composer. […]

Implementing the Netflix Media Database

In the previous blog posts in this series, we introduced the Netflix Media DataBase (NMDB) and its salient “Media Document” data model. In this post we will provide details of the NMDB system architecture beginning with the system requirements—these will serve as the necessary motivation for the architectural choices we made. A fundamental requirement for […]

Scaling Cash Payments in Uber Eats

This article is the fourth in a series covering how Uber’s mobile engineering team developed the newest version of our driver app, codenamed Carbon, a core component of our ridesharing business. Among other new features, the app lets our population of over three million driver-partners find fares, get directions, and track their earnings. We began […]