Observability at Scale: Building Uber’s Alerting Ecosystem

Observability at Scale: Building Uber’s Alerting Ecosystem

  • November 21, 2018
Table of Contents

Observability at Scale: Building Uber’s Alerting Ecosystem

Uber’s software architectures consists of thousands of microservices that empower teams to iterate quickly and support our company’s global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products along with complex configurations that affect these products at city and sub-city levels. To maintain our growth and architecture, Uber’s Observability team built a robust, scalable metrics and alerting pipeline responsible for detecting, mitigating, and notifying engineers of issues with their services as soon as they occur.

Specifically, we built two in-data center alerting systems, called uMonitor and Neris, that flow into the same notification and alerting pipeline. uMonitor is our metrics-based alerting system that runs checks against our metrics database M3, while Neris primarily looks for alerts in host-level infrastructure. Both Neris and uMonitor leverage a common pipeline for sending notifications and deduplication.

We will dive into these systems, along with a discussion on our push towards more mitigation actions, our new alert deduplication platform called Origami, and the challenges in creating alerts with high signal-to-noise ratio. In addition, we also developed a black box alerting system which detects high level outages from outside the data center in cases where our internal systems fail or we have full data center outages. A future blog article will talk about this setup.

Source: uber.com

Share :
comments powered by Disqus

Related Posts

Cross shard transactions at 10 million requests per second

Cross shard transactions at 10 million requests per second

Dropbox stores petabytes of metadata to support user-facing features and to power our production infrastructure. The primary system we use to store this metadata is named Edgestore and is described in a previous blog post, (Re)Introducing Edgestore. In simple terms, Edgestore is a service and abstraction over thousands of MySQL nodes that provides users with strongly consistent, transactional reads and writes at low latency.

Read More
Optimal Shard Placement in a Petabyte Scale Elasticsearch Cluster

Optimal Shard Placement in a Petabyte Scale Elasticsearch Cluster

The number of shards on each node, and tries to balance the number of shards per node evenly across the clusterThe high and low disk watermarks. Elasticsearch considers the available disk space on a node before deciding whether to allocate new shards to that node or to actively relocate shards away from that node. A nodes that has reached the low watermark (i.e 80% disk used) is not allowed receive any more shards.

Read More
GraphQL: A success story for PayPal Checkout

GraphQL: A success story for PayPal Checkout

At PayPal, we recently introduced GraphQL to our technology stack. At PayPal, GraphQL has been a complete game changer to the way we think about data, fetch data and build applications. This blog post takes a close look at PayPal Checkout and explains our journey from REST to Batch REST to GraphQL and lessons learned along the way.

Read More