Using Machine Learning to Ensure the Capacity Safety of Individual Microservices

March 16, 2019

Table of Contents

Reliability engineering teams at Uber build the tools, libraries, and infrastructure that enable engineers to operate our thousands of microservices reliably at scale. At its essence, reliability engineering boils down to actively preventing outages that affect the mean time between failures (MTBF). As Uber’s global mobility platform grows, our global scale and complex network of microservice call patterns have made capacity requirements for individual services difficult to predict.

When we’re unable to predict service-level capacity requirements, capacity-related outages can occur. Given the real-time nature of our operations, capacity-related outages (affecting microservices at the feature level as opposed to directly impacting user experiences) constitute a large chunk of Uber’s overall outages. To help prevent these outages, our reliability engineers abide by a concept called capacity safety, in other words, the ability to support historical peaks or a 90-day forecast of concurrent riders-on-trip from a single data center without causing significant CPU, memory, and general resource starvation.

On the Maps Reliability Engineering team, we built in-house machine learning tooling that forecasts core service metrics—requests per second, latency, and CPU usage—to provide an idea of how these metrics will change throughout the course of a week or month. Using these forecasts, teams can perform accurate capacity safety tests that capture the nuances of each microservice.

Source: uber.com

Tags :

comments powered by Disqus

AI year in review

At Facebook, we think that artificial intelligence that learns in new, more efficient ways – much like humans do – can play an important role in bringing people together. That core belief helps drive our AI strategy, focusing our investments in long-term research related to systems that learn using real-world data, inspiring our engineers to share cutting-edge tools and platforms with the wider AI community, and ultimately demonstrating new ways to use the technology to benefit the world. In 2018, we made important progress in all these areas.

Teaching AI to learn speech the way children do

A collaboration between the Facebook AI Research (FAIR) group and the Paris Sciences & Lettres University, with additional sponsorship from Microsoft Research, to challenge other researchers to teach AI systems to learn speech in a way that more closely resembles how young children learn. The ZeroSpeech 2019 challenge (which builds on previous efforts in 2015 and 2017) asks participants to build a speech synthesizer using only audio input, without any text or phonetic labels. The challenge’s central task is to build an AI system that can discover, in an unknown language, the machine equivalent of text of phonetic labels and use them to re-synthesize a sentence in a given voice.

Using Machine Learning to Ensure the Capacity Safety of Individual Microservices

Tags :

Share :

Related Posts

AI year in review

Teaching AI to learn speech the way children do

Updating Neural Networks to Recognize New Categories, with Minimal Retraining