Operating Apache Kafka Clusters 24/7 Without A Global Ops Team

October 5, 2019

Table of Contents

Earlier this year, the Streaming PubSub team at Lyft got multiple Apache Kafka clusters ready to take on load that required 24/7 support. The team’s operational burden for Kafka quickly started heading towards burn-out territory. On-call rotations started getting miserable because we’d get woken up at night due to failing hosts.

Business requirements kept coming and requiring us to scale the clusters further. The more we scaled, the more we’d get woken up.

Source: lyft.com

Captain Obvious Finally Arrives: Ride-sharing Actually Causes Congestion

A recurring theme among ride-hailing executives from the likes of Lyft and Uber is that their platforms will help reduce congestion in the world’s most populous cities. However, anyone actually living in these places will tell you it doesn’t appear to be working. Cities like New York were already clogged with taxi cabs but, instead of seeing all of these drivers buy personal vehicles to enlist as independent contractors for ride-hailing firms, Uber and Lyft brought in new drivers, more vehicles, and fresh competition.

MIT study shows how much driving for Uber or Lyft sucks

Making long-term forecasts at Lyft

At Lyft, like many other companies, we need to make accurate short and long-term forecasts. Some of the metrics that we need to accurately predict are number of driver hours provided by drivers in different regions — i.e our supply side of the business — and also number of rides taken by riders in different regions, i.e. our demand. We have several internal tools that we use to make forecasts.

Operating Apache Kafka Clusters 24/7 Without A Global Ops Team

Tags :

Share :

Related Posts

Captain Obvious Finally Arrives: Ride-sharing Actually Causes Congestion

MIT study shows how much driving for Uber or Lyft sucks

Making long-term forecasts at Lyft