Kubernetes Failure Stories

Kubernetes Failure Stories

  • January 20, 2019
Table of Contents

Kubernetes Failure Stories

I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from. Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems.

Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were a major pain point in the beginning, but Docker itself has become more mature and did not bite us recently. The biggest chunk of problems can be attributed to the nature of distributed systems and ‘cascading failures’, e.g. a Kubernetes API server outage should not affect running workloads, but it did, or see our recent CoreDNS incident.

Source: srcco.de

Share :
comments powered by Disqus

Related Posts

A Crash Course For Running Istio

A Crash Course For Running Istio

At Namely we’ve been running with Istio for a year now. Yes, that’s pretty much when it first came out. We had a major performance regression with a Kubernetes cluster, we wanted distributed tracing, and used Istio to bootstrap Jaeger to investigate.

Read More
Envoy Proxy at Reddit

Envoy Proxy at Reddit

Reddit’s engineering team and product complexity has seen significant growth over the last three years. Facilitating that growth has taken a lot of behind-the-scenes evolution of Reddit’s backend infrastructure. One major component has been adopting a service-oriented architecture, and a significant facet of that has been evolving service-to-service discovery and communication.

Read More
The Biggest IT Failures of 2018

The Biggest IT Failures of 2018

This year provedonce againthat IT-related failures “are universally unprejudiced: they happen in every country; to large companies and small; in commercial, nonprofit, and governmental organizations; and without regard to status or reputation.” Below is a review that just scratches the surface of the sundry failures, glitches, and other IT hiccups that made the news in 2018. This year saw a slight reduction in the number of flight cancellations and delays due to computer-related problems as compared with the past three years, especially in the United States.

Read More