How a Production Outage Was Caused Using Kubernetes Pod Priorities

How a Production Outage Was Caused Using Kubernetes Pod Priorities

  • July 25, 2019
Table of Contents

How a Production Outage Was Caused Using Kubernetes Pod Priorities

On Friday, July 19, Grafana Cloud experienced a ~30min outage in our Hosted Prometheus service. To our customers who were affected by the incident, I apologize. Itâs our job to provide you with the monitoring tools you need, and when they are not available we make your life harder.

We take this outage very seriously. This blog post explains what happened, how we responded to it, and what weâre doing to ensure it doesnât happen again. The Grafana Cloud Hosted Prometheus service is based on Cortex, a CNCF project to build a horizontally scalable, highly available, multi-tenant Prometheus service.

The Cortex architecture consists of a series of individual microservices, each handling a different role: replication, storage, querying, etc. Cortex is under very active development, continually adding features and increasing performance. We regularly deploy new releases of Cortex to our clusters so that customers see these benefits; Cortex is designed to do so without downtime.

To achieve zero-downtime upgrades, Cortexâs Ingester service requires an extra Ingester replica during the upgrade process. This allows the old Ingesters to send in-progress data to the new Ingesters one by one. But Ingesters are big: They request 4 cores and 15GB of RAM per Pod, 25% of the CPU and memory of a single machine in our Kubernetes clusters.

In aggregate, we typically have much more than 4 cores and 15GB RAM of unused resources available on a cluster to run these extra Ingesters for upgrades.

Source: grafana.com

Share :
comments powered by Disqus

Related Posts

KubeCon EU 2019: Top 10 Takeaways

KubeCon EU 2019: Top 10 Takeaways

The Datawire team and I have returned home from an awesome time last week where we attended KubeCon and CloudNativeCon in Barcelona. Together, we were part of six talks at KubeCon, staffed a packed booth with amazing T-shirts (if I do say so myself!), spoke to dozens of community members, and attended some fantastic talks. As there was so much goodness on offer at KubeCon EU, I’ve tried to summarise some of my key observations in this blog post.

Read More
Future of CRDs: Structural Schemas

Future of CRDs: Structural Schemas

CustomResourceDefinitions were introduced roughly two years ago as the primary way to extend the Kubernetes API with custom resources. From the beginning they stored arbitrary JSON data, with the exception that kind, apiVersion and metadata had to follow the Kubernetes API conventions. In Kubernetes 1.8 CRDs gained the ability to define an optional OpenAPI v3 based validation schema.

Read More
Introducing Volume Cloning Alpha for Kubernetes

Introducing Volume Cloning Alpha for Kubernetes

Kubernetes v1.15 introduces alpha support for volume cloning. This feature allows you to create new volumes using the contents of existing volumes in the user’s namespace using the Kubernetes API. Many storage systems provide the ability to create a “clone” of a volume.

Read More