How should pipelines be monitored?

How should pipelines be monitored?

  • August 4, 2019
Table of Contents

How should pipelines be monitored?

For online serving systems it’s fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though? For a typical web application, high latency or error rates are the sort of thing you want to wake someone up about as they usually negatively affect the end-user’s experience.

Request rate isn’t something to alert on in and of itself, however it’s important to know as it’s often related to errors/latency plus you’ll want it for capacity planning. A offline processing pipeline typically involves queues (such as Kafka) between various stages of computation. There’s no end user eagerly waiting for a web page to load, however how long it takes for data to get through is a key metric.

Similarly if data goes in but an error causes it to be dropped or otherwise not correctly processed that’s usually something to be concerned about. In addition there’s how much data is sitting in each queue, how fast data is being added, and how fast data is being removed. Many will have alerts on too much data being in a queue, and this tends to be a bit spammy.

First off any alert on a fixed threshold tends to get out of date as traffic grows, in the same way it’s better to alert on the ratio of HTTP errors to total requests rather than how many happen per second. The more serious issue however is that one queue having a certain number of items in it doesn’t mean that the overall pipeline is processing data too slowly, and setting thresholds to avoid such false positives would miss actual problems. This is typical for alerts that work off causes rather than symptoms.

Source: robustperception.io

Share :
comments powered by Disqus

Related Posts

Kubernetes Metrics and Monitoring

Kubernetes Metrics and Monitoring

This post explores the current state of metrics and monitoring in Kubernetes by walking through the gradual thought process that I experienced when learning this topic. Kubernetes needs some metrics for it’s basic out-of-the-box functionality, like autoscaling and scheduling. This is regardless of any monitoring solution you may want for the purpose of troubleshooting and alerting.

Read More
MTTR is dead, long live CIRT

MTTR is dead, long live CIRT

The game is changing for the IT ops community, which means the rules of the past make less and less sense. Organizations need accurate, understandable, and actionable metrics in the right context to measure operations performance and drive critical business transformation. The more customers use modern tools and the more variation in the types of incidents they manage, the less sense it makes to smash all those different incidents into one bucket to compute an average resolution time that will represent ops performance, which is what IT has been doing for a long time.

Read More