How should pipelines be monitored?

How should pipelines be monitored?

  • August 4, 2019
Table of Contents

How should pipelines be monitored?

For online serving systems it’s fairly well known that you should look for request rate, errors and duration. What about offline processing pipelines though? For a typical web application, high latency or error rates are the sort of thing you want to wake someone up about as they usually negatively affect the end-user’s experience.

Request rate isn’t something to alert on in and of itself, however it’s important to know as it’s often related to errors/latency plus you’ll want it for capacity planning. A offline processing pipeline typically involves queues (such as Kafka) between various stages of computation. There’s no end user eagerly waiting for a web page to load, however how long it takes for data to get through is a key metric.

Similarly if data goes in but an error causes it to be dropped or otherwise not correctly processed that’s usually something to be concerned about. In addition there’s how much data is sitting in each queue, how fast data is being added, and how fast data is being removed. Many will have alerts on too much data being in a queue, and this tends to be a bit spammy.

First off any alert on a fixed threshold tends to get out of date as traffic grows, in the same way it’s better to alert on the ratio of HTTP errors to total requests rather than how many happen per second. The more serious issue however is that one queue having a certain number of items in it doesn’t mean that the overall pipeline is processing data too slowly, and setting thresholds to avoid such false positives would miss actual problems. This is typical for alerts that work off causes rather than symptoms.

Source: robustperception.io

Share :
comments powered by Disqus

Related Posts

How to monitor Golden signals in Kubernetes

How to monitor Golden signals in Kubernetes

What are Golden signals metrics? How do you monitor golden signals in Kubernetes applications? Golden signals can help to detect issues of a microservices application.

Read More
How Uber Monitors 4,000 Microservices

How Uber Monitors 4,000 Microservices

With 4,000 proprietary microservices and a growing number of open source systems that needed to be monitored, by late 2014 Uber was outgrowing its usage of Graphite and Nagios for metrics. They evaluated several technologies, including Atlas and OpenTSDB, but the fact that a growing number of open source systems were adding native support for the Prometheus Metrics Exporter format tipped the scales in that direction. Uber found with its use of Prometheus and M3, Uber’s storage costs for ingesting metrics became 8.53x more cost effective per metric per replica.

Read More
Stack Overflow: How We Do Monitoring

Stack Overflow: How We Do Monitoring

What is monitoring? As far as I can tell, it means different things to different people. But we more or less agree on the concept.

Read More