Optimal Shard Placement in a Petabyte Scale Elasticsearch Cluster

Optimal Shard Placement in a Petabyte Scale Elasticsearch Cluster

  • November 9, 2018
Table of Contents

Optimal Shard Placement in a Petabyte Scale Elasticsearch Cluster

The number of shards on each node, and tries to balance the number of shards per node evenly across the clusterThe high and low disk watermarks. Elasticsearch considers the available disk space on a node before deciding whether to allocate new shards to that node or to actively relocate shards away from that node. A nodes that has reached the low watermark (i.e 80% disk used) is not allowed receive any more shards.

A node that has reached the high watermark (i.e 90%) will start to actively move shards away from it. The high and low disk watermarks. Elasticsearch considers the available disk space on a node before deciding whether to allocate new shards to that node or to actively relocate shards away from that node.

A nodes that has reached the low watermark (i.e 80% disk used) is not allowed receive any more shards. A node that has reached the high watermark (i.e 90%) will start to actively move shards away from it. New indices created and old indices dropping.

Disk watermark triggers due to indexing and other shard movements. Elasticsearch randomly deciding that a node has too few/too many shards compared to the cluster average. Hardware and OS-level failures causing new AWS instances to spin up and join the cluster.

With 500+ nodes this happens several times a week on average. New nodes added, almost every week because of normal data growth.

Source: meltwater.com

Share :
comments powered by Disqus

Related Posts

Modernizing your build pipelines

Modernizing your build pipelines

Doing Continuous Integration is a lot easier if you have the right tools. In our project at a german car manufacturer, we were tasked with developing new services and bringing them to the cloud. We had a centralized Jenkins instance, shared by all the teams in the department.

Read More
Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Peloton: Uber’s Unified Resource Scheduler for Diverse Cluster Workloads

Cluster management, a common software infrastructure among technology companies, aggregates compute resources from a collection of physical hosts into a shared resource pool, amplifying compute power and allowing for the flexible use of data center hardware. At Uber, cluster management provides an abstraction layer for various workloads. With the increasing scale of our business, the efficient use of cluster resources becomes very important.

Read More
20 Best YouTube channels for AI and machine learning

20 Best YouTube channels for AI and machine learning

What are the most interesting and informative YouTube channels about artificial intelligence (AI) and machine learning? Subscribe to these 20 high-quality channels today to stay up to date with the latest AI and machine learning breakthroughs. Siraj Raval:

Read More