How we 30x’d our Node parallelism

January 23, 2020

Table of Contents

What’s the best way to safely increase parallelism in a production Node service? That’s a question my team needed to answer a couple of months ago. We were running 4,000 Node containers (or ‘workers’) for our bank integration service.

The service was originally designed such that each worker would process only a single request at a time. This design lessened the impact of integrations that accidentally blocked the event loop, and allowed us to ignore the variability in resource usage across different integrations. But since our total capacity was capped at 4,000 concurrent requests, the system did not gracefully scale.

Most requests were network-bound, so we could improve our capacity and costs if we could just figure out how to increase parallelism safely. In our research, we couldn’t find a good playbook for going from ‘no parallelism’ to ‘lots of parallelism’ in a Node service. So we put together our own plan, which relied on careful planning, good tooling and observability, and a healthy dose of debugging.

In the end, we were able to 30x our parallelism, which equated to a cost savings of about $300k annually. This post will outline how we increased the performance and efficiency of our Node workers and describe the lessons that we learned in the process.

Source: plaid.com

Lyft’s Journey through Mobile Networking

In 5 years, the number of endpoints consumed by Lyft’s mobile apps grew to over 500, and the size of our mobile engineering team increased by more than 15x. To scale with this growth, our infrastructure had to evolve dramatically to utilize new advances in modern networking in order to continue to provide benefits for our users. This post describes the journey through the evolution of Lyft’s mobile networking: how it’s changed, what we’ve learned, and why it’s important for us as a growing business.

Automating Datacenter Operations at Dropbox

Switch provisioning at Dropbox is handled by a Pirlo component called the TOR Starter. The TOR Starter is responsible for validating and configuring switches in our datacenter server racks, PoP server racks, and at the different layers of our datacenter fabric that connect racks in the same facility together. Writing the TOR Starter on top of the ClusterOps queue provides us with a basic manager-worker queuing service.

Making the LinkedIn experimentation engine 20x faster

At LinkedIn, we like to say that experimentation is in our blood because no production release at the company happens without experimentation; by “experimentation,” we typically mean “A/B testing.” The company relies on employees to make decisions by analyzing data. Experimentation is a data-driven foundation of the decision-making process, which helps with measuring the precise impact of every change and release, and evaluating whether expectations meet reality.

How we 30x’d our Node parallelism

Tags :

Share :

Related Posts

Lyft’s Journey through Mobile Networking

Automating Datacenter Operations at Dropbox

Making the LinkedIn experimentation engine 20x faster