Over the years, 300 million Pinners have saved more than 200 billion Pins on Pinterest across more than 4 billion boards. To serve this vast user base and content pool, we’ve developed thousands of services, ranging from microservices of a handful CPUs to huge monolithic services that occupy a whole VM fleet. There are also various kinds of batch jobs from all kinds of different frameworks, which can be CPU, memory or I/O intensive. To support these diverse workloads, the infrastructure team at Pinterest is facing multiple challenges: Engineers don’t have a unified experience when launching their workload. Stateless services, stateful services and batch jobs are deployed and managed by totally different tech stacks. This has created a steep learning curve for our engineers, as well as huge maintenance and customer support burdens for the infrastructure team.Engineers managing their own VM fleets is creating a huge maintenance load for the infra team. Simple operations such as an OS or AMI upgrade can take weeks to months. Production workloads are also disturbed during those processes, which are supposed to be transparent to them.It’s hard to build infrastructure governance tools on top of separated management systems. It’s even more difficult for us to determine who owns which machines and if they can be safely recycled.