Borg System Architecture (PDF)
Document Details
Uploaded by EasiestMimosa
Georgia Institute of Technology
2015
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes
Tags
Summary
This document describes Google's Borg system, a large-scale cluster management system. Borg handles hundreds of thousands of jobs across many clusters, blending admission control, task packing, and machine sharing. Its design facilitates high availability applications with minimal fault recovery time. The paper details the system architecture, user perspective, workload, and design decisions.
Full Transcript
Large-scale cluster management at Google with Borg Abhishek Verma† Luis Pedrosa‡ Madhukar Korupolu David Oppenheimer Eric Tune John Wilkes...
Large-scale cluster management at Google with Borg Abhishek Verma† Luis Pedrosa‡ Madhukar Korupolu David Oppenheimer Eric Tune John Wilkes Google Inc. Abstract config file command-line Google’s Borg system is a cluster manager that runs hun- borgcfg tools web browsers web browsers dreds of thousands of jobs, from many thousands of differ- ent applications, across a number of clusters each with up to Cell BorgMaster BorgMaster tens of thousands of machines. BorgMaster BorgMaster BorgMaster UIshard UI shard UIshard shard read/UI UI It achieves high utilization by combining admission con- shard Scheduler persistent store trol, efficient task-packing, over-commitment, and machine scheduler (Paxos) linkshard shard link sharing with process-level performance isolation. It supports link link link shard shard shard high-availability applications with runtime features that min- imize fault-recovery time, and scheduling policies that re- duce the probability of correlated failures. Borg simplifies Borglet Borglet Borglet Borglet life for its users by offering a declarative job specification language, name service integration, real-time job monitor- ing, and tools to analyze and simulate system behavior. We present a summary of the Borg system architecture and features, important design decisions, a quantitative anal- Figure 1: The high-level architecture of Borg. Only a tiny fraction ysis of some of its policy decisions, and a qualitative ex- of the thousands of worker nodes are shown. amination of lessons learned from a decade of operational experience with it. cluding with a set of qualitative observations we have made from operating Borg in production for more than a decade. 1. Introduction The cluster management system we internally call Borg ad- 2. The user perspective mits, schedules, starts, restarts, and monitors the full range Borg’s users are Google developers and system administra- of applications that Google runs. This paper explains how. tors (site reliability engineers or SREs) that run Google’s Borg provides three main benefits: it (1) hides the details applications and services. Users submit their work to Borg of resource management and failure handling so its users can in the form of jobs, each of which consists of one or more focus on application development instead; (2) operates with tasks that all run the same program (binary). Each job runs very high reliability and availability, and supports applica- in one Borg cell, a set of machines that are managed as a tions that do the same; and (3) lets us run workloads across unit. The remainder of this section describes the main fea- tens of thousands of machines effectively. Borg is not the tures exposed in the user view of Borg. first system to address these issues, but it’s one of the few op- erating at this scale, with this degree of resiliency and com- 2.1 The workload pleteness. This paper is organized around these topics, con- Borg cells run a heterogenous workload with two main parts. † Work done while author was at Google. The first is long-running services that should “never” go ‡ Currently at University of Southern California. down, and handle short-lived latency-sensitive requests (a few µs to a few hundred ms). Such services are used for end-user-facing products such as Gmail, Google Docs, and Permission to make digital or hard copies of part or all of this work for personal or web search, and for internal infrastructure services (e.g., classroom use is granted without fee provided that copies are not made or distributed BigTable). The second is batch jobs that take from a few for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. seconds to a few days to complete; these are much less sen- For all other uses, contact the owner/author(s). sitive to short-term performance fluctuations. The workload EuroSys’15, April 21–24, 2015, Bordeaux, France. mix varies across cells, which run different mixes of applica- Copyright is held by the owner/author(s). ACM 978-1-4503-3238-5/15/04. tions depending on their major tenants (e.g., some cells are http://dx.doi.org/10.1145/2741948.2741964 quite batch-intensive), and also varies over time: batch jobs come and go, and many end-user-facing service jobs see a because we don’t want to pay the cost of virtualization. diurnal usage pattern. Borg is required to handle all these Also, the system was designed at a time when we had a cases equally well. considerable investment in processors with no virtualization A representative Borg workload can be found in a publicly- support in hardware. available month-long trace from May 2011 , which has A task has properties too, such as its resource require- been extensively analyzed (e.g., and [1, 26, 27, 57]). ments and the task’s index within the job. Most task proper- Many application frameworks have been built on top of ties are the same across all tasks in a job, but can be over- Borg over the last few years, including our internal MapRe- ridden – e.g., to provide task-specific command-line flags. duce system , FlumeJava , Millwheel , and Pregel Each resource dimension (CPU cores, RAM, disk space,. Most of these have a controller that submits a master disk access rate, TCP ports,2 etc.) is specified independently job and one or more worker jobs; the first two play a similar at fine granularity; we don’t impose fixed-sized buckets or role to YARN’s application manager. Our distributed slots (§5.4). Borg programs are statically linked to reduce storage systems such as GFS and its successor CFS, dependencies on their runtime environment, and structured Bigtable , and Megastore all run on Borg. as packages of binaries and data files, whose installation is For this paper, we classify higher-priority Borg jobs as orchestrated by Borg. “production” (prod) ones, and the rest as “non-production” Users operate on jobs by issuing remote procedure calls (non-prod). Most long-running server jobs are prod; most (RPCs) to Borg, most commonly from a command-line tool, batch jobs are non-prod. In a representative cell, prod jobs other Borg jobs, or our monitoring systems (§2.6). Most job are allocated about 70% of the total CPU resources and rep- descriptions are written in the declarative configuration lan- resent about 60% of the total CPU usage; they are allocated guage BCL. This is a variant of GCL , which gener- about 55% of the total memory and represent about 85% of ates protobuf files , extended with some Borg-specific the total memory usage. The discrepancies between alloca- keywords. GCL provides lambda functions to allow calcula- tion and usage will prove important in §5.5. tions, and these are used by applications to adjust their con- figurations to their environment; tens of thousands of BCL 2.2 Clusters and cells files are over 1 k lines long, and we have accumulated tens The machines in a cell belong to a single cluster, defined by of millions of lines of BCL. Borg job configurations have the high-performance datacenter-scale network fabric that similarities to Aurora configuration files. connects them. A cluster lives inside a single datacenter Figure 2 illustrates the states that jobs and tasks go building, and a collection of buildings makes up a site.1 through during their lifetime. A cluster usually hosts one large cell and may have a few smaller-scale test or special-purpose cells. We assiduously submit + reject avoid any single point of failure. accept Our median cell size is about 10 k machines after exclud- ing test cells; some are much larger. The machines in a cell Pending update are heterogeneous in many dimensions: sizes (CPU, RAM, schedule evict disk, network), processor type, performance, and capabili- fail, kill, ties such as an external IP address or flash storage. Borg iso- lost Running update lates users from most of these differences by determining finish, fail, kill, lost submit where in a cell to run tasks, allocating their resources, in- stalling their programs and other dependencies, monitoring Dead their health, and restarting them if they fail. 2.3 Jobs and tasks A Borg job’s properties include its name, owner, and the number of tasks it has. Jobs can have constraints to force Figure 2: The state diagram for both jobs and tasks. Users can its tasks to run on machines with particular attributes such as trigger submit, kill, and update transitions. processor architecture, OS version, or an external IP address. A user can change the properties of some or all of the Constraints can be hard or soft; the latter act like preferences tasks in a running job by pushing a new job configuration rather than requirements. The start of a job can be deferred to Borg, and then instructing Borg to update the tasks to until a prior one finishes. A job runs in just one cell. the new specification. This acts as a lightweight, non-atomic Each task maps to a set of Linux processes running in transaction that can easily be undone until it is closed (com- a container on a machine. The vast majority of the mitted). Updates are generally done in a rolling fashion, and Borg workload does not run inside virtual machines (VMs), a limit can be imposed on the number of task disruptions 1 There are a few exceptions for each of these relationships. 2 Borg manages the available ports on a machine and allocates them to tasks. (reschedules or preemptions) an update causes; any changes e.g., MapReduce master tasks run at a slightly higher priority that would cause more disruptions are skipped. than the workers they control, to improve their reliability. Some task updates (e.g., pushing a new binary) will al- Priority expresses relative importance for jobs that are ways require the task to be restarted; some (e.g., increasing running or waiting to run in a cell. Quota is used to decide resource requirements or changing constraints) might make which jobs to admit for scheduling. Quota is expressed as the task no longer fit on the machine, and cause it to be a vector of resource quantities (CPU, RAM, disk, etc.) at a stopped and rescheduled; and some (e.g., changing priority) given priority, for a period of time (typically months). The can always be done without restarting or moving the task. quantities specify the maximum amount of resources that Tasks can ask to be notified via a Unix SIGTERM sig- a user’s job requests can ask for at a time (e.g., “20 TiB nal before they are preempted by a SIGKILL, so they have of RAM at prod priority from now until the end of July time to clean up, save state, finish any currently-executing in cell xx”). Quota-checking is part of admission control, requests, and decline new ones. The actual notice may be not scheduling: jobs with insufficient quota are immediately less if the preemptor sets a delay bound. In practice, a notice rejected upon submission. is delivered about 80% of the time. Higher-priority quota costs more than quota at lower- priority. Production-priority quota is limited to the actual 2.4 Allocs resources available in the cell, so that a user who submits A Borg alloc (short for allocation) is a reserved set of re- a production-priority job that fits in their quota can expect it sources on a machine in which one or more tasks can be to run, modulo fragmentation and constraints. Even though run; the resources remain assigned whether or not they are we encourage users to purchase no more quota than they used. Allocs can be used to set resources aside for future need, many users overbuy because it insulates them against tasks, to retain resources between stopping a task and start- future shortages when their application’s user base grows. ing it again, and to gather tasks from different jobs onto the We respond to this by over-selling quota at lower-priority same machine – e.g., a web server instance and an associ- levels: every user has infinite quota at priority zero, although ated logsaver task that copies the server’s URL logs from this is frequently hard to exercise because resources are over- the local disk to a distributed file system. The resources of subscribed. A low-priority job may be admitted but remain an alloc are treated in a similar way to the resources of a ma- pending (unscheduled) due to insufficient resources. chine; multiple tasks running inside one share its resources. Quota allocation is handled outside of Borg, and is inti- If an alloc must be relocated to another machine, its tasks are mately tied to our physical capacity planning, whose results rescheduled with it. are reflected in the price and availability of quota in differ- An alloc set is like a job: it is a group of allocs that reserve ent datacenters. User jobs are admitted only if they have suf- resources on multiple machines. Once an alloc set has been ficient quota at the required priority. The use of quota re- created, one or more jobs can be submitted to run in it. For duces the need for policies like Dominant Resource Fairness brevity, we will generally use “task” to refer to an alloc or a (DRF) [29, 35, 36, 66]. top-level task (one outside an alloc) and “job” to refer to a Borg has a capability system that gives special privileges job or alloc set. to some users; for example, allowing administrators to delete or modify any job in the cell, or allowing a user to access restricted kernel features or Borg behaviors such as disabling 2.5 Priority, quota, and admission control resource estimation (§5.5) on their jobs. What happens when more work shows up than can be ac- commodated? Our solutions for this are priority and quota. Every job has a priority, a small positive integer. A high- 2.6 Naming and monitoring priority task can obtain resources at the expense of a lower- It’s not enough to create and place tasks: a service’s clients priority one, even if that involves preempting (killing) the and other systems need to be able to find them, even after latter. Borg defines non-overlapping priority bands for dif- they are relocated to a new machine. To enable this, Borg ferent uses, including (in decreasing-priority order): moni- creates a stable “Borg name service” (BNS) name for each toring, production, batch, and best effort (also known as task that includes the cell name, job name, and task number. testing or free). For this paper, prod jobs are the ones in the Borg writes the task’s hostname and port into a consistent, monitoring and production bands. highly-available file in Chubby with this name, which Although a preempted task will often be rescheduled is used by our RPC system to find the task endpoint. The elsewhere in the cell, preemption cascades could occur if BNS name also forms the basis of the task’s DNS name, a high-priority task bumped out a slightly lower-priority so the fiftieth task in job jfoo owned by user ubar in cell one, which bumped out another slightly-lower priority task, cc would be reachable via 50.jfoo.ubar.cc.borg.google.com. and so on. To eliminate most of this, we disallow tasks in Borg also writes job size and task health information into the production priority band to preempt one another. Fine- Chubby whenever it changes, so load balancers can see grained priorities are still useful in other circumstances – where to route requests to. Almost every task run under Borg contains a built-in brought up and whenever the elected master fails; it acquires HTTP server that publishes information about the health of a Chubby lock so other systems can find it. Electing a master the task and thousands of performance metrics (e.g., RPC and failing-over to the new one typically takes about 10 s, but latencies). Borg monitors the health-check URL and restarts can take up to a minute in a big cell because some in-memory tasks that do not respond promptly or return an HTTP er- state has to be reconstructed. When a replica recovers from ror code. Other data is tracked by monitoring tools for dash- an outage, it dynamically re-synchronizes its state from other boards and alerts on service level objective (SLO) violations. Paxos replicas that are up-to-date. A service called Sigma provides a web-based user inter- The Borgmaster’s state at a point in time is called a face (UI) through which a user can examine the state of all checkpoint, and takes the form of a periodic snapshot plus a their jobs, a particular cell, or drill down to individual jobs change log kept in the Paxos store. Checkpoints have many and tasks to examine their resource behavior, detailed logs, uses, including restoring a Borgmaster’s state to an arbitrary execution history, and eventual fate. Our applications gener- point in the past (e.g., just before accepting a request that ate voluminous logs; these are automatically rotated to avoid triggered a software defect in Borg so it can be debugged); running out of disk space, and preserved for a while after the fixing it by hand in extremis; building a persistent log of task’s exit to assist with debugging. If a job is not running events for future queries; and offline simulations. Borg provides a “why pending?” annotation, together with A high-fidelity Borgmaster simulator called Fauxmaster guidance on how to modify the job’s resource requests to can be used to read checkpoint files, and contains a complete better fit the cell. We publish guidelines for “conforming” copy of the production Borgmaster code, with stubbed-out resource shapes that are likely to schedule easily. interfaces to the Borglets. It accepts RPCs to make state ma- Borg records all job submissions and task events, as well chine changes and perform operations, such as “schedule all as detailed per-task resource usage information in Infrastore, pending tasks”, and we use it to debug failures, by interact- a scalable read-only data store with an interactive SQL-like ing with it as if it were a live Borgmaster, with simulated interface via Dremel. This data is used for usage-based Borglets replaying real interactions from the checkpoint file. charging, debugging job and system failures, and long-term A user can step through and observe the changes to the sys- capacity planning. It also provided the data for the Google tem state that actually occurred in the past. Fauxmaster is cluster workload trace. also useful for capacity planning (“how many new jobs of All of these features help users to understand and debug this type would fit?”), as well as sanity checks before mak- the behavior of Borg and their jobs, and help our SREs ing a change to a cell’s configuration (“will this change evict manage a few tens of thousands of machines per person. any important jobs?”). 3. Borg architecture 3.2 Scheduling A Borg cell consists of a set of machines, a logically central- When a job is submitted, the Borgmaster records it persis- ized controller called the Borgmaster, and an agent process tently in the Paxos store and adds the job’s tasks to the pend- called the Borglet that runs on each machine in a cell (see ing queue. This is scanned asynchronously by the scheduler, Figure 1). All components of Borg are written in C++. which assigns tasks to machines if there are sufficient avail- able resources that meet the job’s constraints. (The sched- 3.1 Borgmaster uler primarily operates on tasks, not jobs.) The scan pro- Each cell’s Borgmaster consists of two processes: the main ceeds from high to low priority, modulated by a round-robin Borgmaster process and a separate scheduler (§3.2). The scheme within a priority to ensure fairness across users and main Borgmaster process handles client RPCs that either avoid head-of-line blocking behind a large job. The schedul- mutate state (e.g., create job) or provide read-only access ing algorithm has two parts: feasibility checking, to find ma- to data (e.g., lookup job). It also manages state machines chines on which the task could run, and scoring, which picks for all of the objects in the system (machines, tasks, allocs, one of the feasible machines. etc.), communicates with the Borglets, and offers a web UI In feasibility checking, the scheduler finds a set of ma- as a backup to Sigma. chines that meet the task’s constraints and also have enough The Borgmaster is logically a single process but is ac- “available” resources – which includes resources assigned tually replicated five times. Each replica maintains an in- to lower-priority tasks that can be evicted. In scoring, the memory copy of most of the state of the cell, and this state is scheduler determines the “goodness” of each feasible ma- also recorded in a highly-available, distributed, Paxos-based chine. The score takes into account user-specified prefer- store on the replicas’ local disks. A single elected mas- ences, but is mostly driven by built-in criteria such as mini- ter per cell serves both as the Paxos leader and the state mizing the number and priority of preempted tasks, picking mutator, handling all operations that change the cell’s state, machines that already have a copy of the task’s packages, such as submitting a job or terminating a task on a ma- spreading tasks across power and failure domains, and pack- chine. A master is elected (using Paxos) when the cell is ing quality including putting a mix of high and low priority tasks onto a single machine to allow the high-priority ones The Borgmaster polls each Borglet every few seconds to to expand in a load spike. retrieve the machine’s current state and send it any outstand- Borg originally used a variant of E-PVM for scoring, ing requests. This gives Borgmaster control over the rate of which generates a single cost value across heterogeneous communication, avoids the need for an explicit flow control resources and minimizes the change in cost when placing mechanism, and prevents recovery storms. a task. In practice, E-PVM ends up spreading load across The elected master is responsible for preparing messages all the machines, leaving headroom for load spikes – but at to send to the Borglets and for updating the cell’s state with the expense of increased fragmentation, especially for large their responses. For performance scalability, each Borgmas- tasks that need most of the machine; we sometimes call this ter replica runs a stateless link shard to handle the communi- “worst fit”. cation with some of the Borglets; the partitioning is recalcu- The opposite end of the spectrum is “best fit”, which tries lated whenever a Borgmaster election occurs. For resiliency, to fill machines as tightly as possible. This leaves some ma- the Borglet always reports its full state, but the link shards chines empty of user jobs (they still run storage servers), so aggregate and compress this information by reporting only placing large tasks is straightforward, but the tight packing differences to the state machines, to reduce the update load penalizes any mis-estimations in resource requirements by at the elected master. users or Borg. This hurts applications with bursty loads, and If a Borglet does not respond to several poll messages its is particularly bad for batch jobs which specify low CPU machine is marked as down and any tasks it was running needs so they can schedule easily and try to run opportunis- are rescheduled on other machines. If communication is tically in unused resources: 20% of non-prod tasks request restored the Borgmaster tells the Borglet to kill those tasks less than 0.1 CPU cores. that have been rescheduled, to avoid duplicates. A Borglet Our current scoring model is a hybrid one that tries to continues normal operation even if it loses contact with the reduce the amount of stranded resources – ones that cannot Borgmaster, so currently-running tasks and services stay up be used because another resource on the machine is fully even if all Borgmaster replicas fail. allocated. It provides about 3–5% better packing efficiency (defined in ) than best fit for our workloads. 3.4 Scalability If the machine selected by the scoring phase doesn’t have enough available resources to fit the new task, Borg preempts We are not sure where the ultimate scalability limit to Borg’s (kills) lower-priority tasks, from lowest to highest priority, centralized architecture will come from; so far, every time until it does. We add the preempted tasks to the scheduler’s we have approached a limit, we’ve managed to eliminate it. pending queue, rather than migrate or hibernate them.3 A single Borgmaster can manage many thousands of ma- Task startup latency (the time from job submission to chines in a cell, and several cells have arrival rates above a task running) is an area that has received and continues 10 000 tasks per minute. A busy Borgmaster uses 10–14 to receive significant attention. It is highly variable, with CPU cores and up to 50 GiB RAM. We use several tech- the median typically about 25 s. Package installation takes niques to achieve this scale. about 80% of the total: one of the known bottlenecks is Early versions of Borgmaster had a simple, synchronous contention for the local disk where packages are written to. loop that accepted requests, scheduled tasks, and commu- To reduce task startup time, the scheduler prefers to assign nicated with Borglets. To handle larger cells, we split the tasks to machines that already have the necessary packages scheduler into a separate process so it could operate in paral- (programs and data) installed: most packages are immutable lel with the other Borgmaster functions that are replicated for and so can be shared and cached. (This is the only form of failure tolerance. A scheduler replica operates on a cached data locality supported by the Borg scheduler.) In addition, copy of the cell state. It repeatedly: retrieves state changes Borg distributes packages to machines in parallel using tree- from the elected master (including both assigned and pend- and torrent-like protocols. ing work); updates its local copy; does a scheduling pass Additionally, the scheduler uses several techniques to let to assign tasks; and informs the elected master of those as- it scale up to cells with tens of thousands of machines (§3.4). signments. The master will accept and apply these assign- ments unless they are inappropriate (e.g., based on out of 3.3 Borglet date state), which will cause them to be reconsidered in the The Borglet is a local Borg agent that is present on every scheduler’s next pass. This is quite similar in spirit to the machine in a cell. It starts and stops tasks; restarts them if optimistic concurrency control used in Omega , and in- they fail; manages local resources by manipulating OS ker- deed we recently added the ability for Borg to use different nel settings; rolls over debug logs; and reports the state of the schedulers for different workload types. machine to the Borgmaster and other monitoring systems. To improve response times, we added separate threads to talk to the Borglets and respond to read-only RPCs. For 3 Exception: tasks that provide virtual machines for Google Compute En- greater performance, we sharded (partitioned) these func- gine users are migrated. tions across the five Borgmaster replicas §3.3. Together, preemption other machine failure 100 prod machine shutdown out of resources non-prod 80 0 1 2 3 4 5 6 7 8 Percentage of cells Evictions per task-week 60 Figure 3: Task-eviction rates and causes for production and non- 40 production workloads. Data from August 1st 2013. these keep the 99%ile response time of the UI below 1 s 20 and the 95%ile of the Borglet polling interval below 10 s. 0 Several things make the Borg scheduler more scalable: 65 70 75 80 85 90 95 100 Score caching: Evaluating feasibility and scoring a ma- Compacted size [%] chine is expensive, so Borg caches the scores until the prop- Figure 4: The effects of compaction. A CDF of the percentage of erties of the machine or task change – e.g., a task on the ma- original cell size achieved after compaction, across 15 cells. chine terminates, an attribute is altered, or a task’s require- ments change. Ignoring small changes in resource quantities reduces cache invalidations. during maintenance activities such as OS or machine Equivalence classes: Tasks in a Borg job usually have upgrades; identical requirements and constraints, so rather than deter- uses declarative desired-state representations and idem- mining feasibility for every pending task on every machine, potent mutating operations, so that a failed client can and scoring all the feasible machines, Borg only does fea- harmlessly resubmit any forgotten requests; sibility and scoring for one task per equivalence class – a rate-limits finding new places for tasks from machines group of tasks with identical requirements. that become unreachable, because it cannot distinguish Relaxed randomization: It is wasteful to calculate fea- between large-scale machine failure and a network parti- sibility and scores for all the machines in a large cell, so the tion; scheduler examines machines in a random order until it has avoids repeating task::machine pairings that cause task or found “enough” feasible machines to score, and then selects machine crashes; and the best within that set. This reduces the amount of scoring recovers critical intermediate data written to local disk by and cache invalidations needed when tasks enter and leave repeatedly re-running a logsaver task (§2.4), even if the the system, and speeds up assignment of tasks to machines. alloc it was attached to is terminated or moved to another Relaxed randomization is somewhat akin to the batch sam- machine. Users can set how long the system keeps trying; pling of Sparrow while also handling priorities, preemp- a few days is common. tions, heterogeneity and the costs of package installation. A key design feature in Borg is that already-running tasks In our experiments (§5), scheduling a cell’s entire work- continue to run even if the Borgmaster or a task’s Borglet load from scratch typically took a few hundred seconds, but goes down. But keeping the master up is still important did not finish after more than 3 days when the above tech- because when it is down new jobs cannot be submitted niques were disabled. Normally, though, an online schedul- or existing ones updated, and tasks from failed machines ing pass over the pending queue completes in less than half cannot be rescheduled. a second. Borgmaster uses a combination of techniques that enable it to achieve 99.99% availability in practice: replication for 4. Availability machine failures; admission control to avoid overload; and Failures are the norm in large scale systems [10, 11, 22]. deploying instances using simple, low-level tools to mini- Figure 3 provides a breakdown of task eviction causes in mize external dependencies. Each cell is independent of the 15 sample cells. Applications that run on Borg are expected others to minimize the chance of correlated operator errors to handle such events, using techniques such as replication, and failure propagation. These goals, not scalability limita- storing persistent state in a distributed file system, and (if tions, are the primary argument against larger cells. appropriate) taking occasional checkpoints. Even so, we try to mitigate the impact of these events. For example, Borg: automatically reschedules evicted tasks, on a new ma- 5. Utilization chine if necessary; One of Borg’s primary goals is to make efficient use of reduces correlated failures by spreading tasks of a job Google’s fleet of machines, which represents a significant across failure domains such as machines, racks, and financial investment: increasing utilization by a few percent- power domains; age points can save millions of dollars. This section dis- limits the allowed rate of task disruptions and the number cusses and evaluates some of the policies and techniques that of tasks from a job that can be simultaneously down Borg uses to do so. 100 200 prod non-prod baseline unused 80 Percentage of cells Percentage of cell 150 60 100 40 50 20 0 0 A B C D E -10 0 10 20 30 40 50 60 Cell Overhead from segregation [%] (a) The left column for each cell shows the original size and the (b) CDF of additional machines that would be needed if we combined workload; the right one shows the segregated case. segregated the workload of 15 representative cells. Figure 5: Segregating prod and non-prod work into different cells would need more machines. Both graphs show how many extra machines would be needed if the prod and non-prod workloads were sent to separate cells, expressed as a percentage of the minimum number of machines required to run the workload in a single cell. In this, and subsequent CDF plots, the value shown for each cell is derived from the 90%ile of the different cell sizes our experiment trials produced; the error bars show the complete range of values from the trials. 5.1 Evaluation methodology Our jobs have placement constraints and need to handle rare workload spikes, our machines are heterogenous, and we run batch jobs in resources reclaimed from service jobs. So, to evaluate our policy choices we needed a more sophisti- cated metric than “average utilization”. After much exper- imentation we picked cell compaction: given a workload, we found out how small a cell it could be fitted into by removing machines until the workload no longer fitted, re- peatedly re-packing the workload from scratch to ensure that we didn’t get hung up on an unlucky configuration. This provided clean termination conditions and facilitated auto- Figure 6: Segregating users would need more machines. The total number of cells and the additional machines that would be needed mated comparisons without the pitfalls of synthetic work- if users larger than the threshold shown were given their own load generation and modeling. A quantitative compari- private cells, for 5 different cells. son of evaluation techniques can be found in : the details are surprisingly subtle. It wasn’t possible to perform experiments on live produc- we needed a larger cell than the original we cloned the orig- tion cells, but we used Fauxmaster to obtain high-fidelity inal cell a few times before compaction; if we needed more simulation results, using data from real production cells cells, we just cloned the original. and workloads, including all their constraints, actual lim- Each experiment was repeated 11 times for each cell with its, reservations, and usage data (§5.5). This data came different random-number seeds. In the graphs, we use an er- from Borg checkpoints taken on Wednesday 2014-10-01 ror bar to display the min and max of the number of ma- 14:00 PDT. (Other checkpoints produced similar results.) chines needed, and select the 90%ile value as the “result” – We picked 15 Borg cells to report on by first eliminating the mean or median would not reflect what a system admin- special-purpose, test, and small (< 5000 machines) cells, istrator would do if they wanted to be reasonably sure that and then sampled the remaining population to achieve a the workload would fit. We believe cell compaction provides roughly even spread across the range of sizes. a fair, consistent way to compare scheduling policies, and it To maintain machine heterogeneity in the compacted cell translates directly into a cost/benefit result: better policies we randomly selected machines to remove. To maintain require fewer machines to run the same workload. workload heterogeneity, we kept it all, except for server and Our experiments focused on scheduling (packing) a storage tasks tied to a particular machine (e.g., the Borglets). workload from a point in time, rather than replaying a long- We changed hard constraints to soft ones for jobs larger than term workload trace. This was partly to avoid the difficulties half the original cell size, and allowed up to 0.2% tasks to go of coping with open and closed queueing models [71, 79], pending if they were very “picky” and could only be placed partly because traditional time-to-completion metrics don’t on a handful of machines; extensive experiments showed apply to our environment with its long-running services, that this produced repeatable results with low variance. If partly to provide clean signals for making comparisons, 80 100 cell A cell B 60 cell C 80 Percentage of cells cell D Overhead [%] 40 cell E 60 20 40 0 20 2 subcells 5 subcells 10 subcells -20 0 2 4 6 8 10 -50 0 50 100 150 200 250 Sub-cells Overhead from partitioning [%] (a) Additional machines that would be needed as a function of (b) A CDF of additional machines that would be needed to the number of smaller cells for five different original cells. divide each of 15 different cells into 2, 5 or 10 cells. Figure 7: Subdividing cells into smaller ones would require more machines. The additional machines (as a percentage of the single-cell case) that would be needed if we divided these particular cells into a varying number of smaller cells. partly because we don’t believe the results would be sig- for tasks in different environments running on the same ma- nificantly different, and partly a practical matter: we found chine type with the same clock speed. Under these condi- ourselves consuming 200 000 Borg CPU cores for our ex- tions, CPI values are comparable and can be used as a proxy periments at one point—even at Google’s scale, this is a for performance interference, since a doubling of CPI dou- non-trivial investment. bles the runtime of a CPU-bound program. The data was In production, we deliberately leave significant headroom gathered from ∼ 12 000 randomly selected prod tasks over for workload growth, occasional “black swan” events, load a week, counting cycles and instructions over a 5 minute in- spikes, machine failures, hardware upgrades, and large-scale terval using the hardware profiling infrastructure described partial failures (e.g., a power supply bus duct). Figure 4 in , and weighting samples so that every second of CPU shows how much smaller our real-world cells would be if time is counted equally. The results were not clear-cut. we were to apply cell compaction to them. The baselines in (1) We found that CPI was positively correlated with the graphs that follow use these compacted sizes. two measurements over the same time interval: the overall CPU usage on the machine, and (largely independently) the 5.2 Cell sharing number of tasks on the machine; adding a task to a machine Nearly all of our machines run both prod and non-prod tasks increases the CPI of other tasks by 0.3% (using a linear at the same time: 98% of the machines in shared Borg cells, model fitted to the data); increasing machine CPU usage by 83% across the entire set of machines managed by Borg. (We 10% increases CPI by less than 2%. But even though the have a few dedicated cells for special uses.) correlations are statistically significant, they only explain 5% Since many other organizations run user-facing and batch of the variance we saw in CPI measurements; other factors jobs in separate clusters, we examined what would happen if dominate, such as inherent differences in applications and we did the same. Figure 5 shows that segregating prod and specific interference patterns [24, 83]. non-prod work would need 20–30% more machines in the (2) Comparing the CPIs we sampled from shared cells to median cell to run our workload. That’s because prod jobs ones from a few dedicated cells with less diverse applica- usually reserve resources to handle rare workload spikes, but tions, we saw a mean CPI of 1.58 (σ = 0.35) in shared cells don’t use these resources most of the time. Borg reclaims the and a mean of 1.53 (σ = 0.32) in dedicated cells – i.e., CPU unused resources (§5.5) to run much of the non-prod work, performance is about 3% worse in shared cells. so we need fewer machines overall. (3) To address the concern that applications in different Most Borg cells are shared by thousands of users. Figure cells might have different workloads, or even suffer selection 6 shows why. For this test, we split off a user’s workload bias (maybe programs that are more sensitive to interference into a new cell if they consumed at least 10 TiB of mem- had been moved to dedicated cells), we looked at the CPI of ory (or 100 TiB). Our existing policy looks good: even with the Borglet, which runs on all the machines in both types of the larger threshold, we would need 2–16× as many cells, cell. We found it had a CPI of 1.20 (σ = 0.29) in dedicated and 20–150% additional machines. Once again, pooling re- cells and 1.43 (σ = 0.45) in shared ones, suggesting that sources significantly reduces costs. it runs 1.19× as fast in a dedicated cell as in a shared But perhaps packing unrelated users and job types onto one, although this over-weights the effect of lightly loaded the same machines results in CPU interference, and so we machines, slightly biasing the result in favor of dedicated would need more machines to compensate? To assess this, cells. we looked at how the CPI (cycles per instruction) changed 100 100 80 Percentage of clusters 80 Percentage of tasks 60 60 40 prod CPU 40 non-prod CPU 20 prod memory non-prod memory 20 memory-to-CPU-ratio 0 0.01 0.1 1 10 100 1000 0 0 10 20 30 40 50 Requested limit [cores, GiB, GiB/core] Overhead [%] Figure 8: No bucket sizes fit most of the tasks well. CDF of Figure 10: Resource reclamation is quite effective. A CDF of the requested CPU and memory requests across our sample cells. No additional machines that would be needed if we disabled it for 15 one value stands out, although a few integer CPU core sizes are representative cells. somewhat more popular. 100 100 80 Percentage of tasks 80 60 Percentage of cells 40 60 CPU reservation/limit 20 memory reservation/limit 40 CPU usage/limit memory usage/limit 0 20 0 20 40 60 80 100 120 140 upper bound Ratio [%] lower bound 0 -20 0 20 40 60 80 100 120 Figure 11: Resource estimation is successful at identifying unused Overhead [%] resources. The dotted lines shows CDFs of the ratio of CPU and Figure 9: “Bucketing” resource requirements would need more memory usage to the request (limit) for tasks across 15 cells. Most machines. A CDF of the additional overheads that would result tasks use much less than their limit, although a few use more CPU from rounding up CPU and memory requests to the next nearest than requested. The solid lines show the CDFs of the ratio of CPU powers of 2 across 15 cells. The lower and upper bounds straddle and memory reservations to the limits; these are closer to 100%. the actual values (see the text). The straight lines are artifacts of the resource-estimation process. 5.4 Fine-grained resource requests These experiments confirm that performance compar- Borg users request CPU in units of milli-cores, and memory isons at warehouse-scale are tricky, reinforcing the observa- and disk space in bytes. (A core is a processor hyperthread, tions in , and also suggest that sharing doesn’t drastically normalized for performance across machine types.) Figure 8 increase the cost of running programs. shows that they take advantage of this granularity: there are But even assuming the least-favorable of our results, shar- few obvious “sweet spots” in the amount of memory or CPU ing is still a win: the CPU slowdown is outweighed by the cores requested, and few obvious correlations between these decrease in machines required over several different parti- resources. These distributions are quite similar to the ones tioning schemes, and the sharing advantages apply to all re- presented in , except that we see slightly larger memory sources including memory and disk, not just CPU. requests at the 90%ile and above. Offering a set of fixed-size containers or virtual machines, although common among IaaS (infrastructure-as-a-service) 5.3 Large cells providers [7, 33], would not be a good match to our needs. Google builds large cells, both to allow large computations To show this, we “bucketed” CPU core and memory resource to be run, and to decrease resource fragmentation. We tested limits for prod jobs and allocs (§2.4) by rounding them up to the effects of the latter by partitioning the workload for a cell the next nearest power of two in each resource dimension, across multiple smaller cells – by first randomly permuting starting at 0.5 cores for CPU and 1 GiB for RAM. Figure 9 the jobs and then assigning them in a round-robin manner shows that doing so would require 30–50% more resources among the partitions. Figure 7 confirms that using smaller in the median case. The upper bound comes from allocating cells would require significantly more machines. an entire machine to large tasks that didn’t fit after quadru- 160 capacity limit reservation usage CPU [%] 120 80 40 0 160 OOMs Mem [%] 120 80 40 0 20k 10k 0 Week 1 (baseline) Week 2 (aggressive) Week 3 (medium) Week 4 (baseline) Figure 12: More aggressive resource estimation can reclaim more resources, with little effect on out-of-memory events (OOMs). A timeline (starting on 2013-11-11) for one production cell of usage, reservation and limit averaged over 5-minute windows and cumulative out-of- memory events; the slope of the latter is the aggregate rate of OOMs. Vertical bars separate weeks with different resource estimation settings. pling the original cell before compaction began; the lower 4% bound from allowing these tasks to go pending. (This is less [10ms, inf) [5ms, 10ms) than the roughly 100% overhead reported in because we 3% [1ms, 5ms) supported more than 4 buckets and permitted CPU and RAM capacity to scale independently.) 2% 5.5 Resource reclamation 1% A job can specify a resource limit – an upper bound on the resources that each task should be granted. The limit is used 0% 0−20% 20−40% 40−60% 60−80% 80−100% by Borg to determine if the user has enough quota to admit Machine CPU utilization the job, and to determine if a particular machine has enough free resources to schedule the task. Just as there are users Figure 13: Scheduling delays as a function of load. A plot of how often a runnable thread had to wait longer than 1ms to get access to who buy more quota than they need, there are users who a CPU, as a function of how busy the machine was. In each pair of request more resources than their tasks will use, because bars, latency-sensitive tasks are on the left, batch ones on the right. Borg will normally kill a task that tries to use more RAM In only a few percent of the time did a thread have to wait longer or disk space than it requested, or throttle CPU to what it than 5 ms to access a CPU (the white bars); they almost never had asked for. In addition, some tasks occasionally need to use to wait longer (the darker bars). Data from a representative cell for all their resources (e.g., at peak times of day or while coping the month of December 2013; error bars show day-to-day variance. with a denial-of-service attack), but most of the time do not. Rather than waste allocated resources that are not cur- rently being consumed, we estimate how many resources a less than their limits. If this happens, we kill or throttle non- task will use and reclaim the rest for work that can tolerate prod tasks, never prod ones. lower-quality resources, such as batch jobs. This whole pro- Figure 10 shows that many more machines would be cess is called resource reclamation. The estimate is called required without resource reclamation. About 20% of the the task’s reservation, and is computed by the Borgmas- workload (§6.2) runs in reclaimed resources in a median cell. ter every few seconds, using fine-grained usage (resource- We can see more details in Figure 11, which shows the consumption) information captured by the Borglet. The ini- ratio of reservations and usage to limits. A task that exceeds tial reservation is set equal to the resource request (the limit); its memory limit will be the first to be preempted if resources after 300 s, to allow for startup transients, it decays slowly are needed, regardless of its priority, so it is rare for tasks towards the actual usage plus a safety margin. The reserva- to exceed their memory limit. On the other hand, CPU can tion is rapidly increased if the usage exceeds it. readily be throttled, so short-term spikes can push usage The Borg scheduler uses limits to calculate feasibility above reservation fairly harmlessly. (§3.2) for prod tasks,4 so they never rely on reclaimed re- Figure 11 suggests that resource reclamation may be un- sources and aren’t exposed to resource oversubscription; for necessarily conservative: there is significant area between non-prod tasks, it uses the reservations of existing tasks so the reservation and usage lines. To test this, we picked a live the new tasks can be scheduled into reclaimed resources. production cell and adjusted the parameters of its resource A machine may run out of resources at runtime if the estimation algorithm to an aggressive setting for a week by reservations (predictions) are wrong – even if all tasks use reducing the safety margin, and then to an medium setting that was mid-way between the baseline and aggressive set- 4 To be precise, high-priority latency-sensitive ones – see §6.2. tings for the next week, and then reverted to the baseline. Figure 12 shows what happened. Reservations are clearly To help with overload and overcommitment, Borg tasks closer to usage in the second week, and somewhat less so in have an application class or appclass. The most important the third, with the biggest gaps shown in the baseline weeks distinction is between the latency-sensitive (LS) appclasses (1st and 4th). As anticipated, the rate of out-of-memory and the rest, which we call batch in this paper. LS tasks are (OOM) events increased slightly in weeks 2 and 3.5 After used for user-facing applications and shared infrastructure reviewing these results, we decided that the net gains out- services that require fast response to requests. High-priority weighed the downsides, and deployed the medium resource LS tasks receive the best treatment, and are capable of tem- reclamation parameters to other cells. porarily starving batch tasks for several seconds at a time. A second split is between compressible resources (e.g., 6. Isolation CPU cycles, disk I/O bandwidth) that are rate-based and can be reclaimed from a task by decreasing its quality of 50% of our machines run 9 or more tasks; a 90%ile machine service without killing it; and non-compressible resources has about 25 tasks and will be running about 4500 threads (e.g., memory, disk space) which generally cannot be re-. Although sharing machines between applications in- claimed without killing the task. If a machine runs out of creases utilization, it also requires good mechanisms to pre- non-compressible resources, the Borglet immediately termi- vent tasks from interfering with one another. This applies to nates tasks, from lowest to highest priority, until the remain- both security and performance. ing reservations can be met. If the machine runs out of com- pressible resources, the Borglet throttles usage (favoring LS 6.1 Security isolation tasks) so that short load spikes can be handled without killing We use a Linux chroot jail as the primary security isolation any tasks. If things do not improve, Borgmaster will remove mechanism between multiple tasks on the same machine. To one or more tasks from the machine. allow remote debugging, we used to distribute (and rescind) A user-space control loop in the Borglet assigns mem- ssh keys automatically to give a user access to a machine ory to containers based on predicted future usage (for prod only while it was running tasks for the user. For most users, tasks) or on memory pressure (for non-prod ones); handles this has been replaced by the borgssh command, which col- Out-of-Memory (OOM) events from the kernel; and kills laborates with the Borglet to construct an ssh connection to tasks when they try to allocate beyond their memory lim- a shell that runs in the same chroot and cgroup as the task, its, or when an over-committed machine actually runs out locking down access even more tightly. of memory. Linux’s eager file-caching significantly compli- VMs and security sandboxing techniques are used to run cates the implementation because of the need for accurate external software by Google’s AppEngine (GAE) and memory-accounting. Google Compute Engine (GCE). We run each hosted VM in To improve performance isolation, LS tasks can reserve a KVM process that runs as a Borg task. entire physical CPU cores, which stops other LS tasks from using them. Batch tasks are permitted to run on any core, 6.2 Performance isolation but they are given tiny scheduler shares relative to the LS Early versions of Borglet had relatively primitive resource tasks. The Borglet dynamically adjusts the resource caps of isolation enforcement: post-hoc usage checking of memory, greedy LS tasks in order to ensure that they do not starve disk space and CPU cycles, combined with termination of batch tasks for multiple minutes, selectively applying CFS tasks that used too much memory or disk and aggressive ap- bandwidth control when needed ; shares are insufficient plication of Linux’s CPU priorities to rein in tasks that used because we have multiple priority levels. too much CPU. But it was still too easy for rogue tasks to af- Like Leverich , we found that the standard Linux fect the performance of other tasks on the machine, so some CPU scheduler (CFS) required substantial tuning to support users inflated their resource requests to reduce the number of both low latency and high utilization. To reduce schedul- tasks that Borg could co-schedule with theirs, thus decreas- ing delays, our version of CFS uses extended per-cgroup ing utilization. Resource reclamation could claw back some load history , allows preemption of batch tasks by LS of the surplus, but not all, because of the safety margins in- tasks, and reduces the scheduling quantum when multiple LS volved. In the most extreme cases, users petitioned to use tasks are runnable on a CPU. Fortunately, many of our ap- dedicated machines or cells. plications use a thread-per-request model, which mitigates Now, all Borg tasks run inside a Linux cgroup-based re- the effects of persistent load imbalances. We sparingly use source container [17, 58, 62] and the Borglet manipulates the cpusets to allocate CPU cores to applications with particu- container settings, giving much improved control because larly tight latency requirements. Some results of these efforts the OS kernel is in the loop. Even so, occasional low-level are shown in Figure 13. Work continues in this area, adding resource interference (e.g., memory bandwidth or L3 cache thread placement and CPU management that is NUMA-, pollution) still happens, as in [60, 83]. hyperthreading-, and power-aware (e.g., ), and improv- ing the control fidelity of the Borglet. 5 The anomaly at the end of week 3 is unrelated to this experiment. Tasks are permitted to consume resources up to their , a Borg-like scheduler for long running services that runs limit. Most of them are allowed to go beyond that for com- on top of Mesos, with a configuration language and state pressible resources like CPU, to take advantage of unused machine similar to Borg’s. (slack) resources. Only 5% of LS tasks disable this, presum- The Autopilot system from Microsoft provides “au- ably to get better predictability; fewer than 1% of batch tasks tomating software provisioning and deployment; system do. Using slack memory is disabled by default, because it in- monitoring; and carrying out repair actions to deal with creases the chance of a task being killed, but even so, 10% faulty software and hardware” for Microsoft clusters. The of LS tasks override this, and 79% of batch tasks do so be- Borg ecosystem provides similar features, but space pre- cause it’s a default setting of the MapReduce framework. cludes a discussion here; Isaard outlines many best This complements the results for reclaimed resources (§5.5). practices that we adhere to as well. Batch tasks are willing to exploit unused as well as reclaimed Quincy uses a network flow model to provide fairness- memory opportunistically: most of the time this works, al- and data locality-aware scheduling for data-processing DAGs though the occasional batch task is sacrificed when an LS on clusters of a few hundred nodes. Borg uses quota and pri- task needs resources in a hurry. orities to share resources among users and scales to tens of thousands of machines. Quincy handles execution graphs directly while this is built separately on top of Borg. 7. Related work Cosmos focuses on batch processing, with an em- Resource scheduling has been studied for decades, in con- phasis on ensuring that its users get fair access to resources texts as varied as wide-area HPC supercomputing Grids, net- they have donated to the cluster. It uses a per-job manager to works of workstations, and large-scale server clusters. We acquire resources; few details are publicly available. focus here on only the most relevant work in the context of Microsoft’s Apollo system uses per-job schedulers large-scale server clusters. for short-lived batch jobs to achieve high throughput on clus- Several recent studies have analyzed cluster traces from ters that seem to be comparably-sized to Borg cells. Apollo Yahoo!, Google, and Facebook [20, 52, 63, 68, 70, 80, 82], uses opportunistic execution of lower-priority background and illustrate the challenges of scale and heterogeneity in- work to boost utilization to high levels at the cost of (some- herent in these modern datacenters and workloads. con- times) multi-day queueing delays. Apollo nodes provide a tains a taxonomy of cluster manager architectures. prediction matrix of starting times for tasks as a function Apache Mesos splits the resource management and of size over two resource dimensions, which the schedulers placement functions between a central resource manager combine with estimates of startup costs and remote-data- (somewhat like Borgmaster minus its scheduler) and mul- access to make placement decisions, modulated by random tiple “frameworks” such as Hadoop and Spark delays to reduce collisions. Borg uses a central scheduler for using an offer-based mechanism. Borg mostly centralizes placement decisions based on state about prior allocations, these functions using a request-based mechanism that scales can handle more resource dimensions, and focuses on the quite well. DRF [29, 35, 36, 66] was initially developed for needs of high-availability, long-running applications; Apollo Mesos; Borg uses priorities and admission quotas instead. can probably handle a higher task arrival rate. The Mesos developers have announced ambitions to extend Alibaba’s Fuxi supports data-analysis workloads; Mesos to include speculative resource assignment and recla- it has been running since 2009. Like Borgmaster, a cen- mation, and to fix some of the issues identified in. tral FuxiMaster (replicated for failure-tolerance) gathers YARN is a Hadoop-centric cluster manager. Each ap- resource-availability information from nodes, accepts re- plication has a manager that negotiates for the resources it quests from applications, and matches one to the other. The needs with a central resource manager; this is much the same Fuxi incremental scheduling policy is the inverse of Borg’s scheme that Google MapReduce jobs have used to obtain equivalence classes: instead of matching each task to one resources from Borg since about 2008. YARN’s resource of a suitable set of machines, Fuxi matches newly-available manager only recently became fault tolerant. A related open- resources against a backlog of pending work. Like Mesos, source effort is the Hadoop Capacity Scheduler which Fuxi allows “virtual resource” types to be defined. Only syn- provides multi-tenant support with capacity guarantees, hi- thetic workload results are publicly available. erarchical queues, elastic sharing and fairness. YARN has re- Omega supports multiple parallel, specialized “verti- cently been extended to support multiple resource types, pri- cals” that are each roughly equivalent to a Borgmaster minus orities, preemptions, and advanced admission control. its persistent store and link shards. Omega schedulers use The Tetris research prototype supports makespan-aware optimistic concurrency control to manipulate a shared repre- job packing. sentation of desired and observed cell state stored in a cen- Facebook’s Tupperware , is a Borg-like system for tral persistent store, which is synced to/from the Borglets by scheduling cgroup containers on a cluster; only a few details a separate link component. The Omega architecture was de- have been disclosed, although it seems to provide a form signed to support multiple distinct workloads that have their of resource reclamation. Twitter has open-sourced Aurora own application-specific RPC interface, state machines, and refer to arbitrary subsets of a job, which leads to problems scheduling policies (e.g., long-running servers, batch jobs like inflexible semantics for rolling updates and job resizing. from various frameworks, infrastructure services like clus- To avoid such difficulties, Kubernetes rejects the job no- ter storage systems, virtual machines from the Google Cloud tion and instead organizes its scheduling units (pods) using Platform). On the other hand, Borg offers a “one size fits all” labels – arbitrary key/value pairs that users can attach to any RPC interface, state machine semantics, and scheduler pol- object in the system. The equivalent of a Borg job can be icy, which have grown in size and complexity over time as a achieved by attaching a job:jobname label to a set of pods, result of needing to support many disparate workloads, and but any other useful grouping can be represented too, such scalability has not yet been a problem (§3.4). as the service, tier, or release-type (e.g., production, stag- Google’s open-source Kubernetes system places ing, test). Operations in Kubernetes identify their targets by applications in Docker containers onto multiple host means of a label query that selects the objects that the op- nodes. It runs both on bare metal (like Borg) and on various eration should apply to. This approach gives more flexibility cloud hosting providers, such as Google Compute Engine. than the single fixed grouping of a job. It is under active development by many of the same engi- One IP address per machine complicates things. In neers who built Borg. Google offers a hosted version called Borg, all tasks on a machine use the single IP address of Google Container Engine. We discuss how lessons from their host, and thus share the host’s port space. This causes Borg are being applied to Kubernetes in the next section. a number of difficulties: Borg must schedule ports as a re- The high-performance computing community has a long source; tasks must pre-declare how many ports they need, tradition of work in this area (e.g., Maui, Moab, Platform and be willing to be told which ones to use when they start; LSF [2, 47, 50]); however the requirements of scale, work- the Borglet must enforce port isolation; and the naming and loads and fault tolerance are different from those of Google’s RPC systems must handle ports as well as IP addresses. cells. In general, such systems achieve high utilization by Thanks to the advent of Linux namespaces, VMs, IPv6, having large backlogs (queues) of pending work. and software-defined networking, Kubernetes can take a Virtualization providers such as VMware and data- more user-friendly approach that eliminates these complica- center solution providers such as HP and IBM provide tions: every pod and service gets its own IP address, allowing cluster management solutions that typically scale to O(1000) developers to choose ports rather than requiring their soft- machines. In addition, several research groups have proto- ware to adapt to the ones chosen by the infrastructure, and typed systems that improve the quality of scheduling deci- removes the infrastructure complexity of managing ports. sions in certain ways (e.g., [25, 40, 72, 74]). Optimizing for power users at the expense of casual And finally, as we have indicated, another important part ones. Borg provides a large set of features aimed at “power of managing large scale clusters is automation and “operator users” so they can fine-tune the way their programs are run scaleout”. describes how planning for failures, multi- (the BCL specification lists about 230 parameters): the ini- tenancy, health checking, admission control, and restartabil- tial focus was supporting the largest resource consumers at ity are necessary to achieve high numbers of machines per Google, for whom efficiency gains were paramount. Unfor- operator. Borg’s design philosophy is similar and allows us tunately the richness of this API makes things harder for the to support tens of thousands of machines per operator (SRE). “casual” user, and constrains its evolution. Our solution has been to build automation tools and services that run on top of Borg, and determine appropriate settings from experimenta- 8. Lessons and future work tion. These benefit from the freedom to experiment afforded In this section we recount some of the qualitative lessons by failure-tolerant applications: if the automation makes a we’ve learned from operating Borg in production for more mistake it is a nuisance, not a disaster. than a decade, and describe how these observations have 8.2 Lessons learned: the good been leveraged in designing Kubernetes. On the other hand, a number of Borg’s design features have been remarkably beneficial and have stood the test of time. 8.1 Lessons learned: the bad Allocs are useful. The Borg alloc abstraction spawned We begin with some features of Borg that serve as cautionary the widely-used logsaver pattern (§2.4) and another popular tales, and informed alternative designs in Kubernetes. one in which a simple data-loader task periodically updates Jobs are restrictive as the only grouping mechanism the data used by a web server. Allocs and packages allow for tasks. Borg has no first-class way to manage an entire such helper services to be developed by separate teams. The multi-job service as a single entity, or to refer to related Kubernetes equivalent of an alloc is the pod, which is a instances of a service (e.g., canary and production tracks). resource envelope for one or more containers that are always As a hack, users encode their service topology in the job scheduled onto the same machine and can share resources. name and build higher-level management tools to parse these Kubernetes uses helper containers in the same pod instead names. At the other end of the spectrum, it’s not possible to of tasks in an alloc, but the idea is the same. Cluster management is more than task management. 8.3 Conclusion Although Borg’s primary role is to manage the lifecycles Virtually all of Google’s cluster workloads have switched to of tasks and machines, the applications that run on Borg use Borg over the past decade. We continue to evolve it, and benefit from many other cluster services, including naming have applied the lessons we learned from it to Kubernetes. and load balancing. Kubernetes supports naming and load balancing using the service abstraction: a service has a name and a dynamic set of pods defined by a label selector. Any Acknowledgments container in the cluster can connect to the service using the The authors of this paper performed the evaluations and service name. Under the covers, Kubernetes automatically wrote the paper, but the dozens of engineers who de- load-balances connections to the service among the pods that signed, implemented, and maintained Borg’s components match the label selector, and keeps track of where the pods and ecosystem are the key to its success. We list here just are running as they get rescheduled over time due to failures. those who participated most directly in the design, imple- Introspection is vital. Although Borg almost always mentation, and operation of the Borgmaster and Borglets. “just works,” when something goes wrong, finding the root Our apologies if we missed anybody. cause can be challenging. An important design decision in The initial Borgmaster was primarily designed and im- Borg was to surface debugging information to all users rather plemented by Jeremy Dion and Mark Vandevoorde, with than hiding it: Borg has thousands of users, so “self-help” Ben Smith, Ken Ashcraft, Maricia Scott, Ming-Yee Iu, and has to be the first step in debugging. Although this makes it Monika Henzinger. The initial Borglet was primarily de- harder for us to deprecate features and change internal poli- signed and implemented by Paul Menage. cies that users come to rely on, it is still a win, and we’ve Subsequent contributors include Abhishek Rai, Abhishek found no realistic alternative. To handle the enormous vol- Verma, Andy Zheng, Ashwin Kumar, Beng-Hong Lim, ume of data, we provide several levels of UI and debugging Bin Zhang, Bolu Szewczyk, Brian Budge, Brian Grant, tools, so users can quickly identify anomalous events related Brian Wickman, Chengdu Huang, Cynthia Wong, Daniel to their jobs, and then drill down to detailed event and error Smith, Dave Bort, David Oppenheimer, David Wall, Dawn logs from their applications and the infrastructure itself. Chen, Eric Haugen, Eric Tune, Ethan Solomita, Gaurav Dhi- Kubernetes aims to replicate many of Borg’s introspec- man, Geeta Chaudhry, Greg Roelofs, Grzegorz Czajkowski, tion techniques. For example, it ships with tools such as James Eady, Jarek Kusmierek, Jaroslaw Przybylowicz, Ja- cAdvisor for resource monitoring, and log aggregation son Hickey, Javier Kohen, Jeremy Lau, Jerzy Szczepkowski, based on Elasticsearch/Kibana and Fluentd. The John Wilkes, Jonathan Wilson, Joso Eterovic, Jutta De- master can be queried for a snapshot of its objects’ state. gener, Kai Backman, Kamil Yurtsever, Kenji Kaneda, Ke- Kubernetes has a unified mechanism that all components can van Miller, Kurt Steinkraus, Leo Landa, Liza Fireman, use to record events (e.g., a pod being scheduled, a container Madhukar Korupolu, Mark Logan, Markus Gutschke, Matt failing) that are made available to clients. Sparks, Maya Haridasan, Michael Abd-El-Malek, Michael The master is the kernel of a distributed system. Kenniston, Mukesh Kumar, Nate Calvin, Onufry Wojtaszczyk, Borgmaster was originally designed as a monolithic sys- Patrick Johnson, Pedro Valenzuela, Piotr Witusowski, Praveen tem, but over time, it became more of a kernel sitting at the Kallakuri, Rafal Sokolowski, Richard Gooch, Rishi Gos- heart of an ecosystem of services that cooperate to man- alia, Rob Radez, Robert Hagmann, Robert Jardine, Robert age user jobs. For example, we split off the scheduler and Kennedy, Rohit Jnagal, Roy Bryant, Rune Dahl, Scott Gar- the primary UI (Sigma) into separate processes, and added riss, Scott Johnson, Sean Howarth, Sheena Madan, Smeeta services for admission control, vertical and horizontal auto- Jalan, Stan Chesnutt, Temo Arobelidze, Tim Hockin, Todd scaling, re-packing tasks, periodic job submission (cron), Wang, Tomasz Blaszczyk, Tomasz Wozniak, Tomek Zielonka, workflow management, and archiving system actions for Victor Marmol, Vish Kannan, Vrigo Gokhale, Walfredo off-line querying. Together, these have allowed us to scale Cirne, Walt Drummond, Weiran Liu, Xiaopan Zhang, Xiao up the workload and feature set without sacrificing perfor- Zhang, Ye Zhao, and Zohaib Maya. mance or maintainability. The Borg SRE team has also been crucial, and has in- The Kubernetes architecture goes further: it has an API cluded Adam Rogoyski, Alex Milivojevic, Anil Das, Cody server at its core that is responsible only for processing re- Smith, Cooper Bethea, Folke Behrens, Matt Liggett, James quests and manipulating the underlying state objects. The Sanford, John Millikin, Matt Brown, Miki Habryn, Pe- cluster management logic is built as small, composable ter Dahl, Robert van Gent, Seppi Wilhelmi, Seth Hettich, micro-services that are clients of this API server, such as Torsten Marek, and Viraj Alankar. The Borg configuration the replication controller, which maintains the desired num- language (BCL) and borgcfg tool were originally developed ber of replicas of a pod in the face of failures, and the node by Marcel van Lohuizen and Robert Griesemer. controller, which manages the machine lifecycle. We thank our reviewers (especially Eric Brewer, Malte Schwarzkopf and Tom Rodeheffer), and our shepherd, Chris- tos Kozyrakis, for their feedback on this paper. References CFS per-entity load patches. O. A. Abdul-Rahman and K. Aida. Towards understanding http://lwn.net/Articles/531853, 2013. the usage behavior of Google cloud users: the mice and cgroups. http://en.wikipedia.org/wiki/Cgroups, elephants phenomenon. In Proc. IEEE Int’l Conf. on Cloud 2014. Computing Technology and Science (CloudCom), pages C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, 272–277, Singapore, Dec. 2014. R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient Adaptive Computing Enterprises Inc., Provo, UT. Maui data-parallel pipelines. In Proc. ACM SIGPLAN Conf. on Scheduler Administrator’s Guide, 3.2 edition, 2011. Programming Language Design and Implementation (PLDI), T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, pages 363–375, Toronto, Ontario, Canada, 2010. J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom,