Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds PDF
Document Details
Uploaded by EasiestMimosa
Carnegie Mellon University
2017
Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu
Tags
Related
- Machine Learning 1_ classification methods - lectures-1.pdf
- Lecture 6: Machine Learning for Remote Sensing Image Processing - Part I PDF
- Machine learning.pdf
- Machine Learning Overview PDF
- Machine Learning System Design Performance Measurement Lecture
- Lecture12_Machine learning system design.pptx.pdf
Summary
This paper presents Gaia, a geo-distributed machine learning system designed to effectively run machine learning algorithms across multiple data centers. The system aims to minimize communication over WANs and maintain accuracy while enabling a wide range of machine learning algorithms to be run without modification. Key features include an intelligent communication mechanism and a new synchronization model, Approximate Synchronous Parallel (ASP), to efficiently use WAN bandwidth.
Full Transcript
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, and Phillip B. Gibbons, Carnegie Mellon University; Onur Mutlu, ETH Zurich and Carnegie Mellon University https://www.usenix.org/conferen...
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, and Phillip B. Gibbons, Carnegie Mellon University; Onur Mutlu, ETH Zurich and Carnegie Mellon University https://www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh This paper is included in the Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’17). March 27–29, 2017 Boston, MA, USA ISBN 978-1-931971-37-9 Open access to the Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation is sponsored by USENIX. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh† Aaron Harlap† Nandita Vijaykumar† Dimitris Konomis† Gregory R. Ganger† Phillip B. Gibbons† Onur Mutlu§† † Carnegie Mellon University § ETH Zürich Abstract minimize their service latency to end-users, and store Machine learning (ML) is widely used to derive useful massive quantities of data all over the globe [31, 33, 36, information from large-scale data (such as user activities, 41, 57, 58, 71–73, 76]. pictures, and videos) generated at increasingly rapid rates, A commonly-used approach to run an ML application all over the world. Unfortunately, it is infeasible to move over such rapidly generated data is to centralize all data all this globally-generated data to a centralized data center into one data center over wide-area networks (WANs) before running an ML algorithm over it—moving large before running the ML application [9,12,44,68]. However, amounts of raw data over wide-area networks (WANs) can this approach can be prohibitively difficult because: (1) be extremely slow, and is also subject to the constraints WAN bandwidth is a scarce resource, and hence moving all of privacy and data sovereignty laws. This motivates the data can be extremely slow [12,57]. Furthermore, the fast need for a geo-distributed ML system spanning multiple growing rate of image and video generation will eventually data centers. Unfortunately, communicating over WANs saturate the total WAN bandwidth, whose growth has can significantly degrade ML system performance (by as been decelerating for many years [67, 73]. (2) Privacy much as 53.7× in our study) because the communication and data sovereignty laws in some countries prohibit overwhelms the limited WAN bandwidth. transmission of raw data across national or continental Our goal in this work is to develop a geo-distributed borders [12, 72, 73]. ML system that (1) employs an intelligent communication This motivates the need to distribute an ML system mechanism over WANs to efficiently utilize the scarce across multiple data centers, globally. In such a system, WAN bandwidth, while retaining the accuracy and cor- large amounts of raw data are stored locally in different rectness guarantees of an ML algorithm; and (2) is generic data centers, and the ML algorithms running over the and flexible enough to run a wide range of ML algorithms, distributed data communicate between data centers using without requiring any changes to the algorithms. WANs. Unfortunately, existing large-scale distributed ML To this end, we introduce a new, general geo-distributed systems [5, 13, 45, 47, 50, 77] are suitable only for data ML system, Gaia, that decouples the communication residing within a single data center. Our experiments using within a data center from the communication between three state-of-the-art distributed ML systems (Bösen , data centers, enabling different communication and con- IterStore , and GeePS ) show that operating these sistency models for each. We present a new ML syn- systems across as few as two data centers (over WANs) chronization model, Approximate Synchronous Parallel can cause a slowdown of 1.8–53.7× (see Section 2.3 and (ASP), whose key idea is to dynamically eliminate in- Section 6) relative to their performance within a data significant communication between data centers while center (over LANs). Existing systems that do address still guaranteeing the correctness of ML algorithms. Our challenges in geo-distributed data analytics [12, 33, 36, experiments on our prototypes of Gaia running across 41, 57, 58, 71–73] do not consider the broad class of 11 Amazon EC2 global regions and on a cluster that important, sophisticated ML algorithms commonly run emulates EC2 WAN bandwidth show that Gaia provides on ML systems — they focus instead on other types of 1.8–53.5× speedup over two state-of-the-art distributed computation, e.g., map-reduce or SQL. ML systems, and is within 0.94–1.40× of the speed of Our goal in this work is to develop a geo-distributed running the same ML algorithm on machines on a local ML system that (1) minimizes communication over WANs, area network (LAN). so that the system is not bottlenecked by the scarce WAN bandwidth; and (2) is general enough to be applicable to 1. Introduction a wide variety of ML algorithms, without requiring any Machine learning (ML) is very widely used across a changes to the algorithms themselves. variety of domains to extract useful information from To achieve these goals, such a system needs to address large-scale data. It has many classes of applications such two key challenges. First, to efficiently utilize the limited as image or video classification (e.g., [24,39,65]), speech (and heterogeneous) WAN bandwidth, we need to find recognition (e.g., ), and topic modeling (e.g., ). an effective communication model that minimizes com- These applications analyze massive amounts of data from munication over WANs but still retains the correctness user activities, pictures, videos, etc., which are generated guarantee for an ML algorithm. This is difficult because at very rapid rates, all over the world. Many large ML algorithms typically require extensive communication organizations, such as Google , Microsoft , and to exchange updates that keep the global ML model suffi- Amazon , operate tens of data centers globally to ciently consistent across data centers. These updates are USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 629 required to be timely, irrespective of the available network synchronization model, Stale Synchronous Parallel bandwidth, to ensure algorithm correctness. Second, we (SSP) , which bounds how stale (i.e., old) a parameter need to design a general system that effectively handles can be, ASP bounds how inaccurate a parameter can be, WAN communication for ML algorithms without requir- in comparison to the most up-to-date value. Hence, it ing any algorithm changes. This is challenging because provides high flexibility in performing (or not performing) the communication patterns vary significantly across dif- updates, as the server can delay synchronization indefi- ferent ML algorithms [37, 54, 60, 64, 66, 69]. Altering nitely as long as the aggregated update is insignificant. the communication across systems can lead to different We build two prototypes of Gaia on top of two state- tradeoffs and consequences for different algorithms. of-the-art parameter server systems, one specialized for In this work, we introduce Gaia, a new general, CPUs and another specialized for GPUs. We geo-distributed ML system that is designed to effi- deploy Gaia across 11 regions on Amazon EC2, and on ciently operate over a collection of data centers. Gaia a local cluster that emulates the WAN bandwidth across builds on the widely used parameter server architecture different Amazon EC2 regions. Our evaluation with three (e.g., [5, 6, 13, 16, 17, 20, 34, 45, 74, 77]) that provides ML popular classes of ML algorithms shows that, compared worker machines with a distributed global shared memory to two state-of-the-art parameter server systems [17, 18] abstraction for the ML model parameters they collectively deployed on WANs, Gaia: (1) significantly improves train until convergence to fit the input data. The key idea performance, by 1.8–53.5×, (2) has performance within of Gaia is to maintain an approximately-correct copy of 0.94–1.40× of running the same ML algorithm on a LAN the global ML model within each data center, and dynam- in a single data center, and (3) significantly reduces the ically eliminate any unnecessary communication between monetary cost of running the same ML algorithm on data centers. Gaia enables this by decoupling the synchro- WANs, by 2.6–59.0×. nization (i.e., communication/consistency) model within We make three major contributions: a data center from the synchronization model between To our knowledge, this is the first work to propose different data centers. This differentiation allows Gaia a general geo-distributed ML system that (1) differ- to run a conventional synchronization model [19, 34, 74] entiates the communication over a LAN from the that maximizes utilization of the more-freely-available communication over WANs to make efficient use of LAN bandwidth within a data center. At the same time, the scarce and heterogeneous WAN bandwidth, and across different data centers, Gaia employs a new synchro- (2) is general and flexible enough to deploy a wide nization model, called Approximate Synchronous Parallel range of ML algorithms while requiring no change (ASP), which makes more efficient use of the scarce and to the ML algorithms themselves. heterogeneous WAN bandwidth. By ensuring that each We propose a new, efficient ML synchronization ML model copy in different data centers is approximately model, Approximate Synchronous Parallel (ASP), for correct based on a precise notion defined by ASP, we communication between parameter servers across data guarantee ML algorithm convergence. centers over WANs. ASP guarantees that each data ASP is based on a key finding that the vast majority of center’s view of the ML model parameters is approx- updates to the global ML model parameters from each ML imately the same as the “fully-consistent” view and worker machine are insignificant. For example, our study ensures that all significant updates are synchronized of three classes of ML algorithms shows that more than in time. We prove that ASP provides a theoretical 95% of the updates produce less than a 1% change to the guarantee on algorithm convergence for a widely used parameter value. With ASP, these insignificant updates ML algorithm, stochastic gradient descent. to the same parameter within a data center are aggregated We build two prototypes of our proposed system on (and thus not communicated to other data centers) until the CPU-based and GPU-based ML systems, and we aggregated updates are significant enough. ASP allows the demonstrate their effectiveness over 11 globally dis- ML programmer to specify the function and the threshold tributed regions with three popular ML algorithms. to determine the significance of updates for each ML We show that our system provides significant perfor- algorithm, while providing default configurations for mance improvements over two state-of-the-art dis- unmodified ML programs. For example, the programmer tributed ML systems [17,18], and significantly reduces can specify that all updates that produce more than a the communication overhead over WANs. 1% change are significant. ASP ensures all significant updates are synchronized across all model copies in a 2. Background and Motivation timely manner. It dynamically adapts communication We first introduce the architectures of widely-used dis- to the available WAN bandwidth between pairs of data tributed ML systems. We then discuss WAN bandwidth centers and uses special selective barrier and mirror clock constraints and study the performance implications of control messages to ensure algorithm convergence even running two state-of-the-art ML systems over WANs. during a period of sudden fall (negative spike) in available 2.1. Distributed Machine Learning Systems WAN bandwidth. While ML algorithms have different types across different In contrast to a state-of-the-art communication-efficient domains, almost all have the same goal—searching for 630 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association the best model (usually a set of parameters) to describe icantly slow down the workers and reduce the benefits or explain the input data. For example, the goal of of parallelism. The trade-off between fresher updates an image classification neural network is to find the pa- and communication overhead leads to three major syn- rameters (of the neural network) that can most accurately chronization models: (1) Bulk Synchronous Parallel classify the input images. Most ML algorithms iteratively (BSP) , which synchronizes all updates after each refine the ML model until it converges to fit the data. The worker goes through its shard of data; all workers need to correctness of an ML algorithm is thus determined by see the most up-to-date model before proceeding to the whether or not the algorithm can accurately converge to next iteration, (2) Stale Synchronous Parallel (SSP) , the best model for its input data. which allows the fastest worker to be ahead of the slowest As the input data to an ML algorithm is usually enor- worker by up to a bounded number of iterations, so the mous, processing all input data on a single machine can fast workers may proceed with a bounded stale (i.e., old) take an unacceptably long time. Hence, the most common model, and (3) Total Asynchronous Parallel (TAP) , strategy to run a large-scale ML algorithm is to distribute which removes the synchronization between workers com- the input data among multiple worker machines, and pletely; all workers keep running based on the results of have each machine work on a shard of the input data best-effort communication (i.e., each sends/receives as in parallel with other machines. The worker machines many updates as possible). Both BSP and SSP guarantee communicate with each other periodically to synchronize algorithm convergence [19, 34], while there is no such the updates from other machines. This strategy, called guarantee for TAP. Most state-of-the-art parameter servers data parallelism , is widely used in many popular implement both BSP and SSP (e.g., [5,16–18, 34,45, 77]). ML systems (e.g., [1, 2, 5, 13, 45, 47, 50, 77]). As discussed in Section 1, many ML applications There are many large-scale distributed ML systems, need to analyze geo-distributed data. For instance, an such as ones using the MapReduce abstraction (e.g., image classification system would use pictures located at MLlib and Mahout ), ones using the graph abstrac- different data centers as its input data to keep improving tion (e.g., GraphLab and PowerGraph ), and ones its classification using the pictures generated continuously using the parameter server abstraction (e.g., Petuum all over the world. Figure 1b depicts the straightforward and TensorFlow ). Among them, the parameter server approach to achieve this goal. In this approach, the worker architecture provides a performance advantage1 over other machines in each data center (i.e., within a LAN) handle systems for many ML applications and has been widely the input data stored in the corresponding data center. The adopted in many ML systems. parameter servers are evenly distributed across multiple Figure 1a illustrates the high-level overview of the data centers. Whenever the communication between parameter server (PS) architecture. In such an architecture, a worker machine and a parameter server crosses data each parameter server keeps a shard of the global model centers, it does so on WANs. parameters as a key-value store, and each worker machine 2.2. WAN Network Bandwidth and Cost communicates with the parameter servers to READ and WAN bandwidth is a very scarce resource [42, 58, 73] UPDATE the corresponding parameters. The major benefit relative to LAN bandwidth. Moreover, the high cost of of this architecture is that it allows ML programmers to adding network bandwidth has resulted in a deceleration view all model parameters as a global shared memory, and of WAN bandwidth growth. The Internet capacity growth leave the parameter servers to handle the synchronization. has fallen steadily for many years, and the annual growth Data Center 1 Data Center 2 rates have lately settled into the low-30 percent range. Data 1 Data N Data 1 Data N Worker Worker To quantify the scarcity of WAN bandwidth between Worker Worker …… …… Machine 1 Machine N data centers, we measure the network bandwidth between Machine 1 Machine N LAN WAN LAN all pairs of Amazon EC2 sites in 11 different regions (Virginia, California, Oregon, Ireland, Frankfurt, Tokyo, Parameter Parameter Parameter Parameter Server Server Server Server Seoul, Singapore, Sydney, Mumbai, and São Paulo). We use iperf3 to measure the network bandwidth of Global Model Global Model (a) Basic PS architecture (b) Simple PS on WANs each pair of different regions for five rounds, and then Figure 1: Overview of the parameter server architecture calculate the average bandwidth. Figure 2 shows the average network bandwidth between each pair of different Synchronization among workers in a distributed ML regions. We make two observations. system is a critical operation. Each worker needs to see First, the WAN bandwidth between data centers is 15× other workers’ updates to the global model to compute smaller than the LAN bandwidth within a data center on more accurate updates using fresh information. However, average, and up to 60× smaller in the worst case (for synchronization is a high-cost operation that can signif- Singapore Ö São Paulo). Second, the WAN bandwidth 1 For example, a state-of-the-art parameter server, IterStore , is varies significantly between different regions. The WAN shown to outperform PowerGraph by 10× for Matrix Factorization. bandwidth between geographically-close regions (e.g., In turn, PowerGraph is shown to match the performance of GraphX , Oregon Ö California or Tokyo Ö Seoul) is up to 12× of a Spark based system. the bandwidth between distant regions (e.g., Singapore USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 631 23.8X IterStore Bӧsen 25 Time until Convergence Normalized Execution 24.2X 26.8X 20 Network Bandwidth (Mb/s) 15 13.7X 1000 10 900 5.9X 800 5 3.7X 3.5X 4.4X 2.7X 4.9X 2.3X 4.3X 700 600 São Paulo 0 500 Mumbai Sydney LAN EC2-ALL V/C WAN S/S WAN LAN EC2-ALL V/C WAN S/S WAN 400 Singapore Seoul 300 Tokyo BSP SSP Frankfurt 200 100 Ireland Oregon California Figure 3: Normalized execution time until ML algo- 0 Virginia rithm convergence when deploying two state-of-the-art dis- tributed ML systems on a LAN and WANs for the given system, e.g., Bösen-BSP on EC2-ALL is Figure 2: Measured network bandwidth between Amazon EC2 sites in 11 different regions 5.9× slower than Bösen-BSP on LAN. As we see, both systems suffer significant performance Ö São Paulo). As Section 2.3 shows, the scarcity and degradation when deployed across multiple data centers. variation of the WAN bandwidth can significantly degrade When using BSP, IterStore is 3.5× to 23.8× slower on the performance of state-of-the-art ML systems. WANs than it is on a LAN, and Bösen is 4.4× to 24.2× Another important challenge imposed by WANs is slower. While using SSP can reduce overall execution the monetary cost of communication. In data centers, times of both systems, both systems still show significant the cost of WANs far exceeds the cost of a LAN and slowdown when run on WANs (2.3× to 13.7× for Iter- makes up a significant fraction of the overall cost. Store, and 4.3× to 26.8× for Bösen). We conclude that Cloud service providers, such as Amazon EC2, charge simply running state-of-the-art distributed ML systems an extra fee for WAN communication while providing on WANs can seriously slow down ML applications, and LAN communication free of charge. The cost of WAN thus we need a new distributed ML system that can be communication can be much higher than the cost of effectively deployed on WANs. the machines themselves. For example, the cost of two machines in Amazon EC2 communicating at the rate of 3. Our Approach: Gaia the average WAN bandwidth between data centers is up to We introduce Gaia, a general ML system that can be effec- 38× of the cost of renting these two machines. These tively deployed on WANs to address the increasing need costs make running ML algorithms on WANs much more to run ML applications directly on geo-distributed data. expensive than running them on a LAN. We identify two key challenges in designing such a system (Section 3.1). We then introduce the system architecture 2.3. ML System Performance on WANs of Gaia, which differentiates the communication within We study the performance implications of deploying dis- a data center from the communication between different tributed ML systems on WANs using two state-of-the-art centers (Section 3.2). Our approach is based on the key parameter server systems, IterStore and Bösen. empirical finding that the vast majority of communication Our experiments are conducted on our local 22-node within an ML system results in insignificant changes to cluster that emulates the WAN bandwidth between Ama- the state of the global model (Section 3.3). In light of zon EC2 data centers, the accuracy of which is validated this finding, we design a new ML synchronization model, against a real Amazon EC2 deployment (see Section 5.1 called Approximate Synchronous Parallel (ASP), which for details). We run the same ML application, Matrix can eliminate the insignificant communication while en- Factorization (Section 5.2), on both systems. suring the convergence and accuracy of ML algorithms. For each system, we evaluate both BSP and SSP as the We describe ASP in detail in Section 3.4. Finally, Sec- synchronization model (Section 2.1), with four deploy- tion 3.5 summarizes our theoretical analysis of how ASP ment settings: (1) LAN, deployment within a single data guarantees algorithm convergence for a widely-used ML center, (2) EC2-ALL, deployment across 11 aforemen- algorithm, stochastic gradient descent (SGD) (the full tioned EC2 regions, (3) V/C WAN, deployment across two proof is in Appendix A). data centers that have the same WAN bandwidth as that 3.1. Key Challenges between Virginia and California (Figure 2), representing There are two key challenges in designing a general and a distributed ML setting within a continent, and (4) S/S effective ML system on WANs. WAN, deployment across two data centers that have the Challenge 1. How to effectively communicate over same WAN bandwidth as that between Singapore and São WANs while retaining algorithm convergence and ac- Paulo, representing the lowest WAN bandwidth between curacy? As we see above, state-of-the-art distributed any two Amazon EC2 regions. ML systems can overwhelm the scarce WAN bandwidth, Figure 3 shows the normalized execution time until al- causing significant slowdowns. We need a mechanism gorithm convergence across the four deployment settings. that significantly reduces the communication between All results are normalized to IterStore using BSP on a data centers so that the system can provide competitive LAN. The data label on each bar represents how much performance. However, reducing communication can slower the WAN setting is than its respective LAN setting affect the accuracy of an ML algorithm. A poor choice 632 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association Data Center 1 Data Center 2 of synchronization model in a distributed ML system can prevent the ML algorithm from converging to the optimal Data ❶ Global Model Copy Global Model Copy point (i.e., the best model to explain or fit the input data) Shard Worker Machine that one can achieve when using a proper synchronization Parameter Server Parameter Server … model [11, 59]. Thus, we need a mechanism that can Data Shard Worker reduce communication intensity while ensuring that the Machine communication occurs in a timely manner, even when the Data Parameter Server ❷ ASP Parameter Server … network bandwidth is extremely stringent. This mecha- Shard Worker Machine ❸ BSP/SSP nism should provably guarantee algorithm convergence irrespective of the network conditions. Figure 4: Gaia system overview Challenge 2. How to make the system generic and work Section 3.4 describes the details of ASP. On the other for ML algorithms without requiring modification? Devel- hand, worker machines and parameter servers within a oping an effective ML algorithm takes significant effort data center synchronize with each other using the con- and experience, making it a large burden for the ML algo- ventional BSP (Bulk Synchronous Parallel) or SSP (Stale rithm developers to change the algorithm when deploying Synchronous Parallel) models (¸). These models allow it on WANs. Our system should work across a wide variety worker machines to quickly observe fresh updates that of ML algorithms, preferably without any change to the happen within a data center. Furthermore, worker ma- algorithms themselves. This is challenging because differ- chines and parameter servers within a data center can ent ML algorithms have different communication patterns, employ more aggressive communication schemes such and the implication of reducing communication can vary as sending updates early and often [19,74] to fully utilize significantly among them [37, 54, 60, 64, 66, 69, 83]. the abundant (and free) network bandwidth on a LAN. 3.2. Gaia System Overview 3.3. Study of Update Significance We propose a new ML system, Gaia, that addresses the two key challenges in designing a general and effec- As discussed above, Gaia reduces the communication tive ML system on WANs. Gaia is built on top the overhead over WANs by eliminating insignificant com- popular parameter server architecture, which is proven munication. To understand the benefit of our approach, to be effective on a wide variety of ML algorithms we study the significance of the updates sent from worker (e.g., [5, 6, 13, 16, 17, 20, 34, 45, 74, 77]). As discussed machines to parameter servers. We study three classes of in Section 2.1, in the parameter server architecture, all popular ML algorithms: Matrix Factorization (MF) , worker machines synchronize with each other through Topic Modeling (TM) , and Image Classification parameter servers to ensure that the global model state is (IC) (see Section 5.2 for descriptions). We run all up-to-date. While this architecture guarantees algorithm the algorithms until convergence, analyze all the updates convergence, it also requires substantial communication sent from worker machines to parameter servers, and between worker machines and parameter servers. To compare the change they cause on the parameter value make Gaia effective on WANs while fully utilizing the when the servers receive them. We define an update to be abundant LAN bandwidth, we design a new system ar- significant if it causes S% change on the parameter value, chitecture to decouple the synchronization within a data and we vary S, the significance threshold, between 0.01 center (LANs) from the synchronization across different and 10. Figure 5 shows the percentage of insignificant data centers (WANs). updates among all updates, for different values of S. Matrix Factorization Topic Modeling Image Classification Insignificant Updates Figure 4 shows an overview of Gaia. In Gaia, each data 100% center has some worker machines and parameter servers. Percentage of 80% Each worker machine processes a shard of the input 60% data stored in its data center to achieve data parallelism 40% (Section 2.1). The parameter servers in each data center 20% collectively maintain a version of the global model copy 0% 10% 5% 1% 0.5% 0.1% 0.05% 0.01% (¶), and each parameter server handles a shard of this Threshold of Significant Updates (S) global model copy. A worker machine only READs and Figure 5: Percentage of insignificant updates UPDATEs the global model copy in its data center. To reduce the communication overhead over WANs, As we see, the vast majority of updates in these al- the global model copy in each data center is only ap- gorithms are insignificant. Assuming the significance proximately correct. This design enables us to eliminate threshold is 1%, 95.2% / 95.6% / 97.0% of all updates the insignificant, and thus unnecessary, communication are insignificant for MF / TM / IC. When we relax the across different data centers. We design a new synchro- significance threshold to 5%, 98.8% / 96.1% / 99.3% of nization model, called Approximate Synchronous Parallel all updates are insignificant. Thus, most of the communi- (ASP ·), between parameter servers across different data cation changes the ML model state only very slightly. centers to ensure that each global model copy is approx- It is worth noting that our finding is consistent with imately correct even with very low WAN bandwidth. the findings of prior work [21, 22, 40, 47, 80] on other USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 633 ML algorithms, such as PageRank and Lasso. These center are aware of the significant updates after a bounded works observe that in these ML algorithms, not all model network latency, and they wait only for these updates. parameters converge to their optimal value within the The worker machines can make progress as long as they same number iterations — a property called non-uniform do not depend on any of these parameters. convergence. Instead of examining the convergence Data Center 1 Data Center 2 Data Center 1 Data Center 2 rate, we quantify the significance of updates with var- ❶ Significant Updates ❷ Barrier ious significance thresholds, which provides a unique ❸ Clock N ❹ Clock N + DS opportunity to reduce the communication over WANs. Parameter Parameter Parameter 3.4. Approximate Synchronous Parallel Parameter Server Server Server Server The goal of our new synchronization model, Approxi- mate Synchronous Parallel (ASP), is to ensure that the (a) ASP selective barrier (b) Mirror clock global model copy in each data center is approximately Figure 6: The synchronization mechanisms of ASP correct. In this model, a parameter server shares only Mirror clock. The ASP select barrier ensures that the significant updates with other data centers, and ASP the latency of the significant updates is no more than ensures that these updates can be seen by all data centers the network latency. However, it assumes that 1) the in a timely fashion. ASP achieves this goal by using three underlying WAN bandwidth and latency are fixed so techniques: (1) the significance filter, (2) ASP selective that the network latency can be bounded, and 2) such barrier, and (3) ASP mirror clock. We describe them in latency is short enough so that other data centers can order. be aware of them in time. In practice, WAN bandwidth The significance filter. ASP takes two inputs from an can fluctuate over time , and the WAN latency can ML programmer to determine whether or not an update be intolerably high for some ML algorithms. We need is significant. They are: (1) a significance function and a mechanism to guarantee that the worker machines are (2) an initial significance threshold. The significance aware of the significant updates in time, irrespective of function returns the significance of each update. We the WAN bandwidth or latency. define an update as significant if its significance is larger We use the mirror clock (Figure 6b) to provide this than the threshold. For example, an ML programmer can guarantee. When each parameter server receives all the define the significance function as the update’s magnitude updates from its local worker machines at the end of a relative to the current value (| UValue pdate |), and set the initial clock (e.g., an iteration), it reports its clock to the servers significance threshold to 1%. The significance function that are in charge of the same parameters in the other can be more sophisticated if the impact of parameter data centers. When a server detects its clock is ahead changes to the model is not linear, or the importance of of the slowest server that shares the same parameters parameters is non-uniform (see Section 4.3). A parameter by a predefined threshold DS (data center staleness), the server aggregates updates from the local worker machines server blocks its local worker machines from reading and shares the aggregated updates with other data centers its parameters until the slowest mirror server catches up. when the aggregated updates become significant. To In the example of Figure 6b, the server clock in Data ensure that the algorithm can converge to the optimal point, Center 1 is N, while the server clock in Data Center 2 ASP automatically reduces the significance threshold over is (N + DS). As their difference reaches the predefined time (specifically, if the original threshold is v, then√ the limit, the server in Data Center 2 blocks its local worker threshold at iteration t of the ML algorithm is v/ t). from reading its parameters. This mechanism is similar ASP selective barrier. While we can greatly reduce to the concept of SSP , but we use it only as the last the communication overhead over WANs by sending only resort to guarantee algorithm convergence. the significant updates, the WAN bandwidth might still be insufficient for such updates. In such a case, the 3.5. Summary of Convergence Proof significant updates can arrive too late, and we might In this section, we summarize our proof showing that a not be able to bound the deviation between different popular, broad class of ML algorithms are guaranteed to global model copies. ASP handles this case with the converge under our new ASP synchronization model. The ASP selective barrier (Figure 6a) control message. When class we consider are ML algorithms expressed as convex a parameter server receives the significant updates (¶) optimization problems that are solved using distributed at a rate that is higher than the WAN bandwidth can stochastic gradient descent. support, the parameter server first sends the indexes of The proof follows the outline of prior work on SSP , these significant updates (as opposed to sending both with a new challenge, i.e., our new ASP synchronization the indexes and the update values together) via an ASP model allows the synchronization of insignificant updates selective barrier (·) to the other data centers. The receiver to be delayed indefinitely. To prove algorithm conver- of an ASP selective barrier blocks its local worker from gence, our goal is to show that the distributed execution reading the specified parameters until it receives the of an ML algorithm results in a set of parameter values significant updates from the sender of the barrier. This that are very close (practically identical) to the values technique ensures that all worker machines in each data that would be obtained under a serialized execution. 634 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association Let f denote the objective function of an optimization 4.2. System Operations and Communication problem, whose goal is to minimize f. Let x˜t denote the We present a walkthrough of major system operations sequence of noisy (i.e., inaccurate) views of the parameters, and communication. where t = 1, 2,..., T is the index of each view over time. UPDATE from a worker machine. When a local server Let x ∗ denote the value that minimizes f. Intuitively, we (¶) receives a parameter update from a worker machine, would like ft (x˜t ) to approach f (xx∗ ) as t → ∞. We call it updates the parameter in its parameter store (¹), which the difference between ft (x˜t ) and f (xx∗ ) regret. We can maintains the parameter value and its accumulated update. prove ft (x˜t ) approaches f (xx∗ ) as t → ∞ by proving that The local server then invokes the significance filter (º) the average regret, R[X]T → 0 as T → ∞. to determine whether or not the accumulated update of Mathematically, the above intuition is formulated with this parameter is significant. If it is, the significance filter Theorem 1. The details of the proof and the notations sends a MIRROR UPDATE request to the mirror client (¸) are in Appendix A. and resets the accumulated update for this parameter. Messages from the significance filter. The signifi- Theorem 1. (Convergence of SGD under ASP). Suppose cance filter sends out three types of messages. First, as that, in order to compute the minimizer x∗ of a convex T discussed above, it sends a MIRROR UPDATE request to function f (xx) = ∑t=1 ft (xx), with ft ,t = 1, 2,... , T , convex, the mirror client through the data queue (¼). Second, we use stochastic gradient descent on one component when the significance filter detects that the arrival rate of ∇ ft at a time. Suppose also that 1) the algorithm is significant updates is higher than the underlying WAN distributed in D data centers, each of which uses P bandwidth that it monitors at every iteration, it first sends machines, 2) within each data center, the SSP protocol is an ASP Barrier (Section 3.4) to the control queue (») used, with a fixed staleness of s, and 3) a fixed mirror clock before sending the MIRROR UPDATE. The mirror client (¸) difference ∆c is allowed between any two data centers. prioritizes the control queue over the data queue, so that Let ut = −ηt ∇ ft (x˜t ), where the step size ηt decreases the barrier is sent out earlier than the update. Third, to as ηt = √ηt and the significance threshold vt decreases maintain the mirror clock (Section 3.4), the significance as vt = √vt. If we further assume that: k∇ ft (xx)k ≤ L, filter also sends a MIRROR CLOCK request to the control ∀xx ∈ dom( ft ) and max(D(xx, x 0 )) ≤ ∆2 , ∀xx, x 0 ∈ dom( ft ). queue at the end of each clock in the local server. Then, as T → ∞, the regret R[X] = ∑t=1 T ft (x˜t ) − f (xx∗ ) = Operations in the mirror client. The mirror client √ R[X] thread wakes up when there is a request from the control O( T ) and therefore limT →∞ T → 0. queue or the data queue. Upon waking up, the mirror client walks through the queues, packs together the messages 4. Implementation to the same destination, and sends them. We introduce the key components of Gaia in Section 4.1, Operations in the mirror server. The mirror server and discuss the operation and design of individual com- handles above messages (MIRROR UPDATE, ASP BARRIER, ponents in the remaining sections. and MIRROR CLOCK) according to our ASP model. For 4.1. Gaia System Key Components MIRROR UPDATE, it applies the update to the correspond- Figure 7 presents the key components of Gaia. All of ing parameter in the parameter store. For ASP BARRIER, the key components are implemented in the parameter it sets a flag in the parameter store to block the corre- servers, and can be transparent to the ML programs and sponding parameter from being read until it receives the the worker machines. As we discuss above, we decouple corresponding MIRROR UPDATE. For MIRROR CLOCK, the the synchronization within a data center (LANs) from mirror server updates its local mirror clock state for each the synchronization across different data centers (WANs). parameter server in other data centers, and enforces the The local server (¶) in each parameter server handles predefined clock difference threshold DS (Section 3.4). the synchronization between the worker machines in the 4.3. Advanced Significance Functions same data center using the conventional BSP or SSP As we discuss in Section 3.4, the significance filter allows models. On the other hand, the mirror server (·) and the the ML programmer to specify a custom significance mirror client (¸) handle the synchronization with other function to calculate the significance of each update. By data centers using our ASP model. Each of these three providing an advanced significance function, Gaia can be components runs as an individual thread. more effective at eliminating the insignificant communica- Data Center Boundary tion. If several parameters are always referenced together Gaia Parameter Server Gaia Parameter Worker ❶ ❹ Server to calculate the next update, the significance function can Local Machine Server Parameter Store take into account the values of all these parameters. For Worker ❷Mirror example, if three parameters a, b, and c are always used Machine ❺ ❻ Control Server … as a · b · c in an ML algorithm, the significance of a, b, Significance Queue Worker Filter Mirror Client and c can be calculated as the change on a · b · c. If one Data Machine Queue ❸ of them is 0, any change in another parameter, however ❼ large it may be, is insignificant. Similar principles can Figure 7: Key components of Gaia be applied to model parameters that are non-linear or USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 635 non-uniform. For unmodified ML programs, the system charge of aggregating all the significant updates within applies default significance functions, such as the relative the group, and sending to the hubs of the other groups. magnitude of an update for each parameter. Similarly, a hub data center broadcasts the aggregated significant updates from other groups to the other data 4.4. Tuning of Significance Thresholds centers within its group. Each data center group can The user of Gaia can specify two different goals for Gaia: designate different hubs for communication with different (1) speed up algorithm convergence by fully utilizing the data center groups, so the system can utilize more links available WAN bandwidth and (2) minimize the commu- within a data center group. For example, the data centers nication cost on WANs. In order to achieve either of these in Virginia, California, and Oregon can form a data center goals, the significance filter maintains two significance group and assign the data center in Virginia as the hub thresholds and dynamically tunes these thresholds. The to communicate with the data centers in Europe and the first threshold is the hard significance threshold. The data center in Oregon as the hub to communicate with the purpose of this threshold is to guarantee ML algorithm data centers is Asia. This design allows Gaia to broadcast convergence. As we discuss in our theoretical analysis the significant updates with lower communication cost. (Section 3.5), the initial threshold is provided by the ML programmer or a default system setting, and the signif- 5. Methodology icance filter reduces it over time. Every update whose 5.1. Experiment Platforms significance is above the hard threshold is guaranteed to We use three different platforms for our evaluation. be sent to other data centers. The second threshold is the Amazon-EC2. We deploy Gaia to 22 machines spread soft significance threshold. The purpose of it is to use across 11 EC2 regions as we show in Figure 2. In each underutilized WAN bandwidth to speed up convergence. EC2 region we start two instances of type c4.4xlarge This threshold is tuned based on the arrival rate of the or m4.4xlarge , depending on their availability. Both significant updates and the underlying WAN bandwidth. types of instances have 16 CPU cores and at least 30GB When the user chooses to optimize the first goal (speed RAM, running 64-bit Ubuntu 14.04 LTS (HVM). In all, up algorithm convergence), the system lowers the soft sig- our deployment uses 352 CPU cores and 1204 GB RAM. nificance threshold whenever there is underutilized WAN Emulation-EC2. As the monetary cost of running all bandwidth. The updates whose significance is larger than experiments on EC2 is too high, we run some experiments the soft significance threshold are sent in a best-effort on our local cluster that emulates the computation power manner. On the other hand, if the goal of the system and WAN bandwidth of EC2. We use the same number is to minimize the WAN communication costs, the soft of machines (22) in our local cluster. Each machine is significance threshold is not activated. equipped with a 16-core Intel Xeon CPU (E5-2698), an While the configuration of the initial hard threshold NVIDIA Titan X GPU, 64GB RAM, a 40GbE NIC, and depends on how error tolerant each ML algorithm is, a runs the same OS as above. The computation power and simple and conservative threshold (such as 1%–2%) is the LAN speeds of our machines are higher than the likely to work in most cases. This is because most ML ones we get from EC2, so we slow down the CPU and algorithms initialize their parameters with random values, LAN speeds to match the speeds on EC2. We model and make large changes to their model parameters at the measured EC2 WAN bandwidth (Figure 2) with the early phases. Thus, they are more error tolerant at the Linux Traffic Control tool. As Section 6.1 shows, beginning. As Gaia reduces the threshold over time, its our emulation platform gives very similar results to the accuracy loss is limited. An ML expert can choose a results from our real EC2 deployment. more aggressive threshold based on domain knowledge Emulation-Full-Speed. We run some of our experi- of the ML algorithm. ments on our local cluster that emulates the WAN band- 4.5. Overlay Network and Hub width of EC2 at full speed. We use the same settings as Emulation-EC2 except we do not slow down the CPUs While Gaia can eliminate the insignificant updates, each and the LAN. We use this platform to show the results data center needs to broadcast the significant updates to of deployments with more powerful nodes. all the other data centers. This broadcast-based communi- cation could limit the scalability of Gaia when we deploy 5.2. Applications Gaia to many data centers. To make Gaia more scalable We evaluate Gaia with three popular ML applications. with more data centers, we use the concept of overlay Matrix Factorization (MF) is a technique commonly networks. used in recommender systems, e.g., systems that recom- As we discuss in Section 2.2, the WAN bandwidth mend movies to users on Netflix (a.k.a. collaborative between geographically-close regions is much higher filtering). Its goal is to discover latent interactions than that between distant regions. In light of this, Gaia between two entities, such as users and movies, via matrix supports having geographically-close data centers form a factorization. For example, input data can be a partially data center group. Servers in a data center group send filled matrix X, where every entry is a user’s rating for their significant updates only to the other servers in the a movie, each row corresponding to a user, and each same group. Each group has hub data centers that are in column corresponding to a specific movie. Matrix factor- 636 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association ization factorizes X into factor matrices L and R such that changes by less than 2% over the course of 10 iterations, their product approximates X (i.e., X ≈ LR). Like other we declare that the algorithm has converged. In systems [17,32,83], we implement MF using the stochas- order to ensure that each algorithm accurately converges tic gradient descent (SGD) algorithm. Each worker is to the optimal point, we first run each algorithm on our assigned a portion of the known entries in X. The L local cluster until it converges, and we record the absolute matrix is stored locally in each worker, and the R matrix objective value. The execution time of each setting is the is stored in parameter servers. Our experiments use the time it takes to converge to this absolute objective value. Netflix dataset, a 480K-by-18K sparse matrix with 100M The second metric is the cost of algorithm convergence. known entries. They are configured to factor the matrix We calculate the cost based on the cost model of Amazon into the product of two matrices, each with rank 500. EC2 , including the cost of the server time and the Topic Modeling (TM) is an unsupervised method for cost of data transfer on WANs. We provide the details of discovering hidden semantic structures (topics) in an the cost model in Appendix C. unstructured collection of documents, each consisting of a bag (multi-set) of words. TM discovers the topics via 6. Evaluation Results word co-occurrence. For example, “policy” is more likely We evaluate the effectiveness of Gaia by evaluating three to co-occur with “government” than “bacteria”, and thus types of systems/deployments: (1) Baseline, two state- “policy” and “government” are categorized to the same of-the-art parameter server systems (IterStore for topic associated with political terms. Further, a document MF and TM, GeePS for IC) that are deployed across with many instances of “policy” would be assigned a topic multiple data centers. Every worker machine handles the distribution that peaks for the politics-related topics. TM data in its data center, while the parameter servers are learns the hidden topics and the documents’ associations distributed evenly across all the data centers; (2) Gaia, with those topics jointly. Common applications for TM our prototype systems based on IterStore and GeePS, include community detection in social networks and news deployed across multiple data centers; and (3) LAN, the categorizations. We implement our TM solver using baseline parameter servers (IterStore and GeePS) that are collapsed Gibbs sampling. We use the Nytimes deployed within a single data center (also on 22 machines) dataset , which has 100M words in 300K documents that already hold all the data, representing the ideal case with a vocabulary size of 100K. Our experiments classify of all communication on a LAN. For each system, we words and documents into 500 topics. evaluate two ML synchronization models: BSP and SSP Image Classification (IC) is a task to classify im- (Section 2.1). For Baseline and LAN, BSP and SSP are ages into categories, and the state-of-the-art approach is used among all worker machines, whereas for Gaia, they using deep learning and convolutional neural networks are used only within each data center. Due to limited (CNNs). Given a set of images with known cate- space, we present the results for BSP in this section and gories (training data), the ML algorithm trains a CNN leave the results for SSP to Appendix B. to learn the relationship between the image features and 6.1. Performance on EC2 Deployment their categories. The trained CNN is then used to predict We first present the performance of Gaia and Baseline the categories of another set of images (test data). We use when they are deployed across 11 EC2 data centers. Fig- GoogLeNet , one of the state-of-the-art CNNs as our ure 8 shows the normalized execution time until conver- model. We train GoogLeNet using stochastic gradient gence for our ML applications, normalized to Baseline descent with back propagation. As training a CNN on EC2. The data label on each bar is the speedup with a large number of images requires substantial compu- over Baseline for the respective deployment. As Sec- tation, doing so on CPUs can take hundreds of machines tion 5.1 discusses, we run only MF on EC2 due to the over a week. Instead, we use distributed GPUs with high monetary cost of WAN data transfer. Thus, we a popular deep learning framework, Caffe , which present the results of MF on all three platforms, while is hosted by a state-of-the-art GPU-specialized param- we show the results of TM and IC only on our emulation eter server system, GeePS. Our experiments use platforms. As Figure 8a shows, our emulation platform the ImageNet Large Scale Visual Recognition Challenge (Emulation-EC2) matches the execution time of our real 2012 (ILSVRC12) dataset, which consists of 1.3M EC2 deployment (Amazon-EC2) very well. We make two training images and 50K test images. Each image is major observations. labeled as one of the 1,000 pre-defined categories. First, we find that Gaia significantly improves the 5.3. Performance Metrics and Algorithm Conver- performance of Baseline when deployed globally across gence Criteria many EC2 data centers. For MF, Gaia provides a speedup We use two performance metrics to evaluate the effective- of 2.0× over Baseline. Furthermore, the performance of ness of a globally distributed ML system. The first metric Gaia is very similar to the performance of LAN, indicating is the execution time until algorithm convergence. We that Gaia almost attains the performance upper bound use the following algorithm convergence criterion, based with the given computation resources. For TM, Gaia on guidance from our ML experts: if the value of the delivers a similar speedup (2.0×) and is within 1.25× of objective function (the objective value) in an algorithm the ideal speed of LAN. For IC, Gaia provides a speedup USENIX Association 14th USENIX Symposium on Networked Systems Design and Implementation 637 Nromalized Execution Time Nromalized Execution Time Nromalized Execution Time 1 Amazon-EC2 1 1 Emulation-EC2 Emulation-EC2 0.9 Emulation-EC2 0.9 0.9 0.8 0.8 Emulation-Full-Speed 0.8 Emulation-Full-Speed Emulation-Full-Speed 0.7 0.7 0.7 0.6 2.0X 1.8X 2.0X 1.8X 0.6 2.0X 0.6 0.5 0.5 2.5X 0.5 0.4 0.4 0.4 0.3 0.3 3.7X 0.3 3.8X 3.7X 4.8X 5.6X 6.0X 0.2 0.2 0.2 7.5X 8.5X 0.1 0.1 0.1 0 0 0 Baseline Gaia LAN Baseline Gaia LAN Baseline Gaia LAN (a) Matrix Factorization (MF) (b) Topic Modeling (TM) (c) Image Classification (IC) Figure 8: Normalized execution time until convergence when deployed across 11 EC2 regions and our emulation cluster Baseline Gaia LAN Baseline Gaia LAN Baseline Gaia LAN of 5.6× over Baseline, which is within 1.32× of the 1 1 0.9 1 0.9 Normalized Exec. Time 0.9 LAN speed, indicating that Gaia is also effective on a 0.8 0.8 0.7 0.8 0.7 0.7 0.6 0.6 GPU-based ML system. The gap between Baseline and 0.6 0.5 0.5 0.5 0.4 0.4 0.4 LAN is larger for IC than for the other two applications. 0.3 0.3 0.3 0.2 0.2 14X 17X 0.2 This is because the GPU-based ML system generates 0.1 25X 24X 0.1 0.1 54X 54X 0 0 0 parameter updates at a higher rate than the CPU-based Matrix Factorization Topic Modeling Image Classification one, and therefore the limited WAN bandwidth slows it Figure 10: Normalized execution time until convergence down more significantly. with the WAN bandwidth between Singapore and São Paulo Second, Gaia provides a higher performance gain when Second, Gaia still performs very well when WAN deployed on a more powerful platform. As Figure 8 shows, bandwidth is low (S/S WAN, Figure 10): Gaia provides a the performance gap between Baseline and LAN signifi- speedup of 25.4× for MF, 14.1× for TM, and 53.5× for cantly increases on Emulation-Full-Speed compared to IC, and successfully approaches LAN performance. These the slower platform Emulation-EC2. This is expected results show that our design is robust for both CPU-based because the WAN bandwidth becomes a more critical and GPU-based ML systems, and it can deliver high bottleneck when the computation time reduces and the performance even under scarce WAN bandwidth. LAN bandwidth increases. Gaia successfully mitigates Third, for MF, the performance of Gaia (on WANs) is the WAN bottleneck in this more challenging Emulation- slightly better than LAN performance. This is because we Full-Speed setting, and improves the system performance run ASP between different data centers, and the workers by 3.8× for MF, 3.7× for TM, and 6.0× for IC over in each data center need to synchronize only with each Baseline, approaching the speedups provided by LAN. other locally in each iteration. As long as the mirror 6.2. Performance and WAN Bandwidth updates on WANs are timely, each iteration of Gaia can be faster than that of LAN, which needs to synchronize To understand how Gaia performs under different amounts across all workers. While Gaia needs more iterations than of WAN bandwidth, we evaluate two settings where LAN due to the accuracy loss, Gaia can still outperform Baseline and Gaia are deployed across two data centers LAN due to the faster iterations. with two WAN bandwidth configurations: (1) V/C WAN, which emulates the WAN bandwidth between Virginia 6.3. Cost Analysis and California, representing a setting within the same Figure 11 shows the monetary cost of running ML ap- continent; and (2) S/S WAN, which emulates the WAN plications until convergence based on the Amazon EC2 bandwidth between Singapore and São Paulo, representing cost model, normalized to the cost of Baseline on 11 the lowest WAN bandwidth between any two Amazon EC2 regions. Cost is divided into three components: (1) EC2 sites. All the experiments are conducted on our the cost of machine time spent on computation, (2) the emulation platform at full speed. Figures 9 and 10 show cost of machine time spent on waiting for networks, and the results. Three observations are in order. (3) the cost of data transfer across different data centers. Baseline Gaia LAN Baseline Gaia LAN Baseline Gaia LAN 1 1 1 As we discuss in Section 2.2, there is no cost for data Normalized Exec. Time 0.9 0.9 0.9 0.8 0.8 0.8 transfer within a single data center in Amazon EC2. The 0.7 0.7 0.7 0.6 0.6 0.6 data label on each bar shows the factor by which the 0.5 0.5 0.5 0.4 0.4 0.4 cost of Gaia is cheaper than the cost of each respective 3.7X 3.5X 3.7X 3.9X 0.3 0.3 0.3 0.2 0.2 0.2 7.4X 7.4X Baseline. We evaluate all three deployment setups that 0.1 0.1 0.1 0 0 0 we discuss in Sections 6.1 and 6.2. We make two major Matrix Factorization Topic Modeling Image Classification observations. Figure 9: Normalized execution time until convergence with the WAN bandwidth between Virginia and California First, Gaia is very effective in reducing the cost of running a geo-distributed ML application. Across all First, Gaia successfully matches the performance of the evaluated settings, Gaia is 2.6× to 59.0× cheaper LAN when WAN bandwidth is high (V/C WAN). As Fig- than Baseline. Not surprisingly, the major cost saving ure 9 shows, Gaia achieves a speedup of 3.7× for MF, comes from the reduction of data transfer on WANs 3.7× for TM, and 7.4× for IC. For all three ML applica- and the reduction of machine time spent on waiting for tions, the performance of Gaia on WANs is almost the networks. For the S/S WAN setting, the cost of waiting same as LAN performance. for networks is a more important factor than the other 638 14th USENIX Symposium on Networked Systems Design and Implementation USENIX Association 2.5 Machine Cost (Compute) 4 Machine Cost (Compute) 2.5 Normliaed Cost Normliaed Cost Normliaed Cost Machine Cost (Compute) 2 3.5 2 Machine Cost (Network) 3 Machine Cost (Network) Machine Cost (Network) Communication Cost 1.5 Communication Cost 2.5 Communication Cost 1.5 2 1 1 1.5 2.6X 1 0.5 4.2X 6.0X 28.5X 0.5 5.7X 18.7X 8.5X 10.7X 59.0X 0.5 0 0 0 Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia EC2-ALL V/C WAN S/S WAN EC2-ALL V/C WAN S/S WAN EC2-ALL V/C WAN S/S WAN (a) Matrix Factorization (MF) (b) Topic Modeling (TM) (c) Image Classification (IC) Figure 11: Normalized monetary cost of Gaia vs. Baseline two settings, because it takes more time to transfer the setting because each data center has only a small fraction same amount of data under low WAN bandwidth. As of the data, and Centralized moves the data from all Gaia significantly improves system performance and data centers in parallel. reduces data communication overhead, it significantly Second, Centralized is more cost-efficient than Gaia, reduces both cost sources. We conclude that Gaia is a but the gap is small in the two data centers setting. This cost-effective system for geo-distributed ML applications. is because the total WAN traffic of Gaia is still larger Second, Gaia reduces data transfer cost much more than the size of the training data, even though Gaia when deployed on a smaller number of data centers. The significantly reduces the communication overhead over reason is that Gaia needs to broadcast the significant Baseline. The cost gap is larger in the setting of 11 updates to all data centers, so communication cost is data centers (3.33–6.14×) than in two data centers (1.00– higher as the number of data centers increases. While 1.92×), because the WAN traffic of Gaia is positively we employ network overlays (Section 4.5) to mitigate correlated with the number of data centers (Section 4.5). this effect, there is still more overhead with more than 6.5. Effect of Synchronization Mechanisms two data centers. Nonetheless, the cost of Gaia is still One of the major design considerations of ASP is to en- much cheaper (4.2×/2.6×/8.5×) than Baseline even sure that the significant updates arrive in a timely manner when deployed across 11 data centers. to guarantee algor