Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds PDF

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, and Phillip B. Gibbons, Carnegie Mellon University; Onur Mutlu, ETH Zurich and Carnegie Mellon University https://www.usenix.org/conferen...

Gaia: Geo-Distributed Approaching LAN Speeds Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Gregory R. Ganger, and Phillip B. Gibbons, Carnegie Onur Mutlu, ETH Zurich and Carnegie Mellon https://www.usenix.org/conference/nsdi17/technical-se This paper is included in the Proceedings 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’17). March 27–29, 2017 Boston, ISBN 978-1-931971-37-9 Open 14th Systems Gaia: Geo-Distributed Machine Learning Kevin Hsieh† Aaron Harlap† Nandita Gregory R. Ganger† Phillip † Carnegie Mellon University Abstract Machine learning (ML) is widely used to derive useful information from large-scale data (such as user activities, pictures, and videos) generated at increasingly rapid rates, all over the world. Unfortunately, it is infeasible to move all this globally-generated data to a centralized data center before running an ML algorithm over it—moving large amounts of raw data over wide-area networks (WANs) can be extremely slow, and is also subject to the constraints of privacy and data sovereignty laws. This motivates the need for a geo-distributed ML system spanning multiple data centers. Unfortunately, communicating over WANs can significantly degrade ML system performance (by as much as 53.7× in our study) because the communication overwhelms the limited WAN bandwidth. Our goal in this work is to develop a geo-distributed ML system that (1) employs an intelligent communication mechanism over WANs to efficiently utilize the scarce WAN bandwidth, while retaining the accuracy and cor- rectness guarantees of an ML algorithm; and (2) is generic and flexible enough to run a wide range of ML algorithms, without requiring any changes to the algorithms. To this end, we introduce a new, general geo-distributed ML system, Gaia, that decouples the communication within a data center from the communication between data centers, enabling different communication and con- sistency models for each. We present a new ML syn- chronization model, Approximate Synchronous Parallel (ASP), whose key idea is to dynamically eliminate in- significant communication between data centers while still guaranteeing the correctness of ML algorithms. Our experiments on our prototypes of Gaia running across 11 Amazon EC2 global regions and on a cluster that emulates EC2 WAN bandwidth show that Gaia provides 1.8–53.5× speedup over two state-of-the-art distributed ML systems, and is within 0.94–1.40× of the speed of running the same ML algorithm on machines on a local area network (LAN). 1. Introduction Machine learning (ML) is very widely used across a variety of domains to extract useful information from large-scale data. It has many classes of applications such as image or video classification (e.g., [24,39,65]), speech recognition (e.g., ), and topic modeling (e.g., ). These applications analyze massive amounts of data from user activities, pictures, videos, etc., which are generated at very rapid rates, all over the world. Many large organizations, such as Google , Microsoft , and Amazon , operate tens of data centers globally to USENIX Association 14th USENIX required to be timely, irrespective of the available network bandwidth, to ensure algorithm correctness. Second, we need to design a general system that effectively handles WAN communication for ML algorithms without requir- ing any algorithm changes. This is challenging because the communication patterns vary significantly across dif- ferent ML algorithms [37, 54, 60, 64, 66, 69]. Altering the communication across systems can lead to different tradeoffs and consequences for different algorithms. In this work, we introduce Gaia, a new general, geo-distributed ML system that is designed to effi- ciently operate over a collection of data centers. Gaia builds on the widely used parameter server architecture (e.g., [5, 6, 13, 16, 17, 20, 34, 45, 74, 77]) that provides ML worker machines with a distributed global shared memory abstraction for the ML model parameters they collectively train until convergence to fit the input data. The key idea of Gaia is to maintain an approximately-correct copy of the global ML model within each data center, and dynam- ically eliminate any unnecessary communication between data centers. Gaia enables this by decoupling the synchro- nization (i.e., communication/consistency) model within a data center from the synchronization model between different data centers. This differentiation allows Gaia to run a conventional synchronization model [19, 34, 74] that maximizes utilization of the more-freely-available LAN bandwidth within a data center. At the same time, across different data centers, Gaia employs a new synchro- nization model, called Approximate Synchronous Parallel (ASP), which makes more efficient use of the scarce and heterogeneous WAN bandwidth. By ensuring that each ML model copy in different data centers is approximately correct based on a precise notion defined by ASP, we guarantee ML algorithm convergence. ASP is based on a key finding that the vast majority of updates to the global ML model parameters from each ML worker machine are insignificant. For example, our study of three classes of ML algorithms shows that more than 95% of the updates produce less than a 1% change to the parameter value. With ASP, these insignificant updates to the same parameter within a data center are aggregated (and thus not communicated to other data centers) until the aggregated updates are significant enough. ASP allows the ML programmer to specify the function and the threshold to determine the significance of updates for each ML algorithm, while providing default configurations for unmodified ML programs. For example, the programmer can specify that all updates that produce more than a 1% change are significant. ASP ensures all significant updates are synchronized across all model copies in a timely manner. It dynamically adapts communication to the available WAN bandwidth between pairs of data centers and uses special selective barrier and mirror clock control messages to ensure algorithm convergence even during a period of sudden fall (negative spike) in available WAN bandwidth. In contrast to a state-of-the-art communication-efficient 630 14th USENIX Symposium on Networked Systems the best model (usually a set of parameters) to describe or explain the input data. For example, the goal of an image classification neural network is to find the pa- rameters (of the neural network) that can most accurately classify the input images. Most ML algorithms iteratively refine the ML model until it converges to fit the data. The correctness of an ML algorithm is thus determined by whether or not the algorithm can accurately converge to the best model for its input data. As the input data to an ML algorithm is usually enor- mous, processing all input data on a single machine can take an unacceptably long time. Hence, the most common strategy to run a large-scale ML algorithm is to distribute the input data among multiple worker machines, and have each machine work on a shard of the input data in parallel with other machines. The worker machines communicate with each other periodically to synchronize the updates from other machines. This strategy, called data parallelism , is widely used in many popular ML systems (e.g., [1, 2, 5, 13, 45, 47, 50, 77]). There are many large-scale distributed ML systems, such as ones using the MapReduce abstraction (e.g., MLlib and Mahout ), ones using the graph abstrac- tion (e.g., GraphLab and PowerGraph ), and ones using the parameter server abstraction (e.g., Petuum and TensorFlow ). Among them, the parameter server architecture provides a performance advantage1 over other systems for many ML applications and has been widely adopted in many ML systems. Figure 1a illustrates the high-level overview of the parameter server (PS) architecture. In such an architecture, each parameter server keeps a shard of the global model parameters as a key-value store, and each worker machine communicates with the parameter servers to READ and UPDATE the corresponding parameters. The major benefit of this architecture is that it allows ML programmers view all model parameters as a global shared leave the parameter servers to handle the synchronizat Data Center 1 Data 1 Data N Data 1 Worker Worker Worker …… Machine 1 Machine 1 Machine N LAN Parameter Parameter Parameter Server Server Server Global Model (a) Basic PS architecture (b) Simple PS on WANs Figure 1: Overview of the parameter server architecture Synchronization among workers in a distributed ML system is a critical operation. Each worker needs to see other workers’ updates to the global model to compute more accurate updates using fresh information. However, synchronization is a high-cost operation that can signif- 1 For example, a state-of-the-art parameter shown to outperform PowerGraph by 10× for Matrix Factorization. In turn, PowerGraph is shown to match the performance of GraphX , a Spark based system. USENIX Association 14th Network Bandwidth (Mb/s) 1000 900 800 700 600 500 400 300 200 100 0 Figure 2: Measured network bandwidth between Amazon EC2 sites in 11 different regions Ö São Paulo). As Section 2.3 shows, the scarcity and variation of the WAN bandwidth can significantly degrade the performance of state-of-the-art ML systems. Another important challenge imposed by WANs is the monetary cost of communication. In data centers, the cost of WANs far exceeds the cost of a LAN and makes up a significant fraction of the overall cost. Cloud service providers, such as Amazon EC2, charge an extra fee for WAN communication while providing LAN communication free of charge. The cost of WAN communication can be much higher than the cost of the machines themselves. For example, the cost of two machines in Amazon EC2 communicating at the rate of the average WAN bandwidth between data centers is up to 38× of the cost of renting these two machines. These costs make running ML algorithms on WANs much more expensive than running them on a LAN. 2.3. ML System Performance on WANs We study the performance implications of deploying dis- tributed ML systems on WANs using two state-of-the-art parameter server systems, IterStore and Bösen. Our experiments are conducted on our local 22-node cluster that emulates the WAN bandwidth between Ama- zon EC2 data centers, the accuracy of which is validated against a real Amazon EC2 deployment (see Section 5.1 for details). We run the same ML application, Matrix Factorization (Section 5.2), on both systems. For each system, we evaluate both BSP and SSP as the synchronization model (Section 2.1), with four deploy- ment settings: (1) LAN, deployment within a single data center, (2) EC2-ALL, deployment across 11 aforemen- tioned EC2 regions, (3) V/C WAN, deployment across two data centers that have the same WAN bandwidth as that between Virginia and California (Figure 2), representing a distributed ML setting within a continent, and (4) S/S WAN, deployment across two data centers that have the same WAN bandwidth as that between Singapore and São Paulo, representing the lowest WAN bandwidth between any two Amazon EC2 regions. Figure 3 shows the normalized execution time until al- gorithm convergence across the four deployment settings. All results are normalized to IterStore using BSP on a LAN. The data label on each bar represents how much slower the WAN setting is than its respective LAN setting 632 14th USENIX of synchronization model in a distributed ML prevent the ML algorithm from converging to the optimal point (i.e., the best model to explain or fit the input data) that one can achieve when using a proper synchronization model [11, 59]. Thus, we need a mechanism that can reduce communication intensity while ensuring that the communication occurs in a timely manner, even when the network bandwidth is extremely stringent. This mecha- nism should provably guarantee algorithm convergence irrespective of the network conditions. Challenge 2. How to make the system generic and work for ML algorithms without requiring modification? Devel- oping an effective ML algorithm takes significant effort and experience, making it a large burden for the ML algo- rithm developers to change the algorithm when deploying it on WANs. Our system should work across a wide variety of ML algorithms, preferably without any change to the algorithms themselves. This is challenging because differ- ent ML algorithms have different communication patterns, and the implication of reducing communication can vary significantly among them [37, 54, 60, 64, 66, 69, 83]. 3.2. Gaia System Overview We propose a new ML system, Gaia, that addresses the two key challenges in designing a general and effec- tive ML system on WANs. Gaia is built on top the popular parameter server architecture, which is proven to be effective on a wide variety of ML algorithms (e.g., [5, 6, 13, 16, 17, 20, 34, 45, 74, 77]). As discussed in Section 2.1, in the parameter server architecture, all worker machines synchronize with each other through parameter servers to ensure that the global model state is up-to-date. While this architecture guarantees algorithm convergence, it also requires substantial communication between worker machines and parameter servers. To make Gaia effective on WANs while fully utilizing the abundant LAN bandwidth, we design a new system ar- chitecture to decouple the synchronization within a data center (LANs) from the synchronization across different data centers (WANs). Figure 4 shows an overview of Gaia. In Gaia, center has some worker machines and parameter servers. Each worker machine processes a shard of the input data stored in its data center to achieve data parallelism (Section 2.1). The parameter servers in each data center collectively maintain a version of the global model copy (¶), and each parameter server handles a shard of this global model copy. A worker machine only READs and UPDATEs the global model copy in its data center. To reduce the communication overhead over WANs, the global model copy in each data center is only ap- proximately correct. This design enables us to eliminate the insignificant, and thus unnecessary, communication across different data centers. We design a new synchro- nization model, called Approximate Synchronous Parallel (ASP ·), between parameter servers across different data centers to ensure that each global model copy is approx- imately correct even with very low WAN bandwidth. USENIX Association 14th USENIX ML algorithms, such as PageRank and Lasso. These works observe that in these ML algorithms, not all model parameters converge to their optimal value within the same number iterations — a property called non-uniform convergence. Instead of examining the convergence rate, we quantify the significance of updates with var- ious significance thresholds, which provides a unique opportunity to reduce the communication over WANs. 3.4. Approximate Synchronous Parallel The goal of our new synchronization model, Approxi- mate Synchronous Parallel (ASP), is to ensure that the global model copy in each data center is approximately correct. In this model, a parameter server shares only the significant updates with other data centers, and ASP ensures that these updates can be seen by all data centers in a timely fashion. ASP achieves this goal by using three techniques: (1) the significance filter, (2) ASP selective barrier, and (3) ASP mirror clock. We describe them in order. The significance filter. ASP takes two inputs from an ML programmer to determine whether or not an update is significant. They are: (1) a significance function and (2) an initial significance threshold. The significance function returns the significance of each update. We define an update as significant if its significance is larger than the threshold. For example, an ML programmer can define the significance function as the update’s magnitude relative to the current value (| UValue pdate |), and set the initial significance threshold to 1%. The significance function can be more sophisticated if the impact of parameter changes to the model is not linear, or the importance of parameters is non-uniform (see Section 4.3). A parameter server aggregates updates from the local worker machines and shares the aggregated updates with other data centers when the aggregated updates become significant. To ensure that the algorithm can converge to the optimal point, ASP automatically reduces the significance threshold over time (specifically, if the original threshold is v, then√ the threshold at iteration t of the ML algorithm is v/ t). ASP selective barrier. While we can greatly reduce the communication overhead over WANs by sending only the significant updates, the WAN bandwidth might still be insufficient for such updates. In such a case, the significant updates can arrive too late, and we might not be able to bound the deviation between different global model copies. ASP handles this case with the ASP selective barrier (Figure 6a) control message. When a parameter server receives the significant updates (¶) at a rate that is higher than the WAN bandwidth can support, the parameter server first sends the indexes of these significant updates (as opposed to sending both the indexes and the update values together) via an ASP selective barrier (·) to the other data centers. The receiver of an ASP selective barrier blocks its local worker from reading the specified parameters until it receives the significant updates from the sender of the barrier. This technique ensures that all worker machines in each data 634 14th USENIX Symposium on Networked Systems Let f denote the objective function of an optimization problem, whose goal is to minimize f. Let x˜t denote the sequence of noisy (i.e., inaccurate) views of the parameters, where t = 1, 2,..., T is the index of each view over time. Let x ∗ denote the value that minimizes f. Intuitively, we would like ft (x˜t ) to approach f (xx∗ ) as t → ∞. We call the difference between ft (x˜t ) and f (xx∗ ) regret. We can prove ft (x˜t ) approaches f (xx∗ ) as t → the average regret, R[X]T → 0 as T → ∞. Mathematically, the above intuition is formulated with Theorem 1. The details of the proof and the notations are in Appendix A. Theorem 1. (Convergence of SGD under ASP). Suppose that, in order to compute the minimizer x∗ T function f (xx) = ∑t=1 ft (xx), with ft ,t we use stochastic gradient descent on one component ∇ ft at a time. Suppose also that 1) the algorithm distributed in D data centers, each of which uses P machines, 2) within each data center, the SSP used, with a fixed staleness of s, and 3) a fixed difference ∆c is allowed between any two data Let ut = −ηt ∇ ft (x˜t ), where the step as ηt = √ηt and the significance threshold as vt = √vt. If we further assume that: k∇ ft (xx)k ≤ L, ∀xx ∈ dom( ft ) and max(D(xx, x 0 )) ≤ Then, as T → ∞, the regret R[X] = ∑t=1 T ft √ R[X] O( T ) and therefore limT →∞ T → 0. 4. Implementation We introduce the key components of Gaia in Section 4.1, and discuss the operation and design of individual com- ponents in the remaining sections. 4.1. Gaia System Key Components Figure 7 presents the key components of Gaia. All of the key components are implemented in the parameter servers, and can be transparent to the ML programs and the worker machines. As we discuss above, we decouple the synchronization within a data center (LANs) from the synchronization across different data centers (WANs). The local server (¶) in each parameter server handles the synchronization between the worker machines in the same data center using the conventional BSP or SSP models. On the other hand, the mirror server (·) and the mirror client (¸) handle the synchronization with other data centers using our ASP model. Each of these three components runs as an individual thread. Gaia Parameter Server Worker ❶ ❹ Local Machine Server Parameter Store Worker ❷Mirror Machine ❺ ❻ Control Server Significance Queue Worker Filter Mirror Client Data Machine Queue ❸ ❼ Figure 7: Key components of Gaia USENIX Association non-uniform. For unmodified ML programs, the system applies default significance functions, such as the relative magnitude of an update for each parameter. 4.4. Tuning of Significance Thresholds The user of Gaia can specify two different goals for Gaia: (1) speed up algorithm convergence by fully utilizing the available WAN bandwidth and (2) minimize the commu- nication cost on WANs. In order to achieve either of these goals, the significance filter maintains two significance thresholds and dynamically tunes these thresholds. The first threshold is the hard significance threshold. The purpose of this threshold is to guarantee ML algorithm convergence. As we discuss in our theoretical analysis (Section 3.5), the initial threshold is provided programmer or a default system setting, and the signif- icance filter reduces it over time. Every update whose significance is above the hard threshold is guaranteed to be sent to other data centers. The second threshold is the soft significance threshold. The purpose of it is to use underutilized WAN bandwidth to speed up convergence. This threshold is tuned based on the arrival rate of the significant updates and the underlying WAN bandwidth. When the user chooses to optimize the first goal (speed up algorithm convergence), the system lowers the soft sig- nificance threshold whenever there is underutilized WAN bandwidth. The updates whose significance is larger than the soft significance threshold are sent in a best-effort manner. On the other hand, if the goal of the system is to minimize the WAN communication costs, the soft significance threshold is not activated. While the configuration of the initial hard threshold depends on how error tolerant each ML algorithm is, a simple and conservative threshold (such as 1%–2%) is likely to work in most cases. This is because most ML algorithms initialize their parameters with random values, and make large changes to their model parameters at early phases. Thus, they are more error tolerant at the beginning. As Gaia reduces the threshold over time, its accuracy loss is limited. An ML expert can choose a more aggressive threshold based on domain knowledge of the ML algorithm. 4.5. Overlay Network and Hub While Gaia can eliminate the insignificant updates, data center needs to broadcast the significant all the other data centers. This broadcast-based cation could limit the scalability of Gaia when we deploy Gaia to many data centers. To make Gaia more scalable with more data centers, we use the concept of overlay networks. As we discuss in Section 2.2, the WAN bandwidth between geographically-close regions is much higher than that between distant regions. In light of this, Gaia supports having geographically-close data centers form a data center group. Servers in a data center group send their significant updates only to the other servers in the same group. Each group has hub data centers that are in 636 14th USENIX Symposium on Networked Systems ization factorizes X into factor matrices L and R such that their product approximates X (i.e., X ≈ LR). Like other systems [17,32,83], we implement MF using the stochas- tic gradient descent (SGD) algorithm. Each worker is assigned a portion of the known entries in X. The L matrix is stored locally in each worker, and the R matrix is stored in parameter servers. Our experiments use the Netflix dataset, a 480K-by-18K sparse matrix with 100M known entries. They are configured to factor the matrix into the product of two matrices, each with rank 500. Topic Modeling (TM) is an unsupervised method for discovering hidden semantic structures (topics) in an unstructured collection of documents, each consisting bag (multi-set) of words. TM discovers the topics via word co-occurrence. For example, “policy” is more likely to co-occur with “government” than “bacteria”, and thus “policy” and “government” are categorized to the same topic associated with political terms. Further, a document with many instances of “policy” would be assigned a topic distribution that peaks for the politics-related topics. TM learns the hidden topics and the documents’ associations with those topics jointly. Common applications for TM include community detection in social networks and news categorizations. We implement our TM solver using collapsed Gibbs sampling. We use the Nytimes dataset , which has 100M words in 300K documents with a vocabulary size of 100K. Our experiments classify words and documents into 500 topics. Image Classification (IC) is a task to classify im- ages into categories, and the state-of-the-art approach is using deep learning and convolutional neural networks (CNNs). Given a set of images with known cate- gories (training data), the ML algorithm trains a CNN to learn the relationship between the image features and their categories. The trained CNN is then used the categories of another set of images (test GoogLeNet , one of the state-of-the-art CNNs as our model. We train GoogLeNet using stochastic gradient descent with back propagation. As training a CNN with a large number of images requires substantial tation, doing so on CPUs can take hundreds of machines over a week. Instead, we use distributed GPUs with a popular deep learning framework, Caffe , which is hosted by a state-of-the-art GPU-specialized param- eter server system, GeePS. Our experiments use the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC12) dataset, which consists of 1.3M training images and 50K test images. Each image is labeled as one of the 1,000 pre-defined categories. 5.3. Performance Metrics and Algorithm Conver- gence Criteria We use two performance metrics to evaluate the effective- ness of a globally distributed ML system. The first metric is the execution time until algorithm convergence. We use the following algorithm convergence criterion, based on guidance from our ML experts: if the value of the objective function (the objective value) in an algorithm USENIX Association 14th USENIX Nromalized Execution Time 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Baseline (a) Matrix Factorization (MF) Figure 8: Normalized execution time until convergence of 5.6× over Baseline, which is within 1.32× of the LAN speed, indicating that Gaia is also effective on a GPU-based ML system. The gap between Baseline and LAN is larger for IC than for the other two applications. This is because the GPU-based ML system generates parameter updates at a higher rate than the CPU-based one, and therefore the limited WAN bandwidth slows it down more significantly. Second, Gaia provides a higher performance gain when deployed on a more powerful platform. As Figure 8 shows, the performance gap between Baseline and LAN signifi- cantly increases on Emulation-Full-Speed compared to the slower platform Emulation-EC2. This is expected because the WAN bandwidth becomes a more critical bottleneck when the computation time reduces and the LAN bandwidth increases. Gaia successfully mitigates the WAN bottleneck in this more challenging Emulation- Full-Speed setting, and improves the system performance by 3.8× for MF, 3.7× for TM, and 6.0× for IC over Baseline, approaching the speedups provided by LAN. 6.2. Performance and WAN Bandwidth To understand how Gaia performs under different of WAN bandwidth, we evaluate two settings where Baseline and Gaia are deployed across two data centers with two WAN bandwidth configurations: (1) V/C WAN, which emulates the WAN bandwidth between Virginia and California, representing a setting within the same continent; and (2) S/S WAN, which emulates the WAN bandwidth between Singapore and São Paulo, representing the lowest WAN bandwidth between any two Amazon EC2 sites. All the experiments are conducted on our emulation platform at full speed. Figures 9 and 10 show the results. Three observations are in order. Baseline Gaia LAN 1 Normalized Exec. Time 0.9 0.8 0.7 0.6 0.5 0.4 3.7X 3.5X 0.3 0.2 0.1 0 Matrix Factorization Figure 9: Normalized execution time until convergence with the WAN bandwidth between Virginia and California First, Gaia successfully matches the performance of LAN when WAN bandwidth is high (V/C WAN). As Fig- ure 9 shows, Gaia achieves a speedup of 3.7× for MF, 3.7× for TM, and 7.4× for IC. For all three ML applica- tions, the performance of Gaia on WANs is almost the same as LAN performance. 638 14th 2.5 Normliaed Cost Machine Cost (Compute) 2 Machine Cost (Network) Communication Cost 1.5 1 0.5 4.2X 0 Baseline Gaia Baseline EC2-ALL V/C WAN (a) Matrix Factorization (MF) two settings, because it takes more time to transfer the same amount of data under low WAN bandwidth. As Gaia significantly improves system performance and reduces data communication overhead, it significantly reduces both cost sources. We conclude that Gaia is a cost-effective system for geo-distributed ML applications. Second, Gaia reduces data transfer cost much more when deployed on a smaller number of data centers. The reason is that Gaia needs to broadcast the significant updates to all data centers, so communication cost is higher as the number of data centers increases. While we employ network overlays (Section 4.5) to mitigate this effect, there is still more overhead with more than two data centers. Nonetheless, the cost of Gaia is still much cheaper (4.2×/2.6×/8.5×) than Baseline even when deployed across 11 data centers. Machine Learning Dimitris Konomis, Mellon University; University ssions/presentation/hsieh of the MA, USA access to the Proceedings of the USENIX Symposium on Networked Design and Implementation is sponsored by USENIX. Approaching LAN Speeds Vijaykumar† Dimitris Konomis† B. Gibbons† Onur Mutlu§† § ETH Zürich minimize their service latency to end-users, and store massive quantities of data all over the globe [31, 33, 36, 41, 57, 58, 71–73, 76]. A commonly-used approach to run an ML application over such rapidly generated data is to centralize all data into one data center over wide-area networks (WANs) before running the ML application [9,12,44,68]. However, this approach can be prohibitively difficult because: (1) WAN bandwidth is a scarce resource, and hence moving all data can be extremely slow [12,57]. Furthermore, the fast growing rate of image and video generation will eventually saturate the total WAN bandwidth, whose growth has been decelerating for many years [67, 73]. (2) Privacy and data sovereignty laws in some countries prohibit transmission of raw data across national or continental borders [12, 72, 73]. This motivates the need to distribute an ML system across multiple data centers, globally. In such a system, large amounts of raw data are stored locally in different data centers, and the ML algorithms running over the distributed data communicate between data centers using WANs. Unfortunately, existing large-scale distributed ML systems [5, 13, 45, 47, 50, 77] are suitable only for data residing within a single data center. Our experiments using three state-of-the-art distributed ML systems (Bösen , IterStore , and GeePS ) show that operating these systems across as few as two data centers (over WANs) can cause a slowdown of 1.8–53.7× (see Section 2.3 and Section 6) relative to their performance within a data center (over LANs). Existing systems that do address challenges in geo-distributed data analytics [12, 33, 36, 41, 57, 58, 71–73] do not consider the broad class of important, sophisticated ML algorithms commonly run on ML systems — they focus instead on other types of computation, e.g., map-reduce or SQL. Our goal in this work is to develop a geo-distributed ML system that (1) minimizes communication over WANs, so that the system is not bottlenecked by the scarce WAN bandwidth; and (2) is general enough to be applicable to a wide variety of ML algorithms, without requiring any changes to the algorithms themselves. To achieve these goals, such a system needs to address two key challenges. First, to efficiently utilize the limited (and heterogeneous) WAN bandwidth, we need to find an effective communication model that minimizes com- munication over WANs but still retains the correctness guarantee for an ML algorithm. This is difficult because ML algorithms typically require extensive communication to exchange updates that keep the global ML model suffi- ciently consistent across data centers. These updates are Symposium on Networked Systems Design and Implementation 629 synchronization model, Stale Synchronous Parallel (SSP) , which bounds how stale (i.e., old) a parameter can be, ASP bounds how inaccurate a parameter can be, in comparison to the most up-to-date value. Hence, it provides high flexibility in performing (or not performing) updates, as the server can delay synchronization indefi- nitely as long as the aggregated update is insignificant. We build two prototypes of Gaia on top of two state- of-the-art parameter server systems, one specialized for CPUs and another specialized for GPUs. We deploy Gaia across 11 regions on Amazon EC2, and on a local cluster that emulates the WAN bandwidth across different Amazon EC2 regions. Our evaluation with three popular classes of ML algorithms shows that, compared to two state-of-the-art parameter server systems [17, 18] deployed on WANs, Gaia: (1) significantly improves performance, by 1.8–53.5×, (2) has performance within 0.94–1.40× of running the same ML algorithm on a LAN in a single data center, and (3) significantly reduces the monetary cost of running the same ML algorithm on WANs, by 2.6–59.0×. We make three major contributions: To our knowledge, this is the first work to propose a general geo-distributed ML system that (1) differ- entiates the communication over a LAN from the communication over WANs to make efficient use of the scarce and heterogeneous WAN bandwidth, and (2) is general and flexible enough to deploy a wide range of ML algorithms while requiring no change to the ML algorithms themselves. We propose a new, efficient ML synchronization model, Approximate Synchronous Parallel (ASP), for communication between parameter servers across data centers over WANs. ASP guarantees that each data center’s view of the ML model parameters is approx- imately the same as the “fully-consistent” view and ensures that all significant updates are synchronized in time. We prove that ASP provides a theoretical guarantee on algorithm convergence for a widely used ML algorithm, stochastic gradient descent. We build two prototypes of our proposed system on CPU-based and GPU-based ML systems, and we demonstrate their effectiveness over 11 globally dis- tributed regions with three popular ML algorithms. We show that our system provides significant perfor- mance improvements over two state-of-the-art dis- tributed ML systems [17,18], and significantly reduces the communication overhead over WANs. 2. Background and Motivation We first introduce the architectures of widely-used dis- tributed ML systems. We then discuss WAN bandwidth constraints and study the performance implications of running two state-of-the-art ML systems over WANs. 2.1. Distributed Machine Learning Systems While ML algorithms have different types across different domains, almost all have the same goal—searching for Design and Implementation USENIX Association icantly slow down the workers and reduce the benefits of parallelism. The trade-off between fresher updates and communication overhead leads to three major syn- chronization models: (1) Bulk Synchronous Parallel (BSP) , which synchronizes all updates after each worker goes through its shard of data; all workers need to see the most up-to-date model before proceeding to the next iteration, (2) Stale Synchronous Parallel (SSP) , which allows the fastest worker to be ahead of the slowest worker by up to a bounded number of iterations, so the fast workers may proceed with a bounded stale (i.e., old) model, and (3) Total Asynchronous Parallel (TAP) , which removes the synchronization between workers com- pletely; all workers keep running based on the results of best-effort communication (i.e., each sends/receives as many updates as possible). Both BSP and SSP guarantee algorithm convergence [19, 34], while there is no such guarantee for TAP. Most state-of-the-art parameter servers implement both BSP and SSP (e.g., [5,16–18, 34,45, 77]). As discussed in Section 1, many ML applications need to analyze geo-distributed data. For instance, an image classification system would use pictures located at different data centers as its input data to keep improving its classification using the pictures generated continuously all over the world. Figure 1b depicts the straightforward approach to achieve this goal. In this approach, the worker machines in each data center (i.e., within a LAN) handle the input data stored in the corresponding data center. The parameter servers are evenly distributed across multiple data centers. Whenever the communication between a worker machine and a parameter server crosses data centers, it does so on WANs. 2.2. WAN Network Bandwidth and Cost WAN bandwidth is a very scarce resource [42, 58, 73] relative to LAN bandwidth. Moreover, the high cost of to adding network bandwidth has resulted in a deceleration memory, and of WAN bandwidth growth. The Internet capacity growth ion. has fallen steadily for many years, and the annual growth Data Center 2 rates have lately settled into the low-30 percent range. Data N Worker To quantify the scarcity of WAN bandwidth between …… Machine N data centers, we measure the network bandwidth between WAN LAN all pairs of Amazon EC2 sites in 11 different regions (Virginia, California, Oregon, Ireland, Frankfurt, Tokyo, Parameter Server Seoul, Singapore, Sydney, Mumbai, and São Paulo). We use iperf3 to measure the network bandwidth of Global Model each pair of different regions for five rounds, and then calculate the average bandwidth. Figure 2 shows the average network bandwidth between each pair of different regions. We make two observations. First, the WAN bandwidth between data centers is 15× smaller than the LAN bandwidth within a data center on average, and up to 60× smaller in the worst case (for Singapore Ö São Paulo). Second, the WAN bandwidth server, IterStore , is varies significantly between different regions. The WAN bandwidth between geographically-close regions (e.g., Oregon Ö California or Tokyo Ö Seoul) is up to 12× of the bandwidth between distant regions (e.g., Singapore USENIX Symposium on Networked Systems Design and Implementation 631 23.8X IterStore Bӧsen 25 Time until Convergence Normalized Execution 24.2X 26.8X 20 15 13.7X 10 5.9X 5 3.7X 3.5X 4.4X 2.7X 4.9X 2.3X 4.3X São Paulo 0 Mumbai Sydney LAN EC2-ALL V/C WAN S/S WAN LAN EC2-ALL V/C WAN S/S WAN Singapore Seoul Tokyo BSP SSP Frankfurt Ireland Oregon California Figure 3: Normalized execution time until ML algo- Virginia rithm convergence when deploying two state-of-the-art dis- tributed ML systems on a LAN and WANs for the given system, e.g., Bösen-BSP on EC2-ALL is 5.9× slower than Bösen-BSP on LAN. As we see, both systems suffer significant performance degradation when deployed across multiple data centers. When using BSP, IterStore is 3.5× to 23.8× slower on WANs than it is on a LAN, and Bösen is 4.4× to 24.2× slower. While using SSP can reduce overall execution times of both systems, both systems still show significant slowdown when run on WANs (2.3× to 13.7× for Iter- Store, and 4.3× to 26.8× for Bösen). We conclude that simply running state-of-the-art distributed ML systems on WANs can seriously slow down ML applications, and thus we need a new distributed ML system that can be effectively deployed on WANs. 3. Our Approach: Gaia We introduce Gaia, a general ML system that can be effec- tively deployed on WANs to address the increasing need to run ML applications directly on geo-distributed data. We identify two key challenges in designing such a system (Section 3.1). We then introduce the system architecture of Gaia, which differentiates the communication within a data center from the communication between different centers (Section 3.2). Our approach is based on the key empirical finding that the vast majority of communication within an ML system results in insignificant changes to the state of the global model (Section 3.3). In light of this finding, we design a new ML synchronization model, called Approximate Synchronous Parallel (ASP), which can eliminate the insignificant communication while en- suring the convergence and accuracy of ML algorithms. We describe ASP in detail in Section 3.4. Finally, Sec- tion 3.5 summarizes our theoretical analysis of how ASP guarantees algorithm convergence for a widely-used ML algorithm, stochastic gradient descent (SGD) (the full proof is in Appendix A). 3.1. Key Challenges There are two key challenges in designing a general and effective ML system on WANs. Challenge 1. How to effectively communicate over WANs while retaining algorithm convergence and ac- curacy? As we see above, state-of-the-art distributed ML systems can overwhelm the scarce WAN bandwidth, causing significant slowdowns. We need a mechanism that significantly reduces the communication between data centers so that the system can provide competitive performance. However, reducing communication can affect the accuracy of an ML algorithm. A poor choice Symposium on Networked Systems Design and Implementation USENIX Association Data Center 1 Data Center 2 system can Data ❶ Global Model Copy Global Model Copy Shard Worker Machine Parameter Server Parameter Server … Data Shard Worker Machine Data Parameter Server ❷ ASP Parameter Server … Shard Worker Machine ❸ BSP/SSP Figure 4: Gaia system overview Section 3.4 describes the details of ASP. On the other hand, worker machines and parameter servers within a data center synchronize with each other using the con- ventional BSP (Bulk Synchronous Parallel) or SSP (Stale Synchronous Parallel) models (¸). These models allow worker machines to quickly observe fresh updates that happen within a data center. Furthermore, worker ma- chines and parameter servers within a data center can employ more aggressive communication schemes such as sending updates early and often [19,74] to fully utilize the abundant (and free) network bandwidth on a LAN. 3.3. Study of Update Significance As discussed above, Gaia reduces the communication overhead over WANs by eliminating insignificant com- munication. To understand the benefit of our approach, we study the significance of the updates sent from worker machines to parameter servers. We study three classes of popular ML algorithms: Matrix Factorization (MF) , Topic Modeling (TM) , and Image Classification (IC) (see Section 5.2 for descriptions). We run all the algorithms until convergence, analyze all the updates sent from worker machines to parameter servers, and compare the change they cause on the parameter value when the servers receive them. We define an update to be significant if it causes S% change on the parameter value, and we vary S, the significance threshold, between 0.01 and 10. Figure 5 shows the percentage of insignificant updates among all updates, for different values of S. Matrix Factorization Topic Modeling Image Classification Insignificant Updates each data 100% Percentage of 80% 60% 40% 20% 0% 10% 5% 1% 0.5% 0.1% 0.05% 0.01% Threshold of Significant Updates (S) Figure 5: Percentage of insignificant updates As we see, the vast majority of updates in these al- gorithms are insignificant. Assuming the significance threshold is 1%, 95.2% / 95.6% / 97.0% of all updates are insignificant for MF / TM / IC. When we relax the significance threshold to 5%, 98.8% / 96.1% / 99.3% of all updates are insignificant. Thus, most of the communi- cation changes the ML model state only very slightly. It is worth noting that our finding is consistent with the findings of prior work [21, 22, 40, 47, 80] on other Symposium on Networked Systems Design and Implementation 633 center are aware of the significant updates after a bounded network latency, and they wait only for these updates. The worker machines can make progress as long as they do not depend on any of these parameters. Data Center 1 Data Center 2 Data Center 1 Data Center 2 ❶ Significant Updates ❷ Barrier ❸ Clock N ❹ Clock N + DS Parameter Parameter Parameter Parameter Server Server Server Server (a) ASP selective barrier (b) Mirror clock Figure 6: The synchronization mechanisms of ASP Mirror clock. The ASP select barrier ensures that the latency of the significant updates is no more than the network latency. However, it assumes that 1) the underlying WAN bandwidth and latency are fixed so that the network latency can be bounded, and 2) such latency is short enough so that other data centers can be aware of them in time. In practice, WAN bandwidth can fluctuate over time , and the WAN latency can be intolerably high for some ML algorithms. We need a mechanism to guarantee that the worker machines are aware of the significant updates in time, irrespective of the WAN bandwidth or latency. We use the mirror clock (Figure 6b) to provide this guarantee. When each parameter server receives all the updates from its local worker machines at the end of a clock (e.g., an iteration), it reports its clock to the servers that are in charge of the same parameters in the other data centers. When a server detects its clock is ahead of the slowest server that shares the same parameters by a predefined threshold DS (data center staleness), the server blocks its local worker machines from reading its parameters until the slowest mirror server catches up. In the example of Figure 6b, the server clock in Data Center 1 is N, while the server clock in Data Center 2 is (N + DS). As their difference reaches the predefined limit, the server in Data Center 2 blocks its local worker from reading its parameters. This mechanism is similar to the concept of SSP , but we use it only as the last resort to guarantee algorithm convergence. 3.5. Summary of Convergence Proof In this section, we summarize our proof showing that a popular, broad class of ML algorithms are guaranteed to converge under our new ASP synchronization model. The class we consider are ML algorithms expressed as convex optimization problems that are solved using distributed stochastic gradient descent. The proof follows the outline of prior work on SSP , with a new challenge, i.e., our new ASP synchronization model allows the synchronization of insignificant updates to be delayed indefinitely. To prove algorithm conver- gence, our goal is to show that the distributed execution of an ML algorithm results in a set of parameter values that are very close (practically identical) to the values that would be obtained under a serialized execution. Design and Implementation USENIX Association 4.2. System Operations and Communication We present a walkthrough of major system operations and communication. UPDATE from a worker machine. When a local server (¶) receives a parameter update from a worker machine, it updates the parameter in its parameter store (¹), which maintains the parameter value and its accumulated update. ∞ by proving that The local server then invokes the significance filter (º) to determine whether or not the accumulated update of this parameter is significant. If it is, the significance filter sends a MIRROR UPDATE request to the mirror client (¸) and resets the accumulated update for this parameter. Messages from the significance filter. The signifi- cance filter sends out three types of messages. First, as of a convex discussed above, it sends a MIRROR UPDATE request to = 1, 2,... , T , convex, the mirror client through the data queue (¼). Second, when the significance filter detects that the arrival rate of is significant updates is higher than the underlying WAN bandwidth that it monitors at every iteration, it first sends protocol is an ASP Barrier (Section 3.4) to the control queue (») mirror clock before sending the MIRROR UPDATE. The mirror client (¸) centers. prioritizes the control queue over the data queue, so that size ηt decreases the barrier is sent out earlier than the update. Third, to vt decreases maintain the mirror clock (Section 3.4), the significance filter also sends a MIRROR CLOCK request to the control ∆2 , ∀xx, x 0 ∈ dom( ft ). queue at the end of each clock in the local server. (x˜t ) − f (xx∗ ) = Operations in the mirror client. The mirror client thread wakes up when there is a request from the control queue or the data queue. Upon waking up, the mirror client walks through the queues, packs together the messages to the same destination, and sends them. Operations in the mirror server. The mirror server handles above messages (MIRROR UPDATE, ASP BARRIER, and MIRROR CLOCK) according to our ASP model. For MIRROR UPDATE, it applies the update to the correspond- ing parameter in the parameter store. For ASP BARRIER, it sets a flag in the parameter store to block the corre- sponding parameter from being read until it receives the corresponding MIRROR UPDATE. For MIRROR CLOCK, the mirror server updates its local mirror clock state for each parameter server in other data centers, and enforces the predefined clock difference threshold DS (Section 3.4). 4.3. Advanced Significance Functions As we discuss in Section 3.4, the significance filter allows the ML programmer to specify a custom significance function to calculate the significance of each update. By providing an advanced significance function, Gaia can be more effective at eliminating the insignificant communica- Data Center Boundary tion. If several parameters are always referenced together Gaia Parameter Server to calculate the next update, the significance function can take into account the values of all these parameters. For example, if three parameters a, b, and c are always used … as a · b · c in an ML algorithm, the significance of a, b, and c can be calculated as the change on a · b · c. If one of them is 0, any change in another parameter, however large it may be, is insignificant. Similar principles can be applied to model parameters that are non-linear or 14th USENIX Symposium on Networked Systems Design and Implementation 635 charge of aggregating all the significant updates within the group, and sending to the hubs of the other groups. Similarly, a hub data center broadcasts the aggregated significant updates from other groups to the other data centers within its group. Each data center group can designate different hubs for communication with different data center groups, so the system can utilize more links within a data center group. For example, the data centers in Virginia, California, and Oregon can form a data center group and assign the data center in Virginia as the hub to communicate with the data centers in Europe and the data center in Oregon as the hub to communicate with the data centers is Asia. This design allows Gaia to broadcast the significant updates with lower communication cost. by the ML 5. Methodology 5.1. Experiment Platforms We use three different platforms for our evaluation. Amazon-EC2. We deploy Gaia to 22 machines spread across 11 EC2 regions as we show in Figure 2. In each EC2 region we start two instances of type c4.4xlarge or m4.4xlarge , depending on their availability. Both types of instances have 16 CPU cores and at least 30GB RAM, running 64-bit Ubuntu 14.04 LTS (HVM). In all, our deployment uses 352 CPU cores and 1204 GB RAM. Emulation-EC2. As the monetary cost of running all experiments on EC2 is too high, we run some experiments on our local cluster that emulates the computation power and WAN bandwidth of EC2. We use the same number of machines (22) in our local cluster. Each machine is equipped with a 16-core Intel Xeon CPU (E5-2698), an NVIDIA Titan X GPU, 64GB RAM, a 40GbE NIC, and runs the same OS as above. The computation power and the LAN speeds of our machines are higher than the ones we get from EC2, so we slow down the CPU and LAN speeds to match the speeds on EC2. We model the measured EC2 WAN bandwidth (Figure 2) with the Linux Traffic Control tool. As Section 6.1 shows, our emulation platform gives very similar results to the results from our real EC2 deployment. Emulation-Full-Speed. We run some of our experi- ments on our local cluster that emulates the WAN band- width of EC2 at full speed. We use the same settings as Emulation-EC2 except we do not slow down the CPUs each and the LAN. We use this platform to show the results updates to of deployments with more powerful nodes. communi- 5.2. Applications We evaluate Gaia with three popular ML applications. Matrix Factorization (MF) is a technique commonly used in recommender systems, e.g., systems that recom- mend movies to users on Netflix (a.k.a. collaborative filtering). Its goal is to discover latent interactions between two entities, such as users and movies, via matrix factorization. For example, input data can be a partially filled matrix X, where every entry is a user’s rating for a movie, each row corresponding to a user, and each column corresponding to a specific movie. Matrix factor- Design and Implementation USENIX Association changes by less than 2% over the course of 10 iterations, we declare that the algorithm has converged. In order to ensure that each algorithm accurately converges to the optimal point, we first run each algorithm on our local cluster until it converges, and we record the absolute objective value. The execution time of each setting is the time it takes to converge to this absolute objective value. The second metric is the cost of algorithm convergence. We calculate the cost based on the cost model of Amazon EC2 , including the cost of the server time and the cost of data transfer on WANs. We provide the details of the cost model in Appendix C. of a 6. Evaluation Results We evaluate the effectiveness of Gaia by evaluating three types of systems/deployments: (1) Baseline, two state- of-the-art parameter server systems (IterStore for MF and TM, GeePS for IC) that are deployed across multiple data centers. Every worker machine handles the data in its data center, while the parameter servers are distributed evenly across all the data centers; (2) Gaia, our prototype systems based on IterStore and GeePS, deployed across multiple data centers; and (3) LAN, the baseline parameter servers (IterStore and GeePS) that are deployed within a single data center (also on 22 machines) that already hold all the data, representing the ideal case of all communication on a LAN. For each system, we evaluate two ML synchronization models: BSP and SSP (Section 2.1). For Baseline and LAN, BSP and SSP are used among all worker machines, whereas for Gaia, they are used only within each data center. Due to limited space, we present the results for BSP in this section and leave the results for SSP to Appendix B. 6.1. Performance on EC2 Deployment to predict We first present the performance of Gaia and Baseline data). We use when they are deployed across 11 EC2 data centers. Fig- ure 8 shows the normalized execution time until conver- gence for our ML applications, normalized to Baseline on EC2. The data label on each bar is the speedup compu- over Baseline for the respective deployment. As Sec- tion 5.1 discusses, we run only MF on EC2 due to the high monetary cost of WAN data transfer. Thus, we present the results of MF on all three platforms, while we show the results of TM and IC only on our emulation platforms. As Figure 8a shows, our emulation platform (Emulation-EC2) matches the execution time of our real EC2 deployment (Amazon-EC2) very well. We make two major observations. First, we find that Gaia significantly improves the performance of Baseline when deployed globally across many EC2 data centers. For MF, Gaia provides a speedup of 2.0× over Baseline. Furthermore, the performance of Gaia is very similar to the performance of LAN, indicating that Gaia almost attains the performance upper bound with the given computation resources. For TM, Gaia delivers a similar speedup (2.0×) and is within 1.25× of the ideal speed of LAN. For IC, Gaia provides a speedup Symposium on Networked Systems Design and Implementation 637 Nromalized Execution Time Nromalized Execution Time Amazon-EC2 1 1 Emulation-EC2 Emulation-EC2 Emulation-EC2 0.9 0.9 0.8 Emulation-Full-Speed 0.8 Emulation-Full-Speed Emulation-Full-Speed 0.7 0.7 2.0X 1.8X 2.0X 1.8X 0.6 2.0X 0.6 0.5 2.5X 0.5 0.4 0.4 0.3 3.7X 0.3 3.8X 3.7X 4.8X 5.6X 6.0X 0.2 0.2 7.5X 8.5X 0.1 0.1 0 0 Gaia LAN Baseline Gaia LAN Baseline Gaia LAN (b) Topic Modeling (TM) (c) Image Classification (IC) when deployed across 11 EC2 regions and our emulation cluster Baseline Gaia LAN Baseline Gaia LAN Baseline Gaia LAN 1 1 0.9 1 0.9 Normalized Exec. Time 0.9 0.8 0.8 0.7 0.8 0.7 0.7 0.6 0.6 0.6 0.5 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 14X 17X 0.2 0.1 25X 24X 0.1 0.1 54X 54X 0 0 0 Matrix Factorization Topic Modeling Image Classification Figure 10: Normalized execution time until convergence with the WAN bandwidth between Singapore and São Paulo Second, Gaia still performs very well when WAN bandwidth is low (S/S WAN, Figure 10): Gaia provides a speedup of 25.4× for MF, 14.1× for TM, and 53.5× for IC, and successfully approaches LAN performance. These results show that our design is robust for both CPU-based and GPU-based ML systems, and it can deliver high performance even under scarce WAN bandwidth. Third, for MF, the performance of Gaia (on WANs) is slightly better than LAN performance. This is because we run ASP between different data centers, and the workers in each data center need to synchronize only with each other locally in each iteration. As long as the mirror updates on WANs are timely, each iteration of Gaia can be faster than that of LAN, which needs to synchronize amounts across all workers. While Gaia needs more iterations than LAN due to the accuracy loss, Gaia can still outperform LAN due to the faster iterations. 6.3. Cost Analysis Figure 11 shows the monetary cost of running ML ap- plications until convergence based on the Amazon EC2 cost model, normalized to the cost of Baseline on 11 EC2 regions. Cost is divided into three components: (1) the cost of machine time spent on computation, (2) the cost of machine time spent on waiting for networks, and (3) the cost of data transfer across different data centers. Baseline Gaia LAN Baseline Gaia LAN 1 1 As we discuss in Section 2.2, there is no cost for data 0.9 0.9 0.8 0.8 transfer within a single data center in Amazon EC2. The 0.7 0.7 0.6 0.6 data label on each bar shows the factor by which the 0.5 0.5 0.4 0.4 cost of Gaia is cheaper than the cost of each respective 3.7X 3.9X 0.3 0.3 0.2 0.2 7.4X 7.4X Baseline. We evaluate all three deployment setups that 0.1 0.1 0 0 we discuss in Sections 6.1 and 6.2. We make two major Topic Modeling Image Classification observations. First, Gaia is very effective in reducing the cost of running a geo-distributed ML application. Across all the evaluated settings, Gaia is 2.6× to 59.0× cheaper than Baseline. Not surprisingly, the major cost saving comes from the reduction of data transfer on WANs and the reduction of machine time spent on waiting for networks. For the S/S WAN setting, the cost of waiting for networks is a more important factor than the other USENIX Symposium on Networked Systems Design and Implementation USENIX Association 2.5 Machine Cost (Compute) 4 Machine Cost (Compute) Normliaed Cost Normliaed Cost 2 3.5 Machine Cost (Network) 3 Machine Cost (Network) 1.5 Communication Cost 2.5 Communication Cost 2 1 1.5 2.6X 1 6.0X 28.5X 0.5 5.7X 18.7X 8.5X 10.7X 59.0X 0.5 0 0 Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia Baseline Gaia S/S WAN EC2-ALL V/C WAN S/S WAN EC2-ALL V/C WAN S/S WAN (b) Topic Modeling (TM) (c) Image Classification (IC) Figure 11: Normalized monetary cost of Gaia vs. Baseline setting because each data center has only a small fraction of the data, and Centralized moves the data from all data centers in parallel. Second, Centralized is more cost-efficient than Gaia, but the gap is small in the two data centers setting. This is because the total WAN traffic of Gaia is still larger than the size of the training data, even though Gaia significantly reduces the communication overhead over Baseline. The cost gap is larger in the setting of 11 data centers (3.33–6.14×) than in two data centers (1.00– 1.92×), because the WAN traffic of Gaia is positively correlated with the number of data centers (Section 4.5). 6.5. Effect of Synchronization Mechanisms One of the major design considerations of ASP is to en- sure that the significant updates arrive in a timely manner to guarantee algor

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds PDF

Document Details

Tags

Related

Summary

Full Transcript