Cartel: A System for Collaborative Transfer Learning at the Edge (SoCC '19) PDF

Cartel: A System for Collaborative Transfer Learning at the Edge Harshit Daga Patrick K. Nicholson Georgia Institute of Technology...

Cartel: A System for Collaborative Transfer Learning at the Edge Harshit Daga Patrick K. Nicholson Georgia Institute of Technology Nokia Bell Labs Ada Gavrilovska Diego Lugones Georgia Institute of Technology Nokia Bell Labs ABSTRACT 1 INTRODUCTION As Multi-access Edge Computing (MEC) and 5G technologies evolve, The proliferation of connected devices is causing a compound an- new applications are emerging with unprecedented capacity and nual growth rate of 47% in network traffic since 2016, i.e., an in- real-time requirements. At the core of such applications there is a crease from 7 to 49 exabytes per month. Service providers need for machine learning (ML) to create value from the data at such as Facebook, Google, Amazon, and Microsoft rely on machine the edge. Current ML systems transfer data from geo-distributed learning (ML) techniques [18, 47, 50] to extract and monetize in- streams to a central datacenter for modeling. The model is then sights from this distributed data. The predominant approach to moved to the edge and used for inference or classification. These learn from the geographically distributed data is to create a cen- systems can be ineffective because they introduce significant de- tralized model (see Figure 1a) by running ML algorithms over the mand for data movement and model transfer in the critical path raw data, or a preprocessed portion of it, collected from different of learning. Furthermore, a full model may not be needed at each data streams [12, 37]. More sophisticated solutions deal with geo- edge location. An alternative is to train and update the models on- distributed data by training models locally in the device, which are line at each edge with local data, in isolation from other edges. Still, later averaged with other user updates in a centralized location – this approach can worsen the accuracy of models due to reduced an approach known as federated learning [21, 33, 61, 64]. data availability, especially in the presence of local data shifts. A centralized model can be very accurate and generic as it in- In this paper we propose Cartel, a system for collaborative learn- corporates diverse data from multiple streams. From a system per- ing in edge clouds, that creates a model-sharing environment in spective, however, there is a challenge in moving all this data, and which tailored models at each edge can quickly adapt to changes, even the resulting model size can be significant, depending on the and can be as robust and accurate as centralized models. Results implementation, algorithm, and feature set size. Concretely, as show that Cartel adapts to workload changes 4 to 8× faster than data sources spread geographically, the network becomes the bot- isolated learning, and reduces model size, training time and total tleneck. In this case, ML algorithms [1, 12, 37], which are efficient data transfer by 3×, 5.7× and ~1500×, respectively, when com- in datacenters, can be slowed down by up to 53× when distributed pared to centralized learning. to the network edge. The emergent Multi-access Edge Computing (MEC) architecture, CCS CONCEPTS as well as 5G connectivity, are conceived to converge telecommu- Computer systems organization → Distributed architectures; nications and cloud services at the access network, and have the n-tier architectures; Computing methodologies → Machine potential to cope with the challenge described above by enabling learning; Transfer learning; Online learning settings. unprecedented computing and networking performance closer to users. KEYWORDS In this context, the obvious alternative to centralized learning is to replicate the algorithms at each edge cloud and run them in- Mobile-access Edge Computing (MEC), distributed machine learn- dependently with local data, isolated from other edge clouds, as ing, collaborative learning, transfer learning shown in Figure 1b. Isolated models can be useful in certain cases, ACM Reference Format: e.g., when the data patterns observed by an edge are stationary Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones. and data is not significantly diverse. However, in more challeng- 2019. Cartel: A System for Collaborative Transfer Learning at the Edge. In ing scenarios, where the distribution of input data to the ML model ACM Symposium on Cloud Computing (SoCC ’19), November 20–23, 2019, is non-stationary, or when the application requires more complex Santa Cruz, CA, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/ models – only achievable with more data than the local to a partic- 10.1145/3357223.3362708 ular edge – isolated models can have prohibitively high error rates (cf. Section 6.2). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed Therefore, although MEC and 5G technologies constitute the for profit or commercial advantage and that copies bear this notice and the full cita- infrastructure needed to run distributed machine learning algo- tion on the first page. Copyrights for components of this work owned by others than rithms, we argue that there is a need for a coordination system ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission that leverages the edge resources to learn collaboratively from log- and/or a fee. Request permissions from [email protected]. ically similar nodes, reducing training times and excessive model SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA updates. © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6973-2/19/11…$15.00 In this paper we introduce Cartel, a new system for collabora- https://doi.org/10.1145/3357223.3362708 tive ML at the edge cloud (Figure 1c). The idea behind Cartel is SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones Edge Data Cloud (a) (b) (c) Figure 1: Machine learning systems with geographically distributed data streams, (a) Centralized learning, either raw data or partial models are shipped to a central datacenter for modeling (b) Isolated learning, models are replicated in edge cloud locations and maintained independently, and (c) Collaborative learning (Cartel), a distributed model-sharing framework to aggregate knowledge from related logical neighbors. that centralized models, although trained on a broader variety of to 3×, and the training time, by 3 to 5.7× for the ML models in data, may not be required in full at each edge node. When changes evaluation. in the environment or variations in workload patterns require the In summary, our contributions are: model to adapt, Cartel provides a jump start, by transferring knowl- A collaborative system to create, distribute and update machine edge from other edge(s) where similar patterns have been observed. learning models across geographically distributed edge clouds. This allows for lightweight models, reduced backhaul data transfer, Cartel allows for tailored models to be learned at each edge, but strictly improved model accuracy compared to learning in isola- provides for a quick adaptation to unforeseen variations in the tion, and similar performance to centralized models. statistical properties of the values the model is predicting – e.g., Cartel achieves the above by operating on metadata, as op- concept drift, changes in class priors, etc. posed to on raw data, and uses metadata to decide when an edge- Cartel is designed to address the key challenges in learning at based model needs to be updated, which peer should be used for the the edge (cf. §3) and relies on metadata-based operations to guide model update, and how the knowledge should be transferred from cross-edge node collaborations and reduce data transfer require- one model to another. To support these decisions and the collabo- ments in learning (cf. §4). rative learning they enable, Cartel provides three key mechanisms. We design generic metadata-based operations that underpin the It uses drift detection to determine variability in the workload distri- three key mechanisms in Cartel, along with algorithms and sys- bution observed at the edge (i.e., dataset shift) and in the accuracy tem support enabling their implementation. In particular: i) a of the model. It incorporates functions to identify logical neighbors, distance-based heuristic for detecting similarities with other edge i.e., candidate edges from which models can be transferred in case clouds, that can serve as logical neighbors for knowledge trans- of input or output drift. Finally, it supports interfaces with the ML fer, and; ii) two generic algorithms for knowledge transfer, one modeling framework to support model-specific knowledge transfer of which (based on bagging) can be applied to any machine operations used to update a model instance in one edge, using state learning model. (cf. §5). from a model in another edge stack. We describe the system func- An experimental evaluation using different models, workload tionality required to support these key mechanisms, their concrete patterns, and datasets, illustrates how Cartel supports robust implementation using specific algorithms which operate on model edge-based learning, with significant reductions in data trans- metadata, and evaluate the tradeoffs they afford for different col- fer and resource demands compared to traditional learning ap- laborative learning models. proaches (cf. §6). We use both online random forest (ORF) and online support vector machines (OSVMs) [11, 20] as running examples of learning algorithms. Both ORF and OSVM are well-known online classifi- 2 MOTIVATION cation techniques that can operate in the streaming setting. In this section, we briefly introduce MEC and collaborative learn- Moreover, since they function quite differently, and have differ- ing, and summarize evidence from prior work on opportunities to ent characteristics of the model and update sizes, together they leverage locality of information in MEC, which motivates the de- facilitate a discussion of how Cartel can be used with different ML sign of Cartel. algorithms (see Section 7). Multi-access Edge Computing. Cartel targets the emerging field Cartel is evaluated using several streaming datasets as work- of MEC [6, 16, 23, 41, 46, 57, 58, 60]. MEC offers computing, storage loads, which consists of randomized request patterns in the form of and networking resources integrated with the edge of the cellular time series with stationary and non-stationary data distributions network, such as at base stations, with the goal of improving ap- at each edge to cover many use cases. We compare Cartel to iso- plication performance for the end users and devices. The presence lated and centralized approaches. Results show that collaborative of resources a hop away provides low latency for real time applica- learning allows for model updates between 4 to 8× faster than iso- tions, and data offloading and analytics capabilities, while reducing lated learning, and reduces the data transfer from ~200 to ~1500× the backhaul network traffic. 5G and novel use cases demanding compared to a centralized system while achieving similar accuracy. low latency and/or high data rates, such as connected cars, AR/VR, Moreover, at each edge Cartel reduces both the model size, by up etc., are among the primary drivers behind MEC. Cartel: A System for Collaborative Transfer Learning at the Edge SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Collaborative Learning. We define collaborative learning to be a Metadata Service (MdS) model where edge nodes learn independently, but selectively col- laborate with logical neighbors by transferring knowledge when needed. Knowledge transfer happens when a target edge execut- 2 ing a model detects an issue, such as sudden high error rates, with Request for nodes 3 Eis register and with similar model its current model. Logical neighbors are selected to assist the tar- send metadata get edge, that is, edge nodes that: i) are most similar to the target in terms of either the data they have observed or their configuration, and; ii) have models that are performing adequately. Knowledge Subset of helpful transfer involves transmitting some part of the model, or models (if neighbors (E3, E4) 1 there is more than one logical neighbor), from the logical neighbor E1 node to the target edge. This framework induces a set of primitive oper- (t) E2 node ations that can be directly used with existing ensemble-based ma- Request chine learning algorithms. Note that the same learning algorithm is Batch Insights 4 used in all edge nodes. For our proof-of-concept, we focus on ORF Insights and OSVM, where the former makes use of bootstrap aggregation (or bagging) , but the latter does not. However, we emphasize that since the primitives can be applied to any learning algorithm utilizing bagging, and that since any machine learning algorithm can make use of bagging, this means that Cartel is not limited to E4 node E3 node the two techniques, but rather can be applied in general to any ma- Edge Node (E) chine learning technique. An interesting future research direction would be to create a general abstraction to apply the set of primi- Figure 2: Cartel overview. A collaborative system consisting of edge nodes tive operations to any machine learning algorithm directly, with- (E), where E i ’s are trained independently and periodically update a MdS out the use of bagging. A more surgical approach based on tech- with metadata information about the node. A subset of logical neighbors niques such as patching may be a good candidate for such an are selected which helps the target edge node (t) to quickly adapt to change. abstraction. However, patching raises potential model size issues the edge stack or in the non-stationary statistical distributions of after repeated application of the primitive operations, and thus fur- workloads can decrease model accuracy, thus requiring online re- ther investigation is beyond the scope of this work. training. The challenge is to create a mechanism to quickly detect Locality in MEC. As the MEC compute elements are highly dis- and react to such variations. persed, they operate within contexts which may capture highly lo- C2: Which neighbors to contact? Our hypothesis for collaborative calized information. For instance, a recent study of 650 base sta- learning is that edge nodes running similar machine learning tasks tion logs by AT&T reports consistent daily patterns where can share relevant model portions, thereby achieving more effi- each base station exhibits unique characteristics over time. Simi- cient online adaptation to changes. The goal is to avoid sharing larly, Cellscope highlights differences in the base station char- of raw data between edge nodes, which makes nodes oblivious of acteristics, and demonstrates the change in a model’s performance data trends at other edges. Therefore, the challenge is to discover with changes in data distributions. They also demonstrate the in- appropriate logical neighbors dynamically while coping with vari- effectiveness of a global model due to the unique characteristics of ations in the workload of the multiple edges over time. each base station. In Deepcham , Li et al. demonstrate a deep C3: How to transfer knowledge to the target? In order to update a learning framework that exploits data locality and coordination in model it may be possible and/or appropriate to either merge proper near edge clouds to improve object recognition in mobile devices. model portions from collaborating nodes or to simply replace local Similar observations regarding locality in data patterns observed models with remote ones. The decision depends on various param- at an edge location are leveraged in other contexts, such as gam- eters such as the modeling technique, whether it allows for parti- ing and transportation. tioning and merging of models, as well as the feature set, the pace at which the model needs to be updated, and the efficiency of the 3 CHALLENGES cross-node data transfer paths. Thus, the challenge is to provide support for the model-specific methods for sharing and updating To elaborate on the concepts behind Cartel, we first enumerate the model state. complexities and the key challenges in implementing a distributed collaborative learning system in a general and effective manner. In the subsequent section we introduce our system design, com- 4 OVERVIEW OF CARTEL ponents, and implementation, and explain how each challenge is Goal. The main focus of Cartel is to reduce the dependence on data addressed. movement compared to a purely centralized model that must peri- C1: When to execute the collaborative model transfer among edge odically push out model updates. In contrast, Cartel only performs clouds? Since participants run independent models that can evolve knowledge transfer when a target node actively requests help from differently, each edge must determine when to initiate collabora- logical neighbors. Thus, when no such requests are active, Cartel tion with edge peers in the system. Changes in the configuration of only requires nodes to periodically share metadata, which is used SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones to establish a relationship among nodes: raw training data is never in the figure, the metadata service is only logically centralized, and explicitly transmitted among the nodes, or to a centralized loca- may be realized in a distributed manner depending on the scale of tion. the system. Concepts. To achieve this goal while addressing the challenges Workflow. The operation of Cartel can be summarized as follows. enumerated above, Cartel relies on three key mechanisms: drift Each edge node receives batches of requests 1 over a time period – detection (C1), which allows a node to determine when to send a these requests are the workload on which predictions are made by request to its edge peers for a model transfer; logical neighbors the resident model at the edge node. These batches are later used for (C2), which for each node determines sufficiently similar nodes retraining the model locally. If an edge node is experiencing poor likely to provide the required model transfer; and knowledge trans- model accuracy, we refer to that node as the target (t). We remark fer (C3), which allows the system to decide on how to merge model that this batching model fits well with predictive analytics in the portions provided from peer nodes. streaming setting, e.g., predicting and classifying network resource A common principle underpinning these mechanisms is the use demands based on the current set of users and their usage patterns. of system-level support for operating on the metadata. The prin- The metadata from each node is aggregated by the metadata ser- ciple allows Cartel to achieve its desired goal of supporting learn- vice 2 which receives periodic updates from each edge node, de- ing at the edge with adequate accuracy and reduced data transfer scribed in Section 5.2. Cartel is aimed at scenarios with dynamic costs. As a result, Cartel provides a novel system-level support for workload behaviors. As a result, the neighbor selection cannot be metadata management – to store, aggregate, compare, and act on precomputed and stored, but is performed on-demand, based on it. This functionality is used by Cartel’s key mechanisms, which in dynamically aggregated metadata. When an edge node detects that turn facilitate collaborative learning. its model accuracy has decreased significantly (drift detection), it Metadata can be any information about a node that could po- asks the metadata service for similar nodes from which model up- tentially distinguish it from other nodes. In other words, infor- dates should be requested 3. The metadata service processes the mation about the physical hardware, software configuration, en- metadata on-demand to identify the corresponding logical neigh- abled features, active user distribution by segments, geographic bors. The target node interacts with (one of) its logical neighbors information, estimates of class priors, etc. Some metadata, such as directly to request a model update 4 , and applies the shared model enabled software features, those related to active users, or class state to update its resident model (knowledge transfer). prior estimates, can change over time at a given node. When such changes occur, this usually leads to a degradation in model perfor- mance, as machine learning techniques are impacted by underly- 5 DESIGN DETAIL ing dataset shifts. Examples of such dataset shifts include: changes Next, we describe the three key mechanisms in Cartel in terms in class priors (probability of certain classes), covariate shift (distri- of their metadata requirements and the algorithms they rely on bution from which input examples are sampled), concept shift/drift for metadata manipulations, and describe the system support that (classes added/removed or changing boundary functions), and other enables their implementation. more general domain shifts (see for a detailed survey). In our system discussion and experiments, we focus on the first type of shift, i.e., changes in class priors. Thus, in our illustrative 5.1 Metadata Storage and Aggregation online classification examples and experiments, metadata refers To support its three building blocks, Cartel maintains and aggre- specifically to empirical estimates for the prior probabilities for gates model metadata in the metadata service, and also stores some each class, as well as overall and per-class error rates. Such meta- metadata locally at each edge node. For each of the previous W ≥ data is available in general, and therefore allows us to concretely 1 batches, up to the current batch, each edge node maintains: i) describe an implementation of each of the three key mechanisms counts for each class observed in the batch; ii) the overall model that can be applied in general. We emphasize, however, that ad- error rate on the batch, and; iii) the error rate per-class for the ditional application specific (or even completely different) choices batch. Here W is user-defined window length parameter, which is for metadata are possible, such as the ones enumerated above, to used to adjust how sensitive the system is to changes in the model address other dataset shifts. metadata. In terms of memory cost, at each edge node Cartel stores Architecture Overview. Figure 2 shows the architecture overview O(CW ) machine words, if C is the total number of classes in the of Cartel. The system is comprised of edge nodes (E) and a metadata classification problem. service (MdS).At a high level, an edge node maintains a tailored To aggregate model metadata, the metadata service relies on pe- model trained using data observed at the node, and the metadata riodic updates reporting the metadata generated at each edge node. associated with that model. The metadata service is responsible for We considered several aggregation policies which trade the data aggregating and acting on the metadata generated by edge nodes, transfer requirements of Cartel for the quality of the selected logi- so as to facilitate the selection of appropriate peer nodes for col- cal neighbors. The most trivial approach is where edge nodes send laborative model updates. In other words, when a collaborative updates after every request batch. This helps in ensuring there is model update is requested, the metadata service is responsible for no stale information about the edge node at any given time. Thus, selecting the subset of edge nodes that can share portions of their O(CN ) machine words are transferred after each batch, where C models, with the node that requests assistance. These edge nodes is the number of classes, and N the number of edge nodes. We are then responsible for negotiating with the target to determine refer to this operation policy as regular updates. These updates which portions to share. Although shown as a single component can further be sparsified by not sending them for every batch, but Cartel: A System for Collaborative Transfer Learning at the Edge SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA for every m batches, referred to as interval updates. Interval up- is to find a logical neighbor that has similar class priors to the tar- dates can provide additional reduction in data transfer, but result in get node, as this node has most potential to help. Logical neighbors stale data at the MdS. For instance, an edge node may have model are computed by the metadata service after receiving the request performance that degraded recently, but the MdS is oblivious to for help from the target node. The mechanism relies on the model these changes during logical neighbor selection. One can fix this metadata collected from each of the edge nodes, and on a similarity by adding further validation steps, however, this will delay the col- measure used to compare models based on the metadata. laboration process. An alternative policy is to make use of thresh- Similarity measure. For our example where class priors are under- old updates, where an edge sends an update only when there is a going some shift, the empirical distributions from the target node change in the metadata beyond some user-defined threshold. This can be compared with those from the other nodes to determine reduces the required data transfer, while also attempting to avoid which subset of edge nodes are logical neighbors of the target node. stale information, but requires an additional threshold. In the eval- The metadata service maintains a rolling average of this metadata uation of Cartel we primarily use the regular update policy, as it information provided by the edge nodes, which in memory costs provides an upper bound on how much metadata Cartel must trans- can be defined as O(CN ) machine words, if C is the total number fer. However, in Section 6.3 we also explore the additional benefits of classes in the classification problem and N be the total number that a threshold update policy can provide. of edge nodes in the system. The metadata service is thus responsi- We remark that Gaia also employs a threshold update method- ble for determining which nodes are logical neighbors, and does so ology. However, a fundamental difference between Cartel and Gaia, via a similarity measure. There are many measures that can be used is that the later focuses on a new ML synchronization model that for this purpose, such as Kullback-Leibler divergence (KLD) , improves the communication between the nodes sending the mod- Hellinger distance , Chi-squared statistic and Kolmogorov- els updates. Though beneficial for models with smaller model up- Smirnov distance. After evaluating these techniques empiri- date size, the amount of data transfer will increase if applied to cally, we selected Jensen-Shannon divergence (JSD) (which is models with more memory consumption such as ORF. Further, based on Kullback-Leibler divergence), as a function to determine Gaia’s goal is to build a geo-distributed generalized model, whereas the distance of two discrete probability distributions, Q 1 and Q 2 : Cartel supports tailored model at each node that only seek updates JSD(Q 1 , Q 2 ) = (KLD(Q 1 , Q̃) + KLD(Q 2 , Q̃))/2 where, Q̃ = (Q 1 + when a change is observed. ∑ Q 2 )/2, and KLD(Q, Q̃) = i Q(i) log2. JSD has convenient Q (i ) Q̃ (i ) properties of symmetry, and normalized values between [0, 1]; this 5.2 Cartel – Three Key Mechanisms is in contrast with KLD which is unbounded. If JSD(Q 1 , Q 2 ) = 0 then the distributions Q 1 and Q 2 are considered identical by this Drift Detection. As discussed, a dataset shift can cause poor pre- measure. On the other hand, as JSD(Q 1 , Q 2 ) → 1 then the distri- dictive performance of ML models. Through a drift detection mech- butions are considered highly distant. anism, our goal is to quickly improve the performance of models Once a list of logical neighbors with high similarity is identified, on nodes where such dataset drift has occurred. Drift detection is the list is pruned to only contain neighbors with low model error a widely studied problem, especially in the area of text mining and rates. Neighbors with high error rate, e.g., those that are also cur- natural language processing [62, 63]. It is important to note that in rently undergoing dataset drift, are filtered from the list. At this prior work on drift detection, there is often an interest in detect- point, the top-k logical neighbors in the list are transferred to the ing both positive and negative drifts. However, for Cartel we only target node, which then negotiates the knowledge transfer. Impor- take action upon a negative drift (i.e., the error rate of the model in- tantly, if the MdS finds no satisfactory logical neighbors (e.g., if creases over time). Any existing drift detection algorithm that can all JSD scores exceed a user-defined threshold), it will return an detect negative changes to model error rates can be used in Cartel. empty result set to the target node. For ease of exposition, we opt for a straightforward threshold- Knowledge Transfer. The final step in Cartel is to take advan- based drift detection mechanism that requires a user-specified hard tage of the logical neighbors’ models. The knowledge transfer con- limit L ∈ [0, 1] on the error-rate of a resident model. Thus, based sists of two abstract steps: partitioning and merging. The knowl- on the two parameters, L and W , drift detection is performed lo- edge transfer process is dependent upon the machine learning tech- cally at each edge node after processing a batch. The average error nique used by the application and is accomplished through model rate of the model is computed on the previous W batches to detect transfer – a machine learning technique for transferring knowl- whether the hard threshold L has been exceeded, indicating a drift. edge from a source domain (i.e, the data observed by the logical Though simplistic, more sophisticated algorithms for drift detec- neighbors) to a target domain (i.e., the problematic data arriving at tion also make use of two (and often more) such thresholds : the target node). The main difference between standard model typically the thresholds are set with respect to statistical tests or en- transfer and this partitioning and merging setting is that there is tropy measures to determine what constitutes a significant change the potential to transfer knowledge from multiple sources, and also (cf. [7, 31]). that there is a resident model already present at the target node. Logical Neighbor. Although drift detection is useful in determin- As part of the Cartel system, during the knowledge transfer step, ing the need for help (i.e., for knowledge transfer) from an external after the target node receives the logical neighbors, the target node source, we still face the challenge of finding out the node(s) that are attempts to identify those classes that have been most problematic. most similar to the target in terms of their characteristics: either The target node computes this information by examining the per- the data they have observed or their configuration. These nodes class error rates over the last W batches. Any classes that have are also known as logical neighbors. Intuitively, the goal of Cartel SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones error-rates exceeding a user defined threshold are marked as prob- Edge Node (Target) lematic, and communicated to the logical neighbors as part of the Collaborative Learning request for help. Next, depending on the machine learning algo- Register Predict rithm running on the edge nodes, partitioning and merging pro- Train Data ceeds in different ways. We give two different methods for parti- Analyzer ML Model tioning and merging that can be applied broadly to any classifica- Partition Communicator tion problem. Store Transfer Merge Bagging Approach. For the case of ORF, or any online learning algorithm that uses bootstrap aggregation, knowledge transfer is Find Node Collaborative Learning achieved via the following straightforward technique. Suppose the model at the target node contains M sub-models, each constructed Register Predict on a separate bag: e.g., in the case of ORF, M is the number of on- Metadata Service Train Analyzer ML line decision trees in the forest. By replacing a Z ∈ (0, 1] fraction (MdS) Model Partition of the sub-models in the target ensemble (Z of the trees from the Communicator target in the case of ORF), with sub-models collected from the logi- Transfer Merge cal neighbors, we can intuitively create a hybrid model at the target node, that is somewhere in between the old model and the models Edge Node (Logical Neighbor) from the logical neighbors. In other words, we partition the models Figure 3: Cartel system functions. in the logical neighbors, selecting Z × M sub-models among the logical neighbors, and then merge these with the existing model at from their weight matrices, by simply selecting the rows corre- the target node. sponding to these classes. These rows are transmitted to the target Before discussing how to set Z, it first makes sense to answer node where the node merges these model portions into its OSVM the question of which trees should be replaced in the target en- model by overwriting the corresponding rows with the weights semble, and which trees should be used among those at the logi- from the logical neighbors. We note that the same approach ap- cal neighbors. We employ the following heuristic: replace Z of the plies in general to other one-versus-rest classifiers, but that it is trees having the highest error rate in the target node ORF, with the especially appealing in the case of linear models like OSVM, as the Z trees having the lowest error rate from the logical neighbors. data required to transmit the F + 1 weights is small compared to To achieve partitioning and merging with bagging, Cartel must other approaches. therefore additionally maintain, at each edge node, the error rates Crucially, both methods presented have the property that the of each sub-model (e.g., decision tree in the forest). Fortunately, resulting model at the target node does not increase in size. In par- for many libraries implementing ensemble ML algorithms (such ticular, the same number of sub-models is present for the bagging as scikit-learn ), this information is readily accessible from the approach and for OSVM the matrix representing the hyperplanes model APIs. We also experiment with another heuristic that re- is exactly the same size in terms of the total number of entries. This places Z trees with the highest error rate in the target node ORF, means that knowledge transfer can be repeated without gradually with the Z trees that have the lowest per-class error rate among inflating the size of the model. the problematic classes. The focus for Cartel is to provide the system support for parti- The exact setting of Z, in general, can depend on workload dy- tioning and merging operations to be easily integrated for collab- namics, as well as the distance between the target node and logi- orative learning. The specific implementation of partitioning and cal neighbors. For our datasets and workloads we experimentally merging is left to the user, beyond the generic approaches just de- found that Z = 0.3 worked well, and discuss this later in Sec- scribed. Thus, Cartel abstracts this mechanism as a set of APIs for tion 6.3. However, other choices, such as Z in the range [0.3, 0.6] partitioning and merging that can be extended by any machine also behaved similarly. Thus, a precisely engineered value of Z is learning model to be incorporated into the Cartel system, as shown not required to yield similar benefits for the workloads and datasets in Figure 3 and described in the next section. we used. We leave automatic online tuning of Z as an interesting topic for future investigation. 5.3 Cartel Runtime One-versus-Rest Approach. In contrast to the bagging approach The Cartel runtime at each edge node consists of two blocks of above, linear OSVMs using a one-versus-rest (or one-versus-all) functions – Learning and Collaborative – as shown in Figure 3. approach to multi-class classification to construct a set of C hyper- The Learning component depends on the machine learning tech- planes in an n-dimensional space, one for each class, each defining nique used and the type of problem it addresses (e.g., classifica- boundary between its associated class and items not in that class. tion, regression, etc.). It constitutes the learning part of the system This boundary is therefore represented as a row in a C × (F + 1) which makes predictions on the incoming data using the model, weight matrix containing the coefficients – each associated with compares the predicted values to later observations determining one feature plus an additional bias term – representing the hy- the error rate, and subsequently re-trains the model using obser- perplane, where F is the number of features for the model. For vations. It provides four interfaces: predict, (re)train, partition, OSVM, knowledge transfer can be accomplished by updating the and merge. The predict function keeps track of overall and class- weights assigned to the features of problematic classes. The logical wise results predicted by the model while the (re)train function neighbors then partition the subset of these problematic classes is responsible for training the model on the incoming data feed. Cartel: A System for Collaborative Transfer Learning at the Edge SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Moreover, when a model update is requested by a target node, the into feature values, and includes a mix of benign data samples partition function of the logical neighbor helps in finding the por- as well as malicious attacks such as distributed denial-of-service tion of the model that increases the model’s accuracy at the target (DDoS) attacks, Heartbleed, web and infiltration attacks captured edge node. Finally, the merge function of the target node incorpo- over a period of 5 days. We use a subset of the features for two days rates the model update received from the supporting edges and of the dataset consisting of benign data samples, DDoS attacks, and completes a cycle in a collaborative learning process. port scan attacks, consisting of 500k data samples in total. The Collaborative component is independent of the machine learn- Workloads. Edge nodes process requests in discrete batches over ing technique used. It is responsible for drift detection, for trig- time. A batch consists of varying number of requests correspond- gering look-ups and for interacting with logical neighbors. It pro- ing to different classes in a dataset (e.g., corresponding to one of vides four functions – register, analyzer, communicator, and the ten different digits from 0 to 9 in the case of MNIST, or to transfer. With register an edge node joins the Cartel knowledge a different type of attack in case of CICIDS2017). In our experi- sharing pool. The analyzer function analyzes the prediction results ments, the dataset shift we focus on is a change in class priors, i.e., to determine a drift. Upon drift detection, it additionally performs a change in the distribution of the classes arriving at a node. data trend analysis to identify the problematic classes at the target We generate several synthetic request patterns which correspond node. The communicator function interacts with MdS to update the to different types of change patterns in the workload. The results node’s metadata information, based on the metadata aggregation presented in the paper are primarily based on the Introduction and policy. Additionally, if a drift is detected, it also sends a request to Fluctuation patterns. Introduction corresponds to a case where a MdS for logical neighbors. The transfer function opens a commu- new class gets introduced at an edge node abruptly after 25 batches. nication channel between the target node and the selected logical Fluctuation is a distribution pattern where a new class is introduced neighbor(s) to request and receive model portions. at batch 25, but then disappears and re-appears at batches 50 and 75 respectively. This is analogous to the Introduction pattern with 6 EVALUATION periodicity. Other patterns used in our evaluation include Uniform, where all classes are uniformly distributed across all edge nodes, We present the results from the experimental evaluation of Cartel, and Spikes, where several new classes are introduced in succession, and seeks to answer the following questions: each of which does not persist. The results obtained from these lat- 1. How effective is Cartel in reducing data transfer costs, while ter two workloads are similar, so we omit them for brevity. providing for more lightweight and accurate models that can We have used emulated nodes and synthetic workloads, primar- quickly adapt to changes? (Section 6.2); ily because of limited availability of real infrastructure and data. 2. What are the costs in the mechanisms of Cartel and the design However, in the following section we successfully demonstrate the choices? (Section 6.3); benefits of Cartel. Moreover, with more edge nodes, we expect the 3. How does Cartel perform with realistic applications? (Section 6.4). savings from transmitting metadata compared to the raw data will persist. 6.1 Experimental Methodology Experimental Testbed. We evaluate Cartel on a testbed repre- senting a distributed deployment of edge infrastructure, with em- ulated clients generating data to nearby edge nodes. The testbed 6.2 Benefits from Cartel consists of five edge nodes and a central node representing a cen- We compare Cartel to centralized and isolated learning, with re- tralized datacenter. All nodes in the system are Intel(R) Xeon(R) spect to the changes observed at the edge in the three systems. A (X5650) with 2 hex-core 2.67GHz CPUs and 48GB RAM. centralized system repeatedly builds a generic model using data Datasets and applications. The first experiment uses an image collected from all the nodes in the system. This model is then dis- classification application where edge nodes participate in classi- tributed among the edge nodes. In such a system there exists a gap fying images into different categories using ORF and OSVM. The between the error bound and the model performance at the edge models are implemented using the ORF library provided by Amir node. This is due to the time difference between the periodical up- el at. and, for OSVM, scikit-learn. We use the MNIST data- date of the model at the edge nodes. In contrast, in an isolated envi- base of handwritten digits , that consists of 70k digit images. ronment, each edge node is trained individually and any change(s) A set of 1000 uniformly randomly selected images (training data), in the workload pattern could impact the predictive performance distributed across each of the edge node, is used for preliminary of the model. model training. The remainder of the dataset is used to generate a We measure the time taken to adapt to changes in the class series of request patterns, following the different distribution pat- priors in the workload, and examine the resource demand (cost) terns described below; batches from these requests are used for in terms of data transferred over the backhaul network, time re- online training. quired to train the online model and model size. In Figures 4 and 5, We also evaluate Cartel with a second use case based on net- we present the results for the image classification application with work monitoring and intrusion detection (Section 6.4) that uses the ORF and OSVM, and the Introduction and Fluctuation patterns. We CICIDS2017 Intrusion Detection evaluation dataset. This use use different workload patterns to assess the impact of change in case further illustrates some of the tradeoffs enabled by Cartel, and request distribution on the performance of the systems. For each helps generalize the evaluation. The CICIDS2017 dataset consists case, with a horizontal dashed red line, we show the error lower of a time series of different network measurements, preprocessed bound, obtained with offline model training. We used window size SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones (a) ORF overall error (b) ORF class error (c) OSVM overall error (d) OSVM class error Figure 4: Performance comparison for introduction workload distribution (lower error rate is better) showcasing overall model error and the introductory class error change. Misclassification of introductory class degrades the model performance. Cartel is able to quickly adapt to the change in distribution (given by the horizontal arrow). The horizontal line (in red) defines the lower bound obtained through an offline training. (a) ORF overall error (b) ORF class error (c) OSVM overall error (d) OSVM class error Figure 5: Model performance comparison for fluctuation workload. Similar to Figure 4 Cartel is able to adapt to the changes in the distribution and behaves close to the centralized system. W = 5 and hard error rate limit of L = 0.15 for Cartel andW = 10 divide the data transferred into two categories: (i) data / commu- for the centralized system to provide the model updates at the edge. nication cost which includes the transfer of raw data or metadata Adaptability to Change in the Workload. We observe that updates, and (ii) model transfer cost which captures the amount the centralized system is more resilient to the change in the dis- of data transferred during model updates to the edge (periodically tribution pattern. This is due to the generic nature of the edge in case of a centralized system or a partial model request from a model which is regularly synchronized with the central node and logical neighbor in Cartel). is built using prior knowledge from the other edge nodes. On the Cartel does not centralize raw data and only transfers models other hand, the isolated system and Cartel experience a spike in the when there is a shift in the predictive performance. This design model error rate for the same change. We define the time taken for helps in reducing both data / communication cost and model trans- the model error rate to return to the baseline (10%) as the adapt- fer cost by ~1700×, and 66 to 200×, respectively, thereby reducing ability of the system to change. This adaptability is denoted with a the overall cost of total data transferred for Cartel by two to three horizontal arrow in Figures 4 and 5. orders of magnitude, compared to the centralized system, as shown Cartel’s drift detection allows the target node to have increased in Table 1. adaptability with respect to the dataset shift (measured as a smaller We note that for ORF the cost of the model updates is the domi- horizontal spread in terms of number of batches) when compared nating factor in the total data transferred, whereas the data / com- to an isolated system. Specifically, when using OSVM and ORF munication dominates for OSVM. As discussed, the data / commu- techniques, Cartel performs 8× and 4× faster, respectively, as com- nication for Cartel is O(CN ) per batch. For a centralized model, pared to an isolated system. The adaptability of Cartel is important the data / communication is O(BF N ) where B is the average num- for both workload patterns. For Fluctuation (Figure 5), Cartel helps ber of data points in a batch. Provided BF ≫ C, we can expect the to bring the system back to an acceptable predictive performance data / communication cost to be much lower for Cartel than a cen- while the isolated system takes a longer time to adapt to the fluc- tralized system. tuation. For applications where dataset shifts are less frequent, we ex- Data Transfer Cost. In a centralized system, an edge node proac- pect Cartel will provide better predictive performance in the long tively updates its model from the central server which helps in im- run. We expect the gains of Cartel to persist even when consider- proving the inference at edge nodes. However, this improvement ing federated learning-based approaches to building a centralized comes at a cost of a proactive model transfer between the edge model. Those are reported to reduce communication costs by and the central node. To capture the network backhaul usage we one to two orders of magnitude, but, importantly, they strive to build at each edge a global model, and can miss the opportunity Cartel: A System for Collaborative Transfer Learning at the Edge SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Workload ML D/C (×) MU (×) Total (×) ORF Introduction 1745 200 212 Fluctuation 1760 66 71 OSVM Introduction 1763 188 1573 Fluctuation 1763 94 1404 Table 1: Ratio of data transferred in a centralized system versus Cartel. D/C represents data/communication, MU represents model update transfer cost and Total represents the combined cost. (a) Overall model error (b) Introductory class error 25 0.04 Figure 7: Effect of different drift detection policies on overall model and introductory class error rate: “Delayed by X ” implies the drift detection Execution Time (s) 20 0.03 Centralized was delayed by X batches. 15 Isolated 0.02 10 Cartel training time up to by 5.7×, while for OSVM, since the model size 0.01 is constant, the smaller training batch size alone reduces the train- 5 ing time by 3× compared to a centralized system. 0 20 40 60 80 100 0.000 20 40 60 80 100 Impact of Machine Learning Algorithm. We remark that ben- Batch ID Batch ID efits from Cartel are not primarily related to the ML model accu- (a) ORF (b) OSVM racy, but rather to its adaptability during distribution changes. In Figure 6: Time required to train the ORF and SVM model. Similar trends the case of OSVM, although linear SVM exhibits high bias on the are observed for different workload distribution. Combination of global MNIST data, adding more features or using kernel SVMs are un- model and bigger training dataset in a centralized system increases the likely to speed up convergence. As such we expect the benefits of online training time of ML model. Cartel to persist even when other non-linear methods or feature transformations are used. To test this, we ran the MNIST work- to benefit from reduced model sizes or training times, as discussed load as before, but with Random Fourier Features (RFF) [13, 53] to next. improve model accuracy.1 Although these additional features low- Model Size. The model size depends on the machine learning tech- ered the average error rate, we observed similar improvements to nique used. It plays an important role in data transfer during model adaptability, as well as a reduction in data transfer, as observed updates as mentioned above, as well as during retraining of the with linear SVM and ORF. Thus, Cartel can provide similar bene- model. Cartel results in smaller, tailored models for each edge node, fits when used with these additional techniques. leading to faster online training time. Since ORF is an ensemble In summary, Cartel boosts the system’s adaptability by up to 8×. learning method, it builds multiple decision trees, each with vary- It achieves a similar predictive performance compared to a cen- ing number of nodes. The size of the ORF model depends on several tralized system while reducing the data transfer cost up to three factors, such as the data and hyperparameters used. From our ex- orders of magnitude. Cartel enables the use of smaller models at periments with two edge nodes, we observe that Cartel results in each edge and faster training. a reduction of the model size by 3× on an average when compared to a centralized system. This reduction is achieved because a tai- 6.3 Effect of Mechanisms lored model in Cartel does not store as much information when compared to a generic model used in a centralized system. This is We next investigate the impact that each of the mechanism in Cartel expected in MEC, because it operates in contexts with highly lo- has on the overall system performance. calized information that can result in fewer classes being active or Drift Detection Timeliness. Timely drift detection is important observed at each edge [48, 52]. Beyond reduction in classes, the for Cartel, since a delay in detection can impact a model’s predic- number of nodes in the ORF grows less quickly in Cartel vs. the tive performance. To demonstrate the impact of slow drift detec- centralized system due to fewer total training examples, further tion, we modify the drift policy to delay the request to the MdS reducing the model size. For ORF, the model resulting from use of for logical neighbors by a variable number of batches. The results Cartel is similar to that of isolated learning, but has faster adapt- in Figure 7, show the impact of drift detection delay on the over- ability (as shown in the above discussion). Since OSVM uses a ma- all model’s performance as well as the misclassification rate of the trix to represent the hyperplane parameters corresponding to each introduced class for ORF; we observe a similar pattern for OSVM. class, there is no difference (without applying further compression) The request for model transfer from a logical neighbor at the ac- in size of an OSVM model trained for subset of classes compared tual time of drift detection stabilizes the system quickly. If the de- to one trained using all classes. lay is too large (e.g., in the figure, a delay of 20 batches), the model Training Time. The online training time for a machine learning transfer does not provide collaboration benefits, as online training model is a function of the training dataset and the model size. In an eventually improves the model. In the current implementation of online system, a smaller model size and/or less data helps in train- Cartel, drift detection triggers immediate requests for logical neigh- ing the model faster. Figure 6 shows the difference in the training bors. We acknowledge that overly sensitive drift detection may times for the ORF and OSVM model during our experiment. The cause short and non-persistent workload fluctuations to trigger smaller model size and smaller local batch size reduces the ORF 1 We used the implementation of Random Fourier Features of Ishikawa. SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones 15 Model Error % 10 Regular 5 Threshold (2%) Threshold (5%) 0 20 40 60 80 100 Batch ID (a) Data transfer (b) Overall model error (a) Overall model error (b) Data transfer Figure 8: Comparison of MdS metadata aggregation policies. Figure 9: Comparison of adaptability of Cartel and total data transfer for various Z values when using ORF ML model on MNIST dataset. unwarranted and frequent transfer requests, increasing the over- all data transferred over the network; detailed sensitivity analysis can be used to develop automated methods for determining effec- 40 Cartel tive ranges for the delay parameter in the future. Merge (Z=0.3) MdS Aggregation Policies. The metadata at MdS can be updated 30 Merge All Model Error % Replace All according to different policies described in Section 5.1. All other ex- periments use the “regular” update policy, but we note that addi- 20 tional data reduction benefits can be obtained through the “thresh- 10 old” update policy when configured appropriately. Figure 8a shows a comparison of the regular update policy with different thresh- 0 20 40 60 80 100 old policies. Here we use a 2% and 5% change in class priors at Batch ID each edge as the threshold parameter. As the threshold increases, (a) ORF overall model error (b) ORF data transfer we observe a reduction in the total metadata transfer (labeled as Figure 10: Comparison of adaptability of the system and total data transfer data/communication cost transfers in the figure). However, too between various knowledge transfer mechanisms, on MNIST dataset using high thresholds could result in an increase in model update costs. ORF as machine learning model, that can be applied in Cartel. Use of a threshold parameter can result in a misrepresentation of the distribution pattern of different edge nodes at the MdS. The problematic class, while Z > 0.6 leads to higher error rate for the corollary to this is the selection of incorrect logical neighbors, shown non-problematic classes at the target node, as shown in Figure 9. in Figure 8b, where the system repeatedly requests for model up- Both result in more time required by Cartel to adapt to the changes. dates and fails to adapt to the changes in the system due to in- Hence, we selected Z = 0.3 as the optimal value, since higher Z correct logical neighbors, negating the benefit of reducing other values, for ORF, result in more bandwidth utilization. communication costs. In addition, there are a few ways to apply a model update. In Finding Logical Neighbors. In a distributed deployment with Figure 10 we evaluate the impact of each of these on the perfor- hundreds of edge nodes, collaboration can be performed by: i) se- mance of Cartel, in terms of its ability to quickly converge to good lecting a node at random, ii) based on geographical proximity, or; models at the target node (with low model error rate) (left hand- iii) based on similarities determined through node metadata infor- side graphs in the figure), and in terms of the data transfer require- mation, for example. We experiment with the impact of Cartel’s ments (right hand-side graphs). For ORF, one can replace the ex- logical neighbor selection, compared to various baseline approaches, isting forest with the logical neighbor’s forest (Replace All); two using the Introduction workload and ORF. The logical neighbor al- or more forests can be merged (taking the union of all the trees) gorithm in MdS is modified to introduce i selection failures (by (Merge All); the best performing Z of trees from the logical neigh- randomly selecting among the nodes not identified as a top match bor can be merged with the target’s forest (Merge (Z = 0.3)); by the MdS algorithm), before finally providing the correct logical or, finally, the worst performing trees in the target model can be neighbor. Depending on the batch size and how long drift detection replaced by the best performing Z trees from the logical neigh- takes, multiple failures – in our experimental setup, more than two bor’s model. The latter is what is used in Cartel, and enabled by – negate any benefit that knowledge transfer has on the accuracy use of additional local metadata at each edge, examined during the of the model. Thus, it is critical that the MdS uses timely metadata knowledge transfer request. about edge nodes and effective similarity measure, to identify good Replacing the entire edge model with the neighbor’s model might logical neighbor candidates. not work because each edge node experiences different distribu- Knowledge Transfer Balance. As discussed in Section 5.2, knowl- tions and a blind merge from a logical neighbor would not work edge transfer for OSVM involves selecting the coefficients asso- if only a few classes were common among the nodes. When possi- ciated with the hyperplane for the problematic classes. However, ble (i.e., for ORF), merging all or a portion of the model (the ORF ORF operates by selecting the Z trees with lowest error rate for trees) seems to be a good solution when considering the error con- partitioning at a logical neighbor, raising the question of the value vergence time at the target. However, this increases the overall of Z. For our experimental setup, we tested the following values model size by up to 2× for Merge All, which further results in in- of Z: 0.1, 0.2,... , 1.0. We found that Z < 0.3 fails to help for the crease in training time by an average of 2× for Merge All, or 1.3× for Merge (Z = 0.3). Replacing portions of the target model based Cartel: A System for Collaborative Transfer Learning at the Edge SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA 60 Cartel Merge (Z=0.3) 50 Merge All Model Error % Replace All 40 30 20 10 0 200 400 600 800 Batch ID (a) ORF overall model error (b) ORF data transfer (a) ORF overall error (b) Data transfer Figure 12: Comparison of three knowledge transfer mechanisms applied Figure 11: Performance and total data transfer comparison for network to the network attack dataset using ORF as the classification model. attack dataset using ORF to classify begin request against DDoS and port scan attack requests. Collaboration Scope. We evaluate Cartel under different data dis- on the top performing Z classes in the neighbor’s model results in tribution scenarios. However, the evaluation is performed using a target model that quickly converges to the model’s lower error synthetic workloads in an attempt to model realistic scenarios , bound, and keeps model transfer costs low. By enabling replace- and is limited to few edge nodes. Given that a centralized model ment of only those model portions which relate to the problematic must periodically be pushed to all edge nodes, we expect that the classes at the target, Cartel achieves both quick adaptation and low data transfer reduction of Cartel will be larger in deployments with transfer costs. dozens or hundreds of edge nodes. Thus, it would be interesting to evaluate the benefits of Cartel in such scenarios, and for use cases 6.4 Use Case - Network Attack with live traffic and other dataset shifts. In particular, the choice of some of the parameters and threshold values chosen in Cartel, is This scenario is based on data from the CICIDS2017 Intrusion De- dependent upon the workload and the characteristics of the under- tection evaluation dataset. The testbed consists of two edge lying infrastructure, and further work is needed to establish prac- and a central node. One of the nodes (Edдe 0 ) experiences a sus- tical policies for choosing effective values. tained port scan attack, while the other (Edдe 1 ) experiences a DDoS Generalize to other machine learning algorithms. We present attack. After a time period, Edдe 0 (target node) experiences a simi- the benefits of Cartel with the underlying machine learning al- lar DDoS attack. The ML model used to classify the attacks is ORF, gorithms as ORF and OSVM, and developed general partitioning with 10 tree predictors at each node, each with a maximum depth and merging algorithms that work in the bagging and one-versus- of 16. The workload follows the Introduction distribution and con- rest paradigm. Still, we plan to explore other methods for parti- sists of 900 batches each with ~1000 data points of various network tioning and merging heuristics that can be used directly (rather metrics (features). than requiring bagging). We are interested in general methods for Result. As shown in Figure 11, the collaboration with a logical deep neural networks (DNN), and also in evaluating online regres- neighbors helps the target node to adapt 5× faster to the intro- sion problems in addition to online classification. As mentioned, duction of the DDoS attack which was already observed by Edдe 1 , one possibility is patching , though further developments are compared to the isolated system. A centralized system with edge needed to ensure models do not become excessively large over nodes receiving regular model updates from a central node does many partitioning and merging operations. Recent work on knowl- not require time to adapt, however, even with only two edge nodes, edge transfer through merging of DNN [4, 14, 67] could be a step- the total data transfer is 90× more, and the time taken for training ping stone in extending Cartel to support DNN models. Other re- is 2× that of Cartel. cent work has been done to partition DNNs across mobile, edge We performed a more comprehensive evaluation of Cartel using and cloud [22, 26, 29, 32], yet additional advances are needed in the intrusion detection dataset. Figure 12 demonstrates that the the ML algorithms to improve their efficacy of model transfer. knowledge transfer mechanism in Cartel reduces model transfer Privacy. While a discussion about privacy is beyond the scope of cost 2.5× or more compared to the other mechanisms, while keep- this paper, we note that edge nodes in Cartel use the raw data to ing a lower overall model error rate. Additionally, the results from train models, but do not explicitly transmit this data to the other the Fluctuation workload exhibit increased adaptability of the sys- nodes. As such, the information sent to the MdS or to logical neigh- tem, reduced total data transfer (by 60×) and faster model training bors is a sketch derived from the raw data, e.g., a histogram of time (by 1.65×), compared to centralized learning. These results the data distribution at a node. However, we still foresee concerns showcase a similar trend to the ones described for the MNIST- about this sketched data, and believe such concerns also apply to based use case in the Section 6.2; the graphs are omitted for the other techniques such as federated learning. Regarding the possi- brevity. bility and handling of malicious edge nodes, our scope is limited to cooperating nodes that are owned or managed by a single service 7 DISCUSSION provider. We leave further exploration of issues, such as trust, as We have shown how Cartel performs, in terms of data transfer, future work. training time and model size. Our results demonstrate the poten- tial of Cartel, but also illustrate several opportunities to be further explored. SoCC ’19, November 20–23, 2019, Santa Cruz, CA, USA Harshit Daga, Patrick K. Nicholson, Ada Gavrilovska, and Diego Lugones 8 RELATED WORK Amazon. 2018. Amazon Machine Learning. System Limits. https://docs.aws.amazon.com/machine-learning/latest/dg/system-limits.html. Cartel is a system that leverages the unique characteristics of each B. Amento, B. Balasubramanian, R. J. Hall, K. Joshi, G. Jung, and K. H. Purdy. edge and enables collaboration across nodes to provide a head start 2016. FocusStack: Orchestrating Edge Clouds Using Location-Based Focus of Attention. In ACM Symposium on Edge Computing (SEC’16). in adapting the model for changes observed at the target node. This Shabab Bazrafkan and Peter M Corcoran. 2018. Pushing the AI envelope: merg- is in contrast to the existing systems [38, 43, 65] where data pro- ing deep networks to accelerate edge artificial intelligence in consumer electron- cessing and analysis happens in a single datacenter, however, the ics devices and systems. IEEE Consumer Electronics Magazine 7, 2 (2018), 55–61. Rudolf Beran et al. 1977. Minimum Hellinger distance estimates for parametric excessive communication overhead in distributed machine learn- models. The annals of Statistics 5, 3 (1977), 445–463. ing algorithm makes such systems unsuitable in a geo-distributed Ketan Bhardwaj, Ming-Wei Shih, Pragya Agarwal, Ada Gavrilovska, Taesoo Kim, setting. and Karsten Schwan. 2016. Fast, Scalable and Secure Onloading of Edge Func- tions using AirBox. In Proceedings of the 1st IEEE/ACM Symposium on Edge Com- Systems such as Gaia , Project Adam , Federated learn- puting (SEC’16). ing and others [10, 39, 64] focus on addressing the communica- Albert Bifet and Ricard Gavalda. 2007. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on tion overhead in running machine learning methods such as a pa- data mining. SIAM, 443–448. rameter server and large deep neural network in a geo-distributed Léon Bottou. 1998. Online Algorithms and Stochastic Approximations. (1998). environment. Additionally, the distributed setting involves interac- http://leon.bottou.org/papers/bottou-98x revised, oct 2012. Leo Breiman. 1996. Bagging predictors. Machine Learning 24, 2 (01 Aug 1996), tion with a large number of nodes, where some of these nodes can 123–140. https://doi.org/10.1007/BF00058655 experience failures. MOCHA is a system designed to handle Ignacio Cano, Markus Weimer, Dhruv Mahajan, Carlo Curino, Giovanni Matteo such stragglers. DeepCham , IONN , Neurosurgeon Fumarola, and Arvind Krishnamurthy. 2017. Towards Geo-Distributed Machine Learning. IEEE Data Eng. Bull. 40, 4 (2017), 41–59. http://sites.computer.org/ and Splitnet are examples of systems where the machine learn- debull/A17dec/p41.pdf ing model is partitioned across mobile, edge or cloud which works Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2008. Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines. Journal of Ma- in a collaborative way to train a unified model. These systems do chine Learning Research 9 (2008), 1369–1398. https://dl.acm.org/citation.cfm? not consider custom models for each node in MEC where an edge id=1442778 might not require a global model trained on broad variety of data. Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training Similarly to Cartel, Cellscope is also aimed at creating better System.. In OSDI, Vol. 14. 571–582. models at edge nodes. Using real data, the authors show evidence Radha Chitta, Rong Jin, and Anil K Jain. 2012. Efficient kernel clustering us- that global models can lead to poor accuracy and high variance. ing random fourier features. In 2012 IEEE 12th International Conference on Data Mining. IEEE, 161–170. However, the focus of that work is on providing a bigger dataset Yi-Min Chou, Yi-Ming Chan, Jia-Hong Lee, Chih-Yi Chiu, and Chu-Song Chen. by intelligently combining data from multiple base stations to help 2018. Unifying and Merging Well-trained Deep Neural Networks for Inference Stage. CoRR abs/1805.04980 (2018). arXiv:1805.04980 http://arxiv.org/abs/1805. in building the local model at edge nodes. In contrast, Cartel avoids 04980 data transfer and aims to provide model updates from logical edge Cisco. 2017. Cisco Visual Networking Index: Global Mobile Data Traffic nodes only when there exists a data shift. Forecast Update, 2016–2021 White Paper. https://cisco.com/c/en/us/solutions/ collateral/service-provider/visual-networking-index-vni/mobile-white-paper- Finally, there exist many machine learning algorithms [8, 24, c11-520862.pdf 27, 34] to incrementally train machine learning models in an ef- Stephane Daeuble. [n.d.]. Small cells and Mobile Edge Computing cover all the ficient manner and more sophisticated knowledge transfer tech- bases for Taiwan baseball fans. https://www.nokia.com/blog/small-cells-mobile- edge-computing-cover-bases-taiwan-baseball-fans/. niques [17, 49] that Cartel can leverage to further improve the Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. 2007. Boosting for learning performance. Transfer Learning. (2007), 193–200. https://doi.org/10.1145/1273496.1273521 Facebook. [n.d.]. Applying machine learning science to Facebook products. https://research.fb.com/category/machine-learning/. 9 CONCLUSION João Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Comput. Surv. 46, In this paper we introduce Cartel, a system for sharing customized 4 (2014), 44:1–44:37. https://doi.org/10.1145/2523813 machine learning models between edge datacenters. Cartel incor- Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sun- dararajan. 2008. A dual coordinate descent method for large-scale linear SVM. In porates mechanisms to detect changes in the input patterns of the Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML local machine learning model running in a given edge, to dynami- 2008), Helsinki, Finland, June 5-9, 2008 (ACM International Conference Proceed- cally discover logical neighbors that have seen similar trends, and ing Series), William W. Cohen, Andrew McCallum, and Sam T. Roweis (Eds.), Vol. 307. ACM, 408–415. https://doi.org/10.1145/1390156.1390208 to request from them knowledge transfer. This cre

Cartel: A System for Collaborative Transfer Learning at the Edge (SoCC '19) PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue