Empowering Data Mesh with Federated Learning PDF
Document Details
Uploaded by SlickCotangent
Eindhoven University of Technology, Uppsala University
Haoyuan Li, Salman Toor
Tags
Related
Summary
This paper presents a novel approach for empowering Data Mesh with federated learning, showcasing the potential benefits of a distributed data analysis strategy. It highlights the challenges of centralized data architectures and proposes a decentralized paradigm shift for organizations striving for intelligent decision-making.
Full Transcript
Empowering Data Mesh with Federated Learning Haoyuan Li Salman Toor...
Empowering Data Mesh with Federated Learning Haoyuan Li Salman Toor Department of Industrial Design, Department of Information Technology, Eindhoven University of Technology Uppsala University Eindhoven, Netherlands Scaleout Systems [email protected] Uppsala, Sweden [email protected] [email protected] ABSTRACT In response to these challenges, concepts such as Data Ware- arXiv:2403.17878v2 [cs.LG] 27 Mar 2024 The evolution of data architecture has seen the rise of data lakes, houses and Data Lakes have emerged, providing structured and aiming to solve the bottlenecks of data management and promote unstructured repositories, respectively, that can store, manage, and intelligent decision-making. However, this centralized architecture analyze vast amounts of data [Khine and Wang(2018)]. In this mono- is limited by the proliferation of data sources and the growing de- lithic data paradigm, decision-making relies on a centralized data mand for timely analysis and processing. A new data paradigm, team to process and manage the analytical data. However, many Data Mesh, is proposed to overcome these challenges. Data Mesh organizations are suffering from this centralized pattern when scal- treats domains as a first-class concern by distributing the data own- ing to accommodate large volumes of data [Dehghani(2023b), Bode ership from the central team to each data domain, while keeping et al.(2023)]. The central data team faces considerable challenges the federated governance to monitor domains and their data prod- as it bears the responsibilities associated with managing diverse ucts. Many multi-million dollar organizations like Paypal, Netflix, data types and meeting the escalating demands of geographically and Zalando have already transformed their data analysis pipelines dispersed business units, a predicament further intensified by the based on this new architecture. In this decentralized architecture global expansion of modern enterprises. Consequently, organiza- where data is locally preserved by each domain team, traditional tions often find themselves dedicating substantial time and re- centralized machine learning is incapable of conducting effective sources to resolving data silos issues, hindering timely and effective analysis across multiple domains, especially for security-sensitive data-driven decision-making. organizations. To this end, we introduce a pioneering approach To overcome this bottleneck, a paradigm shift toward decen- that incorporates Federated Learning into Data Mesh. To the best tralized data management has been proposed, known as the Data of our knowledge, this is the first open-source applied work that Mesh architecture [Dehghani(2023b), Dehghani(2023a)]. This novel represents a critical advancement toward the integration of feder- architecture advocates for the distribution of responsibilities and ated learning methods into the Data Mesh paradigm, underscoring ownership of data, enabling organizations to handle large-scale the promising prospects for privacy-preserving and decentralized data management in a more efficient and effective manner. Data data analysis strategies within Data Mesh architecture. Mesh embraces the nature of operational data and analytical data, with a focus on domain-oriented, distributed analytical data and KEYWORDS federated governance. By integrating product thinking into data management, the Data Mesh promotes the concept of distributed Data Mesh, Federated Learning, Domain-Intelligence, Privacy, Data- data products, each owned by a specific domain team within an driven analysis organization, to be consumed by other domains or end-users. As ACM Reference format: a result, data management becomes a fundamental aspect of the Haoyuan Li and Salman Toor. 2024. Empowering Data Mesh with Federated domain teams’ responsibilities, encouraging a culture of ownership Learning. In Proceedings of ACM Knowledge Discovery and Data Mining, and accountability to ensure the quality of data products. Barcelona, Spain, 25th - 29th August, 2024 (Conference acronym KDD), 9 pages. In the context of decentralized data architecture, data is owned https://doi.org/10.1145/nnnnnnn.nnnnnnn by each domain team, with no requirement for sharing with other domain teams or aggregation into global data lakes. These domain- specific model teams are responsible for the production of machine 1 INTRODUCTION learning models to serve as interfaces for consumers. Simultane- The concept of Big Data has experienced significant evolution since ously, when organization-wide decisions need to be made, all do- its introduction, with continuous advancements over the past few main teams are required to collaborate on the project and generate decades. Initially characterized by the 3𝑉 𝑠 model (Volume, Variety, a global model to assist the federated governance team. To achieve Velocity), as proposed by Doug Laney [Beyer and Laney(2012)], this this objective, we incorporate federated learning in our study to model has expanded to encompass Veracity and Value, culminating train the machine learning models in a decentralized manner. in the 5𝑉 𝑠 model to describe the main challenges when handling The topology of federated learning models aligns well with the big data. In today’s data-driven landscape, the rapid growth of Data Mesh structure, as it inherently supports the decentralization data volume and complexity in the modern world poses significant and domain-specific governance principles of the Data Mesh. Feder- challenges for data-driven organizations that are striving to explore ated learning enables each domain team to train machine learning the potential value of their data asset [Dehghani(2023b)]. models without accessing raw data in other domains, respecting Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain Haoyuan Li and Salman Toor the principle of data locality and ownership, which is a cornerstone Other data-driven companies, like PayPal [pay(2023)] and In- of the Data Mesh architecture. Due to the domain-oriented archi- tuit [int(2023)], are also working on the implementation of their tecture, federated learning facilitates the creation of more robust strategies for Data Mesh, taking their initial steps on the Data and diverse models by enabling learning from a variety of domain- Mesh journey. As a pioneer in self-service analytics, PayPal decom- specific datasets. In the context of Data Mesh, where each domain poses the architecture of Data Mesh into a set of deployable ele- team owns different types of data, the models trained through feder- ments called data quantum maintained by the specific domain data ated learning can potentially benefit from this diversity, leading to team [Perrin(2023)]. Intuit also provided its own visions for Data more generalized and accurate predictions. The proposed solution Mesh, formulating the future direction of its data-driven systems. is based on an open-source applied work toward the integration of They define a set of strategies from three perspectives that can im- federated learning methods into the Data Mesh paradigm. prove the process of data discovery and organization [Baker(2023)]. The following list is the main contribution of this work: The Data Mesh paradigm is gaining momentum as a promis- ing solution for businesses struggling with the constraints of con- (1) Identifying the main characteristics of machine learning ventional data management approaches. An expanding body of models when conceptualized as data products within a research is devoted to delivering practical guidelines for imple- distributed data architecture. menting a Data Mesh from an enterprise perspective. In their work, (2) Constructing the domain-specific architecture of the split Butte et al. [Butte and Butte(2022)] concentrate on constructing learning model under scenarios of both shared and pre- domain components, addressing interoperability amongst these served labels. entities. They further furnish an abstract cloud service-based archi- (3) Propose two common use cases that demonstrate the ad- tectural blueprint for implementing the Data Mesh. Bode et al. [Bode vantages of domain-oriented data segregation for business et al.(2023)] provide empirical insights into the Data Mesh, derived applications. from 15 semi-structured interviews conducted with industry ex- perts. Machado et al. propose the domain model to represent the ba- 2 RELATED WORK sic components that reside on each independent domain [Machado The Data Mesh concept was initially proposed by Zhamak De- et al.(2021)]. The evolution of the modern data paradigm is iden- hghani in a foundational blog post, where it was positioned as a tified in [Machado et al.(2022)], together with the features and response to the limitations of prevailing monolithic data architec- deployment strategy for Data Mesh. tures. The blog articulated the need for a paradigm shift towards decentralized data structures, laying out the key motivations be- 3 CORE COMPONENTS hind the Data Mesh concept and summarizing the fundamental attributes of the logical components within the data mesh [De- 3.1 Data Mesh hghani(2023b)]. A subsequent post further elaborated on the Data Data Mesh emphasizes the decentralization of data ownership, Mesh, outlining four primary principles: domain ownership, data as domain-oriented data products, self-serve data infrastructure, and a product, a self-serve data platform, and federated computational federated computational governance. This empowers domain teams governance [Dehghani(2023a)]. These principles form the backbone to take ownership of their data while fostering collaboration and of a comprehensive high-level Data Mesh architecture, offering a data sharing across domains. Notably, the principle of the self-serve standard logical model for further investigation. data infrastructure, while essential in understanding the entirety of Building upon this architectural foundation, numerous busi- Data Mesh, falls outside of this research’s scope due to its distinct nesses have reported practical implementations of the Data Mesh, focus on infrastructural considerations that do not directly interact adapting it to align with their specific business strategies. For in- with our exploration of machine learning methods. In the follow- stance, Zalando [zal(2023)], an e-commerce enterprise dealing with ing sections, we present the overview of Data Mesh principles in a myriad of data sources, has engineered a tailored Data Mesh our study, highlighting the essential aspects for training machine framework to alleviate the bottlenecks of a central team and ensure learning models under this decentralized architecture. data quality. This framework incorporates the "Bring Your Own 3.1.1 Domain-Oriented Data Ownership. The core feature of Bucket" (BYOB) mechanism to facilitate decentralized data storage. Data Mesh is decentralized data management. Drawing inspira- Users are able to integrate their data buckets with the centralized tion from Domain-Driven Design (DDD) principles, Data Mesh dis- data lake, allowing them to leverage processing platforms or tech- tributes data ownership across various domains [Dehghani(2023b)]. niques offered by the governance team [Databricks(2023), Machado The centralized analytical data is distributed to domain-oriented et al.(2022)]. Similarly, Netflix [net(2023)] has designed its own Data teams responsible for managing and processing their domain data. Mesh platform to enhance data movements within Netflix Studio. These domain teams can create and maintain data products for end- A primary focus of this platform is to establish a self-service envi- users or other domains. Furthermore, domain teams can collaborate ronment where users can develop their data pipelines and adhere on global activities under the guidance of a global team, enabling to standardized process policies, thereby minimizing redundant efficient and flexible data management across the organization. effort in data pipeline development. The platform is built upon the context of the ETL (Extract-Transform-Load) process in a self- 3.1.2 Data as a Product. The underlying thought of this princi- service manner. Additionally, it applies schemas to all data pipelines ple comes from combining product thinking with analytical data. to guarantee data event quality and simplify data discovery [Net- The domain team, as the data owner, should assume responsibil- flix(2023b), Netflix(2023a), Netflix(2023c)]. ity for generating and sustaining data products for consumers. Empowering Data Mesh with Federated Learning Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain Structurally, a well-defined data product comprises four compo- domains, domain owners can exercise control and provide input nents: code, data and metadata, infrastructure, and interfaces [De- into the learning process. This not only potentially improves the hghani(2023a)]. Depending on the source and purpose, data prod- quality and relevance of the models but also aligns with the data- ucts can be categorized into two types: atomic data products and as-product principle, ensuring data is managed and curated within composite data products. Atomic data products originate from its domain context. Moreover, FL addresses privacy and security source data and cater to end-users or downstream data products. concerns that are often inherent in centralized learning. By main- Composite data products are formed by ingesting data from other taining data within its domain during model training, sensitive data data products, encompassing both basic and composite data prod- does not need to be exposed to a central authority, reducing the ucts from upstream sources. These data products are designed to be risk of data breaches and privacy violations. consumed by end-users rather than other domain teams for specific Building on these advantages, we will now delve into three dis- use cases. tinct types of Federated Learning: Horizontal FL, Vertical FL, and Although customized data products are produced for various Split Learning. Each of these Federated Learning methods presents purposes, they all share similar characteristics to ensure their qual- unique characteristics that could potentially be beneficial in dis- ity. In our study, each domain team cooperates to train the split tributed data architecture. In our study, we concentrate on Split neural network, and each domain produces a partial model to serve Learning, highlighting the features of Split Learning that align with the global server. We use eight attributes identified by [Goedegebu- data mesh and the procedural strategies for its effective implemen- ure et al.(2023)] to present an overview of a high-quality product tation within such a framework. generated by federated learning (model as a product). 3.2.1 Horizontal Federated Learning. Horizontal Federated Learn- 3.1.3 Federated Computational Governance. To balance the con- ing (HFL) was first introduced by Google, aiming to train the ma- centration between centralization and decentralization, Data Mesh chine learning models on decentralized data across multiple devices, creates a global-level entity inside its architecture to govern each reducing the need for data transfer and thus enhancing privacy and domain. This federated governance model aims to facilitate effi- efficiency.HFL also known as collaborative learning, is utilized in cient collaboration and coordination among domain teams while the scenario where each client or node in the federated learning maintaining a high level of autonomy. It defines the global stan- setup has data from many users, but the feature space is the same or dardization rules to ensure the interoperability of each domain. By similar across all clients [Yang et al.(2019)], as described in Figure setting a set of governance policies, the global team can monitor 1-(a). the data products produced by domain teams. In a data mesh architecture, data ownership is distributed across Federated governance activities in Data Mesh can be categorized disparate domains, each possessing distinct types and features of into two types: global governance and local governance. Global gov- data. This decentralized data distribution is contrasted by the as- ernance occurs at a higher level within the data mesh and guides do- sumptions of HFL, which posits that all nodes or clients share a main teams in fulfilling their responsibilities for managing domain- common feature space, differing only in samples or users. While specific data. Meanwhile, local governance takes place closer to HFL maintains its merits and applicability in certain contexts, it the data domain and focuses on maintaining the quality of data may not align optimally with the principles and practicalities of a products produced by domain teams. In our study, we identified data mesh environment. Specifically, the data within Data Mesh is five governance activities [Goedegebuure et al.(2023)] in data mesh distributed to each independent domain and shares similar samples when training federated learning models. but contains different feature spaces. 3.2 Federated Learning 3.2.2 Vertical Federated Learning. Vertical Federated Learning (VFL) provides a framework that allows different entities to engage Federated Learning is a machine learning approach that encour- in the collaborative training of machine learning models while ages model training across a broad network of independent, de- maintaining robust data privacy and security safeguards [Yang centralized nodes. These nodes, in the context of the data mesh, et al.(2019)]. As is shown in Figure 1-(b), this approach is specially correspond to the variety of domains where data naturally resides. tailored for scenarios wherein participating entities possess dis- This methodology aligns closely with the fundamental tenets of parate feature spaces for identical samples. Such a structure is the data mesh, offering a host of benefits and making it a suitable particularly suitable when direct data sharing is unfeasible due to choice for machine learning applications within this distributed legal restrictions or privacy considerations. structure. VFL brings many advantages in terms of data privacy and de- A key advantage of Federated Learning is its harmony with the centralized learning, but it does encounter certain challenges when philosophy of decentralized data ownership specific to domains, applied in a Data Mesh environment. VFL necessitates precise data which is a foundational aspect of the data mesh model. In contrast alignment across domains, requiring all domains to maintain iden- to the issues associated with data copying in centralized learn- tical entities, differing only in the features they hold. However, ing, Federated Learning allows data to stay in its original domain in a data mesh setting, this level of synchronization might not throughout the learning phase. This practice effectively minimizes always be feasible or efficient, thus posing a challenge for VFL’s the requirement for data duplication and transfer, addressing the implementation. associated inefficiencies and potential risks related to data integrity. Furthermore, FL enhances the role of domain owners in the ma- 3.2.3 Split Learning. Split Learning (SL) is another novel ap- chine learning process. By training models within their respective proach that allows training deep neural networks on data from Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain Haoyuan Li and Salman Toor Figure 1: The difference between horizontal federated learning(a), vertical federated learning(b) and split learning(c) multiple parties in a distributed manner. SL is introduced to re- solve security concerns when training deep neural networks for data-sensitive applications [Gupta and Raskar(2018)]. The key idea behind SL is to perform model training across multiple nodes while minimizing the data communication overhead and preserving data privacy. Consider a neural network composed of 𝐿 layers. In SL, we divide this network at the 𝑘-th layer, this layer is also called the cut layer. The client controls the layers from 1 to 𝑘, whereas the server manages the layers from (𝑘 + 1) to 𝐿. We symbolize the output from the 𝑘-th layer as ℎ𝑘 (𝑥;𝑊𝑘 ), where 𝑥 represents the input data, and 𝑊𝑘 stands for the model’s parameters up to the 𝑘-th layer. During forward propagation, the client conveys ℎ𝑘 (𝑥;𝑊𝑘 ) to the server. The server then employs its section of the model, de- Figure 2: Basic Structure of Split Learning noted as ℎ𝑘+1:𝐿 (ℎ𝑘 (𝑥;𝑊𝑘 );𝑊𝑘+1:𝐿 ), to calculate the output. Then the loss will be calculated based on the labels on clients, or the server depending on where the labels data is preserved. When efficiency and preserving bandwidth. In terms of security, only the backpropagation takes place, the server calculates the gradients intermediate representations, or features extracted from the raw concerning its parameters and the input it received, specifically, data, are shared for the training process. This reduces the exposure ∇𝑊𝑘+1:𝐿 and ∇ℎ𝑘 (𝑥;𝑊𝑘 ). The gradient ∇ℎ𝑘 (𝑥;𝑊𝑘 ) is then transmit- of sensitive raw data, as the shared representations often do not ted back to the client. The client utilizes this gradient to compute carry explicit sensitive information, or they make it substantially and update its parameters using the gradient descent method. The harder to extract [Vepakomma et al.(2018)]. weight update process on the client’s end can be formulated as: 𝑊𝑘 := 𝑊𝑘 − 𝛼 · ∇𝑊𝑘 𝐿𝑜𝑠𝑠 (ℎ𝑘+1:𝐿 (ℎ𝑘 (𝑥;𝑊𝑘 );𝑊𝑘+1:𝐿 )) (1) 4 SYSTEM ARCHITECTURE In our research, we propose two distinct architectures for two sep- The server’s weight update step is represented as: arate scenarios within the geographically distributed data mesh - 𝑊𝑘+1:𝐿 := 𝑊𝑘+1:𝐿 −𝛼 ·∇𝑊𝑘+1:𝐿 𝐿𝑜𝑠𝑠 (ℎ𝑘+1:𝐿 (ℎ𝑘 (𝑥;𝑊𝑘 );𝑊𝑘+1:𝐿 )) (2) label sharing and label preserving. These designs strictly adhere to In SL, the computation of a neural network model is partitioned the domain-oriented principle whereby each domain is recognized into two parts: a client model that processes the initial layers of the as an independent data owner. Each domain adheres to a no-peek network, and a server model that handles the subsequent layers. policy, which limits its access to raw data from other domains. This The raw data is kept within the client side, and only the interme- policy, however, permits the sharing of data products like inter- diate representation generated at the cut layer can share with the mediate model weights or gradients exchanged among domains. server for further processing. From a practical perspective, Split For our experimental setup, we employed a concatenation-based Learning is flexible and adaptable, able to support a wide range aggregation mechanism. While this approach is simple and tar- of network architectures and machine learning tasks. Typically, geted, it is not immune to the issue of stragglers. Nonetheless, there SL can be tailored for both horizontal and vertical cases based on are alternative strategies, such as element-wise sum and average the distribution of data on the connected clients, as is described in pooling, which could be explored in varying contexts [Ceballos Figure 1-(c). et al.(2020)]. Within the architecture of data mesh, Split Learning aligns well with the principle of domain-oriented decentralized data owner- 4.1 Distributed Domain Data with Label Sharing ship. The approach also substantially reduces the amount of data In the first scenario, label data is securely retained by the con- that needs to be transmitted over the network, thereby increasing sumer located on the server side. The loss calculation is executed Empowering Data Mesh with Federated Learning Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain server-side, and subsequently, gradients are back-propagated. These gradients follow the sequence layers, being disseminated to each respective domain model at the point of the aggregation layer. Figure 4: Distributed Domain Data without Label Sharing Figure 3: Distributed Domain Data with Label Sharing 4.2 Distributed Domain Data without Label Sharing The second scenario arises when label data is considered sensitive, and its sharing with consumers is not permissible. This label data might either be safeguarded by an autonomous data owner or reside with one of the data domain teams. Under this architectural design, Figure 5: Recommendation System for Retail Industry loss calculation is executed on the client side. Following this, the calculated loss is relayed back to the output layer situated at the final layer of the server model. Notably, the process of back-propagation contributes by generating a partial model based on its proprietary of the gradient in this setup follows a U-shaped trajectory. data. These models then serve as upstream data products that are used by the marketing team to build a comprehensive recommenda- 5 USE CASES tion model for subsequent analyses, such as optimizing marketing To demonstrate the versatility of our proposed solution, we have investment. constructed two prevalent business use cases necessitating the involvement of distributed domains. 5.2 Fraud Detection for Financial Institution Fraud detection presents a security-sensitive concern within finan- 5.1 Recommendation System for Retail cial organizations, often requiring a careful balance of data security Industry and analytical accessibility across multiple domains. In our study, The first use case we implemented in our study is a personalized we utilize an anonymized credit card transaction dataset [Kag- recommendation system in the retail industry, as is shown in Figure gle(2023b)], which is hypothetically partitioned into three distinct 5. The H&M Group [Kaggle(2023a)] provided open-source dataset is domains: finance, cardholder, and security. In this setup, the fraud used for this use case. It is inherently partitioned into three distinct prevention domain is tasked with detecting fraudulent activities, data domains: transaction data (historical purchase records), article despite lacking direct access to raw data. Consequently, each data data (product information), and customer data (user metadata). This domain collaborates by contributing a partial model towards this use case pertains to real-world retail businesses and their challenges effort. associated with large-scale data management and analytics. This use case is closely aligned with organizations managing The overarching task, as proposed by the federated governance online transactions, necessitating an exceptionally high level of team, is to construct a recommendation system. Each domain thus security and the adept handling of sensitive data. The federated Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain Haoyuan Li and Salman Toor governance team holds the responsibility of supervising the training Table 1: Samples in H&M Personalized Fashion Recommen- process, in addition to formulating and issuing policies pertaining dations to encryption techniques, thereby ensuring the security of the intermediate models. The high-level architecture of this use case is Dataset Training Testing No. Users No. Items plotted in Figure 6. H&M-small 78,246 10,109 919 1,132 H&M-medium 393,312 34,683 3,153 4,807 H&M-large 1,097,915 246,360 8,212 25,773 6.2 Datasets and required partitions Two public datasets are used in experiments. The first dataset, H&M Personalized Fashion Recommendations, is provided by H&M Group for product recommendation based on previous purchases [Kaggle(2023a)]. In total, the dataset comprises 1.37 million users, 106k products, and 31.8 million transaction records. Three sizes of datasets—small, medium, and large—are generated based on sam- pling ratios of 0.001, 0.003, and 0.01, respectively. The distribution of three datasets is summarized in Table 1. The NCF (Neural Collaborative Filtering) model learns implicit feedback from transaction data, where only the positive class (e.g., Figure 6: Fraud Detection for Financial Institution customer-product interactions) is observed [He et al.(2017)]. To ad- dress the inherent one-class problem, the negative sampling (NEG) technique is adopted [Mikolov et al.(2013)]. NEG balances the im- 6 RESULTS AND DISCUSSION plicit dataset by generating a set of negative sampling from the The following experiments are carried out on the above-mentioned unseen user-item matrix. This prevents model over-fitting on posi- two use cases: recommendation systems for fashion retailing and tive interaction and reduces the computation complexity [Mikolov credit card fraud detection in banks. Each use case possesses a et al.(2013)]. The NEG ratio is set to 5 for small and medium size centralized model and a split learning model. The centralized model of data, and 2 for large size of data, as recommended by [Mikolov represents the scenario when data is stored in a data center and et al.(2013)]. owned by the central data team. This is also the typical architecture The second selected dataset is anonymous credit card transaction for the current enterprise data platform. The split learning model records provided by Worldline and the Machine Learning Group of creates the scenario when the data is distributed to decentralized ULB [Kaggle(2023b)]. Features 𝑉 1 through 𝑉 28 are derived from data domains. In this case, the ownership of domain data belongs Principal Component Analysis (PCA), while transaction time and to the corresponding data team within the data mesh. the amount have not undergone any transformation. A notable as- In the decentralized configuration, data will be loaded by a dis- pect of this dataset is its highly imbalanced nature in terms of fraud tributed DataLoader. To simulate the structure of the data mesh, fea- detection. As is shown in table 2, the imbalance ratio between the tures are split into different domains and sent out to each data owner. negative and positive classes stands at a mere 0.17%. This imbalance Domain teams provide a data product containing all autonomous tends to bias the model towards the majority class and hinders its technical components (e.g., code, metadata, operational API, and ability to discern patterns within the minority class. infrastructure) to serve other domain teams and global decision- In our study, we adopt the Synthetic Minority Oversampling making [Dehghani(2023a), Dehghani(2023b), Machado et al.(2022)]. Technique (SMOTE) [Chawla et al.(2002)] to relieve the effect of bias Here, specifically, is the partial model provided by each data domain in the data. We initially utilized SMOTE to augment the data points team. in the minority class. SMOTE generates synthetic samples in the feature space of the minority class based on the 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 distance. 6.1 Environment Setup A large sample ratio may lead to the model overfitting the minority All experiments are conducted on the Linux server, equipped with class; thus, we set the sample ratio as 1.5%. To further optimize an Intel Xeon Gold 6230𝑅 Processor, 128𝐺𝐵 RAM, and an NVIDIA the classifier’s performance across both classes, we combine the RTX A5000 GPU with 24 GB of memory. random under-sampling with SMOTE, a practice endorsed by the To address the needs of the split learning we have used PySyft original authors of the SMOTE algorithm [Chawla et al.(2002)]. The as a base framework [Ziller et al.(2021)]. PySyft is an open-source distribution of the resampled data is visualized in Table 2. library created to facilitate privacy-preserving deep learning. All programs are written in Python 3.7.12 using the PySyft library 6.3 Evaluation Metrics (PySyft 0.2.9) and compatible PyTorch library (PyTorch 1.4.0). Using In the use case of the recommendation system, the accuracy metric JupyterLab for interactive deployment and visualization. The source is not appropriate to measure the quality of the ranking. Here, we code is available in a public GitHub repository 1. select Hit Ratio at 𝐾 (HR@K), and Normalized Discounted Cumula- 1 https://github.com/Haoyuan-L/Fed_DataMesh tive Gain at 𝐾 (𝑁 𝐷𝐶𝐺@𝐾) to evaluate the performance of random Empowering Data Mesh with Federated Learning Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain Table 2: Comparison of Class Imbalance Ratio 6.4 Accuracy Analysis Table 3 represents the performance of the recommendation system Dataset Negative Class Positive Class Imbalance Ratio (%) on three sizes of the H&M dataset to test the stability of the split Original 199 020 344 0.17 neural network on the scaling dataset. The experiment is conducted Sampled 149 250 2985 2.00 on three system configurations: Recommendation System on Cen- tralized Neural Network (CRN), Recommendation System on Split Neural Network with label sharing (SRN1), and Recommendation System on Split Neural Network without label sharing (SRN2). recommendations in 𝑘 items. We use Recall at 𝐾 (Recall@K) to In the recall aspect, CRN outperformed across all datasets, sug- assess the ability of the model to find all the relevant cases within gesting its effectiveness in not missing relevant recommendations. the test dataset. Hit rate measures the proportion of cases where Yet, the performance of SRN1 or SRN2 is not drastically inferior. The the true item was among the top 𝐾 items in the ranked list. The potential benefits of decreased data movement in Split Learning definition is given as follows: models remain attractive alternatives, especially in larger, more 𝑁 complex architectures such as Data Mesh. In the context of implicit 1 ∑︁ data, however, with relevance scores being binary (0 for unpur- 𝐻𝑅@𝐾 = 𝑟𝑒𝑙𝑖 (3) 𝑁 𝑖=1 chased items and 1 for purchased), the prediction probability may not directly reflect the user’s latent preferences. This difficulty in 𝑁 is the total number of users, 𝑟𝑒𝑙 is an indicator function. In the accurately predicting a user’s preferences often results in a lower implicit data, 𝑟𝑒𝑙𝑖 is 1 if the actual interacted item is within the top 𝐾 𝑁 𝐷𝐶𝐺 score. This lower score does not necessarily imply poor predicted items for the 𝑖 𝑡ℎ user, and 0 otherwise. In our experiments, model performance but highlights the inherent complexity of im- the length of the recommendation list 𝐾 is set as 10. plicit feedback-based preference prediction. Neural Networks, as a For the evaluation of our ranking model’s performance, we em- parametric method, usually require abundant data points to learn ploy the 𝑁 𝐷𝐶𝐺@𝐾, a robust metric that weighs the position of the underlying pattern within the data. Therefore, we sampled relevant items within the ranked list. 𝑁 𝐷𝐶𝐺 is calculated by di- three sizes of datasets from the original data to see the stability of viding the Discounted Cumulative Gain (𝐷𝐶𝐺) of the presented model performance under the different scales of data. It turns out ranked list by the 𝐷𝐶𝐺 of the ideally ranked list (𝐼𝐷𝐶𝐺). Notably, that our solutions maintain a consistent performance pattern even within the context of our research, each user has only one item training on the small data. This result also aligns with the Data with which they have actually interacted. This item, we propose, Mesh principle of decentralized, scalable data products. should ideally occupy the premier position in the ranking. As a The results in Table 4 explain the performance dynamics of result, the value of 𝐼 𝐷𝐶𝐺𝑖 in our specific context is 1. the Centralized Fraud Neural Network (FraudNN) and the Split Fraud Neural Networks (with and without label sharing) in fraud 𝑁 𝑁 1 ∑︁ 𝐷𝐶𝐺𝑖 1 ∑︁ detection. Despite a slight decrease in precision for the minority 𝑁 𝐷𝐶𝐺@𝐾 = = 𝐷𝐶𝐺𝑖 (4) class, the Split Learning models’ overall performance indicates a 𝑁 𝑖=1 𝐼𝐷𝐶𝐺𝑖 𝑁 𝑖=1 promising direction for machine learning in a Data Mesh frame- where 𝐷𝐶𝐺𝑖 is given by: work, particularly for sensitive and critical applications such as 𝐾 fraud detection. ∑︁ 𝑟𝑒𝑙𝑖 𝑗 Overall, the results illustrate the promising potential of Split 𝐷𝐶𝐺𝑖 = (5) 𝑗=1 𝑙𝑜𝑔 2 ( 𝑗 + 1) Learning in the context of data mesh. It maintains competitive per- formance as data scales, which aligns with Data Mesh’s principle of The range of NDCG is 0 to 1. A higher score signifies a better model. decentralized, scalable data products. Furthermore, the reduced data In our research, we employ 𝑅𝑒𝑐𝑎𝑙𝑙@𝐾 as another performance movement inherent in Split Learning models can offer significant metric. This metric measures the fraction of actual interacted items advantages in Data Mesh. correctly included in the top 𝐾 recommendations provided by the model. This metric proves particularly valuable in the context of 6.5 Diversity Analysis implicit feedback datasets, where the model’s objective is to in- fer users’ interests or preferences, which have not been explicitly The second experiment builds upon the configuration of a Split indicated. The definition is given as: Neural Network with label sharing, with a focus on scaling the number of data domains. The result are shown in Figure 7 and 8. As we incrementally increased the number of data domains involved # of interacted items in top K predicted items in the Split Neural Network model, we consistently observed im- Recall@K = (6) # of interacted items provements in both binary classification metrics (Precision, Recall, In the context of fraud detection, the focus is often on identifying F1-score, and AUC-ROC score for Class 1) and recommendation the minority class - the fraudulent transactions - which typically system metrics (NDCG@10, HR@10, and Recall@10). The results represent a small fraction of the total transactions. Therefore, the infer that the autonomy and diverse characteristics of the domains accuracy is not an ideal metric as it can be misleading. We also within the structure of the data mesh, each contributing unique choose precision, recall, and the F1 score to measure the quality of and valuable information, significantly boost the model’s predictive the model. acuity. Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain Haoyuan Li and Salman Toor Table 3: Performance Comparison of Three Models on Three Datasets H&M-small H&M-medium H&M-large Metric CRN SRN1 SRN2 CRN SRN1 SRN2 CRN SRN1 SRN2 NDCG@10 0.408 0.413 0.411 0.415 0.402 0.398 0.401 0.405 0.399 HR@10 0.905 0.910 0.913 0.912 0.903 0.898 0.901 0.897 0.887 Recall@10 0.900 0.868 0.890 0.959 0.943 0.925 0.982 0.956 0.960 Table 4: Performance Metrics of Fraud Detection Models (a) Centralized FraudNN (b) Split FraudNN with Label Sharing (c) Split FraudNN without Label Sharing Metric Class 0 Class 1 Macro Avg Metric Class 0 Class 1 Macro Avg Metric Class 0 Class 1 Macro Avg Precision 1.00 0.82 0.91 Precision 1.00 0.81 0.90 Precision 1.00 0.83 0.91 Recall 1.00 0.87 0.94 Recall 1.00 0.86 0.93 Recall 1.00 0.84 0.92 F1-score 1.00 0.82 0.92 F1-score 1.00 0.82 0.92 F1-score 1.00 0.83 0.92 Support 85295 148 85443 Support 85295 148 85443 Support 85295 148 85443 predictive performance by contributing unique and valuable infor- mation. This is because Data Mesh encourages multiple domain collaborations without revealing raw data. Moreover, the principle of distributed data ownership enables domain teams to fully harness the multifaceted data under the environment of Data Mesh. In con- clusion, this experiment underscores the capability of Split Learning in leveraging the inherent diversity and decentralization of the Data Mesh architecture. Specifically, in data-sensitive scenarios like fraud detection, Split Learning empowers the domain-specific data teams to fully harness the multifaceted data within the Data Mesh. 7 CONCLUSION In our work, we explore the integration of FL methodologies within Figure 7: Model Performance across Multiple Domains in the framework of Data Mesh. We examined different federated Recommendation System learning strategies, assessing their alignment with the architec- tural principles of Data Mesh. Additionally, we designed two con- figurations of split learning to address use cases involving both label sharing and label preservation. Our proposed methodologies empower the data domains within a data mesh to generate data products for consumers without necessitating the sharing of raw data. However, a potential risk lies in the exposure of intermediate data representations during the training phase, which could poten- tially lead to data leakage. Going forward, there should be a keen focus on safeguarding these intermediate data products. A variety of encryption techniques could be adopted to augment the security of the system.This paper aims to serve as an applied groundwork, inspiring further scholarly pursuits to explore the incorporation of federated learning within the Data Mesh paradigm, thereby advanc- ing the development of more secure and robust machine learning applications. Figure 8: Model Performance across Multiple Domains in Fraud Detection ACKNOWLEDGMENTS To the National Academic Infrastructure for Supercomputing in Sweden [NAISS(2023)] for cloud resources, eSSENCE strategic col- As we can see from the figures 7 and 8, the results infer that laboration for support, and Assistant Professor Prashant Singh for the domains within the Data Mesh are able to boost the models’ technical discussions. Empowering Data Mesh with Federated Learning Conference acronym KDD, 25th - 29th August, 2024, Barcelona, Spain REFERENCES [Perrin(2023)] Jean-Georges Perrin. 2022 (November 11, 2023). The Next Genera- [int(2023)] (August 18, 2023). Intuit. https://www.intuit.com/. tion of Data Platforms is the Data Mesh. https://medium.com/paypal-tech/ [net(2023)] (August 18, 2023). Netflix. https://netflix.com. the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522. [pay(2023)] (August 18, 2023). PayPal. https://www.paypal.com/. [Vepakomma et al.(2018)] Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and [zal(2023)] (August 18, 2023). zalando. https://zalando.com. Ramesh Raskar. 2018. Split learning for health: Distributed deep learning without [Baker(2023)] Tristan Baker. 2022 (September 19, 2023). Intuit’s sharing raw patient data. arXiv preprint arXiv:1812.00564 (2018). Data Mesh Strategy. https://medium.com/intuit-engineering/ [Yang et al.(2019)] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. intuits-data-mesh-strategy-778e3edaa017. Federated machine learning: Concept and applications. ACM Transactions on [Beyer and Laney(2012)] Mark A Beyer and Douglas Laney. 2012. The importance of Intelligent Systems and Technology (TIST) 10, 2 (2019), 1–19. ‘big data’: a definition. Stamford, CT: Gartner (2012), 2014–2018. [Ziller et al.(2021)] Alexander Ziller, Andrew Trask, Antonio Lopardo, Benjamin [Bode et al.(2023)] Jan Bode, Niklas Kühl, Dominik Kreuzberger, and Sebastian Hirschl. Szymkow, Bobby Wagner, Emma Bluemke, Jean-Mickael Nounahon, Jonathan 2023. Data Mesh: Motivational Factors, Challenges, and Best Practices. arXiv Passerat-Palmbach, Kritika Prakash, Nick Rose, et al. 2021. Pysyft: A library for preprint arXiv:2302.01713 (2023). easy federated learning. Federated Learning Systems: Towards Next-Generation AI [Butte and Butte(2022)] Vijay Kumar Butte and Sujata Butte. 2022. Enterprise Data (2021), 111–139. Strategy: A Decentralized Data Mesh Approach. In 2022 International Conference on Data Analytics for Business and Industry (ICDABI). IEEE, 62–66. [Ceballos et al.(2020)] Iker Ceballos, Vivek Sharma, Eduardo Mugica, Abhishek Singh, Alberto Roman, Praneeth Vepakomma, and Ramesh Raskar. 2020. SplitNN-driven Vertical Partitioning. arXiv:2008.04137 [cs.LG] [Chawla et al.(2002)] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357. [Databricks(2023)] Databricks. 2020 (August 18, 2023). Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake. https: //www.youtube.com/watch?v=eiUhV56uVUc. [Dehghani(2023b)] Zhamak Dehghani. 2019 (accessed June 5, 2023)b. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. https://martinfowler.com/articles/data\protect\discretionary{\char\hyphenchar\ font}{}{}monolith-to-mesh.html. [Dehghani(2023a)] Zhamak Dehghani. 2020 (accessed March 3, 2023)a. Data Mesh Principles and Logical Architecture. https://martinfowler.com/ articles/data\protect\discretionary{\char\hyphenchar\font}{}{}mesh\protect\ discretionary{\char\hyphenchar\font}{}{}principles.html. [Goedegebuure et al.(2023)] Abel Goedegebuure, Indika Kumara, Stefan Driessen, Dario Di Nucci, Geert Monsieur, Willem-jan van den Heuvel, and Damian An- drew Tamburri. 2023. Data Mesh: a Systematic Gray Literature Review. arXiv preprint arXiv:2304.01062 (2023). [Gupta and Raskar(2018)] Otkrist Gupta and Ramesh Raskar. 2018. Distributed learn- ing of deep neural network over multiple agents. Journal of Network and Computer Applications 116 (2018), 1–8. [He et al.(2017)] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182. [Kaggle(2023b)] Kaggle. 2018 (accessed March 3, 2023)b. Machine Learning Group - ULB, "Credit Card Fraud Detection. https://www.kaggle.com/datasets/mlg-ulb/ creditcardfraud. [Kaggle(2023a)] Kaggle. 2022 (accessed March 12, 2023)a. H&M Personalized Fashion Recommendations. https://www.kaggle.com/competitions/h\protect\ discretionary{\char\hyphenchar\font}{}{}and\protect\discretionary{\char\ hyphenchar\font}{}{}m\protect\discretionary{\char\hyphenchar\ font}{}{}personalized\protect\discretionary{\char\hyphenchar\font}{}{}fashion\ protect\discretionary{\char\hyphenchar\font}{}{}recommendations/overview. [Khine and Wang(2018)] Pwint Phyu Khine and Zhao Shun Wang. 2018. Data lake: a new ideology in big data era. In ITM web of conferences, Vol. 17. EDP Sciences, 03025. [Machado et al.(2021)] Inês Machado, Carlos Costa, and Maribel Yasmina Santos. 2021. Data-driven information systems: the data mesh paradigm shift. (2021). [Machado et al.(2022)] Inês Araújo Machado, Carlos Costa, and Maribel Yasmina San- tos. 2022. Data Mesh: Concepts and Principles of a Paradigm Shift in Data Ar- chitectures. Procedia Computer Science 196 (2022), 263–271. https://doi.org/10. 1016/j.procs.2021.12.013 International Conference on ENTERprise Information Systems / ProjMAN - International Conference on Project MANagement / HCist - International Conference on Health and Social Care Information Systems and Technologies 2021. [Mikolov et al.(2013)] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013). [NAISS(2023)] NAISS. 2018 (accessed March 3, 2023). National Academic Infrastruc- ture for Supercomputing. https://www.naiss.se/. [Netflix(2023c)] Netflix. 2020 (August 18, 2023)c. Netflix Data Mesh: Composable Data Processing. https://www.youtube.com/watch?v=TO_IiN06jJ4. [Netflix(2023b)] Netflix. 2021 (August 18, 2023)b. Data Movement in Netflix Studio via Data Mesh. https://netflixtechblog.com/ data-movement-in-netflix-studio-via-data-mesh-3fddcceb1059. [Netflix(2023a)] Netflix. 2022 (August 18, 2023)a. Data Mesh: A Data Movement and Processing Platform. https://netflixtechblog.com/ data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873.