Data Analytics Architectures on the Continuum: A Performance Comparison Study PDF 2024
Document Details
Uploaded by Deleted User
2024
Sergio Laso, Javier Berrocal, Pablo Fernández, Antonio Ruiz-Cortés, Juan Manuel Murillo, Schahram Dustdar
Tags
Summary
This paper presents a study of data analytics architectures focusing on performance comparison between Cloud Computing and Computing Continuum architectures for data processing, analysis and IoT applications. It outlines guidelines for evaluating these architectures, including a case study illustrating their suitability in different contexts.
Full Transcript
Data Analytics Architectures on the Continuum: a performance comparison study SERGIO LASO, Global Process and Product Improvement S.L., Spain JAVIER BERROCAL, University of Extremadura, Spain PABLO FERNÁNDEZ, University of Sevilla, Spain ANTONIO RUIZ-CORTÉS, University of Sevilla, Spain JUAN MANUEL...
Data Analytics Architectures on the Continuum: a performance comparison study SERGIO LASO, Global Process and Product Improvement S.L., Spain JAVIER BERROCAL, University of Extremadura, Spain PABLO FERNÁNDEZ, University of Sevilla, Spain ANTONIO RUIZ-CORTÉS, University of Sevilla, Spain JUAN MANUEL MURILLO, University of Extremadura, Spain SCHAHRAM DUSTDAR, TU Wien, Austria Over the last few years, we have experienced a growing avalanche of information generated by devices connected to the Internet derived from the widespread acceptance of technology in society. These devices have enabled the development of IoT applications that facilitate and improve the quality of life of people in different areas such as industry or healthcare. As applications advance, their complexity increases due to the interconnection with more devices and the data they generate. This places greater demands on the architecture responsible for transferring and processing information, which can have a negative impact on performance. This implies that Quality of Service (QoS) has to be managed with precision. Therefore, it is essential to select a suitable architecture for data processing and analysis that meets QoS requirements such as response time or latency. In this paper, we present a set of guidelines for the evaluation and comparison of analytics performance between two architecture alternatives: Cloud Computing and the emerging paradigm of the Computing continuum. To do so, we consider parameters that influence QoS to evaluate and compare these architectures; as validation, we provide a case study where the suitability of each architecture can be observed and detected depending on the application context. Based on the empirical analysis developed, we find that Cloud computing excels in environments with smaller data sets and limited devices; conversely, Computing continuum proves to be superior in scenarios with larger data sets and numerous end devices. Additional Key Words and Phrases: Cloud Computing, Computing continuum, Performance comparison, Evaluation, Guidelines ACM Reference Format: Sergio Laso, Javier Berrocal, Pablo Fernández, Antonio Ruiz-Cortés, Juan Manuel Murillo, and Schahram Dustdar. 2024. Data Analytics Architectures on the Continuum: a performance comparison study. ACM Trans. Internet Technol. X, X, Article X (X 2024), 20 pages. https://doi.org/XXXXXXX.XXXXXXX 1 INTRODUCTION During the last few years, technology has brought with it an avalanche of data generated by Internet-connected devices, due to their wide acceptance and acceptance in society for different purposes, with more than 25 billion of these devices Authors’ addresses: Sergio Laso, [email protected], Global Process and Product Improvement S.L. , Cáceres, Spain; Javier Berrocal, [email protected], University of Extremadura, Badajoz, Spain; Pablo Fernández, [email protected], University of Sevilla, Sevilla, Spain; Antonio Ruiz-Cortés, [email protected], University of Sevilla, Sevilla, Spain; Juan Manuel Murillo, [email protected], University of Extremadura, Badajoz, Spain; Schahram Dustdar, dustdar@ dsg.tuwien.ac.at, TU Wien, Vienna, Austria. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. Manuscript submitted to ACM Manuscript submitted to ACM 1 2 S. Laso et al. expected by 2025. This avalanche of data has been favored by the development of the Internet of Things (IoT) paradigm, which has facilitated the integration of these devices into the Internet and has increased their computing and storage capabilities used for different purposes such as the control of smart homes to the management and control of our health. All these Internet-connected devices generate a large flow of information, which is expected to increase in the coming years. Furthermore, as IoT applications evolve in complexity, they need to connect devices from different areas to provide more valuable functions, resulting in greater demands on the architecture to transfer and process information. These applications will require appropriate analytics architectures to constantly monitor the environment to examine and analyze datasets to discover patterns, identify trends, extract valuable information, and make informed decisions on actions to adapt the environment to users’ needs and preferences; those analytics have proved to be useful in innovative scenarios such as efficiently managing a smart city, industry 4.0 or monitoring people’s health. In these scenarios (especially when there are many IoT devices involved that generate a lot of information) it is important to manage QoS with precision, for example, a smart city requires a system to control traffic flows with a strict high freshness (i.e. the data is as up to date as possible) so that emergency systems act as quickly as possible to any problem. If the QoS requirements are not complied with, the system will be completely inefficient. Consequently, the increase of connected devices, along with the strict QoS of IoT applications, can have a negative impact on the architecture due to the circulation and processing of all the information. Therefore, it is very important to select a suitable analytics architecture for data processing and analysis that meets the QoS requirements. Currently, we can define mainly two types of analytics architectures to be used for this purpose. On the one hand, we can find a more centralized architecture, typically called Cloud computing [5, 22]. Cloud computing has revolutionized how information is generated and consumed in recent years. This technology allows companies to store and process data in the cloud, reducing infrastructure costs and providing better scalability, fault tolerance, and greater control. However, the centralization of a large volume generated by many information sources could negatively affect the quality of service (QoS) like in the previous example, increasing latency and response times. It would also increase operational costs due to the higher volume of data they have to store or process (among other aspects), which will make it even more challenging to achieve QoS for IoT applications to be beneficial and accepted by end users. On the other hand, during the last few years, a new paradigm called the Computing continuum has emerged ; this paradigm is an extension of Cloud computing that is propagated from the cloud to everyday devices (such as smartphones or IoT devices) that are closer to people to store and process information. The Fog, Edge, and Mist computing paradigms have brought Cloud computing environments and information processing closer to data sources. By processing information closer to its source, it is possible to reduce the load on the infrastructure and improve the QoS. This alternative might be more appropriate for the above smart city example. Despite the advantages offered by the distributed approach in data analysis architectures, it is also essential to consider the drawbacks that can arise. Firstly, the limitations of the devices used in Fog, Edge, and Mist environments. These devices tend to have more limited resources compared to a cloud node, which can limit the analysis of complex or high-volume data on these nodes. Secondly, managing and maintaining a distributed environment with multiple dispersed nodes can be complicated. Constant monitoring is needed to avoid overload and troubleshoot problems that may arise at different points in the network. Therefore, it is essential to identify which type of architecture is the most suitable to perform data analytics and take advantage of their benefits to offer a good QoS in the different conditions presented by the infrastructure. In some cases, depending on the characteristics of the context, it may be more convenient to use a specific deployment architecture. Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 3 For this reason, it is essential to analyze both options and detect the limits of each architecture with respect to QoS in order to determine when it is more appropriate to use one or the other. At present, we lack guidelines to analyze when one architecture is more beneficial than another depending on the context. In this paper, we define guidelines to evaluate and compare the performance of two types of architecture for data analytics: Cloud computing and Computing continuum. We propose different parameters that influence QoS in order to determine which architecture is more beneficial depending on the context and application characteristics. We also present a comparative performance study on a proposed case study where Cloud computing is more appropriate for smaller datasets of data and a more limited number of end devices involved, while Computing continuum is more appropriate for larger datasets and a greater number of end devices. The rest of the paper is structured as follows. Section 2 describes the motivations for this work. Section 3 explains the guidelines for performance evaluation between architectures and the parameters involved. Section 4 validates the guidelines through a case study. Section 5 presents the related works. Finally, Section 6 presents the conclusions and future work. 2 MOTIVATION The planet is witnessing massive growth in the amount of information being generated, captured, copied, and consumed due to the popularization of smart devices and the expansion of the Internet of Things. This phenomenon has led to an exponential increase in the amount of data produced, which is known as Big Data. By 2023, the amount of data globally is estimated to reach a staggering 97 zettabytes (ZB), and by 2025 this amount is expected to increase to 181 ZB. Big data is transforming society with its impact in several critical sectors, from healthcare or science to finance and business. A significant challenge for researchers and professionals is that this growth rate outpaces their ability to design appropriate computing infrastructures for data analysis and optimize intensive workloads, proposals optimization of resources, etc. Selecting an appropriate infrastructure for data analytics is critical because of the QoS implications of IoT applications and the stringent requirements for many of them. Obtaining useful information from large amounts of data requires an analytics infrastructure to produce timely results. The information may come from different data sources and must be processed and compared with historical information within a certain period. Such data sources may be located in different locations and contain other formats, making integrating multiple sources for analysis a complex task. Cloud computing is one of the most widely used paradigms for performing complex and large-scale computing. The advantages of cloud computing include virtualized resources, parallel processing, security, and integration of data services with scalable data storage. Cloud computing can minimize the cost and constraint of automation and informatization by individuals and enterprises and provide reduced infrastructure maintenance costs and efficient management. In addition, there are different platforms that offer a variety of frameworks and tools to enable simple and intuitive development. As a result of these advantages, many applications have been developed that leverage various cloud platforms, resulting in a tremendous increase in the scale of data generated and consumed by these applications. Nevertheless, the great multitude of Internet-connected devices can significantly impact the infrastructure. In terms of the number of Internet-connected devices to the Internet, is forecast to reach 25.1 billion by 2030, which is more than triple what it was in 2019. This massive growth in devices contributes significantly to this data explosion. This would harm the QoS, leading to increased latency and response times. In addition, it would cause an increase in Manuscript submitted to ACM 4 S. Laso et al. operating expenses due to the higher volume of data to be stored, processed, and so on, which would make it even more challenging to achieve the desired QoS. In order to solve these problems, there are proposals for distributed information processing along the Computing continuum, which extends from the Cloud through Fog, Edge nodes, and end devices, allowing distributed data processing and reducing the amount of data transmitted to the cloud. Thanks to these approaches, Cloud computing environments, and data processing have been brought closer to the data sources; by this approach, the infrastructure load is relieved by distributing it among the different layers, which in turn leads to a significant improvement in the QoS offered. However, this distributed approach can also have drawbacks. Devices in Fog, Edge, and Mist environments often have limited resources, such as storage capacity, processing power, and energy. On the other hand, managing and maintaining a distributed environment with multiple dispersed nodes can be complex. It requires constant monitoring to avoid overload and troubleshooting problems at various points in the network, which can increase operational complexity. Therefore it is necessary to identify when to use one architecture or another. To better visualize the problems of both architectures for performing data analytics. Below we illustrate with a case study: a smart city has implemented an application to obtain the heatmap of the citizens; this application allows citizens to share their heatmap through their mobile device for different city institutions to analyze that data to help their decision-making process. The city council wants to analyze the heatmaps to obtain long-term movement patterns to plan future infrastructure investments. For example, using these data to identify areas of the city that are experiencing higher population growth and therefore require greater investment in public transportation and accessibility. On the other hand, the traffic system wants to use this data to identify crowded areas and issue alerts to citizens to avoid those areas and manage the traffic, i.e. limit the number of vehicles. The transport system also wants to know this information to control the flow of passengers at urban bus stops to increase the frequency of the line or add new buses. In order to identify the right architecture, there are different trade-offs to be considered: On the one hand, suppose that the application is not being heavily used because it is only used by one institution or there are not many active users; in that case, it may be more beneficial to choose a Cloud computing architecture. In such a case, this architectural decision could imply less complexity to manage, less running expenses, and support a medium-low computational load while providing an acceptable QoS. On the other hand, where multiple institutions require analytics on a very frequent basis and where there are many active users on the infrastructure. Centralizing all analytics in the cloud can slow down the system. Hence, a distributed architecture where data analytics is distributed in different nodes helps alleviate the cloud, obtaining a better QoS. Those examples show the need to know the limits of each architectural approach to choosing them at the right time. Specifically, we need to determine which parameters are the most influential and carry out a performance study to know which architecture is the most appropriate depending on the context and characteristics of the infrastructure. This paper presents a performance analysis of the proposed case study. To do so, we propose guidelines with different influential parameters in the QoS to perform the tests in a simulated testbed that allows the QoS limits of each architecture to be detected later so that the application can be reconfigured and deployed in the most optimal architecture. 3 A PERFORMANCE DATA ANALYTICS FRAMEWORK Figure 1 presents the target data analytics scenario that takes place in which different kinds of analytics are launched by consumers on data generated by a set of end devices, such as cell phones or IoT sensors. Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 5 Fig. 1. Target data analytics scenario The flow of data begins with the launch of analytics by consumers. The data analytics may vary in complexity and in the objectives they pursue as mentioned in the case study (active users and freshness of data). These are sent to the analytics architecture represented by a green rectangle. It is integrated with the rectangle configured with the appropriate parameters to carry out the analysis tasks. Once the data is collected from the end devices, it enters the analysis process within the architecture and subsequently sends the results to the consumers. In this article, the Cloud computing and Computing continuum architectures are integrated into this structure, with the objective of evaluating them and determining at which moments depending on the set of analytics sent they offer a better QoS. Throughout this section, the different aspects related to the architectures that integrate the generic environment and the features with which data analytics can be customized will be discussed in depth; the architectures that will later be used for evaluation and comparison are first detailed explanation in subsection 3.1 and 3.2. Finally, in subsection 3.3, the different analytic parameters that can be defined in data analytics will be presented in detail. 3.1 Cloud computing Cloud computing architecture consists of a single node located in the cloud as shown in Figure 2. This centralized node is responsible for managing connected end devices, such as mobile devices or IoT devices. In this architecture, data processing is performed both on the end devices and on the cloud node. Applying it to the case study, all heatmaps generated by mobile devices are sent to a centralized node in the cloud. The cloud node is responsible for performing the aggregation of data from all users and generating the global heatmap of the selected area. This centralized approach offers advantages in terms of simplicity and ease of implementation, as only one node in the cloud is required to perform the aggregation. However, it can also present challenges in terms of scalability and the ability to handle large volumes of data in real-time. Manuscript submitted to ACM 6 S. Laso et al. 3.2 Computing continuum Computing continuum architecture is composed of a node in the cloud and a set of fog nodes that can be organized into several layers as shown in Figure 2. Although the figure shows only three fog nodes in a single layer, in other cases we can consider a more complex topology with multiple layers and fog nodes. The end devices are managed by the fog nodes closest or lower in the architecture hierarchy. In this architecture, data processing is performed from the end devices to the top node in the hierarchy. Each Fog node may perform partial processing of the data before sending it to the cloud node for final processing. For the case study, the aggregation of heatmaps is decentralized. Each mobile device sends its heatmap to harvesting nodes (in this case, by the fog nodes) in the infrastructure, which in turn perform partial aggregation of the received data. These higher-level nodes combine and aggregate the data from the lower-level nodes to obtain the overall heatmap. This distributed approach leverages the processing and storage capacity of mobile devices and nodes of the infrastructure, reducing the need to transfer large volumes of data to a central node. However, it may require greater complexity in the implementation and coordination of the nodes. Fig. 2. Cloud computing and Computing continuum architectures 3.3 Data Analytics parameters The execution of data analytics involves carefully considering the relationship between three critical dimensions: data quality, resource consumption, and computational cost. These dimensions are closely interconnected and directly affect the quality of service (QoS). Data quality is a fundamental factor that determines the accuracy and reliability of the results of data analytics. A more representative sample will provide more accurate and valuable results for decision-making. However, increased data quality leads to increased consumption of computing resources. As the representative sample or data quality increases, the workload and processing required to perform the analytics also increase. This can negatively affect QoS, as slower response times or increased latency in processing requests may be experienced. Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 7 Likewise, computational cost is also affected by data quality. As higher data quality is sought, more powerful and sophisticated computing resources are likely to be required to handle the additional workload. This results in an increase in computational cost, as the use of additional resources can mean higher infrastructure, energy, and maintenance costs. To address these interconnections and provide good QoS, three parameters, Alpha, Beta, and Gamma, are proposed, which can be defined in data analytics depending on the objectives and requirements of the analysis to define its data quality. The parameterization of these three aspects will make it possible to find the limits of each architectural approach, finding the right balance to ensure good QoS. 3.3.1 Alpha 𝛼. Alpha is expressed as a percentage and determines the use of a greater or lesser number of harvesting nodes, i.e. the nodes that directly collect data from the end devices. In a Cloud computing architecture, it has no effect since it is only composed of a single node, so it will always be selected. However, a Computing continuum architecture can comprise several layers of nodes to process. The selection of these harvesting nodes allows, on the one hand, to adjust the quality of the final analytical data to be obtained and, on the other hand, to balance the workload in the architecture, avoiding excessive overload on specific nodes and improving processing efficiency. In an architecture with multiple layers, the selection of these nodes, in consequence, selects the nodes that are in the intermediate layers so that the information finally reaches the top node. By adjusting the Alpha parameter, we can control the number of harvesting nodes used concerning the total number of harvesting nodes in the bottom layer (i.e. the one closest to the end devices). A lower Alpha value indicates a lower percentage of harvesting nodes, which implies that fewer harvesting nodes will be used. On the other hand, a higher Alpha value indicates a higher number of harvesting nodes relative to the total number of them. The choice of the Alpha percentage in a Computing continuum architecture affects several factors. A low Alpha value may be more appropriate when it is desired to minimize resources or when the analysis to be performed is very specific, for example when the traffic system wants to monitor traffic in a specific area of the city. On the other hand, a high Alpha value may be preferable when wider coverage and higher accuracy in data collection are needed, for example when the municipality wants to generate movement patterns to study different investments. 3.3.2 Beta 𝛽. Beta is expressed as a percentage and determines the number of end devices (i.e. end devices sample) that will participate in the analysis run. The Beta parameter allows to control the number of end devices that are involved in the analysis execution at each harvesting node. By adjusting the Beta parameter, the proportion of end devices that will contribute to data processing and analysis at each node is determined. A lower Beta value indicates that only a small percentage of end devices will participate in running analysis at each harvesting node. On the other hand, a higher Beta value indicates that a higher percentage of end devices will participate in the analysis run at each harvesting node. The choice of the Beta parameter affects several factors, such as the processing capacity of the end devices, the workload of each node, and the objectives of the analysis. A low value may be more appropriate when the workload of each device should be limited due to the load they already have or when it is not required to obtain analytics with a large representative sample, for example when the city council wants to generate the movement patterns in the city, it does not need to obtain it from all the end devices. On the other hand, a high value may be preferable when it wants to take full advantage of the capacity of the end devices or wants to get a fairly representative sample of the data, for Manuscript submitted to ACM 8 S. Laso et al. example, when the transport system wants to control the passenger flow, it requires a large sample to be accurate to manage the bus line. The choice of the Beta parameter affects several factors, such as the sample of the final devices (i.e. the number of end devices involved), and the objectives of the analysis. A low value may be more appropriate when it is not required to obtain analytics with a large representative sample, for example, when the city council wants to generate the movement patterns in one area of the city, it does not need to obtain it from all the devices. On the other hand, a high value may be preferable when you want to obtain a fairly representative sample of the data, for example, when the transport system wants to control the flow of passengers in the whole city, it needs a large sample to be accurate in the management of the bus line. 3.3.3 Gamma 𝛾. Gamma defines the domain-specific filters for the data required. In this context, Gamma could represent one or more parameters that allow the analysis to be customized according to the particular requirements and objectives of each case. For example, if it is desired to analyze the mobility patterns of users in a certain city, a set of parameters can be defined in Gamma that delimit the geographic area of interest and the time period in which the data is collected. These parameters will influence the amount of data to be transferred and analyzed during the process. In order to illustrate an example of parameters we could think in the domain of smart-city mobility where Gamma could be defined as latitude, longitude, radio (in meters), and period (begin date and end date). To illustrate Alpha and Beta, two different analysis scenarios are shown below. These scenarios are only intended to illustrate how the combination of Alpha and Beta parameters can affect the distribution of the analytics and thus the results obtained. Example Scenario 1: 66% 𝛼 and 100% 𝛽. In this scenario, the analytic uses 66% of the harvesting nodes will be used to obtain the information from the end devices and 100% of them in each harvesting node will participate in the analysis execution. The following shows how the selection of resources is carried out in both architectures, as shown in the following Figure 3. – 𝛼= 66% of nodes = 1 for Cloud computing and 2 for Computing continuum. – 𝛽 = 100%, select all end devices of the selected nodes. Fig. 3. Example Scenario 1. 66% 𝛼 and 100% 𝛽. Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 9 Example Scenario 2: 100% 𝛼 and 50% 𝛽. In this scenario, the analytic uses all the harvesting nodes are used to obtain the information from the end devices and 50% of them in each harvesting node participate in the analysis execution. The following shows how the selection of resources is carried out in both architectures, as shown in the following Figure 4. – 𝛼= 100% of nodes = 1 for Cloud computing and 3 for Computing continuum. – 𝛽 = 50%, select half of the end devices of the selected nodes. Fig. 4. Example Scenario 1. 100% 𝛼 and 50% 𝛽. In the following sections of the article, a thorough evaluation of both architectures will be conducted to determine which is more suitable in this particular context. This evaluation will help to better understand the advantages and limitations of each architecture and to determine in which cases it is more appropriate to use one or the other. 4 RESULTS AND DISCUSSION This section will present the results obtained by comparing the Computing continuum with Cloud computing architec- tures in the defined case study and the configurations explained above. These results will provide a clear view of how the different configurations of Alpha, Beta, and Gamma influence the response times, being able to detect with which configurations one architecture is better than the other. Firstly, Subsection 4.1 provides detailed information on the configuration and hardware resources used for the evaluation, and Subsection 4.2 explains the set of tests that will be launched. Secondly, the results obtained from the case study are presented. These results provide a visual representation of how the percentages of harvesting nodes (Alpha) in Subsection 4.3 and participating end devices (Beta) in Subsection 4.4 vary as a function of Gamma and as a result, the response time obtained for each evaluated configuration. These response times represent the time required to complete the data analysis in each case. Finally, Subsection 4.5 will identify the analytics parameter settings that provide the best results in terms of response time for each of the architectures for the evaluated case study. Manuscript submitted to ACM 10 S. Laso et al. Table 1. Scenario setup summary. Cloud computing Computing continuum Nodes 1 4 RAM 4GB 1GB vCPU 2 1 End Devices 135 135 4.1 Setup In this section, we describe the characteristics of both architectures. The components and topology of the two archi- tectures and the hardware capacities used in both architectures, i.e. the computational resources available in each component of the architecture, are specified. As we could observe in Table 1, in the case of Cloud computing architecture, it consists of a single node deployed in the cloud. This cloud node has specific hardware requirements consisting of 4GB of RAM and 2 vCPU, in short, an EC2 instance of AWS T2.medium). These resources allow the cloud node to have enough capacity to process and store the data sent by the end devices and to perform heatmap generation. On the other hand, in the Computing continuum architecture, a different topology is described. This architecture includes one node deployed in the Cloud and three nodes deployed in the Fog. Both the node in the Cloud and in the Fog have similar hardware requirements with 1GB of RAM and 1 vCPU, in short, equivalent to an EC2 instance of AWS T2.micro. As for the end devices/uses, in both Cloud computing and Computing continuum architectures, 135 Android virtual devices deployed with the Perses framework are used. These virtual devices have specific hardware requirements of 6GB RAM and 3 CPUs. These resources are necessary for each device to perform its own heatmap computation and send it to the corresponding node for further aggregation. The selection of these hardware requirements is intended to ensure that, in the overall computation, the architectures have a similar load in terms of hardware resources. This allows the tests to be as fair as possible, despite having different topologies. Maintaining fairness in the hardware requirements ensures that the test results obtained more accurately reflect the inherent advantages and disadvantages of each architecture. 4.2 Testbed The testbed to be performed in the case study to evaluate and compare the Cloud computing and Computing continuum architectures is described. First, the characteristics of the Alpha, Beta, and Gamma parameters and their variability for each of the data analytics are detailed: Alpha has three possible values: 33%, 66%, and 100%. These values represent the proportion of harvesting nodes participating in heatmap processing. For example in the case study, a value of 33% in the Computing continuum architecture means that only 1 of the 3 harvesting nodes will contribute to the generation of the heatmap, while the rest are excluded. Beta also has three possible values: 33%, 66%, and 100%. These values represent the proportion of users participating in the generation of the heatmaps. For example in the case study, a value of 33% means that only 33% of the users of each harvesting node will contribute to the generation of the heatmap, while the rest of the users are excluded. Gamma parameter consists of the specific parameters needed to obtain heatmaps of a specific area; latitude, longitude, and radius. With these parameters, we have generated the necessary request to obtain the small (S1), Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 11 medium (S2), and large (S3) data volumes. Each configuration represents a different dataset of the JSON files with the location of the citizens that generate the heatmaps. The location of the citizens has been generated by The One simulating the movement of citizens in the city of Sevilla (Spain). S1 represents a small volume of data, 60Mb i.e. it takes into account citizens located around a 1000 m radius from the city center. S2 represents a medium volume of data, 320Mb i.e. it takes into account citizens located around a 3000 m radius from the city center. Finally, S3 represents a large volume of data, 540Mb i.e. it takes into account citizens located around a 5000 m radius from the city center. Taking into account the different parameter values, 270 analytics have been generated for the whole experiment covering the different combinations between analytical parameters and divided into nine different runs in order not to overload the infrastructure. Each run is repeated three times. This implies that three repetitions of all analytics are carried out. The repetition of the scenarios helps to obtain more consistent results and to evaluate the stability of the results obtained. In each run, ten simultaneous analytics are performed over a period of three minutes. Once the three minutes have elapsed, the ten analytics are completed and the next ten are run. After initial tests, 10 simultaneous analytics was the limit in order to load the processing nodes to the limit and not overload it. The following subsections detail how they are composed and how to interpret the results will be presented. For all graphs presented, the following standard interpretation will be followed: X-axis: Represents the sets provided by Gamma, identified as S1, S2, and S3. Y-axis: Represents the evolution of Alpha or Beta as a function of the graph. Z-axis: Represents response time in milliseconds. Blue plane: Represents the Computing continuum architecture. Red plane: Represents the Cloud computing architecture. 4.3 Alpha evaluation In this subsection, Figures 5, 6 and 7 are presented, where the Alpha parameter is set at 33%, 66%, and 100% respectively. The evolution of Beta is interpreted in terms of the number of end devices that are available to provide data to the architecture. In this context, Beta represents the percentage of end devices involved in the data analytics execution at each harvesting node. The higher the value of Gamma, the higher the number of end devices involved in data analytics. In Figure 5, for S1 and S2 of Gamma, it is observed that initially, with a low value of Beta (i.e., few participating end devices), the Cloud computing architecture achieves better response times compared to the Computing continuum architecture. This can be attributed to the fact that in the Cloud computing architecture, a single node manages and communicates with all end devices centrally, which allows for more efficient processing. However, as the Beta value increases, i.e., more end devices participate in the analysis execution, the Cloud computing architecture starts to degrade and gets worse response times than the Computing continuum architecture. This is because the Computing continuum architecture distributes the processing load among the different Fog nodes, which avoids a significant degradation in response times as the number of end devices involved increases. For S3 Gamma, it is observed that the Computing continuum architecture always obtains better response times in all three Alpha scenarios. This could be attributed to the nature of the dataset in question. Since the S3 Gamma involves a large volume of data, the Cloud computing architecture, which has a single node to manage and process all the data, may face limitations in terms of capacity and performance. On the other hand, the Computing continuum architecture Manuscript submitted to ACM 12 S. Laso et al. Fig. 5. Alpha results at 33% fixed. allows the processing load to be distributed among the different Fog nodes, which gives it an advantage in efficiently processing large volumes of data. Fig. 6. Alpha results at 66% fixed. In Figure 6, by increasing the Alpha value set to 66%, some changes are observed compared to Figure 5. For S1, it can be observed that, in this configuration, the Cloud computing architecture obtains better response times for all Beta Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 13 values. This is due to the increase in Alpha, i.e., each analytic will require processing on more harvesting nodes in the Computing continuum architecture. As a result, nodes may be more loaded and their processing capacity may be more limited, which affects response times. On the other hand, in the Cloud computing architecture, this effect does not occur and their response times remain more stable. For S2, it is observed that for low values of Beta but slightly higher compared to Figure 5, the Cloud computing archi- tecture shows better response times. However, for medium and high values of Beta, the Cloud computing architecture experiences more degradation in response times compared to the Computing continuum architecture. This is due to the distribution of processing across the Fog nodes of the Computing continuum architecture, which allows performance to not degrade as much as the number of end devices involved increases. For S3, as in Figure 5, it is observed that the Computing continuum architecture always obtains better response times for all Beta values. However, the difference between the architectures has decreased compared to Figure 5, which is reflected visually in that the planes are closer due to the increase of Alpha, slightly affecting the performance in the Computing continuum architecture. Fig. 7. Alpha results at 100% fixed. In Figure 7, where the highest possible value for Alpha is set, for S1, as in the previous Figures, it is observed that the Cloud computing architecture obtains better response times for all Beta values. For S2 it can be observed that the Cloud computing architecture shows better performance for low and medium values of Beta. However, as Beta increases, the Cloud computing architecture experiences more degradation in response times compared to the Computing continuum architecture. For S3, as in Figures 5 and 6, the Computing continuum architecture continues to show better response times for all Beta values. However, it can be visually observed that the difference between the architectures has further decreased compared to Figure 6, due to the increase of Alpha to the maximum value. Manuscript submitted to ACM 14 S. Laso et al. 4.4 Beta evaluation In this subsection, Figures 8, 9 and 10 are presented, where the Beta parameter is set to 33%, 66%, and 100% respectively. These plots will allow us to observe the evolution of Alpha on the Y-axis in relation to the Gamma (S1, S2, and S3) on the X-axis, as well as the response times on the Z-axis. The evolution of Alpha is interpreted in terms of the number of harvesting nodes that will require the analytics to be processed. In this context, Alpha represents the percentage of harvesting nodes involved in the execution of the data analytics. Each analytic will require more nodes for processing, so each node will have a larger set of analytics to process. In Cloud computing architecture, it will not involve any change as it has only one node. Fig. 8. Beta results at 33% fixed. In Figure 8, the following can be observed for S1; the Cloud computing architecture shows better response times compared to the Computing continuum architecture throughout the Alpha evolution. However, these differences are minimal and the two architectures offer very similar response times. Since the volume of data is low and the number of end devices is small, the Cloud computing architecture can easily handle analytics requests. Its single cloud node is sufficient to efficiently manage and process the data coming from the end devices. For S2, the Cloud computing architecture also shows better response times throughout the Alpha evolution compared to the Computing continuum architecture. This is likely also due to the ability of the Cloud computing architecture to efficiently manage and process data from a low percentage of end devices. For S3, the Computing continuum architecture shows better response times for all Alpha values. This is mainly due to the data volume. With a large volume of data, the Computing continuum architecture benefits from the distribution of the processing load between the different fog nodes. However, we can observe that the difference between the two is very small. It is also observed that despite increasing Alpha, the degradation of the Computing continuum architecture degrades more than Cloud computing but still obtains better values. Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 15 Fig. 9. Beta results at 66% fixed. In Figure 9, increasing the Beta value set to 66%, for S1, the Cloud computing architecture shows better response times compared to the Computing continuum architecture throughout the Alpha evolution. As in Figure 8, we can observe that the difference between the two architectures is minimal. For S2, for low and medium Alpha values, the Computing continuum architecture performs better. This is mainly due to the increase in data volume and the number of end devices involved in this dataset with respect to Figure 8. With higher data volume and more end devices, the layout of the Computing continuum architecture allows it to better handle this increased workload. However, as Alpha increases, there is a change in the results and the Cloud computing architecture starts to get better response times. This is due to the negative effect of increasing Alpha on the Computing continuum architecture. With higher Alpha values, analytics require processing on more harvesting nodes, which makes each Fog node more loaded. As a result, the performance of the Computing continuum architecture starts to degrade while in the Cloud computing architecture, it has no effect. For S3, as was the case in Figure 8, the Computing continuum architecture shows better response times compared to the Cloud computing architecture for all Alpha values, in this case, a bit more evident. This is mainly due to the volume of data and the number of end devices involved. With a large volume of data and end devices, the Computing continuum architecture benefits from the distribution of the processing load between the different fog nodes. In Figure 10, where a large number of end devices are involved, an interesting behavior is observed. For S1, with a low Alpha value, the Computing continuum architecture obtains times almost identical to the Cloud computing architecture. This may be due to the Computing continuum architecture’s ability to efficiently manage many end devices. However, as the value of Alpha increases, the Computing continuum architecture starts to show worse response times compared Manuscript submitted to ACM 16 S. Laso et al. Fig. 10. Beta results at 100% fixed. to the Cloud computing architecture. This change can be explained by the fact that as Alpha increases, the fog nodes in the Computing continuum architecture can become saturated and experience degraded response times. For S2, it can be observed that due to the management of all end devices and a higher volume of data, the Cloud computing architecture suffers a significant degradation in response times, while the Computing continuum architecture remains more stable due to the efficient management thanks to the distribution of the processing load among the different Fog nodes. For S3, as was the case in Figures 8 and 9, the Computing continuum architecture shows better response times compared to the Cloud computing architecture for all Alpha values. In this scenario, we can observe that the degradation of the Cloud computing architecture is quite significant. As mentioned above and more significant in this case, managing a large volume of data and a large number of end devices involved, the Computing continuum architecture benefits from the distribution of the processing load among the different Fog nodes, while in the Cloud computing architecture that has a single node to manage and process all the data, it has limitations in terms of capacity and performance. 4.5 Summary of Results Figure 11 presents a set of heat maps corresponding to different values of Gamma. In each heat map, the X and Y axes represent the Alpha and Beta values, respectively. This format visualizes a gradient of colors ranging from blue to red tones. The red tones indicate that the Cloud Computing architecture achieves higher response times for a specific Alpha and Beta configuration. On the other hand, blue shades indicate that the Computing Continuum architecture achieves more efficient response times for a particular Alpha and Beta configuration. Based on these results, we can make the following general observations for the evaluated case study; Cloud computing is generally best suited for scenarios with small data sets. The centralized nature of Cloud computing allows it to Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 17 (a) Gamma - S1. (b) Gamma - S2. (c) Gamma - S3. Fig. 11. Gamma. Summary of results. efficiently manage and process small amounts of data from a limited number of end devices. Computing continuum outperforms Cloud computing when dealing with larger data sets or a larger number of end devices. The distributed nature of Computing continuum, with fog nodes spread throughout the infrastructure, allows it to handle the larger workload and efficiently process data from numerous end devices. These results are explained in detail below, where it is specifically determined what the limits of both architectures are depending on the configuration of the Alpha, Beta, and Gamma parameters. Cloud computing architecture achieves better performance when: Gamma - S1: Cloud computing outperforms Computing continuum across the entire Alpha and Beta range as shown in Figure 11a. The small size of the data set allows the Cloud computing node to manage it efficiently. Gamma - S2: Cloud computing performs best when Alpha is above 66% and Beta is below 66% and, when both Alpha and Beta are set at 33% as depicted in Figure 11b with red colors. In these cases, the moderate size of the data set, combined with a limited number of end devices, favors Cloud computing’s centralized approach. Gamma - S3: Cloud computing shows no advantage over Computing continuum for the entire Alpha and Beta range as shown in Figure 11c. This suggests that in scenarios with large data volumes, the Computing continuum architecture has higher performance. Computing continuum architecture achieves better performance when: Gamma - S1: Computing continuum shows no significant advantage over Cloud computing for the entire Alpha and Beta range as shown in Figure 11a. This suggests that in scenarios with small data volumes, Cloud computing architecture generally performs better. Gamma - S2: Computing continuum outperforms Cloud computing when Alpha is below 66% and Beta is above 66% and, when both Alpha and Beta are set at 100% as depicted in Figure 11b with blue colors. In these cases, the average data set size and a larger number of end devices benefit from Computing continuum’s distributed processing capabilities. Gamma - S3: Computing continuum outperforms Cloud computing across the entire Alpha and Beta range as shown in Figure 11c. The Computing continuum architecture is advantageous for handling large volumes of data and managing numerous end devices efficiently. As we have seen, the choice between Cloud computing and Computing continuum architectures is clearly influenced by the Alpha, Beta, and Gamma parameters. As general conclusions, we can determine that for small data sets and Manuscript submitted to ACM 18 S. Laso et al. a limited number of end devices, Cloud computing is more appropriate, while Computing continuum stands out in scenarios with larger data sets and a larger number of end devices. For the defined case study, thanks to this evaluation, we have been able to determine the values of each parameter where one architecture is more beneficial than another. This would allow the application to perform an intelligent deployment that migrates from one architecture or another depending on the application context and ensure optimal performance and QoS. 5 RELATED WORK ACTUALIZARRRRRRRRRRRRRRRRRRRRRR In the area of evaluation and comparison of architectures in the Cloud computing and Computing continuum domains, several works have been carried out that have contributed to the understanding and improvement of these architectures. In , the authors present IoTSim-Osmosis, a tool for modeling and simulation of multiple distributed systems, allowing the integration of IoT, Edge, and Cloud ecosystems along with SD-WAN networks for evaluation. However, this tool focuses only on modeling the architecture and not on QoS evaluation or possible comparison between different topologies or application contexts. The same is happening for other simulators such as where it focuses on the simulation of architectures, in this case, Clouds. So our proposal could be of great help for this type of simulator. In , the authors present Lambada, a framework for processing data analytics in distributed systems. They address different technical issues and present several examples with large amounts of data where distributed computing offers cost and performance advantages over more traditional solutions. The framework is based on different solutions tree-based invocation of workers for fast start-up, a design for scan operators that balances the cost and performance of cloud storage, and a purely serverless exchange operator. However, this solution is only focused on distributed systems, it does not take into account a possible migration to a traditional Cloud architecture when performance and cost parameters do not meet expectations. Nor does it take into account data quality parameters for each of the analytics. The same happens in , where the authors present a data analytics framework for the implementation of architec- tures in the smart city scenario with the objective of evaluating QoS and improving it. For this purpose, they propose a new protocol called QoS-IoT, which evaluates in terms of throughput, energy and transmission time. However, it is only applied to a fixed architecture, it does not compare with another type of architecture that can further improve QoS in certain application contexts and does not take into account quality parameters in the analytics. Finally in , the authors present an experimental methodology to compare Edge-Cloud and Full-Cloud architectures and detect anomalies with deep learning algorithms. The proposal focuses on metrics related to data transfer delays and required network bandwidth. Despite being a promising proposal, the authors focus on comparing in the two architectures the performance of the deep learning algorithm with different configurations. In light of previous related work, our proposal aims to address existing limitations to the field of architecture evaluation and comparison in the Cloud computing and Computing continuum domains. Our approach distinguishes itself from the lack of integration often found in related works by offering a guide/methodology for the evaluation of both architectures in a wide spectrum of contexts and scenarios, enabling a more comprehensive and consistent evaluation. In addition to the power of decision making to use the most suitable architecture in each context. 6 CONCLUSIONS AND FUTURE WORK In recent years, there has been an explosion of data generated by the large number of devices connected to the Internet. This large amount of data known as Big Data has impacted different sectors of our society allowing the creation of Manuscript submitted to ACM Data Analytics Architectures on the Continuum: a performance comparison study 19 new, more complex, and intelligent applications. However, the pace of data growth is exceeding the capacity of IT infrastructures for analysis and processing, affecting the final QoS obtained by end users. There are different architecture alternatives for processing; Cloud computing and Computing continuum. However, these architectures are not optimal for all possible contexts, e.g. data volume, devices involved, etc. It is necessary to exhaustively evaluate the QoS of these architectures and find the limits of them to take advantage of their capabilities at the best time and have a fully optimized infrastructure. In this paper, we present a set of guidelines to evaluate and compare the QoS of these architectures. For this purpose, we have defined different QoS-influencing metrics that provide flexibility for adaptation to various scenarios. Finally, we have presented the evaluation of a case study where we have obtained the limits of each architecture and detected in which cases one architecture or another is better. As future work, we are working on the development of an intelligent orchestrator that, based on the results obtained through our evaluation methodology, can autonomously make migration decisions between architectures. This orchestrator will seek to optimize performance and efficiency, adapting dynamically to the changing demands of the environment. The integration of an automated decision-making system would represent a step forward in the practical implementation of our research results, enabling more efficient and adaptive management of architectures in dynamic and heterogeneous environments. In addition, we are also working on expanding and improving our evaluation guide, by including new evaluation parameters, such as node sparsity or cost of architectures, enabling a more accurate and complete evaluation of architectures for developers. 7 ACKNOWLEDGMENTS This work has been partially funded by grant DIN2020-011586, funded by MCIN/ AEI/10.13039/501100011033 and by the European Union “Next GenerationEU /PRTR”, by the Ministry of Science, Innovation, and Universities (projects TED2021-130913B-I00, PDC2022-133465-I00, TED2021-131023B-C21, PID2021-126227NB-C22), by the Regional Ministry of Economy, Science and Digital Agenda of the Regional Government of Extremadura (GR21133) and the European Regional Development Fund. REFERENCES Noraini Abdullah, Saiful Adli Ismail, Siti Sophiayati, and Suriani Mohd Sam. 2015. Data quality in big data: a review. Int. J. Advance Soft Compu. Appl 7, 3 (2015), 17–27. Giuseppe Aceto, Valerio Persico, and Antonio Pescapé. 2020. Industry 4.0 and health: Internet of things, big data, and cloud computing for healthcare 4.0. Journal of Industrial Information Integration 18 (2020), 100129. Ejaz Ahmed, Ibrar Yaqoob, Ibrahim Abaker Targio Hashem, Imran Khan, Abdelmuttlib Ibrahim Abdalla Ahmed, Muhammad Imran, and Athanasios V Vasilakos. 2017. The role of big data analytics in Internet of Things. Computer Networks 129 (2017), 459–471. Khaled Alwasel, Devki Nandan Jha, Fawzy Habeeb, Umit Demirbaga, Omer Rana, Thar Baker, Scharam Dustdar, Massimo Villari, Philip James, Ellis Solaiman, et al. 2021. IoTSim-Osmosis: A framework for modeling and simulating IoT applications over an edge-cloud continuum. Journal of Systems Architecture 116 (2021), 101956. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, et al. 2010. A view of cloud computing. Commun. ACM 53, 4 (2010), 50–58. Malika Bendechache, Sergej Svorobej, Patricia Takako Endo, and Theo Lynn. 2020. Simulating resource management across the cloud-to-thing continuum: A survey and future directions. Future Internet 12, 6 (2020), 95. Rajkumar Buyya, Rajiv Ranjan, and Rodrigo N. Calheiros. 2009. Modeling and simulation of scalable Cloud computing environments and the CloudSim toolkit: Challenges and opportunities. In 2009 International Conference on High Performance Computing Simulation. 1–11. https: //doi.org/10.1109/HPCSIM.2009.5192685 Hoang T Dinh, Chonho Lee, Dusit Niyato, and Ping Wang. 2013. A survey of mobile cloud computing: architecture, applications, and approaches. Wireless communications and mobile computing 13, 18 (2013), 1587–1611. P. Ferrari, S. Rinaldi, E. Sisinni, F. Colombo, F. Ghelfi, D. Maffei, and M. Malara. 2019. Performance evaluation of full-cloud and edge-cloud architectures for Industrial IoT anomaly detection based on deep learning. In 2019 II Workshop on Metrology for Industry 4.0 and IoT (MetroInd4.0IoT). Manuscript submitted to ACM 20 S. Laso et al. 420–425. https://doi.org/10.1109/METROI4.2019.8792860 Yosra Hajjaji, Wadii Boulila, Imed Riadh Farah, Imed Romdhani, and Amir Hussain. 2021. Big data and IoT-based applications in smart environments: A systematic review. Computer Science Review 39 (2021), 100318. Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, and Samee Ullah Khan. 2015. The rise of “big data” on cloud computing: Review and open research issues. Information systems 47 (2015), 98–115. Ari Keränen, Jörg Ott, and Teemu Kärkkäinen. 2009. The ONE simulator for DTN protocol evaluation. In Proceedings of the 2nd international conference on simulation tools and techniques. 1–10. Sergio Laso, Javier Berrocal, Pablo Fernandez, José María García, Jose Garcia-Alonso, Juan M Murillo, Antonio Ruiz-Cortés, and Schahram Dustdar. 2022. Elastic Data Analytics for the Cloud-to-Things Continuum. IEEE Internet Computing 26, 6 (2022), 42–49. Sergio Laso, Javier Berrocal, Pablo Fernández, Antonio Ruiz-Cortés, and Juan M Murillo. 2022. Perses: A framework for the continuous evaluation of the QoS of distributed mobile applications. Pervasive and Mobile Computing 84 (2022), 101627. Ingo Müller, Renato Marroquín, and Gustavo Alonso. 2020. Lambada: Interactive data analytics on cold data using serverless cloud infrastructure. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 115–130. Sakshi Painuly, Sachin Sharma, and Priya Matta. 2021. Future trends and challenges in next generation smart application of 5G-IoT. In 2021 5th international conference on computing methodologies and communication (ICCMC). IEEE, 354–357. Shalli Rani and Sajjad Chauhdary. 2018. A Novel Framework and Enhanced QoS Big Data Protocol for Smart City Applications. Sensors 18, 11 (Nov 2018), 3980. https://doi.org/10.3390/s18113980 Manish Saraswat and RC Tripathi. 2020. Cloud computing: Comparison and analysis of cloud service providers-AWs, Microsoft and Google. In 2020 9th international conference system modeling and advancement in research trends (SMART). IEEE, 281–285. Md Shahjalal, Moh Khalid Hasan, Md Mainul Islam, Md Morshed Alam, Md Faisal Ahmed, and Yeong Min Jang. 2020. An overview of AI-enabled remote smart-home monitoring system using LoRa. In 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC). IEEE, 510–513. Statista. 2023. Edge computing market value worldwide. https://www.statista.com/statistics/1175706/worldwide-edge-computing-market-revenue/ https://www.statista.com/statistics/1175706/worldwide-edge-computing-market-revenue/. Accessed October 4, 2023. Jianxin Wang, Ming K Lim, Chao Wang, and Ming-Lang Tseng. 2021. The evolution of the Internet of Things (IoT) over the past 20 years. Computers & Industrial Engineering 155 (2021), 107174. Yi Wei and M Brian Blake. 2010. Service-oriented computing and cloud computing: Challenges and opportunities. IEEE Internet Computing 14, 6 (2010), 72–75. Xiaolong Xu, Qingxiang Liu, Yun Luo, Kai Peng, Xuyun Zhang, Shunmei Meng, and Lianyong Qi. 2019. A computation offloading method over big data for IoT-enabled cloud-edge computing. Future Generation Computer Systems 95 (2019), 522–533. Ashkan Yousefpour, Caleb Fung, Tam Nguyen, Krishna Kadiyala, Fatemeh Jalali, Amirreza Niakanlahiji, Jian Kong, and Jason P Jue. 2019. All one needs to know about fog computing and related edge computing paradigms: A complete survey. Journal of Systems Architecture (2019). Manuscript submitted to ACM