CPE413 Emerging Technologies in Data Science PDF
Document Details
Uploaded by WellBalancedFeynman3330
Pamantasan ng Lungsod ng San Pablo
Dr. Teresa A. Yema, Engr. Mario Jr. G. Brucal
Tags
Summary
This document is a module on emerging technologies in computer science, focusing on Data Science. It covers the definition of data science, data and information, data processing, and fundamental concepts related to big data.
Full Transcript
cm MODULE EMERGING TECHNOLOGIES IN CPE CPE413 ACADEMIC YEAR 2023-2024 DR. TERESA A. YEMA ENGR. MARIO JR G. BRUCAL Lecturers MODULES FOR (EMERGING TECHNOLOGIES IN CPE) Credits : 3 units lecture (3 hours/week) Pre-Requis...
cm MODULE EMERGING TECHNOLOGIES IN CPE CPE413 ACADEMIC YEAR 2023-2024 DR. TERESA A. YEMA ENGR. MARIO JR G. BRUCAL Lecturers MODULES FOR (EMERGING TECHNOLOGIES IN CPE) Credits : 3 units lecture (3 hours/week) Pre-Requisite : 3RD Year Standing Module 5: Data Science Lesson 5. Data Science This module will provide an in-depth exploration of several topics within the field of data science, including data science itself, the distinction between data and information, different forms of data and their representations, the data value chain, and fundamental concepts related to big data. In this module, you will be able to: a) Explain the definition of data science and the function of data scientists. b) Distinguish between information and data. c) Describe the life cycle of data processing. d) Recognize various data kinds from multiple angles e) Explain the data value chain in the new big data era. f) Recognize the fundamentals of big data. DATA SCIENCE Data science encompasses a range of disciplines and employs scientific methodologies, algorithms, and systems to derive knowledge and insights from many forms of data, including structured, semi-structured, and unstructured data. Data science encompasses a broader scope of activities beyond mere data analysis. The organization provides diverse positions that necessitate a diverse set of competencies. Data science is continuously developing as both an academic discipline and a profession. It is widely recognized as a promising and sought-after career trajectory for proficient individuals. In contemporary times, data professionals who aspire to succeed acknowledge the necessity of going beyond conventional proficiencies in data analysis, data mining, and programming. To effectively extract valuable insights for their respective organizations, data scientists must acquire comprehensive proficiency across the entirety of the data science life cycle. Additionally, they should demonstrate high adaptability and comprehension to optimize the benefits derived from each process stage, as mentioned above. Data scientists must possess an intense curiosity and a focus on achieving desired outcomes. Additionally, they should possess extensive expertise in their respective industries and practical communication skills that enable them to convey complex technical findings to individuals lacking technical expertise effectively. The individuals exhibit a robust foundation in statistical analysis and linear algebra, alongside proficiency in programming. Their programming expertise is oriented explicitly towards data warehousing, mining, and modelling, enabling them to construct and evaluate algorithms. This chapter will discuss fundamental definitions of data and information, data kinds and their representation, the dynamics of data value change, and fundamental ideas about big data. DATA AND INFORMATION Data can be described as a codified representation of factual information, conceptual ideas, or instructional content intended to be effectively communicated, interpreted, or processed by human beings and electronic devices. The term "unprocessed facts and figures" might characterize this concept. The representation of data commonly employs characters, including alphabets (A-Z, a-z), numerals (0-9), and special symbols (+, -, /, *, , =, etc.). Information refers to the data that has been processed and serves as the foundation for making decisions and taking action. Data is transformed into a format that holds significance for the receiver and possesses actual or perceived worth in the receiver's present or future actions or decisions. Moreover, it is essential to note that information can be understood as data that has been interpreted. This interpretation occurs after the data has been organized, structured, and processed within a specific context. DATA PROCESSING CYCLE Data processing refers to rearranging or reorganizing data by individuals or automated systems to enhance its utility and provide additional value for a certain objective. The process of data processing encompasses three fundamental stages, namely input, processing, and output. The data processing cycle is comprised of these three parts. Input Process Output Figure 5. 1. Data Processing Cycle The input data is transformed into a suitable format for further processing during this stage. The specific configuration of the form will be contingent upon the characteristics and capabilities of the processing equipment. When electronic computers are utilized, the input data can be stored on several types of storage media, including but not limited to hard disks, CDs, and flash disks. During the processing stage, the incoming data transforms to generate more meaningful and practical data. For instance, the interest calculation can be performed on funds deposited into a financial institution, while the determination of monthly sales can be derived from the compilation of sales orders. The outcome of the preceding processing step is gathered. The specific format of the output data is contingent upon the intended application of the data. An instance of output data, such as payroll information, could pertain to employee compensation. DATA TYPES AND THEIR REPRESENTATION Data types can be characterized from various viewpoints. In computer science and computer programming, a data type refers to a characteristic of data that informs the compiler or interpreter about the intended usage of the data. Data From Programming Perspective Data type is explicitly incorporated in nearly all computer languages, albeit with varying terminology across various languages. Common data types encompass a variety of categories, which are widely utilized in many computational systems and programming languages. Integers(int)- is used to store whole numbers, mathematically known as integers. Booleans(bool)- is used to represent restricted to one of two values: true or false. Characters(char)- is used to store a single character Floating-point numbers(float)- is used to store real numbers Alphanumeric strings(string)- used to store a combination of characters and numbers A data type defines the set of possible values that an expression, such as a variable or a function, can assume. The data type encompasses the set of operations that apply to the data, the semantics of the data, and the mechanisms for storing values of that particular kind. Data Types From Data Analytics Perspective From a data analytics perspective, it is crucial to comprehend that there exist three prevalent categories of data kinds or structures: structured, semi-structured, and unstructured data types. Figure 5. 2. Three Types of Data and Metadata Structured data refers to data conforming to a predetermined data model, facilitating its analysis clearly. Structured data is characterized by adhering to a tabular format, wherein a clear relationship exists between the rows and columns. Typical instances of organized data encompass Excel files or SQL databases. Each has a systematic arrangement of rows and columns that can be organized in order. Semi-structured data refers to structured data that does not adhere to the formal structure of data models commonly used in relational databases or other data tables. However, it does include tags or other indicators that allow for the differentiation of semantic elements and the establishment of hierarchies within the data. Hence, it is commonly referred to as a self-descriptive entity. Instances of semi-structured data encompass JSON and XML, which are recognized as representative types of semi-structured data. Unstructured data refers to information that lacks a predetermined data model or is not organized in a prescribed fashion. Unstructured information predominantly comprises textual content, although it may also encompass several forms of data, including dates, numerical values, and factual details. The presence of inconsistencies and ambiguities in the data leads to challenges in comprehension when utilizing conventional programs, in contrast to data maintained in organized databases. Frequent instances of unstructured data encompass audio and video files and No-SQL databases. The final classification of data type pertains to metadata. From a technical perspective, it can be argued that this is not an independent data structure, yet it holds significant importance in Big Data analysis and solutions. Metadata refers to information that describes or provides context for other data. The provision of supplementary details about a particular dataset is facilitated. Metadata can include information regarding the temporal and spatial aspects of a collection of photographs, such as the precise time and location at which they were captured. The metadata includes designated fields for recording dates and places, which can be classified as structured data in isolation. Due to this rationale, metadata is commonly employed by Big Data solutions for preliminary analysis. DATA VALUE CHAIN The concept of the Data Value Chain is employed to delineate the progression of information within a large-scale data system, encompassing a sequence of stages essential for creating value and deriving meaningful insights from data. The Big Data Value Chain encompasses a set of fundamental activities of significant importance. These activities can be categorized as follows: Data Acquisition Data preprocessing refers to the systematic procedures involved in collecting, filtering, and refining data before storing it in a data warehouse or any other suitable storage platform where subsequent data analysis can be conducted. Collecting data is a significant difficulty in the realm of big data, particularly concerning the necessary infrastructure. The infrastructure necessary to facilitate the acquisition of large-scale data must be capable of consistently delivering low latency in both the process of capturing data and the execution of queries. It should also be equipped to handle exceedingly high transaction volumes, often in a distributed setting, while accommodating flexible and dynamic data structures. Data Analysis The primary objective of this endeavour is to ensure that the raw data obtained is suitable for utilization in decision-making processes and specific applications within a particular area. Data analysis is the systematic examination, manipulation, and modelling of data to identify pertinent information, synthesize insights, and extract valuable latent knowledge from a business perspective. The areas closely associated with this topic encompass data mining, business intelligence, and machine learning. Data Curation Data lifecycle management refers to the proactive administration of data during its entire lifespan, ensuring that it adheres to the essential criteria for data quality to facilitate its efficient utilization. The processes involved in data curation can be classified into many activities, including content production, selection, classification, transformation, validation, and preservation. Data curation is a task carried out by proficient curators who are responsible for enhancing the accessibility and quality of data. Data curators, often called scientific curators or data annotators, are tasked with guaranteeing the reliability, findability, accessibility, reusability, and appropriateness of data for their intended use. One prominent trend in the realm of big data involves the utilization of community and crowdsourcing methodologies. Data Storage Data persistence and management refers to storing and organizing data to accommodate the requirements of applications that necessitate efficient and rapid data retrieval. Relational Database Management Systems (RDBMS) have served as the predominant and virtually exclusive solution for the storage paradigm for approximately four decades. Nevertheless, the ACID properties, encompassing Atomicity, Consistency, Isolation, and Durability, assure the integrity of database transactions. However, these properties exhibit limited adaptability in accommodating schema modifications and may not effectively handle the escalating data volumes and intricacy associated with big data scenarios. Consequently, they are deemed inadequate for such situations. NoSQL technologies have been specifically developed to achieve scalability, offering diverse solutions founded on alternative data models. Data Usage This encompasses the business activities that rely on data-driven approaches, necessitating access to data, its analysis, and the requisite tools for integrating data analysis into the business activity. Using data to make business decisions can enhance competitiveness by reducing costs, increasing added value, or improving other measurable parameters compared to existing performance standards. BASIC CONCEPT OF BIG DATA The term "big data" encompasses a range of unconventional tactics and technologies necessary for collecting, organizing, processing, and analyzing substantial datasets to derive meaningful insights. The issue of managing data that surpasses the computational capabilities or storage capacity of a single computer has been a longstanding concern. However, in recent years, there has been a significant increase in the prevalence, magnitude, and significance of this form of computing. This section will provide an overview of big data at a foundational level and elucidate common ideas that one may encounter. In addition, an overview of the procedures and technology presently employed in this domain will be provided. BIG DATA "Big data" refers to a vast and intricate collection of data sets that pose challenges in processing utilizing conventional database management tools and standard data processing software. In the present context, "large dataset" refers to a dataset that exceeds traditional tools' practical processing or storage capabilities or a single machine. This implies that the standard measurement of large datasets continuously changes and can differ dramatically across different organizations. Big data is commonly defined by three key characteristics, often referred to as the 3V's and beyond: Volume – Large amounts of data Zeta bytes/Massive datasets. Velocity – Data is live streaming or in motion. Variety – Data comes in many different forms from diverse sources. Veracity – Can we trust the data? How accurate is it? etc. Figure 5. 3. Characteristics of Big Data CLUSTERED COMPUTING AND HADOOP ECOSYSTEM Clustered Computing Due to the inherent characteristics of big data, individual computer systems frequently prove insufficient in effectively managing the data at various stages. In order to more effectively cater to the substantial storage and computing requirements of big data, computer clusters present a more suitable solution. The utilization of big data clustering software involves the aggregation of computational capabilities from numerous smaller devices, intending to offer a range of advantageous outcomes: Resource Pooling - Resource pooling involves consolidating storage space to accommodate data, which is undeniably advantageous. However, it is crucial to acknowledge that CPU and memory pooling also hold significant value in this context. Processing huge datasets necessitates substantial quantities of all three of these resources. High Availability – Clusters can offer different degrees of fault tolerance and availability guarantees to mitigate the impact of hardware or software failures on data access and processing. The significance of real-time analytics is increasingly emphasized, rendering this matter of utmost relevance. Easy Scalability – Utilizing clusters facilitates horizontal scalability by seamlessly integrating supplementary machines into the collective unit. This implies that the system can respond to fluctuations in resource demands without increasing the physical resources available on a given machine. Using clusters necessitates implementing a comprehensive approach to effectively handle cluster membership, facilitate resource allocation, and efficiently schedule tasks on individual nodes. Software like Hadoop's YARN (Yet Another Resource Negotiator) may manage cluster membership and resource allocation effectively. When formed, the computing cluster frequently serves as a fundamental infrastructure that other software communicates with to carry out data processing tasks. The computing cluster commonly incorporates machines responsible for managing a distributed storage system, a topic that will be addressed in our discussion on data persistence. HADOOP Hadoop is an open-source platform designed to facilitate the processing and analysis of large-scale datasets. The framework facilitates the distributed analysis of extensive datasets across computer clusters through straightforward programming techniques. The inspiration for this work stems from a technical publication released by Google. Four fundamental attributes define the Hadoop framework: Economiccal. The systems are cost-effective as they may utilize standard computers for data processing. Reliable. The system is considered reliable because it keeps several copies of data on other devices, mitigating the risk of hardware failure. Scalable. The system has a high degree of scalability, as it can be expanded both horizontally and vertically with ease. Adding a few more nodes facilitates the process of expanding the framework. Flexible. The system exhibits a high degree of flexibility, allowing for the storage of both structured and unstructured data in quantities determined by the user, with the option to utilize this data at a later time. The Hadoop ecosystem has undergone development from its original four core components, namely data management, access, processing, and storage. The expansion of resources is consistently occurring to accommodate the demands of Big Data. The elements above, among others, constitute the entirety of the composition. HDFS: Hadoop Distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming-based Data Processing Spark: In-memory data processing PIG, HIVE: Query-based processing of data services HBase: NoSQL Database Mahout, Spark MLLib: Machine Learning algorithm libraries Solar, Lucene: Searching and Indexing Zookeeper: Managing cluster Oozie: Job Scheduling Figure 5. 4. Hadoop Ecosystem HDFS: Hadoop Distributed File System (HDFS) serves as a fundamental element within the Hadoop ecosystem. Its principal function entails the storage of extensive datasets, whether structured or unstructured, across several nodes. Additionally, HDFS is responsible for the management of metadata, which is stored in the form of log files. The Hadoop Distributed File System (HDFS) is comprised of two fundamental components, namely: 1. Name Node 2. Data Node The Name Node serves as the central node in a distributed file system, housing metadata that pertains to the data stored in the system. This metadata, which encompasses information about the data itself, necessitates fewer resources than the data nodes responsible for storing it. The data above nodes are examples of commodity hardware utilized inside a distributed system. It is undoubtedly achieving cost-effectiveness in the use of Hadoop. The Hadoop Distributed File System (HDFS) plays a pivotal role by effectively managing the coordination between clusters and hardware components. YARN: The YARN framework is a key component of Apache Hadoop, providing a distributed processing platform. Yet Another Resource Negotiator, as the name says, YARN is the one who helps to manage the resources across the clusters. In summary, it undertakes the task of scheduling and allocating resources within the Hadoop System. The system is comprised of three primary components, namely. 1. Resource Manager 2. Nodes Manager 3. Application Manager The resource manager can distribute resources across the applications inside a system. In contrast, node managers distribute resources like CPU, memory, and bandwidth per machine. Subsequently, node managers inform the resource manager about the allocation of resources. The application manager is an intermediary between the resource management and node manager, facilitating discussions based on their needs. MapReduce: MapReduce enables the utilization of distributed and parallel algorithms, facilitating the transfer of processing logic and the development of applications that effectively convert large data sets into more manageable ones. MapReduce utilizes two methods, Map() and Reduce(), responsible for performing specific tasks. 1. The Map() function sorts and filters data, dividing data into groups. The Map function produces a result in key-value pairs, which is subsequently processed by the Reduce() method. 2. The "Reduce()" function summarises data by aggregating the mapped data. Concisely, the Reduce() function accepts the output produced by the Map() function and aggregates those tuples into a reduced set. PIG: The Pig platform was primarily developed by Yahoo and operates on a programming language known as Pig Latin. This language is query-based and bears a resemblance to SQL. The platform serves as a framework for organizing the flow of data, as well as for the processing and analysis of large-scale datasets. The execution of commands is performed by the Pig framework, which also manages the underlying MapReduce activity. After the processing, the pig framework stores the resultant data in the Hadoop Distributed File System (HDFS). The Pig Latin language has been specifically intended to utilize how Java operates on the Java Virtual Machine (JVM) within the Pig Runtime framework. Pig plays a significant role in the Hadoop Ecosystem by facilitating programming convenience and optimization. HIVE: HIVE utilizes SQL techniques and interfaces to retrieve and store extensive datasets efficiently. Nevertheless, the query language used by Hive is referred to as HQL, which stands for Hive Query Language. The system has high scalability as it facilitates both real-time processing and batch processing. Furthermore, Hive supports all SQL datatypes, hence facilitating the execution of queries. Like other Query Processing frameworks, HIVE also consists of two components: JDBC Drivers and the HIVE Command Line interface. JDBC, in conjunction with ODBC drivers, facilitates the establishment of data storage rights and connections, while the HIVE Command Line aids in query processing. Mahout: The utilization of Mahout facilitates the incorporation of machine learning capabilities into a system or application. Machine Learning, as its name implies, facilitates a system's autonomous development by analyzing patterns, user/environmental interactions, or algorithmic processes. The software offers a range of libraries and features, including collaborative filtering, clustering, and classification, which are fundamental ideas within machine learning. The software enables the execution of algorithms based on specific requirements by utilizing its libraries. Apache Spark: The platform effectively manages several computationally intensive activities, including batch processing, interactive or iterative real-time processing, graph transformations, and visualization. The consumption of memory resources results in increased speed and optimization compared to the previous version. Spark is particularly well-suited for processing real-time data, while Hadoop is better suited for structured data or batch processing. As a result, many firms employ both technologies interchangeably. Apache HBase: The NoSQL database can handle many data types, making it suitable for managing many data types within the Hadoop Database environment. This technology offers the functionalities of Google's BigTable, enabling efficient processing of large-scale datasets. When there is a requirement to locate or extract instances of a specific element within a vast database, the query must be executed promptly and efficiently. During such instances, HBase proves to be advantageous since it provides a resilient approach for storing a restricted amount of data. Other Components: In addition to the components above, supplementary components play a significant role in enabling Hadoop to process extensive datasets effectively. The following items are as follows: Solr and Lucene are two services that facilitate the process of searching and indexing using Java libraries. Lucene, in particular, is Java-based and offers a spell-check mechanism. However, it might be noted that Solr powers Lucene. The zookeeper noted a significant challenge in effectively managing coordination and synchronization across the resources and components of Hadoop, leading to frequent occurrences of inconsistency. The zookeeper successfully addressed the challenges by implementing synchronization, inter-component communication, grouping, and maintenance techniques. Oozie serves the purpose of a scheduler by effectively managing the scheduling of jobs and consolidating them into a cohesive entity. There are two types of employment opportunities—for instance, the utilization of Oozie workflow and Oozie coordinator duties. The Oozie workflow consists of tasks that must be done in a specific sequential order. In contrast, Oozie Coordinator jobs are activated in response to specific data or external stimuli.