Summary

This is chapter two of textbook on the topic of data science prepared by Surafiel H. This textbook covers fundamentals of data science, including data processing cycles, types, analysis, storage, and curation.

Full Transcript

Chapter Two Data Science Prepared by : Surafiel H. Department of Computer Science, Addis Ababa University November, 2021 Overview of Data Science Data science is a multidisciplinary field that uses scientific methods, processes and algorithm systems to e...

Chapter Two Data Science Prepared by : Surafiel H. Department of Computer Science, Addis Ababa University November, 2021 Overview of Data Science Data science is a multidisciplinary field that uses scientific methods, processes and algorithm systems to extract knowledge and insights from structured, semi- structured and unstructured data. Data science is much more than simply analyzing data. It offers a range of roles and requires a range of skills. Data science continues to evolve as one of the most promising and in-demand career paths for skilled professionals. 2 What is Data? A representation of facts, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing by human or electronic machine. Data can be described as unprocessed facts and figures. Or Data are streams of raw facts representing events occurring in an organization or in the physical environment before they have been organized and arranged in to a form that people understand or use it. 3 Data can also defined as groups of non-random symbols in the form of text, images, and voice representing quantities, action and objects. Data is represented with the help of characters such as alphabets (A-Z, a-z) or special characters (+, -,/ , *, , = etc.). 4 What is Information? Organized or classified data, which has some meaningful values for the receiver. Processed data on which decisions and actions are based. Plain collected data as raw facts cannot help much in decision-making. Interpreted data; created from organized, structured, and processed data in a particular context. 5 Summary: Data Vs. Information Data Information Described as unprocessed or raw facts and figures. Described as processed data. Can’t help in decision making. Can help in decision making Raw material that can be organized, structured, and Interpreted data; created from organized, structured, and interpreted to create useful information systems. processed data in a particular context. Groups of non-random symbols in the form of text, Processed data; in the form of text, images, and voice images, and voice representing quantities, action and representing quantities, action and objects'. objects. 6 Data VS. Information - Examples Data:  The number of cars sold by a dealership in the past month: 100  The number of customers who visited the dealership in the past month: 500 Information:  The dealership's sales have increased by 10% in the past month.  The dealership's conversion rate is 20%. Data:  The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was 23 degrees Celsius. Information:  The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was above average for that time of year. 7 Data Processing Cycle It’s a sequence of steps or operations for processing raw data to be the usable form. It is the re-structuring or re-ordering of data by people or machine to increase their usefulness and add values for a particular purpose. Simply put, it’s the process of converting the raw data into information—the transformation of raw data into meaningful information. It is a cyclical process; it starts and ends with data, and the output of one step is the input for the next step. The value of data is often realized when it’s processed and turned into actionable information. Data processing can be used for various purposes, such as business intelligence, research or decision support. 8 Data processing cycle typically consists of four main stages:  Input  Processing,  Output, and  Storage 9 Input:  The input data is prepared in some convenient form for processing.  The form will depend on the processing machine.  For example, when electronic computers are used, the input data can be recorded on any one of the several types of input medium, such as flash disks, hard disk, and so on. 10 Processing:  In this step, the input data is changed to produce data in a more useful form (The data is transformed into meaningful information).  The raw data is processed by a suitable or selected processing method.  For example, a summary of sales for a month can be calculated from the sales orders data. 11 Output:  At this stage, the result of the processing step is collected.  Present processed data in a human readable format, including reports, charts, graphs and dashboards.  The particular form of the output data depends on the use of the data.  For example, output data can be total sale in a month. 12 Storage:  Refers to how and where the output of the data processing is stored for future use.  The processed data can be stored in databases or file systems, and it can be kept on various storage devices such as hard drives, solid-state drives, and cloud storage. 13 Data Types and its Representation In Computer Science and/or Computer Programming, a data type is an attribute of data which tells the compiler or interpreter how the programmer intends to use the data. The data types defines:-  The operations that can be done on the data,  The meaning of the data, and  The way values of that type can be stored. 14 Common data types in Computer Programming includes:  Integers (int) are used to store whole numbers and all the negatives (or opposites) of the natural numbers, mathematically known as integers.  Booleans (bool) are used to represent values restricted to one of two values: true (1) or false (0).  Characters (char) are used to store a single character.  Floating-point numbers (float) are used to store real numbers.  Alphanumeric strings (strings) are used to store a combination of characters and numbers. 15 Data Types from Data Analytics Perspective From a data analytics point of view, there are three common types of data types or structures: 1. Structured, 2. Semi Structured, and 3. Unstructured data types. 16 Structured Data Data that adhere to a predefined data model and is therefore straightforward to analyze. Data that resides in a fixed field within a file or record. Conforms to a tabular format with relationships between different rows and columns. 17 It depends on the creation of data model, defining what types of data to include and how to store and process it.  Data model: a visual representation of a database structure.  Database: an organized collection of structured data typically stored in a computer system. Common examples of structured data are Excel files or SQL databases. 18 Semi-structured Data A form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables. i.e. doesn’t conform to the formal structure of data model. But contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. 19 Examples of semi-structured data: JSON (JavaScript Object Notation) and XML (Extensible Markup Language).  JSON   XML  20 Unstructured Data Data that either doesn’t have a predefined data model or isn’t organized in a predefined manner. There is no data model; the data is stored in its native format. It is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Common examples:  Audio, video files, NoSQL, pictures, pdfs... 21 Metadata – Data about Data It provides additional information about a specific set of data. For example:  Metadata of a photo could describe when and where the photos were taken. The metadata then provides fields for dates and locations which, by themselves, can be considered structured data. 22 What is Big Data? Generally speaking, Big Data is:  Large datasets,  The category of computing strategies and technologies that are used to handle large datasets. A data set is an ordered collection of data. Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The common scale of big data sets is constantly shifting and may vary significantly from organization to organization. Big Data is mainly characterized by 4V’s. 23 24 Volume: refers the amount of the data.  Large amounts of data - Zeta bytes/Massive datasets. Velocity: refers the speed of data processing  Data is live streaming or in motion. Variety: refers to the number of types of data.  Data comes in many different forms from diverse sources. Veracity: which in this context is equivalent to quality.  Can we trust the data?  Are the data “clean” and accurate?  Do they really have something to offer? 25 Other V's of Big Data Value: refers to the usefulness of gathered data for the business. Variability: refers the number of inconsistencies in the data and the inconsistent speed at which big data is loaded into the database. Validity: data quality, governance, master data management on massive. Venue: distributed multiple data from heterogeneous data from multiple platforms. Vocabulary: data models, semantics that describes data structure. 26 Vulnerability: big data brings new security concerns.  After all, a data breach with big data is a big breach. Volatility: due to the velocity and volume of big data, however, its volatility needs to be carefully considered.  How long does data need to be kept for? Visualization: different ways of representing data such as data clustering or using tree maps, sunbursts, parallel coordinates, circular network diagrams, cone trees. Vagueness: confusion over meaning of big data and tools used. 27 28 Data Value Chain Describe the information flow within a big data system as a series of steps needed to generate value and useful insights from data. The Big Data Value Chain identifies the following key high-level activities:  Data Acquisition,  Data Analysis,  Data Curation,  Data Storage, and  Data Usage 29 Data Acquisition It is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution on which data analysis can be carried out.  A data warehouse is a system that aggregates, stores, and processes information from diverse data sources. Data acquisition is one of the major big data challenges in terms of infrastructure requirements. 30 The infrastructure required for data acquisition must:  Deliver low, predictable latency in both capturing data and in executing queries.  Be able to handle very high transaction volumes, often in a distributed environment.  support flexible and dynamic data structures. 31 Data Analysis A process of cleaning, transforming, and modeling data to discover useful information for business decision-making. Involves exploring, transforming, and modelling data with the goal of highlighting relevant data, synthesising and extracting useful hidden information with high potential from a business point of view. Related areas include data mining, business intelligence, and machine learning. 32 Data Curation Active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage. Data curation processes can be categorized into different activities such as content :  Creation,  Selection,  Classification,  Transformation,  Validation, and  Preservation 33 Data curation is performed by expert curators that are responsible for improving the accessibility and quality of data. Data curators (also known as scientific curators, or data annotators) hold the responsibility of ensuring that data are trustworthy, discoverable, accessible, reusable, and fit their purpose. A key trend for the curation of big data utilizes community and crowd sourcing approaches. 34 Data Storage It is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data. Relational Database Management Systems (RDBMS) have been the main, and almost unique, solution to the storage paradigm for nearly 40 years. 35 Relational database that guarantee database transactions, lack flexibility with regard to schema changes, performance and fault tolerance when data volumes and complexity grow, making them unsuitable for big data scenarios. NoSQL technologies have been designed with the scalability goal in mind and present a wide range of solutions based on alternative data models. 36 Data Usage Covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity. In business decision-making, it can enhance competitiveness through reduction of costs, increased added value, or any other parameter that can be measured against existing performance criteria. 37 Cluster Computing Computing refers to the process of using computers to perform various tasks, including calculations, data processing, and problem-solving. It involves the manipulation and transformation of data using software applications and algorithms. Computing can be done on a single computer or distributed across multiple computers connected through a network. 38 Cluster computing is a specific type of computing that involves the use of a cluster. In general, cluster means “small group”. Cluster computing is a group of interconnected computers or servers working together to perform a task or solve a problem. It refers to multiple computers connected to a network that function as a single entity It allows for the distribution of computational load across multiple machines, enabling faster processing and increased computational power. 39 In a cluster computing setup, each computer in the cluster, also known as a node, works in parallel with other nodes to handle different parts of a larger problem or workload. The nodes are connected through a high-speed network and communicate with each other to coordinate their tasks. Each node is performing a dedicated task. Many nodes (each node) connected with a single node which is called Head node. Accessing a cluster system typically means accessing a head node or gateway node. A head node or gateway node is set up to be the launching point for jobs running on the cluster and the main point of access to the cluster. 40 A classic cluster essentially allows the nodes to share infrastructure, such as disk space, and collaborate by sharing program data while those programs are running. Cluster computing offers solutions to solve complicated problems by:  Providing faster computational speed, and  Enhanced data integrity.  Data integrity refers to the overall accuracy, completeness, and consistency of data. 41 Big Data Cluster System In big data, individual computers are often inadequate for handling data at most stages. Therefore, the high storage and computational needs of big data are addressed through computer clusters. A big data cluster system is a specialized type of cluster computing system designed to manage and process large volumes of data. The primary goal of a big data cluster system is to enable scalable and distributed processing of big data across multiple nodes within the cluster. 42 Big Data clustering software that combines the resources of many smaller machines offers several benefits. Some examples of big data clustering software/tools include Hadoop's YARN (Yet Another Resource Negotiator), Qubole, HPCC, Cassandra, MongoDB, Apache Storm, CouchDB, and Statwing. 43 Using big data cluster provides solution for :  Managing cluster membership,  Coordinating resource sharing, and  Scheduling actual work on individual nodes. Cluster membership & resource allocation can be handled by software like Hadoop’s YARN (Yet Another Resource Negotiator). The assembled computing cluster often acts as a foundation which other software interfaces with to process data. Additionally, the machines in the computing cluster typically manage a distributed storage system. 44 Benefits of Big Data Clustering Software 1. Resource Pooling:  Involves combining available storage space to hold data.  Encompasses CPU and memory pooling, which are crucial for processing large datasets that require substantial amounts of these two resources. 2. High Availability:  Clusters offer varying levels of fault tolerance and availability.  Guarantee to prevent hardware/software failures from affecting access to data and processing.  This becomes increasingly important as real-time analytics continue to be emphasized. 45 3. Easy Scalability:  Clusters facilitate easy horizontal scaling by adding more machines to the group.  This allows the system to adapt to changes in resource demands without needing to increase the physical resources of individual machines. 46 Hadoop It’s an open-source framework designed to simplify interaction with big data. Hadoop is an open-source framework designed for distributed storage and processing of large datasets across clusters of computers. It is inspired by a technical document published by Google.  Open-source software allows anyone to inspect, modify, and enhance its source code.  This development and distribution model provide the public with access to the underlying (source) code of a program or application.  Source code refers to the part of software that typical computer users don’t see; it's the code that programmers can modify to alter how a piece of software, such as a program or application, functions.  A software framework is an abstraction that provides generic functionality, allowing users to 47 extend it with additional code to create application-specific software. Characteristics of Hadoop Economical:  Hadoop systems are highly economical because they can utilize ordinary computers for data processing. Reliable:  Hadoop is reliable due to its ability to store data copies on different machines, making it resistant to hardware failures. Scalable:  Hadoop is easily scalable both horizontally and vertically, allowing the framework to expand with the addition of extra nodes. Flexible:  Hadoop is flexible, enabling storage of both structured and unstructured data for future use as needed. 48 Hadoop Ecosystem Hadoop has an ecosystem that has evolved from its four core components: 1. Data management: Involves handling the storage, organization, and retrieval of data. 2. Data access: Enables users to interact with and retrieve data stored in Hadoop. 3. Data processing: Enables the execution of computations and analytics on large datasets. 4. Data storage: Focuses on efficient and scalable ways to store and manage data. The Hadoop ecosystem is continuously expanding to meet the needs of big data. These components work together to provide a comprehensive ecosystem for managing, accessing, processing, and storing big data. The ecosystem offers a range of tools and technologies to address different aspects of the data lifecycle and cater to various use cases in big data analytics. 49 Components of Hadoop Ecosystem Hadoop Distributed File System (HDFS): A distributed file system that provides reliable and scalable storage for big data across multiple machines. Yet Another Resource Negotiator (YARN): The resource management framework in Hadoop that manages resources and schedules tasks across the cluster, enabling multiple processing engines to run on Hadoop. MapReduce: A programming model and processing engine in Hadoop for parallel processing of large datasets by dividing tasks into map and reduce phases. Spark: A fast and general-purpose cluster computing system with in-memory processing capabilities. It seamlessly integrates with Hadoop and offers higher performance for certain workloads. 50 Pig: A high-level scripting platform in Hadoop that simplifies data processing tasks using a language called Pig Latin, abstracting away the complexities of MapReduce. Hive: A data warehouse infrastructure built on Hadoop, providing a high-level query language (HiveQL) for querying and analyzing data stored in Hadoop. HBase: A distributed and scalable NoSQL database that runs on top of Hadoop. It offers random, real-time access to big data and is suitable for low-latency read/write operations. Mahout: A library of machine learning algorithms for Hadoop, providing scalable implementations for tasks like clustering, classification, and recommendation. 51 MLlib: A machine learning library in Spark that offers a rich set of algorithms and tools for scalable machine learning tasks, including data preprocessing, feature extraction, model training, and evaluation. Solr: An open-source search platform based on Apache Lucene, providing powerful search capabilities such as full-text search, faceted search, and real-time indexing. Lucene: A Java library for full-text search, providing indexing and searching functionalities and serving as the core technology behind search-related applications like Solr and Elasticsearch. ZooKeeper: A centralized coordination service for distributed systems, offering reliable infrastructure for maintaining configuration information, synchronizing processes, and managing distributed locks. Oozie: A workflow scheduling system for Hadoop, enabling users to define and manage workflows that coordinate the execution of multiple Hadoop jobs, automating complex data processing pipelines. 52 53 Big Data Life Cycle with Hadoop 1. Ingesting data into the system:  The first stage of big data processing with Hadoop is ingestion.  Data is transferred to Hadoop from various sources such as relational databases, systems, or local files.  Sqoop facilitates data transfer from RDBMS to HDFS, while Flume handles event data transfer. 54 2. Processing the data in storage:  The second stage involves processing and storing.  Data is stored in the distributed file system, HDFS, and in the NoSQL distributed database, HBase.  Data processing is carried out using Spark and MapReduce. 55 3. Computing and analyzing data:  The third stage is analysis.  Data is analyzed using processing frameworks like Pig, Hive, and Impala.  Pig employs map and reduce techniques for data conversion and analysis, while Hive, based on map and reduce programming, is well- suited for structured data. 56 4. Visualizing the results:  The final stage, access, involves tools such as Hue and Cloudera Search.  Here, users can access the analyzed data and visualize the results. 57 58