Emerging Chapter 2: Data Science PDF
Document Details
Uploaded by MindBlowingSyntax5078
Addis Ababa University
Tags
Summary
This document provides an overview of data science concepts, focusing on definitions of data and information, data processing, data value chains, and the Hadoop ecosystem. It details the different forms of data and how these are used and managed, with a particular focus on big data characteristics and technologies.
Full Transcript
Data Science Data Science CHAPTER TWO Data Science Objectves: ❑ Describe what data science is and the role of data scientists. ❑ Differentiate data and information. ❑ Describe data processing life cycle ❑ Describe data value chain in emerging era...
Data Science Data Science CHAPTER TWO Data Science Objectves: ❑ Describe what data science is and the role of data scientists. ❑ Differentiate data and information. ❑ Describe data processing life cycle ❑ Describe data value chain in emerging era of big data. ❑ Understand the basics of Big Data. ❑ Describe the purpose of the Hadoop ecosystem components Data Science An Overview of Data Science Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured, semi-structured and unstructured data. Data scientists must possess a strong quantitative background in statistics and linear algebra as well as programming knowledge with focuses on data warehousing, mining, and modeling to build and analyze algorithms. Data Science Definition of data and information Data: facts and statistics collected together for reference or analysis. Data is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+,-,/,*,,= etc.) Information: Information is organized or classified data, which has some meaningful values for the receiver. Information is the processed data on which decisions and actions are based. Data Science Cont.. Data can be described as unprocessed facts and figures cannot help in decision-making. Information is interpreted data; created from organized, structured, and processed data in a particular context. However, data is the raw material that is organized, structured, and interpreted to create useful information systems. Data Science Cont.. For the decision to be meaningful, the processed data must qualify for the following characteristics: 1. Timely − Information should be available when required. 2. Accuracy − Information should be accurate. 3. Completeness − Information should be complete. Data Science Data Processing Cycle Data processing is the re-structuring or re-ordering of data by people or machine to increase their usefulness and add values for a particular purpose. Data processing passes through three steps: Input: collecting data from the external world. It may be done in many forms. Processing: the input data is changed to produce data in a more useful form. Output: the result of the proceeding processing step is collected. Data Science Cont.. Data Science Data types and its representation In computer programming a data type is an attribute of data which tells the compiler or interpreter how the programmer intends to use the data. Common data types include: ❖ Integers, example 1, 2, 3, ❖ Booleans, true or false answers ❖ Characters, example a, b, c, d, A, ❖ Floating-point numbers, example 0.35, 1.75 ❖ Alphanumeric strings, example ab12cd Data Science Cont.. Data types from Data Analytics perspective: Structured Data Structured data is data that has been organized into a formatted repository, typically a database, so that its elements can be made addressable for more effective processing and analysis. E.g Excel files or SQL databases Semi-structured Data It is a form of structured data that does not obey the formal structure of data models associated with relational databases or other forms of data tables, but contains tags or other markers to separate semantic elements and fields within the data. E.g JSON and XML Data Science Cont.. Unstructured Data Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. Common examples of unstructured data include audio, video files or No-SQL databases. Data Science Metadata – Data about Data Metadata is data that describes other data. Meta is a prefix that in most information technology usages means "an underlying definition or description." summarizes basic information about data, which can make finding and working with particular instances of data easier. metadata is frequently used by Big Data solutions for initial analysis. E.g In a set of photographs, for example, metadata could describe when and where the photos were taken. Data Science Cont.. Data Science Data value Chain The data value chain describes the process of data creation and use from first identifying a need for data to its final use and possible reuse. Data Science Cont.. Data acquisition (DAQ): It is the process of gathering, filtering, and cleaning data before it is put in a data warehouse. & It is one of the major big data challenges in terms of infrastructure requirements Data analysis: is the process of evaluating data using analytical and statistical tools to discover useful information and aid in business decision making. & it involves exploring, transforming, and modeling data with the helps of data mining, business intelligence, & machine learning. Data curation It is the active management of data over its life cycle to ensure data quality requirements for its effective usage. E.g content creation, selection, classification, validation, etc Data Science Cont.. Data curators (scientific curators or data annotators) hold the responsibility of ensuring that data are trustworthy, discoverable, accessible, reusable and fit their purpose. Data storage is the recording (storing) of information (data) in a storage medium. Relational Database Management Systems(RDBMS): have been the main a solution to the storage paradigm However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties that guarantee database transactions. Unsuitable for big data Data usage It covers the data-driven business activities that need access to data. & enhance competitiveness through the reduction of costs. Data Science Basic concepts of big data Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. "Big Data" refers to the evolution and use of technologies that provide the right user at the right time with the right information from a mass of data. Big data is characterized by 3V and more: Data Science Cont.. Volume: large amounts of data Zeta bytes/Massive datasets Velocity: Data is live streaming or in motion Variety: data comes in many different forms from diverse sources Veracity: can we trust the data? How accurate is it? etc. Data Science Clustered Computing Because of the qualities of big data, individual computers are often inadequate for handling the data at most stages. To better address the high storage and computational needs of big data, computer clusters are a better fit. Big data clustering software combines the resources of many smaller machines, seeking to provide a number of benefits: Data Science Cont.. Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are also extremely important. High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or software failures from affecting access to data and processing. Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group. Cluster membership and resource allocation can be handled by software like Hadoop’s YARN (which stands for Yet Another Resource Negotiator). Data Science Hadoop and its Ecosystem Hadoop is an open-source framework intended to make interaction with big data easier. It is a framework that allows for the distributed processing of large datasets across clusters. The four key characteristics of Hadoop are: Economical: Its systems are highly economical as ordinary computers Reliable: It is resistant to hardware failure. Scalable: It is easily scalable both, horizontally and vertically Flexible: It is flexible and you can store as much structured and unstructured data as you need Data Science Cont.. It comprises the following components and many others: HDFS: Hadoop Distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming based Data Processing Spark: In-Memory data processing PIG, HIVE: Query-based processing of data services HBase: NoSQL Database Mahout, Spark MLLib: Machine Learning algorithm libraries Solar, Lucene: Searching and Indexing Zookeeper: Managing cluster Oozie: Job Scheduling Data Science Cont.. Hadoop has an ecosystem that has evolved from its four core components: data management, access, processing, and storage. Data Science Big Data Life Cycle with Hadoop 1. Ingesting data into the system: The data is ingested or transferred to Hadoop from various sources such as relational database systems, or local files Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data. 2. Processing the data in storage: In this stage, the data is stored and processed. The data is stored in the distributed file system. Data Science Cont.. 3. Computing and analyzing data: The data is analyzed by processing frameworks such as Pig, Hive, and Impala. 4. Visualizing the results: Which is performed by tools such as Hue and Cloudera Search. In this stage, the analyzed data can be accessed by users. Data Science End Of Chapter Two