Chapter 1 Introduction to Data Science PDF
Document Details
Uploaded by EnviousNumber
Philadelphia University
Dr. Bilal Al-Ifan
Tags
Summary
This document introduces the fundamental concepts of data science, covering what data is, different types of data, and how to manage and analyze large datasets (big data). It is a course material from a university, focusing on the concepts and practical aspects of data science.
Full Transcript
DATA SCIENCE FUNDAMENTAL (074114100 ) CHAPTER 1 INTRODUCTION TO DATA SCIENCE Prepared By Dr. Bilal Al-Ifan for the use of Data Science Fundamentals course at Philadelphia University. The slides include materials from Data Science and Big Data Analytics lecture notes -University of Gö...
DATA SCIENCE FUNDAMENTAL (074114100 ) CHAPTER 1 INTRODUCTION TO DATA SCIENCE Prepared By Dr. Bilal Al-Ifan for the use of Data Science Fundamentals course at Philadelphia University. The slides include materials from Data Science and Big Data Analytics lecture notes -University of Göttingen and Data Science Fundamentals lecture notes (Philadelphia University). OUTLINE What is Data What is Big Data The 3 Vs in Big Data The 5Vs in Big Data Data Types What to do with data ? What is Data? WHAT IS DATA? Wikipedia: Data (singular datum) are individual units of information. A datum describes a single quality or quantity of some object Another definition: Data is a collection of facts, such as numbers, words, measurements , observations or even just descriptions of things. DATA ALL AROUND Lots of data is being collected and warehoused Web data Telecom Bank/credit transactions Online trading and purchasing Social Network Data Sources Lots of data is being collected and warehoused The data used for analysis can come from many different sources and be presented in various formats. Websites could register each click from the user. Smartphones can register much information about the users such as location, purchasing habits, interests, and many more. Smart watches could register much information about the user’s status such as heart rates, sleep patterns, and movement behavior. Data Sources Smart cars collect information about routes and drivers’ behavior. Smart homes collect many activities performed by householders. Smart marketers collect purchasing habits. The internet itself reserves a huge amount of data about almost everything, such as sports, music, government statistics, etc.… Financial transactions , bank / credit transactions Online trading and purchasing Social networking Data Warehousing A data warehouse is constructed by integrating data from multiple heterogeneous sources How Much Data We Have? Google processes 20 PB a day (2008) Facebook has 60 TB of daily logs E-bay has 6.5 PB of user data +50 TB/day (2009) Cost of 1 TB of disk: $35 Time to read 1 TB disk: 3 hours (100 MB/s) Size Of Data DATA IS THE NEW OIL It's valuable, but if unrefined it cannot really be used. What to do with the collected data? How to utilize data? DIGGING FOR DATA: DATAFICATION According to Datafication is: A technological trend / process of turning many aspects of our life into data. Once we datafy things, we can transform their purpose and turn the information into new forms of value. K.Cukier and V.Mayer-Schoenberger, Viktor (2013). "The Rise of Big Data". DATAFICATION EXAMPLE 1 Social platforms: Facebook What data can we collect? What benefits can we get? DATAFICATION EXAMPLE 1 Social platforms: Facebook Collect and monitor data information of our actions and friendships to market products and services to us. DATAFICATION EXAMPLE 2 Banking What data can we collect? What benefits can we get? DATAFICATION EXAMPLE 2 Banking Data such as income, gender, age, etc. can be used to determine the likelihood of a person paying back a loan. DATAFICATION EXAMPLE 3 Life insurance industry What data can we collect? What benefits can we get? DATAFICATION EXAMPLE 3 Life insurance industry 1 minute break OUTLINE What is Data What is Big Data The 3 Vs in Big Data The 5Vs in Big Data Data Types What to do with data ? Why Big Data Is A Problem Is this really What Is Big Data ??!! about size? Naive Definition Naive definition: Big data only depends on the data size 1 Gigabyte? 1 Terabyte? 1 Petabyte? Naive interpretation misses important aspects Time: Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per second Diversity: Analyzing spread sheets with numeric data is different from analyzing Web pages that contain a mixture of text and images Distribution: Analyzing data from a single source is different from analyzing data from multiple sources Definition of Big Data Following Gartner‘s IT Glossary: Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. In simple words , Big data is any data that is expensive to manage and hard to extract value from Some people actually use 10 Vs to define big data! Variability Veracity Validity The three Vs Vulnerability Volatility Volume: The size of the data Visualization Velocity: The latency of data processing relative to the growing demand for interactivity Value Variety: The diversity of sources, formats, quality, structures. The Three Vs in Big Data The 3 Vs:Volume Big data is any set of data that is so large that the organization that owns it faces challenges related to storing or processing it. In reality, trends like the following are generating so much information E-commerce, mobility, Social media and the Internet of Things (IoT) The 3 Vs:Velocity It refers to the speed at which new data is created It refers to the speed at which data must be processed and analysed (often close to real-time) If your organization is generating new data at a rapid pace and needs to respond in real time, you have the velocity associated with big data. Most organizations that are involved in e-commerce, social media ,or IoT satisfy this criterion for big data The 3 Vs:Velocity In the year 2000, Google was receiving 32.8 million searches per day! As for 2018, Google was receiving 5.6 billion searches per day ! Approximate monthly active users as of 2018: Facebook: 2.41 billion Instagram: 1 billion Twitter: 320 million Linkedin: 575 million The 3 Vs:Variety It refers to structured , semi-structured and unstructured data due to different sources of data generated either by humans or by machines. In simple words, Diversity in data types and data sources If your data resides in many different formats, it has the variety associated with big data. For example, big data stores typically include e-mail messages, word processing documents, images, video and presentations, as well as data that resides in structured relational database management systems (RDBMSes). THE 5V’S IN BIG DATA Veracity Value VERACITY ()الموثوقية It refers to the assurance of ________of the data Quality Integrity Credibility (trusted) Accuracy (missing values and outliers) Since the data is collected from multiple resources, we need to check the data for accuracy before using it for business insights. VALUE Just because we collected lots of data, it’s of no value unless we garner some insight out of it. Value refers to how useful the data is in decision making.We need to extract the value of the big data using proper analytics. THE 5 V’S Volume: Raw data Velocity: Change over time Variety: Data types / sources Veracity: Data quality Value: Information for decision making OUTLINE What is Data What is Big Data The 3 Vs in Big Data The 5Vs in Big Data Data Types What to do with data ? Examples For Data Types Structured Quasi-Structured Data with defined Textual data types and structure with erratic Example: comma formats that can separated values be formatted with effort Example: Clickstream data Semi-Structured Unstructured Textual data with parseable pattern Example: XML files (files to store and transport Data that has no data ) inherent structure, often with multiple formats Example: Web site, videos STRUCTURED DATA Structured data is organized and easier to work with. e.g., Tables This data can be directly analysed. Pre-processing is only required for data cleaning, e.g., the detection of invalid data or outliers. STRUCTURED DATA Structured data: it’s the traditional data which is organized and conforms to the formal structure of data. This data can be stored in a relational database. Example: Bank statement containing date, time, amount etc. SEMI-STRUCTURED DATA Semi-structured data: e.g., XML, JSON files. The main difference between structured and semi-structured data is that: semi structured data formats are often more flexible. For example, each row in the table of a relational database must have values for exactly the same columns. With XML, the fields are usually similar, but may have structural differences, e.g., due to optional fields. SEMI-STRUCTURED DATA It’s semi-organized data. It doesn’t conform to the formal structure of data Example: log files, JSON files, Sensor data UNSTRUCTURED DATA It’s not an organized data and Doesn’t fit into rows and columns structure of a relational database. Example: Text files, e-mails, images, videos, voicemails, audio files etc. TYPES OF DATA WE HAVE Relational Data (Tables/ Transaction)) Databases Text Data Web Semi-structured Data XML files Graph Data Social Network Semantic Web (RDF) Streaming Data Real-time data OUTLINE What is Data What is Big Data The 3 Vs in Big Data The 5Vs in Big Data Data Types What to do with data ? WHAT TO DO WITH THESE DATA ? 1. Aggregation and statistics (data aggregation is the process where raw data is gathered and expressed in a summary form for statistical analysis to make useful conclusions ). Data warehousing and OLAP (On Line Analytical Processing) 2. Indexing, Searching, and Querying (indexing is the way to get an unordered table into an order that will maximize the query’s efficiency while searching) Keyword based search Pattern matching (XML/RDF) 3. Knowledge discovery Data Mining Statistical Modelling DATA AGGREGATION Data aggregation is the compiling of information from sources with intent to prepare combined datasets for data processing DATA INDEXING KNOWLEDGE DISCOVERY Any Question?