Introduction to Data Science PDF
Document Details
Uploaded by Deleted User
Prof Dr. Christina Abert, Dr. Wael Karam
Tags
Summary
This document provides an introduction to data science, covering topics like data formats, types of data, and various forms of data analysis including data mining and visualization. It also covers business intelligence and its relation to data science.
Full Transcript
Introduction to Data Science PROF DR. CHRISTINA ABERT DR. WAEL KARAM Contents 1. Introduction to Data Science 4. Data Preprocessing ► Data Data Cleaning ► Analytics...
Introduction to Data Science PROF DR. CHRISTINA ABERT DR. WAEL KARAM Contents 1. Introduction to Data Science 4. Data Preprocessing ► Data Data Cleaning ► Analytics Data Integration ► Extract, Transform, and Load (ETL) Process Data Reduction ► Data Science Data Transformation and Data Discretization 2. Applications of Data Science 5. Widely Used Techniques in Data Science 3. Data Collection Methods Supervised Learning Data Collection Strategies Association Rule Unsupervised Learning Characteristics of Good Measures Reinforcement Learning Quantitative and Qualitative Data Pattern Recognition Intro to Deep Learning Tools for Collecting Data Contents 5. 6. Text Mining and Sentiment Analysis 8. Data-Visualization-in-Data-Science Text Mining 9. Generative AI (ChatGPT) Text Categorization Prompt Engineering Representation of Document Text Document Preprocessing 10. Data Ethics Sentiment Analysis 7. Natural-Language-Processing (NLP) Natural Language Processing (NLP) Phases of NLP Steps of NLP Applications of NLP Introduction to Data Science ► MidTerm 15 ► Class work 10 ► Project 15 ► Final 60 Introduction to Data Science ► Data ► Data Formats ► Analytics ► Big Data ► Types of Data Analytics ► Business Intelligence Data ► Data is a collection of information. ► This information is either transmitted or stored ► Data comes in numerous forms ► Any kind of information may it be in numbers or text, or pictures is termed as Data Types of data Data comes in different types. Some of the common types of data include: Text Image Video Numbers Spreadsheets Sound Qualitative vs Quantitative data Qualitative Quantitative Qualitative data is the Quantitative Data is data that is a the data that is descriptive piece of numerical information information. For example, “1”, For example, "What a “3.65” etc. nice day it is" Quantitative data can be of two types Discrete vs Continuous data Discrete Continuous Can be expressed as a Can be any value in an specific value. interval For example, “Number For example, “The of months in a year“, amount of oxygen in “Number of members in the atmosphere”, “Age a family” etc. of members in a family” Introduction to Data Science ► Data ► Data Formats ► Analytics ► Big Data ► Types of Data Analytics ► Business Intelligence Data Formats Data can be categorized into two groups: Data with defined types and Structured Structured Structured data structure Non-structured data Example: comma separated values Semi-Structured Textual data with parseable Semi- Quasi-Structured Structured pattern Unstructured Example: XML files with schema Non Textual data with erratic formats Quasi-Structured Structured that can be formated with effort Example: Clickstream data Data that has no inherent structure, often with multiple formats Unstructured Example: Web site, videos Data Formats ► Structured data is organized and easier to work with. Data Formats ► Unstructured data is not organized. ► We must organize the data for analysis purposes. Examples for Data Types Structured Quasi-Structured Unstructured Semi-Structured Introduction to Data Science ► Data ► Data Formats ► Analytics ► Big Data ► Types of Data Analytics ► Business Intelligence Analytics ► The data is not useful until extract knowledge and make decisions. ► This is where data analytics process steps in. ► Data analytics is the process to use the data (in any form), process it through various tools and techniques, and then extract the useful knowledge from this data. The knowledge will ultimately help in decision making. Analytics ► Starting from basic statistic measures, e.g. means, medians, and varifances, etc., to advance data mining and machine learning techniques each and every step transforms data to extract the knowledge ► This process of data analytics has also opened the doors of new ideas. ► e.g. how to mimic the human brain with computer (Artificial neural network) Analytics + Information Introduction to Data Science ► Data ► Data Formats ► Analytics ► Big Data ► Types of Data Analytics ► Business Intelligence Big Data ► Distributed Data ► This change has impacted all the aspects related to data including the storage mechanisms, processing approaches, and knowledge extractions ► Conventional software applications are not sufficient to cope with the size and nature of big data. ► Security availability of Cloud data ► Hadoop, based on MapReduce, is one of the common platforms for processing and managing such data. ► Data is stored on different systems as per needs, and the processed results are then integrated. Big Data Characteristics ► Volume: We deal with petabytes, zettabytes, and exabytes of data. ► Velocity: Data is generated at immense rate. ► Veracity: Refers to bias and anomalies in big data. ► Variety: The number of types of data. ► Variability: Refers to the number of inconsistencies in the data. ► Value: Refers to the benefits that big data can provide. Characteristics of Big Data Volume Velocity Variety Variability Veracity Value Massive Rapidly Diverse data Constantly Varying Cost- volumes of changing from changing quality and effectiveness data data numerous meaning of reliability of and business sources data data value Challenges in Challenges in storage and real-time Challenges in Challenges in Challenges in analysis analysis integration, gathering transforming and analysis and and trusting interpretation data Characteristics of Big Data Volume Velocity Variety Variability Veracity Value Massive Rapidly Diverse data Constantly Varying Cost- volumes of changing from changing quality and effectiveness data data numerous meaning of reliability of and business sources data data value Challenges in Challenges in storage and real-time Challenges in Challenges in Challenges in analysis analysis integration, gathering transforming and analysis and and trusting interpretation data Big Data Vs Small Data Role of Data Analytics ► A data analyst works with data in various ways. ► It may include data storage, data cleansing, data mining for knowledge extraction, and finally presenting the knowledge through some measures and figures. ► Data mining forms the core of the entire data analytics process. ► It may include extraction of the data from heterogeneous sources including texts, videos, numbers, and figures. ► The data is extracted from the sources, transformed in some form which can easily process, and finally we load the data so that we could perform the required processing Role of Data Analytics ► Statistics and machine learning help in analysis and extraction of knowledge from the data ► Later, these models are then used for prediction analysis and prediction purpose ► lots of tools and libraries are available for this purpose including R and Python, etc. ► The final phase in data analytics is data presentation. ► Data presentation involves visual representations of the results for the concept of the customer Introduction to Data Science ► Data ► Data Formats ► Analytics ► Big Data ► Types of Data Analytics ► Business Intelligence Types of Data Analytics Descriptive analytics ► Descriptive analytics helps to find about “What happened” or “What is happening”. ► In simple words these techniques take the raw data as input and summarize it in the form of knowledge useful for customers ► e.g. it may help find out the total time spent on each customer by the company or total sales done by each region in certain season. ► Results generated are in visual forms for better understanding of the customer. Diagnostic analytics ► Diagnostic analytics help in analyzing about “Why it happened?”. Performing analysis on historical and current data ► Why a certain event actually happened at a certain period in time. ► For example, we can find out the reasons for a certain drop in sales over the third quarter of the year. ► Special measures and metrics can be defined for this purpose, e.g. yield per quarter, profit per six months, etc. ► Overall, the process is completed in three steps: – Data collection – Anomaly detection – Data analysis and identification of the reasons. Predictive analytics ► Predictive analytics as name indicates helps in predicting about future. It helps in finding “What may happen”. ► Using the current and historical data predictive analytics finds the patterns and trends by using statistical and machine learning techniques and tries to predict whether same circumstances may happen in future. ► Various machine learning techniques like artificial neural network, classification algorithms, etc., may be used. ► Overall process comprises the following steps: – Data collection – Anomaly detection – Application of machine learning techniques to predict patterns. Prescriptive analytics ► Prescriptive analytics, as the name implies the necessary actions that need to be taken in case of certain predicted event ► e.g. what should be done to increase the predicted low yield in last quarter of the year. ► What measures should be taken to increase the sales in off season. Challenges of Data Analytics ► Large Volumes of Data ► Processing Real-Time Data ► Visual Representation of Data ► Data from Multiple Sources ► Inaccessible Data ► Poor Quality Data ► Higher Management Pressure ► Lack of Support ► Budget ► Shortage of Skills Top Tools in Data Analytics ► R programming ► Python ► Tableau ► SAS ► Microsoft Excel ► RapidMiner ► KNIME ► Orange Introduction to Data Science ► Data ► Data Formats ► Analytics ► Big Data ► Types of Data Analytics ► Business Intelligence Business Intelligence (BI) ► Business intelligence deals with analyzing the data and presenting the extracted information for business actions to make decisions. ► It is the process that includes technical infrastructure to collect, store, and analyze the data for different business related activities Objectives of BI ► Effective decision making ► Business process optimization ► Enhanced performance and efficiency ► Increased revenues ► Potential advantages over competitors ► Making effective future policies Consideration of Business Intelligence Process. ► Accuracy ► Valuable Insight ► Timeliness ► Actionable Accuracy ► The accuracy of input data and the produced output. ► Data that may contain missing, redundant values, and outliers. ► All these significantly affect the accuracy of the process. ► For this purpose, we need to apply different cleansing techniques as per requirement in order to ensure the accuracy of the input data. Valuable Insight ► The process should generate the valuable insight from the data. ► The insight generated by the business intelligence process should be aligned with the requirements of the business to help it make effective future policies ► e.g. for a medical store owner the information of the customer medical condition is more valuable than a grocery store. Timeliness ► Generating the valuable insight is an important component but the insight should be generated at the right time. ► For example, for medical store discussed above if the system does not generate the ratio of the people that may be affected by pollen allergy in upcoming spring, the store may fail to get full benefit of the process. ► So, generating right insight at the right time is essential. ► It should be noted that here timeliness refers to both the timeliness of the availability of the input data and the timeliness of the insight generated. Actionable ► The insight provided by business intelligence process should always consider the organization context in order to provide effective insight that can be implemented ► e.g. although the process may provide maximum amount of the pollen allergy-related medicines that the medical store should purchase at hand but the budget of the medical store and other constraints like possible sales limit, etc., should also be considered for effective decision making Business Intelligence Process Business Intelligence Steps 1. Data Gathering: collecting the data and cleansing it to convert it in format suitable for BI processing. 2. Analysis: Refers to processing of data to get insight from it. 3. Action: Refers to action taken in accordance with the information analyzed. 4. Measurement: Results of actions are measured with respect to required output. 5. Feedback: Measurements are then used to improve the BI process. Data Analytics VS Data Analysis Data Analytics Data analysis Data analytics is process of making Data analysis is sub-component of data decisions from data analytics which tends to analyze the data and get insight. Data collection and general analysis Collecting, cleaning, and transforming the data to get deep insight out of it. Tools: Python, R, and TensorFlow Tools : RapidMiner, and KNIME Deals with examining, transforming, and Deals with complete management of arranging a given data to extract useful data including collection, organization, information and storage Data Analytics Versus Data Visualization Data Analytics Data Visualization Data analytics is process of making decisions from data Deals with presenting the data in a format (mostly graphical) that is easy to understand. Helps organizations increase the operational Helps organization management to visually performance, make policies, and take decisions may perceive the analytics and concepts present provide advantages of over the business competitors. in the data. 1) Prescriptive analytics may help organizations to find 1) Static visualizations normally provide a out the available prospects and opportunities and single view the current visualization is consequently making the decisions in favor of the intended for. Normally user cannot see business. beyond the lines and figures. 2) Predictive analytics may help organizations to 2) Interactive visualizations help user interact predict the future scenarios by looking into the current with the visualization and get the visualization data and analyzing it. as per their specified requirements. Data analytics deals with tools, techniques, and Data visualization techniques like charts, methods to derive deep insight from the data by graphs, may help see the trends and finding out the relationships relationships in the data. it is the part of the output of the analytics process. Data Analytics Versus Data Visualization Data Analyst Versus Data Scientist Data Analyst Data Scientist Deals with analysis of data for report Research-oriented job responsible for generation understanding the data and its relationships. normally look into the known information Finding the unknown from the data. from new perspective statistics, mathematics, and various data advance data science programming representation and visualization techniques. languages like Python, R, TensorFlow. Data analyst’s job includes data analysis and Includes the skills to understand the data and visualization. find out the relationships for deep insight. Complex More Complex Deals with structured data Deal with structured, unstructured, and hybrid data. Data Analytics Versus Business Intelligence Data Analytics Business Intelligence The process of finding the relationships Process that is useful for decision making out between data to get deep insight of the historical information in business Deals with gathering, cleaning, modeling, Helps in decision making for further growth and using data as per business needs. of the business look into the future and tends to answer the look into the past and tends to answer the questions like when will it happen again? questions like what happened? When What will be the consequences? How much happened? How many times? sales will increase if we do this action? Deals with the tools and techniques like text Deals with the tools and techniques like mining, data mining, and big data analytics. reporting, dashboards, and scorecards. Data Analysis Versus Data Mining Data Analysis Versus Data Mining Data Analysis Data Mining analyze the data and get required Process of finding the existing insight patterns in data Requires involves skill set like statistics Requires skills like mathematics, mathematics, machine learning, statistics, machine learning subject knowledge data analyst performs data A data mining person is responsible collection, cleaning, and for mining patterns into the data transforming the data to get deep insight out of it