Fundamentals of Data Science and Data Mining Chapter 1 PDF
Document Details
Uploaded by Deleted User
École Supérieure en Sciences et Technologies de l'Informatique et du Numérique
2022
Dr. Chemseddine Berbague
Tags
Summary
This document is a chapter from the Fundamentals of Data Science and Data Mining textbook. It provides an introduction to data science, including facets of data science, the data science process, and the big data and data science ecosystem. The content is structured under different headings with brief sub-topic descriptions.
Full Transcript
ÉCOLE SUPÉRIEURE EN SCIENCES ET TECHNOLOGIES DE L'INFORMATIQUE ET DU NUMÉRIQUE FUNDAMENTALS OF DATA SCIENCE AND DATA MINING CHAPTER 1: INTRODUCTION TO DATA SCIENCE Dr. Chemseddine Berbague 2022-2023 CONTENT Facets and data types. Data...
ÉCOLE SUPÉRIEURE EN SCIENCES ET TECHNOLOGIES DE L'INFORMATIQUE ET DU NUMÉRIQUE FUNDAMENTALS OF DATA SCIENCE AND DATA MINING CHAPTER 1: INTRODUCTION TO DATA SCIENCE Dr. Chemseddine Berbague 2022-2023 CONTENT Facets and data types. Data science process. Big data and data science ecosystem. SCIENTIFIC METHOD IN MORE DETAILS THE SCIENTIFIC METHOD IN MORE DETAILS PHASE 1: DISCOVERY 1.1 IDENTIFYING KEY STAKEHOLDERS It is important to recognize from the beginning the stakeholders, their expectations, and the potential impact of the project results on each role. Interviewing the stakeholders allows to gather initial low cost knowledge from previous experiences, and identifying pain points on which the data science team should give focus. Recognizing the stakeholders allows ,as well, to plan the project, put deadlines, and communicate to each role the results that concerns him. 1.1 IDENTIFYING KEY STAKEHOLDERS 1.2 RESOURCES INVENTORY Identify the tools, technologies required for performing the analysis, trainings, tests, and deployment. Evaluate the quality and the quantity of the initial data available. Approximate the amount of data required to reach the objectives of the project. Ensure that the project team has the necessary diversity and knowledge to proceed with the project. Negotiate to deal with missing resources problems. 1.3 FRAMING THE PROBLEM It is important to state in a formal way the analytical The problem is hmm … Remind me what is your problem ? problem and communicate it with each stakeholder. The document may slightly differ from a role to another one so everyone reads the part that concerns him. In parallel, the objectives of the project should be identified and shared with every member concerned by the project. Also, success keys, and failure criteria should be well established from the beginning. Teams tend to think that every project will meet successful results, however failure should be considered as well to preserve time and energy. 1.4 LEARNING THE BUSINESS DOMAIN To the success of the data science process: The team should have deep computational and quantitative knowledge which can be applied across many disciplines. The team should have enough knowledge of the methods, techniques, and ways for applying heuristics to a variety of business and conceptual problems. The team should amount the knowledge required to develop models. 1.5 INTERVIEWING THE ANALYTICS SPONSORS “It’s critical to keep the sponsor informed and involved. Show them plans, progress, and intermediate successes or failures in terms they can understand. A good way to guarantee project failure is to keep the sponsor in the dark.” source Analytics sponsor or project sponsor represent the business interest and it is the role who decides if the project is a success or failure. 1.5 INTERVIEWING THE ANALYTICS SPONSORS In some cases, stakeholders only have in mind a description of the problem and have no idea about how to solve it. In this case, the data science project team should interview the stakeholders to get, to share, to explain, ant to give necessary answers. The interview with the project sponsor may include: A specific and accurate description of the problem and the expected solution. Available resources such as data and initial/intuitive insights. Time line and the project decision making (success/failure keys...etc). 1.6 DEVELOPING INITIAL HYPOTHESIS Initial hypotheses can be identified: By putting initial statics-based hypothesis. The data science team should be open, creative, and ambitious to raise more questions, and put more hypothesis about the data. By gathering the hypothesis, and ideas of the domain experts. Hypothesis should meet the objectives of the costumer and serves data science process which answers the raised questions. 1.6 DEVELOPING INITIAL HYPOTHESIS There is two types of hypothesis: Null hypothesis Alternate hypothesis The statement is a well established fact, A claim which is not proven yet ! phenomenon of real world. null and alternate hypotheses are mutually exclusive and at the same time asymmetric 1.6 DEVELOPING INITIAL HYPOTHESIS 1.6 SOME EXAMPLES OF HYPOTHESIS FROM STATISTICS Normal distribution curve: “students test distribution” Exponential distribution curve: “product validity” 1.7 IDENTIFYING POTENTIAL DATA SOURCES Identify the data sources and the aggregation links between each data source/item. Review the data and obtain an initial overview on the quality and limitations. Understand the data structure and the necessary tools. Identify the required infrastructure to deal with data (disks, clouding services, technologies…etc.) PHASE 2: DATA PREPARATION 2.1 PREPARING THE ANALYTIC SANDBOX “…The Data Science Sandbox is an environment specifically designed for data science and analytics. It gives data scientists and analysts a protected, shared environment where models can be built and experiments conducted without harm to application databases. …” from infogoal.com 2.1 PREPARING THE ANALYTIC SANDBOX The working team must have a separate static copy of the real time data, on which it can perform data analysis. This step is necessary to avoid interacting with growing/changing data. In this step, the team should collect all kind of data (e.g.., structured data from databases tables, unstructured texts, logs …etc.), so it can test different hypothesis, and explore the data for potential conclusions. 2.1 PREPARING THE ANALYTIC SANDBOX In this context, good communication between the data science team and the IT team of the organization is necessary to make the working plan clear as both teams have different views toward data access rights: IT teams are more conservative about their data. Data science teams have more hunger to access all data. The sandbox may be much more bigger than the original data as some specific copies for specific tasks can be created, or specific data is considered during the analytical process. 2.1 PREPARING THE ANALYTIC SANDBOX 2.2 PERFORMING ETLT In this step, the team should ensure good access to the data sources to avoid any interruption to data access. (a) Data extraction (E), (b) Data Loading (L), (c) and Data transformation (T) are usually performed in different orders during the analytical process (i.e., denoted as ETL and ELT.): The team may transform the data (join, sort, group by…etc.) depending on their objective and need. ELT has known more frequent use nowadays due to the appearance of performant and rapid cloud-based data warehouses such as Microsoft Azure. Data quality Security Data Performance Cost compliance ETL + + + - slow with - unstructured data ELT - - - + fast with + unstructured data 2.2 PERFORMING ETLT 2.2 PERFORMING ETLT 2.2 PERFORMING ETLT ETLT is the best of both ETL and ELT. Data may grow up rapidly during the ETLT, in this case big data solutions may be required (i.e., we say Big ETLT.). The objective of ETLT is to collect, aggregate, and clean data obtained from its source and load it into more exploitable data store manager. ETLT adds another transformation operation to integrate other data sources and transformed data. Application Programming Interfaces (APIs) are nowadays popular solutions to access data online (Facebook, Tweeter …etc.). Should we talk about data scrapping ? Read this for further details and comparisons. 2.3 LEARNING ABOUT THE DATA In this step, the data science team should get familiar with the data in hand to recognize: Available data. Accessible data. Data to collect Identify anomalies or missing parts that must be present for a successful analysis. This operation can be considered as an initial data evaluation step. 2.4 DATA CONDITIONING Data cleaning, data normalization, and data transformation are performed by the coordination between different roles within the data science team. Roles involved in this step are: data engineers, database administrators, and data scientists. Data scientists can check data validity (i.e. , norms, anomalies, consistency …etc.) and decide which parts are more interesting for the analysis. Better to keep much more data than cleaning it, hidden knowledge can be anywhere …! Questions about data characteristics are raised in this step: How much data is clean ? How much data is missing ? What are the most targeted data ? How far is data consistent ? 2.5 SURVEY AND VISUALIZE Data visualization allows a quick examination of data for dirt, missing values, or anomalies. Some guidelines of data visualization can be summarized in next: Does data hold necessary temporal information for the analytical problem ? Does data covers all of the targeted actors (e.g., clients, suppliers, financial partners …etc.) ? Is data well distributed over all the studied period ? If no, what solutions/alternatives exist ? Are measurements scales as desired ? Are personal/confidential details removed from the data ? 2.5 SURVEY AND VISUALIZE : EXAMPLE 1 2.5 SURVEY AND VISUALIZE : EXAMPLE 2 2.5 HOW TO CHOOSE VISUALIZATION TYPE There are few questions we may ask to decide which chart type we need to use: №1. What’s the story your data is trying to deliver? №2. Who will you present your results to? №3. How big is your data? №4. What is your data type? №5. How do the different elements of your data relate to each other? 2.5 HOW TO CHOOSE VISUALIZATION TYPE Common roles for data visualization: showing change over time showing a part-to-whole composition 2.5 HOW TO CHOOSE VISUALIZATION TYPE Common roles for data visualization: looking at how data is distributed comparing values between groups 2.5 HOW TO CHOOSE VISUALIZATION TYPE Common roles for data visualization: observing relationships between variables looking at geographical data 2.5 VISUALIZING EXAMPLE: TEXT ANALYSIS RESULTS WITH WORD CLOUDS 2.6 COMMON TOOLS FOR DATA PREPARATION We present ,here, few data preparation tools, more is coming in next chapters: 2.7 DATA PREPARATION 2.7 ABOUT DATA PREPARATION According to towardsdatascience.com: Most data scientists don’t enjoy cleaning and wrangling data Only 20% of the time is available to do meaningful analytics Data quality issues, if not treated early, will cascade and affect downstream 2.7 QUESTIONS WE MAY ASK DURING DATA PREPARATION ! How do I clean up the data?—Data Cleaning. How do I provide accurate data?—Data Transformation. How do I incorporate and adjust data?—Data Integration. How do I unify and scale data?—Data Normalization. How do I handle missing data?—Missing Data Imputation. How do I detect and manage noise?—Noise Identification. 2.7 FORMS OF DATA PREPARATION 2.7.1 DATA PREPARATION STEPS: DATA CLEANING Data cleaning: (a) understand data (i.e., detect dirty data.). (b) remove unnecessary data details, (c) correct bad data, (d) and filter incorrect data. 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION Data Transformation: (a) smoothing, (b) feature construction, (c) aggregation or (d) summarization of data, (e) normalization, (f) discretization and (g) generalization. Here are some transformation functions: (a) linear transformation, (b) quadratic transformation …etc. In next, a simple linear transformation: 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION (a) data smoothing: Data smoothing methods Local Bin Kernels weighted smoothing regression Fitting parabolas 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION Bin smoothing method Kernels technique 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION Local weighted regression 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION (c) aggregation or (d) summarization of data is done by summing up values of a specific feature or a set of features over a period of time for summarization or visualization. In data aggregation, simple functions such as count, max, min, and sum can be performed. 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION (f) Discretization consists of transforming quantitative values to discrete values by defining a number of intervals, then assigning each numerical value to one specific interval. 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION Feature discretization techniques 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION Features discretization techniques comparison 2.7.2 DATA PREPARATION STEPS: DATA TRANSFORMATION (g) data generalization is performed by replacing some value by another value more general or less precise for privacy purposes. It’s known also by blurring. Classical examples of this transformation consists of binning methods such as assigning a specific value to an interval (e.g., age 25 [18-30[.). Automated generalization: an algorithm that distorts values till we get K similar individuals (e.g., k individuals belonging to the same interval). Check the definition of K-Anonymity Declarative generalization: in this case, data ranges are fixed in upfront, so the data scientist decides which generalization level is enough to preserve privacy. In ages examples, we can replace a full date by year and month or a decade. Check also, data masking 2.7.3 DATA PREPARATION STEPS: DATA INTEGRATION Data integration: (a) data merging, (b) conflicts detection, (c) variables and domains definitions (i.e., identify and unify.), (d) redundancies and inconsistencies detections. (e.g., using based-distance techniques..etc.). Don’t you feel curious to know more about conflicts detection ? 2.7.3 DATA PREPARATION STEPS: DATA INTEGRATION Attributes redundancy: An attribute is redundant when it can be conducted from another attribute or a combination of sub attributes. Using the X2 correlation test for nominal values: Correlation coefficient or covariance for numerical values: 2.7.3 DATA PREPARATION STEPS: DATA INTEGRATION Instance redundancy: It may appear because of merging data from different sources, errors in indexing the instances or using deformalized data tables. It may lead to overspecialization of model training. Instance redundancy detection is mostly amounted using distance-based techniques such as: The edit distance Jaro distance: Or probabilistic techniques/ machine learning techniques /clustering techniques. Main difference between the techniques is in the ability of detecting the redundancy when changing the order, containing spaces and so on … 2.7.4 DATA PREPARATION STEPS: DATA NORMALIZATION Data normalization: (a) assign same unit to all data. Here are some data normalization techniques: (a) min/max normalization, (b) Z-score normalization, (c) decimal scale normalization …etc. 2.7.5 DATA PREPARATION STEPS: MISSING DATA IMPUTATION Missing data imputation: (a) inserting intuitive values (e.g., average; min; max; popular value..etc.). Or using machine learning based methods such as: (KNNI, WKNNI, KMI, FKMI, SVMI,SVDI…etc.). Or using probabilistic based methods such as: (EM algorithm). More recent techniques have appeared such as GMC, MLP, ANN, SOM, Bayesian models …etc. 2.7.5 DATA PREPARATION STEPS: MISSING DATA IMPUTATION 2.7.5 DATA PREPARATION STEPS: MISSING DATA IMPUTATION Surprisingly, Excel offers some magic ! 2.7.5 DATA PREPARATION STEPS: MISSING DATA IMPUTATION Missing value imputation using Weka: weka.filters.unsupervised.attribute.MissingValuesImputation - for imputing missing values weka.filters.unsupervised.attribute.MissingValuesInjection - for injecting missing values 2.7.6 DATA PREPARATION STEPS: NOISE IDENTIFICATION Noise identification: Noise can be described by the set of unwanted features, records, or data items which do not help to explain the feature itself, or its relationship to the target class. Noise take two shapes: (a) noise as an item, and (b) noise as a feature. Noise can be faced using: (a) data smoothing or variance detection. Using robust learners. Data polishing methods. Noise filters. Note that data noise has different impacts on data modeling techniques. What do we mean by robust learners ? 2.7.6 DATA PREPARATION STEPS: NOISE IDENTIFICATION Noise as a feature 2.7.6 DATA PREPARATION STEPS: NOISE IDENTIFICATION Using greedy algorithms, heuristics, bio-inspired algorithms …etc. 2.7.6 DATA PREPARATION STEPS: NOISE IDENTIFICATION Noise as an a record (one item) Using K-fold validation for noise detection 2.7.6 DATA PREPARATION STEPS: NOISE IDENTIFICATION Noise may concern a set of records, in this case more methods can be used to diagnosis data noise: Density-based anomaly detection Clustering-based anomaly detection SVM-based anomaly detection Auto-encoder-based anomaly detection Feeling curious to know more ?; looks a good subject to do a search ! 2.8 DATA REDUCTION How do I reduce the dimensionality of data?—Feature Selection (FS). How do I remove redundant and/or conflictive examples?— Instance Selection (IS). How do I simplify the domain of an attribute?—Discretization. How do I fill in gaps in data?—Feature Extraction and/or Instance Generation. 2.8 DATA REDUCTION: MANY FORMS AND ADVANTAGES Speed up the processing of the DM algorithm. Improve data quality. Increase the performance of the DM algorithm. Make the results easier to understand 2.8.1 DATA REDUCTION: FEATURE SELECTION Feature selection (i.e., data condensation.) is reducing the number of attributes as much as possible to speed up data processing, and enable visualization. This operation happens by removing redundant, irrelevant features. Were mainly used for classification purposes. Using: (a) exhaustive search, (b) heuristic search, (c) non- deterministic search. Selection criteria based on: (a) information measures, (b) distance measures, (c) dependency measures, (d) accuracy measures, consistency measures. 2.8.1 DATA REDUCTION: FEATURE SELECTION 2.8.1 DATA REDUCTION: FEATURE SELECTION Feature selection using the information gain H(X)=−∑pilog2(pi) H(X|Y)=∑P(X=v)H(Y|X=v) IG(Y|X) = H(Y) – H(Y|X) 2.8.2 DATA REDUCTION: INSTANCE SELECTION Instance selection: Many techniques have appeared under this category of operation holding the name sampling, it consists of randomly (i.e., or using some heuristics) selecting a subset of records for the data mining process. This operation should be performed very carefully to avoid the bias problem. 2.8.3 DATA REDUCTION: FEATURE EXTRACTION Feature Extraction/Instance Generation: In the contrary of feature selection, in this operation original data contribute in the generation of the extracted features by aggregation, merging, or combinations following some specific functions (e.g., linear combinations, polynomial …etc.). Among these techniques we enumerate PCA, factor analysis,..etc. 2.8.3 DATA REDUCTION: FEATURE EXTRACTION 2.8.3 DATA REDUCTION: FEATURE EXTRACTION 2.8.4 EVALUATE THE QUALITY OF DATA According to https://www.explorium.ai/ many question can be raised about data quality such as: What do we mean when we talk about “Data Quality”? What are the key attributes of high quality data? Why is data quality so important? How is data quality determined? Does this process differ for internal data vs. external data? 2.8.4 DATA QUALITY VALIDATION 2.8.4 POSITIVE IMPACTS OF DATA QUALITY Decision Making: The higher the quality of data, the more companies and users will trust in making important decisions, based on the outputs produced. This, in turn, lowers the risk of the company making the wrong decision. Productivity: Nobody wants to be sitting there for hours on ends fixing data errors. If the correct measures are taken in the initial step, it allows staff to focus on the next steps and other responsibilities. Targets: Quality data can ensure accuracy in companies' current and future goals, for example, the Marketing team having a better understanding of what works and doesn’t work. Compliance: There are many industries where specific guidelines are used to keep data private and safe from any breaches or potential attacks. The lack of maintaining good quality in the finance sector can result in millions of dollars in fines or money laundering. 2.8.4 NEGATIVE IMPACTS OF BAD DATA QUALITY Loss of competitiveness: less data quality allows competitors to get better insights, predictions, thus more revenues. Revenue: low data quality in financial contexts affects the planning process (budget, investments, …etc.) of the company and leads to lose of revenues. Reputation: the organization reputation may be strongly affected when many partners are involved an analytical project based on poor data. Investors tend to trust confident and reliable indicators rather than taking the risk. 2.8.4 DATA QUALITY EVALUATORS According to https://analyticsindiamag.com/, data quality is subject to the next evaluators: Validity: The customer can judge the validity of data by putting rules and constraints on values (i.e. date forms, values ranges…etc.). This verification is so important as it helps to optimize storage requirements. Accuracy: Actions handling affects the quality of data such as in IoT environments, robotics, or websites where the recorded values should describe exactly and accurately the effect of the actions (i.e., temperature, number of clicks, …etc.). Completeness: can be amounted by counting the number of missing values against the totality of present values. The data science team can by this indicator judge if the data is enough to perform a reliable data analysis and decide which techniques to use to improve the data quality. 2.8.4 DATA QUALITY EVALUATORS Consistency of data is ensured when values reflects the same actions over time, with no mutual or temporal conflicts in the records. Uniformity: data should respect the same units, ranges, and norms so it can be merged from different sources without affecting the quality of the analytical results. Relevance is measured according to the objectives of the project and the question we look to answer, for this purpose we may be interested by specifying a period range, and remove unnecessary details to simplify and speed up the data analysis. DATA PREPARATION USING WEKA PHASE 3: MODEL PLANNING 3.1 MODEL PLANNING: OBJECTIVES Models are selected in this step by intuition, comparison, or testing. Classification Clustering Studying the relationships between variables, attributes of data. 3.1 MODEL PLANNING: OBJECTIVES Classification Association rules mining Clustering 3.1 MODEL PLANNING: OBJECTIVES Activities involved in this step: Select tools and technologies. Study the structure of data. Put hypotheses and test them. Evaluate the models choices (i.e., is one model enough ? Should more models being used ? ! How should be ?) 3.1 MODEL PLANNING: DATA EXPLORATION AND VARIABLES SELECTION Statistical methods, visualization and more techniques are used to study the relationships between data columns / variables. The objective of this operation is to reduce the data complexity, by consequence, accelerating the data processing. In this step, the team examines the variables/attributes of the data to check if they meet the outcomes of their hypotheses. Statistical methods such PCA, or deep learning techniques as the auto-encoder are used to simplify data structure. 3.1 MODEL PLANNING: DATA EXPLORATION AND VARIABLES SELECTION Feature extraction time using different techniques 3.1 MODEL PLANNING: DATA EXPLORATION AND VARIABLES SELECTION 3.2 MODEL PLANNING: MODEL SELECTION In this step, the project team selects a techniques, or a set of candidate techniques based on a some rules and constraints: Classification Associations Clustering Models selection can be performed based on initial indicators obtained by tests on small datasets. Generalization is possible after getting positive insights about the models tested. 3.3 COMMON TOOLS FOR MODEL PLANNING R programming language: Offers packages for (a) data analysis, (b) statistical functions, (c) visualization, and(d) access to big data databases. SQL analysis services: SQL queries allow to perform direct interrogations such as selection, filtering, aggregations. SAS/ACCESS: Include many data connectors to access different relational databases. PHASE 4: MODEL BUILDING 4.1 MODEL BUILDING In this step, data is divided into different parts as to ensure building and testing the model: Training dataset Testing dataset Model parameters, entries may be adjusted to ensure the best performances. The results of the variable selection: correlations, significance can be validated empirically. Models parameters VS results should be well stored, compared for ulterior using. Validator questions such as the next can be raised: Are results statistically significant according to the test set ? Still they significant in the context ? Are they understood by domain experts ? Are the parameters reasonable and logic ? Are the results satisfactory ? Or do they need more improvements ? Does data require any transformation/elimination ? Or any additional inputs ? 4.1 MODEL BUILDING: DATA DIVISION 4.2 OPEN SOURCE TOOLS FOR MODEL BUILDING 4.2 COMMERCIAL TOOLS FOR DATA MODELING PHASE 5: COMMUNICATE RESULTS 5.1. COMMUNICATE RESULTS By coming to this step, the team may has been passed by two possible scenarios: A very robust and a rigorous data analysis to prove the initial hypothesis even they are not true. A very superficial data analysis which is not enough to validate the initial hypotheses. AS a result, the team must be rigorous with data, and accept failure not as a state of the project but the state of data against the initial hypotheses. Also, the team should obtain valuable insights about the best models, and the data structure. Identify which aspects/data points are inline with the initial hypotheses and the ones which are not. PHASE 6: OPERATIONALIZE 6.1 OPERATIONALIZE In this this step, the built model is setting under usage, but first amore testing is required before the exploitation. In this test, a small scope of data from the production environment is taken, if approved, the models are applied on the full/big scope data. Otherwise, adjustments are performed before the full deployment. It is ,also, important to deploy the built models on a partial set of the business application (e.g., subset of the products, single line of business..etc.). Part of the engineers responsibility, is to ensure the consistent, coherent integration of the models into the production environment. The integration operation should ,as well, consider the next criteria: Find ways to retrain the models if needed. Detect the performances of the model and provide necessary alerts. REFERENCES I strongly recommend this series of articles: Data Preprocessing in Data Mining : Salvador García Julián Luengo Francisco Herrera