01-Introduction.pdf

Data Mining - Introduction Fouille de données Germain Forestier, PhD Université de Haute-Alsace...

Data Mining - Introduction Fouille de données Germain Forestier, PhD Université de Haute-Alsace https://germain-forestier.info Germain Forestier - Université de Haute-Alsace 1/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 1/38 Introduction Data Mining: Discover patterns, relationships, and knowledge from large datasets Apply algorithms and statistical methods to analyze and interpret data Term ”data mining” emerged in the 1990s; concept has older origins Combines techniques from statistics, pattern recognition, and machine learning Roots in mathematics, computer science, and information theory Germain Forestier - Université de Haute-Alsace 2/38 Introduction Data Mining: Applications Across Industries Business and Finance: Customer behavior, fraud detection, risk management, market segmentation Healthcare and Medicine: Disease prediction, patient care, drug discovery Engineering: Quality control, failure prediction, process optimization Environment and Agriculture: Weather forecasting, crop yield, resource management Germain Forestier - Université de Haute-Alsace 3/38 Introduction Data Mining vs. Data Analysis and Machine Learning Data Analysis: Focuses on understanding data and deriving insights; uses statistical methods Data Mining: Goes beyond data analysis; uses algorithms to find patterns and predict trends Machine Learning: A subset of AI; trains models to learn from data; used within data mining but not synonymous Data Mining Scope: Includes data preprocessing, exploratory analysis, and result interpretation Germain Forestier - Université de Haute-Alsace 4/38 Introduction Data Mining Intersects with Various Disciplines Artificial Intelligence (AI): Algorithms for machine reasoning and decision-making Machine Learning: Training models to learn from data Big Data: Processing and analyzing large, complex datasets Information Retrieval: Accessing and retrieving data from databases Pattern Recognition: Identifying patterns and regularities in data Germain Forestier - Université de Haute-Alsace 5/38 Introduction Data Mining Intersects with Various Disciplines Text Mining: Extracting information from text using statistical patterns and machine learning Database Management: Organizing, storing, and managing large datasets Predictive Analytics: Applying data and algorithms to forecast future outcomes Natural Language Processing (NLP): Enabling machines to understand and interpret human language Computer Vision: Training computers to analyze and make decisions from visual data Germain Forestier - Université de Haute-Alsace 6/38 Introduction Artificial intelligence, machine learning, deep learning: source : nvidia.com Germain Forestier - Université de Haute-Alsace 7/38 Introduction Artificial intelligence, machine learning, deep learning: source : nvidia.com Germain Forestier - Université de Haute-Alsace 7/38 Introduction Map of Data Science Mapping the relations between the data science fields Google Trends data 100 machine learning 80 data science artificial intelligence 60 40 big data 20 data mining 0 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 Year Source : KDnuggets. Données Google Trends Germain Forestier - Université de Haute-Alsace 8/38 Introduction Map(s) of Data Science source : https://matthewlincoln.net/2016/11/23/histories- of- data.html Germain Forestier - Université de Haute-Alsace 9/38 Introduction Trendy domain ”Machine Learning” / ”Big Data” / ”Data Science” / ”AI” used interchangeably in the press and in companies Sometimes over-hyped or misused Real news articles 1. How could big data feed 9 billion humans? 2. Tomorrow AI will be able to generate a Game of Thrones episode in a second. 3. Can big data reverse the unemployment curve? 4. AI to the rescue of alien hunters? 5. Artificial intelligence, a new weapon against cancer? 6. Predicting crimes with Big Data? 7. The use of Big Data can enable academic success. Germain Forestier - Université de Haute-Alsace 10/38 Introduction The ”Data Scientist” Data specialist, a trendy profession Need for cross-disciplinary skills in data science source: https://www.linkedin.com/pulse/myth-new-apprentice-think-how-data-science-should-yeow-teck-keat/ Germain Forestier - Université de Haute-Alsace 11/38 Introduction Rich Ecosystem: Germain Forestier - Université de Haute-Alsace 12/38 Introduction Rich Ecosystem: Germain Forestier - Université de Haute-Alsace 12/38 Introduction Rich Ecosystem: Germain Forestier - Université de Haute-Alsace 12/38 Introduction Rich Ecosystem: THE 2024 MAD (MACHINE LEARNING, ARTIFICIAL INTELLIGENCE & DATA) LANDSCAPE INFRASTRUCTURE ANALYTICS MACHINE LEARNING & ARTIFICIAL INTELLIGENCE APPLICATIONS — ENTERPRISE STORAGE MPP DBs DATA LAKES / DATA STREAMING / BI PLATFORMS VISUALIZATION DATA SCIENCE DATA SCIENCE ENTERPRISE ML/AI PLATFORMS DATA GENERATION LAKEHOUSES WAREHOUSES IN-MEMORY NOTEBOOKS PLATFORMS & LABELING SALES MARKETING CUSTOMER EXPERIENCE HUMAN AUTOMATION DECISION & CAPITAL & OPERATIONS OPTIMIZATION LEGAL PARTNERSHIPS REGTECH & FINANCE COMPLIANCE RDBMS NoSQL DATABASES NewSQL DATABASES REAL TIME GRAPH DBs GPU MULTI- DATA ANALYST PLATFORMS MLOPS AI OBSERVABILITY COMPUTER DATABASES DATABASES MODEL VISION DATABASES & ABSTRACTIONS VECTOR APPLICATIONS — HORIZONTAL DATABASES AI DEVELOPER PLATFORMS AI SAFETY & SECURITY CODE & TEXT AUDIO & VOICE IMAGE PRESENTATION & VIDEO EDITING ANIMATION SEARCH DOCUMENTATION DESIGN & 3D / GAMING / CONVER- SATIONAL AI 3 VIDEO GENERATION ETL / ELT / REVERSE ETL DATA INTEGRATION DATA GOVERNANCE CUSTOMER DATA PRODUCT SPEECH / VOICE NLP COMMERCIAL AI RESEARCH NONPROFIT DATA TRANSFORMATION & CATALOG PLATFORMS ANALYTICS AI RESEARCH APPLICATIONS — INDUSTRY FINANCE & HEALTHCARE LIFE SCIENCES TRANSPORTATION AGRICULTURE INDUSTRIAL & AEROSPACE, INSURANCE LOGISTICS DEFENSE & GOV’T ORCHESTRATION DATA QUALITY & FULLY MGMT / MONITORING PRIVACY COMPUTE LOG ANALYTICS ENTERPRISE SEARCH / AI HARDWARE GPU CLOUD / EDGE AI CLOSED OBSERVABILITY MANAGED & SECURITY KNOWLEDGE ANALYTICS INFRA SOURCE MODELS 3 AU LARGE CROSS- INDUSTRY OPEN SOURCE INFRASTRUCTURE DATA FRAMEWORKS FORMATS QUERY / DATA FLOW DATA MANAGEMENT DATABASES OLAP ORCHESTRATION INFRA- STREAMING & STAT TOOLS & MLOPS & AI INFRA AI FRAMEWORKS, TOOLS & LIBRARIES AI MODELS LOCAL AI SEARCH LOGGING & MONITORING VISUALIZATION COLLABORATION STRUCTURE MESSAGING LANGUAGES DATA SOURCES & APIs DATA & AI CONSULTING DATA MARKETPLACES FINANCIAL & MARKET DATA AIR / SPACE / SEA PEOPLE / ENTITIES LOCATION INTELLIGENCE ESG & DISCOVERY Version 1.0 - March 2024 © Matt Turck (@mattturck) , Aman Kabeer (@AmanKabeer11) & FirstMark (@firstmarkcap) Blog post: mattturck.com/MAD2024 Interactive version: MAD.firstmarkcap.com Comments? Email [email protected] Germain Forestier - Université de Haute-Alsace 12/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 12/38 Structured vs. Unstructured Data Structured Data: Organized into a defined format or schema. Easily searchable and queryable. Typically stored in relational databases. Examples: Tables with rows and columns, CSV files. Unstructured Data: Lacks a specific form or structure. Difficult to process and analyze using conventional methods. Often requires specialized processing techniques. Examples: Text documents, videos, images, social media posts. Germain Forestier - Université de Haute-Alsace 13/38 Introduction Example of a (structured) Data Set Input data: width and length of petals and sepals + species Objective: predict the species of iris source: https://rpubs.com/wjholst/322258 Germain Forestier - Université de Haute-Alsace 14/38 Introduction Example of a (structured) Data Set Input data: width and length of petals and sepals + species Objective: predict the species of iris sepal length sepal width petal length petal width species 5.1 3.5 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 6.7 3.1 4.7 1.5 versicolor 5.6 3 4.1 1.3 versicolor 6.3 2.8 5.1 1.5 virginica 6.7 3.3 5.7 2.5 virginica 6.7 3 5.2 2.3 ? Germain Forestier - Université de Haute-Alsace 15/38 Introduction Example of a (structured) Data Set Input data: width and length of petals and sepals + species Objective: predict the species of iris sepal length sepal width petal length petal width species 5.1 3.5 1.4 0.2 setosa 4.7 3.2 1.3 0.2 setosa 6.7 3.1 4.7 1.5 versicolor 5.6 3 4.1 1.3 versicolor 6.3 2.8 5.1 1.5 virginica 6.7 3.3 5.7 2.5 virginica 6.7 3 5.2 2.3 ? Germain Forestier - Université de Haute-Alsace 15/38 Introduction Example of a (structured) Data Set Germain Forestier - Université de Haute-Alsace 16/38 Introduction Example of a (structured) Data Set Germain Forestier - Université de Haute-Alsace 17/38 Introduction Example of a (structured) Data Set Germain Forestier - Université de Haute-Alsace 17/38 Common Data Formats CSV (Comma-Separated Values): Plain text format that uses commas to separate values. Often used for spreadsheets and databases. JSON (JavaScript Object Notation): Text-based, human-readable format for representing structured data. Commonly used in web applications to exchange data. XML (eXtensible Markup Language): Markup language that defines rules for encoding documents. Self-descriptive structure, often used in web services. Others: YAML: Human-readable data serialization standard. HDF5: Hierarchical data format designed to store large datasets. Germain Forestier - Université de Haute-Alsace 18/38 Data Sources Databases: Structured repositories of data, often managed by DBMS. Types include relational databases, document databases, key-value stores, etc. Web Scraping: The process of extracting data from websites. Can be automated using various tools and libraries. APIs (Application Programming Interfaces): A protocol for building and interacting with software applications. Often used to retrieve data from web services in a standardized manner. Germain Forestier - Université de Haute-Alsace 19/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 19/38 Importance of Clean and Quality Data Clean Data: Free from errors and inconsistencies. Enhances the accuracy of analytical models. Facilitates smooth data integration and transformation. Quality Data: Adheres to a set of standards, such as accuracy, completeness, reliability. Supports effective decision-making and insights. Directly impacts the success of data-driven projects. Consequences of Poor Data Quality: Can lead to incorrect conclusions and misguided strategies. Increases operational costs due to unnecessary rework and corrections. Diminishes trust in data and analytical outcomes. Germain Forestier - Université de Haute-Alsace 20/38 Data Cleaning Techniques Handling Missing Values: Removing records with missing values. Imputing missing values using means, medians, or modes. Using machine learning models to predict missing values. Data Transformation: Normalizing or scaling numerical values. Converting data types and formats. Encoding categorical variables. Outlier Detection: Identifying and handling anomalous values. Using statistical or machine learning methods for detection. Validation and Verification: Ensuring data accuracy and adherence to constraints. Cross-checking with external trusted sources. Germain Forestier - Université de Haute-Alsace 21/38 Missing Data Handling Techniques for Handling Missing Data: Deletion: Removing instances with missing values. Mean Imputation: Replacing missing values with the mean. Median Imputation: Replacing with the median for a robust approach. Model-Based Imputation: Using algorithms like k-NN or regression Considerations: Understanding the underlying cause of the missingness. Choosing the method that aligns with the data characteristics. Evaluating the impact on the analysis and interpretation. Germain Forestier - Université de Haute-Alsace 22/38 Outlier Detection Definition: An outlier is an observation that deviates significantly from other observations. Can indicate variability, errors, or interesting phenomena. Considerations in Outlier Detection: Understanding the context and domain of the data. Deciding whether to treat outliers as noise or valuable information. Being cautious with removing outliers, as they may contain important insights. Challenges: Distinguishing between true outliers and natural variability. High dimensionality can make outlier detection more complex. Germain Forestier - Université de Haute-Alsace 23/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 23/38 Purpose of EDA (Exploratory Data Analysis) Understanding Data Structure: Analyzing the shape, central tendency, and dispersion. Visualizing relationships between variables. Identifying Patterns and Anomalies: Discovering trends, clusters, and outliers. Spotting potential errors or inconsistencies in the data. Facilitating Communication: Using visualizations to make complex data more understandable. Bridging the gap between technical experts and stakeholders. Germain Forestier - Université de Haute-Alsace 24/38 Exploratory Data Analysis: Iris Dataset Example Visualization: Visualizing the dataset helps in understanding the distribution and relationships between features. Color coding each class enables easy identification of patterns and clusters. Iris dataset 4.5 setosa versicolor 4.0 virginica 3.5 sepal_width 3.0 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 sepal_length Germain Forestier - Université de Haute-Alsace 25/38 Data Visualization Techniques Bar Charts: Used to display and compare the frequency or count of items across categories. Scatter Plots: Show the relationship between two continuous variables. Line Charts: Ideal for visualizing trends over time. Histograms: Illustrate the distribution of a continuous variable. Box Plots: Provide a view of the central tendency and variability of a dataset. Considerations: Selecting the appropriate visualization technique based on data type and the question to be answered. Germain Forestier - Université de Haute-Alsace 26/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 26/38 Introduction to Popular Tools Python: General-purpose programming language. Rich ecosystem for data analysis (e.g., pandas, matplotlib). R: Language built for statistical computing. Offers vast library of packages for data manipulation and visualization. SQL: Query language for managing and retrieving data from relational databases. Used for complex data manipulation, aggregation, and joining tables. Excel: Spreadsheet software with built-in tools for data analysis. Suitable for managing, filtering, and visualizing smaller datasets. Germain Forestier - Université de Haute-Alsace 27/38 Libraries and Packages Pandas: Data manipulation and analysis library. Provides data structures for efficiently storing large datasets. NumPy: Library for numerical computing. Supports large arrays and matrices, mathematical functions. Matplotlib: Visualization library for creating static, interactive, and animated visualizations. Highly customizable and extensible. Scikit-learn: Library for machine learning. Includes tools for classification, regression, clustering, and preprocessing. Germain Forestier - Université de Haute-Alsace 28/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 28/38 Descriptive vs. Inferential Statistics Descriptive Statistics: Summarizes and describes the main aspects of a dataset. Includes measures like mean, median, mode, standard deviation. Provides a snapshot of the data. Inferential Statistics: Makes predictions or inferences about a population from a sample. Utilizes hypothesis testing, confidence intervals, regression analysis. Goes beyond the data to draw general conclusions. Germain Forestier - Université de Haute-Alsace 29/38 Supervised vs. Unsupervised Learning Supervised Learning: Requires a labeled dataset with input-output pairs. Goal is to learn a function that maps inputs to outputs. Commonly used for classification and regression tasks. Unsupervised Learning: Works with unlabeled data without specified output. Goal is to identify patterns, structures, or relationships in the data. Commonly used for clustering, dimensionality reduction, and association rules. Germain Forestier - Université de Haute-Alsace 30/38 Bias-Variance Tradeoff Bias: Error from oversimplifying the model High bias leads to underfitting; fails to capture the underlying pattern Variance: Error from sensitivity to small data fluctuations High variance leads to overfitting; captures noise instead of the pattern Tradeoff: Reducing bias increases variance; reducing variance increases bias Aim for a balance to minimize the total error for optimal performance source : https://pbs.twimg.com/media/CpWDWuSW8AQUuCk.jpg Germain Forestier - Université de Haute-Alsace 31/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 31/38 Steps in the Data Mining Process 1. Business Understanding: Determine business objectives, define goals. 2. Data Understanding: Collect, describe, explore, and verify data quality. 3. Data Preparation: Clean, transform, and integrate data; select and format data. 4. Modeling: Select techniques, generate test design, build and assess models. 5. Evaluation: Assess model quality, review process, determine next steps. 6. Deployment: Plan deployment, maintenance, and monitoring. Germain Forestier - Université de Haute-Alsace 32/38 Importance of Understanding Business Objectives Alignment with Goals: Ensures project matches organizational strategies Problem Definition: Clarifies the problem and guides technique selection Performance Metrics: Sets metrics that reflect business needs Resource Allocation: Optimizes allocation of time, budget, and personnel Stakeholder Communication: Frames results in terms of business value Impact Assessment: Clarifies project impact on the organization Risk Management: Identifies and mitigates potential risks Germain Forestier - Université de Haute-Alsace 33/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 33/38 Understanding Big Data Volume: Large data sizes (terabytes, petabytes) Velocity: Speed of data generation and processing Variety: Types of data (structured, semi-structured, unstructured) Veracity: Data quality and trustworthiness Value: Potential value derived from data Challenges: Storage, processing, analysis, security Technologies: Tools like Hadoop, Spark, NoSQL databases Germain Forestier - Université de Haute-Alsace 34/38 Challenges in Handling Big Data Storage: Managing large data volumes, distribution, and redundancy Processing: Efficient data processing; often requires parallel computing Integration: Merging data from diverse sources, formats, and structures Quality: Ensuring data accuracy, consistency, and trustworthiness Security: Protecting data privacy, integrity, and regulatory compliance Analysis: Extracting insights from complex, diverse datasets Scalability: Scaling systems to handle data growth without performance loss Cost: Managing costs of storage, processing, and analysis vs. value gained Germain Forestier - Université de Haute-Alsace 35/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 35/38 Successful Applications of Data Mining Healthcare: Predicting disease outbreaks, personalizing treatments, and improving patient care. Finance: Detecting fraudulent activities, managing risk, and optimizing investment strategies. Retail: Recommending products, optimizing pricing, and enhancing customer experiences. Manufacturing: Enhancing quality control, optimizing production processes, and reducing costs. Transportation: Predicting traffic patterns, optimizing routes, and enhancing safety. Energy: Forecasting energy demand, optimizing grid performance, and improving sustainability. Entertainment: Personalizing content recommendations, analyzing audience engagement, and optimizing advertising. Government: Enhancing public safety, improving service delivery, and informing policy decisions. Germain Forestier - Université de Haute-Alsace 36/38 Real-Life Examples of Successful Data Mining Netflix: Uses algorithms to recommend personalized shows and movies, accounting for over 75% of users’ viewing. American Express: Analyzes transaction data to detect and prevent fraud, saving millions of dollars annually. Walmart: Utilizes data mining to optimize inventory levels in stores, leading to increased efficiency and reduced costs. GE Aviation: Applies predictive analytics to monitor and maintain airplane engines, enhancing safety and reliability. Google Maps: Employs real-time traffic analysis to provide optimal driving routes, saving time for millions of commuters. IBM Watson in Healthcare: Assists doctors in diagnosing and treating cancer by analyzing medical literature and patient data. National Weather Service: Utilizes data mining to improve weather forecasts, aiding in disaster preparation and response. LinkedIn: Leverages algorithms to suggest professional connections and job opportunities, enhancing networking and career growth. Germain Forestier - Université de Haute-Alsace 37/38 Contents 1. Overview of Data Mining 2. Data Types and Sources 3. Data Preprocessing 4. Exploratory Data Analysis (EDA) 5. Tools and Software for Data Mining 6. Basic Concepts in Statistics and Machine Learning 7. The Data Mining Process 8. Introduction to Big Data and Scalability 9. Real-World Examples and Case Studies 10. Conclusion Germain Forestier - Université de Haute-Alsace 37/38 Summary of the Key Points Covered Overview of Data Mining and related fields. Understanding of Common Data Formats and Data Sources. Differentiation between Structured vs. Unstructured Data. Importance of Clean and Quality Data, including Data Cleaning Techniques. Missing Data Handling and Outlier Detection. Exploration of EDA and various Data Visualization Techniques. Introduction to Popular Tools, Libraries, and Packages. Descriptive vs. Inferential Statistics; Supervised vs. Unsupervised Learning. Exploration of Bias-Variance Tradeoff and Data Mining Process Steps. Emphasizing Understanding Business Objectives and Big Data. Discussion of challenges in Handling Big Data. Real-life and concrete examples of successful data mining applications. Germain Forestier - Université de Haute-Alsace 38/38

01-Introduction.pdf

Document Details

Related

Full Transcript