Data Mining Review PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a review of data mining techniques and concepts. It covers common tasks, reasons for the increasing usage of data mining, and various models and tools such as Numpy and Pandas. The document also includes information on data preprocessing techniques, statistical analysis, and model evaluation metrics.
Full Transcript
Review Data Mining Common tasks of data mining Estimation Prediction Regression Classification Association Clustering Reasons for data mining usage increase Commercialization of products Continual technological advancements External pressure The Scientific M...
Review Data Mining Common tasks of data mining Estimation Prediction Regression Classification Association Clustering Reasons for data mining usage increase Commercialization of products Continual technological advancements External pressure The Scientific Method for Analytics Cross Industry Standard Process (CRISP-DM) Lifecycle Ref: Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. Method: Understanding - Business and Data Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Define Project requirements and objectives Translate objectives Data Data Into data mining problem definition we have we need Prepare preliminary strategy To meet objectives Method: Data Preparation Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Ref: Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, by Daniel Larose and Chantal Larose, John Wiley and Sons, Inc., 2014. Method: Modeling and Evaluation Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation Method: Deployment and Tools Business Data Data Modeling Evaluation Deployment Understanding Understanding Preparation EDA Why Do We Preprocess Data? Raw data often unprocessed, incomplete, noisy May contain: Obsolete/redundant fields Missing values Outliers Data in form not suitable for data mining Values not consistent with policy or common sense Minimize GIGO (Garbage In à Garbage Out) Data cleaning Removing entries Removing columns Removing a cluster Arrays in NumPy Why is NumPy Faster Than Lists? NumPy array elements have fixed types. NumPy arrays are stored at one continuous place in memory unlike lists, so processes can access and manipulate them very efficiently. This behavior is called locality of reference in computer science. 1-D Array Slicing Using Numpy 2-D Array Slicing Using NumPy 3-D Array Slicing Using NumPy Pandas Statistical Analysis Univariate Statistical Analysis Bivariate Statistical Analysis Multivariate Statistical Analysis Transformations to achieve normality Normal distribution Right-skewness data Left skewness data Methods for Identifying Outliers Z-score Standardization Interquartile Range (IQR) Scatterplot (2D) Confidence Interval Estimate Consists of an interval of numbers Includes a confidence level General form: Point Estimate +/- Margin of Error Box Plot Frequency Heatmap Model complexity Bias and Variance Bias and Variance (Cont.…) Bias and Variance (Cont.…) Bias and Variance (Cont.…) Bias and Variance (Cont.…) Bias and Variance (Cont.…) Steps in Hypothesis Testing Learning Models Correlation coefficient Linear Regression Regression Evaluation Metrics Mean Absolute Error Cost Function Mean Squared Error Root Mean Squared Error R2 Score (Coefficient of Determination) Model Evaluation Adjusted R2 Score kNN SVM Decision Tree Random Forest Classification Model Evaluation Metrics Sensitivity and Specificity ROC and AUC Accuracy, Precision, Recall, F1-Score k-Means Clustering Agglomerative and Divisive Hierarchical Clustering Misc. Stats Mean Median Mode Skewness Normalization Standardization Z-Score Confidence interval IQR Distance Functions IG Entropy Project 2/Deliverable 2 Feature Selection Feature Scaling Feature Handling Missing Engineering Values Handling Imbalanced Data Regression Choice Justification Supervised Feature Encoding Prediction Classification Techniques Model Selection Performance Unsupervised Clustering Comparison Hyperparameter Best Parameter Tuning Selection Regression Metrics Evaluation Metrics Classification Metrics