ANALYTICS-REVIEWER-SUMMER.pdf
Document Details
Uploaded by HonorableCalculus366
Tags
Full Transcript
MODULE 1: Introduction to Predictive Analytics Artificial Neural Network - It's a computer system modeled after the brain, using nodes to recognize patterns and make What is predictive analytics?...
MODULE 1: Introduction to Predictive Analytics Artificial Neural Network - It's a computer system modeled after the brain, using nodes to recognize patterns and make What is predictive analytics? decisions. - it uses statistical and machine learning methods to forecast Sequential Patterns - It involves finding regular sequences in future outcomes based on historical data patterns. It helps data, where events follow one another in a specific order. organizations make informed decisions by anticipating future trends and events. 7 Essential Steps of the Data Mining Process Data Cleaning Artificial Intelligence (AI): To learn from data. Data Integration Ex: Natural Language Processing (NLP) Data Reduction for Data Quality Machine Learning: To recognize patterns and Data Transformation make predictions. Data Mining Ex: Employee Retention and Turnover Pattern Evaluation Data Mining: To dig for treasures Representing Knowledge in Data Mining hidden in large datasets. Statistics: To understand the Common use of Predictive Analytics in Business world. 1. Customer segmentation Benefits of Predictive Analytics 2. Fraud detection 3. Risk assessment - Improved Talent Acquisition and Retention 4. Demand forecasting - Enhanced Employee Performance Management 5. Operations optimization - Optimized Workforce Planning - Increased Employee Engagement Six Phases Process - Cost Reduction and Efficiency - Better Decision-Making I. Business Understanding - Defining project objectives - Enhanced Employee Experience and requirements understanding, Data mining problem definition. What is DATA MINING? II. Data Understanding - Data collection and - Data mining is the process of sorting through large data sets familiarization, Data quality problems identification to identify patterns and relationships that can help solve - Data Exploration: extract, transform, and load business problems through data analysis. III. Data Preparation - Table, record and attribute selection, Data transformation and cleaning PROS IV. Modeling - Modeling techniques selection and Customer Relationship application, Parameters calibration Management V. Evaluation - Business objectives and issues achievement Forecasting solution Competitive Advantage VI. Deployment - Result model deployment, Repeatable data Attract Customers process implementation Anomaly Detection Statistical Analysis CONS The foundation for most predictive analytics. Expensive In the initial stage Statistical methods such as regression—the most Security of the Critical Data common method—attempt to model independent Data mining violates variables (predictors) to predict a dependent variable user privacy (outcomes). Lack of Precision or Incorrect The type of model used in analysis depends on the Information outcome being predicted. Other methodologies for prediction include decision DATA MINING TECHNIQUES trees, support vector machines, and clustering. Machine learning methodologies such as neural networks Classification - involves assigning employees to pre defined are also used for predictive modeling. categories based on their attributes. Regression - predicting a continuous outcome Train Model based on various input features. - A model is "trained" by applying a statistical model to Clustering - involves grouping employees based on determine the relationship between a predictor and an various attributes to identify patterns outcome. Once this relationship is determined, the and insights that can inform model can then be tested for accuracy. decision-making. Association Rules - discovering interesting relationships Test Model or associations between different - Testing a model involves checking the outputs of the variables in employee data. predicted values against actual and measuring the accuracy Prediction - it involves using data to forecast future outcomes. and error. Outlier Detection - It identifies data points that differ Output Model Components significantly from the rest. Outliers can indicate errors or unusual events. The model components are the relationships iv. Combined computer and human inspection - detecting generated from the statistical model. The components are also suspicious values and checking it by human interventions used to output predictions later. b) Identifying outliers Output Predictions - Solutions for identifying outliers: - With the model components determined, predictions can i. Box plot now be created by inputting new values of the predictor variable into the model. DATA INTEGRATION - reorganizes the various raw - Estimating or forecasting the outcome or result of a system, datasets into a single dataset that contain all the information process or event based on data and models. required for the desired statistical analyses. GOAL : Allows you to consolidate the data to present a Tools in Predictive Analytics unified, singular view of the data for all users. Spreadsheets - (Basic Employee Database) 3 MAJOR ISSUES TO CONSIDER DURING DATA Workflow Interfaces - workflow interfaces are visual tools INTEGRATION that guide users through the process of building and executing Schema Integration also it is is a user-friendly way to create a step-by-step guide Redundancy Detection for analyzing data and making predictions. Resolution of Data Value Conflicts Programming Statistical Computing - it is the backbone of predictive analytics. It involves using programming languages DATA TRANSFORMATION - is a process of transforming like Python, or SAS to manipulate data, build statistical data from one format to another. models, and extract meaningful insights. Uses in HR: - Predicting Employee Turnover DATA TRANSFORMATION TASKS - Employee Segmentation 1. NORMALIZATION - a way to scale specific variables to fall within a small specific range. Predictive Analytics Framework key steps: I. MIN-MAX NORMALIZATION - Transforming 1. Data Collection values to a new scale such that all attributes fall between 2. Data Pre processing a standardized format. 3. Data Analysis II. Z-SCORE STANDARDIZATION - Transforming a 4. Model Validation numerical variable to a standard normal distribution. 5. Decision Making 6. Monitoring 2. ENCODING AND BINNING (a) ENCODING the process of transforming categorical to MODULE 2: DATA PREPROCESSING binary numerical counterparts. (b) BINING the process of transforming numerical variables Incomplete Data: lacking of attribute values, lacking into categorical counterparts. certain attributes of interest or containing only aggregate data. 1. EQUAL-WIDTH (distance) PARTITIONING - divides Noisy Data : containing errors or outliers the range into N intervals of equal size, thus forming a Inconsistent Data : containing discrepancies in codes or uniform grid. names 2. EQUAL-DEPTH (frequency) PARTITIONING - divides Duplicate records the range into N intervals, each containing approximately the same number of samples. METHODS OF DATA PREPROCESSING 3. ATTRIBUTE CONSTRUCTION - a new set of attributes is created from an existing set to facilitate the mining process. DATA CLEANING - is the process of altering data in a 4. AGGREGATION - summary or aggregation operations given storage resource to make sure that it is accurate and are applied to the data correct 5. DISCRETIZATION - dividing the range of continuous Data Cleaning Tasks: attribute into intervals a) Fill in missing values 6. GENERALIZATION - low level data attributes are - Solutions for handling missing data: transformed into high level attributes by using the concept of i. Ignore the tuple hierarchies and creating layers of successive summary data. ii. Fill in the missing value manually iii. Data Imputation DATA REDUCTION - is a process of obtaining a reduced Use a global constant to fill in the missing value representation of the data set that is much smaller in volume Use the attribute mean to fill in the missing value but yet produces the same (or almost the same) analytical Use the attribute mean for all samples belonging to the results same class B. DATA REDUCTION STRATEGIES: (b) Cleaning noisy data SAMPLING - utilizing a smaller representative or sample i. Binning - a way to group a number of more or less from the big data set or population that will generalize the continuous values into a smaller number of "bins entire population. ii. Clustering - grouping data into corresponding cluster and use the cluster average to represent a value iii. Regression - utilizing a simple regression line to estimate a very erratic data set TYPES OF SAMPLING 2. OneR 1. SIMPLE RANDOM SAMPLING - Short for "One Rule" 2. SAMPLING WITHOUT REPLACEMENTS - Simple yet accurate 3. SAMPLING WITH REPLACEMENT - Generates one rule for each predictor in the data 4. STRATIFIED SAMPLING 3. Naive-Bayes Classifier C. FEATURE SUBSET SELECTION - reduces the - Naive Bayes is a probabilistic machine learning algorithm dimensionality of data by eliminating redundant and irrelevant used for classification tasks. It is based on Bayes' theorem, features which calculates the probability of a class given the features. FEATURE SUBSET SELECTION TECHNIQUES - It is called Naïve because it assumes that the occurrence of a 1. EMBEDDED APPROACHES certain feature is independent of the occurrence of the other 2. FILTER APPROACHES feature. 3. WRAPPER APPROACHES Advantages: Fast and Easy, Popular for text classification, and Good for multi - class predictions D. FEATURE CREATION - creating new attributes that can Disadvantages: Naive Bayes assumes that all features are capture the important information in a data set much more independent or unrelated, so it cannot learn the relationship efficiently than the original attributes. between features. FEATURE CREATION METHODOLOGIES 1. FEATURE EXTRACTION 4. DECISION TREES 2. MAPPING DATA TO NEW SPACE - A type of machine learning model used primarily for 3. FEATURE CONSTRUCTION classification and regression tasks. - It can be used to predict outcomes and make intelligent DATA SMOOTHING - is a pre processing technique used to decisions without relying on human intuition or experience. remove noise from data. PROS: TRANSPARENT, EFFICIENT, and FLEXIBLE - It highlights underlying trends and patterns, essential for CONS: COMPLEX, UNSTABLE, and RISKY predictive analytics. - Commonly used in time series analysis, financial forecasting, 5. NEAREST NEIGHBOURS and HR analytics. - example of classification algorithms in which we utilize distance or similarity measures in providing predictions DATA SMOOTHING TECHNIQUES 1. Moving Averages 6. ARTIFICIAL NEUTRAL NETWORK (ANN) 2. Exponential Smoothing - Is a network of Perceptron or Nodes that mimics a biological 3. Seasonal Smoothing network of Neurons in a brain 4. Holt-Winters Method - An ANN attempts to recreate the computational mirror of the biological neural network IMPORTANCE - Each neuron takes many input signals, then based on an 1. Identifying Trends internal weighting system, produces a single output signal 2. Removing Noise that's typically sent as input to another neuron. 3. Handling Outliers 4. Improving Seasonal Forecasts WHAT IS THE PERCEPTRON? - Is a line that divides the green points and red points -The objectives of Perceptron is to find the line that optimally Module 3: Supervised Learning separates those two classes. Whenever a new point comes in, we will definitely know whether that new point is a green or a SUPERVISED LEARNING red point - It refers to the task of inferring a function from supervised PERCEPTRON INDUCTION ISSUE - Separable data implies there is a plane that classifies training (or labeled) training data. data perfectly. - Uses labeled training data - lf separable, many solutions, depends on the initial w - Used to classify data or make predictions. - If not separable, does not converge: Needs ANN 2 CATEGORIES OF SUPERVISED LEARNING 7. SUPPORT VECTOR MACHINES - Support Vector Machine (SVM) is a supervised machine 1. Classification learning algorithm that classifies data by finding an optimal - Historical data data are used to build a model - The goal is to predict previously unseen records line or hyperplane that maximizes the distance between each class in an N-dimensional space. 2. Regression - The SVM algorithm is widely used in machine learning as it can handle both linear and non-linear classification tasks. SEVERAL CLASSIFICATION OF ALGORITHM - when the data is not linearly separable, kernel 1. ZeroR - Simplest classification methodology which relies on the functions are used to transform the data higher-dimensional target and ignores all predictors space to enable linear separation. - Simply predicts the majority class - Useful for determining a baseline performance as a benchmark for other classification methods TYPES OF SVMs CLASSIFIERS III. INDICATOR VARIABLES - Also knows as 'dummy variables' are used to model LINEAR SVMS - Used with linearly separable data; this qualitative variables in regression. means that the data do not need to undergo any - It assign levels to qualitative variables/categories so that transformations to separate the data into different classes. regression analysis can be performed on them. NON-LINEAR SVMS - Applied when data is not linearly - It don't have a scale of measurement separable. The kernel trick helps in mapping data to a higher dimension where it can be separated linearly. MULTICOLLINEARITY - Is the inflation of coefficient estimates due to interdependent 8. ENSEMBLES regression. - It is an example of classification algorithm - Correlations or multiple correlations of sufficient magnitude - Construct a set of classifiers from the training data to have the potential to adversely affect regression estimates - Predict class label of previously unseen records by - If all regressors are orthogonal (independent), with each aggregating predictions made by multiple classifiers other then multicollinearity is not a problem. However, this is a rare situation in regression analysis. TYPE OF ENSEMBLES - Parallel Ensemble LOGISTIC REGRESSION - Serial Ensemble - Predicts the probability of an outcome that can only have two values. 9. RANDOM FOREST - The prediction is based on the use of one or several - Random Forests is a machine learning algorithm that predictors (numerical and categorical) operates by constructing multiple decision trees during - It produces a logistic curve, which is limited to values 0 and training and outputting the class that is the mode of the classes 1. (classification) or mean prediction (regression) of the individual trees MODULE 4: SUPERVISED LEARNING CLASSIFICATION MODEL EVALUATION Supervised learning - Uses labeled data to train models and MODEL EVALUATION make predictions. Example tasks include classification and - Model Evaluation is a methodology that helps to find the regression. best model that represents our data and how well that chosen model will work in the future. Unsupervised learning - Uses unlabeled data to discover patterns or structures. Example tasks include clustering and II. REGRESSION dimensionality reduction. - data mining tasks of predicting the value of target (numerical variable y) by building a model based on one or more ASSOCIATION RULE MINING predictors (numerical and categorical variables) - most widely used statistical technique but probably most Market basket analysis - A data mining technique used to widely misused. discover interesting relationships or associations among a set of items in transactional data. USES OF REGRESSION OF ANALYSIS - Data description - Specifically, given a set of transactions, we find - Parameter Estimation rules that will predict the occurrence of an item - Prediction and Estimation Control based on the occurrences of other items in a particular transaction. SIMPLE LINEAR REGRESSION MODEL - A method used to model the linear relationship between a DEFINITION OF TERMS target variable and one predictor variables. Item set - A collection of one or more items K-item set refers MULTIPLE LINEAR REGRESSION MODEL to a collection of k items that appear together in a - A method used to model the linear relationship between a transaction or dataset target variable and more than one predictor variables. Support count - how frequently a particular item set (or REGRESSION MODEL EVALUATION combination of items) appears in a data set. - Refers to the process of assessing how well a regression model predicts or explains the relationship between the Support - is the proportion or percentage of transactions that dependent variable and independent variables. include a specific item or combination of items. EVALUATION METRICS FOR REGRESSION Frequent item sent - a group of items that frequently occur MODELS together in transactions within a dataset. - Mean Squared Error (MSE) - Root Mean Squared Error (RMSE) - R-Squared - Adjusted R-squared SOLVING ASSOCIATION 2. Divisive Hierarchical Clustering (Top-Down) - Start with all data points in one big cluster. BRUTE-FORCE APPROACH (simplest way to do it) - Gradually split the big cluster into smaller ones, step by step, until each point is its own cluster. Minimum Support (minsup) CLUSTER SIMILARITY: WARD’S METHOD This is a threshold you set to decide which itemsets are - Similarity of the two clusters is based on the increase in frequent enough to be considered interesting squared error when two clusters are merged. - Less susceptible to noise and outliers TWO-STEP APPROACH - Biased towards globular clusters 1. Frequent item set generation - Hierarchical analogue of K-means (can be used to initialize 2. Rule generation K-means) APRIORI PRINCIPLE - A fundamental concept used in HIERCHICHAL CLUSTERING: PROBLEMS AND data mining, specifically in the context of association rule LIMITATION learning. It is used to identify frequent item sets and to derive association rules from these frequent item sets. - Once a decision is made to combine two clusters, it cannot be undone RULE GENERATION - A process in association rule - No objective function is directly minimized. learning, a fundamental part of data mining, where you derive - Different schemes have problems with one or more of the useful relationships from a dataset. In predictive analytics, following: these rules are used to predict future trends, behaviors, or events based on historical data. 1. Sensitivity to noise and outliers 2. Difficulty handling different-sized clusters and convex LIFT RATIO - Lift is a measure of the strength of the shapes association between two items, taking into account the 3. Breaking large clusters frequency of both items in the dataset. It is calculated as the confidence of the association divided by the support of the TEXT MINING second item. - also known as TEXT ANALYSIS, a procedure of modifying text that is not structured into structured form in order to SEQUENTIAL PATTERN MINING - is concerned with recognize significant patterns and the latest insights for easy finding statistically relevant patterns within the time series analysis. data where values are delivered in a - It can be transforming data into information that devices can sequence. learn, text mining automates the process of classifying texts by sentiment, subject, and intent. SUBSEQUENCE - A sequence that is also in another - It utilizes diverse methodologies for processing the text and sequence.Frequent occurring ordered events. one of this is NATURAL LANGUAGES PROCESSING (NLP). SPADE ALGORITHM - SPADE works by breaking down the sequences into smaller pieces and counting how often NATURAL LANGUAGES PROCESSING (NLP). those pieces appear across the entire list. It uses a special - It is a device that gives computers the ability to interpret, format to store this information efficiently, making it faster to manipulate, and comprehend human language. find patterns without having to repeatedly scan through all the - It helps computers communicate with humans in their own data. language and scales other language-related tasks. Application of the spade algorithm 3 STEPS IN TEXT MINING PROCESS Data set - employee ID, job title, department Step 1 - Establish the Corpus Sequential data - analysis, decision-making , etc. Step 2 - Create Term-Document Matrix Benefits - carer development, talent management, etc. Step 3 - Extract Knowledge from Term-Document Matrix HIERCHICAL CLUSTERING SOCIAL MEDIA SENTIMENT ANALYSIS - Produces a set of nested clusters organized as a hierarchical - is the process of collecting and analyzing information on the tree, usually visualized as a dendogram. emotions behind how people talk about your brand on social media. STRENGHTS OF HIERCHICAL CLUSTERING - -Do not assume any particular number of clusters 2 MAIN TYPES OF TEXTUAL INFORMATION - May correspond to meaningful taxonomies - FACTS AND OPINIONS 2 MAIN TYPES OF HIERCHICAL CLUSTERING \ Sentiment Analysis on Opinions - Importance of OpinionsIt aims to identify the sentiment behind a statement. 1. Agglomerative Hierarchical Clustering (Bottom-UP) - Start with every data point as its own cluster. Importance of Opinions - Opinions are important because - Gradually combine the closest clusters together, step by step, whenever we need to make a decision, we want to hear others' until everything is in one big cluster. opinions. Web and user-generated content in the rise of sentiment analysis - People have become prolific text data producers as a result of the proliferation of social media, blogs, forums, and review websites and user-generated content in the rise of sentiment analysis Key concepts of Social Media sentiment analysis Opinion - The subjective statement or sentiment expressed by a user about a particular subject or entity. Target - The specific subject or aspect of the entity that the opinion is directed towards. Opinion Holder - The individual or entity who expresses the opinion. Key concepts of Social Media sentiment analysis - Social media sentiment analysis involves extracting and understanding emotions or opinions expressed in social media posts. Sentiment Classification - Determining whether the sentiment of a text is positive, negative, or neutral. Emotion Detection - Identifying specific emotions like joy, anger, or sadness. Application of Social Media Analysis - Talent Acquisition and Recruitment - Employee Engagement and Retention - Crisis Management and Reputation Management