BIF524/CSC463 Fall 2024 Data Mining Statistical Learning PDF
Document Details
Eileen Marie Hanna, PhD
Tags
Summary
This document provides lecture notes on data mining and statistical learning. It covers various types of quantitative and qualitative attributes and defines concepts like nominal, binary, and ordinal attributes. The document also discusses how to visualize data for better understanding.
Full Transcript
Fall 2024 BIF524/CSC463 Data Mining Statistical Learning Eileen Marie Hanna, PhD 03/09/2023 Attribute A data field representing a characteristic of a data object. Values of attributes are also called observations. A set of...
Fall 2024 BIF524/CSC463 Data Mining Statistical Learning Eileen Marie Hanna, PhD 03/09/2023 Attribute A data field representing a characteristic of a data object. Values of attributes are also called observations. A set of attribute describing an object is called attribute vector or feature vector. The type of an attribute is determined by its possible values. Insurance phone heart disease patient ID gender... occupation temperature pain level weight smoker coverage number history 357 F... dentist 3 123456 38.2 moderate 62 0 1 358 M... architect 2 234567 40.3 severe 78 0 0 359 F... designer 1 345678 37.5 minimal 56 1 1 340 M... manager 3 456789 41 unbearable 88 1 0 341 F... nurse 1 567890 38.8 moderate 64 0 1................................. Qualitative attributes Nominal attributes – also referred to as categorical can have symbols or name of things as values. can also be represented as numbers coding for possible names/categories. It makes no sense to compute the mean or median for such attributes – the mode can however be calculated. Insurance phone heart disease patient ID gender... occupation temperature pain level weight smoker coverage number history 357 F... dentist 3 123456 38.2 moderate 62 0 1 358 M... architect 2 234567 40.3 severe 78 0 0 359 F... designer 1 345678 37.5 minimal 56 1 1 340 M... manager 3 456789 41 unbearable 88 1 0 341 F... nurse 1 567890 38.8 moderate 64 0 1................................. Qualitative attributes Binary attributes can only two possible values. Also called Boolean attributes when states are true ( ) or false ( ) which typically means that the attribute is either present or absent for a certain object, respectively. Insurance phone heart disease patient ID gender... occupation temperature pain level weight smoker coverage number history 357 F... dentist 3 123456 38.2 moderate 62 0 1 358 M... architect 2 234567 40.3 severe 78 0 0 359 F... designer 1 345678 37.5 minimal 56 1 1 340 M... manager 3 456789 41 unbearable 88 1 0 341 F... nurse 1 567890 38.8 moderate 64 0 1................................. Qualitative attributes Ordinal attributes have possible values of meaningful order or ranking among them. The magnitude between successive values is not specified. Insurance phone heart disease patient ID gender... occupation temperature pain level weight smoker coverage number history 357 F... dentist 3 123456 38.2 moderate 62 0 1 358 M... architect 2 234567 40.3 severe 78 0 0 359 F... designer 1 345678 37.5 minimal 56 1 1 340 M... manager 3 456789 41 unbearable 88 1 0 341 F... nurse 1 567890 38.8 moderate 64 0 1................................. Quantitative attributes Numeric, i.e., measurable quantity that can be represented by integer or real values. Interval-scaled attributes are measured on an equal-size units. Values of interval-scaled attributed do not have a zero-point (e.g., does not mean that there is no temperature). Insurance phone heart disease patient ID gender... occupation temperature pain level weight smoker coverage number history 357 F... dentist 3 123456 38.2 moderate 62 0 1 358 M... architect 2 234567 40.3 severe 78 0 0 359 F... designer 1 345678 37.5 minimal 56 1 1 340 M... manager 3 456789 41 unbearable 88 1 0 341 F... nurse 1 567890 38.8 moderate 64 0 1................................. Quantitative attributes Ratio-scaled attributes have ordered integer values with inherent zero-point. Insurance phone heart disease patient ID gender... occupation temperature pain level weight smoker coverage number history 357 F... dentist 3 123456 38.2 moderate 62 0 1 358 M... architect 2 234567 40.3 severe 78 0 0 359 F... designer 1 345678 37.5 minimal 56 1 1 340 M... manager 3 456789 41 unbearable 88 1 0 341 F... nurse 1 567890 38.8 moderate 64 0 1................................. Discrete vs continuous attributes Another classification of attributes could be: discrete: has a finite (e.g., hair_color) or countably infinite set of values, that may or may not be represented as integers (e.g., ZIP_code, customerID). continuous: i.e., numeric values represented by integers or real numbers. Boxplots – five-number summary of a distribution Boxplots – five-number summary of a distribution potential outliers maximum 𝐐3 𝐐1 median minimum Statistical Learning Covers tools for understanding data. Those tools can be categorized as: Supervised: involves building a statistical model to predict or estimate an output, given one or more inputs. Unsupervised: involves learning the structure and relationships in given inputs, with no supervised output. “Wage” dataset Includes factors believed to be related to wages of a group of males from the Atlantic region in the US, e.g., age, education level,..etc. “Wage” dataset “Wage” dataset estimate of the average wage at each age Wage as a function of age. On average, wage increases with age until around years and starts to declines afterwards. “Wage” dataset “Wage” dataset Wage as a function of year. A slow and steady (roughly linear) increase in wages. approx. between and. “Wage” dataset “Wage” dataset Wage as a function of education level: being the lowest (no high school diploma) being the highest (advanced graduate degree). On average, wage increases with education level. Which of those factors can be used to predict the wage of an employee? Which of those factors can be used to predict the wage of an employee? Quantitative (or continuous) output -> regression problem “Smarket” – stock market dataset Daily movements in S&P stock index over a -year period, between and. “Smarket” – stock market dataset Daily movements in S&P stock index over a -year period, between and. Predict whether the index will increase or decrease based on the percentage of change in the past days. In this case, we are not predicting a numerical value. We are predicting whether a certain day’s stock performance falls into the Up bucket or the Down bucket. -> classification problem. “Smarket” – stock market dataset The percentage change in the stock index on the pervious day. data from days for which the market decreased on the following day data from days for which the market increased on the following day Is it enough to make our predictions based on previous day changes only? “Smarket” – stock market dataset “Smarket” – stock market dataset Little association between pervious days and present returns. That is somehow expected due to strong correlations between returns on successive days. What more can we say through mining techniques? – later Gene expression dataset A case where we only have inputs variables with no output -> clustering problem. The dataset consists of the expression values of genes for each of cell lines. Can we group cell lines based on their gene expression measurements? Knowing that we have thousands of values per cell line, how can we visualize such data? Principal components summarize data in smaller dimensions. Gene expression dataset Gene expression dataset Here, the first two components and summarize the expression of measurements for each cell line in just two dimensions. Tradeoff as some information will be lost, but efficient visualization is acquired. groups (clusters) of cell lines identified and can then be further examined for similarities in their cancer types,..., relationship between gene expression and cancer,... Gene expression dataset We also know that the cell lines come from different cancer types, but this information was not used in the previous graph. When added, we get a similar graph, but this time it shows that cell lines from the same cancer type tend to be grouped together -> independent verification of the analysis. Notations Notations : number of distinct data points (observations) in a sample : number of available variables (attributes) 𝒊𝒋 : variable for the observation, and matrix 𝒊 : vector containingvariables of the observation, represented as column by default 𝒋 : vector of length containing observations of values of variable for observations Notations Notations 𝒊 : the observation of the variable 𝒕𝒉 on which we want to make predictions (e.g., wage) -> the set of all observations in vector form. The observed data can be represented as: where is a vector of length “Wage” dataset Goal The relationship between a quantitative response and predictors can be written as: is a random error term that is independent of and has mean zero. Let’s go back to the “Wages” dataset The blue curve represents the true relationship (which is usually unknown) based on the observed points. Vertical lines correspond to (positive if above the curve) overall, the error mean is approx. zero. For the “Wages” dataset More/Which input variables? Here, income as a (true) function of years of education and seniority. Why estimate f , generally speaking? prediction inference Prediction Predict given a set of inputs. can be predicted using: resulting estimate for prediction for Prediction – example Let be the measured characteristics of a blood sample, and let be a variable corresponding to the patient’s risk of a severe adverse reaction to a drug. In such settings, two factors determine the accuracy of : Prediction – example Let be the measured characteristics of a blood sample, and let be a variable corresponding to the patient’s risk of a severe adverse reaction to a drug. In such settings, two factors determine the accuracy of : reducible error: usually is not expected to be a perfect estimate of -> error. Reducible because we can improve the accuracy (i.e., reduce the error) by using more appropriate learning techniques. Prediction – example Let be the measured characteristics of a blood sample, and let be a variable corresponding to the patient’s risk of a severe adverse reaction to a drug. In such settings, two factors determine the accuracy of : reducible error irreducible error: there will always be an irreducible error introduced by because is also a function of , which cannot be predicted by. Due to some unmeasured or unmeasurable factors e.g., the manufacturing variation of the drug itself or the patient’s wellbeing Prediction – how? The goal is to use appropriate learning techniques to minimize the reducible error. Suppose that we have an estimate and a set of predictors leading to prediction: The expected value (average) of the square difference between the predicted and the actual value of is given by: variance of error Inference Understand how changes when change. i.e., how changes as a function of. In such cases, cannot be treated as a black box! Reference