MISY662 Lecture 8 PDF

EDA Missing Data Data Imputation Missing Data AKA missing values occur when you don't have data stored for certain variables Data can go missing due to Incomplete data entry Equipment malfunctions Lost files etc In any dataset there are usually missing value In quantitative research missing values appear as blank cells in your spreadsheet or NA in R Are Missing Data Problematic Yes because They can prevent you from doing properdata analysis and visualization They can sometimes cause sampling bias This means results may not be generalizable outside of your study because your data come from an unrepresentative sample Types of missing values 1 Missingcompletely at Random CMCAR Missing data are randomly distributed across the variable and unrelated to other variables You have a dataset that contains missing values in every row of every variable 2 Missing at Random MAR Missing data are not randomly distributed but they are accounted for by other observed war Missing data in your dataset is only missing in 3 specific related variables 3 Missing not at random MNAR Missing data systematically differ from the observed variables You have a dataset that contains missing values blo participants refused to answerthat How to Deal with Missing Data To tidy up your missing data your options usually include accepting removing or recreating the missing data Acceptance Leave data as is Deletion Delete all cases participants with missing data from analyses Imputation Use other data to fill in the missing data Functions Packages used to Explore Missing Dat Explore missing values in a dataset There are different basic functions we can use to explore missing data in a dataset Function explanation is nac Returns TRUE if value is NA otherwiseFALS whichlisinal Returns the indices i e place ofmissingvalue hairm TRUE Removes missing values for a specific omit calculati n Removes rows in dataset with missingvalue also returnsindices i e place ofomittedrows n miss Returns number of rows with missing value n completed Returns number of complete rows pct completedReturns the percentage of complete value pct miss Returns the percentage of missing value pct miss_casel Returns the percentageof rows w missingva pct complete casel Returnsthe percentage of rows w omissingva vis misse Returns a heat map ofmissing value's gg miss var dataset name show_pct TRUE Returns the percentage of missing values in each variable in a dataset in graph form gg miss fct dataset name categorical variable Plots the number of missing values for eac variable by a categoricalvariablein dataset Missingvalue and bata Imputation Review The use of the mean value to impute the data is simple and may work with small datasets but it is not the best practice other imputation methods such as pmm imputationby predictive mean matching is a betterpractice to replace missing values in a numeric variable Replacing the missingvalues with the mean works with numerical variables but NOT with categorical Otherimputation methods such as rf imputatio by randomforest is a better practice to replace missing values in a categorical variable PMM imputation by Predictive Mean Matching PMMis an attractive way to do multiple imputation for missing data especially for imputing numerical variables that are NOTnormally distribute PMMproduces imputed values are much more like real values If the original variable is bounded by 0 and 100 the imputed values will also be bounded by 0 and 100 And if the real values are discrete like of children the imputed values will also be discrete That's because the imputed values are real values that are borrowed from individuals with real data RF imputation by Random Forest RF is a machine learning technique used to address missing data RF has almost every quality of being the best imputation technique RF can handle non linearity in data as well as outliers RF can handle missingvalues in both numerical and categorical variables RF has a built in feature selection technique RF how does it work step 1 The missing values are filled by the mea of respective columns for continuous and most frequentdata for categorical data step 2 The dataset is divided into 2 parts training data consisting of the observed variables and the other is missing data used for prediction These training and prediction sets are then fed Random Forest and subsequently the predicted data is imputed at appropriate places After imputing all values one iteration gets completed step 3 step 2 is repeated until a stopping conditio is reached The iteration process ensures the algorithm operate on better quality data in subsequent iterations The process continues until the sum of squared differences between the current and previous imputationincreases or a specific iteration limit is reached Usually it takes 5 6 iterations to attribute data well Using the Mile package to impute the Data The mile package provides a good method to input It uses more sophisticated prediction techniques for this purpose instead of simply using the mean Forthis course we will focus on pmm fornumerical if for categorical The mice function format mice data m 5 method CC pmm rt rt pmm I maxit 201 Arguments data is the data set that containsmissing values m number of multiple imputations the default is m ñ meaning we will get 5 differentclean and imputed datasets method specifies the imputation method to be used for each column in the data columns that do not need imputing have emptymethod maxit a scalar giving the number of iterations detail is 5 higher the value more accurate the predictio Also provides a visual representation of the missing values in a dataset using the method Md pattern dataset name Quiz03 Study Guide ncoll of columns nrowl of rows dataset name specification of variables summary summary statistics dataset name variable name asfactordataset name variablena stoves categorical variables as factors dataset names read Csv dataset name csv assigns variable a dataset attach dataset name lets R know what we'rereferringto var variable name variance of variable use w attache librarypackage name loads package options scipen 999 converts values to absolute term stat desc dataset name uses describe functionsummarystat describec describe function for summary statistic describe name variable name By dataset summary stats of dataset by categorical variable data dataset name load a new dataset is.no dataset name explore dataset for NAvalues is na dataset name variable name NA values in variable whichCisinacdataset name variable name indices of missing values in specific variable mean dataset name variable name hairm TRUE mean without missing values n miss dataset name returns number of missingvalue n completedataset name number of complete value pct miss dataset name percentage of missingvalue pct complete dataset name percentage of complete value vis miss dataset name heat map of missingvalues gg miss var dataset name show_pct TRUE percentage of missing values in a graph libraryIggplot2 gg miss fct dataset name categorical argriable name plot of missing values for each va be by categorical library mice Md pattern dataset name plot of missingvalues new variable name mice dataset name m method CC pmm rt if pmm maxit create new variable usingmice function rf pmm new variable name imp variable name retrieve mean value in predicted datasets final_new_dataset complete new variable name choose most accurate predicted dataset tostoreinputtedd s

MISY662 Lecture 8 PDF

Document Details

Tags

Related

Summary

Full Transcript