R and Data Mining: Examples and Case Studies PDF

1 R and Data Mining: Examples and Case Studies Yanchang Zhao [email protected] http://www.RDataMining.com October 20, 2015 1 c 2012-2015 Yanc...

1 R and Data Mining: Examples and Case Studies Yanchang Zhao [email protected] http://www.RDataMining.com October 20, 2015 1 c 2012-2015 Yanchang Zhao. Published by Elsevier in December 2012. All rights reserved. Messages from the Author Case studies: The case studies are not included in this online version. They are reserved exclu- sively for a book version published by Elsevier in December 2012. Latest version: The latest online version is available at links below. See the websites also for an R Reference Card for Data Mining. http://www.rdatamining.com http://www2.rdatamining.com (for readers having no access to above website) R code, data and FAQs: R code, data and FAQs are provided at links below. http://www.rdatamining.com/books/rdm http://www2.rdatamining.com/r-and-data-mining-examples-and-case-studies.html Chapters/sections to add: topic modelling and stream graph; spatial data analysis; perfor- mance evaluation of classification/prediction models (with ROC and AUC); parallel computing and big data. Please let me know if some topics are interesting to you but not covered yet by this book. Questions and feedback: If you have any questions or comments, or come across any problems with this document or its book version, please feel free to post them to the RDataMining group below or email them to me. Thanks. Discussion forum: Please join our discussions on R and data mining at the RDataMining group (16,000+ members, as of October 2015) on LinkedIn. Twitter: Follow @RDataMining on Twitter (2,200+ followers, as of October 2015). A sister book: See a new edited book titled Data Mining Application with R at links below, which features 15 real-world applications on data mining with R. http://www.rdatamining.com/books/dmar http://www2.rdatamining.com/data-mining-applications-with-r.html Contents List of Figures v List of Abbreviations vii 1 Introduction 1 1.1 Data Mining....................................... 1 1.2 R.............................................. 1 1.2.1 R Basics...................................... 2 1.2.2 RStudio...................................... 2 1.3 Datasets.......................................... 3 1.3.1 The Iris Dataset................................. 4 1.3.2 The Bodyfat Dataset............................... 4 2 Data Import and Export 7 2.1 Save and Load R Data.................................. 7 2.2 Import from and Export to.CSV Files......................... 8 2.3 Import Data from SAS.................................. 8 2.4 Import/Export via ODBC................................ 9 2.4.1 Read from Databases.............................. 9 2.4.2 Output to and Input from EXCEL Files.................... 9 2.5 Read and Write EXCEL files with package xlsx.................... 10 2.6 Further Readings..................................... 11 3 Data Exploration and Visualization 13 3.1 Have a Look at Data................................... 13 3.2 Explore Individual Variables............................... 15 3.3 Explore Multiple Variables................................ 19 3.4 More Explorations.................................... 23 3.5 Save Charts into Files.................................. 31 3.6 Further Readings..................................... 32 4 Decision Trees and Random Forest 33 4.1 Decision Trees with Package party........................... 33 4.2 Decision Trees with Package rpart........................... 36 4.3 Random Forest...................................... 40 5 Regression 45 5.1 Linear Regression..................................... 45 5.2 Logistic Regression.................................... 50 5.3 Generalized Linear Regression.............................. 51 5.4 Non-linear Regression.................................. 52 i ii CONTENTS 6 Clustering 53 6.1 The k-Means Clustering................................. 53 6.2 The k-Medoids Clustering................................ 54 6.3 Hierarchical Clustering.................................. 57 6.4 Density-based Clustering................................. 57 7 Outlier Detection 63 7.1 Univariate Outlier Detection.............................. 63 7.2 Outlier Detection with LOF............................... 66 7.3 Outlier Detection by Clustering............................. 70 7.4 Outlier Detection from Time Series........................... 71 7.5 Discussions........................................ 72 8 Time Series Analysis and Mining 75 8.1 Time Series Data in R.................................. 75 8.2 Time Series Decomposition............................... 76 8.3 Time Series Forecasting................................. 78 8.4 Time Series Clustering.................................. 79 8.4.1 Dynamic Time Warping............................. 79 8.4.2 Synthetic Control Chart Time Series Data................... 80 8.4.3 Hierarchical Clustering with Euclidean Distance............... 81 8.4.4 Hierarchical Clustering with DTW Distance.................. 83 8.5 Time Series Classification................................ 85 8.5.1 Classification with Original Data........................ 85 8.5.2 Classification with Extracted Features..................... 86 8.5.3 k-NN Classification................................ 88 8.6 Discussions........................................ 88 8.7 Further Readings..................................... 88 9 Association Rules 89 9.1 Basics of Association Rules............................... 89 9.2 The Titanic Dataset................................... 89 9.3 Association Rule Mining................................. 91 9.4 Removing Redundancy.................................. 93 9.5 Interpreting Rules.................................... 94 9.6 Visualizing Association Rules.............................. 95 9.7 Further Readings..................................... 99 10 Text Mining 101 10.1 Retrieving Text from Twitter.............................. 101 10.2 Transforming Text.................................... 102 10.3 Stemming Words..................................... 103 10.4 Building a Term-Document Matrix........................... 106 10.5 Frequent Terms and Associations............................ 107 10.6 Word Cloud........................................ 108 10.7 Clustering Words..................................... 109 10.8 Clustering Tweets.................................... 110 10.8.1 Clustering Tweets with the k-means Algorithm................ 111 10.8.2 Clustering Tweets with the k-medoids Algorithm............... 112 10.9 Packages, Further Readings and Discussions...................... 114 CONTENTS iii 11 Social Network Analysis 115 11.1 Network of Terms..................................... 115 11.2 Network of Tweets.................................... 121 11.3 Two-Mode Network................................... 126 11.4 Discussions and Further Readings............................ 129 12 Case Study I: Analysis and Forecasting of House Price Indices 131 13 Case Study II: Customer Response Prediction and Profit Optimization 133 14 Case Study III: Predictive Modeling of Big Data with Limited Memory 135 15 Online Resources 137 15.1 R Reference Cards.................................... 137 15.2 R.............................................. 137 15.3 Data Mining....................................... 138 15.4 Data Mining with R................................... 139 15.5 Classification/Prediction with R............................ 139 15.6 Time Series Analysis with R............................... 140 15.7 Association Rule Mining with R............................ 140 15.8 Spatial Data Analysis with R.............................. 140 15.9 Text Mining with R................................... 140 15.10Social Network Analysis with R............................. 140 15.11Data Cleansing and Transformation with R...................... 141 15.12Big Data and Parallel Computing with R....................... 141 Bibliography 143 General Index 149 Package Index 151 Function Index 153 Appendix: Book Promotion - Data Mining Applications with R 155 iv CONTENTS List of Figures 1.1 RStudio.......................................... 3 3.1 Histogram......................................... 16 3.2 Density.......................................... 17 3.3 Pie Chart......................................... 18 3.4 Bar Chart......................................... 19 3.5 Boxplot.......................................... 20 3.6 Scatter Plot........................................ 21 3.7 Scatter Plot with Jitter................................. 22 3.8 Smooth Scatter Plot................................... 22 3.9 A Matrix of Scatter Plots................................ 23 3.10 3D Scatter plot...................................... 24 3.11 Heat Map......................................... 25 3.12 Level Plot......................................... 26 3.13 Contour.......................................... 27 3.14 3D Surface........................................ 28 3.15 Parallel Coordinates................................... 29 3.16 Parallel Coordinates with Package lattice....................... 30 3.17 Scatter Plot with Package ggplot2........................... 31 4.1 Decision Tree....................................... 34 4.2 Decision Tree (Simple Style)............................... 35 4.3 Decision Tree with Package rpart............................ 38 4.4 Selected Decision Tree.................................. 39 4.5 Prediction Result..................................... 40 4.6 Error Rate of Random Forest.............................. 42 4.7 Variable Importance................................... 43 4.8 Margin of Predictions.................................. 44 5.1 Australian CPIs in Year 2008 to 2010......................... 46 5.2 Prediction with Linear Regression Model - 1...................... 48 5.3 A 3D Plot of the Fitted Model............................. 49 5.4 Prediction of CPIs in 2011 with Linear Regression Model.............. 50 5.5 Prediction with Generalized Linear Regression Model................. 52 6.1 Results of k-Means Clustering.............................. 54 6.2 Clustering with the k-medoids Algorithm - I...................... 55 6.3 Clustering with the k-medoids Algorithm - II..................... 56 6.4 Cluster Dendrogram................................... 57 6.5 Density-based Clustering - I............................... 59 6.6 Density-based Clustering - II.............................. 60 6.7 Density-based Clustering - III.............................. 60 v vi LIST OF FIGURES 6.8 Prediction with Clustering Model............................ 61 7.1 Univariate Outlier Detection with Boxplot....................... 64 7.2 Outlier Detection - I................................... 65 7.3 Outlier Detection - II................................... 66 7.4 Density of outlier factors................................. 67 7.5 Outliers in a Biplot of First Two Principal Components............... 68 7.6 Outliers in a Matrix of Scatter Plots.......................... 69 7.7 Outliers with k-Means Clustering............................ 71 7.8 Outliers in Time Series Data.............................. 72 8.1 A Time Series of AirPassengers............................ 76 8.2 Seasonal Component................................... 77 8.3 Time Series Decomposition............................... 78 8.4 Time Series Forecast................................... 79 8.5 Alignment with Dynamic Time Warping........................ 80 8.6 Six Classes in Synthetic Control Chart Time Series.................. 81 8.7 Hierarchical Clustering with Euclidean Distance.................... 82 8.8 Hierarchical Clustering with DTW Distance...................... 84 8.9 Decision Tree....................................... 86 8.10 Decision Tree with DWT................................ 87 9.1 A Scatter Plot of Association Rules........................... 95 9.2 A Balloon Plot of Association Rules.......................... 96 9.3 A Graph of Association Rules.............................. 97 9.4 A Graph of Items..................................... 98 9.5 A Parallel Coordinates Plot of Association Rules................... 99 10.1 Frequent Terms...................................... 107 10.2 Word Cloud........................................ 109 10.3 Clustering of Words................................... 110 10.4 Clusters of Tweets.................................... 113 11.1 A Network of Terms - I................................. 117 11.2 A Network of Terms - II................................. 118 11.3 Cohesive Blocks...................................... 119 11.4 Cliques.......................................... 120 11.5 Cliques.......................................... 121 11.6 Distribution of Degree.................................. 122 11.7 A Network of Tweets - I................................. 123 11.8 A Network of Tweets - II................................ 124 11.9 A Network of Tweets - III................................ 125 11.10A Two-Mode Network of Terms and Tweets -I.................... 127 11.11A Two-Mode Network of Terms and Tweets - II.................... 129 List of Abbreviations ARIMA Autoregressive integrated moving average ARMA Autoregressive moving average AVF Attribute value frequency CLARA Clustering for large applications CRISP-DM Cross industry standard process for data mining DBSCAN Density-based spatial clustering of applications with noise DTW Dynamic time warping DWT Discrete wavelet transform GLM Generalized linear model IQR Interquartile range, i.e., the range between the first and third quartiles LOF Local outlier factor PAM Partitioning around medoids PCA Principal component analysis STL Seasonal-trend decomposition based on Loess TF-IDF Term frequency-inverse document frequency vii viii LIST OF FIGURES Chapter 1 Introduction This book introduces into using R for data mining. It presents many examples of various data mining functionalities in R and three case studies of real world applications. The supposed audience of this book are postgraduate students, researchers, data miners and data scientists who are interested in using R to do their data mining research and projects. We assume that readers already have a basic idea of data mining and also have some basic experience with R. We hope that this book will encourage more and more people to use R to do data mining work in their research and applications. This chapter introduces basic concepts and techniques for data mining, including a data mining process and popular data mining techniques. It also presents R and its packages, functions and task views for data mining. At last, some datasets used in this book are described. 1.1 Data Mining Data mining is the process to discover interesting knowledge from large amounts of data [Han and Kamber, 2000]. It is an interdisciplinary field with contributions from many areas, such as statistics, machine learning, information retrieval, pattern recognition and bioinformatics. Data mining is widely used in many domains, such as retail, finance, telecommunication and social media. The main techniques for data mining include classification and prediction, clustering, outlier detection, association rules, sequence analysis, time series analysis and text mining, and also some new techniques such as social network analysis and sentiment analysis. Detailed introduction of data mining techniques can be found in text books on data mining [Han and Kamber, 2000, Hand et al., 2001, Witten and Frank, 2005]. In real world applications, a data mining process can be broken into six major phases: business understanding, data understanding, data preparation, modeling, evaluation and deployment, as defined by the CRISP-DM (Cross Industry Standard Process for Data Mining)1. This book focuses on the modeling phase, with data exploration and model evaluation involved in some chapters. Readers who want more information on data mining are referred to online resources in Chapter 15. 1.2 R R 2 [R Core Team, 2015b] is a free software environment for statistical computing and graphics. It provides a wide variety of statistical and graphical techniques. R can be easily extended with 7324 packages available on CRAN3 (as of October 20, 2015). In addition, there are many packages 1 http://www.crisp-dm.org/ 2 http://www.r-project.org/ 3 http://cran.r-project.org/ 1 2 CHAPTER 1. INTRODUCTION provided on other websites, such as Bioconductor4 , and also a lot of packages under development at R-Forge5 and GitHub6. More details about R are available in An Introduction to R 7 [Venables et al., 2015] and R Language Definition 8 [R Core Team, 2015d] at the CRAN website. R is widely used in both academia and industry. To help users to find our which R packages to use, the CRAN Task Views 9 are a good guidance. They provide collections of packages for different tasks. Some Task Views related to data mining are: Machine Learning & Statistical Learning, Cluster Analysis & Finite Mixture Models, Time Series Analysis, Natural Language Processing, Multivariate Statistics, and Analysis of Spatial Data. Another guide to R for data mining is an R Reference Card for Data Mining (see page ??), which provides a comprehensive indexing of R packages and functions for data mining, categorized by their functionalities. Its latest version is available at http://www.rdatamining.com/docs and http://www2.rdatamining.com/. Readers who want more information on R are referred to online resources in Chapter 15. 1.2.1 R Basics Please refer to An Introduction to R [Venables et al., 2015] for an introduction to basics of R. 1.2.2 RStudio 10 RStudio is an integrated development environment (IDE) for R and can run on various oper- ating systems like Windows, Mac OS X and Linux. It is a very useful and powerful tool for R programming, and therefore, readers are suggested to use RStudio when learning from this book or doing their projects, although all the provided code can run without it. What you normally need is RStudio Desktop open source edition, which is free of charge. When RStudio is launched for the first time, you can see a window similar to Figure 1.1. There are four panels: Source panel (top left), which shows your R source code. If you cannot see the source panel, you can find it by clicking menu “File”, “New File” and then “R Script”. You can run a line or a selection of R code by clicking the “Run” bottom on top of source panel, or pressing “Ctrl + Enter”. Console panel (bottom left), which shows outputs and system messages displayed in a normal R console; Environment/History/Presentation panel (top right), whose three tabs show respectively all objects and function loaded in R, a history of submitted R code, and Presentations generated with R; 4 http://www.bioconductor.org/ 5 http://r-forge.r-project.org/ 6 https://github.com/ 7 http://cran.r-project.org/doc/manuals/R-intro.pdf 8 http://cran.r-project.org/doc/manuals/R-lang.pdf 9 http://cran.r-project.org/web/views/ 10 http://www.rstudio.com/ 1.3. DATASETS 3 Files/Plots/Packages/Help/Viewer panel (bottom right), whose tabs show respectively a list of files, plots, R packages installed, help documentation and local web content. Figure 1.1: RStudio It is always a good practice to begin R programming with an RStudio project, which is a folder where to put your R code, data files and figures. To create a new project, click the “Project” button at the top-right corner and then choose “New Project”. After that, select “create project from new directory” and then “Empty Project”. After typing a directory name, which will also be your project name, click “Create Project” to create your project folder and files. If you open an existing project, RStudio will automatically set the working directory to the project directory, which is very convenient. After that, create three folders as below: code, where to put your R souce code; data, where to put your datasets; and figures, where to put produced diagrams. In addition to above three folders which are usesul to most projects, depending on your project and preference, you may create additional folders below: rawdata, where to put all raw data, models, where to put all produced analytics models, and reports, where to put your analysis reports. 1.3 Datasets Some datasets used in this book are briefly described in this section. 4 CHAPTER 1. INTRODUCTION 1.3.1 The Iris Dataset The iris dataset has been used for classification in many research publications. It consists of 50 samples from each of three classes of iris flowers [Frank and Asuncion, 2010]. One class is linearly separable from the other two, while the latter are not linearly separable from each other. There are five attributes in the dataset: sepal length in cm, sepal width in cm, petal length in cm, petal width in cm, and class: Iris Setosa, Iris Versicolour, and Iris Virginica. Detailed desription of the dataset and research publications citing it can be found at the UCI Machine Learning Repository 11. Below we have a look at the structure of the dataset with str(). Note that all variable names, package names and function names in R are case sensitive. > str(iris) data.frame: 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1... From the output, we can see that there are 150 observations (records, or rows) and 5 variables (or columns) in the dataset. The first four variables are numeric. The last one, Species, is categoric (called as “factor” in R) and has three levels of values. 1.3.2 The Bodyfat Dataset Bodyfat is a dataset available in package TH.data [Hothorn, 2015]. It has 71 rows, and each row contains information of one person. It contains the following 10 numeric columns. age: age in years. DEXfat: body fat measured by DXA, response variable. waistcirc: waist circumference. hipcirc: hip circumference. elbowbreadth: breadth of the elbow. kneebreadth: breadth of the knee. anthro3a: sum of logarithm of three anthropometric measurements. anthro3b: sum of logarithm of three anthropometric measurements. anthro3c: sum of logarithm of three anthropometric measurements. anthro4: sum of logarithm of three anthropometric measurements. 11 https://archive.ics.uci.edu/ml/datasets/Iris 1.3. DATASETS 5 The value of DEXfat is to be predicted by the other variables. > data("bodyfat", package = "TH.data") > str(bodyfat) data.frame: 71 obs. of 10 variables: $ age : num 57 65 59 58 60 61 56 60 58 62... $ DEXfat : num 41.7 43.3 35.4 22.8 36.4... $ waistcirc : num 100 99.5 96 72 89.5 83.5 81 89 80 79... $ hipcirc : num 112 116.5 108.5 96.5 100.5... $ elbowbreadth: num 7.1 6.5 6.2 6.1 7.1 6.5 6.9 6.2 6.4 7... $ kneebreadth : num 9.4 8.9 8.9 9.2 10 8.8 8.9 8.5 8.8 8.8... $ anthro3a : num 4.42 4.63 4.12 4.03 4.24 3.55 4.14 4.04 3.91 3.66... $ anthro3b : num 4.95 5.01 4.74 4.48 4.68 4.06 4.52 4.7 4.32 4.21... $ anthro3c : num 4.5 4.48 4.6 3.91 4.15 3.64 4.31 4.47 3.47 3.6... $ anthro4 : num 6.13 6.37 5.82 5.66 5.91 5.14 5.69 5.7 5.49 5.25... 6 CHAPTER 1. INTRODUCTION Chapter 2 Data Import and Export This chapter shows how to import foreign data into R and export R objects to other formats. At first, examples are given to demonstrate saving R objects to and loading them from.Rdata files. After that, it demonstrates importing data from and exporting data to.CSV files, SAS databases, ODBC databases and EXCEL files. 2.1 Save and Load R Data Data in R can be saved as.Rdata files with function save() and.Rdata files can be reloaded into R with load(). With the code below, we first create a new object a as a numeric sequence (1, 2,..., 10) and a second new object b as a vector of characters (‘a’, ‘b’, ‘c’, ‘d’, ‘e’). Object letters is a built-in vector in R of 26 English letters, and letters[1:5] returns the first five letters. We then save them to a file and remove them from R with function rm(). After that, we reload both a and b from the file and print their values. > a b save(a, b, file="./data/mydatafile.Rdata") > rm(a, b) > load("./data/mydatafile.Rdata") > print(a) 1 2 3 4 5 6 7 8 9 10 > print(b) "a" "b" "c" "d" "e" An alternative way to save and load R data objects is using functions saveRDS() and readRDS(). They work in a similar way as save() and load(). The differences are: 1) multiple R objects can be saved into one single file with save(), but only one object can be saved in a file with saveRDS(); and 2) readRDS() enables us to restore the data under a different object name, while load() restores the data under the same object name as when it was saved. > a saveRDS(a, file="./data/mydatafile2.rds") > a2 print(a2) 1 2 3 4 5 6 7 8 9 10 R also provides function save.image() to save everything in current workspace into a single file, which is very convenient to save your current work and resume it later, if the data loaded into R are not very big. 7 8 CHAPTER 2. DATA IMPORT AND EXPORT 2.2 Import from and Export to.CSV Files Data frame is a data format that we mostly deal with in R. A data frame is similar to a table in databases, with each row being an observation (or record) and each column beding a variable (or feature). The example below demonstrates saving a dataframe into file and then reloaded it into R. At first, we create three vectors, an integer vector, a numeric (real) vector and a character vector, use function data.frame() to build them into dataframe df1 and save it into a.CSV file with write.csv(). Function sample(5) produces a random sample of five numbers out of 1 to 5. Column names in the data frame are then set with function names(). After that, we reload the data frame from the file to a new data frame df2 with read.csv(). Note that the very first column printed below is the row names, created automatically by R. > var1 var2 var3 df1 names(df1) write.csv(df1, "./data/mydatafile3.csv", row.names = FALSE) > df2 print(df2) Var.Int Var.Num Var.Char 1 3 0.3 R 2 4 0.4 and 3 1 0.1 Data Mining 4 2 0.2 Examples 5 5 0.5 Case Studies 2.3 Import Data from SAS Package foreign [R Core Team, 2015a] provides function read.ssd() for importing SAS datasets (.sas7bdat files) into R. However, the following points are essential to make importing successful. SAS must be available on your computer, and read.ssd() will call SAS to read SAS datasets and import them into R. The file name of a SAS dataset has to be no longer than eight characters. Otherwise, the importing would fail. There is no such a limit when importing from a.CSV file. During importing, variable names longer than eight characters are truncated to eight char- acters, which often makes it difficult to know the meanings of variables. One way to get around this issue is to import variable names separately from a.CSV file, which keeps full names of variables. An empty.CSV file with variable names can be generated with the following method. 1. Create an empty SAS table dumVariables from dumData as follows. data work.dumVariables; set work.dumData(obs=0); run; 2. Export table dumVariables as a.CSV file. 2.4. IMPORT/EXPORT VIA ODBC 9 The example below demonstrates importing data from a SAS dataset. Assume that there is a SAS data file dumData.sas7bdat and a.CSV file dumVariables.csv in folder “Current working directory/data”. > library(foreign) # for importing SAS data > # the path of SAS on your computer > sashome filepath # filename should be no more than 8 characters, without extension > fileName # read data from a SAS dataset > a print(a) Note that the variable names above are truncated. The full names can be imported from a.CSV file with the following code. > # read variable names from a.CSV file > variableFileName myNames names(a) print(a) Although one can export a SAS dataset to a.CSV file and then import data from it, there are problems when there are special formats in the data, such as a value of “$100,000” for a numeric variable. In this case, it would be better to import from a.sas7bdat file. However, variable names may need to be imported into R separately as above. Another way to import data from a SAS dataset is to use function read.xport() to read a file in SAS Transport (XPORT) format. 2.4 Import/Export via ODBC Package RODBC provides connection to ODBC databases [Ripley and Lapsley, 2015]. 2.4.1 Read from Databases Below is an example of reading from an ODBC database. Function odbcConnect() sets up a connection to database, sqlQuery() sends an SQL query to the database, and odbcClose() closes the connection. > library(RODBC) > connection query # or read query from file > # query myData odbcClose(connection) There are also sqlSave() and sqlUpdate() for writing or updating a table in an ODBC database. 2.4.2 Output to and Input from EXCEL Files An example of writing data to and reading data from EXCEL files is shown below, where a sheet name needs to be provided in function sqlFetch(). 10 CHAPTER 2. DATA IMPORT AND EXPORT > library(RODBC) > filename xlsFile sqlSave(xlsFile, a, rownames = FALSE) > b odbcClose(xlsFile) Note that there might be a limit of 65,536 rows to write to an EXCEL file. 2.5 Read and Write EXCEL files with package xlsx While package RODBC can read and write EXCEL files on Windows, but it does not work directly on Mac OS X, because an ODBC driver for EXCEL is not provided by default on Mac. However, package xlsx supports reading and writing Excel 2007 and Excel 97/2000/XP/2003 files [Dragulescu, 2014], with no additional drivers required. It works both on Windows and on Mac OS X. The example below demonstrates creation of an EXCEL file iris.xlsx with three sheets. Function library() loads an R package (or library), and table() returns the frequencies of values in a vector. We can see that there are three species, with each having 50 observations. Observations of species “setosa” are extracted first with function subset() and then saved into sheet “setosa” in the EXCEl file with function write.xlsx(). Row names are excluded using row.names=F. Then data of the other two species are saved into the same file, but in different sheets. When writing the second and third sheets, we need to use append=T to add new sheets to the existing file, instead of overwriting it. Finally, we read from sheet “setosa” with function read.xlsx() and show the first six observations with function head(). > library(xlsx) > table(iris$Species) setosa versicolor virginica 50 50 50 > setosa write.xlsx(setosa, file="./data/iris.xlsx", sheetName="setosa", row.names=F) > versicolor write.xlsx(versicolor, file="./data/iris.xlsx", sheetName="versicolor", + row.names=F, append=T) > virginica write.xlsx(virginica, file="./data/iris.xlsx", sheetName="virginica", + row.names=F, append=T) > a head(a) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 2.6. FURTHER READINGS 11 2.6 Further Readings For more details on data import and export, please refer to R Data Import/Export 1 [R Core Team, 2015c], which covers importing data from text files, XML files, spreadsheet-like data, var- ious statistical systems, relational databases, binary files, image files, connections and network interfaces. 1 http://cran.r-project.org/doc/manuals/R-data.pdf 12 CHAPTER 2. DATA IMPORT AND EXPORT Chapter 3 Data Exploration and Visualization This chapter shows examples on data exploration with R. It starts with inspecting the dimen- sionality, structure and data of an R object, followed by basic statistics and various charts like pie charts and histograms. Exploration of multiple variables are then demonstrated, including grouped distribution, grouped boxplots, scattered plot and pairs plot. After that, examples are presented on level plot, contour plot and 3D plot. It also shows how to saving charts into files of various formats. 3.1 Have a Look at Data The iris data is used in this chapter for demonstration of data exploration with R. See Sec- tion 1.3.1 for details of the iris data. We first check the size and structure of data. In code below, function dim() returns the dimensionality of data, which shows that there are 150 observations (or rows or records) and 5 variables (or columns). The name of variables are returned by names(). Functions str() and attributes() return the structure and attributes of data. > dim(iris) 150 5 > names(iris) "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > str(iris) data.frame: 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1... > attributes(iris) $names "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" 13 14 CHAPTER 3. DATA EXPLORATION AND VISUALIZATION $row.names 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 $class "data.frame" Next, we have a look at the first five rows of data. > iris[1:5, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa The first or last rows of data can be retrieved with head() or tail(), which by default return the first or last 6 rows. Alternatively, we can get a certain number of rows by setting the 2nd parameter to both functions. For example, the first 10 rows will be returned with head(iris, 10). > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > tail(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 145 6.7 3.3 5.7 2.5 virginica 146 6.7 3.0 5.2 2.3 virginica 147 6.3 2.5 5.0 1.9 virginica 148 6.5 3.0 5.2 2.0 virginica 149 6.2 3.4 5.4 2.3 virginica 150 5.9 3.0 5.1 1.8 virginica A random sample of the data can be retrieved with function sample() in code below. > ## draw a sample of 5 rows > idx idx 142 100 103 128 138 3.2. EXPLORE INDIVIDUAL VARIABLES 15 > iris[idx, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 142 6.9 3.1 5.1 2.3 virginica 100 5.7 2.8 4.1 1.3 versicolor 103 7.1 3.0 5.9 2.1 virginica 128 6.1 3.0 4.9 1.8 virginica 138 6.4 3.1 5.5 1.8 virginica We can also retrieve the values of a single column. For example, the first 10 values of Sepal.Length can be obtained in three different ways below. > iris[1:10, "Sepal.Length"] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 > iris[1:10, 1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 > iris$Sepal.Length[1:10] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 3.2 Explore Individual Variables Distribution of every numeric variable can be checked with function summary(), which returns the minimum, maximum, mean, median, and the first (25%) and third (75%) quartiles. Take Sepal.Length as an example, the result below shows that, its minimum value is 4.3 and the maximum 7.9. Its first quartile (“1st Qu.”) is 5.1, which means that 25% out of all records have Sepal.Length below 5.1. Similarly, a value of 6.4 in the third quartile (“3rd Qu.”) indidates that 75% out of all records have Sepal.Length below 6.4. It has a median of 5.8, which means that half of records have Sepal.Length below 5.8. The value of mean shows that the arithemetic mean (calculated by adding all values together and dividing by the number of values) of Sepal.Length is 5.843. For factors (or categorical variables), it shows the frequency of every level. In the result below, we can see that each one of the three Species, “setosa”, “versicolor” and “virginica”, has 50 observations. > summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 The mean, median and range can also be obtained respectively with functions with mean(), median() and range(). Quartiles and percentiles are supported by function quantile() as below, > quantile(iris$Sepal.Length) 0% 25% 50% 75% 100% 4.3 5.1 5.8 6.4 7.9 16 CHAPTER 3. DATA EXPLORATION AND VISUALIZATION > quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65)) 10% 30% 65% 4.80 5.27 6.20 Then we check the variance of Sepal.Length with var(), and also check its distribution with histogram and density using functions hist() and density(). > var(iris$Sepal.Length) 0.6856935 > hist(iris$Sepal.Length) Histogram of iris$Sepal.Length 30 25 20 Frequency 15 10 5 0 4 5 6 7 8 iris$Sepal.Length Figure 3.1: Histogram 3.2. EXPLORE INDIVIDUAL VARIABLES 17 > plot(density(iris$Sepal.Length)) density.default(x = iris$Sepal.Length) 0.4 0.3 Density 0.2 0.1 0.0 4 5 6 7 8 N = 150 Bandwidth = 0.2736 Figure 3.2: Density The frequency of a factor variable can be calculated with function table(), and then plotted 18 CHAPTER 3. DATA EXPLORATION AND VISUALIZATION as a pie chart with pie() or a bar chart with barplot(). > table(iris$Species) setosa versicolor virginica 50 50 50 > pie(table(iris$Species)) setosa versicolor virginica Figure 3.3: Pie Chart 3.3. EXPLORE MULTIPLE VARIABLES 19 > barplot(table(iris$Species)) 50 40 30 20 10 0 setosa versicolor virginica Figure 3.4: Bar Chart 3.3 Explore Multiple Variables After checking the distributions of individual variables, we then investigate the relationships be- tween two variables. Below we calculate covariance and correlation between variables with cov() and cor(). > cov(iris$Sepal.Length, iris$Petal.Length) 1.274315 > cov(iris[,1:4]) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707 Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394 Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094 Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063 > cor(iris$Sepal.Length, iris$Petal.Length) 0.8717538 > cor(iris[,1:4]) Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000 20 CHAPTER 3. DATA EXPLORATION AND VISUALIZATION Correlation shows whether and how strongly a pair of variables are related to each other. It ranges from -1 to +1. The closer the correlation is to +1 (or -1), the more strongly the two variables are positively (or negatively) related. When it is close to zero, it means that there is no relationship between them. From above results, we can see that the correlation between Sepal.Length and Petal.Length is 0.87, which means that they are positively related to each other. Similarly, Sepal.Length and Petal.Width are highly related, as well as Petal.Length and Petal.Width. In contrast, Sepal.Width is weakly negatively related with the other three. Next, we compute the stats of Sepal.Length of every Species with aggregate(). > aggregate(Sepal.Length ~ Species, summary, data=iris) Species Sepal.Length.Min. Sepal.Length.1st Qu. Sepal.Length.Median 1 setosa 4.300 4.800 5.000 2 versicolor 4.900 5.600 5.900 3 virginica 4.900 6.225 6.500 Sepal.Length.Mean Sepal.Length.3rd Qu. Sepal.Length.Max. 1 5.006 5.200 5.800 2 5.936 6.300 7.000 3 6.588 6.900 7.900 We then use function boxplot() to plot a box plot, also known as box-and-whisker plot, to show the median, first and third quartile of a distribution (i.e., the 50%, 25% and 75% points in cumulative distribution), and outliers. The bar in the middle is the median. The box shows the interquartile range (IQR), which is the range between the 75% and 25% observation. The result shows that the three species are of different distributions in their Sepal.Length. “Virginica” tends to have large Sepal.Length, “setosa” has small Sepal.Length and “versicolor” sits in between. It suggests that the varialbe can be used to predict the species of flowers. > boxplot(Sepal.Length ~ Species, data=iris, xlab="Species", ylab="Sepal.Length") 8.0 7.5 7.0 6.5 Sepal.Length 6.0 5.5 5.0 4.5 setosa versicolor virginica Species Figure 3.5: Boxplot A scatter plot can be drawn for two numeric variables with plot() as below. Using function with(), we don’t need to add “iris$” before variable names. In the code below, the colors (col) 3.3. EXPLORE MULTIPLE VARIABLES 21 and symbols (pch) of points are set to Species. > with(iris, plot(Sepal.Length, Sepal.Width, col=Species, pch=as.numeric(Species))) > ## same function as above > # plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=as.numeric(iris$Species)) 4.0 3.5 Sepal.Width 3.0 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Sepal.Length Figure 3.6: Scatter Plot 22 CHAPTER 3. DATA EXPLORATION AND VISUALIZATION When there are many points, some of them may overlap. We can use jitter() to add a small amount of noise to the data before plotting. > plot(jitter(iris$Sepal.Length), jitter(iris$Sepal.Width)) 4.5 4.0 jitter(iris$Sepal.Width) 3.5 3.0 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 jitter(iris$Sepal.Length) Figure 3.7: Scatter Plot with Jitter A smooth scatter plot can be plotted with function smoothScatter(), which a smoothed color density representation of the scatterplot, obtained through a kernel density estimate. > smoothScatter(iris$Sepal.Length, iris$Sepal.Width) 4.0 iris$Sepal.Width 3.5 3.0 2.5 2.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 iris$Sepal.Length Figure 3.8: Smooth Scatter Plot 3.4. MORE EXPLORATIONS 23 A matrix of scatter plots can be produced with function pairs(), where each sub figure is the scatter plot of a pair of variables. > pairs(iris) 2.0 3.0 4.0 0.5 1.5 2.5 7.5 Sepal.Length 6.0 4.5 4.0 Sepal.Width 3.0 2.0 7 5 Petal.Length 3 1 2.5 1.5 Petal.Width 0.5

R and Data Mining: Examples and Case Studies PDF

Document Details

Tags

Related

Summary

Full Transcript