R_in_Action data analysis.pdf
Document Details
Uploaded by WellEducatedEuropium
Tags
Related
- Stats Unit-3: Introduction to R PDF
- A First Course in Statistical Programming with R PDF
- Lesson 5 Summaries _ Data Science in R _ A Gentle Introduction PDF
- Introduction to Statistical Learning with Applications in R PDF
- Statistical Analysis using R PDF
- An Introduction to Statistical Learning with Applications in R PDF
Full Transcript
IN ACTION Data analysis and graphics with R Robert I. Kabacoff MANNING R in Action Data analysis and graphics with R ROBERT I. KABACOFF MANNING...
IN ACTION Data analysis and graphics with R Robert I. Kabacoff MANNING R in Action Data analysis and graphics with R ROBERT I. KABACOFF MANNING Shelter Island Licensed to Mark Jacobson For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: [email protected] ©2011 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co. Development editor: Sebastian Stirling 20 Baldwin Road Copyeditor: Liz Welch PO Box 261 Typesetter: Composure Graphics Shelter Island, NY 11964 Cover designer: Marija Tudor ISBN: 9781935182399 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 -- MAL -- 16 15 14 13 12 11 Licensed to Mark Jacobson brief contents Part I Getting started.......................................... 1 1 Introduction to R 3 2 Creating a dataset 21 3 Getting started with graphs 45 4 Basic data management 73 5 Advanced data management 91 Part II Basic methods........................................ 117 6 Basic graphs 119 7 Basic statistics 141 Part III Intermediate methods......................... 171 8 Regression 173 9 Analysis of variance 219 10 Power analysis 246 11 Intermediate graphs 263 12 Resampling statistics and bootstrapping 291 iii Licensed to Mark Jacobson iv BRIEF CONTENTS Part IV Advanced methods...................................311 13 Generalized linear models 313 14 Principal components and factor analysis 331 15 Advanced methods for missing data 352 16 Advanced graphics 373 Licensed to Mark Jacobson contents preface xiii acknowledgments xv about this book xvii about the cover illustration xxii Part I Getting started.............................................1 1 Introduction to R 1.1 Why use R? 5 3 1.2 Obtaining and installing R 7 1.3 Working with R 7 Getting started 8 Getting help 11 The workspace 11 Input and output 13 1.4 Packages 14 What are packages? 15 Installing a package 16 Loading a package 16 Learning about a package 16 1.5 Batch processing 17 1.6 Using output as input—reusing results 18 1.7 Working with large datasets 18 v Licensed to Mark Jacobson vi CONTENTS 1.8 Working through an example 18 1.9 Summary 20 2 Creating a dataset 2.1 Understanding datasets 21 22 2.2 Data structures 23 Vectors 24 Matrices 24 Arrays 26 Data frames 27 Factors 30 Lists 32 2.3 Data input 33 Entering data from the keyboard 34 Importing data from a delimited text file 35 Importing data from Excel 36 Importing data from XML 37 Webscraping 37 Importing data from SPSS 38 Importing data from SAS 38 Importing data from Stata 38 Importing data from netCDF 39 Importing data from HDF5 39 Accessing database management systems (DBMSs) 39 Importing data via Stat/Transfer 41 2.4 Annotating datasets 42 Variable labels 42 Value labels 42 2.5 Useful functions for working with data objects 42 2.6 Summary 43 3 Getting started with graphs 3.1 Working with graphs 46 45 3.2 A simple example 48 3.3 Graphical parameters 49 Symbols and lines 50 Colors 52 Text characteristics 53 Graph and margin dimensions 54 3.4 Adding text, customized axes, and legends 56 Titles 57 Axes 57 Reference lines 60 Legend 60 Text annotations 62 3.5 Combining graphs 65 Creating a figure arrangement with fine control 69 3.6 Summary 71 4 Basic data management 4.1 A working example 73 73 4.2 Creating new variables 75 4.3 Recoding variables 76 Licensed to Mark Jacobson CONTENTS vii 4.4 Renaming variables 78 4.5 Missing values 79 Recoding values to missing 80 Excluding missing values from analyses 80 4.6 Date values 81 Converting dates to character variables 83 Going further 83 4.7 Type conversions 83 4.8 Sorting data 84 4.9 Merging datasets 85 Adding columns 85 Adding rows 85 4.10 Subsetting datasets 86 Selecting (keeping) variables 86 Excluding (dropping) variables 86 Selecting observations 87 The subset() function 88 Random samples 89 4.11 Using SQL statements to manipulate data frames 89 4.12 Summary 90 5 Advanced data management 5.1 A data management challenge 92 91 5.2 Numerical and character functions 93 Mathematical functions 93 Statistical functions 94 Probability functions 96 Character functions 99 Other useful functions 101 Applying functions to matrices and data frames 102 5.3 A solution for our data management challenge 103 5.4 Control flow 107 Repetition and looping 107 Conditional execution 108 5.5 User-written functions 109 5.6 Aggregation and restructuring 112 Transpose 112 Aggregating data 112 The reshape package 113 5.7 Summary 116 Part II Basic methods............................................117 6 Basic graphs 6.1 119 Bar plots 120 Simple bar plots 120 Stacked and grouped bar plots 121 Mean bar plots 122 Tweaking bar plots 123 Spinograms 124 6.2 Pie charts 125 6.3 Histograms 128 Licensed to Mark Jacobson viii CONTENTS 6.4 Kernel density plots 130 6.5 Box plots 133 Using parallel box plots to compare groups 134 Violin plots 137 6.6 Dot plots 138 6.7 Summary 140 7 Basic statistics 7.1 141 Descriptive statistics A menagerie of methods 142 142 Descriptive statistics by group 146 Visualizing results 149 7.2 Frequency and contingency tables 149 Generating frequency tables 150 Tests of independence 156 Measures of association 157 Visualizing results 158 Converting tables to flat files 158 7.3 Correlations 159 Types of correlations 160 Testing correlations for significance 162 Visualizing correlations 164 7.4 t-tests 164 Independent t-test 164 Dependent t-test 165 When there are more than two groups 166 7.5 Nonparametric tests of group differences 166 Comparing two groups 166 Comparing more than two groups 168 7.6 Visualizing group differences 170 7.7 Summary 170 Part III Intermediate methods............................171 8 Regression 8.1 173 The many faces of regression Scenarios for using OLS regression 175 174 What you need to know 176 8.2 OLS regression 177 Fitting regression models with lm() 178 Simple linear regression 179 Polynomial regression 181 Multiple linear regression 184 Multiple linear regression with interactions 186 8.3 Regression diagnostics 188 A typical approach 189 An enhanced approach 192 Global validation of linear model assumption 199 Multicollinearity 199 8.4 Unusual observations 200 Outliers 200 High leverage points 201 Influential observations 202 Licensed to Mark Jacobson CONTENTS ix 8.5 Corrective measures 205 Deleting observations 205 Transforming variables 205 Adding or deleting variables 207 Trying a different approach 207 8.6 Selecting the “best” regression model 207 Comparing models 208 Variable selection 209 8.7 Taking the analysis further 213 Cross-validation 213 Relative importance 215 8.8 Summary 218 9 Analysis of variance 9.1 219 A crash course on terminology 220 9.2 Fitting ANOVA models 222 The aov() function 222 The order of formula terms 223 9.3 One-way ANOVA 225 Multiple comparisons 227 Assessing test assumptions 229 9.4 One-way ANCOVA 230 Assessing test assumptions 232 Visualizing the results 232 9.5 Two-way factorial ANOVA 234 9.6 Repeated measures ANOVA 237 9.7 Multivariate analysis of variance (MANOVA) 239 Assessing test assumptions 241 Robust MANOVA 242 9.8 ANOVA as regression 243 9.9 Summary 245 10 Power analysis 10.1 246 A quick review of hypothesis testing 247 10.2 Implementing power analysis with the pwr package 249 t-tests 250 ANOVA 252 Correlations 253 Linear models 253 Tests of proportions 254 Chi-square tests 255 Choosing an appropriate effect size in novel situations 257 10.3 Creating power analysis plots 258 10.4 Other packages 260 10.5 Summary 261 11 Intermediate graphs 11.1 Scatter plots 264 Scatter plot matrices 267 263 High-density scatter plots 271 3D scatter plots 274 Bubble plots 278 Licensed to Mark Jacobson x CONTENTS 11.2 Line charts 280 11.3 Correlograms 283 11.4 Mosaic plots 288 11.5 Summary 290 12 Resampling statistics and bootstrapping 12.1 Permutation tests 292 291 12.2 Permutation test with the coin package 294 Independent two-sample and k-sample tests 295 Independence in contingency tables 296 Independence between numeric variables 297 Dependent two-sample and k-sample tests 297 Going further 298 12.3 Permutation tests with the lmPerm package 298 Simple and polynomial regression 299 Multiple regression 300 One-way ANOVA and ANCOVA 301 Two-way ANOVA 302 12.4 Additional comments on permutation tests 302 12.5 Bootstrapping 303 12.6 Bootstrapping with the boot package 304 Bootstrapping a single statistic 305 Bootstrapping several statistics 307 12.7 Summary 309 Part IV Advanced methods...................................311 13 Generalized linear models 13.1 The glm() function 315 313 Generalized linear models and the glm() function Supporting functions 316 314 Model fit and regression diagnostics 317 13.2 Logistic regression 317 Interpreting the model parameters 320 Assessing the impact of predictors on the probability of an outcome 321 Overdispersion 322 Extensions 323 13.3 Poisson regression 324 Interpreting the model parameters 326 Overdispersion 327 Extensions 328 13.4 Summary 330 14 Principal components and factor analysis 14.1 Principal components and factor analysis in R 331 333 14.2 Principal components 334 Selecting the number of components to extract 335 Licensed to Mark Jacobson CONTENTS xi Extracting principal components 336 Rotating principal components 339 Obtaining principal components scores 341 14.3 Exploratory factor analysis 342 Deciding how many common factors to extract 343 Extracting common factors 344 Rotating factors 345 Factor scores 349 Other EFA-related packages 349 14.4 Other latent variable models 349 14.5 Summary 350 15 Advanced methods for missing data 15.1 Steps in dealing with missing data 353 352 15.2 Identifying missing values 355 15.3 Exploring missing values patterns 356 Tabulating missing values 357 Exploring missing data visually 357 Using correlations to explore missing values 360 15.4 Understanding the sources and impact of missing data 362 15.5 Rational approaches for dealing with incomplete data 363 15.6 Complete-case analysis (listwise deletion) 364 15.7 Multiple imputation 365 15.8 Other approaches to missing data 370 Pairwise deletion 370 Simple (nonstochastic) imputation 371 15.9 Summary 371 16 Advanced graphics 16.1 373 The four graphic systems in R 374 16.2 The lattice package 375 Conditioning variables 379 Panel functions 381 Grouping variables 383 Graphic parameters 387 Page arrangement 388 16.3 The ggplot2 package 390 16.4 Interactive graphs 394 Interacting with graphs: identifying points 394 playwith 394 latticist 396 Interactive graphics with the iplots package 397 rggobi 399 16.5 Summary 399 afterword Into the rabbit hole 400 Licensed to Mark Jacobson xii CONTENTS appendix A Graphic user interfaces 403 appendix B Customizing the startup environment 406 appendix C Exporting data from R 408 appendix D Creating publication-quality output 410 appendix E Matrix Algebra in R 419 appendix F Packages used in this book 421 appendix G Working with large datasets 429 appendix H Updating an R installation 432 references 434 index 438 Licensed to Mark Jacobson preface What is the use of a book, without pictures or conversations? —Alice, Alice in Wonderland It’s wondrous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. —Q, “Q Who?” Stark Trek: The Next Generation When I began writing this book, I spent quite a bit of time searching for a good quote to start things off. I ended up with two. R is a wonderfully flexible platform and language for exploring, visualizing, and understanding data. I chose the quote from Alice in Wonderland to capture the flavor of statistical analysis today—an in- teractive process of exploration, visualization, and interpretation. The second quote reflects the generally held notion that R is difficult to learn. What I hope to show you is that is doesn’t have to be. R is broad and powerful, with so many analytic and graphic functions available (more than 50,000 at last count) that it easily intimidates both novice and experienced users alike. But there is rhyme and reason to the apparent madness. With guidelines and instructions, you can navigate the tremendous resources available, selecting the tools you need to accomplish your work with style, elegance, efficiency—and more than a little coolness. I first encountered R several years ago, when applying for a new statistical consulting position. The prospective employer asked in the pre-interview material if I was conversant in R. Following the standard advice of recruiters, I immediately said yes, and set off to learn it. I was an experienced statistician and researcher, had xiii Licensed to Mark Jacobson xiv PREFACE 25 years experience as an SAS and SPSS programmer, and was fluent in a half dozen programming languages. How hard could it be? Famous last words. As I tried to learn the language (as fast as possible, with an interview looming), I found either tomes on the underlying structure of the language or dense treatises on specific advanced statistical methods, written by and for subject-matter experts. The online help was written in a Spartan style that was more reference than tutorial. Every time I thought I had a handle on the overall organization and capabilities of R, I found something new that made me feel ignorant and small. To make sense of it all, I approached R as a data scientist. I thought about what it takes to successfully process, analyze, and understand data, including Accessing the data (getting the data into the application from multiple sources) Cleaning the data (coding missing data, fixing or deleting miscoded data, trans- forming variables into more useful formats) Annotating the data (in order to remember what each piece represents) Summarizing the data (getting descriptive statistics to help characterize the data) Visualizing the data (because a picture really is worth a thousand words) Modeling the data (uncovering relationships and testing hypotheses) Preparing the results (creating publication-quality tables and graphs) Then I tried to understand how I could use R to accomplish each of these tasks. Be- cause I learn best by teaching, I eventually created a website (www.statmethods.net) to document what I had learned. Then, about a year ago, Marjan Bace (the publisher) called and asked if I would like to write a book on R. I had already written 50 journal articles, 4 technical manuals, numerous book chapters, and a book on research methodology, so how hard could it be? At the risk of sounding repetitive—famous last words. The book you’re holding is the one that I wished I had so many years ago. I have tried to provide you with a guide to R that will allow you to quickly access the power of this great open source endeavor, without all the frustration and angst. I hope you enjoy it. P.S. I was offered the job but didn’t take it. However, learning R has taken my career in directions that I could never have anticipated. Life can be funny. Licensed to Mark Jacobson acknowledgments A number of people worked hard to make this a better book. They include Marjan Bace, Manning publisher, who asked me to write this book in the first place. Sebastian Stirling, development editor, who spent many hours on the phone with me, helping me organize the material, clarify concepts, and generally make the text more interesting. He also helped me through the many steps to publication. Karen Tegtmeyer, review editor, who helped obtain reviewers and coordinate the review process. Mary Piergies, who helped shepherd this book through the production pro- cess, and her team of Liz Welch, Susan Harkins, and Rachel Schroeder. Pablo Domínguez Vaselli, technical proofreader, who helped uncover areas of confusion and provided an independent and expert eye for testing code. The peer reviewers who spent hours of their own time carefully reading through the material, finding typos and making valuable substantive sug- gestions: Chris Williams, Charles Malpas, Angela Staples, PhD, Daniel Reis Pereira, Dr. D. H. van Rijn, Dr. Christian Marquardt, Amos Folarin, Stuart Jefferys, Dror Berel, Patrick Breen, Elizabeth Ostrowski, PhD, Atef Ouni, Carles Fenollosa, Ricardo Pietrobon, Samuel McQuillin, Landon Cox, Austin Ziegler, Rick Wagner, Ryan Cox, Sumit Pal, Philipp K. Janert, Deepak Vohra, and Sophie Mormede. xv Licensed to Mark Jacobson xvi ACKNOWLEDGMENTS The many Manning Early Access Program (MEAP) participants who bought the book before it was finished, asked great questions, pointed out errors, and made helpful suggestions. Each contributor has made this a better and more comprehensive book. I would also like to acknowledge the many software authors that have contributed to making R such a powerful data-analytic platform. They include not only the core developers, but also the selfless individuals who have created and maintain contributed packages, extending R’s capabilities greatly. Appendix F provides a list of the authors of contributed packages described in this book. In particular, I would like to mention John Fox, Hadley Wickham, Frank E. Harrell, Jr., Deepayan Sarkar, and William Revelle, whose works I greatly admire. I have tried to represent their contributions accurately, and I remain solely responsible for any errors or distortions inadvertently included in this book. I really should have started this book by thanking my wife and partner, Carol Lynn. Although she has no intrinsic interest in statistics or programming, she read each chapter multiple times and made countless corrections and suggestions. No greater love has any person than to read multivariate statistics for another. Just as important, she suffered the long nights and weekends that I spent writing this book, with grace, support, and affection. There is no logical explanation why I should be this lucky. There are two other people I would like to thank. One is my father, whose love of science was inspiring and who gave me an appreciation of the value of data. The other is Gary K. Burger, my mentor in graduate school. Gary got me interested in a career in statistics and teaching when I thought I wanted to be a clinician. This is all his fault. Licensed to Mark Jacobson about this book If you picked up this book, you probably have some data that you need to collect, summarize, transform, explore, model, visualize, or present. If so, then R is for you! R has become the world-wide language for statistics, predictive analytics, and data visualization. It offers the widest range available of methodologies for understand- ing data, from the most basic to the most complex and bleeding edge. As an open source project it’s freely available for a range of platforms, including Windows, Mac OS X, and Linux. It’s under constant development, with new procedures added daily. Additionally, R is supported by a large and diverse community of data scientists and programmers who gladly offer their help and advice to users. Although R is probably best known for its ability to create beautiful and sophisticated graphs, it can handle just about any statistical problem. The base installation provides hundreds of data-management, statistical, and graphical functions out of the box. But some of its most powerful features come from the thousands of extensions (packages) provided by contributing authors. This breadth comes at a price. It can be hard for new users to get a handle on what R is and what it can do. Even the most experienced R user is surprised to learn about features they were unaware of. R in Action provides you with a guided introduction to R, giving you a 2,000-foot view of the platform and its capabilities. It will introduce you to the most important functions in the base installation and more than 90 of the most useful contributed packages. Throughout the book, the goal is practical application—how you can make sense of your data and communicate that understanding to others. When you xvii Licensed to Mark Jacobson xviii ABOUT THIS BOOK finish, you should have a good grasp of how R works and what it can do, and where you can go to learn more. You’ll be able to apply a variety of techniques for visualizing data, and you’ll have the skills to tackle both basic and advanced data analytic problems. Who should read this book R in Action should appeal to anyone who deals with data. No background in statistical programming or the R language is assumed. Although the book is accessible to nov- ices, there should be enough new and practical material to satisfy even experienced R mavens. Users without a statistical background who want to use R to manipulate, summarize, and graph data should find chapters 1–6, 11, and 16 easily accessible. Chapter 7 and 10 assume a one-semester course in statistics; and readers of chapters 8, 9, and 12–15 will benefit from two semesters of statistics. But I have tried to write each chapter in such a way that both beginning and expert data analysts will find something interesting and useful. Roadmap This book is designed to give you a guided tour of the R platform, with a focus on those methods most immediately applicable for manipulating, visualizing, and under- standing data. There are 16 chapters divided into 4 parts: “Getting started,” “Basic methods,” “Intermediate methods,” and “Advanced methods.” Additional topics are covered in eight appendices. Chapter 1 begins with an introduction to R and the features that make it so useful as a data-analysis platform. The chapter covers how to obtain the program and how to enhance the basic installation with extensions that are available online. The remainder of the chapter is spent exploring the user interface and learning how to run programs interactively and in batches. Chapter 2 covers the many methods available for getting data into R. The first half of the chapter introduces the data structures R uses to hold data, and how to enter data from the keyboard. The second half discusses methods for importing data into R from text files, web pages, spreadsheets, statistical packages, and databases. Many users initially approach R because they want to create graphs, so we jump right into that topic in chapter 3. No waiting required. We review methods of creating graphs, modifying them, and saving them in a variety of formats. Chapter 4 covers basic data management, including sorting, merging, and subsetting datasets, and transforming, recoding, and deleting variables. Building on the material in chapter 4, chapter 5 covers the use of functions (mathematical, statistical, character) and control structures (looping, conditional execution) for data management. We then discuss how to write your own R functions and how to aggregate data in various ways. Licensed to Mark Jacobson ABOUT THIS BOOK xix Chapter 6 demonstrates methods for creating common univariate graphs, such as bar plots, pie charts, histograms, density plots, box plots, and dot plots. Each is useful for understanding the distribution of a single variable. Chapter 7 starts by showing how to summarize data, including the use of descriptive statistics and cross-tabulations. We then look at basic methods for understanding relationships between two variables, including correlations, t-tests, chi-square tests, and nonparametric methods. Chapter 8 introduces regression methods for modeling the relationship between a numeric outcome variable and a set of one or more numeric predictor variables. Methods for fitting these models, evaluating their appropriateness, and interpreting their meaning are discussed in detail. Chapter 9 considers the analysis of basic experimental designs through the analysis of variance and its variants. Here we are usually interested in how treatment combinations or conditions affect a numerical outcome variable. Methods for assessing the appropriateness of the analyses and visualizing the results are also covered. A detailed treatment of power analysis is provided in chapter 10. Starting with a discussion of hypothesis testing, the chapter focuses on how to determine the sample size necessary to detect a treatment effect of a given size with a given degree of confidence. This can help you to plan experimental and quasi-experimental studies that are likely to yield useful results. Chapter 11 expands on the material in chapter 5, covering the creation of graphs that help you to visualize relationships among two or more variables. This includes various types of 2D and 3D scatter plots, scatter-plot matrices, line plots, correlograms, and mosaic plots. Chapter 12 presents analytic methods that work well in cases where data are sampled from unknown or mixed distributions, where sample sizes are small, where outliers are a problem, or where devising an appropriate test based on a theoretical distribution is too complex and mathematically intractable. They include both resampling and bootstrapping approaches—computer-intensive methods that are easily implemented in R. Chapter 13 expands on the regression methods in chapter 8 to cover data that are not normally distributed. The chapter starts with a discussion of generalized linear models and then focuses on cases where you’re trying to predict an outcome variable that is either categorical (logistic regression) or a count (Poisson regression). One of the challenges of multivariate data problems is simplification. Chapter 14 describes methods of transforming a large number of correlated variables into a smaller set of uncorrelated variables (principal component analysis), as well as methods for uncovering the latent structure underlying a given set of variables (factor analysis). The many steps involved in an appropriate analysis are covered in detail. In keeping with our attempt to present practical methods for analyzing data, chapter 15 considers modern approaches to the ubiquitous problem of missing data values. R Licensed to Mark Jacobson xx ABOUT THIS BOOK supports a number of elegant approaches for analyzing datasets that are incomplete for various reasons. Several of the best are described here, along with guidance for which ones to use when and which ones to avoid. Chapter 16 wraps up the discussion of graphics with presentations of some of R’s most advanced and useful approaches to visualizing data. This includes visual representations of very complex data using lattice graphs, an introduction to the new ggplot2 package, and a review of methods for interacting with graphs in real time. The afterword points you to many of the best internet sites for learning more about R, joining the R community, getting questions answered, and staying current with this rapidly changing product. Last, but not least, the eight appendices (A through H) extend the text’s coverage to include such useful topics as R graphic user interfaces, customizing and upgrading an R installation, exporting data to other applications, creating publication quality output, using R for matrix algebra (à la MATLAB), and working with very large datasets. The examples In order to make this book as broadly applicable as possible, I have chosen examples from a range of disciplines, including psychology, sociology, medicine, biology, busi- ness, and engineering. None of these examples require a specialized knowledge of that field. The datasets used in these examples were selected because they pose interesting questions and because they’re small. This allows you to focus on the techniques described and quickly understand the processes involved. When you’re learning new methods, smaller is better. The datasets are either provided with the base installation of R or available through add-on packages that are available online. The source code for each example is available from www.manning.com/RinAction. To get the most out of this book, I recommend that you try the examples as you read them. Finally, there is a common maxim that states that if you ask two statisticians how to analyze a dataset, you’ll get three answers. The flip side of this assertion is that each answer will move you closer to an understanding of the data. I make no claim that a given analysis is the best or only approach to a given problem. Using the skills taught in this text, I invite you to play with the data and see what you can learn. R is interactive, and the best way to learn is to experiment. Code conventions The following typographical conventions are used throughout this book: A monospaced font is used for code listings that should be typed as is. A monospaced font is also used within the general text to denote code words or previously defined objects. Italics within code listings indicate placeholders. You should replace them with appropriate text and values for the problem at hand. For example, path_to_my_ file would be replaced with the actual path to a file on your computer. Licensed to Mark Jacobson ABOUT THIS BOOK xxi R is an interactive language that indicates readiness for the next line of user input with a prompt (> by default). Many of the listings in this book capture interactive sessions. When you see code lines that start with >, don’t type the prompt. Code annotations are used in place of inline comments (a common convention in Manning books). Additionally, some annotations appear with numbered bullets like q that refer to explanations appearing later in the text. To save room or make text more legible, the output from interactive sessions may include additional white space or omit text that is extraneous to the point under discussion. Author Online Purchase of R in Action includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum and sub- scribe to it, point your web browser to www.manning.com/RinAction. This page pro- vides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct on the forum. Manning’s commitment to our readers is to provide a venue where a meaningful dialog between individual readers and between readers and the author can take place. It isn’t a commitment to any specific amount of participation on the part of the author, whose contribution to the AO forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions, lest his interest stray! The AO forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print. About the author Dr. Robert Kabacoff is Vice President of Research for Management Research Group, an international organizational development and consulting firm. He has more than 20 years of experience providing research and statistical consultation to organizations in health care, financial services, manufacturing, behavioral sciences, government, and academia. Prior to joining MRG, Dr. Kabacoff was a professor of psychology at Nova Southeastern University in Florida, where he taught graduate courses in quantitative methods and statistical programming. For the past two years, he has managed Quick-R, an R tutorial website. Licensed to Mark Jacobson about the cover illustration The figure on the cover of R in Action is captioned “A man from Zadar.” The illustra- tion is taken from a reproduction of an album of Croatian traditional costumes from the mid-nineteenth century by Nikola Arsenovic, published by the Ethnographic Mu- seum in Split, Croatia, in 2003. The illustrations were obtained from a helpful librarian at the Ethnographic Museum in Split, itself situated in the Roman core of the medieval center of the town: the ruins of Emperor Diocletian’s retirement palace from around AD 304. The book includes finely colored illustrations of figures from different regions of Croatia, accompanied by descriptions of the costumes and of everyday life. Zadar is an old Roman-era town on the northern Dalmatian coast of Croatia. It’s over 2,000 years old and served for hundreds of years as an important port on the trading route from Constantinople to the West. Situated on a peninsula framed by small Adriatic islands, the city is picturesque and has become a popular tourist destination with its architectural treasures of Roman ruins, moats, and old stone walls. The figure on the cover wears blue woolen trousers and a white linen shirt, over which he dons a blue vest and jacket trimmed with the colorful embroidery typical for this region. A red woolen belt and cap complete the costume. Dress codes and lifestyles have changed over the last 200 years, and the diversity by region, so rich at the time, has faded away. It’s now hard to tell apart the inhabitants of different continents, let alone of different hamlets or towns separated by only a few miles. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life. Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by illustrations from old books and collections like this one. xxii Licensed to Mark Jacobson Part 1 Getting started W elcome to R in Action! R is one of the most popular platforms for data analysis and visualization currently available. It is free, open-source software, with versions for Windows, Mac OS X, and Linux operating systems. This book will provide you with the skills needed to master this comprehensive software, and apply it effectively to your own data. The book is divided into four sections. Part I covers the basics of installing the software, learning to navigate the interface, importing data, and massaging it into a useful format for further analysis. Chapter 1 will familiarize you with the R environment. The chapter begins with an overview of R and the features that make it such a powerful platform for modern data analysis. After briefly describing how to obtain and install the software, the user interface is explored through a series of simple examples. Next, you’ll learn how to enhance the functionality of the basic installation with extensions (called contributed packages), that can be freely downloaded from online repositories. The chapter ends with an example that allows you to test your new skills. Once you’re familiar with the R interface, the next challenge is to get your data into the program. In today’s information-rich world, data can come from many sources and in many formats. Chapter 2 covers the wide variety of methods available for importing data into R. The first half of the chapter introduces the data structures R uses to hold data and describes how to input data manually. The second half discusses methods for importing data from text files, web pages, spreadsheets, statistical packages, and databases. Licensed to Mark Jacobson From a workflow point of view, it would probably make sense to discuss data management and data cleaning next. However, many users approach R for the first time out of an interest in its powerful graphics capabilities. Rather than frustrating that interest and keeping you waiting, we dive right into graphics in chapter 3. The chapter reviews methods for creating graphs, customizing them, and saving them in a variety of formats. The chapter describes how to specify the colors, symbols, lines, fonts, axes, titles, labels, and legends used in a graph, and ends with a description of how to combine several graphs into a single plot. Once you’ve had a chance to try out R’s graphics capabilities, it is time to get back to the business of analyzing data. Data rarely comes in a readily usable format. Significant time must often be spent combining data from different sources, cleaning messy data (miscoded data, mismatched data, missing data), and creating new variables (combined variables, transformed variables, recoded variables) before the questions of interest can be addressed. Chapter 4 covers basic data management tasks in R, including sorting, merging, and subsetting datasets, and transforming, recoding, and deleting variables. Chapter 5 builds on the material in chapter 4. It covers the use of numeric (arithmetic, trigonometric, and statistical) and character functions (string subsetting, concatenation, and substitution) in data management. A comprehensive example is used throughout this section to illustrate many of the functions described. Next, control structures (looping, conditional execution) are discussed and you will learn how to write your own R functions. Writing custom functions allows you to extend R’s capabilities by encapsulating many programming steps into a single, flexible function call. Finally, powerful methods for reorganizing (reshaping) and aggregating data are discussed. Reshaping and aggregation are often useful in preparing data for further analyses. After having completed part 1, you will be thoroughly familiar with programming in the R environment. You will have the skills needed to enter and access data, clean it up, and prepare it for further analyses. You will also have experience creating, customizing, and saving a variety of graphs. Licensed to Mark Jacobson This chapter covers Installing R Understanding the R language Running programs Introduction to R 1 How we analyze data has changed dramatically in recent years. With the advent of personal computers and the internet, the sheer volume of data we have available has grown enormously. Companies have terabytes of data on the consumers they interact with, and governmental, academic, and private research institutions have extensive archival and survey data on every manner of research topic. Gleaning information (let alone wisdom) from these massive stores of data has become an industry in itself. At the same time, presenting the information in easily accessible and digestible ways has become increasingly challenging. The science of data analysis (statistics, psychometrics, econometrics, machine learning) has kept pace with this explosion of data. Before personal computers and the internet, new statistical methods were developed by academic researchers who published their results as theoretical papers in professional journals. It could take years for these methods to be adapted by programmers and incorporated into the statistical packages widely available to data analysts. Today, new methodologies appear daily. Statistical researchers publish new and improved methods, along with the code to produce them, on easily accessible websites. 3 Licensed to Mark Jacobson 4 CHAPTER 1 Introduction to R Import Data Prepare, explore, and clean data Fit a stascal model Evaluate the model fit Cross-validate the model Evaluate model predicon on new data Figure 1.1 Steps in a Produce report typical data analysis The advent of personal computers had another effect on the way we analyze data. When data analysis was carried out on mainframe computers, computer time was pre- cious and difficult to come by. Analysts would carefully set up a computer run with all the parameters and options thought to be needed. When the procedure ran, the resulting output could be dozens or hundreds of pages long. The analyst would sift through this output, extracting useful material and discarding the rest. Many popular statistical packages were originally developed during this period and still follow this approach to some degree. With the cheap and easy access afforded by personal computers, modern data analysis has shifted to a different paradigm. Rather than setting up a complete data analysis at once, the process has become highly interactive, with the output from each stage serving as the input for the next stage. An example of a typical analysis is shown in figure 1.1. At any point, the cycles may include transforming the data, imputing missing values, adding or deleting variables, and looping back through the whole process again. The process stops when the analyst believes he or she understands the data intimately and has answered all the relevant questions that can be answered. The advent of personal computers (and especially the availability of high-resolution monitors) has also had an impact on how results are understood and presented. A picture really can be worth a thousand words, and human beings are very adept at extracting useful information from visual presentations. Modern data analysis increasingly relies on graphical presentations to uncover meaning and convey results. To summarize, today’s data analysts need to be able to access data from a wide range of sources (database management systems, text files, statistical packages, and spreadsheets), merge the pieces of data together, clean and annotate them, analyze them with the latest methods, present the findings in meaningful and graphically Licensed to Mark Jacobson Why use R? 5 appealing ways, and incorporate the results into attractive reports that can be distributed to stakeholders and the public. As you’ll see in the following pages, R is a comprehensive software package that’s ideally suited to accomplish these goals. 1.1 Why use R? R is a language and environment for statistical computing and graphics, similar to the S language originally developed at Bell Labs. It’s an open source solution to data analy- sis that’s supported by a large and active worldwide research community. But there are many popular statistical and graphing packages available (such as Microsoft Excel, SAS, IBM SPSS, Stata, and Minitab). Why turn to R? R has many features to recommend it: Most commercial statistical software platforms cost thousands, if not tens of thousands of dollars. R is free! If you’re a teacher or a student, the benefits are obvious. R is a comprehensive statistical platform, offering all manner of data analytic techniques. Just about any type of data analysis can be done in R. R has state-of-the-art graphics capabilities. If you want to visualize complex data, R has the most comprehensive and powerful feature set available. R is a powerful platform for interactive data analysis and exploration. From its inception it was designed to support the approach outlined in figure 1.1. For example, the results of any analytic step can easily be saved, manipulated, and used as input for additional analyses. Getting data into a usable form from multiple sources can be a challenging propo- sition. R can easily import data from a wide variety of sources, including text files, database management systems, statistical packages, and specialized data reposito- ries. It can write data out to these systems as well. R provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner. It’s easily extensible and provides a natural language for quickly programming recently published methods. R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis. If you’re a SAS user, imagine getting a new SAS PROC every few days. If you don’t want to learn a new language, a variety of graphic user interfaces (GUIs) are available, offering the power of R through menus and dialogs. R runs on a wide array of platforms, including Windows, Unix, and Mac OS X. It’s likely to run on any computer you might have (I’ve even come across guides for installing R on an iPhone, which is impressive but probably not a good idea). You can see an example of R’s graphic capabilities in figure 1.2. This graph, created with a single line of code, describes the relationships between income, education, and prestige for blue-collar, white-collar, and professional jobs. Technically, it’s a scatter plot matrix with groups displayed by color and symbol, two types of fit lines (linear and Licensed to Mark Jacobson 6 CHAPTER 1 Introduction to R loess), confidence ellipses, and two types of density display (kernel density estimation, and rug plots). Additionally, the largest outlier in each scatter plot has been automati- cally labeled. If these terms are unfamiliar to you, don’t worry. We’ll cover them in later chapters. For now, trust me that they’re really cool (and that the statisticians reading this are salivating). Basically, this graph indicates the following: Education, income, and job prestige are linearly related. In general, blue-collar jobs involve lower education, income, and prestige, where- as professional jobs involve higher education, income, and prestige. White-collar jobs fall in between. 20 40 60 80 100 RR.engineer 80 income 60 40 bc prof minister 20 wc 100 education 80 60 40 RR.engineer RR.engineer 20 100 minister prestige 80 RR.engineer 60 40 20 0 20 40 60 80 0 20 40 60 80 100 Figure 1.2 Relationships between income, education, and prestige for blue-collar (bc), white-collar (wc), and professional jobs (prof). Source: car package (scatterplotMatrix function) written by John Fox. Graphs like this are difficult to create in other statistical programming languages but can be created with a line or two of code in R. Licensed to Mark Jacobson Working with R 7 There are some interesting exceptions. Railroad Engineers have high income and low education. Ministers have high prestige and low income. Education and (to lesser extent) prestige are distributed bi-modally, with more scores in the high and low ends than in the middle. Chapter 8 will have much more to say about this type of graph. The important point is that R allows you to create elegant, informative, and highly customized graphs in a simple and straightforward fashion. Creating similar plots in other statistical languages would be difficult, time consuming, or impossible. Unfortunately, R can have a steep learning curve. Because it can do so much, the documentation and help files available are voluminous. Additionally, because much of the functionality comes from optional modules created by independent contributors, this documentation can be scattered and difficult to locate. In fact, getting a handle on all that R can do is a challenge. The goal of this book is to make access to R quick and easy. We’ll tour the many features of R, covering enough material to get you started on your data, with pointers on where to go when you need to learn more. Let’s begin by installing the program. 1.2 Obtaining and installing R R is freely available from the Comprehensive R Archive Network (CRAN) at http:// cran.r-project.org. Precompiled binaries are available for Linux, Mac OS X, and Win- dows. Follow the directions for installing the base product on the platform of your choice. Later we’ll talk about adding functionality through optional modules called packages (also available from CRAN). Appendix H describes how to update an existing R installation to a newer version. 1.3 Working with R R is a case-sensitive, interpreted language. You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. There are a wide variety of data types, including vectors, matrices, data frames (similar to datasets), and lists (collections of objects). We’ll discuss each of these data types in chapter 2. Most functionality is provided through built-in and user-created functions, and all data objects are kept in memory during an interactive session. Basic functions are available by default. Other functions are contained in packages that can be attached to a current session as needed. Statements consist of functions and assignments. R uses the symbol age weight mean(weight) 7.06 > sd(weight) 2.077498 > cor(age,weight) 0.9075655 > plot(age,weight) > q() You can see from listing 1.1 that the mean weight for these 10 infants is 7.06 kilograms, that the standard deviation is 2.08 kilograms, and that there is strong linear relation- ship between age in months and weight in kilograms (correlation = 0.91). The rela- tionship can also be seen in the scatter plot in figure 1.4. Not surprisingly, as infants get older, they tend to weigh more. The scatter plot in figure 1.4 is informative but somewhat utilitarian and unattractive. In later chapters, you’ll see how to customize graphs to suit your needs. TIP To get a sense of what R can do graphically, enter demo(graphics)at the command prompt. A sample of the graphs produced is included in figure 1.5. Other demonstrations include demo(Hershey), demo(persp), and demo(image). To see a complete list of demonstrations, enter demo() without parameters. Licensed to Mark Jacobson 10 CHAPTER 1 Introduction to R 10 9 8 weight 7 6 5 Figure 1.4 Scatter plot of infant weight (kg) by age (mo) 2 4 6 8 10 12 age Figure 1.5 A sample of the graphs created with the demo() function Licensed to Mark Jacobson Working with R 11 1.3.2 Getting help R provides extensive help facilities, and learning to navigate them will help you signifi- cantly in your programming efforts. The built-in help system provides details, refer- ences, and examples of any function contained in a currently installed package. Help is obtained using the functions listed in table 1.2. Table 1.2 R help functions Function Action help.start() General help. help("foo") or Help on function foo (the quotation marks are ?foo optional). help.search("foo") or Search the help system for instances of the ??foo string foo. example("foo") Examples of function foo (the quotation marks are optional). RSiteSearch("foo") Search for the string foo in online help manuals and archived mailing lists. apropos("foo", mode="function") List all available functions with foo in their name. data() List all available example datasets contained in currently loaded packages. vignette() List all available vignettes for currently installed packages. vignette("foo") Display specific vignettes for topic foo. The function help.start() opens a browser window with access to introductory and advanced manuals, FAQs, and reference materials. The RSiteSearch() function searches for a given topic in online help manuals and archives of the R-Help discus- sion list and returns the results in a browser window. The vignettes returned by the vignette() function are practical introductory articles provided in PDF format. Not all packages will have vignettes. As you can see, R provides extensive help facilities, and learning to navigate them will definitely aid your programming efforts. It’s a rare ses- sion that I don’t use the ? to look up the features (such as options or return values) of some function. 1.3.3 The workspace The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, functions, data frames, or lists). At the end of an R session, you can save an image of the current workspace that’s automatically reloaded the next time R starts. Commands are entered interactively at the R user prompt. You can use the Licensed to Mark Jacobson 12 CHAPTER 1 Introduction to R up and down arrow keys to scroll through your command history. Doing so allows you to select a previous command, edit it if desired, and resubmit it using the Enter key. The current working directory is the directory R will read files from and save results to by default. You can find out what the current working directory is by using the getwd() function. You can set the current working directory by using the setwd() function. If you need to input a file that isn’t in the current working directory, use the full pathname in the call. Always enclose the names of files and directories from the operating system in quote marks. Some standard commands for managing your workspace are listed in table 1.3. Table 1.3 Functions for managing the R workspace Function Action getwd() List the current working directory. setwd("mydirectory") Change the current working directory to mydirectory. ls() List the objects in the current workspace. rm(objectlist) Remove (delete) one or more objects. help(options) Learn about available options. options() View or set current options. history(#) Display your last # commands (default = 25). savehistory("myfile") Save the commands history to myfile ( default =.Rhistory). loadhistory("myfile") Reload a command’s history (default =.Rhistory). save.image("myfile") Save the workspace to myfile (default =.RData). save(objectlist, Save specific objects to a file. file="myfile") load("myfile") Load a workspace into the current session (default =.RData). q() Quit R. You’ll be prompted to save the workspace. To see these commands in action, take a look at the following listing. Listing 1.2 An example of commands used to manage the R workspace setwd("C:/myprojects/project1") options() options(digits=3) x patientdata[1:2] patientID age 1 1 25 2 2 34 3 3 28 4 4 52 > patientdata[c("diabetes", "status")] Licensed to Mark Jacobson 28 CHAPTER 2 Creating a dataset diabetes status 1 Type1 Poor 2 Type2 Improved 3 Type1 Excellent q Indicates age variable in patient 4 Type1 Poor > patientdata$age data frame 25 34 28 52 The $ notation in the third example is new q. It’s used to indicate a particular variable from a given data frame. For example, if you want to cross tabulate diabetes type by status, you could use the following code: > table(patientdata$diabetes, patientdata$status) Excellent Improved Poor Type1 1 0 2 Type2 0 1 0 It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts are available. You can use either the attach() and detach() or with() functions to simplify your code. ATTACH, DETACH, AND WITH The attach() function adds the data frame to the R search path. When a variable name is encountered, data frames in the search path are checked in order to locate the variable. Using the mtcars data frame from chapter 1 as an example, you could use the following code to obtain summary statistics for automobile mileage (mpg), and plot this variable against engine displacement (disp), and weight (wt): summary(mtcars$mpg) plot(mtcars$mpg, mtcars$disp) plot(mtcars$mpg, mtcars$wt) This could also be written as attach(mtcars) summary(mpg) plot(mpg, disp) plot(mpg, wt) detach(mtcars) The detach() function removes the data frame from the search path. Note that detach() does nothing to the data frame itself. The statement is optional but is good programming practice and should be included routinely. (I’ll sometimes ignore this sage advice in later chapters in order to keep code fragments simple and short.) The limitations with this approach are evident when more than one object can have the same name. Consider the following code: > mpg attach(mtcars) The following object(s) are masked _by_ ‘.GlobalEnv’: mpg Licensed to Mark Jacobson Data structures 29 > plot(mpg, wt) Error in xy.coords(x, y, xlabel, ylabel, log) : ‘x’ and ‘y’ lengths differ > mpg 25 36 47 Here we already have an object named mpg in our environment when the mtcars data frame is attached. In such cases, the original object takes precedence, which isn’t what you want. The plot statement fails because mpg has 3 elements and disp has 32 ele- ments. The attach() and detach() functions are best used when you’re analyzing a single data frame and you’re unlikely to have multiple objects with the same name. In any case, be vigilant for warnings that say that objects are being masked. An alternative approach is to use the with() function. You could write the previous example as with(mtcars, { summary(mpg, disp, wt) plot(mpg, disp) plot(mpg, wt) }) In this case, the statements within the {} brackets are evaluated with reference to the mtcars data frame. You don’t have to worry about name conflicts here. If there’s only one statement (for example, summary(mpg)), the {} brackets are optional. The limitation of the with() function is that assignments will only exist within the function brackets. Consider the following: > with(mtcars, { stats stats Error: object ‘stats’ not found If you need to create objects that will exist outside of the with() construct, use the special assignment operator a sqrt(a) 2.236068 > b round(b) 1 6 3 > c c [,1] [,2] [,3] [,4] [1,] 0.4205 0.355 0.699 0.323 [2,] 0.0270 0.601 0.181 0.926 [3,] 0.6682 0.319 0.599 0.215 > log(c) [,1] [,2] [,3] [,4] [1,] -0.866 -1.036 -0.358 -1.130 [2,] -3.614 -0.508 -1.711 -0.077 [3,] -0.403 -1.144 -0.513 -1.538 > mean(c) 0.444 Notice that the mean of matrix c in listing 5.4 results in a scalar (0.444). The mean() function took the average of all 12 elements in the matrix. But what if you wanted the 3 row means or the 4 column means? R provides a function, apply(), that allows you to apply an arbitrary function to any dimension of a matrix, array, or data frame. The format for the apply function is apply(x, MARGIN, FUN,...) where x is the data object, MARGIN is the dimension index, FUN is a function you specify, and... are any parameters you want to pass to FUN. In a matrix or data frame MARGIN=1 indicates rows and MARGIN=2 indicates columns. Take a look at the examples in listing 5.5. Licensed to Mark Jacobson A solution for our data management challenge 103 Listing 5.5 Applying a function to the rows (columns) of a matrix > mydata mydata q Generate data [,1] [,2] [,3] [,4] [,5] [1,] 0.71298 1.368 -0.8320 -1.234 -0.790 [2,] -0.15096 -1.149 -1.0001 -0.725 0.506 [3,] -1.77770 0.519 -0.6675 0.721 -1.350 [4,] -0.00132 -0.308 0.9117 -1.391 1.558 [5,] -0.00543 0.378 -0.0906 -1.485 -0.350 [6,] -0.52178 -0.539 -1.7347 2.050 1.569 > apply(mydata, 1, mean) w Calculate row means -0.155 -0.504 -0.511 0.154 -0.310 0.165 > apply(mydata, 2, mean) e Calculate column means -0.2907 0.0449 -0.5688 -0.3442 0.1906 > apply(mydata, 2, mean, trim=0.2) Calculate trimmed -0.1699 0.0127 -0.6475 -0.6575 0.2312 r column means You start by generating a 6 x 5 matrix containing random normal variates q. Then you calculate the 6 row means w, and 5 column means e. Finally, you calculate trimmed column means (in this case, means based on the middle 60 percent of the data, with the bottom 20 percent and top 20 percent of values discarded) r. Because FUN can be any R function, including a function that you write yourself (see section 5.4), apply() is a powerful mechanism. While apply() applies a function over the margins of an array, lapply() and sapply() apply a function over a list. You’ll see an example of sapply (which is a user-friendly version of lapply) in the next section. You now have all the tools you need to solve the data challenge in section 5.1, so let’s give it a try. 5.3 A solution for our data management challenge Your challenge from section 5.1 is to combine subject test scores into a single perfor- mance indicator for each student, grade each student from A to F based on their rela- tive standing (top 20 percent, next 20 percent, etc.), and sort the roster by students’ last name, followed by first name. A solution is given in the following listing. Listing 5.6 A solution to the learning example > options(digits=2) > Student Math Science English roster z score roster y roster$grade[score >= y] roster$grade[score < y & score >= y] roster$grade[score < y & score >= y] roster$grade[score < y & score >= y] roster$grade[score < y] name lastname firstname roster roster roster Firstname Lastname Math Science English score grade 6 Cheryl Cushing 512 85 28 0.35 C 1 John Davis 502 95 25 0.56 B 9 Joel England 573 89 27 0.70 B 4 David Jones 358 82 15 -1.16 F 8 Greg Knox 625 95 30 1.34 A 5 Janice Markhammer 495 75 20 -0.63 D 3 Bullwinkle Moose 412 80 18 -0.86 D 10 Mary Rayburn 522 86 18 -0.18 C 2 Angela Williams 600 99 22 0.92 A 7 Reuven Ytzrhak 410 80 15 -1.05 F The code is dense so let’s walk through the solution step by step: Step 1. The original student roster is given. The options(digits=2) limits the num- ber of digits printed after the decimal place and makes the printouts easier to read. > options(digits=2) > roster Student Math Science English 1 John Davis 502 95 25 2 Angela Williams 600 99 22 3 Bullwinkle Moose 412 80 18 4 David Jones 358 82 15 5 Janice Markhammer 495 75 20 6 Cheryl Cushing 512 85 28 7 Reuven Ytzrhak 410 80 15 8 Greg Knox 625 95 30 9 Joel England 573 89 27 10 Mary Rayburn 522 86 18 Step 2. Because the Math, Science, and English tests are reported on different scales (with widely differing means and standard deviations), you need to make them compa- rable before combining them. One way to do this is to standardize the variables so that each test is reported in standard deviation units, rather than in their original scales. You can do this with the scale() function: > z z Math Science English Licensed to Mark Jacobson A solution for our data management challenge 105 [1,] 0.013 1.078 0.587 [2,] 1.143 1.591 0.037 [3,] -1.026 -0.847 -0.697 [4,] -1.649 -0.590 -1.247 [5,] -0.068 -1.489 -0.330 [6,] 0.128 -0.205 1.137 [7,] -1.049 -0.847 -1.247 [8,] 1.432 1.078 1.504 [9,] 0.832 0.308 0.954 [10,] 0.243 -0.077 -0.697 Step 3. You can then get a performance score for each student by calculating the row means using the mean() function and adding it to the roster using the cbind() function: > score roster roster Student Math Science English score 1 John Davis 502 95 25 0.559 2 Angela Williams 600 99 22 0.924 3 Bullwinkle Moose 412 80 18 -0.857 4 David Jones 358 82 15 -1.162 5 Janice Markhammer 495 75 20 -0.629 6 Cheryl Cushing 512 85 28 0.353 7 Reuven Ytzrhak 410 80 15 -1.048 8 Greg Knox 625 95 30 1.338 9 Joel England 573 89 27 0.698 10 Mary Rayburn 522 86 18 -0.177 Step 4. The quantile() function gives you the percentile rank of each student’s per- formance score. You see that the cutoff for an A is 0.74, for a B is 0.44, and so on. > y y 80% 60% 40% 20% 0.74 0.44 -0.36 -0.89 Step 5. Using logical operators, you can recode students’ percentile ranks into a new categorical grade variable. This creates the variable grade in the roster data frame. > roster$grade[score >= y] roster$grade[score < y & score >= y] roster$grade[score < y & score >= y] roster$grade[score < y & score >= y] roster$grade[score < y] roster Student Math Science English score grade 1 John Davis 502 95 25 0.559 B 2 Angela Williams 600 99 22 0.924 A 3 Bullwinkle Moose 412 80 18 -0.857 D 4 David Jones 358 82 15 -1.162 F 5 Janice Markhammer 495 75 20 -0.629 D 6 Cheryl Cushing 512 85 28 0.353 C 7 Reuven Ytzrhak 410 80 15 -1.048 F 8 Greg Knox 625 95 30 1.338 A Licensed to Mark Jacobson 106 CHAPTER 5 Advanced data management 9 Joel England 573 89 27 0.698 B 10 Mary Rayburn 522 86 18 -0.177 C Step 6. You’ll use the strsplit() function to break student names into first name and last name at the space character. Applying strsplit() to a vector of strings re- turns a list: > name name [] "John" "Davis" [] "Angela" "Williams" [] "Bullwinkle" "Moose" [] "David" "Jones" [] "Janice" "Markhammer" [] "Cheryl" "Cushing" [] "Reuven" "Ytzrhak" [] "Greg" "Knox" [] "Joel" "England" [] "Mary" "Rayburn" Step 7. You can use the sapply() function to take the first element of each compo- nent and put it in a firstname vector, and the second element of each component and put it in a lastname vector. "[" is a function that extracts part of an object—here the first or second component of the list name. You’ll use cbind() to add them to the roster. Because you no longer need the student variable, you’ll drop it (with the –1 in the roster index). > Firstname Lastname roster roster Firstname Lastname Math Science English score grade 1 John Davis 502 95 25 0.559 B 2 Angela Williams 600 99 22 0.924 A 3 Bullwinkle Moose 412 80 18 -0.857 D Licensed to Mark Jacobson Control flow 107 4 David Jones 358 82 15 -1.162 F 5 Janice Markhammer 495 75 20 -0.629 D 6 Cheryl Cushing 512 85 28 0.353 C 7 Reuven Ytzrhak 410 80 15 -1.048 F 8 Greg Knox 625 95 30 1.338 A 9 Joel England 573 89 27 0.698 B 10 Mary Rayburn 522 86 18 -0.177 C Step 8. Finally, you can sort the dataset by first and last name using the order() function: > roster[order(Lastname,Firstname),] Firstname Lastname Math Science English score grade 6 Cheryl Cushing 512 85 28 0.35 C 1 John Davis 502 95 25 0.56 B 9 Joel England 573 89 27 0.70 B 4 David Jones 358 82 15 -1.16 F 8 Greg Knox 625 95 30 1.34 A 5 Janice Markhammer 495 75 20 -0.63 D 3 Bullwinkle Moose 412 80 18 -0.86 D 10 Mary Rayburn 522 86 18 -0.18 C 2 Angela Williams 600 99 22 0.92 A 7 Reuven Ytzrhak 410 80 15 -1.05 F Voilà! Piece of cake! There are many other ways to accomplish these tasks, but this code helps capture the flavor of these functions. Now it’s time to look at control structures and user-written functions. 5.4 Control flow In the normal course of events, the statements in an R program are executed sequen- tially from the top of the program to the bottom. But there are times that you’ll want to execute some statements repetitively, while only executing other statements if certain conditions are met. This is where control-flow constructs come in. R has the standard control structures you’d expect to see in a modern programming language. First you’ll go through the constructs used for conditional execution, followed by the constructs used for looping. For the syntax examples throughout this section, keep the following in mind: statement is a single R statement or a compound statement (a group of R state- ments enclosed in curly braces { } and separated by semicolons). cond is an expression that resolves to true or false. expr is a statement that evaluates to a number or character string. seq is a sequence of numbers or character strings. After we discuss control-flow constructs, you’ll learn how to write your functions. 5.4.1 Repetition and looping Looping constructs repetitively execute a statement or series of statements until a con- dition isn’t true. These include the for and while structures. Licensed to Mark Jacobson 108 CHAPTER 5 Advanced data management FOR The for loop executes a statement repetitively until a variable’s value is no longer con- tained in the sequence seq. The syntax is for (var in seq) statement In this example for (i in 1:10) print("Hello") the word Hello is printed 10 times. WHILE A while loop executes a statement repetitively until the condition is no longer true. The syntax is while (cond) statement In a second example, the code i 0) {print("Hello"); i barplot(means$x, names.arg=means$Group.1) > title("Mean Illiteracy Rate") Title added w Listing 6.3 sorts the means from small- est to largest q. Also note that use of the title() function w is equivalent Mean Illiteracy Rate to adding the main option in the plot call. means$x is the vector containing 1.5 the heights of the bars, and the option names.arg=means$Group.1 is added to provide labels. 1.0 You can take this example further. The bars can be connected with straight line segments using the lines() function. You can also create mean bar plots with 0.5 superimposed confidence intervals using the barplot2() function in the gplots package. See “barplot2: Enhanced Bar 0.0 Plots” on the R Graph Gallery website North Central Northeast West South (http://addictedtor.free.fr/graphiques) Figure 6.3 Bar plot of mean illiteracy rates for for an example. US regions sorted by rate 6.1.4 Tweaking bar plots There are several ways to tweak the appearance of a bar plot. For example, with many bars, bar labels may start to overlap. You can decrease the font size using the cex. names option. Specifying values smaller than 1 will shrink the size of the labels. Option- ally, the names.arg argument allows you to specify a character vector of names used to label the bars. You can also use graphical parameters to help text spacing. An example is given in the following listing with the output displayed in figure 6.4. Licensed to Mark Jacobson 124 CHAPTER 6 Basic graphs Treatment Outcome Marked Improvement Some Improvement No Improvement 0 10 20 30 40 Figure 6.4 Horizontal bar plot with tweaked labels Listing 6.4 Fitting labels in a bar plot par(mar=c(5,8,4,2)) par(las=2) counts margin.table(mytable, 1) Treatment Marginal Placebo Treated w frequencies 43 41 > margin.table(mytable, 2) Sex Female Male 59 25 > margin.table(mytable, 3) Improved None Some Marked 42 14 28 > margin.table(mytable, c(1, 3)) Treatment x Improved Improved marginal Treatment None Some Marked Placebo 29 7 7 e frequencies Treated 13 7 21 > ftable(prop.table(mytable, c(1, 2))) Improve Improved None Some Marked proportions for Treatment Sex r Treatment x Sex Licensed to Mark Jacobson Frequency and contingency tables 155 Placebo Female 0.594 0.219 0.188 Male 0.909 0.000 0.091 Treated Female 0.222 0.185 0.593 Male 0.500 0.143 0.357 > ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) Improved None Some Marked Sum Treatment Sex Placebo Female 0.594 0.219 0.188 1.000 Male 0.909 0.000 0.091 1.000 Treated Female 0.222 0.185 0.593 1.000 Male 0.500 0.143 0.357 1.000 The code in q produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table. The code in w produces the marginal frequencies for Treatment, Sex, and Improved. Because you created the table with the formula ~Treatement+Sex+Improve, Treatment is referred to by index 1, Sex is referred to by index 2, and Improve is referred to by index 3. The code in e produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment x Sex combination is provided in r. Here you see that 36 percent of treated males had marked improvement, compared to 59 percent of treated females. In general, the proportions will add to one over the indices not included in the prop.table() call (the third index, or Improve in this case). You can see this in the last example, where you add a sum margin over the third index. If you want percentages instead of proportions, you could multiply the resulting table by 100. For example: ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100 would produce this table: Sex Female Male Sum Treatment Improved Placebo None 65.5 34.5 100.0 Some 100.0 0.0 100.0 Marked 85.7 14.3 100.0 Treated None 46.2 53.8 100.0 Some 71.4 28.6 100.0 Marked 76.2 23.8 100.0 While contingency tables tell you the frequency or proportions of cases for each com- bination of the variables that comprise the table, you’re probably also interested in whether the variables in the table are related or independent. Tests of independence are covered in the next section. Licensed to Mark Jacobson 156 CHAPTER 7 Basic statistics 7.2.2 Tests of independence R provides several methods of testing the independence of the categorical variables. The three tests described in this section are the chi-square test of independence, the Fisher exact test, and the Cochran-Mantel–Haenszel test. CHI-SQUARE TEST OF INDEPENDENCE You can apply the function chisq.test() to a two-way table in order to produce a chi-square test of independence of the row and column variables. See this next listing for an example. Listing 7.13 Chi-square test of independence > library(vcd) > mytable chisq.test(mytable) Pearson’s Chi-squared test q Treatment and data: mytable Improved not X-squared = 13.1, df = 2, p-value = 0.001463 independent > mytable chisq.test(mytable) Pearson’s Chi-squared test w Gender and data: mytable Improved X-squared = 4.84, df = 2, p-value = 0.0889 independent Warning message: In chisq.test(mytable) : Chi-squared approximation may be incorrect From the results q, there appears to be a relationship between treatment received and level of improvement (p <.01). But there doesn’t appear to be a relationship w between patient sex and improvement (p >.05). The p-values are the probability of ob- taining the sampled results assuming independence of the row and column variables in the population. Because the probability is small for q, you reject the hypothesis that treatment type and outcome are independent. Because the probability for w isn’t small, it’s not unreasonable to assume that outcome and gender are independent. The warning message in listing 7.13 is produced because one of the six cells in the table (male-some improvement) has an expected value less than five, which may invalidate the chi-square approximation. FISHER’S EXACT TEST You can produce a Fisher’s exact test via the fisher.test() function. Fisher’s exact test evaluates the null hypothesis of independence of rows and columns in a contingen- cy table with fixed marginals. The format is fisher.test(mytable), where mytable is a two-way table. Here’s an example: > mytable fisher.test(mytable) Licensed to Mark Jacobson Frequency and contingency tables 157 Fisher’s Exact Test for Count Data data: mytable p-value = 0.001393 alternative hypothesis: two.sided In contrast to many statistical packages, the fisher.test() function can be applied to any two-way table with two or more rows and columns, not