A First Course in Statistical Programming with R PDF
Document Details
W. John Braun and Duncan J. Murdoch
Tags
Summary
This book provides a first course in statistical programming using the R language. It covers the principles of R, includes programming techniques used to develop complex projects, and provides ample exercises.
Full Transcript
This page intentionally left blank A First Course in Statistical Programming with R This is the only introduction you’ll need to start programming in R, the open- source language that is free to download, and lets you adapt the source code for your own requirements. Co-written by one of the R core...
This page intentionally left blank A First Course in Statistical Programming with R This is the only introduction you’ll need to start programming in R, the open- source language that is free to download, and lets you adapt the source code for your own requirements. Co-written by one of the R core development team, and by an established R author, this book comes with real R code that complies with the standards of the language. Unlike other introductory books on the ground-breaking R system, this book emphasizes programming, including the principles that apply to most computing languages, and the techniques used to develop more complex projects. Learning the language is made easier by the frequent exercises within chapters which enable you to progress conf idently through the book. More substantial exercises at the ends of chapters help to test your understanding. Solutions, datasets, and any errata will be available from the book’s website. W. John Braun is an Associate Professor in the Department of Statistical and Actuarial Sciences at the University of Western Ontario. He is also a co-author, with John Maindonald, of Data Analysis and Graphics Using R. Duncan J. Murdoch is an Associate Professor in the Department of Statistical and Actuarial Sciences at the University of Western Ontario. He was columnist and column editor of the statistical computing column of Chance during 1999–2000. A First Course in Statistical Programming with R W. John Braun and Duncan J. Murdoch CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521872652 © W. John Braun and Duncan J. Murdoch 2007 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published in print format 2007 ISBN-13 978-0-511-50614-7 eBook (EBL) ISBN-13 978-0-521-87265-2 hardback ISBN-13 978-0-521-69424-7 paperback Cambridge University Press has no responsibility for the persistence or accuracy of urls for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Contents Preface page ix 1 Getting started 1 1.1 What is statistical programming? 1 1.2 Outline of the book 2 1.3 The R package 3 1.4 Why use a command line? 3 1.5 Font conventions 4 1.6 Installation of R 4 2 Introduction to the R language 5 2.1 Starting and quitting R 5 2.1.1 Recording your work 6 2.2 Basic features of R 7 2.2.1 Calculating with R 7 2.2.2 Named storage 7 2.2.3 Functions 9 2.2.4 Exact or approximate? 9 2.2.5 R is case-sensitive 12 2.2.6 Listing the objects in the workspace 12 2.2.7 Vectors 12 2.2.8 Extracting elements from vectors 13 2.2.9 Vector arithmetic 14 2.2.10 Simple patterned vectors 15 2.2.11 Missing values and other special values 16 2.2.12 Character vectors 16 2.2.13 Factors 17 2.2.14 More on extracting elements from vectors 18 2.2.15 Matrices and arrays 18 2.2.16 Data frames 19 2.2.17 Dates and times 21 2.3 Built-in functions and online help 21 2.3.1 Built-in examples 22 2.3.2 Finding help when you don’t know the function name 23 2.3.3 Built-in graphics functions 23 2.3.4 Additional elementary built-in functions 25 2.4 Logical vectors and relational operators 26 2.4.1 Boolean algebra 26 2.4.2 Logical operations in R 27 2.4.3 Relational operators 28 2.5 Data input and output 29 2.5.1 Changing directories 29 vi CONTENTS 2.5.2 dump() and source() 29 2.5.3 Redirecting R output 30 2.5.4 Saving and retrieving image files 31 2.5.5 Data frames and the read.table function 31 2.5.6 Lists 31 Chapter exercises 32 3 Programming statistical graphics 33 3.1 High-level plots 33 3.1.1 Bar charts and dot charts 34 3.1.2 Pie charts 35 3.1.3 Histograms 35 3.1.4 Box plots 36 3.1.5 Scatterplots 38 3.1.6 QQ plots 39 3.2 Choosing a high-level graphic 41 3.3 Low-level graphics functions 42 3.3.1 The plotting region and margins 42 3.3.2 Adding to plots 43 3.3.3 Setting graphical parameters 45 Chapter exercises 46 4 Programming with R 47 4.1 Flow control 47 4.1.1 The for() loop 47 4.1.2 The if() statement 50 4.1.3 The while() loop 54 4.1.4 Newton’s method for root finding 55 4.1.5 The repeat loop, and the break and next statements 57 4.2 Managing complexity through functions 59 4.2.1 What are functions? 59 4.2.2 Scope of variables 62 4.3 Miscellaneous programming tips 63 4.3.1 Using fix() 63 4.3.2 Documentation using # 64 4.4 Some general programming guidelines 65 4.4.1 Top-down design 67 4.5 Debugging and maintenance 72 4.5.1 Recognizing that a bug exists 72 4.5.2 Make the bug reproducible 73 4.5.3 Identify the cause of the bug 73 4.5.4 Fixing errors and testing 75 4.5.5 Look for similar errors elsewhere 75 4.5.6 The browser() and debug() functions 75 4.6 Efficient programming 77 4.6.1 Learn your tools 77 4.6.2 Use efficient algorithms 78 4.6.3 Measure the time your program takes 79 CONTENTS vii 4.6.4 Be willing to use different tools 80 4.6.5 Optimize with care 80 Chapter exercises 80 5 Simulation 82 5.1 Monte Carlo simulation 82 5.2 Generation of pseudorandom numbers 83 5.3 Simulation of other random variables 88 5.3.1 Bernoulli random variables 88 5.3.2 Binomial random variables 89 5.3.3 Poisson random variables 93 5.3.4 Exponential random numbers 97 5.3.5 Normal random variables 99 5.4 Monte Carlo integration 101 5.5 Advanced simulation methods 104 5.5.1 Rejection sampling 104 5.5.2 Importance sampling 107 Chapter exercises 109 6 Computational linear algebra 112 6.1 Vectors and matrices in R 113 6.1.1 Constructing matrix objects 113 6.1.2 Accessing matrix elements; row and column names 115 6.1.3 Matrix properties 117 6.1.4 Triangular matrices 118 6.1.5 Matrix arithmetic 118 6.2 Matrix multiplication and inversion 119 6.2.1 Matrix inversion 120 6.2.2 The LU decomposition 121 6.2.3 Matrix inversion in R 122 6.2.4 Solving linear systems 123 6.3 Eigenvalues and eigenvectors 124 6.4 Advanced topics 125 6.4.1 The singular value decomposition of a matrix 125 6.4.2 The Choleski decomposition of a positive definite matrix 126 6.4.3 The QR decomposition of a matrix 127 6.4.4 The condition number of a matrix 128 6.4.5 Outer products 129 6.4.6 Kronecker products 129 6.4.7 apply() 129 Chapter exercises 130 7 Numerical optimization 132 7.1 The golden section search method 132 7.2 Newton–Raphson 135 7.3 The Nelder–Mead simplex method 138 7.4 Built-in functions 142 viii CONTENTS 7.5 Linear programming 142 7.5.1 Solving linear programming problems in R 145 7.5.2 Maximization and other kinds of constraints 145 7.5.3 Special situations 146 7.5.4 Unrestricted variables 149 7.5.5 Integer programming 150 7.5.6 Alternatives to lp() 151 7.5.7 Quadratic programming 151 Chapter exercises 157 Appendix Review of random variables and distributions 158 Index 161 Preface This text began as notes for a course in statistical computing for sec- ond year actuarial and statistical students at the University of Western Ontario. Both authors are interested in statistical computing, both as sup- port for our other research and for its own sake. However, we have found that our students were not learning the right sort of programming basics before they took our classes. At every level from undergraduate through Ph.D., we found that students were not able to produce simple, reliable programs; that they didn’t understand enough about numeri- cal computation to understand how rounding error could influence their results; and that they didn’t know how to begin a difficult computational project. We looked into service courses from other departments, but we found that they emphasized languages and concepts that our students would not use again. Our students need to be comfortable with simple programming so that they can put together a simulation of a stochastic model; they also need to know enough about numerical analysis so that they can do numerical computations reliably. We were unable to find this mix in an existing course, so we designed our own. We chose to base this text on R. R is an open-source computing package which has seen a huge growth in popularity in the last few years. Being open source, it is easily obtainable by students and economical to install in our computing lab. One of us (Murdoch) is a member of the R core development team, and the other (Braun) is a co-author of a book on data analysis using R. These facts made it easy for us to choose R, but we are both strong believers in the idea that there are certain universals of programming, and in this text we try to emphasize those: it is not a manual about programming in R, it is a course in statistical programming that uses R. Students starting this course are not assumed to have any program- ming experience or advanced statistical knowledge. They should be familiar with university-level calculus, and should have had exposure to a course in introductory probability, though that could be taken concurrently: the probabilistic concepts start in Chapter 5. (We include a concise appendix reviewing the probabilistic material.) We include some advanced topics in x P R E FAC E simulation, linear algebra, and optimization that an instructor may choose to skip in a one-semester course offering. We have a lot of people to thank for their help in writing this book. The students in Statistical Sciences 259b have provided motivation and feedback, Lutong Zhou drafted several figures, and Diana Gillooly of Cambridge University Press, Professor Brian Ripley of Oxford Univer- sity, and some anonymous reviewers all provided helpful suggestions. And of course, this book could not exist without R, and R would be far less valuable without the contributions of the worldwide R community. 1 Getting started Welcome to the world of statistical programming. This book will contain a lot of specific advice about the hows and whys of the subject. We start in this chapter by giving you an idea of what statistical programming is all about. We will also tell you what to expect as you proceed through the rest of the book. The chapter will finish with some instructions about how to download and install R, the software package and language on which we base our programming examples. 1.1 What is statistical programming? Computer programming involves controlling computers, telling them what calculations to do, what to display, etc. Statistical programming is harder to define. One definition might be that it’s the kind of computer programming statisticians do – but statisticians do all sorts of programming. Another would be that it’s the kind of programming one does when one is doing statistics – but again, statistics involves a wide variety of computing tasks. For example, statisticians are concerned with collecting and analyzing data, and some statisticians would be involved in setting up connections between computers and laboratory instruments – but we would not call that statistical programming. Statisticians often oversee data entry from questionnaires, and may set up programs to aid in detecting data entry errors. That is statistical programming, but it is quite specialized, and beyond the scope of this book. Statistical programming involves doing computations to aid in statistical analysis. For example, data must be summarized and displayed. Models must be fit to data, and the results displayed. These tasks can be done in a number of different computer applications: Microsoft Excel, SAS, SPSS, S-PLUS, R, Stata, etc. Using these applications is certainly statistical computing, and usually involves statistical programming, but it is not the focus of this book. In this book our aim is to provide a foundation for an understanding of how these applications work: we describe the calculations they do, and how you could do them yourself. 2 G E T T ING STA RT ED Since graphs play an important role in statistical analysis, drawing graphics of one, two, or higher dimensional data is an aspect of statistical programming. An important part of statistical programming is stochastic simulation. Digital computers are naturally very good at exact, reproducible computa- tions, but the real world is full of randomness. In stochastic simulation we program a computer to act as though it is producing random results, even though if we knew enough, the results would be exactly predictable. Statistical programming is closely related to other forms of numerical programming. It involves optimization and approximation of mathematical functions. There is less emphasis on differential equations than in physics or applied mathematics (though this is slowly changing). We tend to place more of an emphasis on the results and less on the analysis of the algorithms than in computer science. 1.2 Outline of the book This book is an introduction to statistical programming. We will start with basic programming: how to tell a computer what to do. We do this using the open source R statistical package, so we will teach you R, but we will try not to just teach you R. We will emphasize those things that are common to many computing platforms. Statisticians need to display data. We will show you how to construct statistical graphics. In doing this, we will learn a little bit about human vision, and how it motivates our choice of display. In our introduction to programming, we will show how to control the flow of execution of a program. For example, we might wish to do repeated calculations as long as the input consists of positive integers, but then stop when an input value hits 0. Programming a computer requires basic logic, and we will touch on Boolean algebra, a formal way to manipulate logical statements. The best programs are thought through carefully before being implemented, and we will discuss how to break down complex problems into simple parts. When we are discussing programming, we will spend quite a lot of time discussing how to get it right: how to be sure that the computer program is calculating what you want it to calculate. One distinguishing characteristic of statistical programming is that it is concerned with randomness: random errors in data, and models that include stochastic components. We will discuss methods for simulating random values with specified characteristics, and show how random simulations are useful in a variety of problems. Many statistical procedures are based on linear models. While discus- sion of linear regression and other linear models is beyond the scope of this book, we do discuss some of the background linear algebra, and how the computations it involves can be carried out. We also discuss the gen- eral problem of numerical optimization: finding the values which make a function as large or as small as possible. WHY U S E A C O MMAND LINE? 3 Each chapter has a number of exercises which are at varying degrees of difficulty. Solutions to selected exercises can be found on the web at www.stats.uwo.ca/faculty/braun/statprog!. 1.3 The R package This book uses R, which is an open-source package for statistical comput- ing. “Open source” has a number of different meanings; here the important one is that R is freely available, and its users are free to see how it is written, and to improve it. R is based on the computer language S, developed by John Chambers and others at Bell Laboratories in 1976. In 1993 Robert Gentleman and Ross Ihaka at the University of Auckland wanted to exper- iment with the language, so they developed an implementation, and named it R. They made it open source in 1995, and hundreds of people around the world have contributed to its development. S-PLUS is a commercial implementation of the S language. Because both R and S-PLUS are based on the S language, much of what is described in what follows will apply without change to S-PLUS. 1.4 Why use a command line? The R system is mainly command-driven, with the user typing in text and asking R to execute it. Nowadays most programs use interactive graphical user interfaces (menus, etc.) instead. So why did we choose such an old- fashioned way of doing things? Menu-based interfaces are very convenient when applied to a limited set of commands, from a few to one or two hundred. However, a command- line interface is open ended. As we will show in this book, if you want to program a computer to do something that no one has done before, you can easily do it by breaking down the task into the parts that make it up, and then building up a program to carry it out. This may be possible in some menu-driven interfaces, but it is much easier in a command-driven interface. Moreover, learning how to use one command line interface will give you skills that carry over to others, and may even give you some insight into how a menu-driven interface is implemented. As statisticians it is our belief that your goal should be understanding, and learning how to program at a command line will give you that at a fundamental level. Learning to use a menu-based program makes you dependent on the particular organization of that program. There is a fairly rich menu-driven interface to R available in the Rcmdr package.1 After you have worked through this book, if you come upon a statistical task that you don’t know how to start, you may find that the 1 A package is a collection of functions menus in Rcmdr give you an idea of what methods are available. and programs that can be used within R. 4 G E T T ING STA RT ED 1.5 Font conventions This book describes how to do computations in R. As we will see in the next chapter, this requires that the user types input, and R responds with text or graphs as output. To indicate the difference, we have typeset the user input in a slanted typewriter font, and text output in an upright version of the same font. For example, > This was typed by the user This is a response from R In most cases other than this one and certain exercises, we will show the actual response from R.2 There are also situations where the code is purely illustrative and is not meant to be executed. (Many of those are not correct R code at all; others illustrate the syntax of R code in a general way.) In these situations we have typeset the code examples in an upright typewriter font. For example, f( some arguments ) 1.6 Installation of R R can be downloaded from http://cran.r-project.org!. Most users should download and install a binary version. This is a version that has been translated (by compilers) into machine language for execution on a particular type of computer with a particular operating system. R is designed to be very portable: it will run on Microsoft Windows, Linux, Solaris, Mac OSX, and other operating systems, but different binary ver- sions are required for each. In this book most of what we do would be the same on any system, but when we write system-specific instructions, we will assume that readers are using Microsoft Windows. Installation on Microsoft Windows is straightforward. A binary version is available for Windows 98 or above from the web page http://cran.r-project.org/bin/windows/base. Download the “setup program,” a file with a name like R-2.5.1- win32.exe. Clicking on this file will start an almost automatic installa- tion of the R system. Though it is possible to customize the installation, the default responses will lead to a satisfactory installation in most situations, particularly for beginning users. One of the default settings of the installation procedure is to create an 2 We have used the Sweave package R icon on your computer’s desktop. so that R itself is computing the output. Once you have installed R, you will be ready to start statistical The computations in the text were done programming. Let’s learn how. with a pre-release version of R 2.5.0. 2 Introduction to the R language Having installed the R system, you are now ready to begin to learn the art of statistical programming. The first step is to learn the syntax of the language that you will be programming in; you need to know the rules of the language. This chapter will give you an introduction to the syntax of R. 2.1 Starting and quitting R In Microsoft Windows, the R installer will have created a Start Menu item and an icon for R on your desktop. Double clicking on the R icon starts the program.1 The first thing that will happen is that R will open the console, into which the user can type commands. The greater-than sign (>) is the prompt symbol. When this appears, you can begin typing commands. For example, R can be used as a calculator. We can type simple arithmetical expressions at the > prompt: > 5 + 49 Upon pressing the Enter key, the result 54 appears, prefixed by the number 1 in square brackets: > 5 + 49 54 The indicates that this is the first (and in this case only) result from the command. Other commands return multiple values, and each line of results will be labeled to aid the user in deciphering the output. For example, the sequence of integers from 1 to 20 may be displayed as follows: > options(width=40) > 1:20 1 Other systems may install an icon to 1 2 3 4 5 6 7 8 9 10 11 12 click, or may require you to type “R” at 13 14 15 16 17 18 19 20 a command prompt. 6 INT RO D U C T ION TO T HE R LA NG UAG E The first line starts with the first return value, so is labeled ; the second line starts with the 13th, so is labeled.2 Anything that can be computed on a pocket calculator can be computed at the R prompt. Here are some additional examples: > # "*" is the symbol for multiplication. > # Everything following a # sign is assumed to be a > # comment and is ignored by R. > 3 * 5 15 > 3 - 8 -5 > 12 / 4 3 To quit your R session, type > q() If you then hit the Enter key, you will be asked whether to save an image of the current workspace, or not, or to cancel. The workspace image contains a record of the computations you’ve done, and may contain some saved results. Hitting the Cancel option allows you to continue your current R session. We rarely save the current workspace image, but occasionally find it convenient to do so. Note what happens if you omit the parentheses () when attempting to quit: > q function (save = "default", status = 0, runLast = TRUE).Internal(quit(save, status, runLast)) This has happened because q is a function that is used to tell R to quit. Typing q by itself tells R to show us the (not very pleasant-looking) contents of the function q. By typing q(), we are telling R to call the function q. The action of this function is to quit R. Everything that R does is done through calls to functions, though sometimes those calls are hidden (as when we click on menus), or very basic (as when we call the multiplication function to multiply 3 times 5). 2.1.1 Recording your work Rather than saving the workspace, we prefer to keep a record of the com- mands we entered, so that we can reproduce the workspace at a later date. In Windows, the easiest way to do this is to enter commands in R’s script editor, available from the File menu. Commands are executed by high- 2 The position of the line break shown lighting them and hitting Ctrl-R (which stands for “run”). At the end of here depends on the optional setting a session, save the final script for a permanent record of your work. In options(width=40). Other choices other systems a text editor and some form of cut and paste serve the same of line widths would break in different purpose. places. B AS IC FEATU RES O F R 7 2.2 Basic features of R 2.2.1 Calculating with R At its most basic level, R can be viewed as a fancy calculator. We saw in the previous section that it can be used to do scalar arithmetic. The basic operations are + (add), - (subtract), * (multiply), and / (divide). It can also be used to compute powers with the ˆ operator. For example, > 3ˆ4 81 Modular arithmetic is also available. For example, we can compute the remainder after division of 31 by 7, i.e. 31 (mod 7): > 31 %% 7 3 and the integer part of a fraction as > 31 %/% 7 4 We can confirm that 31 is the sum of its remainder plus seven times the integer part of the fraction: > 7 * 4 + 3 31 2.2.2 Named storage R has a workspace known as the global environment that can be used to store the results of calculations, and many other types of objects. For a first example, suppose we would like to store the result of the calculation 1.0025ˆ30 for future use. (This number arises out of a compound interest calculation based on an interest rate of 0.25% per year and a 30-year period.) We will assign this value to an object called interest.30. To this, we type > interest.30 We tell R to make the assignment using an arrow that points to the left, created with the less-than sign ( interest.30 1.077783 8 INT RO D U C T ION TO T HE R LA NG UAG E Think of this as just another calculation: R is calculating the result of the expression interest.30, and printing it. We can also use interest.30 in further calculations if we wish. For example, we can calculate the bank balance after 30 years at 0.25% annual interest, if we start with an initial balance of $3000: > initial.balance final.balance final.balance 3233.35 Example 2.1 An individual wishes to take out a loan, today, of P at a monthly interest rate i. The loan is to be paid back in n monthly installments of size R, beginning one month from now. The problem is to calculate R. Equating the present value P to the future (discounted) value of the n monthly payments R, we have P = R(1 + i)−1 + R(1 + i)−2 + · · · R(1 + i)−n or n P=R (1 + i)−j. j=1 Summing this geometric series and simplifying, we obtain 1 − (1 + i)−n P=R. i This is the formula for the present value of an annuity. We can find R, given P, n and i as i R=P. 1 − (1 + i)−n In R, we define variables as follows: principal to hold the value of P, and intRate to hold the interest rate, and n to hold the number of payments. We will assign the resulting payment value to an object called payment. Of course, we need some numerical values to work with, so we will suppose that the loan amount is $1500, the interest rate is 1% and the number of payments is 10. The required code is then > intRate n principal payment payment 158.3731 For this particular loan, the monthly payments are $158.37. 2.2.3 Functions Most of the work in R is done through functions. For example, we saw that to quit R we type q(). This tells R to call the function named q. The parentheses surround the argument list, which in this case contains nothing: we just want R to quit, and do not need to tell it how. We also saw that q is defined as > q function (save = "default", status = 0, runLast = TRUE).Internal(quit(save, status, runLast)) This shows that q is a function that has three arguments: save, status, and runLast. Each of those has a default value: "default", 0, and TRUE, respectively. What happens when we execute q()is that R calls the q function with the arguments set to their default values. If we want to change the default values, we specify them when we call the function. Arguments are identified in the call by their position, or by specifying the name explicitly. For example, both q("no") q(save = "no") tell R to call q with the first argument set to "no", i.e. to quit without saving the workspace. If we had given two arguments without names, they would apply to save and status. If we want to accept the defaults of the early parameters but change later ones, we give the name when calling the function, e.g. q(runLast = FALSE) or use commas to mark the missing arguments, e.g. q( , , FALSE) It is a good idea to use named arguments when calling a function which has many arguments or when using uncommon arguments, because it reduces the risk of specifying the wrong argument, and makes your code easier to read. 2.2.4 Exact or approximate? One important distinction in computing is between exact and approximate results. Most of what we do in this book is aimed at approximate meth- ods. It is possible in a computer to represent any rational number exactly, but it is more common to use approximate representations: usually floating point representations. These are a binary (base-two) variation on scientific 10 I NT RO D U C T IO N TO T HE R LA NG UAG E notation. For example, we might write a number to four significant digits in scientific notation as 6.926 × 10−4. This representation of a number could represent any true value between 0.000 692 55 and 0.000 692 65. Standard floating point representations on computers are similar, except that a power of 2 would be used rather than a power of 10, and the fraction would be written in binary notation. The number above would be written as 1.0112 × 2−11 if four binary digit precision was used. The subscript 2 in the mantissa 1.0112 indicates that this number is shown in base 2; that is, it represents 1×20 +0×2−1 +1×2−2 +1×2−3 , or 1.375 in decimal notation. However, 6.926×10−4 and 1.0112 ×2−11 are not identical. Four binary digits give less precision than four decimal digits: a range of values from approximately 0.000 641 to 0.000 702 would all get the same representation to four binary digit precision. In fact, 6.926 × 10−4 cannot be represented exactly in binary notation in a finite number of digits. The problem is similar to trying to represent 1/3 as a decimal: 0.3333 is a close approximation, but is not exact. The standard precision in R is 53 binary digits, which is equivalent to about 15 or 16 decimal digits. To illustrate, consider the fractions 5/4 and 4/5. In decimal notation these can be represented exactly as 1.25 and 0.8 respectively. In binary notation 5/4 is 1 + 1/4 = 1.012. How do we determine the binary repre- sentation of 4/5? It is between 0 and 1, so we’d expect something of the form 0.b1 b2 b3 · · · , where each bi represents a “bit,” i.e. a 0 or 1 digit. Multi- plying by 2 moves the all bits left by one, i.e. 2 × 4/5 = 1.6 = b1.b2 b3 · · ·. Thus b1 = 1, and 0.6 = 0.b2 b3 · · ·. We can now multiply by 2 again to find 2 × 0.6 = 1.2 = b2.b3 · · · , so b2 = 1. Repeating twice more yields b3 = b4 = 0. (Try it!) At this point we’ll have the number 0.8 again, so the sequence of 4 bits will repeat indefinitely: in base 2, 4/5 is 0.110 011 001 100 · · ·. Since R only stores 53 bits, it won’t be able to store 0.8 exactly. Some rounding error will occur in the storage. We can observe the rounding error with the following experiment. With exact arithmetic, (5/4) × (4/5) = 1, so (5/4) × (n × 4/5) should be exactly n for any value of n. But if we try this calculation in R, we find > n 1.25 * (n * 0.8) - n 0.000000e+00 0.000000e+00 4.440892e-16 0.000000e+00 0.000000e+00 8.881784e-16 8.881784e-16 0.000000e+00 0.000000e+00 0.000000e+00 i.e. it is equal for some values, but not equal for n = 3, 6, or 7. The errors are very small, but nonzero. Rounding error tends to accumulate in most calculations, so usually a long series of calculations will result in larger errors than a short one. Some operations are particularly prone to rounding error: for example, subtraction of two nearly equal numbers, or (equivalently) addition of two numbers with nearly the same magnitude but opposite signs. Since the leading bits in the binary expansions of nearly equal numbers will match, they will cancel in subtraction, and the result will depend on what is stored in the later bits. B AS IC FEATU RES O F R 11 Example 2.2 Consider the standard formula for the sample variance of a sample x1 ,... , xn : 1 n s2 = (xi − x̄)2 , n−1 i=1 where x̄ is the sample mean, (1/n) xi. In R, s2 is available as var(), and x̄ is mean(). For example: > x mean(x) 6 > var(x) 11 > sum( (x - mean(x))ˆ2 ) / 10 11 Because this formula requires calculation of x̄ first and the sum of squared deviations second, it requires that all xi values be kept in memory. Not too long ago memory was so expensive that it was advantageous to rewrite the formula as n 1 s = 2 xi − nx̄. 2 2 n−1 i=1 This is called the “one-pass formula,” because we evaluate each xi value just once, and accumulate the sums of xi and of xi2. It gives the correct answer, both mathematically and in our example: > ( sum(xˆ2) - 11 * mean(x)ˆ2 ) / 10 11 However, notice what happens if we add a large value A to each xi. n 2 2 2 The sum i=1 xi increases by approximately nA , and so does nx̄. This doesn’t change the variance, but it provides the conditions for a “catastrophic loss of precision” when we take the difference: > A x var(x) 11 > ( sum(xˆ2) - 11 * mean(x)ˆ2 ) / 10 0 Since R gets the right answer, it clearly doesn’t use the one-pass formula, and neither should you. 12 I NT RO D U C T IO N TO T HE R LA NG UAG E 2.2.5 R is case-sensitive See what happens if you type > x MEAN(x) Error: could not find function "MEAN" or > mean(x) 5.5 Now try > MEAN MEAN(x) 5.5 The function mean()is built in to R. R considers MEAN to be a different function, because it is case-sensitive: m is a different letter than M. 2.2.6 Listing the objects in the workspace The calculations in the previous sections led to the creation of several simple R objects. These objects are stored in the current R workspace. A list of all objects in the current workspace can be printed to the screen using the objects()function: > objects() "A" "final.balance" "initial.balance" "interest.30" "intRate" "MEAN" "n" "payment" "principal" "saveopt" "x" A synonym for objects()is ls(). Remember that if we quit our R session without saving the workspace image, then these objects will disappear. If we save the workspace image, then the workspace will be restored at our next R session.3 2.2.7 Vectors A numeric vector is a list of numbers. The c()function is used to collect things together into a vector. We can type > c(0, 7, 8) 0 7 8 Again, we can assign this to a named object: 3 This will always be true if we start R > x x be the case, but users are free to change the folder during a session using the 0 7 8 menus or the setwd() function. B AS IC FEATU RES O F R 13 The : symbol can be used to create sequences of increasing (or decreasing) values. For example, > numbers5to20 numbers5to20 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Vectors can be joined together (i.e. concatenated ) with the c function. For example, note what happens when we type > c(numbers5to20, x) 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 7 8 Here is another example of the use of the c()function: > some.numbers a.mess a.mess 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 59 67 71 73 79 83 89 97 103 107 109 113 119 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4 3 2 1 Remember that the numbers in the square brackets give the index of the element immediately to the right. Among other things, this helps us to identify the 22nd element of a.mess as 89. 2.2.8 Extracting elements from vectors A nicer way to display the 22nd element of a.mess is to use square brackets to extract just that element: > a.mess 89 To print the second element of x, type > x 7 We can extract more than one element at a time. For example, > some.numbers[c(3, 6, 7)] 5 13 17 To get the third through seventh elements of numbers5to20, type > numbers5to20[3:7] 7 8 9 10 11 14 I NT RO D U C T IO N TO T HE R LA NG UAG E Negative indices can be used to avoid certain elements. For example, we can select all but the second element of x as follows: > x[-2] 0 8 The third through eleventh elements of some.numbers can be avoided as follows: > some.numbers[-(3:11)] 2 3 37 41 43 47 59 67 71 73 79 83 89 97 103 107 109 113 119 Using a zero index returns nothing. This is not something that one would usually type, but it may be useful in more complicated expressions. > numbers5to20[c(0, 3:7)] 7 8 9 10 11 Do not mix positive and negative indices. To see what happens, consider > x[c(-2, 3)] Error: only 0’s may be mixed with negative subscripts The problem is that it is not clear what is to be extracted: do we want the third element of x before or after removing the second one? 2.2.9 Vector arithmetic Arithmetic can be done on R vectors. For example, we can multiply all elements of x by 3: > x * 3 0 21 24 Note that the computation is performed elementwise. Addition, sub- traction and division by a constant have the same kind of effect. For example, > y y -5 2 3 For another example, consider taking the third power of the elements of x: > xˆ3 0 343 512 The above examples show how a binary arithmetic operator can be used with vectors and constants. In general, the binary operators also work element-by-element when applied to pairs of vectors. For example, we can compute yixi , for i = 1, 2, 3, i.e. (y1x1 , y2x2 , y3x3 ), as follows: > yˆx 1 128 6561 B AS IC FEATU RES O F R 15 When the vectors are different lengths, the shorter one is extended by recycling: values are repeated, starting at the beginning. For example, to see the pattern of remainders of the numbers 1 to 10 modulo 2 and 3, we need only give the 2:3 vector once: > c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, + 6, 6, 7, 7, 8, 8, 9, 9, 10, 10) %% 2:3 1 1 0 2 1 0 0 1 1 2 0 0 1 1 0 2 1 0 0 1 R will give a warning if the length of the longer vector is not a multiple of the length of the smaller one, because that is usually a sign that something is wrong. For example, if we wanted the remainders modulo 2, 3, and 4, this is the wrong way to do it: > c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, + 6, 6, 7, 7, 8, 8, 9, 9, 10, 10) %% 2:4 1 1 2 0 0 3 0 1 1 1 0 2 1 1 0 0 0 1 0 1 Warning message: longer object length is not a multiple of shorter object length in: c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10)%%2:4 (Do you see the error?) 2.2.10 Simple patterned vectors We saw the use of the :operator for producing simple sequences of integers. Patterned vectors can also be produced using the seq()function as well as the rep()function. For example, the sequence of odd numbers less than or equal to 21 can be obtained using seq(1, 21, by=2) Notice the use of by=2 here. The seq()function has several optional parameters, including one named by. If by is not specified, the default value of 1 will be used. Repeated patterns are obtained using rep(). Consider the following examples: > rep(3, 12) # repeat the value 3, 12 times 3 3 3 3 3 3 3 3 3 3 3 3 > rep(seq(2, 20, by=2), 2) # repeat the pattern 2 4... 20, twice 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 > rep(c(1, 4), c(3, 2)) # repeat 1, 3 times and 4, twice 1 1 1 4 4 > rep(c(1, 4), each=3) # repeat each value 3 times 1 1 1 4 4 4 > rep(seq(2, 20, 2), rep(2, 10)) # repeat each value twice 2 2 4 4 6 6 8 8 10 10 12 12 14 14 16 16 18 18 20 20 16 I NT RO D U C T IO N TO T HE R LA NG UAG E 2.2.11 Missing values and other special values The missing value symbol is NA. Missing values often arise in real data problems, but they can also arise because of the way calculations are performed. > some.evens some.evens[seq(2, 20, 2)] some.evens NA 2 NA 4 NA 6 NA 8 NA 10 NA 12 NA 14 NA 16 NA 18 NA 20 What happened here is that we assigned values to elements 2, 4,... , 20 but never assigned anything to elements 1, 3,... , 19, so R uses NA to signal that the value is unknown. Recall that x contains the values (0, 7, 8). Consider > x / x NaN 1 1 The NaN symbol denotes a value which is “not a number,” which arises as a result of attempting to compute the indeterminate 0/0. This sym- bol is sometimes used when a calculation does not make sense. In other cases, special values may be shown, or you may get an error or warning message: > 1 / x Inf 0.1428571 0.1250000 Here, R has tried to evaluate 1/0. Always be careful to make sure that vector indices are integers. When fractional values are used, they will be truncated towards 0. Thus 0.4 becomes 0, and we see > x[0.4] numeric(0) The output numeric(0)indicates a numeric vector of length zero. 2.2.12 Character vectors Scalars and vectors can be made up of strings of characters instead of numbers. All elements of a vector must be of the same type. For example, > colors more.colors # this appended some new elements to colors > z more.colors "red" "yellow" "blue" "green" "magenta" "cyan" > z # 1 has been converted to the character "1" "red" "green" "1" B AS IC FEATU RES O F R 17 There are two basic operations you might want to perform on character vectors. To take substrings, use substr(). The former takes arguments substr(x, start, stop), where x is a vector of character strings, and start and stop say which characters to keep. For example, to print the first two letters of each color use > substr(colors, 1, 2) "re" "ye" "bl" The substring()function is similar, but with slightly different definitions of the arguments: see the help page ?substring. The other basic operation is building up strings by concatenation. Use the paste()function for this. For example, > paste(colors, "flowers") "red flowers" "yellow flowers" "blue flowers" There are two optional parameters to paste(). The sep parameter controls what goes between the components being pasted together. We might not want the default space, for example: > paste("several ", colors, "s", sep="") "several reds" "several yellows" "several blues" The collapse parameter to paste()allows all the components of the resulting vector to be collapsed into a single string: > paste("I like", colors, collapse = ", ") "I like red, I like yellow, I like blue" 2.2.13 Factors Factors offer an alternative way of storing character data. For example, a factor with four elements and having the two levels, control and treatment can be created using: > grp grp "control" "treatment" "control" "treatment" > grp grp control treatment control treatment Levels: control treatment Factors are a more efficient way of storing character data when there are repeats among the vector elements. This is because the levels of a factor are internally coded as integers. To see what the codes are for our factor, we can type > as.integer(grp) 1 2 1 2 18 I NT RO D U C T IO N TO T HE R LA NG UAG E The labels for the levels are only stored once each, rather than being repeated. The codes are indices into the vector of levels: > levels(grp) "control" "treatment" > levels(grp)[as.integer(grp)] "control" "treatment" "control" "treatment" 2.2.14 More on extracting elements from vectors As for numeric vectors, square brackets []are used to index factor and character vector elements. For example, the factor grp has four elements, so we can print out the third element by typing > grp control Levels: control treatment We can access the second through fifth elements of more.colors as follows: > more.colors[2:5] "yellow" "blue" "green" "magenta" When there may be missing values, the is.na()function should be used to detect them. For instance, > is.na(some.evens) TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE (The result is a “logical vector”. More on these in Section 2.4 below.) The !symbol means “not”, so we can locate the non-missing values in some.evens as follows: > !is.na(some.evens) FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE We can then display the even numbers only: > some.evens[!is.na(some.evens)] 2 4 6 8 10 12 14 16 18 20 2.2.15 Matrices and arrays To arrange values into a matrix, we use the matrix()function: > m m [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 B AS IC FEATU RES O F R 19 We can then access elements using two indices. For example, the value in the first row, second column is > m[1, 2] 3 Somewhat confusingly, R also allows a matrix to be indexed as a vector, using just one value: > m 4 Here elements are selected in the order in which they are stored inter- nally: down the first column, then down the second, and so on. This is known as column-major storage order. Some computer languages use row- major storage order, where values are stored in order from left to right across the first row, then left to right across the second, and so on. Whole rows or columns of matrices may be selected by leaving the corresponding index blank: > m[1,] 1 3 5 > m[, 1] 1 2 A more general way to store data is in an array. Arrays have multiple indices, and are created using the array function: > a a , , 1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 , , 2 [,1] [,2] [,3] [,4] [1,] 13 16 19 22 [2,] 14 17 20 23 [3,] 15 18 21 24 Notice that the dimensions were specified in a vector c(3, 4, 2). When inserting data, the first index varies fastest; when it has run through its full range, the second index changes, etc. 2.2.16 Data frames Most data sets are stored in R as data frames. These are like matrices, but with the columns having their own names. Columns can be of different types from each other. Use the data.frame()function to construct data 20 I NT RO D U C T IO N TO T HE R LA NG UAG E frames from vectors: > colors numbers colors.and.numbers colors.and.numbers colors numbers more.numbers 1 red 1 4 2 yellow 2 5 3 blue 3 6 Exercises 1 Calculate the remainder after dividing 31 079 into 170 166 719. 2 Calculate the monthly payment required for a loan of $200 000, at a monthly interest rate of 0.003, based on 300 monthly payments, starting in one month’s time. 3 Calculate the sum nj=1 r j , where r has been assigned the value 1.08, and compare with (1 − r n+1 )/(1 − r), for n = 10, 20, 30, 40. Repeat for r = 1.06. 4 Referring n to the above question, use the quick formula to compute r j , for r = 1.08, for all values of n between 1 and 100. Store the j=1 100 values in a vector. 5 Calculate the sum nj=1 j and compare with n(n + 1)/2, for n = 100, 200, 400, 800. 6 Referring n to the above question, use the quick formula to compute j=1 j for all values of n between 1 and 100. Store the 100 values in a vector. 7 Calculate the sum nj=1 j 2 and compare with n(n + 1)(2n + 1)/6, for n = 200, 400, 600, 800. 8 Referring n 2 to the above question, use the quick formula to compute j=1 j for all values of n between 1 and 100. Store the 100 values in a vector. 9 Calculate the sum N i=1 1/i, and compare with log(N)+ 0.6, for N = 500, 1000, 2000, 4000, 8000. 10 Can you explain these two results? (Hint: see Section 2.2.4.) > x x[0.9999999999999999] numeric(0) > x[0.99999999999999999] 0 11 Using rep()and seq()as needed, create the vectors 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 BU ILT- IN FU NC TIO NS AND ONLINE HELP 21 and 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 12 Using rep()and seq()as needed, create the vector 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9 13 Use the more.colors vector, rep()and seq()to create the vector "red" "yellow" "blue" "yellow" "blue" "green" "blue" "green" "magenta" "green" "magenta" "cyan" 2.2.17 Dates and times Dates and times are among the most difficult types of data to work with on computers. The standard calendar is very complicated: months of different lengths, leap years every four years (with exceptions for whole centuries) and so on. When looking at dates over historical time periods, changes to the calendar (such as the switch from the Julian calendar to the modern Gregorian calendar that occurred in various countries between 1582 and 1923) affect the interpretation of dates. Times are also messy, because there is often an unstated time zone (which may change for some dates due to daylight savings time), and some years have “leap seconds” added in order to keep standard clocks consistent with the rotation of the earth. There have been several attempts to deal with this in R. The base package has the function strptime()to convert from strings (e.g. "2007-12-25", or "12/25/07") to an internal numerical representa- tion, and format()to convert back for printing. The ISOdate()and ISOdatetime()functions are used when numerical values for the year, month day, etc. are known. Other functions are available in the chron package. These can be difficult functions to use, and a full description is beyond the scope of this book. 2.3 Built-in functions and online help The function q()is an example of a built-in function. There are many functions in R which are designed to do all sorts of things. The online help facility can help you to see what a particular function is supposed to do. There are a number of ways of accessing the help facility. If you know the name of the function that you need help with, the help()function is likely sufficient. For example, for help on the q()function, type > ?q or > help(q) Either of these commands opens a window which will give you a description of the function for quitting R. 22 I NT RO D U C T IO N TO T HE R LA NG UAG E Another commonly used function in R is mean(). Upon typing > help(mean) a new window will appear. The first part of its contents is mean package:base R Documentation Arithmetic Mean Description: Generic function for the (trimmed) arithmetic mean. Usage: mean(x,...) ## Default S3 method: mean(x, trim = 0, na.rm = FALSE,...) Arguments: x: An R object. Currently there are methods for numeric data frames, numeric vectors and dates. A complex vector is allowed for ’trim = 0’, only. trim: the fraction (0 to 0.5) of observations to be trimmed from each end of ’x’ before the mean is computed. (There may be small differences in the display on your system.) This tells us that mean()will compute the ordinary arithmetic average or it will do something called “trimming” if we ask for it. To compute the mean of the values of the x vector created earlier, we simply type > mean(x) 5 2.3.1 Built-in examples A useful alternative to help()is the example()function: > example(mean) mean> x xm c(xm, mean(x, trim = 0.10)) 8.75 5.50 mean> mean(USArrests, trim = 0.2) Murder Assault UrbanPop Rape 7.42 167.60 66.20 20.16 BU ILT- IN FU NC TIO NS AND ONLINE HELP 23 This example shows simple use of the mean()function as well as how to use the trim argument. (When trim=0.1, the highest 10% and lowest 10% of the data are deleted before the average is calculated.) 2.3.2 Finding help when you don’t know the function name It is often convenient to use help.start(). This brings up an Internet browser, such as Internet Explorer or Firefox.4 The browser will show you a menu of several options, including a listing of installed packages. The base package contains many of the routinely used functions. Another function that is often used is help.search(). For example, to see if there are any functions that do optimization (finding minima or maxima), type help.search("optimization") Here is the result of a such a search: Help files with alias or concept or title matching "optimization" using fuzzy matching: lmeScale(nlme) Scale for lme Optimization optimization(OPM) minimize linear function with linear constraints constrOptim(stats) Linearly constrained optimisation nlm(stats) Non-Linear Minimization optim(stats) General-purpose Optimization optimize(stats) One Dimensional Optimization portfolio.optim(tseries) Portfolio Optimization Type "help(FOO, package = PKG)" to inspect entry "FOO(PKG) TITLE". We can then check for specific help on a function like nlm()by typing help(nlm) Web search engines such as Google can be useful for finding help on R. Including “R” as a keyword in such a search will often bring up the relevant R help page.5 The name of the R package that is needed is usually listed at the top of the help page. Another function to note is RSiteSearch()which will do a search in the R-help mailing list and other web resources. For example, to bring up information on the treatment of missing values in R, we can type 4 R depends on your system having a RSiteSearch("missing") properly installed browser. If it doesn’t have one, you may see an error message, or possibly nothing at all. 2.3.3 Built-in graphics functions 5 You may find pages describing Two basic plots are the histogram and the scatterplot. Consider functions that you do not have installed, > x hist(x) packages. 24 I NT RO D U C T IO N TO T HE R LA NG UAG E Histogram of x Fig. 2.1 A simple histogram. 3.0 2.5 2.0 Frequency 1.5 1.0 0.5 0.0 8 10 12 14 16 18 20 x Fig. 2.2 A simple scatterplot. 0 y 2 4 6 8 10 x (see Figure 2.1) and > x y plot(x, y) (see Figure 2.2). Note that the x values are plotted along the horizontal axis. Another useful plotting function is the curve()function for plotting the graph of a univariate mathematical function on an interval. The left and right endpoints of the interval are specified by from and to arguments, respectively. BU ILT- IN FU NC TIO NS AND ONLINE HELP 25 Fig. 2.3 Plotting the sin curve. 1.0 0.5 sin (x) 0.0 0 5 10 15 x A simple example involves plotting the sine function on the interval [0, 6π]: > curve(expr = sin, from = 0, to = 6 * pi) (see Figure 2.3). The expr parameter is either a function (whose output is a numeric vector when the input is a numeric vector) or an expression in terms of x. An example of the latter type of usage is: curve(xˆ2 - 10 * x, from = 1, to = 10) More information on graphics can be found in Chapter 3. 2.3.4 Additional elementary built-in functions The sample median The sample median measures the middle value of a data set. If the data are x ≤ x ≤ · · · ≤ x[n], then the median is x[(n + 1)/2], if n is odd, or {[x[n/2] + x[n/2 + 1]}/2, if n is even. For example, the median of the values: 10, 10, 18, 30, 32 is 18, and the median of 40, 10, 10, 18, 30, 32 is the average of 18 and 30, i.e. 24. This calculation is handled by R as follows: median(x) # computes the median or 50th percentile of the data in x Other summary measures Summary statistics can be calculated for data stored in vectors. In particular, try var(x) # computes the variance of the data in x summary(x) # computes several summary statistics on the data in x Exercises 1 The following are a sample of observations on incoming solar radiation at a greenhouse: 11.1 10.6 6.3 8.8 10.7 11.2 8.9 12.2 26 I NT RO D U C T IO N TO T HE R LA NG UAG E (a) Assign the data to an object called solar.radiation. (b) Find the mean, median and variance of the radiation observations. (c) Add 10 to each observation of solar.radiation, and assign the result to sr10. Find the mean, median, and variance of sr10. Which statistics change, and by how much? (d) Multiply each observation by −2, and assign the result to srm2. Find the mean, median, and variance of srm2. How do the statistics change now? (e ) Plot a histogram of the solar.radiation, sr10, and srm2. ( f ) There are two formulas commonly used for the variance of a set of numbers: (1/n) ni=1 (xi − x̄)2 and [1/(n − 1)] ni=1 (xi − x̄)2. One uses the sample size n in the denominator, and the other uses n − 1. Which formula does the var()function in R use? 2.4 Logical vectors and relational operators We have used the c()function to put numeric vectors together as well as character vectors. R also supports logical vectors. These contain two different elements: TRUE and FALSE. 2.4.1 Boolean algebra To understand how R handles TRUE and FALSE, we need to understand a little “Boolean algebra.” The idea of Boolean algebra is to formalize a mathematical approach to logic. Logic deals with statements that are either true or false. We represent each statement by a letter or variable, e.g. A is the statement that the sky is clear, and B is the statement that it is raining. Depending on the weather where you are, those two statements may both be true (there is a “sun- shower”), A may be true and B false (the usual clear day), A false and B true (the usual rainy day), or both may be false (a cloudy but dry day). Boolean algebra tells us how to evaluate the truth of compound state- ments. For example, “A and B” is the statement that it is both clear and raining. This statement is only true during a sunshower. “A or B” says that it is clear or it is raining, or both: anything but the cloudy dry day. This is sometimes called an inclusive or, to distinguish it from the exclusive or “A xor B,” which says that it is either clear or raining, but not both. There is also the “not A” statement, which says that it is not clear. There is a very important relation between Boolean algebra and set theory. If we interpret A and B as sets, then we can think of “A and B” as the set of elements which are in A and are in B, i.e. the intersection A ∩ B. Similarly “A or B” can be interpreted as the set of elements that are in A or are in B, i.e. the union A∪B. Finally, “not A” is the complement of A, i.e. Ac. Because there are only two possible values (true and false), we can record all Boolean operations in a table. On the first line of Table 2.1 we list the basic Boolean expressions, on the second line the equivalent way to code them in R, and in the body of the table the results of the operations. LOG IC AL VEC TORS AND RELATIO NAL OP ERATO RS 27 Table 2.1. Truth table for Boolean operations Boolean A B not A not B A and B A or B R A B !A !B A & B A | B TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE Exercises 1 More complicated expressions can be constructed from the basic Boolean operations. Write out the truth table for the xor operator, and show how to write it in terms of and, or, and not. 2 Venn diagrams can be used to illustrate set unions and intersections. Draw Venn diagrams that correspond to the and, or, not, and xor operations. 3 DeMorgan’s laws in R notation are !(A & B) == (!A) | (!B) and !(A | B) == (!A) & (!B). Write these out in English using the A and B statements above, and use truth tables to confirm each equality. 2.4.2 Logical operations in R One of the basic types of vector in R holds logical values. For example, a logical vector may be constructed as > a b b[a] 13 2 The elements of b corresponding to TRUE are selected. If we attempt arithmetic on a logical vector, e.g. > sum(a) 2 then the operations are performed after converting FALSE to 0 and TRUE to 1. In this case the result is that we count how many occurrences of TRUE are in the vector. There are two versions of the Boolean operators. The usual versions are &, |, and !, as listed in the previous section. These are all vectorized, so we see for example > !a FALSE TRUE TRUE FALSE 28 I NT RO D U C T IO N TO T HE R LA NG UAG E If we attempt logical operations on a numerical vector, 0 is taken to be FALSE, and any nonzero value is taken to be TRUE: > a & (b - 2) TRUE FALSE FALSE FALSE The operators &&and ||are similar to &and |, but behave differently in two respects. First, they are not vectorized: only one calculation is done. Secondly, they are guaranteed to be evaluated from left to right, with the right-hand operand only evaluated if necessary. For example, if A is FALSE, then A && Bwill be FALSE regardless of the value of B, so B needn’t be evaluated. This can save time if evaluating B would be very slow, and may make calculations easier, for example if evaluating B would cause an error when A was FALSE. This behavior is sometimes called short-circuit evaluation. Exercises 1 Under what circumstances would B need to be evaluated in the expression A || B? 2 Using the values from the previous section, predict the output from each of these expressions, and then try them in R. min(b) min(a) max(b) max(a) length(a) 3 Type b * a 2.4.3 Relational operators It is often necessary to test relations when programming to decide whether they are TRUE or FALSE. R allows for equality and inequality relations to be tested in using the relational operators: , ==, >=, = 4 49) not being equal to 4. DATA INP U T AND O U TP U T 29 To test which elements are not equal to 4, type a != 4 To print the elements of a which are greater than 4, type a[a > 4] Type b pretend.df pretend.df$x 2.5.6 Lists Data frames are actually a special kind of list, or structure. Lists in R can contain any other objects. You won’t often construct these yourself, but 32 I NT RO D U C T IO N TO T HE R LA NG UAG E many functions return complicated results as lists. You can see the names of the objects in a list using the names()function, and extract parts of it: > names(d) # Print the names of the objects in the d data frame. > d$x # Print the x component of d The list()function is one way of organizing multiple pieces of output from functions. For example, > x y list(x = x, y = y) $x 3 2 3 $y 7 7 Exercises 1 Display the row 1, column 3 element of pretend.df. 2 Use two different commands to display the y column of pretend.df. Chapter exercises 1 Assign the data set in rnf6080.dat 8 to a data frame called rain.df. Use the header=FALSE option. (a) Display the row 2, column 4 element of rain.df. (b) What are the names of the columns of rain.df. (c) Display the contents of the second row of the rain dataset. (d) Use the following command to re-label the columns of this data frame: > names(rain.df) 3 8 This data set is available at www.stats.uwo.ca/faculty/ on the interval [0, 6]. braun/data/rnf6080.dat. 3 Programming statistical graphics Users of statistical computing will need to produce graphs of their data and the results of their computations. In this chapter we start with a general overview of how this is done in R, and learn how to draw some basic plots. We then discuss some of the issues involved in choosing a style of plot to draw: it is not always an easy choice, and there are plenty of bad examples in the world to lead us astray. Finally, we will go into some detail about how to customize graphs in R. There are several different graphics systems in R. The oldest one is most directly comparable to the original S graphics, and is now known as base graphics. You can think of base graphics as analogous to drawing with ink on paper. You build up a picture by drawing fixed things on it, and once some- thing is drawn, it is permanent, though you might be able to cover it with something else, or move to a clean sheet of paper. Since the very beginning, base graphics has been designed to allow easy production of good quality scientific plots. In this chapter we will concentrate on base graphics. The grid package provides the basis for a newer graphics system. It also has facilities to produce good quality graphics, but the programmer has access to the individual pieces of a graph, and can modify them: a graph is more like a physical model being built and displayed, rather than just drawn. The lattice and ggplot packages provide functions for high-level plots based on grid graphics. Both base and grid graphics are designed to be “device independent.” Directions are given where to draw and these drawing commands work on any device. The actual look of a graph will vary slightly from one device to another (e.g. on paper versus in a window on your screen), because of different capabilities. There are other more exotic graphics systems available in R as well, providing interactive graphics, 3-D displays, etc. These are beyond the scope of this book. 3.1 High-level plots In this section we will discuss several basic plots. The functions to draw these in R are called “high level” because you don’t need to worry about 34 P RO G R AMMIN G STAT IST IC A L G R A PHICS Death rates in Virginia Fig. 3.1 An example of a bar chart. 80 60 Deaths per 1000 40 20 0 Rural Male Rural Female Urban Male Urban Female the details of where the ink goes; you just describe the plot you want, and R does the drawing. 3.1.1 Bar charts and dot charts The most basic type of graph is one that displays a single set of numbers. Bar charts and dot charts do this by displaying a bar or dot whose length or position corresponds to the number. For example, the VADeaths dataset in R contains death rates (number of deaths per 1000 population per year) in various subpopulations within the state of Virginia in 1940. This may be plotted as a bar chart. > VADeaths Rural Male Rural Female Urban Male Urban Female 50-54 11.7 8.7 15.4 8.4 55-59 18.1 11.7 24.3 13.6 60-64 26.9 20.3 37.0 19.3 65-69 41.0 30.9 54.6 35.1 70-74 66.0 54.3 71.1 50.0 > barplot(VADeaths, beside=TRUE, legend=TRUE, ylim=c(0, 90), + ylab="Deaths per 1000", + main="Death rates in Virginia") Figure 3.1 shows the plot that results. The bars correspond to each num- ber in the matrix. The beside=TRUE argument causes the values in each column to be plotted side-by-side; legend=TRUE causes the legend in the top right to be added. The ylim=c(0, 90) argument modifies the verti- cal scale of the graph to make room for the legend. (We will describe other ways to place the legend in Section 3.3 below.) Finally, the main=argument sets the main title for the plot. HIGH- LEVEL P LOTS 35 Death rates in Virginia Fig. 3.2 An example of a dot Rural male chart. Rural female Urban male Urban female 0 20 40 60 Deaths per 1000 An alternative way to plot the same data is in a dot chart (Figure 3.2). > dotchart(VADeaths, xlim=c(0, 75), B + xlab="Deaths per 1000", + main="Death rates in Virginia") A We set the x-axis limits to run from 0 to 75 so that zero is included, because it is natural to want to compare the total rates in the different groups. F 3.1.2 Pie charts C D Pie charts display a vector of numbers by breaking up a circular disk into pieces whose angle (and hence area) is proportional to each number. For Fig. 3.3 A pie chart showing the example, the letter grades assigned to a class might arise in the proportions distribution of grades in a class. shown in Figure 3.3, which was drawn with the R code > groupsizes labels pie(groupsizes, labels, col=c("grey40", "white", "grey", "black", "grey90")) Pie charts are popular in non-technical publications, but they have fallen out of favour with statisticians. Some of the reasons why will be discussed in Section 3.2. 3.1.3 Histograms A histogram is a special type of bar chart that is used to show the frequency distribution of a collection of numbers. Each bar represents the count of x values that fall in the range indicated by the base of the bar. Usually all bars should be the same width; this is the default in R. In this case the height of each bar is proportional to the number of observations in the corresponding interval. If bars have different widths, then the area of the bar should be proportional to the count; in this way the height represents the density (i.e. the frequency per unit of x). 36 P RO G R AMMIN G STAT IST IC A L G R A PHICS Histogram of x Fig. 3.4 An example of a histogram of the values in a vector x of length 100. We can see for example that the most frequently 20 observed interval is −0.5 to 0, and that 23 values lie therein. 15 Frequency 10 5 0 –2 –1 0 1 2 3 x In R, hist(x,...) is the main way to plot histograms. Here x is a vector consisting of numeric observations, and optional parameters in... are used to control the details of the display. For example, Figure 3.4 shows the result of the following code. > x hist(x) If you have n values of x, R, by default, divides the range into approxi- mately log2 (n)+1 intervals, giving rise to that number of bars. For example, our data set consisted of 100 measurements. Since 100 > 26 = 64 100 < 27 = 128 6 < log2 (100) < 7, it can be seen that R should choose about 7 or 8 bars. In fact, it chose 11, because it also attempts to put the breaks at round numbers (multiples of 0.5 in this case). The above rule (known as the “Sturges” rule) is not always sat- isfactory for very large values of n, giving too few bars. Current research suggests that the number of bars should increase proportionally to n1/3 rather than proportional to log2 (n). The breaks = "Scott" and breaks = "Freedman-Diaconis" options provide variations on this choice. For example, Figure 3.5 shows the results for a 10 000 point dataset using the “Sturges” and “Scott” rules. 3.1.4 Box plots A box plot (or “box-and-whisker plot”) is an alternative to a histogram to give a quick visual display of the main features of a set of data. A rectangular box is drawn, together with lines which protrude from two opposing sides. HIGH- LEVEL P LOTS 37 breaks = "Sturges" breaks = "Scott" Fig. 3.5 Histograms of the 0 500 1000 1500 2000 200 400 600 800 values in a vector x of length Frequency Frequency 10 000, using two different rules for setting the breaks. 0 –4 –2 0 2 4 –2 0 2 4 x x The box gives an indication of the location and spread of the central portion of the data, while the extent of the lines (the “whiskers”) provides an idea outlier of the range of the bulk of the data. In some implementations, outliers (observations that are very different from the rest of the data) are plotted as upper whisker separate points. The basic construction of the box part of the boxplot is as follows: upper quartile 1 A horizontal line is drawn at the median. 2 Split the data into two halves, each containing the median. 3 Calculate the upper and lower quartiles as the medians of each half, and median draw horizontal lines at each of these values. Then connect the lines to form a rectangular box. lower quartile The box thus drawn defines the interquartile range (IQR). This is the difference between the upper quartile and the lower quartile. We use the IQR to give a measure of the amount of variability in the central portion of lower whisker the dataset, since about 50% of the data will lie within the box. The lower whisker is drawn from the lower end of the box to the smallest value that is no smaller than 1.5 IQR below the lower quartile. Similarly, the upper whisker is drawn from the middle of the upper end of the box to outliers the largest value that is no larger than 1.5 IQR above the upper quartile. The rationale for these definitions is that when data are drawn from the normal Fig. 3.6 Construction of a distribution or other distributions with a similar shape, about 99% of the boxplot. observations will fall between the whiskers. An annotated example of a box plot is displayed in Figure 3.6. Box plots are convenient for comparing distributions of data in two or more categories, with a number (say 10 or more) of numerical observations per category. For example, the iris dataset in R is a well-studied dataset of measurements of 50 flowers from each of three species of iris. Figure 3.7, produced by the code > boxplot(Sepal.Length ˜ Species, data = iris, + ylab = "Sepal length (cm)", main = "Iris measurements", + boxwex = 0.5) compares the distributions of the sepal length measurements between the different species. Here we have used R’s formula based interface to the graphics function: the syntax Sepal.Length ˜ Species is read as “Sepal.Length depending on Species,” where both are columns of the data frame specified by data = iris. The boxplot() function draws separate side-by-side box plots for each species. From these, we can see 38 P RO G R AMMIN G STAT IST IC A L G R A PHICS Iris measurements Fig. 3.7 An example of side-by-side boxplots. 8.0 7.5 7.0 Sepal length (cm) 6.5 6.0 5.5 5.0 4.5 setosa versicolor virginica substantial differences between the mean lengths for the species, and that there is one unusually small specimen among the virginica samples. 3.1.5 Scatterplots When doing statistics, most of the interesting problems have to do with the relationships between different measurements. To study this, one of the most commonly used plots is the scatterplot, in which points (xi , yi ), i = 1,... , n are drawn using dots or other symbols. These are drawn to show relationships between the xi and yi values. In R, scatterplots (and many other kinds of plots) are drawn using the plot () function. Its basic usage is plot(x, y,...)where x and y are numeric vectors of the same length holding the data to be plotted. There are many additional optional arguments, and versions of plot designed for non-numerical data as well. One important optional argument is type. The default is type="p", which draws a scatterplot. Line plots (in which line segments join the (xi , yi ) points in order from first to last) are drawn using type="l". Many other types are available, including type="n", to draw nothing: this just sets up the frame around the plot, allowing other functions to be used to draw in it. Some of these other functions will be discussed in Section 3.3. Many other types of graphs can be obtained with this function. We will show how to explore some of the options using some artificial data. Two vectors of numbers will be simulated, one from a standard normal distribution and the other from a Poisson distribution having mean 30. > x y # to y; mean value is 30 > mean(y) # the resulting value should be near 30 30.91 HIGH- LEVEL P LOTS 39 Poisson versus Normal Fig. 3.8 An example of a scatterplot. 50 45 40 35 y 30 25 20 –3 –2 –1 0 1 2 x The main argument sets the main title for the plot. Figure 3.8 shows the result of > plot(x, y, main = "Poisson versus Normal") Other possibilities you should try: > plot(x, y, type="l") # plots a broken line (a dense tangle of line > # segments here) > plot(sort(x), sort(y), type="l") # a plot of the sample "quantiles" There are many more optional arguments to the plot() function, described on the ?plot and ?par help pages. 3.1.6 QQ plots Quantile–quantile plots (otherwise known as QQ plots) are a type of scat- terplot used to compare the distributions of two groups or to compare a sample with a reference distribution. In the case where there are two groups of equal size, the QQ plot is obtained by first sorting the observations in each group: X ≤ · · · ≤ X [n] and Y ≤ · · · ≤ Y [n]. Next, draw a scatterplot of (X [i], Y [i]), for i = 1,... , n. When the groups are of different sizes, some scheme must be used to artificially match them. R reduces the size of the larger group to the size of the smaller one by keeping the minimum and maximum values, and choosing equally spaced quantiles between. For example, if there were five X values but twenty Y values, then the X values would be plotted against the minimum, lower quartile, median, upper quartile and maximum of the Y values. When plotting a single sample against a reference distribution, theoret- ical quantiles are used for one coordinate. R normally puts the theoretical quantiles on the X axis and the data on the Y axis, but some authors make 40 P RO G R AMMIN G STAT IST IC A L G R A PHICS the opposite choice. To avoid biases, quantiles are chosen corresponding to probabilities (i − 1/2)/n: these are centered evenly between zero and one. When the distributions of X and Y match, the points in the QQ plot will lie near the line y = x. We will see a different straight line if one distribution is a linear transformation of the other. On the other hand, if the two distributions are not the same, we will see systematic patterns in the QQ plot. The following code illustrates some common patterns (see Figure 3.9). > X A qqplot(X, A, main="A and X are the same") > B qqplot(X, B, main="B is rescaled X") > C qqplot(X, C, main="C has heavier tails") > D qqplot(X, D, main="D is skewed to the right") Exercises 1 The islands vector contains the areas of 48 land masses. (a) Plot a histogram of these data. (b) Are there advantages to taking logarithms of the areas before plotting the histogram? (c) Compare the histograms that result when using breaks based on Sturges’ and Scott’s rules. Make this comparison on the log-scale and on the original scale. (d) Construct a boxplot for these data on the log-scale as well as the original scale. (d) Construct a dot-chart of the areas. Is a log transformation needed here? (f ) Which form of graphic do you think is most appropriate for displaying these data? 2 The stackloss data frame contains 21 observations on four variables taken at a factory where ammonia is converted to nitric acid. The first three variables are Air.Flow, Water.Temp, and Acid.Conc. The fourth variable is stack.loss, which measures the amount of ammo- nia that escapes before being absorbed. (Read the help file for more Fig. 3.9 Several examples of information about this data frame.) QQ plots. (a) Use scatterplots to explore possible relationships between acid con- centration, water temperature, and air flow and the amount of ammonia which escapes. Do these relationships appear to be linear or nonlinear? (b) Use the pairs() function to obtain all pairwise scatterplots among the four variables. Identify pairs of variables where there might be linear or nonlinear relationships. 3 Consider the pressure data frame. There are two columns: temperature and pressure. C HO O S ING A HIGH- LEVEL GRAP HIC 41 (a) Construct a scatterplot with pressure on the vertical axis and temperature on the horizontal axis. Are the variables related linearly or nonlinearly? (b) The graph of the following function passes through the plotted points reasonably well: y = (0.168 + 0.007x)20/3. The differences between the pressure values predicted by the curve and the observed pressure values are called residuals. Here is a way to calculate them: > residu