Document Details

FertileIrrational6003

Uploaded by FertileIrrational6003

Tags

R programming data structures statistical analysis computer science

Summary

This document is a syllabus for a R programming course in the Computer Science with Data Analytics program's fifth semester. It covers various R programming concepts including data structures (vectors, matrices, data frames, lists), functions, and modeling. The syllabus provides an overview of the topics by unit.

Full Transcript

**Subject Name :** R PROGRAMMING **Department :** Computer Science With Data Analytics **Class :** III B.SC CSDA **Semester :** V **SYLLABUS** **UNIT I:** **INTRODUCING TO R:** Introducing to R -- R Data Structures -- Help Functions in R -- Vectors -- Scalars --Declarations -- Recycling -- Com...

**Subject Name :** R PROGRAMMING **Department :** Computer Science With Data Analytics **Class :** III B.SC CSDA **Semester :** V **SYLLABUS** **UNIT I:** **INTRODUCING TO R:** Introducing to R -- R Data Structures -- Help Functions in R -- Vectors -- Scalars --Declarations -- Recycling -- Common Vector Operations -- Using all and any -- Vectorized operations -- NA and NULL values -- Filtering -- Victoriesed if-then else -- Vector Element names. (9). **UNIT II:** **MATRICES AND OPERATIONS:** Creating matrices -- Matrix Operations -- Applying Functions to Matrix Rows and Columns -- Adding and deleting rows and columns - Vector/Matrix Distinction -- Avoiding Dimension Reduction -- Higher Dimensional arrays -- lists -- Creating lists -- General list operations -- Accessing list components and values -- applying functions to lists -- recursive lists. **UNIT III:** **DATA FRAMES:** Creating Data Frames -- Matrix-like operations in frames -- merging Data frames -- Applying functions to Data Frames -- Factors and Tables -- Factors and levels -- Common Functions used with factors -- Working with tables -- Other factors and table related functions -- Control statements -- Arithmetic and Boolean operators and values -- Default Values for arguments -- Returning Boolean Values -- Functions are objects -- Environment and scope issues -- Writing Upstairs -- Recursion -- Replacement functions -- Tools for Composing function code -- Math and Simulation in R. **UNIT IV:** **CLASSES AND OBJECTS:** S3 Classes -- S4 Classes -- Managing your objects -- Input/output -- accessing keyboard and monitor -- reading and writing files -- accessing the internet -- String Manipulation -- Graphics -- Creating Graphs -- Customizing Graphs -- Saving Graphs to files -- Creating Three-Dimensional plots. **UNIT V:** **MODELING IN R:** Interfacing R to other languages -- Parallel R -- Basic Statistics -- Linear Model -- Generalized Linear models -- Non-linear Models -- Time Series and AutoCorrelation -- Clustering. **TEXTBOOK:** 1. Norman Matloff, ―The Art of R Programming: A Tour of Statistical Software Design‖, No Starch Press, 2011. 2. Jared P. Lander, ―R for Everyone: Advanced Analytics and Graphics‖, AddisonWesley Data & Analytics Series, 2013. **REFERENCE BOOKS:** 1. Mark Gardner, ―Beginning R -- The Statistical Programming Language‖, Wiley, 2013. 2. Robert Knell, ―Introductory R: A Beginner's Guide to Data Visualisation, Statistical Analysis and programming in R‖, Amazon Digital South Asia Services Inc, 2013. Richard Cotton(2013). Learning R, O'Reilly Media. 3. Garret Grolemund (2014). Hands-on Programming with R. O'Reilly Media, Inc. 4. Roger D.Peng (2018). R Programming for Data Science. Lean Publishing. **UNIT I:** **INTRODUCING TO R:** Introducing to R -- R Data Structures -- Help Functions in R -- Vectors -- Scalars --Declarations -- Recycling -- Common Vector Operations -- Using all and any -- Vectorized operations -- NA and NULL values -- Filtering -- Victoriesed if-then else -- Vector Element names. **[Features of R Programming]** - **Open-source** - Strong Graphical Capabilities - Highly Active Community - A Wide Selection of Packages - Comprehensive Environment - Can Perform Complex Statistical Calculations - Distributed Computing - Running Code Without a Compiler - Interfacing with Databases - Data Variety - Machine Learning - Data Wrangling - Cross-platform Support - Compatible with Other Programming Languages **[Data Structures in R]** - A data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. Data structures in R programming are tools for holding multiple values. - R's base data structures are often organized by their dimensionality (1D, 2D, or nD) and whether they're homogeneous (all elements must be of the identical type) or heterogeneous (the elements are often of various types). This gives rise to the six data types which are most frequently utilized in data analysis. - The most essential data structures used in R include: - **Vectors** - **Lists** - **Dataframes** - **Matrices** - **Arrays** - **Factors** **[Vectors]** A vector is an ordered collection of basic data types of a given length. The only key thing here is all the elements of a vector must be of the identical data type e.g homogeneous data structures. Vectors are one-dimensional data structures. **[Example]** **X =c(1, 3, 5, 7, 8) print(X)** **[Output]** **\[1\] 1 3 5 7 8** **[Types of vectors]** Vectors are of different types which are used in R. Following are some of the types of vectors: **[Example]** **\# R program to create numeric Vectors \# creation of vectors using c() function.** **v1 \ c(1,2,4,1,2) + c(6,0,9,20,22)** Here's a more subtle example: **\> x** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 2 5** **\[3,\] 3 6** **\> x+c(1,2)** **\[,1\] \[,2\]** **\[1,\] 2 6** **\[2,\] 4 6** **\[3,\] 4 8** Again, keep in mind that matrices are actually long vectors. Here, x, as a 3-by-2 matrix, is also a six element vector, which in R is stored column by column. In other words, in terms of storage, x is the same as c(1,2,3,4,5,6). We added a two-element vector to this six-element one, so our added vector needed to be repeated twice to make six elements. In other words, we were essentially doing this: **x + c(1,2,1,2,1,2)** Not only that, but c(1,2,1,2,1,2) was also changed from a vector to amatrix having the same shape as x before the addition took place: 1. **2** 2. **1** **1 2** Thus, the net result was to compute the following: ![](media/image2.jpg) **[Common Vector Operations]** **[Vector Arithmetic and Logical Operations]** Remember that R is a functional language. Every operator, including + inthe following example, is actually a function. **\> 2+3** **\[1\] 5** **\> \"+\"(2,3)** **\[1\] 5** Recall further that scalars are actually one-element vectors. So, we canadd vectors, and the + operation will be applied element-wise. **\> x \ x + c(5,0,-1)** **\[1\] 6 2 3** If you are familiar with linear algebra, you may be surprised at what happenswhen we multiply two vectors. **\> x \* c(5,0,-1)** **\[1\] 5 0 -4** But remember, because of the way the \* function is applied, the multiplicationis done element by element. The first element of the product (5) is the result of the first element of x (1) being multiplied by the first element of c(5,0,1) (5), and so on. **\> x \ x / c(5,4,-1)** **\[1\] 0.2 0.5 -4.0** **\> x %% c(5,4,-1)** **\[1\] 1 2 0** **[Vector Indexing]** One of the most important and frequently used operations in R is that of*indexing* vectors, in which we form a subvector by picking elements of the given vector for specific indices. The format is vetor1\[vector2\], with the result that we select those elements of vector1 whose indices are given in vector2. **\> y \ y\[c(1,3)\] \# extract elements 1 and 3 of y** **\[1\] 1.2 0.4** **\> y\[2:3\]** **\[1\] 3.9 0.4** **\> v \ y\[v\]** **\[1\] 0.40 0.12** Note that duplicates are allowed. **\> x \ y \ y** **\[1\] 4 4 17** Negative subscripts mean that we want to exclude the given elements in our output \> z \ z\[-1\] \# exclude element 1** **\[1\] 12 13** **\> z\[-1:-2\] \# exclude elements 1 through 2** **\[1\] 13** In such contexts, it is often useful to use the length() function. Forinstance, suppose we wish to pick up all elements of a vector z except for the last. The following code will do just that: **\> z \ z\[1:(length(z)-1)\]** **\[1\] 5 12** **Or more simply:** **\> z\[-length(z)\]** **\[1\] 5 12** This is more general than using z\[1:2\]. Our program may need to work for more than just vectors of length 2, and the second approach would giveus that generality**.** **[Generating Useful Vectors with the : Operator]** There are a few R operators that are especially useful for creating vectors. Let's start with the colon operator : **\> 5:8** **\[1\] 5 6 7 8** **\> 5:1** **\[1\] 5 4 3 2 1** You may recall that it was used earlier in this chapter in a loop context, as follows: **for (i in 1:length(x)) {** Beware of operator precedence issues. **\> i \ 1:i-1 \# this means (1:i) - 1, not 1:(i-1)** **\[1\] 0 1** **\> 1:(i-1)** **\[1\] 1** In the expression 1:i-1, the colon operator takes precedence over the subtraction. So, the expression 1:i is evaluated first, returning 1:2. R then subtracts 1 from that expression. That means subtracting a one-element vector from a two-element one, which is done via recycling. The one-element vector (1) will be extended to (1,1) to be of compatible length with 1:2. Element-wise subtraction then yields the vector (0,1). In the expression 1:(i-1), on the other hand, the parentheses have higher precedence than the colon. Thus, 1 is subtracted from i, resulting in 1:1, as seen in the preceding example. **[Generating Vector Sequences with seq()]** A generalization of : is the seq() (or *sequence*) function, which generates a sequencein arithmetic prgression. For instance, whereas 3:8 yields the vector (3,4,5,6,7,8), with the elements spaced one unit apart (4 *−* 3 = 1, 5 *−* 4 = 1, and so on), we can make them, say, three units apart, as follows: **\> seq(from=12,to=30,by=3)** **\[1\] 12 15 18 21 24 27 30** The spacing can be a non-integer value, too, say 0.1. **\> seq(from=1.1,to=2,length=10)** **\[1\] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0** **for (i in 1:length(x))** If x is empty, this loop should not have any iterations, but it actually has two, since 1:length(x) evalates to (1,0). We could fix this by writing the statement as follows: **for (i in seq(x))** To see why this works, let's do a quick test of seq(): **\> x \ x** **\[1\] 5 12 13** **\> seq(x)** **\[1\] 1 2 3** **\> x \ x** **NULL** **\> seq(x) integer(0)** You can see that seq(x) gives us the same result as 1:length(x) if x is not empty, but it correctly evaluates to NULL if x is empty, resulting in zero iterations in the above loop. **[Repeating Vector Constants with rep()]** The rep() (or *repeat*) function allows us to conveniently put the same constant into long vectors. The call form is rep(x,times), which creates a vector of *times\*length(x)* elements---that is, times copies of x. Here is an example: **\> x \ x** **\[1\] 8 8 8 8** **\> rep(c(5,12,13),3)** **\[1\] 5 12 13 5 12 13 5 12 13** **\> rep(1:3,2)** **\[1\] 1 2 3 1 2 3** There is also a named argument each, with very different behavior, whichinterleaves the copies of x **\> rep(c(5,12,13),each=2)** **\[1\] 5 5 12 12 13 13** **[Using all and any]** The any() and all() functions are handy shortcuts. They report whether anyor all of their arguments are TRUE. **\> x \ any(x \> 8)** **\[1\] TRUE** **\> any(x \> 88)** **\[1\] FALSE** **\> all(x \> 88)** **\[1\] FALSE** **\> all(x \> 0)** **\[1\] TRUE** For example, suppose that R executes the following: **\> any(x \> 8)** It first evaluates x \> 8, yielding this: **(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE)** The any() function then reports whether any of those values is TRUE. The all() function works similarly and reports if *all* of the values are TRUE. **[Example]** **findruns \ u \ v \ u \> v** **\[1\] TRUE FALSE FALSE** Here, the \>function was applied to u\[1\] and v\[1\], resulting in TRUE, then tou\[2\] and v\[2\], resulting in FALSE, and so on. A key point is that if an R function uses vectorized operations, it, too, isvectorized, thus enabling a potential speedup. Here is an example: **\> w \ w(u)** **\[1\] 6 3 9** Here, w() uses +, which is vectorized, so w() is vectorized as well. As you can see, there is an unlimited number of vectorized functions, as complex ones are built up from simpler ones. Note that even the transcendental functions---square roots, logs, trig functions, and so on are vectorized. **\> sqrt(1:9)** **\[1\] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751** **2.828427** **\[9\] 3.000000** This applies to many other built-in R functions. For instance, let's applythe function for rounding to the nearest integer to an example vector y: **\> y \ z \ z** **\[1\] 1 4 0** The point is that the round() function is applied individually to eachelement in the vector y. And rmember that scalars are really single-element vectors, so the "ordinary" use of round() on just one number is merely a special case. \> round(1.2) \[1\] 1 Here, we used the built-in function round(), but you can do the samething with functions that you write yourself. As mentioned earlier, even operators such as + are really functions. Forexample, consider this code: **\> y \ y+4** **\[1\] 16 9 17** The reason element-wise addition of 4 works here is that the + is actually a function! Here it is explicit ly: **\> \'+\'(y,4)** **\[1\] 16 9 17** Note, too, that recycling played a key role here, with the 4 recycled into (4,4,4). Since we know that R has no scalars, let's consider vectorized functions that appear to have scalar arguments. **\> f** **function(x,c) return((x+c)\^2)** **\> f(1:3,0)** **\[1\] 1 4 9** **\> f(1:3,1)** **\[1\] 4 9 16** In our definition of f() here, we clearly intend c to be a scalar, but, ofcourse, it is actually a vector of length 1. Even if we use a single number for c in our call to f(), it will be extended through recycling to a vector for our computation of x+c within f(). So in our call f(1:3,1) in the example, the quantity x+c becomes as follows: This brings up a question of code safety. There is nothing in f() that keeps us from using an explicit vector for c, such as in this example: **\> f(1:3,1:3)** **\[1\] 4 16 36** You should work through the computation to confirm that (4,16,36) isindeed the expected output. If you really want to restrict c to scalars, you should insert some kind ofcheck, say this one: **\> f function(x,c) {** **if (length(c) != 1) stop(\"vector c not allowed\") return((x+c)\^2)** **}** **[Vector In, Matrix Out]** The vectorized functions we've been working with so far have scalar returnvalues. Calling sqrt() on a number gives us a number. If we apply this function to an eight-element vector, we get eight numbers, thus another eight element vector, as output. Applying z12() to 5, say, gives us the two-element vector (5,25). If weapply this function to an eight-element vector, it produces 16 numbers: **x \ z12(x)** **\[1\] 1 2 3 4 5 6 7 8 1 4 9 16 25 36 49 64** It might be more natural to have these arranged as an 8-by-2 matrix,which we can do with the matrix function: **\> matrix(z12(x),ncol=2)** **\[,1\] \[,2\]** **\[1,\] 1 1** **\[2,\] 2 4** **\[3,\] 3 9** **\[4,\] 4 16** **\[5,\] 5 25** **\[6,\] 6 36** **\[7,\] 7 49** **\[8,\] 8 64** But we can streamline things using sapply() (or *simplify apply*). The callsapply(x,f) applies the function f() to each element of x and then converts the result to a matrix. Here is an example: **\> z12 \ sapply(1:8,z12)** **\[,1\] \[,2\] \[,3\] \[,4\] \[,5\] \[,6\] \[,7\] \[,8\]** **\[1,\] 1 2 3 4 5 6 7 8** **\[2,\] 1 4 9 16 25 36 49 64** We do get a 2-by-8 matrix, not an 8-by-2 one, but it's just as useful this way **[NA and NULL Values]** **[Using NA]** In many of R's statistical functions, we can instruct the function to skip overany missing values, or NAs. Here is an example: **\> x \ x** **\[1\] 88 NA 12 168 13** **\> mean(x)** **\[1\] NA** **\> mean(x,na.rm=T)** **\[1\] 70.25** **\> x \ mean(x)** **\[1\] 70.25** In the first call, mean() refused to calculate, as one value in x was NA. Butby setting the optional argument na.rm (*NA remove*) to true (T), we calculated the mean of the remaining elements. But R automatically skipped over the NULL value, which we'll look at in the next section. There are multiple NA values, one for each mode: **\> x \ mode(x\[1\])** **\[1\] \"numeric\"** **\> mode(x\[2\])** **\[1\] \"numeric\"** **\> y \ mode(y\[2\])** **\[1\] \"character\"** **\> mode(y\[3\])** **\[1\] \"character\"** **[Using NULL]** One use of NULL is to build up vectors in loops, in which each iterationadds another element to the vector. In this simple example, we build up a vector of even numbers: **\# build up a vector of the even numbers in 1:10** **\> z \ for (i in 1:10) if (i %%2 == 0) z \ z** **\[1\] 2 4 6 8 10** For example, 13 %% 4 is 1, as the remainder of dividing 13 by4 is 1. Thus the example loop starts with a NULL vector and then adds the element 2 to it, then 4, and so on. This is a very artificial example, of course, and there are much better ways to do this particular task. Here are two more ways another way to find even numbers in 1:10: **\> seq(2,10,2)** **\[1\] 2 4 6 8 10** **\> 2\*1:5** **\[1\] 2 4 6 8 10** But the point here is to demonstrate the difference between NA andNULL. If we were to use NA instead of NULL in the preceding example, we would pick up an unwanted NA: **\> z \ for (i in 1:10) if (i %%2 == 0) z \ z** **\[1\] NA 2 4 6 8 10** NULL values really are counted as nonexistent, as you can see here: **\> u \ length(u)** **\[1\] 0** **\> v \ length(v)** **\[1\] 1** NULL is a special R object with no mode. **[Filtering]** Another feature reflecting the functional language nature of R is *filtering*. This allows us to extract a vector's elements that satisfy certain conditions. Filtering is one of the most common operations in R, as statistical analyses often focus on data that satisfies conditions of interest. **[Generating Filtering Indices]** **\> z \ w \ 8\]** **\> w** **\[1\] 5 -3 8** Looking at this code in an intuitive, "What is our intent?" manner, we see that we asked R to extract from z all its elements whose squares were greater than 8 and then assign that subvector to w. But filtering is such a key operation in R that it's worthwhile to examine the technical details of how R achieves our intent above. Let's look at it done piece by piece: **\> z \ z** **\[1\] 5 2 -3 8** **\> z\*z \> 8** **\[1\] TRUE FALSE TRUE TRUE** Evaluation of the expression z\*z \> 8 gives us a vector of Boolean values! It's very important that you understand exactly how this comes about. First, in the expression z\*z \> 8, note that *everything* is a vector or vector operator: - Since z is a vector, that means z\*z will also be a vector (of the same lengthas z). - Due to recycling, the number 8 (or vector of length 1) becomes the vector(8,8,8,8) here. - The operator \>, like +, is actually a function. Let's look at an example of that last point: **\> \"\>\"(2,1)** **\[1\] TRUE** **\> \"\>\"(2,5)** **\[1\] FALSE** Thus, the following: **z\*z \> 8** is really this: **\"\>\"(z\*z,8)** In other words, we are applying a function to vectors---yet another case of vectorization, no different from the others you've seen. And thus the result is a vector---in this case, a vector of Booleans. Then the resulting Boolean values are used to cull out the desired elements of z: **\> z\[c(TRUE,FALSE,TRUE,TRUE)\]** **\[1\] 5 -3 8** **[Filtering with the subset() Function]** Filtering can also be done with the subset() function. When applied to vectors, the difference between using this function and ordinary filtering lies in the manner in which NA values are handled. **\> x \ x** **\[1\] 6 1 2 3 NA 12** **\> x\[x \> 5\]** **\[1\] 6 NA 12** **\> subset(x,x \> 5)** **\[1\] 6 12** When we did ordinary filtering in the previous section, R basically said,"Well, x\[5\] is unknown, so it's also unknown whether its square is greater than 5." But you may not want NAs in your results. When you wish to exclude NA values, using subset() saves you the trouble of removing the NA values yourself. **[The Selection Function which()]** As you've seen, filtering consists of extracting elements of a vector z that satisfy a certain condition. In some cases, though, we may just want to find the positions within z at which the condition occurs. We can do this using which(), as follows: **\> z \ which(z\*z \> 8)** **\[1\] 1 3 4** The result says that elements 1, 3, and 4 of z have squares greater than 8.As with filtering, it is important to understand exactly what occurred in the preceding code. The expression **z\*z \> 8** is evaluated to (TRUE,FALSE,TRUE,TRUE). The which() function then simply reports which elements of the latter expression are TRUE. One handy (though somewhat wasteful) use of which() is for determining the location within a vector at which the first occurrence of some condition holds. For example, recall our code on page 27 to find the first 1 value within a vector x: **first1 \ x \ ifelse(x \> 6,2\*x,3\*x)** **\[1\] 15 6 18 24** We return a vector consisting of the elements of x, either multiplied by 2 or 3, depending on whether the element is greater than 6. Again, it helps to think through what is really occurring here. The expression x *\>*6 is a vector of Booleans. If the *i*th component is true, then the *i*th element of the return value will be set to the *i*th element of 2\*x; otherwise, it will be set to 3\*x\[i\], and so on. The advantage of ifelse() over the standard ifthen-else construct is that it is vectorized, thus potentially much faster. **x \ 0){** **print(\"Positive number\")** **}** **[Output]** **\[1\] Positive number** **[Example 2] v = c(14,7,6,9,2)** **ifelse(v %% 2 == 1,\"odd\",\"even\")** **[Output]** **even\' \'odd\' \'even\' \'odd\' \'even\'** The internal working of code above produces a logical vector as c(FALSE,TRUE,FALSE,TRUE,FALSE). The first parameter will form a string vector of c(\"odd\",\"odd\",\"odd\",\"odd\",\"odd\") also the second parameter which in turn will produce string vector as c(\"even\',\"even\",\"even\",\"even\",\"even\"). Finally when the individual vector elements is TRUE gets change to \'odd\' whereas the \'FALSE\' will change to \'even\'. **[Example 3- Recoding an Abalone Data Set]** Due to the vector nature of the arguments, you can nest ifelse() operations. In the following example, which involves an abalone data set, gender is coded as M, F, or I (for infant). We wish to recode those characters as 1, 2, or 3. The real data set consists of more than 4,000 observations, but for our example, we'll say we have just a few, stored in g: **\> g** **\[1\] \"M\" \"F\" \"F\" \"I\" \"M\" \"M\" \"F\"** **\> ifelse(g == \"M\",1,ifelse(g == \"F\",2,3))** **\[1\] 1 2 2 3 1 1 2** What actually happens in that nested ifelse()? Let's take a careful look.First, for the sake of concreteness, let's find what the formal argument names are in the function ifelse(): **\> args(ifelse) function (test, yes, no)** **NULL** Remember, for each element of test that is true, the function evaluates tothe corresponding element in yes. Similarly, if test\[i\] is false, the function evaluates to no\[i\]. All values so generated are returned together in a vector. In our case here, R will execute the outer ifelse() call first, in which testis g == \"M\", and yes is 1 (recycled); no will (later) be the result of executingifelse(g==\"F\",2,3). Now since test\[1\] is true, we generate yes\[1\], which is 1.So, the first element of the return value of our outer call will be 1. Next R will evaluate test\[2\]. That is false, so R needs to find no\[2\]. R now needs to execute the inner ifelse() call. It hasn't done so before, because it hasn't needed it until now. R uses the principle of *lazy evaluation*, meaning that an expression is not computed until it is needed. R will now evaluate ifelse(g==\"F\",2,3), yielding (3,2,2,3,3,3,2); this is nofor the outer ifelse() call, so the latter's second return element will be thesecond element of (3,2,2,3,3,3,2), which is 2. When the outer ifelse() call gets to test\[4\], it will see that value to be false and thus will return no\[4\]. Since R had already computed no, it has the value needed, which is 3. Remember that the vectors involved could be columns in matrices, whichis a very common scenario. Say our abalone data is stored in the matrix ab, with gender in the first column. Then if we wish to recode as in the preceding example, we could do it this way: Suppose we wish to form subgroups according to gender. We could usewhich() to find the element numbers corresponding to M, F, and I: **\> m \ f \ i \ m** **\[1\] 1 5 6** **\> f** **\[1\] 2 3 7** **\> i** **\[1\] 4** Going one step further, we could save these groups in a list, like this: **\> grps \ for (gen in c(\"M\",\"F\",\"I\")) grps\[\[gen\]\] \ grps** **\$M** **\[1\] 1 5 6** **\$F** **\[1\] 2 3 7** **\$I** **\[1\] 4** Note that we take advantage of the fact that R's for() loop has the abilityto loop through a vector of strings. We might use our recoded data to draw some graphs, exploring the various variables in the abalone data set. Let's summarize the nature of the variables by adding the following header to the file: **Gender,Length,Diameter,Height,WholeWt,ShuckedWt,ViscWt,ShellWt,Rings** We could, for instance, plot diameter versus length, with a separate plotfor males and females, using the following code: **aba \ m \ m** **\[,1\] \[,2\] \[,3\]** **\[1,\] 1 2 3** **\[2,\] 4 5 6** Note that the matrix is still stored in column-major order. The byrow argument enabled only our *input* to come in row-major form. This may bemore convenient if you are reading from a data file organized that way **[General Matrix Operations]** **[Performing Linear Algebra Operations on Matrices]** You can perform various linear algebra operations on matrices, such as matrix multiplication, matrix scalar multiplication, and matrix addition. Using y from the preceding example, here is how to perform those three operations: **\> y %\*% y \# mathematical matrix multiplication** **\[,1\] \[,2\]** **\[1,\] 7 15** **\[2,\]10 22** **\> 3\*y \# mathematical multiplication of matrix by scalar** **\[,1\] \[,2\]** **\[1,\] 3 9** **\[2,\] 6 12** **\> y+y \# mathematical matrix addition** **\[,1\] \[,2\]** **\[1,\] 2 6** **\[2,\] 4 8** **[Matrix Indexing]** **\> z** **\[,1\] \[,2\] \[,3\]** **\[1,\] 1 1 1** **\[2,\] 2 1 0** **\[3,\] 3 0 1** **\[4,\] 4 0 0** **\> z\[,2:3\]** **\[,1\] \[,2\]** **\[1,\] 1 1** **\[2,\] 1 0** **\[3,\] 0 1** **\[4,\] 0 0** Here, we requested the submatrix of z consisting of all elements with column numbers 2 and 3 and any row number. This extracts the second and third columns. Here's an example of extracting rows instead of columns: **\> y** **\[,1\] \[,2\]** **\[1,\]11 12** **\[2,\]21 22** **\[3,\]31 32** **\> y\[2:3,\]** **\[,1\] \[,2\]** **\[1,\]21 22** **\[2,\]31 32** **\> y\[2:3,2\]** **\[1\] 22 32** You can also assign values to submatrices: **\> y** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 2 5** **\[3,\] 3 6** **\> y\[c(1,3),\] \ y** **\[,1\] \[,2\]** **\[1,\] 1 8** **\[2,\] 2 5** **\[3,\] 1 12** Here, we assigned new values to the first and third rows of y. And here's another example of assignment to submatrices: **\> x \ y \ y** **\[,1\] \[,2\]** **\[1,\] 4 2** **\[2,\] 5 3** **\> x\[2:3,2:3\] \ x** **\[,1\] \[,2\] \[,3\]** **\[1,\] NA NA NA** **\[2,\] NA 4 2** **\[3,\] NA 5 3** Negative subscripts, used with vectors to exclude certain elements, work the same way with matrices: **\> y** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 2 5** **\[3,\] 3 6** **\> y\[-2,\]** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 3 6** In the second command, we requested all rows of y except the second. **[Filtering on Matrices]** Filtering can be done with matrices, just as with vectors. Let's start with a simple example: **\> x x** **\[1,\] 1 2** **\[2,\] 2 3** **\[3,\] 3 4 \> x\[x\[,2\] \>= 3,\] x** **\[1,\] 2 3** **\[2,\] 3 4** **\> j \= 3** **\> j** **\[1\] FALSE TRUE TRUE** Here, we look at the vector x\[,2\], which is the second column of x, and determine which of its elements are greater than or equal to 3. The result,assigned to j, is a Boolean vector.Now, use j in x: **\> x\[j,\] x** **\[1,\] 2 3** **\[2,\] 3 4** Here, we compute x\[j,\]---that is, the rows of x specified by the true elementsof j---getting the rows coresponding to the elements in column 2that were at least equal to 3. **\> x x** **\[1,\] 1 2** **\[2,\] 2 3** **\[3,\] 3 4 \> x\[x\[,2\] \>= 3,\] x** **\[1,\] 2 3** **\[2,\] 3 4** For performance purposes, it's worth noting again that the computationof j here is a completely vectorized operation, since all of the followingare true: - The object x\[,2\] is a vector. - The operator \>= compares two vectors. - The number 3 was recycled to a vector of 3s. Also note that even though j was defined in terms of x and then wasused to extract from x, it did not need to be that way. The filtering criterioncan be based on a variable separate from the one to which the filtering willbe applied. Here's an example with the same x as above: **\> z \ x\[z %% 2 == 1,\]** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 3 6** Here, the expression z %% 2 == 1 tests each element of z for being anodd number, thus yielding (TRUE,FALSE,TRUE). As a result, we extracted thefirst and third rows of x.Here is another example: **\> m** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 2 5** **\[3,\] 3 6** **\> m\[m\[,1\] \> 1 & m\[,2\] \> 5,\]** **\[1\] 3 6** We're using the same principle here, but with a slightly more complexset of conditions for row extraction. (Column extraction, or more generally,extraction of any submatrix, is similar.) First, the expression m\[,1\] *\>*1 compareseach element of the first column of m to 1 and returns (FALSE,TRUE,TRUE).The second expression, m\[,2\] *\>*5, similarly returns (FALSE,FALSE,TRUE). We then take the logical AND of (FALSE,TRUE,TRUE) and (FALSE,FALSE,TRUE), yielding(FALSE,FALSE,TRUE). Using the latter in the row indices of m, we get thethird row of m. Note that we needed to use &, the vector Boolean AND operator, rather than the scalar one that we would use in an if statement, && The alert reader may have noticed an anomaly in the preceding example.Our filtering should have given us a submatrix of size 1 by 2, but insteadit gave us a twoelement vector. The elements were correct, but the data typewas not. This would cause trouble if we were to then input it to some othermatrix function. The solution is to use the drop argument, which tells R to retain the two dimensional nature of our data. Since matrices are vectors, you can also apply vector operations to them.Here's an example: **\> m** **\[,1\] \[,2\]** **\[1,\] 5 -1** **\[2,\] 2 10** **\[3,\] 9 11** **\> which(m \> 2)** **\[1\] 1 3 5 6** R informed us here that, from a vector-indexing point of view, elements1, 3, 5, and 6 of m are larger than 2. For example, element 5 is the element in row 2, column 2 of m, which we see has the value 10, which is indeed greater than 2. **[Applying Functions to Matrix Rows and Columns]** One of the most famous and most used features of R is the \*apply() family offunctions, such as apply(), tapply(), and lapply(). Here, we'll look at apply(), which instructs R to call a user-specified function on each of the rows or each of the columns of a matrix. **[Using the apply() Function]** This is the general form of apply for matrices: where the arguments are as follows: - m is the matrix. - dimcode is the dimension, equal to 1 if the function applies to rows or 2for columns. - f is the function to be applied. - fargs is an optional set of arguments to be supplied to f. For example, here we apply the R function mean() to each column of a matrix z: **\> z** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 2 5** **\[3,\] 3 6** **\> apply(z,2,mean)** **\[1\] 2 5** In this case, we could have used the colMeans() function, but this provides a simple example of using apply(). A function you write yourself is just as legitimate for use in apply() as any R built-in function such as mean(). Here's an example using our own function f: **\> z** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 2 5** **\[3,\] 3 6** **\> f \ y \ y** **\[,1\] \[,2\] \[,3\]** **\[1,\] 0.5 1.000 1.50** **\[2,\] 0.5 0.625 0.75** Our f() function divides a two-element vector by the vector (2,8). (Recyclingwould be used if x had a length longer than 2.) The call to apply() asks R to call f() on each of the rows of z. The first such row is (1,4), so in the call to f(), the actual argument corresponding to the formal argument x is (1,4). Thus, R computes the value of (1,4)/(2,8), which in R's element-wise vector arithmetic is (0.5,0.5). The comptations for the other two rows are similar. You may have been surprised that the size of the result here is 2 by 3 rather than 3 by 2. That first computation, (0.5,0.5), ends up at the first column in the output of apply(), not the first row. But this is the behavior of apply(). If the function to be applied returns a vector of *k* components, then the result of apply() will have *k* rows. You can use the matrix transpose function t() to change it if necessary, as follows: **\> t(apply(z,1,f))** **\[,1\] \[,2\]** **\[1,\] 0.5 0.500** **\[2,\] 1.0 0.625** **\[3,\] 1.5 0.750** If the function returns a scalar (which we know is just a one-elementvector), the final result will be a vector, not a matrix. As you can see, the function to be applied needs to take at least one argument. The formal argument here will correspond to an actual argument of one row or column in the matrix, as described previously. In some cases, you will need additional arguments for this function, which you can place following the function name in your call to apply(). For instance, suppose we have a matrix of 1s and 0s and want to createa vector as follows: For each row of the matrix, the corresponding element of the vector will be either 1 or 0, depending on whether the majority of the first d elements in that row is 1 or 0. Here, d will be a parameter that we may wish to vary. We could do this: **\> copymaj function(rw,d) { maj \ 0.5) 1 else 0)** **}** **\> x** **\[,1\] \[,2\] \[,3\] \[,4\] \[,5\]** **\[1,\] 1 0 1 1 0** **\[2,\] 1 1 1 1 0** **\[3,\] 1 0 0 1 1** **\[4,\] 0 1 1 1 0** **\> apply(x,1,copymaj,3)** **\[1\] 1 1 0 1** **\> apply(x,1,copymaj,2)** **\[1\] 0 1 0 0** Here, the values 3 and 2 form the actual arguments for the formalargument d in copymaj(). Let's look at what happened in the case of row 1 of x. That row consisted of (1,0,1,1,0), the first d elements of which were (1,0,1). A majority of those three elements were 1s, so copymaj() returned a 1, and thus the first element of the output of apply() was a 1. Contrary to common opinion, using apply() will generally not speed up your code. The benefits are that it makes for very compact code, which may be easier to read and modify, and you avoid poss ble bugs in writing code for looping. Moreover, as R moves closer and closer to parallel processing, functions like apply() will become more and more important. For example, the clusterApply() function in the snow package gives R some parallel-processing capability by distributing the submatrix data to various network nodes, with each node basically applying the given function on its submatrix. **[Adding and deleting rows and columns]** Technically, matrices are of fixed length and dimensions, so we cannot addor delete rows or columns. However, matrices can be *reassigned*, and thus we can achieve the same effect as if we had directly done additions or deletions. **[Changing the Size of a Matrix]** Recall how we reassign vectors to change their size: **\> x** **\[1\] 12 5 13 16 8** **\> x \ x** **\[1\] 12 5 13 16 8 20** **\> x \ x** **\[1\] 12 5 13 20 16 8 20** **\> x \ x** **\[1\] 12 16 8 20** In the first case, x is originally of length 5, which we extend to 6 via concatenation and then reassign ment. We didn't literally change the length of x but instead created a new vector from x and then asigned x to that new vector. Analogous operations can be used to change the size of a matrix. For instance, the rbind() (row bind) and cbind() (column bind) functions let youadd rows or columns to a matrix. **\> one** **\[1\] 1 1 1 1** **\> z** **\[,1\] \[,2\] \[,3\]** **\[1,\] 1 1 1** **\[2,\] 2 1 0** **\[3,\] 3 0 1** **\[4,\] 4 0 0** **\> cbind(one,z)** **\[1,\]1 1 1 1** **\[2,\]1 2 1 0** **\[3,\]1 3 0 1** **\[4,\]1 4 0 0** Here, cbind() creates a new matrix by combining a column of 1s with thecolumns of z. We choose to get a quick printout, but we could have assigned the result to z (or another variable), as follows: **z \ cbind(1,z)** **\[,1\] \[,2\] \[,3\] \[,4\]** **\[1,\] 1 1 1 1** **\[2,\] 1 2 1 0** **\[3,\] 1 3 0 1** **\[4,\] 1 4 0 0** Here, the 1 value was recycled into a vector of four 1 values. You can also use the rbind() and cbind() functions as a quick way to createsmall matrices. Here's an example: **\> q \ q** **\[,1\] \[,2\]** **\[1,\] 1 3** **\[2,\] 2 4** Be careful with rbind and cbin(), though. Like creating a vector, creatinga matrix is time consuming (matrices are vectors, after all). In the following code, cbind() creates a new matrix: The new matrix happens to be reassigned to z; that is, we gave it the namez-the same name as the original matrix, which is now gone. But the point is that we did incur a time penalty in creating the m trix. If we did this repeatedly inside a loop, the cumulative penalty would be large. So, if you are adding rows or columns one at a time within a loop, and the matrix will eventually become large, it's better to allocate a large matrix in the first place. It will be empty at first, but you fill in the rows or columns one at a time, rather than doing a time-consuming matrix memory allocation each time. **\> m \ m** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 2 5** **\[3,\] 3 6** **\> m \ m** **\[,1\] \[,2\]** **\[1,\] 1 4** **\[2,\] 3 6** **[Vector/Matrix Distinction]** Matrix is just a vector but withtwo additional attributes: the number of rows and the number of columns. Here, we'll take a closer look at the vector nature of matrices. Consider this example: **\> z \ z** **\[,1\] \[,2\]** **\[1,\] 1 5** **\[2,\] 2 6** **\[3,\] 3 7** **\[4,\] 4 8** As z is still a vector, we can query its length: **\> length(z)** **\[1\] 8** But as a matrix, z is a bit more than a vector: **\> class(z)** **\[1\] \"matrix\"** **\> attributes(z)** **\$dim** **\[1\] 4 2** In other words, there actually is a matrix *class*, in the object-oriented programming sense. Most of R consists of S3 classes, whose components are denoted by dollar signs. The matrix class has one attribute, named dim, which is a vector containing the numbers of rows and columns in the matrix You can also obtain dim via the dim() function: **\> dim(z)** **\[1\] 4 2** The numbers of rows and columns are obtainable individually via the nrow() and ncol() functions: **\> nrow(z)** **\[1\] 4** **\> ncol(z)** **\[1\] 2** These just piggyback on dim(), as you can see by inspecting the code.Recall once again that objects can be printed in interactive mode by simply typing their names: **\> nrow function (x) dim(x)\[1\]** These functions are useful when you are writing a general-purposelibrary function whose argument is a matrix. By being able to determine the number of rows and columns in your code, you alleviate the caller of the burden of supplying that information as two additional arguments. This is one of the benefits of object-oriented programming. **[Avoiding Dimension Reduction]** In the world of statistics, dimension reduction is a good thing, with many statistical procedures aimed to do it well. If we are working with, say, 10 variables and can reduce that number to 3 that still capture the essence of our data. However, in R, something else might merit the name *dimension reduction*that we may somtimes wish to avoid. Say we have a four-row matrix andextract a row from it: **\> z** **\[,1\] \[,2\]** **\[1,\] 1 5** **\[2,\] 2 6** **\[3,\] 3 7** **\[4,\] 4 8** **\> r \ r** **\[1\] 2 6** This seems innocuous, but note the format in which R has displayedr. It's a vector format, not a matrix format. In other words, r is a vector of length 2, rather than a 1-by-2 matrix. We can confirm this in a couple of ways: **\> attributes(z)** **\$dim** **\[1\] 4 2** **\> attributes(r)** **NULL** **\> str(z)** **int \[1:4, 1:2\] 1 2 3 4 5 6 7 8** **\> str(r) int \[1:2\] 2 6** Here, R informs us that z has row and column numbers, while r does not. Similarly, str() tells us that z has indices ranging in 1:4 and 1:2, for rows and columns, while r's indices simply range in 1:2. No doubt about it---r is a vector, not a matrix. This seems natural, but in many cases, it will cause trouble in programs that do a lot of matrix operations. You may find that your code works fine in general but fails in a special case. For instance, suppose that your code extracts a submatrix from a given matrix and then does some matrix operations on the submatrix. If the submatrix has only one row, R will make it a vector, which could ruin your computation. Fortunately, R has a way to suppress this dimension reduction: the dropargument. Here's an example, using the matrix z from above: **\> r \ r** **\[,1\] \[,2\]** **\[1,\] 2 6** **\> dim(r)** **\[1\] 1 2** Now r is a 1-by-2 matrix, not a two-element vector.For these reasons, you may find it useful to routinely include the drop=FALSE argument in all your matrix code.Why can we speak of drop as an argument? Because that \[ is actually afunction, just as is the case for operators like +. Consider the following code: **\> z\[3,2\]** **\[1\] 7** **\> \"\[\"(z,3,2)** **\[1\] 7** If you have a vector that you wish to be treated as a matrix, you can usethe as.matrix() function, as follows: **\> u** **\[1\] 1 2 3** **\> v \ attributes(u)** **NULL** **\> attributes(v)** **\$dim** **\[1\] 3 1** **[Higher-Dimensional Arrays]** In a statistical context, a typical matrix in R has rows corresponding to observations, say on various people, and columns corresponding to variables, such as weight and blood pressure. The matrix is then a two-dimensional data structure. But suppose we also have data taken at different times, one data point per person per variable per time. Time then becomes the third dimension, in addition to rows and columns. In R, such data sets are called *arrays*. As a simple example, consider students and test scores. Say each test consists of two parts, so we record two scores for a student for each test. Now suppose that we have two tests, and to keep the example small, assume we have only three students. Here's the data for the first test: **\> firsttest** **\[,1\] \[,2\]** **\[1,\] 46 30** **\[2,\] 21 25** **\[3,\] 50 50** Student 1 had scores of 46 and 30 on the first test, student 2 scored 21and 25, and so on. Here are the scores for the same students on the second test: **\> secondtest** **\[,1\] \[,2\]** \[1,\] 46 43 ============ **\[2,\] 41 35** \[3,\] 50 50 ============ Now let's put both tests into one data structure, which we'll name tests.We'll arrange it to have two "layers"---one layer per test---with three rows and two columns within each layer. We'll store firsttest in the first layer and secondtest in the second. In layer 1, there will be three rows for the three students' scores on thefirst test, with two columns per row for the two portions of a test. We use R's array function to create the data structure: In the argument dim=c(3,2,2), we are specifying two layers (this is the second2), each consisting of three rows and two columns. This then becomes an attribute of the data structure: **\> attributes(tests)** **\$dim** **\[1\] 3 2 2** Each element of tests now has three subscripts, rather than two as inthe matrix case. The first subscript corresponds to the first element in the \$dim vector, the second subscript corresponds to the second element in the vector, and so on. For instance, the score on the second portion of test 1 for student 3 is retrieved as follows: **\> tests\[3,2,1\]** **\[1\] 48** R's print function for arrays displays the data layer by layer: **\> tests** **, , 1** **\[,1\] \[,2\]** **\[1,\] 46 30** **\[2,\] 21 25** **\[3,\] 50 48** [, ,\[,1\] \[,2\] 2](#section) [\[1,\] 46 43 \[2,\] 41 35](#section-1) [\[3,\] 50 49](#section-2) Just as we built our three-dimensional array by combining two matrices,we can build four-dimensional arrays by combining two or more threedimensional arrays, and so on. One of the most common uses of arrays is in calculating tables. **[Lists]** In contrast to a vector, in which all elementsmust be of the same mode, R's list structure can combine objects of different types. For those familiar with Python, an R list is similar to a Python dictionary **[Creating Lists]** Technically, a list is a vector. Ordinary vectorsare termed *atomic* vectors, since their components canot be broken down into smaller components. In contrast,lists are referred to as *recursive* vectors. For our first look at lists, let's consider an employee database. For each employee, we wish to store the name, salary, and a Boolean indicating union membership. Since we have three different modes here---character, numeric, and logical---it's a perfect place for using lists. Our entire database might then be a list of lists, or some other kind of list such as a data frame, though we won't pursue that here. We could create a list to represent our employee, Joe, this way: **j \ j** **\$name** **\[1\] \"Joe\"** **\$salary** **\[1\] 55000** **\$union** **\[1\] TRUE** Actually, the component names---called *tags* in the R literature---such assalary are optional. We could alternatively do this: **\> jalt \ jalt** **\[\[1\]\]** **\[1\] \"Joe\"** **\[\[2\]\]** **\[1\] 55000** **\[\[3\]\]** **\[1\] TRUE** However, it is generally considered clearer and less error-prone to usenames instead of numeric indices. Names of list components can be abbreviated to whatever extent is possible without causing ambiguity: **\> j\$sal** **\[1\] 55000** Since lists are vectors, they can be created via vector(): **\> z \ z\[\[\"abc\"\]\] \ z** **\$abc** **\[1\] 3** **[General List Operations]** Now that you've seen a simple example of creating a list, let's look at how toaccess and work with lists. **[List Indexing]** You can access a list component in several different ways: **\> j\$salary \[1\] 55000** **\> j\[\[\"salary\"\]\]** **\[1\] 55000** **\> j\[\[2\]\]** **\[1\] 55000** We can refer to list components by their numerical indices, treating the list as a vector. However, note that in this case, we use double brackets instead of single ones. So, there are three ways to access an individual component c of a list lst and return it in the data type of c: - **lst\$c** - **lst\[\[\"c\"\]\]** - **lst\[\[i\]\], where i is the index of c within lst** Each of these is useful in different contexts, as you will see in subsequentexamples. But note the qualfying phrase, "return it in the datatype of c." An alternative to the second and third techniques listed is touse single brackets rather than double brackets: - **lst\[\"c\"\]** - **lst\[i\], where i is the index of c within lst** Both single-bracket and double-bracket indexing access list elementsin vector-index fashion. But there is an important difference from ordinary(atomic) vector indexing. If single brackets \[ \] are used, the result is another list---a sublist of the original. For instance, continuing the preceding example, we have this: **\> j\[1:2\]** **\$name** **\[1\] \"Joe\"** **\$salary** **\[1\] 55000** **\> j2 \ j2** **\$salary** **\[1\] 55000** **\> class(j2)** **\[1\] \"list\"** **\> str(j2)** **List of 1** **\$ salary: num 55000** The subsetting operation returned another list consisting of the first twocomponents of the original list j. Note that the word *returned* makes sense here, since index brackets are functions. This is similar to other cases you've seen for operators that do not at first appear to be functions, such as +. By contrast, you can use double brackets \[\[ \]\] for referencing only asingle component, with the result having the type of that component. **\> j\[\[1:2\]\]** **Error in j\[\[1:2\]\] : subscript out of bounds** **\> j2a \ j2a** **\[1\] 55000** **\> class(j2a)** **\[1\] \"numeric\"** **[Adding and Deleting List Elements]** The operations of adding and deleting list elements arise in a surprising number of contexts. This is especially true for data structures in which lists form the foundation, such as data frames and R classes. **\> z \ z** **\$a** **\[1\] \"abc\"** **\$b** **\[1\] 12** **\> z\$c \ \# did c really get added?** **\> z** **\$a** **\[1\] \"abc\"** **\$b** **\[1\] 12** **\$c** **\[1\] \"sailing\"** Adding components can also be done via a vector index: **\> z\[\[4\]\] \ z\[5:7\] \ z** **\$a** **\[1\] \"abc\"** **\$b** **\[1\] 12** **\$c** **\[1\] \"sailing\"** **\[\[4\]\]** **\[1\] 28** **\[\[5\]\]** **\[1\] FALSE** **\[\[6\]\]** **\[1\] TRUE** **\[\[7\]\]** **\[1\] TRUE** You can delete a list component by setting it to NULL. **\> z\$b \ z** **\$a** **\[1\] \"abc\"** **\$c** **\[1\] \"sailing\"** **\[\[3\]\]** **\[1\] 28** **\[\[4\]\]** **\[1\] FALSE** **\[\[5\]\]** **\[1\] TRUE** **\[\[6\]\]** **\[1\] TRUE** Note that upon deleting z\$b, the indices of the elements after it movedup by 1. For instance, the former z\[\[4\]\] became z\[\[3\]\]. You can also concatenate lists. **\> c(list(\"Joe\", 55000, T),list(5))** **\[\[1\]\]** **\[1\] \"Joe\"** **\[\[2\]\]** **\[1\] 55000** **\[\[3\]\]** **\[1\] TRUE** **\[\[4\]\]** **\[1\] 5** **[Getting the Size of a List]** Since a list is a vector, you can obtain the number of components in a list vialength(). **\> length(j)** **\[1\] 3** **[Accessing List Components and Values] j \ names(j)** **\[1\] \"name\" \"salary\" \"union\"** To obtain the values, use unlist(): **\> ulj \ ulj** **name salary union** **\"Joe\" \"55000\" \"TRUE\"** **\> class(ulj)** **\[1\] \"character\"** The return value of unlist() is a vector---in this case, a vector of character strings. Note that the element names in this vector come from the components in the original list. **\> z \ y \ class(y)** **\[1\] \"numeric\"** **\> y a b c** **5 12 13** So the output of unlist() in this case was a numeric vector. What about amixed case? **\> w \ wu \ class(wu)** **\[1\] \"character\" \> wu a b** **\"5\" \"xyz\"** Here, R chose the least common denominator: character strings. This sounds like some kind of precedence structure, and it is. As R's help for unlist() states: **Where possible the list components are coerced to a common mode during the unlisting, and so the result often ends up as a character vector. Vectors will be coerced to the highest type of the components in the hierarchy NULL *\ b \ c \ a \ a** **\[\[1\]\]** **\[\[1\]\]\$u** **\[1\] 5** **\[\[1\]\]\$v** **\[1\] 12** **\[\[2\]\]** **\[\[2\]\]\$w** **\[1\] 13** **\> length(a)** **\[1\] 2** This code makes a into a two-component list, with each component itself also being a list. The concatenate function c() has an optional argument recursive, which controls whether *flattening* occurs when recursive lists are combined. **\> c(list(a=1,b=2,c=list(d=5,e=9)))** **\$a** **\[1\] 1** **\$b** **\[1\] 2 \$c** **\$c\$d** **\[1\] 5** **\$c\$e** **\[1\] 9** **\> c(list(a=1,b=2,c=list(d=5,e=9)),recursive=T) a b c.d c.e 1 2 5 9** In the first case, we accepted the default value of recursive, which isFALSE, and obtained a recursive list, with the c component of the main list itself being another list. In the second call, with recursive set to TRUE, we got a single list as a result; only the names look recursive. (It's odd that setting recursive to TRUE gives a *nonrecursive* list.) Recall that our first example of lists consisted of an employee database.I mentioned that since each employee was represented as a list, the entire database would be a list of lists. That is a concrete example of recursive lists. **UNIT III:** **DATA FRAMES:** Creating Data Frames -- Matrix-like operations in frames -- merging Data frames -- Applying functions to Data Frames -- Factors and Tables -- Factors and levels -- Common Functions used with factors -- Working with tables -- Other factors and table related functions -- Control statements -- Arithmetic and Boolean operators and values -- Default Values for arguments -- Returning Boolean Values -- Functions are objects -- Environment and scope issues -- Writing Upstairs -- Recursion -- Replacement functions -- Tools for Composing function code -- Math and Simulation in R. **[Data Frame]** A *data frame* is like a matrix, with a two-dimensional rows-andcolumns structure. However, it differs from a matrix in that each column may have a different mode. For instance, one column may consist of numbers, and another column might have character strings. In this sense, just as lists are the heterogeneous analogs of vectors in one dimension, data frames are the heterogeneous analogs of matrices for two-dimensional data **[Creating Data Frames]** **\> kids \ ages \ d \ d \# matrix-like viewpoint kids ages 1 Jack 12** **2 Jill 10** The first two arguments in the call to data.frame() are clear: We wish toproduce a data frame from our two vectors: kids and ages. However, that third argument, stringsAsFactors=FALSE requires more comment. If the named argument stringsAsFactors is not specified, then by default,stringsAsFactors will be TRUE. (You can also use options() to arrange the opposite default.) This means that if we create a data frame from a character vector---in this case, kids---R will convert that vector to a *factor*. Because our work with character data will typically be with vectors rather than factors, we'll set stringsAsFactors to FALSE. **[Accessing Data Frames]** Now that we have a data frame, let's explore a bit. Since d is a list, we canaccess it as such via component index values or component names: **\> d\[\[1\]\]** **\[1\] \"Jack\" \"Jill\"** **\> d\$kids** **\[1\] \"Jack\" \"Jill\"** But we can treat it in a matrix-like fashion as well. For example, we canview column 1: **\> d\[,1\]** **\[1\] \"Jack\" \"Jill\"** This matrix-like quality is also seen when we take d apart using str(): **\> str(d)** **\'data.frame\': 2 obs. of 2 variables:** **\$ kids: chr \"Jack\" \"Jill\"** **\$ ages: num 12 10** R tells us here that d consists of two observations-our two rows-that store data on two variables-our two columns. Consider three ways to access the first column of our data frame above:d\[\[1\]\], d\[,1\], and d\$kids. Of these, the third would generally considered to be clearer and, more importantly, safer than the first two. This better identifies the column and makes it less likely that you will reference the wrong column. But in writing general code---say writing R packages---matrix-like notation d\[,1\] is needed, and it is especially handy if you are extracting subdata frames **[Example- Regression Analysis of Exam Grades]** first few records in the file is as follows: **\"Exam 1\" \"Exam 2\" Quiz** **2.0 3.3 4.0** **3.3 2.0 3.7** **4.0 4.0 4.0** **2.3 0.0 3.3** **2.3 1.0 3.3** **3.3 3.7 4.0** As you can see, each line contains the three test scores for one student.This is the classic two-dimensional file notion, like that alluded to in the preceding output of str(). Here, each line in our file contains the data for one observation in a statistical data set. The idea of a data frame is to encapsulate such data, along with variable names, into one object. Notice that we have separated the fields here by spaces. Other delimitersmay be specified, notably commas for comma-separated value (CSV) files.The variable names specified in the first record must be separated by the same delimiter as used for the data, which is spaces in this case. If the names themselves contain embedded spaces, as we have here, they must be quoted. We read in the file as before, but in this case we state that there is a header record: **examsquiz \ head(examsquiz)** **Exam.1 Exam.2 Quiz 1 2.0 3.3 4.0** 2. **3.3 2.0 3.7** 3. **4.0 4.0 4.0** 4. **2.3 0.0 3.3** 5. **2.3 1.0 3.3** 6. **3.3 3.7 4.0** **[Matrix-Like Operations]** Various matrix operations also apply to data frames. Most notably and usefully,we can do filtering to extract various subdata frames of interest. **[Extracting Subdata Frames]** As mentioned, a data frame can be viewed in row-and-column terms. Inparticular, we can extract subdata frames by rows or columns. Here's an example: Note that in that second call, since examsquiz\[2:5,2\] is a vector, Rcreated a vector instead of another data frame We can also do filtering. Here's how to extract the subframe of all studentswhose first exam score was at least 3.8: **\> examsquiz\[examsquiz\$Exam.1 \>= 3.8,\]** **Exam.1 Exam.2 Quiz** **3 4 4.0 4.0** **9 4 3.3 4.0** **11 4 4.0 4.0** **1 4 4.0 4.0** **16 4 3.7 4.0** **19 4 4.0 4.0** **22 4 4.0 4.0** **25 4 4.0 3.3** **29 4 3.0 3.7** **[More on Treatment of NA Values]** Suppose the second exam score for the first student had been missing. Thenwe would have typed the following into that line when we were preparing the data file: **2.0 NA 4.0** In any subsequent statistical analyses, R would do its best to cope withthe missing data. However, in some situations, we need to set the option **na.rm=TRUE**, explicitly telling R to ignore NA values. For instance, with the missing exam score, calculating the mean score on exam 2 by calling R's mean() function would skip that first student in finding the mean. Otherwise, R would just report NA for the mean. Here's a little example: **\> x \ mean(x)** **\[1\] NA** **\> mean(x,na.rm=TRUE)** **\[1\] 3** You can apply subset() function, in data framesfor row selection. The column names are taken in the context of the given data frame. In our example, instead of typing this: **\> examsquiz\[examsquiz\$Exam.1 \>= 3.8,\]** we could run this: **\> subset(examsquiz,Exam.1 \>= 3.8)** Note that we do not need to write this: **\> subset(examsquiz,examsquiz\$Exam.1 \>= 3.8)** In some cases, we may wish to rid our data frame of any observationthat has at least one NA value. A handy function for this purpose is complete.cases(). ![](media/image9.jpg) Cases 2 and 4 were incomplete; hence the FALSE values in the output ofcomplete.cases(d4). We then use that output to select the intact rows. **[Using the rbind() and cbind() Functions and Alternatives]** you can use cbind() to add a new column that has the same length as the existing columns. In using rbind() to add a row, the added row is typically in the form of another data frame or list. You can also create new columns from old ones. For instance, we canadd a variable that is the difference between exams 1 and 2: ![](media/image11.jpg) The new name is rather unwieldy: It's long, and it has embedded blanks. We could change it, using the names() function, but it would be better to exploit the list basis of data frames and add a column (of the same length)to the data frame for this result: Since one can add a new component to an alreadyexisting list at any time, we did so: We added a component ExamDiff to the list/data frame examsquiz. We can even exploit recycling to add a column that is of a different length than those in the data frame: ![](media/image16.jpg) **[Applying apply()]** You can use apply() on data frames, if the columns are all of the same type.For instance, we can find the maximum grade for each student, as follows: **[Merging Data Frames]** In the relational database world, one of the most important operations isthat of a *join*, in which two tables can be combined according to the valuesof a common variable. In R, two data frames can be similarly combined using the merge() function. The simplest form is as follows: This merges data frames x and y. It assumes that the two data frameshave one or more columns with names in common. Here's an example: ![](media/image18.jpg) Here, the two data frames have the variable kids in common. R foundthe rows in which this variable had the same value of kids in both data frames (the ones for Jack and Jill). It then created a data frame with corresponding rows and with columns taken from data frames (kids, states, and ages). The merge() function has named arguments by.x and by.y, which handle cases in which variables have similar information but different names in the two data frames. Here's an example: Even though our variable was called kids in one data frame and pals inthe other, it was meant to store the same information, and thus the merge made sense. ![](media/image20.jpg) There are two Jills in d2a. There is a Jill in d1 who lives in Massachusettsand another Jill with unknown residence. In our previous example, merge(d1,d2), there was only one Jill, who was presumed to be the same person in both data frames. But here, in the call merge(d1,d2a), it may have been the case that only one of the Jills was a Massachusetts resident. It is clear from this little example that you must choose matching variables with great care. **[Applying Functions to Data Frames]** As with lists, you can use the lapply and sapply functions with data frames. **[Using lapply() and sapply() on Data Frames]** Keep in mind that data frames are special cases of lists, with the list components consisting of the data frame's columns. Thus, if you call lapply() on a data frame with a specified function f(), then f() will be called on each of the frame's columns, with the return values placed in a list. For instance, with our previous example, we can use lapply as follows: So, dl is a list consisting of two vectors, the sorted versions of kidsand ages.Note that dl is just a list, not a data frame. We could coerce it to a data frame, like this: **as.data.frame(dl) kids ages** 1. **Jack 10** 2. **Jill 12** But this would make no sense, as the correspondence between names and ages has been lost. Jack, for instance, is now listed as 10 years old instead of 12. **[Factors and Tables]** Factors form the basis for many of R'spowerful operations, including many of those performed on tabular data. The motivation for factors comes from the notion of *nominal*, or *categorical*, variables in statistics. These values are non-numerical in nature, corresponding to categories such as Democrat, Republican, and Unaffiliated, although they may be coded using numbers. **[Factors and Levels]** An R *factor* might be viewed simply as a vector with a bit more information added. That extra information consists of a record of the distinct values in that vector, called *levels*. Here's an example: **\> x \ xf \ xf** **\[1\] 5 12 13 12** **Levels: 5 12 13** The distinct values in xf---5, 12, and 13---are the levels here. Let's take a look inside: **\> str(xf)** **Factor w/ 3 levels \"5\",\"12\",\"13\": 1 2 3 2** **\> unclass(xf) \[1\] 1 2 3 2 attr(,\"levels\")** **\[1\] \"5\" \"12\" \"13\"** This is revealing. The core of xf here is not (5,12,13,12) but rather(1,2,3,2). The latter means that our data consists first of a level-1 value, then level-2 and level-3 values, and finally another level-2 value. So the data has been recorded by level. The levels themselves are recorded too, of course, though as characters such as \"5\" rather than 5. The length of a factor is still defined in terms of the length of the data rather than, say, being a count of the number of levels: **\> length(xf)** **\[1\] 4** We can anticipate future new levels, as seen here: **\> x \ xff \ xff** **\[1\] 5 12 13 12** **Levels: 5 12 13 88** **\> xff\[2\] \ xff** **\[1\] 5 88 13 12** **Levels: 5 12 13 88** Originally, xff did not contain the value 88, but in defining it, weallowed for that future possibility. Later, we did indeed add the value. By the same token, you cannot sneak in an "illegal" level. Here's what happens when you try: **\> xff\[2\] \ table(c(5,12,13,12,8,5))** **5 8 12 13** **2 1 2 1** Here's an example of a three-dimensional table, involving voters' genders, race(white, black, Asian, and other), and political views (liberal or conservative): ![](media/image31.jpg) R prints out a three-dimensional table as a series of two-dimensionaltables. In this case, it generates a table of gender and race for conservatives and then a corresponding table for liberals. For example, the second two dimensional table says that there were two white male liberals. **[Other Factor- and Table-Related Functions]** R includes a number of other functions that are handy for working withtables and factors. We'll discuss two of them here: aggregate() and cut(). **[The aggregate() Function]** Aggregate() Function in R Splits the data into subsets, computes summary statistics for each subsets and returns the result in a group by form. Aggregate function in R is similar to [[group] [ by] [ in] [ SQL].](https://www.datasciencemadesimple.com/sql-group-by/) Aggregate() function is useful in performing all the aggregate operations like sum,count,mean, minimum and Maximum. Lets see an Example of following - Aggregate() which computes group sum - calculate the group max and minimum using aggregate() function - Aggregate() function which computes group mean - Get group counts using aggregate() function The first argument, aba\[,-1\], is the entire data frame except for the firstcolumn, which is Gender itself. The second argument, which must be a list, is our Gender factor as before. Finally, the third argument tells R to compute the median on each column in each of the data frames generated by the subgrouping corresponding to our factors. There are three such subgroups in our example here and thus three rows in the output of aggregate(). **[The cut() Function]** A common way to generate factors, especially for tables, is the cut() function.You give it a data vector x and a set of bins defined by a vector b. The function then determines which bin each of the elements of x falls into. The following is the form of the call we'll use here: ----- cut ----- Sometimes it is useful to categorize the values of a continuous variable in different levels of a factor. For that purpose, you can use the R function. In the following block of code we show the syntax of the function and the simplified description of the arguments. **y \ body(g)** **{** **return(x + 1)** **}** Recall that when using R in interactive mode, simply typing the nameof an object results in printing that object to the screen. Functions are no exception, since they are objects just like anything else. **\> g function(x) { return(x+1)** **}** This is handy if you're using a function that you wrote but which you'veforgotten the details of. Printing out a function is also useful if you are not quite sure what an R library function does. By looking at the code, you may understand it better **[Scope of Variable in R]** In R, variables are the containers for storing data values. They are reference, or pointers, to an object in memory which means that whenever a variable is assigned to an instance, it gets mapped to that instance. A variable in R can store a vector, a group of vectors or a combination of many R objects. **[Example]** **\# R program to demonstrate** **\# variable assignment** **\# Assignment using equal operator var1 =c(0, 1, 2, 3) print(var1)** **\# Assignment using leftward operator var2 \ 58. **print.check\_Rd\_line\_widths\*** 59. **print.check\_Rd\_metadata\*** 60. **print.check\_Rd\_xrefs\*** 61. **print.check\_RegSym\_calls\*** 62. **print.check\_T\_and\_F\*** 63. **print.check\_code\_usage\_in\_package\*** 64. **print.check\_compiled\_code\*** 65. **print.check\_demo\_index\*** 66. **print.check\_depdef\*** 67. **print.check\_details\*** 68. **print.check\_details\_changes\*** 69. **print.check\_doi\_db\*** 70. **print.check\_dotInternal\*** 71. **print.check\_make\_vars\*** 72. **print.check\_nonAPI\_calls\*** 73. **print.check\_package\_CRAN\_incoming\*** 74. **print.check\_package\_code\_assign\_to\_globalenv\*** 75. **print.check\_package\_code\_attach\*** 76. **print.check\_package\_code\_data\_into\_globalenv\*** 77. **print.check\_package\_code\_startup\_functions\*** 78. **print.check\_package\_code\_syntax\*** 79. **print.check\_package\_code\_unload\_functions\*** 80. **print.check\_package\_compact\_datasets\*** 81. **print.check\_package\_datasets\*** 82. **print.check\_package\_depends\*** 83. **print.check\_package\_description\*** 84. **print.check\_package\_description\_encoding\*** 85. **print.check\_package\_license\*** 86. **print.check\_packages\_in\_dir\*** 87. **print.check\_packages\_used\*** 88. **print.check\_po\_files\*** 89. **print.check\_so\_symbols\*** 90. **print.check\_url\_db\*** 91. **print.check\_vignette\_index\*** 92. **print.citation\*** 93. **print.codoc\*** 94. **print.codocClasses\*** 95. **print.codocData\*** 96. **print.colorConverter\*** 97. **print.compactPDF\*** 98. **print.condition** 99. **print.connection** 100. **print.data.frame** 101. **print.default** 102. **print.dendrogram\*** 103. **print.density\*** 104. **print.difftime** 105. **print.dist\*** 106. **print.dummy\_coef\*** 107. **print.dummy\_coef\_list\*** 108. **print.ecdf\*** 109. **print.eigen** 110. **print.factanal\*** 111. **print.factor** 112. **print.family\*** 113. **print.fileSnapshot\*** 114. **print.findLineNumResult\*** 115. **print.formula\*** 116. **print.fseq\*** 117. **print.ftable\*** 118. **print.function** 119. **print.getAnywhere\*** 120. **print.glm\*** 121. **print.hclust\*** 122. **print.help\_files\_with\_topic\*** 123. **print.hexmode** 124. **print.hsearch\*** 125. **print.hsearch\_db\*** 126. **print.htest\*** 127. **print.html\*** 128. **print.html\_dependency\* \[129\] print.htmlwidget\*** 130. **print.infl\*** 131. **print.integrate\*** 132. **print.isoreg\*** 133. **print.kmeans\*** 134. **print.libraryIQR** 135. **print.listof** 136. **print.lm\*** 137. **print.loadings\*** 138. **print.loess\*** 139. **print.logLik\*** 140. **print.ls\_str\*** 141. **print.medpolish\*** 142. **print.mtable\*** 143. **print.news\_db\*** 144. **print.nls\*** 145. **print.noquote** 146. **print.numeric\_version** 147. **print.object\_size\*** 148. **print.octmode** 149. **print.packageDescription\*** 150. **print.packageIQR\*** 151. **print.packageInfo** 152. **print.packageStatus\* \[153\] print.pairwise.htest\*** 154. **print.pdf\_doc\*** 155. **print.pdf\_fonts\*** 156. **print.pdf\_info\*** 157. **print.person\*** 158. **print.power.htest\*** 159. **print.ppr\*** 160. **print.prcomp\*** 161. **print.princomp\*** 162. **print.proc\_time** 163. **print.raster\*** 164. **print.recordedplot\*** 165. **print.restart** 166. **print.rle** 167. **print.roman\*** 168. **print.sessionInfo\*** 169. **print.shiny.tag\*** 170. **print.shiny.tag.list\*** 171. **print.simple.list** 172. **print.smooth.spline\*** 173. **print.socket\*** 174. **print.srcfile** 175. **print.srcref** 176. **print.stepfun\* \[177\] print.stl\*** 178. **print.subdir\_tests\*** 179. **print.summarize\_CRAN\_check\_status\*** 180. **print.summary.aov\*** 181. **print.summary.aovlist\*** 182. **print.summary.ecdf\*** 183. **print.summary.glm\*** 184. **print.summary.lm\*** 185. **print.summary.loess\*** 186. **print.summary.manova\*** 187. **print.summary.nls\*** 188. **print.summary.packageStatus\*** 189. **print.summary.ppr\*** 190. **print.summary.prcomp\*** 191. **print.summary.princomp\*** 192. **print.summary.table** 193. **print.summaryDefault** 194. **print.suppress\_viewer\*** 195. **print.table** 196. **print.tables\_aov\*** 197. **print.terms\*** 198. **print.ts\*** 199. **print.tskernel\*** 200. **print.tukeyline\*** 201. **print.tukeysmooth\*** 202. **print.undoc\*** 203. **print.vignette\*** 204. **print.warnings** 205. **print.xgettext\*** 206. **print.xngettext\*** 207. **print.xtabs\*** In the above long list there are important methods like print.factor(). When we print a factor through function print(), the call would automatically dispatch to print.factor() The class created as -- bank, would search for a method named print.bank(), and since no such method exists print.default() is used. Generic functions have a default method which is used when no match is available. **Creating your own method** Creating your own method is possible. Now if the class -- 'bank' searches for print.bank(), it will find this method and use it if we have already created it. **[Example]** **x \

Use Quizgecko on...
Browser
Browser