Lec5.pdf
Document Details
Uploaded by Deleted User
Full Transcript
Data Manipulation (part 1) Subsetting, Sorting, and Converting Ryan Yue Zicklin School of Business Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 1 /...
Data Manipulation (part 1) Subsetting, Sorting, and Converting Ryan Yue Zicklin School of Business Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 1 / 48 Data Subsetting We can subset columns (variables) and rows (cases) of a data.frame the same way we did for a matrix. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 2 / 48 Data Subsetting We can subset columns (variables) and rows (cases) of a data.frame the same way we did for a matrix. Example 1: Create a subset with the 1st, the 30th, and the 50th observations and the first 3 variables in the iris data: data("iris") sub1 = iris[c(1, 30, 50), 1:3] sub1 ## Sepal.Length Sepal.Width Petal.Length ## 1 5.1 3.5 1.4 ## 30 4.7 3.2 1.6 ## 50 5.0 3.3 1.4 Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 2 / 48 Data Subsetting Example 2: Create a subset that includes all but the 1st, the 30th, and the 50th observations: sub2 = iris[-c(1, 30, 50),] head(sub2) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 3 / 48 Data Subsetting We can extract variables using $ followed by their names. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 4 / 48 Data Subsetting We can extract variables using $ followed by their names. Example 3: Create a subset including Sepal.Length and Species variables: sub3 = data.frame(iris$Sepal.Length, iris$Species) head(sub3) ## iris.Sepal.Length iris.Species ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5.0 setosa ## 6 5.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 4 / 48 Subsetting by Conditions We often need the subsets of rows satisfying certain conditions. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 5 / 48 Subsetting by Conditions We often need the subsets of rows satisfying certain conditions. Example 1: Create a subset of irises with Setosa species: cond1 = (iris$Species == 'setosa') str(cond1) ## logi [1:150] TRUE TRUE TRUE TRUE TRUE TRUE... sub4 = iris[cond1,] head(sub4) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 5 / 48 Subsetting by Conditions Example 2: Create a subset of the irises with Sepal.Length greater than 5: cond2 = (iris$Sepal.Length > 5) sub5 = iris[cond2,] head(sub5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 6 / 48 Subsetting by Conditions Example 3: Create a subset of the irises with non-Setosa species: cond3 = (iris$Species != 'setosa') sub6 = iris[cond3,] head(sub6) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 7 / 48 Subsetting by Logical Conditions Logical conditions can be combined with “and” and “or” operators, which in R are “&” and “|”, respectively. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 8 / 48 Subsetting by Logical Conditions Logical conditions can be combined with “and” and “or” operators, which in R are “&” and “|”, respectively. Example 1: Create a subset of the irises with Setosa species and Sepal.Length greater than 5: cond4 = (iris$Species == 'setosa')&(iris$Sepal.Length > 5) sub7 = iris[cond4,] head(sub7) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 8 / 48 Subsetting by Logical Conditions Example 2: Create a subset of the irises that are not Setosas or whose Sepal.Width are less than equal to 4: cond5 = (iris$Species != 'setosa')|(iris$Sepal.Width % select(1,2,5) str(sub1) ## 'data.frame': 150 obs. of 3 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1. ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 12 / 48 Select by Locations (dplyr) Use select() to choose variables (columns) by locations. Example: Create a subset that contains the 1st, 2nd, and 5th variables: sub1 = iris %>% select(1,2,5) str(sub1) ## 'data.frame': 150 obs. of 3 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1. ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: The %>%, called a “pipe”, is used to separate iris and select. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 12 / 48 Select by Names (dplyr) Use select() to choose variables (columns) by names. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 13 / 48 Select by Names (dplyr) Use select() to choose variables (columns) by names. Example: Create a subset of sepal length, sepal width and species. sub2 = iris %>% select(Sepal.Length, Sepal.Width, Species) head(sub2) ## Sepal.Length Sepal.Width Species ## 1 5.1 3.5 setosa ## 2 4.9 3.0 setosa ## 3 4.7 3.2 setosa ## 4 4.6 3.1 setosa ## 5 5.0 3.6 setosa ## 6 5.4 3.9 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 13 / 48 Excluding Variables (dplyr) Example 1: Exclude the 1st, 2nd, and 5th variables: sub3 = iris %>% select(-1, -2, -5) str(sub3) ## 'data.frame': 150 obs. of 2 variables: ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 14 / 48 Excluding Variables (dplyr) Example 1: Exclude the 1st, 2nd, and 5th variables: sub3 = iris %>% select(-1, -2, -5) str(sub3) ## 'data.frame': 150 obs. of 2 variables: ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... Example 2: Exclude Sepal.Length, Sepal.Width, and Species: sub4 = iris %>% select(-Sepal.Length, -Sepal.Width, -Species) str(sub4) ## 'data.frame': 150 obs. of 2 variables: ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 14 / 48 Subsetting Rows (dplyr) Use filter() to subset rows with conditions. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 15 / 48 Subsetting Rows (dplyr) Use filter() to subset rows with conditions. Example 1: Create a subset of the irises with Setosas species and their Sepal.Length greater than 5: sub5 = iris %>% filter((Species == 'setosa') & (Sepal.Length > 5)) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 15 / 48 Subsetting Rows (dplyr) Use filter() to subset rows with conditions. Example 1: Create a subset of the irises with Setosas species and their Sepal.Length greater than 5: sub5 = iris %>% filter((Species == 'setosa') & (Sepal.Length > 5)) Example 2: Create a subset of the irises whose Species are NOT Setosa or whose Sepal.Width are less than equal to 4: sub6 = iris %>% filter((Species != 'setosa') | (Sepal.Width % filter(Species == 'setosa') %>% select(-Species) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 16 / 48 Combine filter and select (dplyr) Create a subset that only contains Setosas and excludes Species sub7 = iris %>% filter(Species == 'setosa') %>% select(-Species) The order in which filter and select appear matters: iris %>% select(-Species) %>% filter(Species == 'setosa') ## Error in `filter()`: ## i In argument: `Species == "setosa"`. ## Caused by error: ## ! object 'Species' not found Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 16 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. iris %>% filter(Petal.Length > 6) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. iris %>% filter(Petal.Length > 6) 3 Create a subset of iris that contains the two sepal variables and the irises with sepal width greater than 4. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. iris %>% filter(Petal.Length > 6) 3 Create a subset of iris that contains the two sepal variables and the irises with sepal width greater than 4. iris %>% select(Sepal.Length, Sepal.Width) %>% filter(Sepal.Width > 4) # one solution iris %>% filter(Sepal.Width > 4) %>% select(1:2) # another solution Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48