Full Transcript

Data Manipulation (part 1) Subsetting, Sorting, and Converting Ryan Yue Zicklin School of Business Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 1 /...

Data Manipulation (part 1) Subsetting, Sorting, and Converting Ryan Yue Zicklin School of Business Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 1 / 48 Data Subsetting We can subset columns (variables) and rows (cases) of a data.frame the same way we did for a matrix. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 2 / 48 Data Subsetting We can subset columns (variables) and rows (cases) of a data.frame the same way we did for a matrix. Example 1: Create a subset with the 1st, the 30th, and the 50th observations and the first 3 variables in the iris data: data("iris") sub1 = iris[c(1, 30, 50), 1:3] sub1 ## Sepal.Length Sepal.Width Petal.Length ## 1 5.1 3.5 1.4 ## 30 4.7 3.2 1.6 ## 50 5.0 3.3 1.4 Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 2 / 48 Data Subsetting Example 2: Create a subset that includes all but the 1st, the 30th, and the 50th observations: sub2 = iris[-c(1, 30, 50),] head(sub2) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 3 / 48 Data Subsetting We can extract variables using $ followed by their names. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 4 / 48 Data Subsetting We can extract variables using $ followed by their names. Example 3: Create a subset including Sepal.Length and Species variables: sub3 = data.frame(iris$Sepal.Length, iris$Species) head(sub3) ## iris.Sepal.Length iris.Species ## 1 5.1 setosa ## 2 4.9 setosa ## 3 4.7 setosa ## 4 4.6 setosa ## 5 5.0 setosa ## 6 5.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 4 / 48 Subsetting by Conditions We often need the subsets of rows satisfying certain conditions. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 5 / 48 Subsetting by Conditions We often need the subsets of rows satisfying certain conditions. Example 1: Create a subset of irises with Setosa species: cond1 = (iris$Species == 'setosa') str(cond1) ## logi [1:150] TRUE TRUE TRUE TRUE TRUE TRUE... sub4 = iris[cond1,] head(sub4) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 5 / 48 Subsetting by Conditions Example 2: Create a subset of the irises with Sepal.Length greater than 5: cond2 = (iris$Sepal.Length > 5) sub5 = iris[cond2,] head(sub5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 6 / 48 Subsetting by Conditions Example 3: Create a subset of the irises with non-Setosa species: cond3 = (iris$Species != 'setosa') sub6 = iris[cond3,] head(sub6) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 7 / 48 Subsetting by Logical Conditions Logical conditions can be combined with “and” and “or” operators, which in R are “&” and “|”, respectively. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 8 / 48 Subsetting by Logical Conditions Logical conditions can be combined with “and” and “or” operators, which in R are “&” and “|”, respectively. Example 1: Create a subset of the irises with Setosa species and Sepal.Length greater than 5: cond4 = (iris$Species == 'setosa')&(iris$Sepal.Length > 5) sub7 = iris[cond4,] head(sub7) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 8 / 48 Subsetting by Logical Conditions Example 2: Create a subset of the irises that are not Setosas or whose Sepal.Width are less than equal to 4: cond5 = (iris$Species != 'setosa')|(iris$Sepal.Width % select(1,2,5) str(sub1) ## 'data.frame': 150 obs. of 3 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1. ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 12 / 48 Select by Locations (dplyr) Use select() to choose variables (columns) by locations. Example: Create a subset that contains the 1st, 2nd, and 5th variables: sub1 = iris %>% select(1,2,5) str(sub1) ## 'data.frame': 150 obs. of 3 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1. ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: The %>%, called a “pipe”, is used to separate iris and select. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 12 / 48 Select by Names (dplyr) Use select() to choose variables (columns) by names. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 13 / 48 Select by Names (dplyr) Use select() to choose variables (columns) by names. Example: Create a subset of sepal length, sepal width and species. sub2 = iris %>% select(Sepal.Length, Sepal.Width, Species) head(sub2) ## Sepal.Length Sepal.Width Species ## 1 5.1 3.5 setosa ## 2 4.9 3.0 setosa ## 3 4.7 3.2 setosa ## 4 4.6 3.1 setosa ## 5 5.0 3.6 setosa ## 6 5.4 3.9 setosa Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 13 / 48 Excluding Variables (dplyr) Example 1: Exclude the 1st, 2nd, and 5th variables: sub3 = iris %>% select(-1, -2, -5) str(sub3) ## 'data.frame': 150 obs. of 2 variables: ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 14 / 48 Excluding Variables (dplyr) Example 1: Exclude the 1st, 2nd, and 5th variables: sub3 = iris %>% select(-1, -2, -5) str(sub3) ## 'data.frame': 150 obs. of 2 variables: ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... Example 2: Exclude Sepal.Length, Sepal.Width, and Species: sub4 = iris %>% select(-Sepal.Length, -Sepal.Width, -Species) str(sub4) ## 'data.frame': 150 obs. of 2 variables: ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 14 / 48 Subsetting Rows (dplyr) Use filter() to subset rows with conditions. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 15 / 48 Subsetting Rows (dplyr) Use filter() to subset rows with conditions. Example 1: Create a subset of the irises with Setosas species and their Sepal.Length greater than 5: sub5 = iris %>% filter((Species == 'setosa') & (Sepal.Length > 5)) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 15 / 48 Subsetting Rows (dplyr) Use filter() to subset rows with conditions. Example 1: Create a subset of the irises with Setosas species and their Sepal.Length greater than 5: sub5 = iris %>% filter((Species == 'setosa') & (Sepal.Length > 5)) Example 2: Create a subset of the irises whose Species are NOT Setosa or whose Sepal.Width are less than equal to 4: sub6 = iris %>% filter((Species != 'setosa') | (Sepal.Width % filter(Species == 'setosa') %>% select(-Species) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 16 / 48 Combine filter and select (dplyr) Create a subset that only contains Setosas and excludes Species sub7 = iris %>% filter(Species == 'setosa') %>% select(-Species) The order in which filter and select appear matters: iris %>% select(-Species) %>% filter(Species == 'setosa') ## Error in `filter()`: ## i In argument: `Species == "setosa"`. ## Caused by error: ## ! object 'Species' not found Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 16 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. iris %>% filter(Petal.Length > 6) Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. iris %>% filter(Petal.Length > 6) 3 Create a subset of iris that contains the two sepal variables and the irises with sepal width greater than 4. Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48 Exercise 2 (dplyr) 1 Create a subset of iris that contains the last two variables. iris %>% select(dim(iris)-1, dim(iris)) 2 Create a subset of iris that contains those with petal length greater than 6. iris %>% filter(Petal.Length > 6) 3 Create a subset of iris that contains the two sepal variables and the irises with sepal width greater than 4. iris %>% select(Sepal.Length, Sepal.Width) %>% filter(Sepal.Width > 4) # one solution iris %>% filter(Sepal.Width > 4) %>% select(1:2) # another solution Ryan Yue (Zicklin School of Business) Data Manipulation (part 1) 17 / 48

Use Quizgecko on...
Browser
Browser