week03 word.docx
Document Details

Uploaded by GenerousChrysoprase
La Trobe University
Full Transcript
Week 3 Data Wrangling & R programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Subject Syllabus CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Learning outcomes: Learn about data representation. Learn how to convert data from one forma...
Week 3 Data Wrangling & R programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Subject Syllabus CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Learning outcomes: Learn about data representation. Learn how to convert data from one format to another . Learn R programming conditional statement. Learn how to use R programming packages. Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— Data Collection: working with data Data sources; PC, internet, external. Data formats: text, CSV, URL, ..., etc. Data values: qualitative or quantitative. Data categories: experimental or observational. What we have learned so far? —— R Programming —— Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc Data Type and data structure (vector, factor, matrix and data frame) View, Access, Change etc. Import data into R Environment (text file and csv files) CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Example: Consider the country population dataset (data1.csv). The same data can be organised in different representations, as shown in next slides. Example: format-1. Example: format-2. Example: format-3. Example: format-4. Example: format-5. From the previous examples, we have see that The same data can be organised in different representations or formats. Each format shows the same values of four variables: country, year, population and cases. Different format show the values in a different representation. Q: What type of representation will be used in CSE5DEV labs? A: Tabular representation (Observations-by-features). Figure: Image from R for Data Science Tabular representation In CSE5DEV, we use data frame data structure Figure: Image from R for Data Science Tabular representation Organising data in observations-by-features is considered the most convenient and standard representation for data analysis. Tabular data Types of features/attributes: It is important to recognise the types of values each feature/attribute takes in order to understand which operations make sense for it. This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars. Tabular data Qualitative vs. Quantitative attributes: Attribute values can be split into two types: Tabular data Qualitative: Nominal vs. Ordinal: Qualitative attributes can be split further into two types: Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Tabular data Quantitative: Interval vs. Ratio: Quantitative attributes can also be split into two types: We can also split quantitative into discrete and continuous ones. All quantitative attributes are considered discrete. Tabular data Summary of attribute types: The types of attributes can be re- garded via the operations that can be applied to them: Comparison (= and 6=) - every type Ordering (> and <) - every type except nominal Differences (-) and addition (+) - only quantitative Division (/) and multiplication (x, .) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others. Tabular data Technical formats: Tabular data can be stored or collected in sev- eral standard formats, such as: Comma separated file (CSV) Flat file or delimited text file (e.g., space or tab delimited) XML or other log files Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) Database tables Non-tabular Data: Transactional data (term matrix, text docu- ments), structured signals, multidimensional signals, nonparametric representations. Tabular representation In Tabular representation, we need to make sure that Figure: Image from R for Data Science Each variable must have its own column. Each observation must have its own row. Each value must have its own cell. Tabular representation If the data is not in a tabular representation, then we need perform a couple of processes to convert it into a tabular representation. Examples of the processes are: Gathering and Spreading. Separating and Uniting. Filtering. Grouping. mutating. Tabular representation Example: Gathering process - gather columns into a new pair of variables Figure: Image from R for Data Science Tabular representation Example: Gathering process - gather columns into a new pair of variables gather(data, key, value, ...) data is the data frame you are working with. key is the name of the key column to create. value is the name of the value column to create. ... is a way to specify what columns to gather from. Tabular representation Example: Gathering process - gather columns into a new pair of variables Figure: Image from R for Data Science Tabular representation Example: Spreading process - Spreading is the opposite of gather- ing. Figure: Image from R for Data Science Tabular representation Example: Spreading process - Spreading is the opposite of gather- ing. spread(data, key, value) data is your data of interest. key is the column whose values will become variable names. value is the column where values will fill in under the new variables created from key. Tabular representation Example: Spreading process - Spreading is the opposite of gather- ing. Figure: Image from R for Data Science Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears Figure: Image from R for Data Science Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears separate(data,col, into, sep) data is the data frame of interest. col is the column that needs to be separated. into is a vector of names of columns for the data to be separated into to. sep is the value where you want to separate the data at. Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears Figure: Image from R for Data Science Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. Figure: Image from R for Data Science Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. unite(data,col,..., sep) data is the data frame of interest. col is the column you wish to add. ... is names of columns you wish to unite together. sep is how you wish to join the data in the columns. Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. Figure: Image from R for Data Science Five main verbs Select - select variables by their names. Filter - choose rows that sat- isfy some criteria. Arrange - reorder the rows. Mutate - create transformed or derived variables. Summarise - collapse rows down to summaries. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Basics of R Programming In previous lectures, we have learned How to read data from file. Variable, variable names and data types. Data structures: vector, factor, matrix and data frame. View, access, change ...etc. dat <- read.csv("data.csv", header=TRUE, sep =",") names() - shows the names attribute for a data frame. head() - shows first 6 rows. tail() - shows last 6 rows. dim() - returns the dimensions of data frame. nrow() - number of rows. ncol() - number of columns. str() - structure of data frame - name, type and preview of data in each column. sapply(dataframe, class) - shows the class of each column in the data frame. In this lecture, we will learn how to write R code for the following tasks: Logical conditions to select subsets Conditional execution: if statements Repetitive execution: for loops, repeat and while Packages Format transform Example: read data from file. head(dat) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160. We can use Logical condition operators to select subset of data. Conditional operators — Conditional operators are used to compare between values or expres- sions. They return TRUE (1) or FALSE (0) Conditional operators — Examples: Conditional operators for two variables: x and y. x <- 4 y <- 15 x<y ## [1] TRUE x>y ## [1] FALSE x<=5 ## [1] TRUE y>=20 ## [1] FALSE y == 16 ## [1] FALSE x != 5 ## [1] TRUE Conditional operators — Examples: Conditional operators for a vector x x <- c(3, 5, 1, 2, 7, 6, 4) x < 5 # is x less than 5 ## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE x <= 5 # is x less than or equal to 5 ## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE x > 3 # is x greater than 3 ## [1] FALSE TRUE FALSE FALSE TRUE TRUE TRUE x >= 3 # is x greater than or equal to 3 ## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE x == 2 # is x equal to 2 ## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE x != 2 # is x not equal to 2 ## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE — Conditional operators — Useful functions: all, any and which The all and any functions check whether all or at least some entries of a logical vector are TRUE respectively. The function which gives the TRUE and the index of value. Logical Operators — Logical operators can be used to combine two or more conditions. In this subject, we will only use the element-wise operators: !, & and |. All operators compare vectors element by element and then return TRUE (1) or FALSE (0). Logical Operators — Examples: Logical operators for a vector x Logical Operators — Consider the following example: x <- c (5, 3, 7, 9, 10) We want to extract the values of the vector x which are greater than 5 (7, 9, 10). There are two methods: Method 1 Method 2 — Logical Condition Operators — We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160. We can use Logical condition operators to select subset of data. s <- dat[dat$disp<=160, ] print(s) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 — Logical Condition Operators — We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160 AND hp less than 110. z <- dat[dat$disp<=160 & dat$hp<110,] print(z) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 — Logical Condition Operators — We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160 AND hp less than 110 for wt column. z <- dat[dat$disp<=160 & dat$hp<110,] print(z) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 If statements have this syntax: if (condition) {expressions 1 if true} else {expressions 2 otherwise} — If Statement — We can use If statement without else. For example, We can use multi-able else using else if as follows: Examples of R repetitive execution functions are for loop: iterate over a vector. for (variable in vector) commands } repeat: iterate over a block of code number of times until some condition is met. repeat expression if(condition) {break} } while: evaluates a expression as long as a stated condition is TRUE. while(condition) expression } — Example: for loops — — Example: for loops — ## [,1] [,2] ## [1,] 10 13 ## [2,] 11 14 ## [3,] 12 15 nrr <- nrow(a) # n for (i in 1:nrr) { ## [1] 13 ## [1] 14 ## [1] 15 — Example: repeat loop — — Example: while loop — i <- 1 j <- 1 mat <- matrix(0, n print (mat) ## [,1] [,2] ## [1,] 0 0 ## [2,] 0 0 ## [3,] 0 0 ## [,1] [,2] ## [1,] 3 3 ## [2,] 5 5 ## [3,] 7 7 Some packages are installed with R and automatically loaded at the start of the Rstudio. Several other Packages should be installed before we can use them. To install a Package run ONLY ONE TIME: install.packages(”Package name”) To use an installed Package, we need to load it using library function as follows: library (Package name) — Data Wrangling — Example: Package for the five main verbs Select - select variables by their names. Filter - choose rows that satisfy some criteria. Arrange - reorder the rows. Mutate - create transformed or de- rived variables. Summarise - collapse rows down to summaries. The above processes can be used only if the ”tidyr” and/or ”dplyr” package has been installed and loaded into R as follows: To install a package in R run: install.packages(”tidyr”) To load a package into R run: library(tidyr) — Data Wrangling — Step 1: Create a data frame: df <- data.frame(color = c("blue", "black", "blue", "blue", "black"), value = 1:5) Step 2: perform the following functions: filter() arrange() select() mutate() — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — End of Week 3 See you Next Lecture (Week 4) Data Cleaning & Normalisation Table: CSE5DEV Timetable Check LMS