week02_merged.docx
Document Details

Uploaded by GenerousChrysoprase
La Trobe University
Full Transcript
Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Subject Syllabus What we have learned so far? Install R and Rstudio Create Rmarkdown file. Add chunk of code. Write and run basic codes. Learning outcomes: Learn about the source of data. Learn about...
Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Subject Syllabus What we have learned so far? Install R and Rstudio Create Rmarkdown file. Add chunk of code. Write and run basic codes. Learning outcomes: Learn about the source of data. Learn about data type. Learn about how to import data into Rmarkdown. Learn about R programming. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data can help us in: learning more about customers, items, products, ..etc. discovering trends in the current system, organisation, ..etc. segmenting elements into different groups based on their indi- vidual needs. decision making process to improve the quality of the system. improving the quality of the product or service based on the feedback obtained. Data sources: Data can be obtained from various sources such as: Data format: Data can be stored in a different format such as : What is data? Qualitative data: descriptive information (describes something). Quantitative data: numerical information (numbers). What is data?: Qualitative vs Quantitative The trip was great Discrete 10 Continuous 3.3 Data values can be: Numeric: Discrete - integer values. Example: number of car in the park. Continuous - any value in a pre-defined range (float, double). Example: average mark (e.g., 63.4) Categorical: values are selected from a predefined number of categories. Ordinal - categories could be meaningfully ordered. Example: grades (A, B, C, D, E, F). Nominal - don’t have any order. Example: eye colours (blue, black, honey, etc.) Binary - the special case of nominal, with only 2 possible categories. Example: binary value (1, 0) Date: datetime, timestamp. Example: 11.10.2018. Text: Multidimensional data Time series: Data points indexed in the time order Data category: data can be one of two main categories: experi- mental or observational Data category: data can be one of two main categories: experi- mental or observational. Data Type: data can be Numbers String Relational data Factors or categorical variables Dates and times Description We can read data from the various sources or files. Files can be in any format such as: name.CSV name.DAT name.TXT name.XLS name.HTML name.json When we get a new data, we often ask: What is in it? What is wrong with it? What should I do with it? Answer: Step 1. Import the data into your code. Step 2. Organise the data in a readable format. Step 3. ... .... Step n. This lecture will cover Step 1. Step 1. Import the data into R environment. Reading Data: write R codes to import data into Rstudio environment. View the data: explore, access and print. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming How RStudio and R work? CSE5DEV Student - Note: you ONLY need to run and write your code in RStudio Interface. RStudio Interface RStudio Interface In this lecture, we will learn how to write R code for the following tasks: Import data: reading data from file. View data. Access data. Check data types. Export data R uses various functions to import data from the Working Directory into R environment. We can import data from different formats such as: Text files: txt files. Comma Separated Values: CSV files. Excel Files: xls or xlsx files. Web-site: URL files. SPSS File ... etc R reading function syntax: Object name: variable that can hold different values. R read function: used read data from file based on file extension. file name.ext: the name of the file to read, file extension and location. Arguments: control statements Examples of R reading functions: read.table for TEXT files read.csv for CSV files — Read data from text files — Example: read data from text file called Mytext.txt and assign the data to dat Object (or variable). The read.table function read the file and save it in object. header=TRUE: By default the header argument is set as TRUE. This indicates that the first row in the file is set as header information (column names). If your file does not have a header, set the header argument to FALSE: header=FALSE. sep =” ”. Indicate the columns are separated by white space(s). We can use tabs, newlines or comma. dec=”.”. The character used in the file for decimal points is a dot. — Read data from CSV files — Example: read data from csv file called data.csv and assign the data to dat object (or variable). read.csv: read the data from ”data.csv”, which includes a header row and separated by comma (,). By default dat will be data frame. We can use the following functions to view/check the data in dat: names() - shows the names attribute for a data frame, which gives the column names. head() - shows first 6 rows. tail() - shows last 6 rows. dim() - returns the dimensions of data frame (number of rows and number of columns). nrow() - number of rows. ncol() - number of columns. str() - structure of data frame - name, type and preview of data in each column. sapply(dataframe, class) - shows the class of each column in the data frame. Example of functions for viewing/checking data. head(dat) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 dim(dat) ## [1] 32 12 nrow(dat) ## [1] 32 ncol(dat) ## [1] 12 We can use print () function to display dat data at the screen. print(dat) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 ## 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 ## 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 ## 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 ## 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 ## 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 ## 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 ## 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 ## 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 ## 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 ## 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 ## 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 ## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 str(dat) - displays the structure of data, type and the data in each column. str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... Based on the above, we can see that dat is categorised as an object and data.frame type. Columns data are either character, number or integer. Objects or variables are used to save data values that R pro- grams can manipulate. A valid object name consists of letters, numbers and the dot or underline characters. It should starts with a letter, or the dot not followed by a number. Examples of Valid and Invalid object names are: object name2. valid - contains letters, numbers, dot and un- derscore. object name% Invalid - contains the character ’%’. Only dot(.) and underscore allowed. 2object name invalid - starts with a number. .object name, object.name valid - can start with a dot(.) but the dot(.) should not be followed by a number. .2object name invalid - dot is followed by a number. object name invalid - starts with which is not valid. Objects assignment: the objects can be assigned values using <- symbol. For example, x<-5, y<-5.2, z<-”CSE5DEV”. Objects are reserved memory locations to store values. They store data of different types, and different types can do different things. The stored values are known as R data types. ▶ In R the data types can be one of following: Logical: TRUE, FALSE. Integer: 21L, 3L, 3L, ...etc. The letter ”L” declares this as an integer. Numeric: real or decimal (2.1, 2.0, pi). Character: ”a” or ”swc”. Complex: 1 + 0i or 1 + 4i. Date Values: ”2021-07-26”. We can use class() or typeof() function to check the data type of objects. — Examples of R objects assignment and data types — # numeric x <- 5.5 class(x) ## [1] "numeric" # integer x <- 200L class(x) ## [1] "integer" # complex x <- 6i + 2 class(x) ## [1] "complex" # character/string x <- "R CSE5DEV" class(x) ## [1] "character" # logical/boolean x <- TRUE class(x) ## [1] "logical" — R Data Type Conversion — ▶ In R , we can convert a value from one type to another using the following functions: as.numeric() as.integer() as.complex() as.Date () Examples of data type conversion are: — R Data Structures— Data structures are used to store data, keep it organised, and enable easy modification and access. Data structures store a SET of data values that relate to each other, and allows us to perform operations or functions on these values. Examples of R data structures are: Vectors. Matrices. Data Frames. Factors. — R Data Structures: Vectors — Vectors store a list of items (or values) of the same type. We use the c() function declare a vector consists of set of values separated by a comma. We can create a vector that combines a set of values as follows: — R Data Structures: Vectors — Some of useful functions for vectors: Vector Length: length() returns the number of values. Sort a Vector: sort() sorts values alphabetically or numeri- cally. Access Vectors: use [] brackets to access the vector items by index number. Change an Item Value: use [index number] to change the value of a specific item. Repeat Vectors: use rep() to repeat vectors items. — R Data Structures: Vectors — Examples of vectors functions. R Data Structures: Vectors — Examples of vector operations. R Data Structures: Vectors — Functions for vectors. R Data Structures: Matrices — A matrix stores data in two-dimensional rectangular layout with columns and rows. A column is a vertical representation of data, while a row is a horizontal representation of data. We use matrix() function to create a matrix. We also need to specify the nrow and ncol parameters to get the number of rows and columns. ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 — R Data Structures: Matrices — Some of useful functions for Matrices: Access matrix items: use [] brackets to access items using two index numbers: first one for row while the second one for column. Access more than one row or column: use [] and c() to access more than one row or column: [c(1,2), ] or [, c(1,2)]. Add cows and columns: use cbind() to add columns and rbind() to add rows. Remove rows and columns: use c() to remove rows and columns: [-c(1), -c(1)]. Check if an item exists: use %in% operator to check if an item is exist: item %in% matrix. Matrix size: dim() returns the number of rows and columns. Matrix length: length() returns the dimension of a Matrix — R Data Structures: Matrices — #Access Matrix Items mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) print (mart[1, 2]) ## [1] "cherry" print (mart[2,]) ## [1] "banana" "orange" # Access More Than One Row mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart[c(1,2),]) ## [,1] [,2] [,3] ## [1,] "apple" "orange" "pear" ## [2,] "banana" "grape" "melon" — R Data Structures: Matrices — # Access More Than One Column print (mart[, c(1,2)]) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" # Add Rows and Columns mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart) ## [,1] [,2] [,3] ## [1,] "apple" "orange" "pear" ## [2,] "banana" "grape" "melon" ## [3,] "cherry" "pineapple" "fig" — R Data Structures: Matrices — newmatrix <- cbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## [,1] [,2] [,3] [,4] ## [1,] "apple" "orange" "pear" "strawberry" ## [2,] "banana" "grape" "melon" "blueberry" ## [3,] "cherry" "pineapple" "fig" "raspberry" newmatrix <- rbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## [,1] [,2] [,3] ## [1,] "apple" "orange" "pear" ## [2,] "banana" "grape" "melon" ## [3,] "cherry" "pineapple" "fig" ## [4,] "strawberry" "blueberry" "raspberry" — R Data Structures: Matrices — mart <- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3, ncol =2) #Remove the first row and the first column mart <- mart[-c(1), -c(1)] print (mart) ## [1] "mango" "pineapple" # Check if an Item Exists mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) "apple" %in% mart ## [1] TRUE # check no of rows and columns mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) dim(mart) ## [1] 2 2 # Matrix Length mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) length(mart) ## [1] 4 R Data Structures: Matrices — Examples of matrix operations. R Data Structures: Matrices — Functions for matrix. R Data Structures: Matrices — The following functions can be used to check data type in each column: is.numeric(): Check if the data is Numeric - True or False. is.integer(): check if the data is Integer - True or False. is.factor(): check if the data is Factor - True or False. is.character(): check if the data is Character - True or False. R Data Structures: Factors — Factors can be used to categorise data and store it as levels. Factors store both strings and integers. They are very useful in the columns which have a limited num- ber of unique values: Demography {Male, Female}, Music {Rock, Classic, Jazz}, Training {Strength, Stamina}, Logical {True, False}. We use factor() function to create a factor and add a vector c() as an argument. — R Data Structures: Factors — Some of useful functions for Factors: Levels: we can use levels() function to print factor levels or set the levels. Factor length: length() function returns the number of items. Access factors: use [] brackets to access factor items. Change item value: use [] and item index number to change its value. — R Data Structures: Factors — Examples of factors functions: # print levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" # set the levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Other")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" "Other" # Factor length length(music) ## [1] 8 # Access factors print (music[3]) ## [1] Classic ## Levels: Classic Jazz Pop Rock Other # Change item value music[4] <- "Pop" print (music[4]) ## [1] Pop ## Levels: Classic Jazz Pop Rock Other — R Data Structures: Data Frames — Data Frame is the most common and practical way of storing data in R, especially in data analyses. data.frame shows data in a table format. data.frame stores different types of data inside it. Different columns can have different data types. For example, the first column can be numeric, the second can be character and the third logical, ..etc. However, each column must have the same data type. We use data.frame() function to create a data frame. — R Data Structures: Data Frames — Example: Create a data frame consists of 3 columns and 3 rows. — R Data Structures: Data Frames — Some of useful functions for Data Frames: Summarise the data: use summary() function to summarise the data. Access items: use single [], double brackets [ [] ] and $ to access columns. Add rows and columns: use rbind() to add rows and cbind() to add columns. Remove rows and columns: use c() to remove rows and columns. Number of rows and columns: use dim() or ncol() & nrow() to find the number of rows and columns. Data frame length: length() returns the number of columns. — R Data Structures: Data Frames — — R Data Structures: Data Frames — # Access items data_frame[1] ## Training ## 1 Strength ## 2 Stamina ## 3 Other data_frame[["Training"]] ## [1] "Strength" "Stamina" "Other" data_frame$Training ## [1] "Strength" "Stamina" "Other" # Add a new row New_row_DF <- rbind(data_frame, c("Strength", 110, 11.0)) print (New_row_DF) ## Training ID Time ## 1 Strength 10 6.6 ## 2 Stamina 11 3.2 ## 3 Other 13 4.0 ## 4 Strength 110 11.0 — R Data Structures: Data Frames — # Add a new column New_col_DF <- cbind(data_frame, Steps = c(1000, 6000, 2000)) print(New_col_DF) ## Training ID Time Steps ## 1 Strength 10 6.6 1000 ## 2 Stamina 11 3.2 6000 ## 3 Other 13 4.0 2000 # Remove the first row and column Data_Frame_New <- data_frame[-c(1), -c(1)] print (Data_Frame_New) ## ID Time ## 2 11 3.2 ## 3 13 4.0 # find the number of rows and columns print (dim(data_frame)) ## [1] 3 3 # Data Frame Length print (length(data_frame)) ## [1] 3 — R Data Structures: Data Frames — All column names should be non-empty. All row names should be unique. The data stored in data frame columns can be of numeric, factor or character. Each column should contains the same number of items and data type. — R Data Structures: Data Frames — Example: create data frame for five employees consists of employee ID, name, salary and starting date. ## 1 1 A 611.30 2014-01-01 ## 2 2 B 512.20 2015-08-23 ## 3 3 C 621.00 2016-10-15 ## 4 4 D 722.00 2016-04-11 ## 5 5 E 343.21 2016-04-26 Please note: Creating data frames using data.frame() function will converted (character) strings to factors (distinct groups). Use stringsAsFactors = FALSE if you are going to change it or making it as plain strings. dat <- read.csv("data.csv", header=TRUE, sep =",") str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... dim(dat) ## [1] 32 12 class(dat) ## [1] "data.frame" class(dat$Model) ## [1] "character" class(dat[[2]]) ## [1] "numeric" # summary dat summary(dat) ## Model mpg cyl disp ## Length:32 Min. :10.40 Min. :4.000 Min. : 71.1 ## Class :character 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 ## ## ## ## ## Mode :character hp Median :19.20 Mean :20.09 3rd Qu.:22.80 Max. :33.90 drat Median :6.000 Mean :6.188 3rd Qu.:8.000 Max. :8.000 wt Median :196.3 Mean :230.7 3rd Qu.:326.0 Max. :472.0 qsec ## Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50 ## 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 ## Median :123.0 Median :3.695 Median :3.325 Median :17.71 ## Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85 ## 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 ## Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90 ## vs am gear carb ## Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000 The data stored in objects can be exported and saved as text or csv files using the following functions: write.table: export text file: write.table(data to export, file = ”file name.txt”, sep = ” ”). write.csv: export csv file: write.csv(data to export, file = ”file name.csv”, sep = ”,”) In this lecture, we have learned how to import data into R environment (RStudio->RMarkdown). how to view data in R. objects and how to manipulate them. R data types. R data structures. how to export data. End of Week 2 See you Next Lecture (Week 3) Data Wrangling & R Programming Table: CSE5DEV Timetable Check LMS Week 3 Data Wrangling & R programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Subject Syllabus CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Learning outcomes: Learn about data representation. Learn how to convert data from one format to another . Learn R programming conditional statement. Learn how to use R programming packages. Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— Data Collection: working with data Data sources; PC, internet, external. Data formats: text, CSV, URL, ..., etc. Data values: qualitative or quantitative. Data categories: experimental or observational. What we have learned so far? —— R Programming —— Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc Data Type and data structure (vector, factor, matrix and data frame) View, Access, Change etc. Import data into R Environment (text file and csv files) CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Example: Consider the country population dataset (data1.csv). The same data can be organised in different representations, as shown in next slides. Example: format-1. Example: format-2. Example: format-3. Example: format-4. Example: format-5. From the previous examples, we have see that The same data can be organised in different representations or formats. Each format shows the same values of four variables: country, year, population and cases. Different format show the values in a different representation. Q: What type of representation will be used in CSE5DEV labs? A: Tabular representation (Observations-by-features). Figure: Image from R for Data Science Tabular representation In CSE5DEV, we use data frame data structure Figure: Image from R for Data Science Tabular representation Organising data in observations-by-features is considered the most convenient and standard representation for data analysis. Tabular data Types of features/attributes: It is important to recognise the types of values each feature/attribute takes in order to understand which operations make sense for it. This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars. Tabular data Qualitative vs. Quantitative attributes: Attribute values can be split into two types: Tabular data Qualitative: Nominal vs. Ordinal: Qualitative attributes can be split further into two types: Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Tabular data Quantitative: Interval vs. Ratio: Quantitative attributes can also be split into two types: We can also split quantitative into discrete and continuous ones. All quantitative attributes are considered discrete. Tabular data Summary of attribute types: The types of attributes can be re- garded via the operations that can be applied to them: Comparison (= and 6=) - every type Ordering (> and <) - every type except nominal Differences (-) and addition (+) - only quantitative Division (/) and multiplication (x, .) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others. Tabular data Technical formats: Tabular data can be stored or collected in sev- eral standard formats, such as: Comma separated file (CSV) Flat file or delimited text file (e.g., space or tab delimited) XML or other log files Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) Database tables Non-tabular Data: Transactional data (term matrix, text docu- ments), structured signals, multidimensional signals, nonparametric representations. Tabular representation In Tabular representation, we need to make sure that Figure: Image from R for Data Science Each variable must have its own column. Each observation must have its own row. Each value must have its own cell. Tabular representation If the data is not in a tabular representation, then we need perform a couple of processes to convert it into a tabular representation. Examples of the processes are: Gathering and Spreading. Separating and Uniting. Filtering. Grouping. mutating. Tabular representation Example: Gathering process - gather columns into a new pair of variables Figure: Image from R for Data Science Tabular representation Example: Gathering process - gather columns into a new pair of variables gather(data, key, value, ...) data is the data frame you are working with. key is the name of the key column to create. value is the name of the value column to create. ... is a way to specify what columns to gather from. Tabular representation Example: Gathering process - gather columns into a new pair of variables Figure: Image from R for Data Science Tabular representation Example: Spreading process - Spreading is the opposite of gather- ing. Figure: Image from R for Data Science Tabular representation Example: Spreading process - Spreading is the opposite of gather- ing. spread(data, key, value) data is your data of interest. key is the column whose values will become variable names. value is the column where values will fill in under the new variables created from key. Tabular representation Example: Spreading process - Spreading is the opposite of gather- ing. Figure: Image from R for Data Science Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears Figure: Image from R for Data Science Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears separate(data,col, into, sep) data is the data frame of interest. col is the column that needs to be separated. into is a vector of names of columns for the data to be separated into to. sep is the value where you want to separate the data at. Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears Figure: Image from R for Data Science Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. Figure: Image from R for Data Science Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. unite(data,col,..., sep) data is the data frame of interest. col is the column you wish to add. ... is names of columns you wish to unite together. sep is how you wish to join the data in the columns. Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. Figure: Image from R for Data Science Five main verbs Select - select variables by their names. Filter - choose rows that sat- isfy some criteria. Arrange - reorder the rows. Mutate - create transformed or derived variables. Summarise - collapse rows down to summaries. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Basics of R Programming In previous lectures, we have learned How to read data from file. Variable, variable names and data types. Data structures: vector, factor, matrix and data frame. View, access, change ...etc. dat <- read.csv("data.csv", header=TRUE, sep =",") names() - shows the names attribute for a data frame. head() - shows first 6 rows. tail() - shows last 6 rows. dim() - returns the dimensions of data frame. nrow() - number of rows. ncol() - number of columns. str() - structure of data frame - name, type and preview of data in each column. sapply(dataframe, class) - shows the class of each column in the data frame. In this lecture, we will learn how to write R code for the following tasks: Logical conditions to select subsets Conditional execution: if statements Repetitive execution: for loops, repeat and while Packages Format transform Example: read data from file. head(dat) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160. We can use Logical condition operators to select subset of data. Conditional operators — Conditional operators are used to compare between values or expres- sions. They return TRUE (1) or FALSE (0) Conditional operators — Examples: Conditional operators for two variables: x and y. x <- 4 y <- 15 x<y ## [1] TRUE x>y ## [1] FALSE x<=5 ## [1] TRUE y>=20 ## [1] FALSE y == 16 ## [1] FALSE x != 5 ## [1] TRUE Conditional operators — Examples: Conditional operators for a vector x x <- c(3, 5, 1, 2, 7, 6, 4) x < 5 # is x less than 5 ## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE x <= 5 # is x less than or equal to 5 ## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE x > 3 # is x greater than 3 ## [1] FALSE TRUE FALSE FALSE TRUE TRUE TRUE x >= 3 # is x greater than or equal to 3 ## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE x == 2 # is x equal to 2 ## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE x != 2 # is x not equal to 2 ## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE Conditional operators — Useful functions: all, any and which The all and any functions check whether all or at least some entries of a logical vector are TRUE respectively. The function which gives the TRUE and the index of value. Logical Operators — Logical operators can be used to combine two or more conditions. In this subject, we will only use the element-wise operators: !, & and |. All operators compare vectors element by element and then return TRUE (1) or FALSE (0). Logical Operators — Examples: Logical operators for a vector x Logical Operators — Consider the following example: x <- c (5, 3, 7, 9, 10) We want to extract the values of the vector x which are greater than 5 (7, 9, 10). There are two methods: Method 1 Method 2 — Logical Condition Operators — We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160. We can use Logical condition operators to select subset of data. s <- dat[dat$disp<=160, ] print(s) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 — Logical Condition Operators — We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160 AND hp less than 110. z <- dat[dat$disp<=160 & dat$hp<110,] print(z) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 — Logical Condition Operators — We may need to extract data that satisfy certain criteria. For example, we may want to select data based on the disp value that equal or less than 160 AND hp less than 110 for wt column. z <- dat[dat$disp<=160 & dat$hp<110,] print(z) ## Model mpg cyl disp hp drat wt qsec vs am gear carb ## 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 If statements have this syntax: if (condition) {expressions 1 if true} else {expressions 2 otherwise} — If Statement — We can use If statement without else. For example, We can use multi-able else using else if as follows: Examples of R repetitive execution functions are for loop: iterate over a vector. for (variable in vector){ commands } repeat: iterate over a block of code number of times until some condition is met. repeat { expression if(condition) {break} } while: evaluates a expression as long as a stated condition is TRUE. while(condition){ expression } — Example: for loops — — Example: for loops — ## [,1] [,2] ## [1,] 10 13 ## [2,] 11 14 ## [3,] 12 15 nrr <- nrow(a) # n for (i in 1:nrr) { ## [1] 13 ## [1] 14 ## [1] 15 — Example: repeat loop — — Example: while loop — i <- 1 j <- 1 mat <- matrix(0, n print (mat) ## [,1] [,2] ## [1,] 0 0 ## [2,] 0 0 ## [3,] 0 0 ## [,1] [,2] ## [1,] 3 3 ## [2,] 5 5 ## [3,] 7 7 Some packages are installed with R and automatically loaded at the start of the Rstudio. Several other Packages should be installed before we can use them. To install a Package run ONLY ONE TIME: install.packages(”Package name”) To use an installed Package, we need to load it using library function as follows: library (Package name) — Data Wrangling — Example: Package for the five main verbs Select - select variables by their names. Filter - choose rows that satisfy some criteria. Arrange - reorder the rows. Mutate - create transformed or de- rived variables. Summarise - collapse rows down to summaries. The above processes can be used only if the ”tidyr” and/or ”dplyr” package has been installed and loaded into R as follows: To install a package in R run: install.packages(”tidyr”) To load a package into R run: library(tidyr) — Data Wrangling — Step 1: Create a data frame: df <- data.frame(color = c("blue", "black", "blue", "blue", "black"), value = 1:5) Step 2: perform the following functions: filter() arrange() select() mutate() Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — Data Wrangling — End of Week 3 See you Next Lecture (Week 4) Data Cleaning & Normalisation Table: CSE5DEV Timetable Check LMS Week 4 Data Cleaning & Normalisation CSE5DEV Syllabus Week-Overview Data Cleaning Data Normalisation Subject Syllabus CSE5DEV Syllabus Week-Overview Data Cleaning Data Normalisation Learning outcomes: Learn about data preparation. Learn about handling data Types. Learn about data transformation. Learn about data cleaning Learn about data normalisation Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— Data Collection: working with data Data Wrangling: correct or change the format of the data What we have learned so far? —— R Programming —— Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc Data Type and data structure (vector, factor, matrix and data frame) View, Access, Change etc. Import data into R Environment (text file and csv files) Correct or change the format of the data to make it tidy ?mean Base R Cheat Sheet Getting Help Accessing the help files Vectors Creating Vectors For Loop Example Programming While Loop Example Get help of a particular function. help.search(‘weighted mean’) Search the help files for a word or phrase. help(package = ‘dplyr’) Find help for a package. More about an object sort(x) Vector Functions rev(x) If Statements Functions str(iris) Get a summary of an object’s structure. class(iris) Find the class an object belongs to. Return x sorted. table(x) See counts of values. Return x reversed. unique(x) See unique values. Using Libraries install.packages(‘dplyr’) Download and install a package from CRAN. library(dplyr) Load the package into the session, making all its functions available to use. dplyr::select Use a particular function from a package. data(iris) Load a built-in dataset into the environment. Working Directory getwd() Find the current working directory (where inputs are found and outputs are sent). Selecting Vector Elements By Position x[4] The fourth element. x[-4] All but the fourth. x[2:4] Elements two to four. x[-(2:4)] All elements except two to four. x[c(1, 5)] Elements one and five. By Value x[x == 10] Elements which are equal to 10. x[x < 0] All elements less than zero. Example Reading and Writing Data Example setwd(‘C://file/path’) Change the current working directory. x[x %in% c(1, 2, 5)] Elements in the set 1, 2, 5. Use projects in RStudio to set the working directory to the folder you are working in. Named Vectors x[‘apple’] Element with name ‘apple’. Conditions RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15 m <- matrix(x, nrow = 3, ncol = 3) Create a matrix from x. log(x) Natural log. sum(x) Sum. exp(x) Exponential. mean(x) Mean. max(x) Largest element. median(x) Median. min(x) Smallest element. quantile(x) Percentage quantiles. round(x, n) Round to n decimal rank(x) Rank of elements. places. signif(x, n) Round to n var(x) The variance. significant figures. cor(x, y) Correlation. sd(x) The standard deviation. df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) A special case of a list where all elements are the same length. List subsetting t.test(x, y) Preform a t-test for difference between means. pairwise.t.test Preform a t-test for paired data. prop.test Test for a difference between proportions. aov Analysis of variance. Matrix subsetting df[ , 2] df[2, ] df[2, 2] nrow(df) Number of rows. ncol(df) Number of columns. dim(df) Number of columns and rows. cbind - Bind columns. rbind - Bind rows. RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15 CSE5DEV Syllabus Week-Overview Data Cleaning Data Preparation Data Transformation Missing Values Data Normalisation However, An example of dirty (raw) data. Data is dirty if it has incomplete, noisy or inconsistent values. ▶ Incomplete data comes from: non available data value when collected. different criteria between the time when the data was collected and when it is analysed. human/hardware/software problems. Noisy data comes from: data collection: faulty instruments. data entry: human or computer errors. data transmission. ▶ Inconsistent (and redundant) data comes from: Different data sources, so non uniform naming conventions/data codes Functional dependency and/or referential integrity violation. Data cleaning can be one or all of the following processes: — Data Preparation — — Data Preparation — An example of dirty data with incorrect types and formats Figure: adopted from Quantdare — Data Preparation — Recall that variable values can be: Numeric: Discrete - integer values. Example: number of car in the park. Continuous - any value in a pre-defined range (float, double). Example: average mark Categorical: values are selected from a predefined number of categories. Ordinal - categories could be meaningfully ordered. Example: grades (A, B, C, D, E, F). Nominal - don’t have any order. Example: eye colours (blue, black, honey, etc.) Dichotomous/Binary - the special case of nominal, with only 2 possible categories. Example: binary value (1, 0) Date: datetime, timestamp. Example: 11.10.2018. Text: Multidimensional data Time series: Data points indexed in the time order We need to ensure the given data variables are correct (consist data). Consist data means variable values have consist formats and types. — Data Preparation — To convert variable values into correct formats and types, we need to do the following steps: Data validation. Handling dates. Handling data types. The given data should respect the following rules: Dates have the same format. Integers variables are assigned integer values. Categorical data did not have duplicates because of white-spaces or low- er/upper cases. Data is in range of permissible values. Example: numerical variables are in pre-defined (min, max) range. Data integrity check: titles with sex, age of birth with age. Historical data have the right chronology. Delivery after purchase, Bank account opening before the first payment, etc. The actions are made by allowed entities. The mortgage could be approved only for people older than 18 years old, etc. — Data Preparation: Data validation — If we found errors what we should do? Correct them if possible. Discard them if they did not have critical impact. Do nothing but this might impact next steps. — Data Preparation: Handling dates — Different systems saves dates in different formats: 12.11.2014, 2018- 12-01, Jan 14, 2008 etc. Unix timestamp Other timestamp. — Data Preparation: Handling dates — In R we can use as.Date and format to convert the given date into correct one. as.Date(’19946-16’) –output–> ”1994-06-16” as.Date(’1996/02/17’) –output–> ”1996-02-17” as.Date(’1/15/2001’, format=’%m/%d/%Y’) –output–> ”2001- 01-15” as.Date(’April 26, 2001’, format=’%B %d, %Y’) –output–> ”2001-04-26” %d: Day of the month (decimal number) %m: Month (decimal number). %b: Month (abbreviated). %B Month (full name) %y: Year (2 digit) %Y: Year (4 digit) — Data Preparation: Handling dates — Example: convert strings to dates. When date and time data are imported into R they will often default to a character string. This requires us to convert strings to dates. We may also have multiple strings that we want to merge to create a date variable. 1 x <- c(" 2015 -07 -01 ", " 2015 -08 -01 ", " 2015 -09 -01 ") 2 as. Date ( x) 3 # Output 4 [1] " 2015 -07 -01 " " 2015 -08 -01 " " 2015 -09 -01 " 5 6 # example using " format" 7 y <- c(" 07 / 01 / 2017 ", " 07 / 01 / 2018 ", " 07 / 01 / 2015 ") 8 as. Date (y, format = "% m/% d/% Y") 9 # Output 10 [1] " 2017 -07 -01 " " 2018 -07 -01 " " 2015 -07 -01 " — Data Preparation: Handling dates — Examples: create a date, specify the format, use a different origin and take a difference. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 — Data Preparation: Handling dates — Example: print today’s date and calculate the difference between two different dates. 1 # print today ' s date 2 today <- Sys. Date () 3 format( today , format="% B % d % Y") 4 [1] " August 09 2021 " 5 6 # alternate method with specified units 7 difftime ( Sys. Date (), as. Date (" 1970 -01 -01 "), units = " days") 8 [1] Time difference of 18846 days 9 10 # see the internal integer representation 11 unclass( Sys. Date ()) 12 [1] 18846 13 14 mydates <- as. Date ( c(" 2007 -06 -22 ", " 2004 -02 -13 ")) 15 days <- mydates [1] - mydates [2] 16 [1] Time difference of 1225 days — Data Preparation: Handling dates — Example: calculate the time between two dates using ”difftime” function. 1 2 3 4 5 6 7 8 9 10 — Data Preparation: Handling data types — Although this seems an easy task, some models work with a certain data types. Data types are: numerical and categorical. Some algo- rithms, models and visualisation work: Only with categorical data type Only with numerical data type With both types Data Preparation: Handling data types — In R, there are several functions to convert the given data values into correct types. For examples, as.character(): store a value as a character or to convert a value into character data type. as.numerical(): convert the values of other data types into numerical values. as.integer(): convert the values of other data types into integer values. as.logical(): convert the value into True and False. Data Preparation: Handling data types — The following functions can be used to check data type in each column: is.numeric(): Check if the data is Numeric - True or False. is.integer(): check if the data is Integer - True or False. is.factor(): check if the data is Factor - True or False. is.character(): check if the data is Character - True or False. Data Preparation: Handling data types — Examples: convert data to character, numeric, integer and logical. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Data Preparation: Handling data types — R helpful functions filter(): Pick rows (observations/samples) based on their val- ues. distinct(): Remove duplicate rows. arrange(): Reorder the rows. select(): Select columns (variables) by their names. rename(): Rename columns. mutate() and transmutate(): Add/create new variables. summarise(): Compute statistical summaries (e.g., computing the mean or the sum). str(): display all information about the given object. Given a categorical variables ( A+, A-, B+, B-), it is very hard to use ”<” or ”>” comparisons on such variables. We can convert them into: A+: 4.0. A-: 3.7. B+: 3.3. B-: 3.0. Multi-valued and unordered attributes with small number of values can be converted into numerical values: Colour=Red, Orange, Yellow, ..., Violet For each value v create a binary ”flag” variable C v , which is ”1” if Color=v, ”0” otherwise. — Data Transformation — We can convert categorical variables into numerical ones using the following methods: Indicator Variables: convert categorical data into boolean val- ues by creating indicator variables. If we have more than two values (n) we have to create n-1 columns. Data Binning or Bucketing: divide the samples into intervals and replace them by categorical values. Data Transformation — Example: Indicator Variables: convert categorical data into boolean values by creating indicator variables. Figure: adopted from Quantdare Data Transformation — Example: Data Binning or Bucketing: divide the samples into intervals and replace them by categorical values. Data Transformation — Example: convert colour ”blue” and ”red” into ”1” and ”2” 1 2 3 4 5 6 7 8 9 10 Data Transformation — Example: create ordered factor variables. 1 ord <- c(" low ", " middle ", " low ", " low ", " low ", " low ", " middle ", " low ", " middle ", 2 " middle ", " middle ", " middle ", " middle ", " high ", " high ", " low ", " middle ", 3 " middle ", " low ", " high ") 4 5 [1] " low " " middle " " low " " low " " low " " low " " middle " " low " " middle " " middle " " middle " " middle " " middle " " high " " high " " low " " middle " " middle " " low " " high " 6 7 ord . order <- ordered ( ord , levels = c(" low ", " middle ", " high ")) 8 [1] low middle low low low low middle low middle middle middle middle middle high high low middle middle low high 9 [1] Levels: low < middle < high Most of real world data often involve missing values because of: measurements fail. non-respondents in surveys. results get lost. measurements do not fulfill some prior knowledge. ... etc. Missing Values — An example of dirty data. Missing values denoted as NA: Null. Missing Values — Why we need to handle missing values? We need to handle missing values because: Missing values are often of great interest because we can re- placed them by meaningful values. Most of the standard methods can only be applied to complete data. Deleting whole columns or rows of data where missing values appear would result in a loss of important available information Missing Values — If the data has a lot of missing values, what we should do? identify missing values. Handel missing values. delete missing observations (samples or rows), or delete variables (features or columns) with missing values, or impute missing values — Missing Values — How can we identify missing values? Use a function to count how many missing values in whole data, in one column or one row. Use a function to calculate the percentage of missing values. Use visualisation methods to understand the missing values ver- sus normal values. Missing Values — Example: Use the summary () function to count how many missing values in whole data, in one column or one row. Example: Use a function to calculate the percentage of missing values, e.g. percentage= 0.3166667 Missing Values — Example: Use visualisation methods to understand the missing val- ues versus normal values. Missing Values — Example: Use visualisation methods to understand the missing val- ues versus normal values. Missing Values — Example: Use visualisation methods to understand the missing val- ues versus normal values. Missing Values — Example: Use visualisation methods to understand the missing val- ues versus normal values. Missing Values — When can we delete missing observations? If the total number of missing observations (row or samples) is very hight, we might need to remove them because: Imputing too many missing observations can lead to bias in the dataset It can lead to poor results However, you need to make sure after deleting the observations you have sufficient data samples, so the model does not produce poor result. not introducing bias (non-representation of classes). Missing Values — When can we delete variables (features or columns) with missing values? if one or two features contribute to the most number of missing values, we can delete these features with high percentage of missing values, or delete features which have more than 30% of missing values, or remove feature that can save many samples. However, you need to check the importance of the feature, and losing out on a number of samples. Missing Values — Examples of imputing missing values methods include: mean median mode Predictive Mean Matching Machine Learning Algorithms Missing Values — Example before imputing missing values. Missing Values — Example after imputing missing values. Missing Values — R functions to identify and count missing values: is.na(): determine if a dataset has a missing value. na.omit(): omit NA values. sum(is.na()): return the total number of missing value. any(is.na()): retune TRUE if there is missing values. summary(): produce result summaries of dataframe. complete.cases(): to check which rows have missing values. Missing Values — Example: handle the missing value in ”iris.mis” data. Missing Values — Example: Step-1 - handle the missing value in ”iris.mis” data. 1 # Call is. na () on the full iris. mis to spot all NAs 2 is. na( iris. mis) 3 # Call is. na () on the full Species feature to spot all NAs 4 is. na( iris. mis$ Species) 5 # Use the any () function to ask whether there are any NAs in the data 6 any( is. na( iris. mis)) 7 # View a summary () of the dataset 8 summary ( iris. mis) 9 # Replace all empty strings in status with NA 10 iris. mis$ Sepal. Length [ iris. mis$ Sepal. Length == ""] <- NA 11 iris. mis$ Sepal. Width [ iris. mis$ Sepal. Width == ""] <- NA 12 iris. mis$ Petal. Length [ iris. mis$ Petal. Length == ""] <- NA 13 iris. mis$ Petal. Width [ iris. mis$ Petal. Width == ""] <- NA 14 iris. mis$ Species[ iris. mis$ Species == ""] <- NA 15 summary ( iris. mis) Missing Values — Example: Step-2 - handle the missing value in ”iris.mis” data. 1 # Use complete . cases () to see which rows have missing values 2 complete . cases( iris. mis) 3 # Count missing values for whole data 4 sum ( is. na( iris. mis)) 5 # Count missing values in a feature 6 sum ( is. na( iris. mis$ Sepal. Length )) 7 # Find indices of NAs in Sepal. Length 8 ind <- which ( is. na( iris. mis$ Sepal. Length )) 9 ## Look at the full rows of missing Sepal. Length values 10 iris. mis[ ind , ] 11 # Set Sepal. Length missing values to 0.5 12 iris. mis$ Sepal. Length [ ind ] <- 0.5 13 # We can omit all rows with any missing values using na. omit () 14 na. omit( iris. mis) CSE5DEV Syllabus Week-Overview Data Cleaning Data Normalisation Two most common normalisation techniques are: Min-Max: It is the simplest way of scaling values in a feature. But, it tries to move the values towards the mean of the feature. Z score: It converts all indicators to a common scale with an average of zero and standard deviation of one. Min-Max technique: It is the simplest way of scaling values in a feature. But, it tries to move the values towards the mean of the feature. x − min(feature) z = max (feature) − min(feature) Z score normalisation: It converts all indicators to a common scale with an average of zero and standard deviation of one. µ=Mean α= Standard deviation x − µ z = α Example of data before normalisation. Example of data after normalisation. Example of data before normalisation. Example of data after normalisation. Example: Normalise the dataframe by columns or by rows 1 # Normalise the dataframe by columns 2 # The function to normalize data is 3 #( x - min ( x))/( max( x) - min ( x)) 4 # We take only the numerical values to normalize 5 iris_ norm <- as. data. frame ( apply( iris[, 1:4], 2 , function ( x) ( x - min ( x))/( max( x)- min ( x)))) 6 iris_ norm $ Species <- iris$ Species 7 str( iris_ norm ) 8 # Normalize the dataframe by rows 9 iris_ norm <- as. data. frame ( t( apply( iris [1:4] , 1 , function ( x) ( x - min ( x))/( max( x)- min ( x))))) 10 # Now we see that Sepal Length is always 1 because it is the maximum value in every row 11 # At the same time the Petal Width is always the lower value 12 summary( iris_ norm ) End of Week 4 See you Next Lecture (Week 5) Data Visualisation Table: CSE5DEV Timetable Check LMS