week02_merged.pdf
Document Details
Uploaded by GenerousChrysoprase
La Trobe University
Tags
Related
- Chapter 1 R and the Tidyverse (Data Science) PDF
- Chapter 2 Reading in Data Locally and From the Web (Data Science) PDF
- Data Science And Visualization VAI301 Syllabus PDF
- Lesson 5 Summaries _ Data Science in R _ A Gentle Introduction PDF
- Big Data in Social Sciences - PDF
- Data Science COMP5122M Data Linkage PDF
Full Transcript
Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 CSE5DEV DATA EXPLORATION AND ANALYSIS Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3...
Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 CSE5DEV DATA EXPLORATION AND ANALYSIS Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Subject Syllabus Lecture 1 Introduction Lecture 2 Data Collection & R Programming Lecture 3 Data Wrangling & R Programming Lecture 4 Data Cleaning & Normalisation Lecture 5 Data Visualisation Lecture 6 Lecture 7 Lecture 8 Data Exploration 1 Data Exploration 2 Data Exploration 3 Analysis Analysis Analysis Lecture 10 Case Study 1 Lecture 11 Case Study 2 Lecture 12 Revision Lecture 9 Correlation & Pattern Discovery Analysis Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming What we have learned so far? Lecture 1 — Introduction — What we have learned so far? 1 Install R and Rstudio 2 Create Rmarkdown file. 3 Add chunk of code. 4 Write and run basic codes. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Science Project Almost all data science and analysis projects require the same set of stages to be performed. These are: Stage -1 Identify the problem (question) Stage - 2 Collect & Prepare the data Stage - 3 Explore the data Stage - 4 Communicate the results What is the goal? What do you want to estimate? How to track houses prices across different areas? Data resources Descriptive statistics What are the findings? Data representation Visualisation What we learn? Report the findings Does the result make sense? Clean and normalise the data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 Overview Data Collection & R programming This week will be covering the basics of Data Collection & R programming. Learning outcomes: • Learn about the source of data. • Learn about data type. • Learn about how to import data into Rmarkdown. • Learn about R programming. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Collection Data collection is the process of gathering information from a specific source, which can be used to answer relevant questions and evaluate outcomes. Data can help us in: • learning more about customers, items, products, ..etc. • discovering trends in the current system, organisation, ..etc. • segmenting elements into different groups based on their individual needs. • decision making process to improve the quality of the system. • improving the quality of the product or service based on the feedback obtained. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Collection Data collection is the process of gathering information from a specific source, which can be used to answer relevant questions and evaluate outcomes. R Code Data Exploration & Analysis Techniques R Code R Code Knowledge, Conclusions, Actions,…,etc R Code Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data sources: Data can be obtained from various sources such as: PC Data Internet External PC PC Data Data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data format: Data can be stored in a different format such as : Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection What is data? Data Data is a set of facts such as numbers, words, measurements, observations or descriptions of things. A set of values of qualitative or quantitative variables collected by a various range of organisations and institutions, such as businesses and non-governmental organisations. ▶ Qualitative data: descriptive information (describes something). ▶ Quantitative data: numerical information (numbers). Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection What is data?: Qualitative vs Quantitative Data Qualitative The trip was great Quantitative Discrete 10 Continuous 3.3 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data values can be: ▶ Numeric: • Discrete - integer values. Example: number of car in the park. • Continuous - any value in a pre-defined range (float, double). Example: average mark (e.g., 63.4) ▶ Categorical: values are selected from a predefined number of categories. • Ordinal - categories could be meaningfully ordered. Example: grades (A, B, C, D, E, F). • Nominal - don’t have any order. Example: eye colours (blue, black, honey, etc.) • Binary - the special case of nominal, with only 2 possible categories. Example: binary value (1, 0) ▶ Date: datetime, timestamp. Example: 11.10.2018. ▶ Text: Multidimensional data ▶ Time series: Data points indexed in the time order Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data category: data can be one of two main categories: experimental or observational Experimental data Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Examples • Medical clinical trials • Election polls Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data category: data can be one of two main categories: experimental or observational. Observational data Data collected from ’real-world’ settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Example Almost all data used in data mining, bushiness analytic and data science are observational data. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Type: data can be • • • • • • Numbers String Relational data Factors or categorical variables Dates and times Description We can read data from the various sources or files. Files can be in any format such as: • • • • • • name.CSV name.DAT name.TXT name.XLS name.HTML name.json Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection When we get a new data, we often ask: • What is in it? • What is wrong with it? • What should I do with it? Answer: • Step 1. Import the data into your code. • Step 2. Organise the data in a readable format. • Step 3. ... • .... • Step n. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data importing Data importing can be defined as the process of writing R code to get the data from disk (PC) into R environment. This lecture will cover Step 1. • Step 1. Import the data into R environment. 1 Reading Data: write R codes to import data into Rstudio environment. 2 View the data: explore, access and print. col 1 col 2 col 3 Value 1 Value 2 Value 3 Value 4 Value 5 Value 6 Value 7 Value 8 Value 9 Value 10 Value 11 Value 12 Write R Code Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming How RStudio and R work? CSE5DEV Student Write RStudio Interface R code Run R in the background R programming software Output PC Monitor Computer - Note: you ONLY need to run and write your code in RStudio Interface. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming RStudio Interface Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming RStudio Interface Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming In this lecture, we will learn how to write R code for the following tasks: • Import data: reading data from file. • View data. • Access data. • Check data types. • Export data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data R uses various functions to import data from the Working Directory into R environment. We can import data from different formats such as: • Text files: txt files. • Comma Separated Values: CSV files. • Excel Files: xls or xlsx files. • Web-site: URL files. • SPSS File • ... etc Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data R reading function syntax: R Code: Read data function format Object_name <- R_read_function("file_name.ext", Arguments) • Object name: variable that can hold different values. • R read function: used read data from file based on file extension. • file name.ext: the name of the file to read, file extension and location. • Arguments: control statements Examples of R reading functions: • read.table for TEXT files • read.csv for CSV files Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data — Read data from text files — Example: read data from text file called Mytext.txt and assign the data to dat Object (or variable). R Code: — Read data from text files — dat <- read.table("Mytext.txt", header=TRUE, sep =" ", dec=".") • The read.table function read the file and save it in object. • header=TRUE: By default the header argument is set as TRUE. This indicates that the first row in the file is set as header information (column names). If your file does not have a header, set the header argument to FALSE: header=FALSE. • sep =” ”. Indicate the columns are separated by white space(s). We can use tabs, newlines or comma. • dec=”.”. The character used in the file for decimal points is a dot. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data — Read data from CSV files — Example: read data from csv file called data.csv and assign the data to dat object (or variable). R Code: — Read data from text csv — dat <- read.csv("data.csv", header=TRUE, sep =",") • read.csv: read the data from ”data.csv”, which includes a header row and separated by comma (,). • By default dat will be data frame. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data We can use the following functions to view/check the data in dat: • names() - shows the names attribute for a data frame, which gives the column names. • head() - shows first 6 rows. • tail() - shows last 6 rows. • dim() - returns the dimensions of data frame (number of rows and number of columns). • nrow() - number of rows. • ncol() - number of columns. • str() - structure of data frame - name, type and preview of data in each column. • sapply(dataframe, class) - shows the class of each column in the data frame. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data Example of functions for viewing/checking data. dat <- read.csv("data.csv", header=TRUE, sep =",") names(dat) "Model" "mpg" "am" "gear" "cyl" "carb" "disp" "hp" "drat" "wt" "qsec" "vs" head(dat) ## ## ## ## ## ## ## Model 1 Mazda RX4 2 Mazda RX4 Wag 3 Datsun 710 4 Hornet 4 Drive 5 Hornet Sportabout 6 Valiant dim(dat) ## [1] 32 12 nrow(dat) ## [1] 32 ncol(dat) ## [1] 12 mpg cyl disp hp drat wt qsec vs am gear carb 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data We can use print () function to display dat data at the screen. print(dat) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Model mpg cyl disp hp drat wt qsec vs am gear carb 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data str(dat) - displays the structure of data, type and the data in each column. str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... Based on the above, we can see that • dat is categorised as an object and data.frame type. • Columns data are either character, number or integer. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure ▶ Objects or variables are used to save data values that R programs can manipulate. A valid object name consists of letters, numbers and the dot or underline characters. It should starts with a letter, or the dot not followed by a number. ▶ Examples of Valid and Invalid object names are: 1 2 3 4 5 6 ▶ object name2. valid - contains letters, numbers, dot and underscore. object name% Invalid - contains the character ’%’. Only dot(.) and underscore allowed. 2object name invalid - starts with a number. .object name, object.name valid - can start with a dot(.) but the dot(.) should not be followed by a number. .2object name invalid - dot is followed by a number. object name invalid - starts with which is not valid. Objects assignment: the objects can be assigned values using <- symbol. For example, x<-5, y<-5.2, z<-”CSE5DEV”. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure ▶ Objects are reserved memory locations to store values. They store data of different types, and different types can do different things. The stored values are known as R data types. ▶ In R the data types can be one of following: Logical: TRUE, FALSE. 2 Integer: 21L, 3L, 3L, ...etc. The letter ”L” declares this as an integer. 1 ▶ 3 Numeric: real or decimal (2.1, 2.0, pi). 4 Character: ”a” or ”swc”. 5 Complex: 1 + 0i or 1 + 4i. 6 Date Values: ”2021-07-26”. We can use class() or typeof() function to check the data type of objects. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — Examples of R objects assignment and data types — # numeric x <- 5.5 class(x) ## [1] "numeric" # integer x <- 200L class(x) ## [1] "integer" # complex x <- 6i + 2 class(x) ## [1] "complex" # character/string x <- "R CSE5DEV" class(x) ## [1] "character" # logical/boolean x <- TRUE class(x) ## [1] "logical" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Type Conversion — ▶ In R , we can convert a value from one type to another using the following functions: • • • • ▶ as.numeric() as.integer() as.complex() as.Date () Examples of data type conversion are: x <- 2L # integer y <- 4 # numeric # convert from integer to numeric: a <- as.numeric(x) # convert from numeric to integer: b <- as.integer(y) # print values of x and y print (x) ## [1] 2 print (y) ## [1] 4 # print the class name of a and b class(a) ## [1] "numeric" class(b) ## [1] "integer" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures— ▶ Data structures are used to store data, keep it organised, and enable easy modification and access. ▶ Data structures store a SET of data values that relate to each other, and allows us to perform operations or functions on these values. ▶ Examples of R data structures are: 1 Vectors. Matrices. 3 Data Frames. 2 4 Factors. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — ▶ Vectors store a list of items (or values) of the same type. ▶ We use the c() function declare a vector consists of set of values separated by a comma. ▶ We can create a vector that combines a set of values as follows: # Vector of numerical values numbers <- c(1, 2, 3, 4) print (numbers) ## [1] 1 2 3 4 # Vector of strings fruits <- c("apple", "orange", "banana") print (fruits) ## [1] "apple" "orange" "banana" # We can create a vector using the Colon : operator numbers <- 1:10 print (numbers) ## [1] 1 2 3 4 5 6 7 8 9 10 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Some of useful functions for vectors: ▶ Vector Length: length() returns the number of values. ▶ Sort a Vector: sort() sorts values alphabetically or numerically. ▶ Access Vectors: use [] brackets to access the vector items by index number. ▶ Change an Item Value: use [index number] to change the value of a specific item. ▶ Repeat Vectors: use rep() to repeat vectors items. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Examples of vectors functions. # Vector Length fruits <- c("banana", "apple", "orange") length(fruits) ## [1] 3 fruits <- c("banana", "apple", "orange", "mango", "lemon") numbers <- c(13, 3, 5, 7, 20, 2) # Sort vector sort(fruits) # Sort a string ## [1] "apple" "banana" "lemon" "mango" "orange" sort(numbers) # Sort numbers ## [1] 2 3 5 7 13 20 #Access Vectors fruits <- c("banana", "apple", "orange") # Access the first item (banana) fruits[1] ## [1] "banana" fruits[3] ## [1] "orange" #Change an Item fruits <- c("banana", "apple", "orange", "mango", "lemon") # Change "apple" to "pear" fruits[2] <- "pear" print (fruits) ## [1] "banana" "pear" "orange" "mango" # Repeat Vector repeat_vec <- rep(c(1,2,3), each = 3) print (repeat_vec) ## [1] 1 1 1 2 2 2 3 3 3 "lemon" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Examples of vector operations. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Functions for vectors. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — ▶ A matrix stores data in two-dimensional rectangular layout with columns and rows. ▶ A column is a vertical representation of data, while a row is a horizontal representation of data. ▶ We use matrix() function to create a matrix. We also need to specify the nrow and ncol parameters to get the number of rows and columns. # Create a matrix matr <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2) print (matr) ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Some of useful functions for Matrices: ▶ Access matrix items: use [] brackets to access items using two index numbers: first one for row while the second one for column. ▶ Access more than one row or column: use [] and c() to access more than one row or column: [c(1,2), ] or [, c(1,2)]. ▶ Add cows and columns: use cbind() to add columns and rbind() to add rows. ▶ Remove rows and columns: use c() to remove rows and columns: [-c(1), -c(1)]. ▶ Check if an item exists: use %in% operator to check if an item is exist: item %in% matrix. ▶ Matrix size: dim() returns the number of rows and columns. ▶ Matrix length: length() returns the dimension of a Matrix Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — #Access Matrix Items mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) print (mart[1, 2]) ## [1] "cherry" print (mart[2,]) ## [1] "banana" "orange" # Access More Than One Row mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart[c(1,2),]) ## [,1] [,2] [,3] ## [1,] "apple" "orange" "pear" ## [2,] "banana" "grape" "melon" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — # Access More Than One Column print (mart[, c(1,2)]) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" # Add Rows and Columns mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" [,3] "pear" "melon" "fig" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — newmatrix <- cbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" [,3] "pear" "melon" "fig" [,4] "strawberry" "blueberry" "raspberry" newmatrix <- rbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## ## ## ## ## [1,] [2,] [3,] [4,] [,1] "apple" "banana" "cherry" "strawberry" [,2] "orange" "grape" "pineapple" "blueberry" [,3] "pear" "melon" "fig" "raspberry" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — mart <- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3, ncol =2) #Remove the first row and the first column mart <- mart[-c(1), -c(1)] print (mart) ## [1] "mango" "pineapple" # Check if an Item Exists mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) "apple" %in% mart ## [1] TRUE # check no of rows and columns mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) dim(mart) ## [1] 2 2 # Matrix Length mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) length(mart) ## [1] 4 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Examples of matrix operations. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Functions for matrix. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — The following functions can be used to check data type in each column: 1 is.numeric(): Check if the data is Numeric - True or False. 2 is.integer(): check if the data is Integer - True or False. 3 is.factor(): check if the data is Factor - True or False. 4 is.character(): check if the data is Character - True or False. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — ▶ ▶ ▶ ▶ Factors can be used to categorise data and store it as levels. Factors store both strings and integers. They are very useful in the columns which have a limited number of unique values: Demography {Male, Female}, Music {Rock, Classic, Jazz}, Training {Strength, Stamina}, Logical {True, False}. We use factor() function to create a factor and add a vector c() as an argument. # Create a factor music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) print (music) ## [1] Jazz Rock Classic Classic Pop ## Levels: Classic Jazz Pop Rock Jazz Rock Jazz Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — Some of useful functions for Factors: ▶ Levels: we can use levels() function to print factor levels or set the levels. ▶ Factor length: length() function returns the number of items. ▶ Access factors: use [] brackets to access factor items. ▶ Change item value: use [] and item index number to change its value. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — Examples of factors functions: # print levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" # set the levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Other")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" # Factor length length(music) ## [1] 8 # Access factors print (music[3]) ## [1] Classic ## Levels: Classic Jazz Pop Rock Other # Change item value music[4] <- "Pop" print (music[4]) ## [1] Pop ## Levels: Classic Jazz Pop Rock Other "Other" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Data Frame is the most common and practical way of storing data in R, especially in data analyses. ▶ ▶ ▶ ▶ ▶ data.frame shows data in a table format. data.frame stores different types of data inside it. Different columns can have different data types. For example, the first column can be numeric, the second can be character and the third logical, ..etc. However, each column must have the same data type. We use data.frame() function to create a data frame. Note: read.csv() All files that we import using the read.csv() function are stored as data.frame() data structures. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Example: Create a data frame consists of 3 columns and 3 rows. # Create a data frame data_frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), ID = c(10, 11, 13), Time = c(6.6, 3.2, 4.0) ) print (data_frame) ## Training ID Time ## 1 Strength 10 6.6 ## 2 Stamina 11 3.2 ## 3 Other 13 4.0 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Some of useful functions for Data Frames: ▶ Summarise the data: use summary() function to summarise the data. ▶ Access items: use single [], double brackets [ [] ] and $ to access columns. ▶ Add rows and columns: use rbind() to add rows and cbind() to add columns. ▶ Remove rows and columns: use c() to remove rows and columns. ▶ Number of rows and columns: use dim() or ncol() & nrow() to find the number of rows and columns. ▶ Data frame length: length() returns the number of columns. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Summarise the Data # Create a data frame data_frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), ID = c(10, 11, 13), Time = c(6.6, 3.2, 4.0) ) print (data_frame) ## Training ID Time ## 1 Strength 10 6.6 ## 2 Stamina 11 3.2 ## 3 Other 13 4.0 summary(data_frame) ## ## ## ## ## ## ## Training Length:3 Class :character Mode :character ID Min. :10.00 1st Qu.:10.50 Median :11.00 Mean :11.33 3rd Qu.:12.00 Max. :13.00 Time Min. :3.2 1st Qu.:3.6 Median :4.0 Mean :4.6 3rd Qu.:5.3 Max. :6.6 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Access items data_frame[1] ## Training ## 1 Strength ## 2 Stamina ## 3 Other data_frame[["Training"]] ## [1] "Strength" "Stamina" "Other" data_frame$Training ## [1] "Strength" "Stamina" "Other" # Add a new row New_row_DF <- rbind(data_frame, c("Strength", 110, 11.0)) print (New_row_DF) ## ## ## ## ## Training ID Time 1 Strength 10 6.6 2 Stamina 11 3.2 3 Other 13 4.0 4 Strength 110 11.0 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Add a new column New_col_DF <- cbind(data_frame, Steps = c(1000, 6000, 2000)) print(New_col_DF) ## Training ID Time Steps ## 1 Strength 10 6.6 1000 ## 2 Stamina 11 3.2 6000 ## 3 Other 13 4.0 2000 # Remove the first row and column Data_Frame_New <- data_frame[-c(1), -c(1)] print (Data_Frame_New) ## ID Time ## 2 11 3.2 ## 3 13 4.0 # find the number of rows and columns print (dim(data_frame)) ## [1] 3 3 # Data Frame Length print (length(data_frame)) ## [1] 3 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Note: data frame rules In R, all data frames should respect the following rules. ▶ All column names should be non-empty. ▶ All row names should be unique. ▶ The data stored in data frame columns can be of numeric, factor or character. ▶ Each column should contains the same number of items and data type. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Example: create data frame for five employees consists of employee ID, name, salary and starting date. # Create employee data frame. employee <- data.frame( employee_id = c (1:5), employee_name = c("A","B","C","D","E"), employee_salary = c(611.3,512.2,621.0,722.0,343.21), start_date = as.Date(c("2014-01-010", "2015-08-23", "2016-10-15", "2016-04-11", "2016-04-26")), stringsAsFactors = FALSE) print(employee) ## ## ## ## ## ## 1 2 3 4 5 employee_id employee_name employee_salary start_date 1 A 611.30 2014-01-01 2 B 512.20 2015-08-23 3 C 621.00 2016-10-15 4 D 722.00 2016-04-11 5 E 343.21 2016-04-26 Please note: 1 Creating data frames using data.frame() function will converted (character) strings to factors (distinct groups). 2 Use stringsAsFactors = FALSE if you are going to change it or making it as plain strings. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Example: import and view data dat <- read.csv("data.csv", header=TRUE, sep =",") str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... dim(dat) ## [1] 32 12 class(dat) ## [1] "data.frame" class(dat$Model) ## [1] "character" class(dat[[2]]) ## [1] "numeric" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Example: import and view data # summary dat summary(dat) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Model Length:32 Class :character Mode :character hp Min. : 52.0 1st Qu.: 96.5 Median :123.0 Mean :146.7 3rd Qu.:180.0 Max. :335.0 vs Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4375 3rd Qu.:1.0000 Max. :1.0000 mpg cyl disp Min. :10.40 Min. :4.000 Min. : 71.1 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 Median :19.20 Median :6.000 Median :196.3 Mean :20.09 Mean :6.188 Mean :230.7 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 Max. :33.90 Max. :8.000 Max. :472.0 drat wt qsec Min. :2.760 Min. :1.513 Min. :14.50 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 Median :3.695 Median :3.325 Median :17.71 Mean :3.597 Mean :3.217 Mean :17.85 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 Max. :4.930 Max. :5.424 Max. :22.90 am gear carb Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :0.0000 Median :4.000 Median :2.000 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :1.0000 Max. :5.000 Max. :8.000 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Exporting data The data stored in objects can be exported and saved as text or csv files using the following functions: ▶ write.table: export text file: write.table(data to export, file = ”file name.txt”, sep = ” ”). ▶ write.csv: export csv file: write.csv(data to export, file = ”file name.csv”, sep = ”,”) Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming In this lecture, we have learned 1 how to import data into R environment (RStudio->RMarkdown). 2 how to view data in R. 3 objects and how to manipulate them. 4 R data types. 5 R data structures. 6 how to export data. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming End of Week 2 See you Next Lecture (Week 3) Data Wrangling & R Programming Table: CSE5DEV Timetable Check LMS CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming CSE5DEV DATA EXPLORATION AND ANALYSIS Week 3 Data Wrangling & R programming CSE5DEV Syllabus Week-Overview Data Wrangling Overview 1 CSE5DEV Syllabus 2 Week-Overview 3 Data Wrangling 4 Basics of R Programming Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Subject Syllabus — Lecture 1 — Introduction — Lecture 2 — Data Collection & R Programming — Lecture 3 — Data Wrangling & R Programming Lecture 4 Data Cleaning & Normalisation Lecture 5 Data Visualisation Lecture 6 Lecture 7 Lecture 8 Data Exploration 1 Data Exploration 2 Data Exploration 3 Analysis Analysis Analysis Lecture 10 Case Study 1 Lecture 11 Case Study 2 Lecture 12 Revision Lecture 9 Correlation & Pattern Discovery Analysis CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Science Project Almost all data science and analysis projects require the same set of stages to be performed. These are: Stage -1 Identify the problem (question) Stage - 2 Collect & Prepare the data Stage - 3 Explore the data Stage - 4 Communicate the results What is the goal? What do you want to estimate? How to track houses prices across different areas? Data resources Descriptive statistics What are the findings? Data representation Visualisation What we learn? Report the findings Does the result make sense? Clean and normalise the data CSE5DEV Syllabus Week-Overview Data Wrangling Overview 1 CSE5DEV Syllabus 2 Week-Overview 3 Data Wrangling 4 Basics of R Programming Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Week 3 Overview Data Wrangling & R programming This week will be covering the basics of Data Wrangling & R programming. Learning outcomes: • Learn about data representation. • Learn how to convert data from one format to another . • Learn R programming conditional statement. • Learn how to use R programming packages. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming What we have learned so far? Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— • Data Collection: working with data 1 Data sources; PC, internet, external. 2 Data formats: text, CSV, URL, ..., etc. 3 Data values: qualitative or quantitative. 4 Data categories: experimental or observational. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming What we have learned so far? What we have learned so far? —— R Programming —— 1 Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc 2 Data Type and data structure (vector, factor, matrix and data frame) 3 View, Access, Change.... etc. 4 Import data into R Environment (text file and csv files) Note The above steps (Reading, Viewing, Accessing, Changing, ..., etc) are very crucial for Lecture 3 to lecture 11. If you DON’T know how to perform them in R, please let us know as soon as possible. CSE5DEV Syllabus Week-Overview Data Wrangling Overview 1 CSE5DEV Syllabus 2 Week-Overview 3 Data Wrangling 4 Basics of R Programming Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Data Wrangling Data wrangling can be defined as the process of organising data in consistent representation or format that can be easily used and presented. CSV file R Code: Import CSV file View Data Data Type Data Structure Access Data Rstudio Environment Transform data into a readable format CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Example: Consider the country population dataset (data1.csv). The same data can be organised in different representations, as shown in next slides. CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-1. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-2. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-3. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-4. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Example: format-5. Data Wrangling Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling From the previous examples, we have see that • The same data can be organised in different representations or formats. • Each format shows the same values of four variables: country, year, population and cases. • Different format show the values in a different representation. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Q: What type of representation will be used in CSE5DEV labs? A: Tabular representation (Observations-by-features). Figure: Image from R for Data Science CSE5DEV Syllabus Week-Overview Data Wrangling Data Wrangling Tabular representation In CSE5DEV, we use data frame data structure Figure: Image from R for Data Science Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Organising data in observations-by-features is considered the most convenient and standard representation for data analysis. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Types of features/attributes: It is important to recognise the types of values each feature/attribute takes in order to understand which operations make sense for it. Example • Can we compute an average eye colour? • How do we compute the difference between phone numbers? • Can we say today is ’twice as hot/cold’ as yesterday? This is similar to problems like 6 apples / 4 people = 1.5 apples per person, but 10 people / 4 car seats = 3 cars. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Qualitative vs. Quantitative attributes: Attribute values can be split into two types: Qualitative attributes Attributes that take values from a (finite) set of categories are called categorical or qualitative attributes. In some sense, they describe an object/observation, rather than measure its properties. Quantitative attributes Attributes that represent quantities are called numerical or quantitative attributes. They provide concrete quantifiable measurements of an object/observation. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Qualitative: Nominal vs. Ordinal: Qualitative attributes can be split further into two types: Nominal attributes Examples: zip codes, eye colour, operating system, gender. Values of such attributes just specify names without any particular order or relation between them (except for = and ̸=). Binary attributes are nominal attributes with only two values (Yes/No or 0/1). They can be symmetric or asymmetric based in whether or not their values are equally informative. Ordinal attributes Examples: ratings, grades, street/avenue numbers. Values of such attributes have some order, even though they don’t specify an exact quantity. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Quantitative: Interval vs. Ratio: Quantitative attributes can also be split into two types: Interval attributes Examples: calendar dates, azimuth direction, Fahrenheit temperatures. Such attributes represent quantities with meaningful difference (or fixed intervals) between their values (but no multiplicative relations). Ratio attributes Examples: mass, length, distance, currency, age, electrical current. Such attributes represent quantities that have meaningful ratios between their values. Unlike interval attributes, ratio ones usually have an ’absolute zero’. We can also split quantitative into discrete and continuous ones. All quantitative attributes are considered discrete. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Summary of attribute types: The types of attributes can be regarded via the operations that can be applied to them: • Comparison (= and 6=) - every type • Ordering (> and <) - every type except nominal • Differences (-) and addition (+) - only quantitative • Division (/) and multiplication (x, .) - only ratio Other operations (e.g., mean, median, correlation) may also be inapplicable for some types while applicable to others. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular data Technical formats: Tabular data can be stored or collected in several standard formats, such as: • Comma separated file (CSV) • Flat file or delimited text file (e.g., space or tab delimited) • XML or other log files • Proprietary formats (e.g., FCS for biological data or MAT files for Matlab data) • Database tables Non-tabular Data: Transactional data (term matrix, text documents), structured signals, multidimensional signals, nonparametric representations. CSE5DEV Syllabus Week-Overview Data Wrangling Data Wrangling Tabular representation In Tabular representation, we need to make sure that Figure: Image from R for Data Science • Each variable must have its own column. • Each observation must have its own row. • Each value must have its own cell. Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation If the data is not in a tabular representation, then we need perform a couple of processes to convert it into a tabular representation. Examples of the processes are: • Gathering and Spreading. • Separating and Uniting. • Filtering. • Grouping. • mutating. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Gathering process - gather columns into a new pair of variables Figure: Image from R for Data Science CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Gathering process - gather columns into a new pair of variables • gather(data, key, value, ...) • • • • data is the data frame you are working with. key is the name of the key column to create. value is the name of the value column to create. ... is a way to specify what columns to gather from. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Gathering process - gather columns into a new pair of variables Figure: Image from R for Data Science R Code: gather () function gather(data, ”year”, ”cases”, 2:3) CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Spreading process - Spreading is the opposite of gathering. Figure: Image from R for Data Science CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Spreading process - Spreading is the opposite of gathering. • spread(data, key, value) • data is your data of interest. • key is the column whose values will become variable names. • value is the column where values will fill in under the new variables created from key. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Spreading process - Spreading is the opposite of gathering. Figure: R Code: spread () function spread(data, key, value) Image from R for Data Science CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears Figure: Image from R for Data Science CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears • separate(data,col, into, sep) • data is the data frame of interest. • col is the column that needs to be separated. • into is a vector of names of columns for the data to be separated into to. • sep is the value where you want to separate the data at. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Separating process - pulls apart one column into multiple columns, by splitting wherever a separator character appears Figure: Image from R for Data Science R Code: separate() function separate(data, rate, c(”cases”, ”population”), sep=”/”) CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. Figure: Image from R for Data Science CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. • unite(data,col,..., sep) • • • • data is the data frame of interest. col is the column you wish to add. ... is names of columns you wish to unite together. sep is how you wish to join the data in the columns. CSE5DEV Syllabus Week-Overview Data Wrangling Basics of R Programming Data Wrangling Tabular representation Example: Uniting process - the inverse of separate. It combines multiple columns into a single column. Figure: Image from R for Data Science R Code: unite() function unite(data, ”year”, century, year, sep=””) CSE5DEV Syllabus Week-Overview Data Wrangling Data Wrangling Five main verbs • Select - select variables by their names. • Filter - choose rows that satisfy some criteria. • Arrange - reorder the rows. • Mutate - create transformed or derived variables. • Summarise - collapse rows down to summaries. Basics of R Programming CSE5DEV Syllabus Week-Overview Data Wrangling Overview 1 CSE5DEV Syllabus 2 Week-Overview 3 Data Wrangling 4 Basics of R Programming Basics of R Programming Basics of R Programming Overview 5 Basics of R Programming Basics of R Programming Basics of R Programming In previous lectures, we have learned • How to read data from file. • Variable, variable names and data types. • Data structures: vector, factor, matrix and data frame. • View, access, change ...etc. dat <- read.csv("data.csv", header=TRUE, sep =",") • names() - shows the names attribute for a data frame. • head() - shows first 6 rows. • tail() - shows last 6 rows. • dim() - returns the dimensions of data frame. • nrow() - number of rows. • ncol() - number of columns. • str() - structure of data frame - name, type and preview of data in each column. • sapply(dataframe, class) - shows the class of each column in the data frame. Basics of R Programming Basics of R Programming In this lecture, we will learn how to write R code for the following tasks: • Logical conditions to select subsets • Conditional execution: if statements • Repetitive execution: for loops, repeat and while • Packages • Format transform Basics of R Programming View data Example: read data from file. dat <- read.csv("data.csv", header=TRUE, sep =",") names(dat) "Model" "mpg" "am" "gear" "cyl" "carb" "disp" "hp" "drat" "wt" "qsec" "vs" head(dat) ## ## ## ## ## ## ## Model 1 Mazda RX4 2 Mazda RX4 Wag 3 Datsun 710 4 Hornet 4 Drive 5 Hornet Sportabout 6 Valiant mpg cyl disp hp drat wt qsec vs am gear carb 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 • We may need to extract data that satisfy certain criteria. • For example, we may want to select data based on the disp value that equal or less than 160. • We can use Logical condition operators to select subset of data. Basics of R Programming Logical condition operators — Conditional operators — Conditional operators are used to compare between values or expressions. They return TRUE (1) or FALSE (0) Basics of R Programming Logical condition operators — Conditional operators — Examples: Conditional operators for two variables: x and y. x <- 4 y <- 15 x<y ## [1] TRUE x>y ## [1] FALSE x<=5 ## [1] TRUE y>=20 ## [1] FALSE y == 16 ## [1] FALSE x != 5 ## [1] TRUE Basics of R Programming Logical condition operators — Conditional operators — Examples: Conditional operators for a vector x x <- c(3, 5, 1, 2, 7, 6, 4) x < 5 # is x less than 5 ## [1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE x <= 5 # is x less than or equal to 5 ## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE x > 3 # is x greater than 3 ## [1] FALSE TRUE FALSE FALSE TRUE TRUE TRUE x >= 3 # is x greater than or equal to 3 ## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE x == 2 # is x equal to 2 ## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE x != 2 # is x not equal to 2 ## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE Basics of R Programming Logical condition operators — Conditional operators — Useful functions: all, any and which • The all and any functions check whether all or at least some entries of a logical vector are TRUE respectively. x <- c(3, 5, 1, 2, 7, 6, 4) any (x == 2) ## [1] TRUE all (x == 2) ## [1] FALSE all (x < 10) ## [1] TRUE • The function which gives the TRUE and the index of value. x <- c(3, 5, 1, 2, 7, 6, 4) which (x == 2) # Fourth element of x is equal two 2 ## [1] 4 which (x < 3) # Third and fourth elements of x are less than 3 ## [1] 3 4 y <- which (x < 3) print(y) ## [1] 3 4 print(typeof(y)) ## [1] "integer" Basics of R Programming Logical condition operators — Logical Operators — Logical operators can be used to combine two or more conditions. In this subject, we will only use the element-wise operators: !, & and |. All operators compare vectors element by element and then return TRUE (1) or FALSE (0). Basics of R Programming Logical condition operators — Logical Operators — Examples: Logical operators for a vector x x <- c(3, 5, 1, 2, 7, 6, 4) (x > 2) & (x <= 6) # is x greater than 2 and less than or equal to 6 ## [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE (x < 2) | (x > 5) # is x less than 2 or greater than 5 ## [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE !(x > 3) # not [x greater than] ## [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE Basics of R Programming Logical condition operators — Logical Operators — Consider the following example: x <- c (5, 3, 7, 9, 10) • We want to extract the values of the vector x which are greater than 5 (7, 9, 10). There are two methods: 1 Method 1 x <- c (5, 3, 7, 9, 10) ind <- x > 5 # is x greater than 5 print (ind) ## [1] FALSE FALSE TRUE TRUE TRUE print (x[ind]) ## [1] 7 9 10 2 Method 2 x <- c (5, 3, 7, 9, 10) x[x > 5] ## [1] 7 9 10 Basics of R Programming Logical condition operators — Logical Condition Operators — • We may need to extract data that satisfy certain criteria. • For example, we may want to select data based on the disp value that equal or less than 160. • We can use Logical condition operators to select subset of data. s <- dat[dat$disp<=160, ] print(s) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Model 1 Mazda RX4 2 Mazda RX4 Wag 3 Datsun 710 8 Merc 240D 9 Merc 230 18 Fiat 128 19 Honda Civic 20 Toyota Corolla 21 Toyota Corona 26 Fiat X1-9 27 Porsche 914-2 28 Lotus Europa 30 Ferrari Dino 32 Volvo 142E mpg cyl disp hp drat wt qsec vs am gear carb 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Basics of R Programming Logical condition operators — Logical Condition Operators — • We may need to extract data that satisfy certain criteria. • For example, we may want to select data based on the disp value that equal or less than 160 AND hp less than 110. z <- dat[dat$disp<=160 & dat$hp<110,] print(z) ## ## ## ## ## ## ## ## ## ## ## Model 3 Datsun 710 8 Merc 240D 9 Merc 230 18 Fiat 128 19 Honda Civic 20 Toyota Corolla 21 Toyota Corona 26 Fiat X1-9 27 Porsche 914-2 32 Volvo 142E mpg cyl disp hp 22.8 4 108.0 93 24.4 4 146.7 62 22.8 4 140.8 95 32.4 4 78.7 66 30.4 4 75.7 52 33.9 4 71.1 65 21.5 4 120.1 97 27.3 4 79.0 66 26.0 4 120.3 91 21.4 4 121.0 109 drat 3.85 3.69 3.92 4.08 4.93 4.22 3.70 4.08 4.43 4.11 wt 2.320 3.190 3.150 2.200 1.615 1.835 2.465 1.935 2.140 2.780 qsec vs am gear carb 18.61 1 1 4 1 20.00 1 0 4 2 22.90 1 0 4 2 19.47 1 1 4 1 18.52 1 1 4 2 19.90 1 1 4 1 20.01 1 0 3 1 18.90 1 1 4 1 16.70 0 1 5 2 18.60 1 1 4 2 Basics of R Programming Logical condition operators — Logical Condition Operators — • We may need to extract data that satisfy certain criteria. • For example, we may want to select data based on the disp value that equal or less than 160 AND hp less than 110 for wt column. x <- dat[dat$disp<=160 & dat$hp<110, "wt"] print(x) ## [1] 2.320 3.190 3.150 2.200 1.615 1.835 2.465 1.935 2.140 2.780 z <- dat[dat$disp<=160 & dat$hp<110,] print(z) ## ## ## ## ## ## ## ## ## ## ## Model 3 Datsun 710 8 Merc 240D 9 Merc 230 18 Fiat 128 19 Honda Civic 20 Toyota Corolla 21 Toyota Corona 26 Fiat X1-9 27 Porsche 914-2 32 Volvo 142E mpg cyl disp hp 22.8 4 108.0 93 24.4 4 146.7 62 22.8 4 140.8 95 32.4 4 78.7 66 30.4 4 75.7 52 33.9 4 71.1 65 21.5 4 120.1 97 27.3 4 79.0 66 26.0 4 120.3 91 21.4 4 121.0 109 drat 3.85 3.69 3.92 4.08 4.93 4.22 3.70 4.08 4.43 4.11 wt 2.320 3.190 3.150 2.200 1.615 1.835 2.465 1.935 2.140 2.780 qsec vs am gear carb 18.61 1 1 4 1 20.00 1 0 4 2 22.90 1 0 4 2 19.47 1 1 4 1 18.52 1 1 4 2 19.90 1 1 4 1 20.01 1 0 3 1 18.90 1 1 4 1 16.70 0 1 5 2 18.60 1 1 4 2 Basics of R Programming Conditional execution: if statements Conditional execution A conditional execution (or if statement) executes some codes ( or statements) only if some condition is met. • If statements h