week02.pdf
Document Details
Uploaded by GenerousChrysoprase
La Trobe University
Tags
Full Transcript
Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 CSE5DEV DATA EXPLORATION AND ANALYSIS Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3...
Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 CSE5DEV DATA EXPLORATION AND ANALYSIS Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Subject Syllabus Lecture 1 Introduction Lecture 2 Data Collection & R Programming Lecture 3 Data Wrangling & R Programming Lecture 4 Data Cleaning & Normalisation Lecture 5 Data Visualisation Lecture 6 Lecture 7 Lecture 8 Data Exploration 1 Data Exploration 2 Data Exploration 3 Analysis Analysis Analysis Lecture 10 Case Study 1 Lecture 11 Case Study 2 Lecture 12 Revision Lecture 9 Correlation & Pattern Discovery Analysis Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming What we have learned so far? Lecture 1 — Introduction — What we have learned so far? 1 Install R and Rstudio 2 Create Rmarkdown file. 3 Add chunk of code. 4 Write and run basic codes. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Science Project Almost all data science and analysis projects require the same set of stages to be performed. These are: Stage -1 Identify the problem (question) Stage - 2 Collect & Prepare the data Stage - 3 Explore the data Stage - 4 Communicate the results What is the goal? What do you want to estimate? How to track houses prices across different areas? Data resources Descriptive statistics What are the findings? Data representation Visualisation What we learn? Report the findings Does the result make sense? Clean and normalise the data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Week 2 Overview Data Collection & R programming This week will be covering the basics of Data Collection & R programming. Learning outcomes: • Learn about the source of data. • Learn about data type. • Learn about how to import data into Rmarkdown. • Learn about R programming. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Collection Data collection is the process of gathering information from a specific source, which can be used to answer relevant questions and evaluate outcomes. Data can help us in: • learning more about customers, items, products, ..etc. • discovering trends in the current system, organisation, ..etc. • segmenting elements into different groups based on their individual needs. • decision making process to improve the quality of the system. • improving the quality of the product or service based on the feedback obtained. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Collection Data collection is the process of gathering information from a specific source, which can be used to answer relevant questions and evaluate outcomes. R Code Data Exploration & Analysis Techniques R Code R Code Knowledge, Conclusions, Actions,…,etc R Code Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data sources: Data can be obtained from various sources such as: PC Data Internet External PC PC Data Data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data format: Data can be stored in a different format such as : Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection What is data? Data Data is a set of facts such as numbers, words, measurements, observations or descriptions of things. A set of values of qualitative or quantitative variables collected by a various range of organisations and institutions, such as businesses and non-governmental organisations. ▶ Qualitative data: descriptive information (describes something). ▶ Quantitative data: numerical information (numbers). Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection What is data?: Qualitative vs Quantitative Data Qualitative The trip was great Quantitative Discrete 10 Continuous 3.3 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data values can be: ▶ Numeric: • Discrete - integer values. Example: number of car in the park. • Continuous - any value in a pre-defined range (float, double). Example: average mark (e.g., 63.4) ▶ Categorical: values are selected from a predefined number of categories. • Ordinal - categories could be meaningfully ordered. Example: grades (A, B, C, D, E, F). • Nominal - don’t have any order. Example: eye colours (blue, black, honey, etc.) • Binary - the special case of nominal, with only 2 possible categories. Example: binary value (1, 0) ▶ Date: datetime, timestamp. Example: 11.10.2018. ▶ Text: Multidimensional data ▶ Time series: Data points indexed in the time order Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data category: data can be one of two main categories: experimental or observational Experimental data Data collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Examples • Medical clinical trials • Election polls Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data category: data can be one of two main categories: experimental or observational. Observational data Data collected from ’real-world’ settings without control over the captured underlying phenomena. It is easier to collect and obtain, but results and conclusions from such data may be biased or inconclusive. Example Almost all data used in data mining, bushiness analytic and data science are observational data. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data Type: data can be • • • • • • Numbers String Relational data Factors or categorical variables Dates and times Description We can read data from the various sources or files. Files can be in any format such as: • • • • • • name.CSV name.DAT name.TXT name.XLS name.HTML name.json Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection When we get a new data, we often ask: • What is in it? • What is wrong with it? • What should I do with it? Answer: • Step 1. Import the data into your code. • Step 2. Organise the data in a readable format. • Step 3. ... • .... • Step n. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Data Collection Data importing Data importing can be defined as the process of writing R code to get the data from disk (PC) into R environment. This lecture will cover Step 1. • Step 1. Import the data into R environment. 1 Reading Data: write R codes to import data into Rstudio environment. 2 View the data: explore, access and print. col 1 col 2 col 3 Value 1 Value 2 Value 3 Value 4 Value 5 Value 6 Value 7 Value 8 Value 9 Value 10 Value 11 Value 12 Write R Code Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Overview 1 Section 1: CSE5DEV Syllabus 2 Section 2: Data Collection 3 Section 3: Basics of R Programming Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming How RStudio and R work? CSE5DEV Student Write RStudio Interface R code Run R in the background R programming software Output PC Monitor Computer - Note: you ONLY need to run and write your code in RStudio Interface. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming RStudio Interface Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming RStudio Interface Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming In this lecture, we will learn how to write R code for the following tasks: • Import data: reading data from file. • View data. • Access data. • Check data types. • Export data Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data R uses various functions to import data from the Working Directory into R environment. We can import data from different formats such as: • Text files: txt files. • Comma Separated Values: CSV files. • Excel Files: xls or xlsx files. • Web-site: URL files. • SPSS File • ... etc Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data R reading function syntax: R Code: Read data function format Object_name <- R_read_function("file_name.ext", Arguments) • Object name: variable that can hold different values. • R read function: used read data from file based on file extension. • file name.ext: the name of the file to read, file extension and location. • Arguments: control statements Examples of R reading functions: • read.table for TEXT files • read.csv for CSV files Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data — Read data from text files — Example: read data from text file called Mytext.txt and assign the data to dat Object (or variable). R Code: — Read data from text files — dat <- read.table("Mytext.txt", header=TRUE, sep =" ", dec=".") • The read.table function read the file and save it in object. • header=TRUE: By default the header argument is set as TRUE. This indicates that the first row in the file is set as header information (column names). If your file does not have a header, set the header argument to FALSE: header=FALSE. • sep =” ”. Indicate the columns are separated by white space(s). We can use tabs, newlines or comma. • dec=”.”. The character used in the file for decimal points is a dot. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Importing data — Read data from CSV files — Example: read data from csv file called data.csv and assign the data to dat object (or variable). R Code: — Read data from text csv — dat <- read.csv("data.csv", header=TRUE, sep =",") • read.csv: read the data from ”data.csv”, which includes a header row and separated by comma (,). • By default dat will be data frame. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data We can use the following functions to view/check the data in dat: • names() - shows the names attribute for a data frame, which gives the column names. • head() - shows first 6 rows. • tail() - shows last 6 rows. • dim() - returns the dimensions of data frame (number of rows and number of columns). • nrow() - number of rows. • ncol() - number of columns. • str() - structure of data frame - name, type and preview of data in each column. • sapply(dataframe, class) - shows the class of each column in the data frame. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data Example of functions for viewing/checking data. dat <- read.csv("data.csv", header=TRUE, sep =",") names(dat) "Model" "mpg" "am" "gear" "cyl" "carb" "disp" "hp" "drat" "wt" "qsec" "vs" head(dat) ## ## ## ## ## ## ## Model 1 Mazda RX4 2 Mazda RX4 Wag 3 Datsun 710 4 Hornet 4 Drive 5 Hornet Sportabout 6 Valiant dim(dat) ## [1] 32 12 nrow(dat) ## [1] 32 ncol(dat) ## [1] 12 mpg cyl disp hp drat wt qsec vs am gear carb 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data We can use print () function to display dat data at the screen. print(dat) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Model mpg cyl disp hp drat wt qsec vs am gear carb 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming View data str(dat) - displays the structure of data, type and the data in each column. str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... Based on the above, we can see that • dat is categorised as an object and data.frame type. • Columns data are either character, number or integer. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure ▶ Objects or variables are used to save data values that R programs can manipulate. A valid object name consists of letters, numbers and the dot or underline characters. It should starts with a letter, or the dot not followed by a number. ▶ Examples of Valid and Invalid object names are: 1 2 3 4 5 6 ▶ object name2. valid - contains letters, numbers, dot and underscore. object name% Invalid - contains the character ’%’. Only dot(.) and underscore allowed. 2object name invalid - starts with a number. .object name, object.name valid - can start with a dot(.) but the dot(.) should not be followed by a number. .2object name invalid - dot is followed by a number. object name invalid - starts with which is not valid. Objects assignment: the objects can be assigned values using <- symbol. For example, x<-5, y<-5.2, z<-”CSE5DEV”. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure ▶ Objects are reserved memory locations to store values. They store data of different types, and different types can do different things. The stored values are known as R data types. ▶ In R the data types can be one of following: Logical: TRUE, FALSE. 2 Integer: 21L, 3L, 3L, ...etc. The letter ”L” declares this as an integer. 1 ▶ 3 Numeric: real or decimal (2.1, 2.0, pi). 4 Character: ”a” or ”swc”. 5 Complex: 1 + 0i or 1 + 4i. 6 Date Values: ”2021-07-26”. We can use class() or typeof() function to check the data type of objects. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — Examples of R objects assignment and data types — # numeric x <- 5.5 class(x) ## [1] "numeric" # integer x <- 200L class(x) ## [1] "integer" # complex x <- 6i + 2 class(x) ## [1] "complex" # character/string x <- "R CSE5DEV" class(x) ## [1] "character" # logical/boolean x <- TRUE class(x) ## [1] "logical" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Type Conversion — ▶ In R , we can convert a value from one type to another using the following functions: • • • • ▶ as.numeric() as.integer() as.complex() as.Date () Examples of data type conversion are: x <- 2L # integer y <- 4 # numeric # convert from integer to numeric: a <- as.numeric(x) # convert from numeric to integer: b <- as.integer(y) # print values of x and y print (x) ## [1] 2 print (y) ## [1] 4 # print the class name of a and b class(a) ## [1] "numeric" class(b) ## [1] "integer" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures— ▶ Data structures are used to store data, keep it organised, and enable easy modification and access. ▶ Data structures store a SET of data values that relate to each other, and allows us to perform operations or functions on these values. ▶ Examples of R data structures are: 1 Vectors. Matrices. 3 Data Frames. 2 4 Factors. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — ▶ Vectors store a list of items (or values) of the same type. ▶ We use the c() function declare a vector consists of set of values separated by a comma. ▶ We can create a vector that combines a set of values as follows: # Vector of numerical values numbers <- c(1, 2, 3, 4) print (numbers) ## [1] 1 2 3 4 # Vector of strings fruits <- c("apple", "orange", "banana") print (fruits) ## [1] "apple" "orange" "banana" # We can create a vector using the Colon : operator numbers <- 1:10 print (numbers) ## [1] 1 2 3 4 5 6 7 8 9 10 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Some of useful functions for vectors: ▶ Vector Length: length() returns the number of values. ▶ Sort a Vector: sort() sorts values alphabetically or numerically. ▶ Access Vectors: use [] brackets to access the vector items by index number. ▶ Change an Item Value: use [index number] to change the value of a specific item. ▶ Repeat Vectors: use rep() to repeat vectors items. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Examples of vectors functions. # Vector Length fruits <- c("banana", "apple", "orange") length(fruits) ## [1] 3 fruits <- c("banana", "apple", "orange", "mango", "lemon") numbers <- c(13, 3, 5, 7, 20, 2) # Sort vector sort(fruits) # Sort a string ## [1] "apple" "banana" "lemon" "mango" "orange" sort(numbers) # Sort numbers ## [1] 2 3 5 7 13 20 #Access Vectors fruits <- c("banana", "apple", "orange") # Access the first item (banana) fruits[1] ## [1] "banana" fruits[3] ## [1] "orange" #Change an Item fruits <- c("banana", "apple", "orange", "mango", "lemon") # Change "apple" to "pear" fruits[2] <- "pear" print (fruits) ## [1] "banana" "pear" "orange" "mango" # Repeat Vector repeat_vec <- rep(c(1,2,3), each = 3) print (repeat_vec) ## [1] 1 1 1 2 2 2 3 3 3 "lemon" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Examples of vector operations. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Vectors — Functions for vectors. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — ▶ A matrix stores data in two-dimensional rectangular layout with columns and rows. ▶ A column is a vertical representation of data, while a row is a horizontal representation of data. ▶ We use matrix() function to create a matrix. We also need to specify the nrow and ncol parameters to get the number of rows and columns. # Create a matrix matr <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2) print (matr) ## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Some of useful functions for Matrices: ▶ Access matrix items: use [] brackets to access items using two index numbers: first one for row while the second one for column. ▶ Access more than one row or column: use [] and c() to access more than one row or column: [c(1,2), ] or [, c(1,2)]. ▶ Add cows and columns: use cbind() to add columns and rbind() to add rows. ▶ Remove rows and columns: use c() to remove rows and columns: [-c(1), -c(1)]. ▶ Check if an item exists: use %in% operator to check if an item is exist: item %in% matrix. ▶ Matrix size: dim() returns the number of rows and columns. ▶ Matrix length: length() returns the dimension of a Matrix Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — #Access Matrix Items mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) print (mart[1, 2]) ## [1] "cherry" print (mart[2,]) ## [1] "banana" "orange" # Access More Than One Row mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart[c(1,2),]) ## [,1] [,2] [,3] ## [1,] "apple" "orange" "pear" ## [2,] "banana" "grape" "melon" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — # Access More Than One Column print (mart[, c(1,2)]) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" # Add Rows and Columns mart <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) print (mart) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" [,3] "pear" "melon" "fig" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — newmatrix <- cbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## [,1] [,2] ## [1,] "apple" "orange" ## [2,] "banana" "grape" ## [3,] "cherry" "pineapple" [,3] "pear" "melon" "fig" [,4] "strawberry" "blueberry" "raspberry" newmatrix <- rbind(mart, c("strawberry", "blueberry", "raspberry")) print (newmatrix) ## ## ## ## ## [1,] [2,] [3,] [4,] [,1] "apple" "banana" "cherry" "strawberry" [,2] "orange" "grape" "pineapple" "blueberry" [,3] "pear" "melon" "fig" "raspberry" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — mart <- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3, ncol =2) #Remove the first row and the first column mart <- mart[-c(1), -c(1)] print (mart) ## [1] "mango" "pineapple" # Check if an Item Exists mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) "apple" %in% mart ## [1] TRUE # check no of rows and columns mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) dim(mart) ## [1] 2 2 # Matrix Length mart <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) length(mart) ## [1] 4 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Examples of matrix operations. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — Functions for matrix. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Matrices — The following functions can be used to check data type in each column: 1 is.numeric(): Check if the data is Numeric - True or False. 2 is.integer(): check if the data is Integer - True or False. 3 is.factor(): check if the data is Factor - True or False. 4 is.character(): check if the data is Character - True or False. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — ▶ ▶ ▶ ▶ Factors can be used to categorise data and store it as levels. Factors store both strings and integers. They are very useful in the columns which have a limited number of unique values: Demography {Male, Female}, Music {Rock, Classic, Jazz}, Training {Strength, Stamina}, Logical {True, False}. We use factor() function to create a factor and add a vector c() as an argument. # Create a factor music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) print (music) ## [1] Jazz Rock Classic Classic Pop ## Levels: Classic Jazz Pop Rock Jazz Rock Jazz Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — Some of useful functions for Factors: ▶ Levels: we can use levels() function to print factor levels or set the levels. ▶ Factor length: length() function returns the number of items. ▶ Access factors: use [] brackets to access factor items. ▶ Change item value: use [] and item index number to change its value. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Factors — Examples of factors functions: # print levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" # set the levels music <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"), levels = c("Classic", "Jazz", "Pop", "Rock", "Other")) print (levels(music)) ## [1] "Classic" "Jazz" "Pop" "Rock" # Factor length length(music) ## [1] 8 # Access factors print (music[3]) ## [1] Classic ## Levels: Classic Jazz Pop Rock Other # Change item value music[4] <- "Pop" print (music[4]) ## [1] Pop ## Levels: Classic Jazz Pop Rock Other "Other" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Data Frame is the most common and practical way of storing data in R, especially in data analyses. ▶ ▶ ▶ ▶ ▶ data.frame shows data in a table format. data.frame stores different types of data inside it. Different columns can have different data types. For example, the first column can be numeric, the second can be character and the third logical, ..etc. However, each column must have the same data type. We use data.frame() function to create a data frame. Note: read.csv() All files that we import using the read.csv() function are stored as data.frame() data structures. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Example: Create a data frame consists of 3 columns and 3 rows. # Create a data frame data_frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), ID = c(10, 11, 13), Time = c(6.6, 3.2, 4.0) ) print (data_frame) ## Training ID Time ## 1 Strength 10 6.6 ## 2 Stamina 11 3.2 ## 3 Other 13 4.0 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Some of useful functions for Data Frames: ▶ Summarise the data: use summary() function to summarise the data. ▶ Access items: use single [], double brackets [ [] ] and $ to access columns. ▶ Add rows and columns: use rbind() to add rows and cbind() to add columns. ▶ Remove rows and columns: use c() to remove rows and columns. ▶ Number of rows and columns: use dim() or ncol() & nrow() to find the number of rows and columns. ▶ Data frame length: length() returns the number of columns. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Summarise the Data # Create a data frame data_frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), ID = c(10, 11, 13), Time = c(6.6, 3.2, 4.0) ) print (data_frame) ## Training ID Time ## 1 Strength 10 6.6 ## 2 Stamina 11 3.2 ## 3 Other 13 4.0 summary(data_frame) ## ## ## ## ## ## ## Training Length:3 Class :character Mode :character ID Min. :10.00 1st Qu.:10.50 Median :11.00 Mean :11.33 3rd Qu.:12.00 Max. :13.00 Time Min. :3.2 1st Qu.:3.6 Median :4.0 Mean :4.6 3rd Qu.:5.3 Max. :6.6 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Access items data_frame[1] ## Training ## 1 Strength ## 2 Stamina ## 3 Other data_frame[["Training"]] ## [1] "Strength" "Stamina" "Other" data_frame$Training ## [1] "Strength" "Stamina" "Other" # Add a new row New_row_DF <- rbind(data_frame, c("Strength", 110, 11.0)) print (New_row_DF) ## ## ## ## ## Training ID Time 1 Strength 10 6.6 2 Stamina 11 3.2 3 Other 13 4.0 4 Strength 110 11.0 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — # Add a new column New_col_DF <- cbind(data_frame, Steps = c(1000, 6000, 2000)) print(New_col_DF) ## Training ID Time Steps ## 1 Strength 10 6.6 1000 ## 2 Stamina 11 3.2 6000 ## 3 Other 13 4.0 2000 # Remove the first row and column Data_Frame_New <- data_frame[-c(1), -c(1)] print (Data_Frame_New) ## ID Time ## 2 11 3.2 ## 3 13 4.0 # find the number of rows and columns print (dim(data_frame)) ## [1] 3 3 # Data Frame Length print (length(data_frame)) ## [1] 3 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Note: data frame rules In R, all data frames should respect the following rules. ▶ All column names should be non-empty. ▶ All row names should be unique. ▶ The data stored in data frame columns can be of numeric, factor or character. ▶ Each column should contains the same number of items and data type. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming R object, data type, data structure — R Data Structures: Data Frames — Example: create data frame for five employees consists of employee ID, name, salary and starting date. # Create employee data frame. employee <- data.frame( employee_id = c (1:5), employee_name = c("A","B","C","D","E"), employee_salary = c(611.3,512.2,621.0,722.0,343.21), start_date = as.Date(c("2014-01-010", "2015-08-23", "2016-10-15", "2016-04-11", "2016-04-26")), stringsAsFactors = FALSE) print(employee) ## ## ## ## ## ## 1 2 3 4 5 employee_id employee_name employee_salary start_date 1 A 611.30 2014-01-01 2 B 512.20 2015-08-23 3 C 621.00 2016-10-15 4 D 722.00 2016-04-11 5 E 343.21 2016-04-26 Please note: 1 Creating data frames using data.frame() function will converted (character) strings to factors (distinct groups). 2 Use stringsAsFactors = FALSE if you are going to change it or making it as plain strings. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Example: import and view data dat <- read.csv("data.csv", header=TRUE, sep =",") str(dat) ## 'data.frame': 32 obs. of 12 variables: ## $ Model: chr "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : int 6 6 4 6 8 6 8 4 4 6 ... ## $ disp : num 160 160 108 258 360 ... ## $ hp : int 110 110 93 110 175 105 245 62 95 123 ... ## $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec : num 16.5 17 18.6 19.4 17 ... ## $ vs : int 0 0 1 1 0 1 0 1 1 1 ... ## $ am : int 1 1 1 0 0 0 0 0 0 0 ... ## $ gear : int 4 4 4 3 3 3 3 4 4 4 ... ## $ carb : int 4 4 1 1 2 1 4 2 2 4 ... dim(dat) ## [1] 32 12 class(dat) ## [1] "data.frame" class(dat$Model) ## [1] "character" class(dat[[2]]) ## [1] "numeric" Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Example: import and view data # summary dat summary(dat) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Model Length:32 Class :character Mode :character hp Min. : 52.0 1st Qu.: 96.5 Median :123.0 Mean :146.7 3rd Qu.:180.0 Max. :335.0 vs Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4375 3rd Qu.:1.0000 Max. :1.0000 mpg cyl disp Min. :10.40 Min. :4.000 Min. : 71.1 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 Median :19.20 Median :6.000 Median :196.3 Mean :20.09 Mean :6.188 Mean :230.7 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 Max. :33.90 Max. :8.000 Max. :472.0 drat wt qsec Min. :2.760 Min. :1.513 Min. :14.50 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 Median :3.695 Median :3.325 Median :17.71 Mean :3.597 Mean :3.217 Mean :17.85 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 Max. :4.930 Max. :5.424 Max. :22.90 am gear carb Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :0.0000 Median :4.000 Median :2.000 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :1.0000 Max. :5.000 Max. :8.000 Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Exporting data The data stored in objects can be exported and saved as text or csv files using the following functions: ▶ write.table: export text file: write.table(data to export, file = ”file name.txt”, sep = ” ”). ▶ write.csv: export csv file: write.csv(data to export, file = ”file name.csv”, sep = ”,”) Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming Basics of R Programming In this lecture, we have learned 1 how to import data into R environment (RStudio->RMarkdown). 2 how to view data in R. 3 objects and how to manipulate them. 4 R data types. 5 R data structures. 6 how to export data. Section 1: CSE5DEV Syllabus Section 2: Data Collection Section 3: Basics of R Programming End of Week 2 See you Next Lecture (Week 3) Data Wrangling & R Programming Table: CSE5DEV Timetable Check LMS