Lec06_Data exploration - Part 1_31July2024 (1).R
Document Details
Uploaded by OpulentLandArt
Tags
Full Transcript
ECON 223 - Intro to Statistical Programming =========================================== Krea SIAS, AY 2024-25 ===================== WEEK 03 ======= LECTURE 06 ========== \#--------------------------------------------\# \#--------------------------------------------\# \#-------------------------...
ECON 223 - Intro to Statistical Programming =========================================== Krea SIAS, AY 2024-25 ===================== WEEK 03 ======= LECTURE 06 ========== \#--------------------------------------------\# \#--------------------------------------------\# \#--------------------------------------------\# \#--------------------------------------------\# \# DATA EXPLORATION \#--------------------------------------------\# \#--------------------------------------------\# rm(list=ls()) \#---------------\#\ \# Using 'iris' dataset \# Eg. courtesy - Prof. Xiaorui Zhu \#---------------\#\ ?iris \# dataframe with 150 cases (rows) and 5 variables (cols) data(iris) \# load dataset into current workspace View(iris) \# uppercase V - view dataframe in a spreadsheet-like format iris \# prints entire dataframe on to console - not easy to view head(iris) \# prints first 6 rows of the dataframe head(iris, n=10) \# prints first 10 rows of the dataframe tail(iris) \# prints last 6 rows of the dataframe tail(iris, n=10) \# prints last 10 rows of the dataframe dim(iris) \# check dimensions -- 150 rows, 5 cols nrow(iris) \# no. of rows ncol(iris) \# no. of cols names(iris) \# variable names or col names colnames(iris) \# gives the same result str(iris) \# structure of dataframe \# notice: num and Factor vars class(iris\[,1\]) \# class of 1st column of iris class(iris\[,5\]) \# class of 5th column of iris Simple Summary Statistics ========================= ?summary\ summary(iris) \# summary stats (Min, Max, Mean, Median, Quantiles) for continuous vars (locational stats), \# count or frequency for categorical vars Let's analyze a particular variable 'Petal length' ================================================== summary(Petal.Length) \# Why does this show an Error? iris\$Petal.Length \# Use \$ to extract any column from dataset \# (data\_frame\_name[*vector*~*n*~*ame*)*summary*(*iris*]{.math.inline}Petal.Length) mean(iris[\$Petal.Length) \# mean median(iris\$]{.math.inline}Petal.Length) \# median var(iris[\$Petal.Length) \# variance sd(iris\$]{.math.inline}Petal.Length) \# std deviation min(iris[\$Petal.Length) \# min max(iris\$]{.math.inline}Petal.Length) \# max max(iris[*Petal*.*Length*) − *min*(*iris*]{.math.inline}Petal.Length) \# range quantile(iris[\$Petal.Length) \# important quantiles quantile(iris\$]{.math.inline}Petal.Length, 0.25) \# quartile 1 (25th percentile) quantile(iris[\$Petal.Length, 0.75) \# quartile 3 (75th percentile) quantile(iris\$]{.math.inline}Petal.Length, c(0.25, 0.5, 0.75)) \# different quantiles in one go One-way table ============= table(iris[\$Species) \# frequency table of the \'Species\' variable proportions(table(iris\$]{.math.inline}Species)) \# proportions of diff. 'Species' Select columns ============== names(iris) iris\[, "Sepal.Length"\] \# select col by name iris\[, c("Sepal.Length", "Sepal.Width")\] \# select 2 cols by name iris\[, 3:5\] \# select columns 3-5 Select rows =========== iris\[1:8, \] \# first 8 rows iris\[15:23, \] \# select rows 15-23 QUICK TIP - Attaching and Detaching a dataset ============================================= \# attach mean(Petal.Length) mean(iris\$Petal.Length) attach(iris) mean(Petal.Length) \# detach detach(iris) mean(Petal.Length) mean(iris\$Petal.Length) Let's draw some charts! ======================= summary(iris\$Petal.Length) Histogram (for Sepal Length) ============================ ?hist hist(iris\$Sepal.Length) \# change color hist(iris\$Sepal.Length, col = "violet") \# add title hist(iris\$Sepal.Length, col = "violet", main = "Histogram of Sepal Length") \# add label for x-axis hist(iris\$Sepal.Length, col = "violet", main = "Histogram of Sepal Length", xlab = "Sepal Length") \# specify the \# bins (using 'breaks') hist(iris\$Sepal.Length, col = "violet", main = "Histogram of Sepal Length", xlab = "Sepal Length", breaks = 8) hist(iris\$Sepal.Length, col = "violet", main = "Histogram of Sepal Length", xlab = "Sepal Length", breaks = 20) \# specify that you want bins going from 4 to 8 in increments of 0.5 (4-4.5, 4.5-5,...) hist(iris\$Sepal.Length, col = "violet", main = "Histogram of Sepal Length", xlab = "Sepal Length", breaks = seq(4,8,0.5)) \# specify that you want bins going from 4 to 8 in increments of 0.1 hist(iris\$Sepal.Length, col = "violet", main = "Histogram of Sepal Length", xlab = "Sepal Length", breaks = seq(4,8,0.1)) Box plot (for Petal Length) =========================== ?boxplot boxplot(iris[*Petal*.*Length*)*summary*(*iris*]{.math.inline}Petal.Length) \# change color boxplot(iris\$Petal.Length, col = "blue") \# add title boxplot(iris\$Petal.Length, col = "blue", main = "Box plot of Petal Length") \# draw box plot for 2 variables boxplot(iris[*Petal*.*Length*, *iris*]{.math.inline}Sepal.Length, col = c("blue", "red"), main = "Box plot of Petal Length & Sepal Length") boxplot(iris[*Petal*.*Length*, *iris*]{.math.inline}Sepal.Length, col = c("blue", "red"), main = "Box plot of Petal Length & Sepal Length", names = c("Petal Length", "Sepal Length")) \# horizontal box plot boxplot(iris[*Petal*.*Length*, *iris*]{.math.inline}Sepal.Length, col = c("blue", "red"), main = "Box plot of Petal Length & Sepal Length", names = c("Petal Length", "Sepal Length"), horizontal = TRUE) Scatter plot of Sepal length and Petal length ============================================= ?plot plot(iris[*Sepal*.*Length*, *iris*]{.math.inline}Petal.Length) \# 1st arg is var in x-axis, 2nd arg is var in y-axis plot(iris[*Sepal*.*Length*, *iris*]{.math.inline}Petal.Length, xlab = "Sepal Length", ylab = "Petal Length", main = "Scatter plot of Sepal Length and Petal Length") Summary by group (skip for now) =============================== ?aggregate() \# Splits the data into subsets, \# computes summary statistics for each, and \# returns the result in a convenient form. aggregate(iris,.\~Species, mean) \# group mean - by Species aggregate(iris,.\~Species, sd) \# group std dev aggregate(iris,.\~Species, quantile) Reordering columns and sorting rows (skip for now, in-detail later) =================================================================== \# To find the top 5 rows in iris with the largest Petal.Length iris\[order(iris\$Petal.Length, decreasing = TRUE)\[1:5\], \] \# To find the 5 rows with the lowest Petal.Length iris\[order(iris\$Petal.Length, decreasing = FALSE)\[1:5\],\]