Full Transcript

Week 6 Data Exploration CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univariate Analysis: Graphical Exploration Examples of Data Exploration Subject Syllabus CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univa...

Week 6 Data Exploration CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univariate Analysis: Graphical Exploration Examples of Data Exploration Subject Syllabus CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univariate Analysis: Graphical Exploration Examples of Data Exploration Learning outcomes: Develop a high-level understanding of the data. Learn about the distribution of variables. Understand analysis and summary statistics. Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— Collecting and Wrangling: working with data Read & correct data Cleaning and Normalising: convert dirty data into correct data. Cleaning & Handling Missing Values. Normalising or Standardising Data. Data visualisation Produce Scatter, Boxplots, and Line plots Plotting with ggplot2 What we have learned so far? —— R Programming —— Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc Data Type and data structure (vector, factor, matrix and data frame) View, Access, Change etc. Import data into R Environment (text file and csv files) Correct or change the format of the data to make it tidy Clean the data Normalise the data Data visualisation using ggplot2 ?mean Base R Cheat Sheet Getting Help Accessing the help files Vectors Creating Vectors For Loop Example Programming While Loop Example Get help of a particular function. help.search(‘weighted mean’) Search the help files for a word or phrase. help(package = ‘dplyr’) Find help for a package. More about an object sort(x) Vector Functions rev(x) If Statements Functions str(iris) Get a summary of an object’s structure. class(iris) Find the class an object belongs to. Return x sorted. table(x) See counts of values. Return x reversed. unique(x) See unique values. Using Libraries install.packages(‘dplyr’) Download and install a package from CRAN. library(dplyr) Load the package into the session, making all its functions available to use. dplyr::select Use a particular function from a package. data(iris) Load a built-in dataset into the environment. Working Directory getwd() Find the current working directory (where inputs are found and outputs are sent). Selecting Vector Elements By Position x[4] The fourth element. x[-4] All but the fourth. x[2:4] Elements two to four. x[-(2:4)] All elements except two to four. x[c(1, 5)] Elements one and five. By Value x[x == 10] Elements which are equal to 10. x[x < 0] All elements less than zero. Example Reading and Writing Data Example setwd(‘C://file/path’) Change the current working directory. x[x %in% c(1, 2, 5)] Elements in the set 1, 2, 5. Use projects in RStudio to set the working directory to the folder you are working in. Named Vectors x[‘apple’] Element with name ‘apple’. Conditions RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15 m <- matrix(x, nrow = 3, ncol = 3) Create a matrix from x. log(x) Natural log. sum(x) Sum. exp(x) Exponential. mean(x) Mean. max(x) Largest element. median(x) Median. min(x) Smallest element. quantile(x) Percentage quantiles. round(x, n) Round to n decimal rank(x) Rank of elements. places. signif(x, n) Round to n var(x) The variance. significant figures. cor(x, y) Correlation. sd(x) The standard deviation. df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) A special case of a list where all elements are the same length. List subsetting t.test(x, y) Preform a t-test for difference between means. pairwise.t.test Preform a t-test for paired data. prop.test Test for a difference between proportions. aov Analysis of variance. Matrix subsetting df[ , 2] df[2, ] df[2, 2] nrow(df) Number of rows. ncol(df) Number of columns. dim(df) Number of columns and rows. cbind - Bind columns. rbind - Bind rows. RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15 CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univariate Analysis: Graphical Exploration Examples of Data Exploration Recall .... Data can be one of two main categories: experimental or observational Recall .... Data variable values can be: Numeric: Discrete - integer values. Continuous - any value in a pre-defined range (float, double). Categorical: values are selected from a predefined number of cate- gories. Ordinal - categories could be meaningfully ordered. Nominal - don’t have any order. Binary - the special case of nominal, with only 2 possible categories. Date: datetime, timestamp. Text: Multidimensional data Time series: Data points indexed in the time order Recall .... When you get a new data set (or a project), you should ask yourself a few questions before starting exploring it. What is in it? What is wrong with it? What should I do with it? Who is going to read/implement your analysis (their knowledge)? What is Data Exploration? Data exploration can: give you a sense of the distribution of the data. help you to check if there are trend and relationship. help you to know possible values for each characteristic. inform you on how to develop or design your model. Before trying to extract useful insight from the data, we should define the problem (formulate your question) to be solved. The problem definition determine data analysis plan execution. The problem definition tasks in- cludes: The main objective of the analysis. The main deliverable. How to perform analysis. Use or implement the findings. Based on the problem definition, we can create an execution plan. Examples of problem definitions (or questions) are: Which factors effect house prices across different areas? Are hourly travel times on average higher in Melbourne than they are in Sydney? Which states in the Australia have the highest levels of population distribution? What type of variation occurs within my variables? Which value occurs more than others? Hypotheses questions. Data exploration approaches are: Data exploration tools can be used to analyse data variables as follows: Univariate analysis: Univariate analysis is simplest form of data anal- ysis. It analyses and provides summary statistics for each variable (single) in the data. Bivariate analysis: Bivariate analysis can be used to analyse rela- tionship between two variables of the data. Multivariate analysis: Multivariate analysis can be used analyse re- lationship between more than two variables of the data Examples of Univariate variable (single feature, attribute or column), Bi- variate variables and Multivariate variables (more than 2 variables). Univariate Bivariate Multivariate In this lecture, we will perform Univariate analysis. — Univariate Analysis — In Univariate analysis, we explore each variable separately. A variable can be Categorical or Numerical. The ultimate aim is to summarise and analyse the pattern present in each variable. We will explore each variable using the following approaches: Tabular Exploration. Graphical Exploration. CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univariate Analysis: Graphical Exploration Examples of Data Exploration — Univariate Analysis: Tabular Exploration — Tabular exploration provides a summary or descriptive statistics for each variable. It can help us to identify various data quality issues such as Precision: The closeness of repeated measurements to one another. Bias: A systematic variation of measurements from the quantity being measured. Accuracy: The closeness of measurements to the true value of the quantity being measured. Outliers: The unusual values that allocated beyond the defined range. Univariate Analysis: Tabular Exploration — To understand the benefits of summary statistics, let’s print the salary of 100 different persons. As can be seen, it is very difficult to understand the values. Indeed, long series are often not informative and we can not make any conclusion or find any pattern in these values. Univariate Analysis: Tabular Exploration — To get more information, let’s plot the salary of 100 different persons as two groups (50 persons in each group). 30 id 20 G1 G2 10 The red line (—–) in the figure rep- resents the mean value. As can be seen the mean is almost similar in both groups. However, although the values seem to be in the same locations, it is clear that the distribution (dispersion) is very different in both groups. G1 G2 id For this reason, we need to use summary statistics to analyse both the location measures and distribution measures of all variables. Univariate Analysis: Tabular Exploration — In summary statistics location and distribution can be defined as follows: Locations: Location measures are used to analyse and understand where the data is located. It demonstrates the central tendency and the positions of data values. Distribution measures: Distribution measures are used to analyse the distribution of values for a particular variable. It is highly recommended to use both location and dispersion measures to explore data values and present several different insight from both types of measures. Univariate Analysis: Tabular Exploration — Location uses the following measures: Minimum and Maximum Mean Median Mode Frequency First quartile (Q1) and Third quartile (Q3) Univariate Analysis: Tabular Exploration — Location measure: Minimum and Maximum Minimum and maximum: Minimum (min) and maximum (max) are simply the lowest and highest values in a given data variable. 1 22 6 8 10 16 5 — Univariate Analysis: Tabular Exploration — Location measure: Mean Mean: The mean (or average) is the most common way to mea- sure the central location or value of data points. However, it is very sensitive to outliers. A trimmed mean is more robust to outliers by disregarding extreme values. Weighted mean also takes into account weights for each observation. The mean is calculated by summing all values and dividing this sum by the number of values: mean =sum of all valuses 1 10 6 8 16 22 5 — Univariate Analysis: Tabular Exploration — Location measure: Median Median: The median of a sorted values is a value such that half of the observed values are above it and half are below it. It is the middle value for an odd number of observations, or the average (when it makes sense) between the two middle numbers for an even number of observations. 1 10 6 8 16 22 5 Sort the values and then select the middle one (Median) 1 5 6 10 16 22 — Univariate Analysis: Tabular Exploration — Location measure: Mode Mode: Mode is the most frequent (highest number of occurrences) value of an attribute values. — Univariate Analysis: Tabular Exploration — Location measure: Frequency Frequency: The portion (e.g., percentage) of the observation with each specific value of a categorical or discrete attribute. 1 22 6 8 10 16 5 Values = 6 1 5 8 16 22 Frequency = 0.28 0.142 0.142 0.142 0.142 0.142 — Univariate Analysis: Tabular Exploration — Location measure: First quartile & Third quartile First quartile (Q1) and Third quartile (Q3): Q1 and Q3 are similar to the median where they divide the values into two parts but of different size. Q1: The median of the top half of the sorted values. Q3: The median of the bottom half of the sorted values. — Univariate Analysis: Tabular Exploration — Distribution uses the following measures: Range Standard deviation Variance Coefficient of variation Interquartile range Univariate Analysis: Tabular Exploration — Distribution measure: Range Range: Range is the difference between max and min observed values of an attribute. 1 22 6 8 10 16 5 Range = (22-1) Range = 21 — Univariate Analysis: Tabular Exploration — Distribution measure: Standard deviation Standard deviation: Standard deviation measures the spread of val- ues. It shows the normal deviation of the values. The larger the standard deviation value, the more scattered the values are. How- ever, like the mean, it sensitives to outliers. Standard deviation represents the average deviation of the data from their mean. std = q 1 Σn (xi − mean)2 1 22 6 8 10 16 5 Std = 6.60 — Univariate Analysis: Tabular Exploration — Distribution measure: Variance Variance: Variance is the square of the standard deviation. It also used to measuring how far the values are from their mean. Variance = 1 Σn (xi − mean)2 1 22 6 8 10 16 5 Variance = 43.63 — Univariate Analysis: Tabular Exploration — Distribution measure: Coefficient of variation Coefficient of variation: The coefficient of variation (CV) is the standard deviation divided by the mean. CV can be used to compare values with different units or widely different means. CV = Std . 1 22 6 8 10 16 5 Std = 6.60 Mean = 5.33 CV = Std/Mean CV = 1.23 — Univariate Analysis: Tabular Exploration — Distribution measure: Interquartile range Interquartile range: The interquartile range (IQR) measures the distribution of the values using Q1 and Q3. IQR = Q3 − Q1. — Univariate Analysis: Tabular Exploration — To summarise, tabular exploration approaches should be used to analyse values using location measures and distribution measures. Locations: Location measures are used to analyse and understand where the data is located. It demonstrates the central tendency and the positions of data values. Minimum, Maximum, Mean, Median, First quartile (Q1), Third quartile (Q3), Mode. Distribution measures: Distribution measures are used to analyse the distribution of values for a particular variable. Range, standard deviation, Variance, Interquartile range, Coefficient of variation. 30 id 20 G1 G2 10 G1 G2 id CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univariate Analysis: Graphical Exploration Examples of Data Exploration — Univariate Analysis: Graphical Exploration — In Univariate Analysis, we can use plots and charts to visualise variable values as follows: Continuous Variables: Histograms, Boxplot and Dot Chart. 1.6 22 6.7 8.7 10 Categorical Variables: Bar Chart and Pie Chart. A C Good D Blue — Univariate Analysis: Graphical Exploration — In continuous variables, plots and charts can be used for the following purposes. Measures of location Measures of spread Asymmetry Outliers Gaps — Univariate Analysis: Graphical Exploration — Continuous Variables: Histograms 40 30 20 10 0 Measures of location: all values are between 4.5 to 25. Measures of spread: most of values are less than 12. Most of them between 7 and 5. Asymmetry: most of values are skewed right. Outliers: there are few outliers, value=25. Gaps: there is a gap between 23 and 25. — Univariate Analysis: Graphical Exploration — Continuous Variables: Dot Chart 1.00 0.75 0.50 0.25 0.00 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 wage sex F M X — Univariate Analysis: Graphical Exploration — Continuous Variables: Boxplot ggplot(dat, aes(x=sex, y=wage, fill=sex)) + geom_boxplot 25 20 15 10 5 — Univariate Analysis: Graphical Exploration — In categorical variables, plots and charts can be used for the following purposes. Count of each category Proportion of each category Imbalanced categories Mislabeled categories — Univariate Analysis: Graphical Exploration — Categorical Variables: Bar Chart and Pie Chart ) 100 75 sex 50 F M Var1 clerical const manag manuf other prof sales service 25 Freq 0 clerical const manag manuf other prof sales service sector Count of each category Proportion of each category Imbalanced categories Mislabeled categories — Univariate Analysis: Graphical Exploration — Categorical Variables: Bar Chart manuf const 20% sales other prof 10% service manag clerical 0% clerical manag service prof other sales const manuf Sector 0% 10% 20% Percent CSE5DEV Syllabus Week-Overview Data Exploration Univariate Analysis: Tabular Exploration Univariate Analysis: Graphical Exploration Examples of Data Exploration In this lecture, we will learn How to use R for tabular exploration. How to use R for graphical exploration. To this end, its assumed that you KNOW how to import data organise data clean data normalise data visualise data In this lecture, we will use the Australian weather data (weatherAUS.csv). The data contains daily weather observations from numerous weather sta- tions. We will do basic analysis for the following variables: Temperature Wind direction Load libraries, read data and print column names. dat <- read.csv("weatherAUS.csv", header = TRUE, sep = ",") head(dat) ## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir ## 1 2008-12-01 Albury 13.4 22.9 0.6 NA NA W ## 2 2008-12-02 Albury 7.4 25.1 0.0 NA NA WNW ## 3 2008-12-03 Albury ## WindGustSpeed WindDi 12.9 25.7 r9am WindDir3pm Wi 0.0 NA NA WSW ndSpeed9am WindSpeed3pm Humidity9am ## 1 44 W WNW 20 24 71 ## 2 44 NNW WSW 4 22 44 ## 3 ## Humidi 46 ty3pm Pressure W WSW 9am Pressure3pm Cl 19 26 oud9am Cloud3pm Temp9am 38 Temp3pm ## 1 22 1007.7 1007.1 8 NA 16.9 21.8 ## 2 25 1010.6 1007.8 NA NA 17.2 24.3 ## 3 ## RainTo 30 1007.6 1008.7 day RISK_MM RainTomorrow NA 2 21.0 23.2 ## 1 No 0.0 No ## 2 No 0.0 No ## 3 No 0.0 No Print the structure of whole data str(dat) ## 'data.frame': 142193 obs. of 24 variables: ## $ Date : chr "2008-12-01" "2008-12-02" "2008-12-03" "2008-12-04" ... ## $ Location : chr "Albury" "Albury" "Albury" "Albury" ... ## $ MinTemp : num 13.4 7.4 12.9 9.2 17.5 14.6 14.3 7.7 9.7 13.1 ... ## $ MaxTemp : num 22.9 25.1 25.7 28 32.3 29.7 25 26.7 31.9 30.1 ... ## $ Rainfall : num 0.6 0 0 0 1 0.2 0 0 0 1.4 ... ## $ Evaporation : num NA NA NA NA NA NA NA NA NA NA ... ## $ Sunshine : num NA NA NA NA NA NA NA NA NA NA ... ## $ WindGustDir : chr "W" "WNW" "WSW" "NE" ... ## $ WindGustSpeed: int 44 44 46 24 41 56 50 35 80 28 ... ## $ WindDir9am : chr "W" "NNW" "W" "SE" ... ## $ WindDir3pm : chr "WNW" "WSW" "WSW" "E" ... ## $ WindSpeed9am : int 20 4 19 11 7 19 20 6 7 15 ... ## $ WindSpeed3pm : int 24 22 26 9 20 24 24 17 28 11 ... ## $ Humidity9am : int 71 44 38 45 82 55 49 48 42 58 ... ## $ Humidity3pm : int 22 25 30 16 33 23 19 19 9 27 ... ## $ Pressure9am : num 1008 1011 1008 1018 1011 ... ## $ Pressure3pm : num 1007 1008 1009 1013 1006 ... ## $ Cloud9am : int 8 NA NA NA 7 NA 1 NA NA NA ... ## $ Cloud3pm : int NA NA 2 NA 8 NA NA NA NA NA ... ## $ Temp9am : num 16.9 17.2 21 18.1 17.8 20.6 18.1 16.3 18.3 20.1 ... ## $ Temp3pm : num 21.8 24.3 23.2 26.5 29.7 28.9 24.6 25.5 30.2 28.2 ... ## $ RainToday : chr "No" "No" "No" "No" ... ## $ RISK_MM : num 0 0 0 1 0.2 0 0 0 1.4 0 ... ## $ RainTomorrow : chr "No" "No" "No" "No" ... Check If there are Missing Values. Print the number of missing values in each column. sapply(dat, function(x) sum(is.na(x))) ## Date Location MinTemp MaxTemp Rainfall ## 0 0 637 322 1406 ## Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ## 60843 67816 9330 9270 10013 ## WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm ## 3778 1348 2630 1774 3610 ## Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am ## 14014 13981 53657 57094 904 ## Temp3pm RainToday RISK_MM RainTomorrow ## 2726 1406 0 0 Delete column(s) Print summary (not complete due to size limitation) summary(dat) ## Date Location MinTemp MaxTemp ## Length:142193 Length:142193 Min. :-8.50 Min. :-4.80 ## Class :character Class :character 1st Qu.: 7.60 1st Qu.:17.90 ## Mode :character Mode :character Median :12.00 Median :22.60 ## Mean :12.19 Mean :23.23 ## 3rd Qu.:16.80 3rd Qu.:28.20 ## Max. :33.90 Max. :48.10 ## NA's :637 NA's 322 ## Rainfall Evaporation Sunshine WindGustDir ## Min. : 0.00 Min. : 0.00 Min. : 0.00 Length:142193 ## 1st Qu.: 0.00 1st Qu.: 2.60 1st Qu.: 4.90 Class :character ## Median : 0.00 Median : 4.80 Median : 8.50 Mode :character ## Mean : 2.35 Mean : 5.47 Mean : 7.62 ## 3rd Qu.: 0.80 3rd Qu.: 7.40 3rd Qu.:10.60 ## Max. :371.00 Max. :145.00 Max. :14.50 ## NA's 1406 NA's 60843 NA's 67816 ## WindGustSpeed WindDir9am WindDir3pm WindSpeed9am ## Min. : 6.00 Length:142193 Length:142193 Min. : 0 ## 1st Qu.: 31.00 Class :character Class :character 1st Qu.: 7 ## Median : 39.00 Mode :character Mode :character Median : 13 ## Mean : 39.98 Mean : 14 ## 3rd Qu.: 48.00 3rd Qu.: 19 ## Max. :135.00 Max. :130 ## NA's 9270 NA's 1348 Basic analysis: min temperature variable — min_temp <- dat$MinTemp summary(min_temp) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -8.50 7.60 12.00 12.19 16.80 33.90 637 Print the standard deviation Both the mean of 12.19 and median of 12 indicate where the centre of the data is located, and what the typical minimum temp is. Consequently, the minimum temp is about 12. The standard deviation value is 6.04. This means that, on average, the minimum temp deviates from the mean by 6.04. This also indicate the values are highly scattered. Basic analysis: min temperature variable — Remove NA from min temp summary(min_temp) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -8.50 7.60 12.00 12.19 16.80 33.90 637 min_temp <- na.omit(min_temp) summary(min_temp) ## Min. 1st Qu. Median Mean 3rd Qu. ## -8.50 7.60 12.00 12.19 16.80 Max. 33.90 Basic analysis: min temperature variable — Histogram - min temp Histogram of MinTemp 600 400 200 0 0 10 20 30 MinTemp Basic analysis: min temperature variable — Histogram by location - min temp 1.5 1.0 0.5 0.0 Histogram of MinTemp by Location 0 10 20 30 MinTemp The histogram of these data show that 12 is also where the typical values are centred in the his- togram. Since the histogram is slightly skewed in a positive direction, the mean is slightly larger than the median. Basic analysis: min temperature variable — Box Plot - min temp 30 20 10 0 x Basic analysis: min temperature variable — Box Plot - min temp by Location ggplot(data = dat1, mapping = aes(x = Location, y = MinTemp,fill=Location)) + geom_boxplot() +coord_flip() Woomera Williamtown Watsonia WaggaWagga Townsville SydneyAirport Sydney Sale Portland PerthAirport Perth Nuriootpa NorfolkIsland MountGambier Moree Mildura MelbourneAirport Melbourne Hobart Darwin CoffsHarbour Cobar Canberra Cairns Brisbane AliceSprings 0 10 20 30 MinTemp Location AliceSprings Brisbane Cairns Canberra Cobar CoffsHarbour Darwin Hobart Melbourne MelbourneAirport Mildura Moree MountGambier NorfolkIsland Nuriootpa Perth PerthAirport Portland Sale Sydney SydneyAirport Townsville WaggaWagga Watsonia Williamtown Woomera Basic analysis: min temperature variable — Density - min temp 0.04 0.02 0.00 0 10 20 30 MinTemp Basic analysis: max temperature variable — max_temp <- dat$MaxTemp summary(max_temp) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -4.80 17.90 22.60 23.23 28.20 48.10 322 Print SD - max temp sd(max_temp, na.rm = TRUE) ## [1] 7.117618 Remove NAs summary(max_temp) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -4.80 17.90 22.60 23.23 28.20 48.10 322 max_temp <- na.omit(max_temp) summary(max_temp) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -4.80 17.90 22.60 23.23 28.20 48.10 Both the mean of 23.23 and median of 22.60 indicate where the center of the data is located, and what the typical maximum temp is. Thus, the typical Maximum temp is about 23. The analysis of the max temperature seems to be similar to min temperature. Basic analysis: max temperature variable — Histogram by location - max temp Histogram of MaxTemp by Location 1.0 0.5 0.0 10 20 30 40 50 MaxTemp The histogram of these data show that 23 is also where the typical values are centred in the his- togram. Since the histogram is slightly skewed in a positive direction, the mean is slightly larger than the median. Basic analysis: max temperature variable — Box plot - max temp by location Woomera Williamtown Watsonia WaggaWagga Townsville SydneyAirport Sydney Sale Portland PerthAirport Perth Nuriootpa NorfolkIsland MountGambier Moree Mildura MelbourneAirport Melbourne Hobart Darwin CoffsHarbour Cobar Canberra Cairns Brisbane AliceSprings 10 20 30 40 50 MaxTemp Location AliceSprings Brisbane Cairns Canberra Cobar CoffsHarbour Darwin Hobart Melbourne MelbourneAirport Mildura Moree MountGambier NorfolkIsland Nuriootpa Perth PerthAirport Portland Sale Sydney SydneyAirport Townsville WaggaWagga Watsonia Williamtown Woomera — Basic analysis: temperature variable — Box plot of max vs min temperature 50 30 40 20 30 10 20 0 10 MinTemp x MaxTemp x — Basic analysis: temperature variable — Density of max vs min temperature 0.05 0.04 0.03 0.02 0.01 0.00 10 20 30 40 50 MaxTemp 0.04 0.02 0.00 0 10 20 30 MinTemp — Basic analysis: Wind Direction — Frequency — Basic analysis: Wind Direction — Bar charts Wind Dir 3pm from 2008−2017 WSW WNW W SW SSW SSE SE S NW NNW NNE NE N ESE Direction E ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW ENE E 0 1000 2000 3000 4000 Count — Basic analysis: Wind Direction — Bar charts: Frequency percentage Wind Dir 3pm 2008−2017 0.06 0.04 0.02 Direction E ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW 0.00 E ENE ESE N NE NNE NNW NW S SE SSE SSW SW W WNW WSW Direction End of Week 6 See you Next Lecture (Week 7) 06/09/2021 Table: CSE5DEV Timetable Check LMS Week 7 Data Exploration CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Overview CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate Analysis: Graphical Exploration Examples of Data Exploration Subject Syllabus CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Overview CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate Analysis: Graphical Exploration Examples of Data Exploration CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Week 7 Overview Learning outcomes: Develop a high-level understanding of the data. Understand the relationships between variables. Learn how to analyse bivariate and multivariate variables. CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A What we have learned so far? Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— Collecting and Wrangling: working with data Read & correct data Cleaning and Normalising: convert dirty data into correct data. Cleaning & Handling Missing Values. Normalising or Standardising Data. Data visualisation Scatter plot, Boxplots, and Line plots Data Exploration Univariate Analysis CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A What we have learned so far? What we have learned so far? —— R Programming —— Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc Data Type and data structure (vector, factor, matrix and data frame) View, Access, Change etc. Import data into R Environment (text file and csv files) Correct or change the format of the data to make it tidy Clean the data Normalise the data Data visualisation using ggplot2 Data Exploration: Tabular and Graphical Explorations ?mean Base R Cheat Sheet Getting Help Accessing the help files Vectors Creating Vectors For Loop Example Programming While Loop Example Get help of a particular function. help.search(‘weighted mean’) Search the help files for a word or phrase. help(package = ‘dplyr’) Find help for a package. More about an object sort(x) Vector Functions rev(x) If Statements Functions str(iris) Get a summary of an object’s structure. class(iris) Find the class an object belongs to. Return x sorted. table(x) See counts of values. Return x reversed. unique(x) See unique values. Using Libraries install.packages(‘dplyr’) Download and install a package from CRAN. library(dplyr) Load the package into the session, making all its functions available to use. dplyr::select Use a particular function from a package. data(iris) Load a built-in dataset into the environment. Working Directory getwd() Find the current working directory (where inputs are found and outputs are sent). Selecting Vector Elements By Position x[4] The fourth element. x[-4] All but the fourth. x[2:4] Elements two to four. x[-(2:4)] All elements except two to four. x[c(1, 5)] Elements one and five. By Value x[x == 10] Elements which are equal to 10. x[x < 0] All elements less than zero. Example Reading and Writing Data Example setwd(‘C://file/path’) Change the current working directory. x[x %in% c(1, 2, 5)] Elements in the set 1, 2, 5. Use projects in RStudio to set the working directory to the folder you are working in. Named Vectors x[‘apple’] Element with name ‘apple’. Conditions RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15 m <- matrix(x, nrow = 3, ncol = 3) Create a matrix from x. log(x) Natural log. sum(x) Sum. exp(x) Exponential. mean(x) Mean. max(x) Largest element. median(x) Median. min(x) Smallest element. quantile(x) Percentage quantiles. round(x, n) Round to n decimal rank(x) Rank of elements. places. signif(x, n) Round to n var(x) The variance. significant figures. cor(x, y) Correlation. sd(x) The standard deviation. df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) A special case of a list where all elements are the same length. List subsetting t.test(x, y) Preform a t-test for difference between means. pairwise.t.test Preform a t-test for paired data. prop.test Test for a difference between proportions. aov Analysis of variance. Matrix subsetting df[ , 2] df[2, ] df[2, 2] nrow(df) Number of rows. ncol(df) Number of columns. dim(df) Number of columns and rows. cbind - Bind columns. rbind - Bind rows. RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15 CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Overview CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate Analysis: Graphical Exploration Examples of Data Exploration Recall .... Data can be one of two main categories: experimental or observational Recall .... Data variable values can be: Numeric: Discrete - integer values. Continuous - any value in a pre-defined range (float, double). Categorical: values are selected from a predefined number of cate- gories. Ordinal - categories could be meaningfully ordered. Nominal - don’t have any order. Binary - the special case of nominal, with only 2 possible categories. Date: datetime, timestamp. Text: Multidimensional data Time series: Data points indexed in the time order Recall .... When you get a new data set (or a project), you should ask yourself a few questions before starting exploring it. What is in it? What is wrong with it? What should I do with it? Who is going to read/implement your analysis (their knowledge)? Recall .... Data exploration can: give you a sense of the distribution of the data. help you to check if there are trend and relationship. help you to know possible values for each characteristic. Recall .... Recall .... Before trying to extract useful insight from the data, we should define the problem (formulate your question) to be solved. The problem definition determine data analysis plan execution. The problem definition tasks in- cludes: The main objective of the analysis. The main deliverable. How to perform analysis. Use or implement the findings. Based on the problem definition, we can create an execution plan. Recall .... Data exploration approaches are: Recall .... Data exploration tools can be used to analyse data variables as follows: Univariate analysis: Univariate analysis is simplest form of data anal- ysis. It analyses and provides summary statistics for each variable (single) in the data. Bivariate analysis: Bivariate analysis can be used to analyse rela- tionship between two variables of the data. Multivariate analysis: Multivariate analysis can be used analyse re- lationship between more than two variables of the data Recall .... Univariate variable (single feature, attribute or column), Bivariate vari- ables and Multivariate variables (more than 2 variables). Univariate Bivariate Multivariate Univariate Explore one variable of the same data type Bivariate Explore two variables or columns of the same or different data types. Multivariate Explore three or more variables of the same or different data types. Recall .... Recall .... In Univariate analysis, we explore each variable separately. A variable can be Categorical or Numerical. The ultimate aim is to summarise and analyse the pattern present in each variable. In Univariate analysis We can explore each variable using the following approaches: Tabular Exploration. Graphical Exploration. Recall .... In Univariate analysis, we can use Tabular Exploration : location measures (min, max, mean, ..., etc) and distribution measures (standard deviation, range, ..., etc). Graphical Exploration: Histograms, bar chart, ..., etc. Why do we need to use graphical methods to explore the data? What is the Range of : 4, 6, 2, 9, 3, 7, 10 Min= 2, Max= 10 Range = Min-Max = 2 - 10 = 8 It is easy to calculate the range but this sometime can be misleading. For example, what is the Range of 4, 8, 12, 6, 9, 7, 6, 3518? Min = 4, Max= 3518 Range = 4 - 3518 = 3514 Mean vs. Median Row 1 Values 53 55 56 57 58 60 65 66 67 70 72 61.72727 60 Row 2 Values 51 56 56 56 58 58 67 67 67 71 72 61.72727 58 Row 3 Values 51 56 56 56 58 58 70 77 80 71 72 64.09091 58 Row 4 Values 50 50 50 56 58 58 59 59 60 61 62 56.63636 58 From the above example, we can see that the values in Row 1 and Row 2 have the same mean values but the median values are different, while in Row 3 and Row 4 the mean values are different but the median values are same. This indicates that both measures might not the effective ways to reveal the shape of the distribution. Hence, we can use histograms to understand the centre of the data. SD, Mean & Median Row Index 1 2 3 4 6 SD Mean Median Row 1 Values 5 6 8 10 12 2.863564 8.2 8 Row 2 Values 2 3 9 11 16 5.80517 8.2 9 Row 3 Values 1 1 20 9 3 8.074652 6.8 3 Row 4 Values 0 6 18 12 15 7.224957 10.2 12 From the above example, we can see that the spread of values (SD) are different even when we have the same mean values. Both measures are very sensitive to outlier. This is again justify the need for graphical methods to understand data distribution. Graphical methods can be used to understand data distribution. Data distribution can be symmetrical, left skewed or right skewed. Graphical methods: Histogram can be used to understand the Central Value CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration Graphical methods: Histogram can be used to understand how a difference in mean values shifts the distributions CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration Graphical methods: Histogram can be used to understand spread of values CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration Graphical methods: Histogram can be used to identify Outliers CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration Graphical methods: Histogram can be used to understand and compare distribution - Pima Indians diabetes dataset Graphical methods: In previous example, there were several outliers but very hard to identify them using histograms. Box plots are better for identifying outliers. CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration — Bivariate Analysis — In this lecture, we will perform Bivariate and Multivariate analysis. — Bivariate Analysis — Bivariate: explores two variables (columns) of the same or different data types. Multivariate: explores three or more variables (columns) of the same or different data types — Bivariate Analysis — Bivariate analysis can be seen as a part of multivariate analysis. The variables to explore can be Numerical vs Numerical . Numerical vs Categorical. Categorical vs Categorical. — Bivariate Analysis — The variables in bivariate (or multivariate) analysis can be independent or dependent. As the name indicate, the values of the dependent variable rely on the values of the independent variable. For example, let’s consider the relationship be- tween student answers and their final mark. Assume, we have an exam that consists of 10 questions. 10 marks will be given for the correct answer; and 0 otherwise. If we answer all questions (10 questions), we will get 100 marks, if we answer 9 questions, we will get 90 marks and so on. From the table, we can see that the values of Mark variable completely depend on the values of No of answered questions variable. — Bivariate Analysis — Similar to Univariate analysis, we will use the following approaches to conduct the Bivariate analysis and Multivariate analysis: Tabular Exploration. Graphical Exploration. Numerical Numerical Numerical Categorical Categorical Categorical 1 12 1 Good A One 2.3 2 10 A B False 6 11 3.5 False Blue Red 10 2.3 6 Blue Good North CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Overview CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate Analysis: Graphical Exploration Examples of Data Exploration — Bivariate Analysis: Tabular Exploration — Tabular exploration uses statistical methods to explore the relationship between variables. Statistical methods can help us to draw valid conclusions in exploring variables. Examples are Difference: Checks if there is a significant difference or not. Relationship: Checks if there is a relationship between the variables or not. Distribution: Checks if the two variables have the same distribution or not. Bivariate Analysis: Tabular Exploration — Numerical vs Numerical To determine if there is a difference between two numerical variables, we can use the paired samples t-test statistical method. m t = sd/√n m: mean sd: standard deviation n: number of values The paired samples t-test compares the means of two variables and returns the p-value. Based on the p-value we can conclude if there is a difference between the two numerical variables or not. Bivariate Analysis: Tabular Exploration — Numerical vs Numerical Example: Write R code to calculate the paired samples t-test for x and y variables. Bivariate Analysis: Tabular Exploration — Categorical vs Categorical Categorical variables can be compared using the Chi-square (x 2) test. x 2 = Σ (Oi − Ei )2 Ei Oi : observed value Ei : expected value x 2 will give a p-value. Based on the p-value, we can check if the results are significant or not. — Bivariate Analysis: Tabular Exploration — Categorical vs Categorical Example: Write R code to calculate the Chi-square (x 2) test for x and y variables. # WindGustDir RainTomorrow 1 W No 2 WNW No 3 WSW No 4 NE No dat<-read.csv("weatherAUS.csv", header = TRUE) chisq <- chisq.test(dat$WindGustDir, dat$RainTomorrow) print (chisq) ## ## Pearson's Chi-squared test ## ## data: dat$WindGustDir and dat$RainTomorrow ## X-squared = 1519.9, df = 15, p-value < 2.2e-16 5 W No 6 WNW No 7 W No 8 W No 9 NNW Yes 10 W No 11 N Yes 12 NNE Yes 13 W Yes 14 SW No 15 WNW No — Bivariate Analysis: Tabular Exploration — Categorical vs Numerical Categorical versus numerical variables can be compared using the Chi- square (x 2) test. Example: Write R code to calculate the Chi-square (x 2) test for x and y variables. CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration — Bivariate Analysis: Tabular Exploration — We could also summarise all variable statistics in one table as follows: — Bivariate Analysis: Tabular Exploration — It should be noted that the selection of the statistical test is highly re- lated to the data types, distribution, and relationships between variables. Therefore, we often ask the following questions before using a statistical test. Which variable is dependent or which one is independent. How many categories are there for categorical data. The distributed of the data: normally distributed or not normally distributed — Multivariate Analysis — This type of analysis is almost similar to bivariate analysis where the main difference is that in Multivariate analysis we have more than two variables. Statistical test methods can be applied but there should be several dependent variables. We can perform multivariate analysis using Cluster Analysis, Correlation Analysis or Prin- cipal Component Analysis (PCA). CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Overview CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate Analysis: Graphical Exploration Examples of Data Exploration — Bivariate Analysis: Graphical Exploration — In Bivariate Analysis, we can use plots and charts to visualise variable values as follows: Numerical and Numerical variables Numerical and Categorical variables Categorical and Categorical variables — Bivariate Analysis: Graphical Exploration — Numerical and Numerical: Scatter plot We can use the Scatter plot to explore the relationship between variables. From the figure, we can see that the relationship between the carat size and the price of a diamond is exponential but it is unclear. This is because when number of values grows up, we can see there is a lot of overplotting. We can fix this issue by using the alpha aesthetic to add transparency to the plot. Bivariate Analysis: Graphical Exploration — Numerical and Numerical: Scatter plot The plot looks much better now. However, as can be seen, transparency did not fix the issue, specially if we have very large number of values. We can address this issue using the geom bin2d() function to bin in two dimensions. The geom bin2d() function divides the coordinate plane into 2d bins and then use a fill colour to show how many points fall into each bin Bivariate Analysis: Graphical Exploration — Numerical and Numerical: Scatter plot Relationship: There is a positive (exponential) relationship between the two variables. As the carat size increases the price also increases. Outliers: Several outliers appear in the carat size direction. The outliers at the far right of the chart suggesting these carat have very large size but sold for average prices. Clusters: There is a lot of overplotting. We can see that the clustering of data points in the center of the variable relationship. For example, carat with size between 1 and 2, appear to have a price ceiling of 5000 or more. — Bivariate Analysis: Graphical Exploration — Numerical and Categorical In the numerical and categorical analysis, the numerical values are broken down by the categorical values. For this analysis we can use Bar chart: to show summary statistics (e.g., means or medians) on a numerical variable for each level of the categorical variable. Boxplot: to compare the levels of a categorical variable on the nu- merical variable. Density plot: to compare the numeric variables based on the level of the categorical variable. CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration — Bivariate Analysis: Graphical Exploration — Numerical and Categorical: Bar chart Mean Price $0 Fair Good Ideal Premium Very Good — Bivariate Analysis: Graphical Exploration — Numerical and Categorical: Boxplot 15000 10000 5000 0 Fair Good Ideal Premium Very Good cut From the boxplot figure, we can observe the following: not much information about the distribution. better quality diamonds are cheaper on average!. We have to investigate why. fair is worse than good and good is worse than very good and so on. too many outliers in all type of cuts. CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration — Bivariate Analysis: Graphical Exploration — Numerical and Categorical: Density plot 5e−04 4e−04 3e−04 2e−04 1e−04 cut Fair Good Ideal Premium Very Good 0e+00 0 5000 10000 15000 20000 price From the density plot figure we can see that the fair diamonds (the lowest quality) have the highest average price. — Bivariate Analysis: Graphical Exploration — Categorical and Categorical We can count the number of observations occurred at each combination of the categorical values. This can be accomplished using geomcount() function or grouped bar chart. J I H n G F E D Fair Good Ideal Premium Very Good cut 1000 2000 3000 4000 5000 4000 3000 2000 1000 color D E F G H I J 0 Fair Good Ideal Premium Very Good cut — Bivariate Analysis: Graphical Exploration — Categorical and Categorical We could also present the bar charts based on the percentage for each category. 100% 80% 60% 40% Color D E F G H I J 20% 0% Fair Good Ideal Premium Very Good — Multivariate Analysis: Graphical Exploration — In multivariate analysis, graphical exploration method can be used to dis- play the relationships among three or more variables. This can be achieved by using the following methods: Grouping: In this method, the values of the first variable is mapped to x axe and the values of the second variable is mapped to y axe. Other variables can be used as colour, transparency size, line type, and shape. Faceting: This method displays several separate plots one for each level of a third variable, or combination of variables. 3D plot: This method displays three variables as 3D plot. — Multivariate Analysis: Graphical Exploration — Grouping — Multivariate Analysis: Graphical Exploration — Faceting 8000 Carat histograms by cut Fair Good Ideal 6000 4000 2000 0 8000 6000 4000 2000 Premium Very Good 0 1 2 3 4 5 0 0 1 2 3 4 5 0 1 2 3 4 5 carat CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Data Exploration — Multivariate Analysis: Graphical Exploration — 3D plot CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Overview CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate Analysis: Graphical Exploration Examples of Data Exploration In this lecture, we will learn How to use R for graphical exploration. To this end, its assumed that you KNOW how to import data organise data clean data normalise data visualise data In this lecture, we will use the Australian weather data (weatherAUS.csv). The data contains daily weather observations from numerous weather sta- tions. We will do basic Bivariate Analysis to answer few questions such as The relationship between WindGust-Speed and MinTemp The relationship between WindSpeed and Pressure9am. The relationship between rains Today and Tomorrow. ...etc Load libraries, read data and print column names. dat <- read.csv("weatherAUS.csv", header = TRUE, sep = ",") head(dat) ## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir ## 1 2008-12-01 Albury 13.4 22.9 0.6 NA NA W ## 2 2008-12-02 Albury 7.4 25.1 0.0 NA NA WNW ## 3 2008-12-03 Albury ## WindGustSpeed WindDi 12.9 25.7 r9am WindDir3pm Wi 0.0 NA NA WSW ndSpeed9am WindSpeed3pm Humidity9am ## 1 44 W WNW 20 24 71 ## 2 44 NNW WSW 4 22 44 ## 3 ## Humidi 46 ty3pm Pressure W WSW 9am Pressure3pm Cl 19 26 oud9am Cloud3pm Temp9am 38 Temp3pm ## 1 22 1007.7 1007.1 8 NA 16.9 21.8 ## 2 25 1010.6 1007.8 NA NA 17.2 24.3 ## 3 ## RainTo 30 1007.6 1008.7 day RISK_MM RainTomorrow NA 2 21.0 23.2 ## 1 No 0.0 No ## 2 No 0.0 No ## 3 No 0.0 No Print the structure of whole data str(dat) ## 'data.frame': 142193 obs. of 24 variables: ## $ Date : chr "2008-12-01" "2008-12-02" "2008-12-03" "2008-12-04" ... ## $ Location : chr "Albury" "Albury" "Albury" "Albury" ... ## $ MinTemp : num 13.4 7.4 12.9 9.2 17.5 14.6 14.3 7.7 9.7 13.1 ... ## $ MaxTemp : num 22.9 25.1 25.7 28 32.3 29.7 25 26.7 31.9 30.1 ... ## $ Rainfall : num 0.6 0 0 0 1 0.2 0 0 0 1.4 ... ## $ Evaporation : num NA NA NA NA NA NA NA NA NA NA ... ## $ Sunshine : num NA NA NA NA NA NA NA NA NA NA ... ## $ WindGustDir : chr "W" "WNW" "WSW" "NE" ... ## $ WindGustSpeed: int 44 44 46 24 41 56 50 35 80 28 ... ## $ WindDir9am : chr "W" "NNW" "W" "SE" ... ## $ WindDir3pm : chr "WNW" "WSW" "WSW" "E" ... ## $ WindSpeed9am : int 20 4 19 11 7 19 20 6 7 15 ... ## $ WindSpeed3pm : int 24 22 26 9 20 24 24 17 28 11 ... ## $ Humidity9am : int 71 44 38 45 82 55 49 48 42 58 ... ## $ Humidity3pm : int 22 25 30 16 33 23 19 19 9 27 ... ## $ Pressure9am : num 1008 1011 1008 1018 1011 ... ## $ Pressure3pm : num 1007 1008 1009 1013 1006 ... ## $ Cloud9am : int 8 NA NA NA 7 NA 1 NA NA NA ... ## $ Cloud3pm : int NA NA 2 NA 8 NA NA NA NA NA ... ## $ Temp9am : num 16.9 17.2 21 18.1 17.8 20.6 18.1 16.3 18.3 20.1 ... ## $ Temp3pm : num 21.8 24.3 23.2 26.5 29.7 28.9 24.6 25.5 30.2 28.2 ... ## $ RainToday : chr "No" "No" "No" "No" ... ## $ RISK_MM : num 0 0 0 1 0.2 0 0 0 1.4 0 ... ## $ RainTomorrow : chr "No" "No" "No" "No" ... CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Examples of data exploration Check If there are Missing Values. Print the number of missing values in each column. sapply(dat, function(x) sum(is.na(x))) ## Date Location MinTemp MaxTemp Rainfall ## 0 0 637 322 1406 ## Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ## 60843 67816 9330 9270 10013 ## WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm ## 3778 1348 2630 1774 3610 ## Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am ## 14014 13981 53657 57094 904 ## Temp3pm RainToday RISK_MM RainTomorrow ## 2726 1406 0 0 Delete column(s) Print summary (not complete due to size limitation) summary(dat) ## Date Location MinTemp MaxTemp ## Length:142193 Length:142193 Min. :-8.50 Min. :-4.80 ## Class :character Class :character 1st Qu.: 7.60 1st Qu.:17.90 ## Mode :character Mode :character Median :12.00 Median :22.60 ## Mean :12.19 Mean :23.23 ## 3rd Qu.:16.80 3rd Qu.:28.20 ## Max. :33.90 Max. :48.10 ## NA's :637 NA's 322 ## Rainfall Evaporation Sunshine WindGustDir ## Min. : 0.00 Min. : 0.00 Min. : 0.00 Length:142193 ## 1st Qu.: 0.00 1st Qu.: 2.60 1st Qu.: 4.90 Class :character ## Median : 0.00 Median : 4.80 Median : 8.50 Mode :character ## Mean : 2.35 Mean : 5.47 Mean : 7.62 ## 3rd Qu.: 0.80 3rd Qu.: 7.40 3rd Qu.:10.60 ## Max. :371.00 Max. :145.00 Max. :14.50 ## NA's 1406 NA's 60843 NA's 67816 ## WindGustSpeed WindDir9am WindDir3pm WindSpeed9am ## Min. : 6.00 Length:142193 Length:142193 Min. : 0 ## 1st Qu.: 31.00 Class :character Class :character 1st Qu.: 7 ## Median : 39.00 Mode :character Mode :character Median : 13 ## Mean : 39.98 Mean : 14 ## 3rd Qu.: 48.00 3rd Qu.: 19 ## Max. :135.00 Max. :130 ## NA's 9270 NA's 1348 CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Examples of data exploration — Basic analysis: The relationship between WindGust-Speed and MinTemp — Since both variables are numerical, we can use scatter plot to find the relationship between two variables. — Basic analysis: The relationship between WindSpeed and Pressure9am — Since both variables are numerical, we can use scatter plot to find the relationship between two variables. CSE5DEV Syllabus Week-Overview Data Exploration Bivariate Analysis: Tabular Exploration Bivariate A Examples of data exploration — Basic analysis: The relationship between rains Today and Tomorrow — Since both variables are categorical, we can use bar chars to display the relationship between two variables. Bar plot of the distribution of the RainToday variable 65.0% 60.0% 55.0% 50.0% 45.0% 40.0% 35.0% 30.0% Rained the next day No Yes 25.0% 20.0% 15.0% 10.0% 5.0% 0.0% No Yes NA Rained Today — Basic analysis: The relationship between between amount of Rainfall and the temperature — In this analysis we have three variables: Rainfall (numerical), temperature (numerical) and Rain Tomorrow (categorical) — Basic analysis: The relationship between between humidity in the morning and amount of Rainfall — In this analysis we have three variables: Rainfall (numerical), humidity (numerical) and Rain Tomorrow (categorical) — Basic analysis: Wind Direction and Rain Tomorrow — — Basic analysis: Does Location affect the formation of Rain? — — Basic analysis: Does Date (Month) affect the formation of Rain? — Basic analysis: Which variables effect the formation of rain tomorrow? — Basic analysis: Which variables effect the formation of rain tomorrow? — End of Week 7 See you Next Lecture (Week 8) Data Exploration Table: CSE5DEV Timetable Check LMS Week 8 Data Exploration CSE5DEV Syllabus Week-Overview Data Exploration Examples of Time Series Data Exploration Diagnostic Analytic E Overview CSE5DEV Syllabus Week-Overview Data Exploration Examples of Time Series Data Exploration Diagnostic Analytic Examples of Diagnostic Analytic Subject Syllabus CSE5DEV Syllabus Week-Overview Data Exploration Examples of Time Series Data Exploration Diagnostic Analytic E Overview CSE5DEV Syllabus Week-Overview Data Exploration Examples of Time Series Data Exploration Diagnostic Analytic Examples of Diagnostic Analytic CSE5DEV Syllabus Week-Overview Data Exploration Examples of Time Series Data Exploration Diagnostic Analytic E Week 8 Overview Learning outcomes: Develop a high-level understanding of the data. Understand time series data analysis Understand diagnostic analytic. CSE5DEV Syllabus Week-Overview Data Exploration Examples of Time Series Data Exploration Diagnostic Analytic E What we have learned so far? Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— Collecting and Wrangling: working with data Read & correct data Cleaning and Normalising: convert dirty data into correct data. Cleaning & Handling Missing Values. Normalising or Standardising Data. Data Visualisation Scatter plot, Boxplots, and Line plots Data Exploration Univariate Analysis Bivariate (multivariate) Analysis CSE5DEV Syllabus Week-Overview Data Exploration Examples of Time Series Data Exploration Diagnostic Analytic E What we have learned so far? What we have learned so far? —— R Programming —— Install R and Rstudio, create Rmarkdown file, write and run basic codes, ..etc Data Type and data structure (vector, factor, matrix and data frame) View, Access, Change etc. Import data into R Environment (text file and csv files) Correct or change the format of the data to make it tidy Clean the data Normalise the data Data visualisation using ggplot2 Data Exploration: Tabular and Graphical Explorations ?mean Base R Cheat Sheet Getting Help Accessing the help files Vectors Creating Vectors For Loop Example Programming While Loop Example Get help of a particular function. help.search(‘weighted mean’) Search the help files for a word or phrase. help(package = ‘dplyr’) Find help for a package. More about an object sort(x) Vector Functions rev(x) If Statements Functions str(iris) Get a summary of an object’s structure. class(iris) Find the class an object belongs to. Return x sorted. table(x) See counts of values. Return x reversed. unique(x) See unique values. Using Libraries install.packages(‘dplyr’) Download and install a package from CRAN. library(dplyr) Load the package into the session, making all its functions available to use. dplyr::select Use a particular function from a package. data(iris) Load a built-in dataset into the environment. Working Directory getwd() Find the current working directory (where inputs are found and outputs are sent). Selecting Vector Elements By Position x[4] The fourth element. x[-4] All but the fourth. x[2:4] Elements two to four. x[-(2:4)] All elements except two to four. x[c(1, 5)] Elements one and five. By Value x[x == 10] Elements which are equal to 10. x[x < 0] All elements less than zero. Example Reading and Writing Data Example setwd(‘C://file/path’) Change the current working directory. x[x %in% c(1, 2, 5)] Elements in the set 1, 2, 5. Use projects in RStudio to set the working directory to the folder you are working in. Named Vectors x[‘apple’] Element with name ‘apple’. Conditions RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15 m <- matrix(x, nrow = 3, ncol = 3) Create a matrix from x. log(x) Natural log. sum(x) Sum. exp(x) Exponential. mean(x) Mean. max(x) Largest element. median(x) Median. min(x) Smallest element. quantile(x) Percentage quantiles. round(x, n) Round to n decimal rank(x) Rank of elements. places. signif(x, n) Round to n var(x) The variance. significant figures. cor(x, y) Correlation. sd(x) The standard deviation. df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) A special case of a list where all elements are the same len

Use Quizgecko on...
Browser
Browser