week11_week12_merged.docx
Document Details
Uploaded by GenerousChrysoprase
La Trobe University
Full Transcript
Week 11 Case Study CSE5DEV Syllabus Week-Overview Case Study Subject Syllabus CSE5DEV Syllabus Week-Overview Case Study Learning outcomes: Formulate the main questions. Perform data exploration and analysis. Present the results and report findings. Data can be in different formats, but compu...
Week 11 Case Study CSE5DEV Syllabus Week-Overview Case Study Subject Syllabus CSE5DEV Syllabus Week-Overview Case Study Learning outcomes: Formulate the main questions. Perform data exploration and analysis. Present the results and report findings. Data can be in different formats, but computer program expects your data to be organised in a well-defined structure. What we have learned so far? —— Theory —— Collecting and Wrangling: working with data Read & correct data Cleaning and Normalising: convert dirty data into correct data. Cleaning & Handling Missing Values. Normalising or Standardising Data. Data Visualisation Scatter plot, Boxplots, and Line plots Data Exploration Univariate Analysis Bivariate (multivariate) Analysis Time Series Data Analysis Correlation & Pattern Discovery Reporting & Data Communication What we have learned so far? —— R Programming —— Install R and Rstudio, create Rmarkdown file, run basic codes, ..etc Data type and data structure (vector, factor, matrix and data frame) View, Access, Change etc. Correct or change the format of the data to make it tidy Clean the data Normalise the data Data visualisation using ggplot2 Data Exploration: Tabular and Graphical Explorations Correlation & Pattern Discovery Data Communication ?mean Base R Cheat Sheet Getting Help Accessing the help files Vectors Creating Vectors For Loop Example Programming While Loop Example Get help of a particular function. help.search(‘weighted mean’) Search the help files for a word or phrase. help(package = ‘dplyr’) Find help for a package. More about an object sort(x) Vector Functions rev(x) If Statements Functions str(iris) Get a summary of an object’s structure. class(iris) Find the class an object belongs to. Return x sorted. table(x) See counts of values. Return x reversed. unique(x) See unique values. Using Libraries install.packages(‘dplyr’) Download and install a package from CRAN. library(dplyr) Load the package into the session, making all its functions available to use. dplyr::select Use a particular function from a package. data(iris) Load a built-in dataset into the environment. Working Directory getwd() Find the current working directory (where inputs are found and outputs are sent). Selecting Vector Elements By Position x[4] The fourth element. x[-4] All but the fourth. x[2:4] Elements two to four. x[-(2:4)] All elements except two to four. x[c(1, 5)] Elements one and five. By Value x[x == 10] Elements which are equal to 10. x[x < 0] All elements less than zero. Example Reading and Writing Data Example setwd(‘C://file/path’) Change the current working directory. x[x %in% c(1, 2, 5)] Elements in the set 1, 2, 5. Use projects in RStudio to set the working directory to the folder you are working in. Named Vectors x[‘apple’] Element with name ‘apple’. Conditions RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15 m <- matrix(x, nrow = 3, ncol = 3) Create a matrix from x. log(x) Natural log. sum(x) Sum. exp(x) Exponential. mean(x) Mean. max(x) Largest element. median(x) Median. min(x) Smallest element. quantile(x) Percentage quantiles. round(x, n) Round to n decimal rank(x) Rank of elements. places. signif(x, n) Round to n var(x) The variance. significant figures. cor(x, y) Correlation. sd(x) The standard deviation. df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) A special case of a list where all elements are the same length. List subsetting t.test(x, y) Preform a t-test for difference between means. pairwise.t.test Preform a t-test for paired data. prop.test Test for a difference between proportions. aov Analysis of variance. Matrix subsetting df[ , 2] df[2, ] df[2, 2] nrow(df) Number of rows. ncol(df) Number of columns. dim(df) Number of columns and rows. cbind - Bind columns. rbind - Bind rows. RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15 CSE5DEV Syllabus Week-Overview Case Study — Case Study — Which Programming Language is the best for Data Science: R or Python? To answer the above question, we need to know How many Data Scientist jobs are listed in job-search website? Which Programming Languages are being mentioned in job ads? How many jobs mentioned R and Python programming languages? In the next slides, we will present Data Science example to answer the main question. The example is adopted from https://www.kaggle.com/nomilk/ deep-exploration-of-data-science-job-listings/code Please check the case study file (Case study) in LMS. End of Week 11 See you Next Lecture (Week 12) Revision Table: CSE5DEV Timetable Check LMS Week 12 Revision CSE5DEV Exam Revision CSE5DEV CSE5DEV Exam Revision Date: Wednesday, 8 November 2023 Start Time (AEST): 09:00 AM Duration: 2 hours (120 minutes) Venue: Union Hall (Bundoora) | BUS-228 (Bendigo) The test will be On-Site, Paper based, Closed-book from 09:00 AM and will be closed at 11:00 AM sharp on the same day. Note that no extra time if you are late. Please make sure that you are ready and attempt the test on time. The main aim is to test your understanding. All questions are based on the lectures, labs and assignments. There will be a lot of questions. So, you will not really have time to find the answers. You should study hard, so you have all the knowledge in your mind. The final exam is an Individual On-line test, contributing 50% to the overall subject mark. Students should attempt the test individually. There is ONE attempt for this on-line exam. The final exam contains: 20 Marks: Multiple choice questions (MCQ) (some of them are single answer MCQs, some others have one or more correct answers). 20 Marks: True and False questions. 30 Marks: Short answer questions - answer each of the questions in a paragraph. 30 Marks: R code questions - write short codes. SAMPLES: Final Exam Questions Sample: Question 1. Multiple choice questions. Select the right choice. Quantitative variables take a predefined type which can be: a. Discrete Ordinal Nominal Binary attributes are nominal attributes with only two values: 0 or 1 ratings grades Experimental data is collected from strictly controlled/designed experiments with efforts made to ensure statistical validity. Is collected from various resources such as internet, survey and external devices. Data is dirty if it has various data types incomplete, noisy or inconsistent values Sample: Question 2. True/false questions. In bivariate exploration, we can use bar chart to explore the relationship between variables. True False Pattern discovery is can be used to predict what will happened in future. True False Visual clutter creates excessive cognitive load that can hinder the transmission of our message True False Sample: Question 3. R code questions What is the output of the following code: x <- c (3, NA, NA, 1, 4) y <- c (2, NA, 1, 2, 2) x + y Output: 5, NA, NA, 3, 6 What is the type of ‘a’ in the following code: x <- c (1, 2.1, FALSE) Output: "double" Write a code to read data from file f and print the name of columns. dat <- read.csv('f.csv',header=TRUE,sep = ",") names(dat) Write a code to read data from file f1. Use line chart to plot x column versus y column. library("ggplot2") dat <- read.csv('f1.csv',header=TRUE,sep = ",") ggplot(dat, aes(x, y))+ geom_line() Sample: Question 4. Short answer question Based on the below Correlations matrix identify strong and weak correlation between variables. Sample Answer: The End — The End — The subject has been challenging for some students. The End — The subject has been challenging for some students. I could make the subject easier, but you would be less prepared for the real-world. The End — The subject has been challenging for some students. I could make the subject easier, but you would be less prepared for the real-world. I hope after completing the labs and assignment, you are more confi- dent in exploring and analysing industry problems. The End — The subject has been challenging for some students. I could make the subject easier, but you would be less prepared for the real-world. I hope after completing the labs and assignment, you are more confi- dent in exploring and analysing industry problems. It has been a pleasure to teach you CSE5DEV this semester! End of Week 12 No Labs for Week 12 Wish you all the best!