STAT362 R for Data Science PDF
Document Details
![EasyToUseDaffodil](https://quizgecko.com/images/avatars/avatar-18.webp)
Uploaded by EasyToUseDaffodil
Queen's University
Tags
Summary
This document provides an introduction to the course STAT362 R for data science. It explains what R and RStudio are, the benefits of R, and the topics that will be covered in the course. The document includes an overview of data visualizations and introduces data wrangling.
Full Transcript
STAT362 R for data science 1 Introduction 1.1 What is R and RStudio? R R is a programming language and environment for statistical computing, analysis, and graphics. R is an interpreted language (individual language expressions are read and then executed immediately as soo...
STAT362 R for data science 1 Introduction 1.1 What is R and RStudio? R R is a programming language and environment for statistical computing, analysis, and graphics. R is an interpreted language (individual language expressions are read and then executed immediately as soon as the command is entered) To download R, go to https://cloud.r-project.org/ RStudio is an integrated development environment (IDE) for R programming Install R first, then go to https://posit.co/download/rstudio-desktop/ and download RStudio While you can work in R directly, it is recommended to work in RStudio. 1.2 Why R? It’s free, open source, and available on every major platform. As a result, if you do your analysis in R, anyone can easily reproduce it. A massive set of packages for statistical modelling, maching learning, visualization, and importing and manipulating data. Powerful tools for communicating your results. RMarkdown makes it easy to turn your results into HTML files, PDFs, Word documents, PowerPoint presentations, and more. Shiny allows you to make beautiful interactive apps without any knowledge of HTML or javascript. RStudio provides an integrated development environment. Cutting edge tools. Researchers in statistics and machine learning will often publish an R package to accompany their articles. This means you have access to the latest statistical techniques and implementations. The ease with which R can connect to high-performance programming languages like C, Fortran, and C++. BUT, R is not perfect. One of the challenges is that most R users are not programmers. This means, for example, that much of the R code you will see in the wild is written in haste to solve a pressing problem. As a result, code is not very elegant, fast, or easy to understand. Most users do not revise their code to address these shortcomings. R is also not a particularly fast programming languages, and poorly written R code can be terribly slow. 1.3 What will you learn in this course? Note: we do not assume you know R or any programming language before. 1 1.3.1 R and R as a programming language operators control flow (if..else.., for loop) defining a function 1.3.2 Data Wrangling Data wrangling = the process of tidying and transforming the data 1.3.3 Data Visualization Graphs are powerful to illustrate features of the data. You will learn how to create some basic plots as well as using the package ggplot2 to create more elegant plots. Consider a dataset about cars. library(ggplot2) mpg ## # A tibble: 234 x 11 ## manufacturer model displ year cyl trans drv cty hwy fl class ## ## 1 audi a4 1.8 1999 4 auto~ f 18 29 p comp~ ## 2 audi a4 1.8 1999 4 manu~ f 21 29 p comp~ ## 3 audi a4 2 2008 4 manu~ f 20 31 p comp~ ## 4 audi a4 2 2008 4 auto~ f 21 30 p comp~ ## 5 audi a4 2.8 1999 6 auto~ f 16 26 p comp~ ## 6 audi a4 2.8 1999 6 manu~ f 18 26 p comp~ ## 7 audi a4 3.1 2008 6 auto~ f 18 27 p comp~ ## 8 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~ ## 9 audi a4 quattro 1.8 1999 4 auto~ 4 16 25 p comp~ ## 10 audi a4 quattro 2 2008 4 manu~ 4 20 28 p comp~ ## # i 224 more rows Among the variables in mpg are: displ, a car’s engine size, in litres. hwy, a car’s fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance. Scatterplot 40 30 hwy 20 2 3 4 5 6 7 displ Scatterplot, points are labeled with colors according to the class variable 2 40 class 2seater compact 30 midsize hwy minivan pickup subcompact suv 20 2 3 4 5 6 7 displ Scatterplots 2seater compact midsize minivan 40 30 20 hwy 2 3 4 5 6 7 pickup subcompact suv 40 30 20 2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7 displ Line Chart 12000 unemploy 8000 4000 1970 1980 1990 2000 2010 date Bar chart 3 20000 clarity 15000 I1 SI2 SI1 count VS2 10000 VS1 VVS2 VVS1 5000 IF 0 Fair Good Very Good Premium Ideal cut Another Bar Chart 5000 4000 clarity I1 SI2 3000 SI1 count VS2 VS1 2000 VVS2 VVS1 IF 1000 0 Fair Good Very Good Premium Ideal cut Boxplot Sugar Content by Shelf 20 Sugar (grams per portion) 15 10 5 0 1 2 3 Shelf Histogram ## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0. ## i Please use `after_stat(density)` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. 4 0.15 0.10 density 0.05 0.00 20 30 40 MPG.city 1.3.4 Statistical Inference Many problems in different domains can be formulated into hypothesis testing problems. Are university graduates more likely to vote for Candidate A? Is a treatment effective in reducing weights? Is a drug effective in reducing mortality rate? We want to answer these questions that take into account of the intrinsic variability. Formally, we can perform hypothesis testing and compute the confidence intervals. These are what you learned in STAT 269. It is ok if you haven’t taken the STAT 269. The topics will be briefly reviewed. We will focus on the applications using R. 1.3.5 Some Numerical Methods Monte Carlo simulation (estimate probabilities, expectations, integrals) numerical optimization methods (e.g. maximizing a multi-parameter likelihood function using optim) 1.3.6 Statistical and Machine Learning We will illustrate a few basic statistical and machine learning methods using real datasets. 1.3.7 Lastly It is important to communicate your results to other after performing the data analysis. Therefore, you will do a project with presentation and report. 1.4 Let’s Get Started The best way to learn R is to get started immediately and try the code by yourselves. We will not discuss every topic in detail at the beginning, which is not interesting and unnecessary. We shall revisit the topics when we need additional knowledge. Simple arithmetic expression # can be used a simple calculator 3 + 5 ## 8 5 4 * 2 ## 8 10 / 2 ## 5 Comment a code: use the hash mark # # this is a comment, R will not run the code behine # Function for ‘combining’ c(4, 2, 3) # "c" is to "combine" the numbers ## 4 2 3 Assignment (