STT157 EDA Lecture 1 PDF
Document Details
Uploaded by NeatElation1760
MSU-IIT
Tags
Summary
This document provides a lecture on exploratory data analysis (EDA) using the dplyr package in R, focusing on data manipulation and transformations. It covers data frames, functions like select(), filter(), arrange(), and rename, and explains the practical application of these functions in the context of air pollution data from Chicago. The lecture also introduces the concept of data tidiness and detreding.
Full Transcript
Welcome to STT157 Exploratory Data Analysis (EDA) 1 Lecture 1: INTRODUCTION 2 EXPLORATORY DATA ANALYSIS Important principle: “It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it”...
Welcome to STT157 Exploratory Data Analysis (EDA) 1 Lecture 1: INTRODUCTION 2 EXPLORATORY DATA ANALYSIS Important principle: “It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it” 3 The goals are many, but they include identifying relationships between variables that are particularly interesting or unexpected -checking to see if there is any evidence for or against a stated hypothesis, -checking for problems with the collected data, such as missing data or measurement error), or identifying certain areas where more data need to be collected. 4 The goal of EDA -is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make. 5 EXPLORATORY DATA ANALYSIS - Is an iterative cycle: 1. Generate questions about your data 2. Search for answers by visualizing, transforming, and modelling your data. 3. Use what you learn to refine your questions and/or generate new questions. 6 Two types of questions 1. What type of variation occurs within my variables? 2. What type of covariation occurs between my variables? 7 Managing Data Frames with the dplyr package Data Frames - Is a key data structure in statistics and in R. - There is one observation per row - Each column represents a variable, a measure, feature, or characteristic of that observation. 8 The dplyr Package - Developed by Hadley Wickham of RStudio and is an optimized and distilled version of his plyr package (also by Hadley). -One important contribution of the dplyr package is that it provides a “grammar” (in particular, verbs) for data manipulation and for operating on data frames. -Another useful contribution is that the dplyr functions are very fast 9 dplyr Grammar Some of the key “verbs” provided by the dplyr package are - select: return a subset of the columns of a data frame, using a flexible notation - filter: extract a subset of rows from a data frame based on logical conditions - arrange: reorder rows of a data frame - rename: rename variables in a data frame - mutate: add new variables/columns or transform existing variables - summarise / summarize: generate summary statistics of different variables in the data frame, possibly within strata - %>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline 10 Common dplyr Function Properties In particular, 1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names). 3. The return result of a function is a new data frame 4. Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be tidy. In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation. 11 select() For the examples, dataset containing air pollution and weather variables for the city of Chicago in the U.S from year 1987 - 2005. Load the data in R. You can see some basic characteristics of the dataset with the dim() and str() - dim() give you the dimension of your data - str() provides an overview of the contents of your data, including the types of variables, without printing the entire object. The select() function can be used to select columns of a data frame that you want to focus on. 12 13 city: The name of the city for which the weather data is recorded (in this case, Chicago). tmpd: Temperature (daily mean or daily maximum/minimum temperature, depending on context). dptp: Dew Point Temperature (the temperature at which air becomes saturated with moisture, leading to condensation). date: The date when the weather data was recorded. pm25tmean2: Mean concentration of PM2.5 (particulate matter with a diameter of 2.5 micrometers or less) over a specific period. pm10tmean2: Mean concentration of PM10 (particulate matter with a diameter of 10 micrometers or less) over a specific period. o3tmean2: Mean concentration of ozone (O3) over a specific period. no2tmean2: Mean concentration of nitrogen dioxide (NO2) over a specific period. Suppose we wanted to take the first 3 columns only. 14 > head(select(chicago, city:dptp)) Note that: normally cannot be used with names or strings, but inside the select() function you can use it to specify a range of variable names. 15 You can also omit variables using the select() function by using the negative sign. > select(chicago, -(city:dptp)) which indicates that we should include every variable except the variables city through dptp. > head(select(chicago, -(city:dptp))) 16 The select() function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a “2”, we could do 17 filter() The filter() function is used to extract subsets of rows from a data frame. Suppose we wanted to extract the rows of the chicago data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do 18 We can place an arbitrarily complex logical sequence inside of filter() Example extract the rows where PM2.5 is greater than 30 and temperature is greater than 80 degrees Fahrenheit. 19 arrange() The arrange() function is used to reorder rows of a data frame according to one of the variables/columns. Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation. > chicago chicago