Exploratory Data Analysis (EDA) Lecture Notes PDF

# Exploratory Data Analysis (EDA) - Was introduced by John Tukey in his book *Exploratory Data Analysis* in 1977. - It is an approach/philosophy for data analysis that employs a variety of techniques (most graphical) to: - Maximize insight into a data set. - Uncover underlying structure. - Extract important variables. - Detect outliers and anomalies. ## EDA - Start with a general question or area of interest without specific hypotheses. ## Main Characteristics of EDA: - Use of (mostly) graphical techniques to understand data properties. But it is not only a matter of using graphics, it is a different "philosophy" to approach the problem. - The broad aim of EDA is to help us formulate and refine hypotheses that will lead to informative analyses or further data collection. ## The Core Objectives of EDA Are: - To suggest hypotheses about the causes of observed phenomena. - To guide the selection of appropriate statistical tools and techniques. - To assess the assumptions on which statistical analysis will be based. - To provide a foundation for further data collection. ## How Does Exploratory Data Analysis Differ from Classical Data Analysis? ### Classical Data Analysis - Is a traditional approach to analyzing data that involves a structured, model-driven process: - **Problem** → **Data** → **Model** → **Analysis** → **Conclusions** ### Exploratory Data Analysis - Is more flexible and data-driven, focusing on exploring the data to uncover insights and patterns. - **Problem** → **Data** → **Analysis** → **Model** → **Conclusions** | CDA | EDA | |-------------------------------------------------------------|-----------------------------------------------------------| | More structured and hypothesis-driven, starting with a specific model and testing it. | More flexible and data-driven, focusing on exploring the data to uncover insights and patterns. | | Focus on the **Model**. | Focus on the **Data**. | | Technique: Mostly **quantitative**. | Technique: Mostly **graphical**. | ## Some Basic Concepts That Underpin EDA 1. Clarifying different types of data. 2. Distinguishing between populations and samples. ## Statistical Variables and Data **Variable** is typically used to mean one or two things: - **In the context of programming**, a variable is a name-value association that we create when run some code. - **For statisticians**, a variable is any characteristic or quantity that can be measured, classified or experimentally controlled. ## Examining, Cleaning and Filtering We will be covering the following topics: - Reshaping and tidying up missing and erroneous data - Manipulating and mutating data - Selecting and filtering data - Cleaning and manipulating time-series data - Handling complex textual data ## Erroneous Data **Incorrect data** - Data contains errors such as typos, incorrect values or misrecorded information. ## Types of Variables | Types of Variable | Description | |-------------------------------------------------------------|--------------------------------------------------------------------| | **Categorical** | Have values that describe a characteristic of a data unit. | | **Numerical** | Have values that describe a measurable quantity as a number, like "how many" or "how much". Also called quantitative variables. | | **Nominal** - Observations can take a value that is not able to be organized in a logical sequence. | | | **Ordinal** - Observations can take a value that can be logically ordered higher to lower. | | | **Discrete** - Observation can take a value based on count from a set of whole numbers/values. | | | **Continuous** - Observations can take any value between a certain set of real numbers (ie. numbers represented by decimals | | ## CDA - **Problem** - Identify the research question or hypothesis you meant to test. - **Data** - Collect to obtain the data relevant to your problem. - **Model** - Choose a statistical model that fits your hypothesis. - **Analysis** - Apply statistical method to test the hypothesis using the chosen model. This might include hypothesis testing, calculating interval, etc. - **Conclusion** - Interpret the results to or result the hypothesis. ## EDA - **Problem** - Start with a general question or area of interest without a specific hypothesis. - **Data** - Gather the data you want to explore. - **Analysis** - Use graphical and descriptive techniques to explore the data. This includes creating histograms, scatter plots, box plots, or calculating summary statistics. - **Model** - Based on the insights gained from the exploratory analysis. You might develop hypotheses or models for further analysis. - **Conclusions** - Draw preliminary conclusions and identify patterns, trends, or anomalies. ## Load the Data ```r setwd() read.csv() view() str() # The str() function is declared as an alternative to the summary function. It displays the internal structure of an R object in a compact manner. ``` ## Reshaping and Tidying up Erroneous Data **Erroneous Data** - Is regarded as data that falls outside of what is accepted and what should be rejected by the system. In this section, we will focus on two major activities: reshaping and tidying up erroneous data. 1. Install package "tidyr" 2. library(tidyr) Once we have the package in our system, we can proceed with reshaping and tidying up the dataset. This process requires the following functions: `gather()`, `unite()`, `separate()`, `spread()`. ### The `gather()` Function ```r # Math = c(90, 85) # Science = c(15, 80) # English = c(85, 88) # df <- data.frame( # Name = c("Alice", "Bob"), # Math = c(90, 85), # Science = c(15, 80), # English = c(85, 88) # ) # df_long <- df %>% gather(key = "subject", # value = "score", Math : English) gather(data, key, value, ..., na.rm = FALSE, convert = FALSE) # data: Data frame # key: The name of the new key column that will be created. This column will contain the names of the original columns that are being gathered. # value: The name of the new value column that will be created. This column will contain the values from the original columns. # ... : The columns you want to gather. You can specify them by name, position, or using a range (eg, year1 : year3) # na.rm : A logical value indicating whether to remove rows with NA values in the value column. # convert : A logical value indicating whether to automatically convert the key column to the appropriate type (e.g., numeric, integer, logical). The default is FALSE, meaning the key column will be stored as a character vector. ``` ### The `unite()` Function ```r # mpg5 <- unite(mpg, "Fuel Efficiency", c("drv", "fl")) # view(mpg5) unite(data, col, ..., rep = "-", remove = TRUE) # data: Data frame # col: The name of the column to be added. Further specification of columns can be added. # rep: Separator to use between the values. # remove: If TRUE, this will remove input columns from the data frame mentioned. ``` ### The `separate()` Function ```r # separate(df, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE) # data: The data frame you want to modify. # col: The name of the column you want to split into multiple columns. # into: A vector of new column names that will be created from the original column. # rep: The separator used to split the column. The default value ^[:alnum:]]+ means any non-alpha-numeric character (like spaces, punctuation, etc.) # remove : A logical value indicating whether to remove the original column after splitting. The default is TRUE, meaning the original column will be removed. # convert : A logical value indicating whether to automatically convert the new columns to the appropriate type (e.g., numeric, integer). The default is FALSE, meaning the new columns will be stored as character vectors. > mpgle <- mpg5 %>% separate(Fuel Efficiency, c("drv", "fl")) ``` ## Manipulating and Mutating Data `dplyr` package is used. This package helps the manipulation and mutation of the data. The various functions that will be implemented are as follows: - `mutate()` - `summarize()` - `group_by()` - `glimpse()` - `arrange()` ### `mutate()` ```r # impg_Mutate <-mpg %>% mutate (nr=cyl+displ) # used to add new columns to the dataset mentioned. It is considered useful to create attributes with respect to functions of other attributes in the dataset. ``` ### `summarize()` ```r # mpg_summarize <-mpg %>% # group_by(year) %>% # summarize(avg_displ = mean(displ)) # used to aggregate multiple column values to a single column. It is predominantly used with the group_by() function. ``` ### `glimpse()` ```r # glimpse(mpg) # used to see the columns of the dataset and display some portion of the data with respect to each attribute that can fit on a single line. ```

Exploratory Data Analysis (EDA) Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript