Exploratory Data Analysis PDF - Industrial Engineering

Summary

This document is a presentation on Exploratory Data Analysis (EDA), a crucial process in data science. The document explores various aspects of EDA, including univariate, bivariate, and multivariate analysis, and provides an understanding of data visualization, handling outliers, and statistical methods to uncover data patterns. Also it covers Descriptive Statistics, Handling Outliers and Visualize Data Relationship, and Specialized EDA.

Full Transcript

Industrial Engineering Exploratory Data Analysis Engr. Ranzel Dimaculangan, ECT Exploratory Data Analysis It is a data analytics process that aims to understand the data in depth and learn its different characteristics, often using visual means. Why Exploratory Data Analysis is Important?...

Industrial Engineering Exploratory Data Analysis Engr. Ranzel Dimaculangan, ECT Exploratory Data Analysis It is a data analytics process that aims to understand the data in depth and learn its different characteristics, often using visual means. Why Exploratory Data Analysis is Important? Helps to understand the dataset, showing how many features there are, the type of data in each feature, and how the data is spread out, which helps in choosing the right methods for analysis. EDA helps to identify hidden patterns and relationships between different data points, which help us in and model building. Allows to spot errors or unusual data points (outliers) that could affect your results. Insights that you obtain from EDA help you decide which features are most important for building models and how to prepare them to improve performance. By understanding the data, EDA helps us in choosing the best modeling techniques and adjusting them for better results. Types of EDA 1. Univariate Analysis 2. Bivariate Analysis 3. Multivariate Analysis Univariate Analysis Univariate analysis focuses on studying one variable to understand its characteristics. It helps describe the data and find patterns within a single feature. Summary statistics like mean, median, mode, variance, and standard deviation help describe the central tendency and spread of the data Univariate Analysis Common methods include histograms to show data distribution, box plots to detect outliers and understand data spread, and bar charts for categorical data. Example of a Box Plot ANOVA: Analysis of Variance The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters: 1. F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means. 2. P-value: P-value tells how statistically significant is our calculated score value. If the p-value is below a predefined threshold, conclude that at least one group has a significantly different mean. Bivariate Analysis Bivariate analysis focuses on exploring the relationship between two variables to find connections, correlations, and dependencies. Some key techniques used in bivariate analysis include 1. Scatter plots, which visualize the relationship between two continuous variables. 2. Cross-tabulation, or contingency tables, which show the frequency distribution of two categorical variables and help understand their relationship. 3. Correlation coefficient, which measures how strongly two variables are related, commonly using Pearson’s correlation for linear relationships; Bivariate Analysis Cross-tabulation Bivariate Analysis Regression Plot Regression plots as the name suggests creates a regression line between 2 parameters and helps to visualize their linear relationships. A regression line represents the best-fit line that predicts the dependent variable based on the independent variable. Correlation vs Causation Correlation: a measure of the extent of interdependence between variables. Causation: the relationship between cause and effect between two variables. It is important to know the difference between these two and that correlation does not imply causation. Pearson Correlation The Pearson Correlation measures the linear dependence between two variables X and Y. The resulting coefficient is a value between -1 and 1 inclusive, where: 1: Total positive linear correlation. 0: No linear correlation, the two variables most likely do not affect each other. -1: Total negative linear correlation. P-value The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant. P-value By convention, when the p-value is < 0.001: we say there is strong evidence that the correlation is significant. the p-value is < 0.05: there is moderate evidence that the correlation is significant. the p-value is < 0.1: there is weak evidence that the correlation is significant. the p-value is > 0.1: there is no evidence that the correlation is significant. Multivariate Analysis Multivariate analysis examines the relationships between two or more variables in the dataset. It aims to understand how variables interact with one another, which is crucial for most statistical modeling techniques. It include techniques like pair plots, which show the relationships between multiple variables at once, helping to see how they interact. Another technique is Principal Component Analysis (PCA), which reduces the complexity of large datasets by simplifying them, while keeping the most important information. Example of a Pair Plot Example of a PCA Plot Specialized EDA There are specialized EDA techniques tailored for specific types of data or analysis needs: 1. Spatial Analysis: For geographical data, using maps and spatial plotting to understand the geographical distribution of variables. 2. Text Analysis: Involves techniques like word clouds, frequency distributions, and sentiment analysis to explore text data. 3. Time Series Analysis: This type of analysis is mainly applied to statistics sets that have a temporal component. Time collection evaluation entails inspecting and modeling styles, traits, and seasonality inside the statistics through the years. Techniques like line plots, autocorrelation analysis, transferring averages, and ARIMA (AutoRegressive Integrated Moving Average) fashions are generally utilized in time series analysis. Data Exploration The next step in EDA is to explore the characteristics of your data by examining the distribution, central tendency, and variability of your variables, as well as identifying any outliers or anomalies. This helps in selecting appropriate analysis methods and spotting potential data issues. You should calculate summary statistics like mean, median, mode, standard deviation, skewness, and kurtosis for numerical variables. These provide an overview of the data’s distribution and help identify any irregular patterns or issues. Descriptive Statistics Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's main features and characteristics without making any generalizations or inferences to a larger population. Descriptive Statistics The primary goal of descriptive statistics is to provide a clear and concise summary of the data, enabling researchers or analysts to gain insights and understand patterns, trends, and distributions within the dataset. This summary typically includes measures such as central tendency (e.g., mean, median, mode), dispersion (e.g., range, variance, standard deviation), and shape of the distribution (e.g., skewness, kurtosis). Descriptive statistics also involves a graphical representation of data through charts, graphs, and tables, which can further aid in visualizing and interpreting the information. Visualize Data Relationship Visualization is a powerful tool in the EDA process, helping to uncover relationships between variables and identify patterns or trends that may not be obvious from summary statistics alone. For categorical variables, create frequency tables, bar plots, and pie charts to understand the distribution of categories and identify imbalances or unusual patterns. For numerical variables, generate histograms, box plots, violin plots, and density plots to visualize distribution, shape, spread, and potential outliers. To explore relationships between variables, use scatter plots, correlation matrices, or statistical tests like Pearson’s correlation coefficient or Spearman’s rank correlation. Handling Outliers Outliers are data points that significantly differ from the rest of the data, often caused by errors in measurement or data entry. Handling Outliers Detecting and handling outliers is important because they can skew your analysis and affect model performance. You can identify outliers using methods like interquartile range (IQR), Z-scores, or domain-specific rules. Once identified, outliers can be removed or adjusted depending on the context. Properly managing outliers ensures your analysis is accurate and reliable. Any questions? References https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/ https://www.simplilearn.com/tutorials/data-analytics- tutorial/exploratory-data-analysis https://www.simplilearn.com/what-is-descriptive-statistics-article Thanks! Do you have any questions? [email protected] CREDITS: This presentation template was created by Slidesgo, and includes icons by Flaticon, and infographics & images by Freepik Please keep this slide for attribution