Data Exploration and Visualization PDF

Document Details

InviolableSasquatch6690

Uploaded by InviolableSasquatch6690

Karunya Institute of Technology and Sciences

Tags

data visualization python EDA data analysis

Summary

This document provides an overview of data exploration and visualization techniques using Python. It discusses various chart types (line charts, bar charts, scatter plots, area plots, stacked plots, pie charts, polar plots, histograms, lollipop charts) and their applications, along with technical requirements for creating effective visualizations.

Full Transcript

23DC2028-Data Exploration and Visualization Module 2 - Visual Aids of EDA Module 2 : Visual Aids of EDA Technical requirements - Line chart - Bar charts - Scatter plot - Area plot and stacked plot - Pie chart – Table Chart - Polar chart - Histogram - L...

23DC2028-Data Exploration and Visualization Module 2 - Visual Aids of EDA Module 2 : Visual Aids of EDA Technical requirements - Line chart - Bar charts - Scatter plot - Area plot and stacked plot - Pie chart – Table Chart - Polar chart - Histogram - Lollipop chart - Choosing the best chart. Text Books: 1. Suresh Kumar. Mukhiya and Usman Ahmed. (2020). Hands-On Exploratory Data Analysis with Python. Perform EDA techniques to understand. summarize. and investigate your data. Packt. ISBN: 978- 1789537253. 2. Dr. Ossama Embarak. (2018). Data Analysis and Visualization Using Python Analyze Data to Create Visualizations for BI Systems. ISBN: 978-1-4842-4108-0. Reference Books: 1. Sam Lau. Joseph Gonzalez. Deborah Nolan. (2023). Learning Data Science: Data Wrangling. Exploration. Visualization. and Modeling with Python. O'Reilly Media; ISBN: 9781098113001 2.Claus O. Wilke. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. ( 1st Edition).Shroff/O'Reilly; ISBN-10: 9352138112. Technical Requirements for Visual Aids of EDA Line chart in Matplotlib – Python  Matplotlib is a data visualization library in Python. Code:  The pyplot, a sublibrary of Matplotlib, is a collection # importing the required libraries of functions that helps in creating a variety of import matplotlib.pyplot as plt charts. import numpy as np  Line charts are used to represent the relation # define data values between two data X and Y on a different axis. x = np.array([1, 2, 3, 4]) # X-axis points y = x*2 # Y-axis points Example Simple line chart is generated using NumPy to define plt.plot(x, y) # Plot the chart data values. plt.show() # display The x-values are evenly spaced points, and the y- values are calculated as twice the corresponding x- values. Bart Chart in Matplotlib – Python Scatter Plot  Scatter plot is a mathematical technique that is used to represent data.  Scatter plot also called a Scatter Graph, or Scatter Chart uses dots to describe two different numeric variables.  The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Example – Find the Relationship between No. of matches played and Goals Scored TECHINICAL REQUIREMENTS First, essential tools and libraries should be utilized, including Matplotlib, Seaborn, Pandas, Bokeh, and Plotly. These libraries provide a range of functionalities for creating various types of visualizations and handling data efficiently. A solid understanding of Python programming basics is crucial. This includes knowledge of data manipulation techniques such as filtering, aggregating, and transforming data. Additionally, familiarity with basic plotting functions and methods in the mentioned libraries will greatly aid in creating sophisticated visualizations. Data preparation is a key step before visualization. This involves cleaning the data by handling missing values, removing duplicates, and correcting any inconsistencies. Organizing data effectively is also important, which may involve restructuring data frames, normalizing data ranges, and converting data types to appropriate formats. Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves summarizing the main characteristics of a dataset, often using visual methods. EDA helps in understanding the data better, uncovering underlying structures, detecting anomalies, testing hypotheses, and checking assumptions with the help of summary statistics and graphical representations. Here are the key components and techniques used in EDA: 1. Data Cleaning: Exploratory Data 1. Handling Missing Values: Identifying Analysis (EDA) and dealing with missing data points by either removing them or imputing with appropriate values. 2. Removing Duplicates: Ensuring no redundant data is present in the dataset. 3. Correcting Inconsistencies: Standardizing data formats and correcting any inconsistencies. 2. Descriptive Statistics: 1. Summary Statistics: Calculating mean, median, mode, standard deviation, and variance to get an overview of the data distribution. Data Visualization: Histograms: Displaying the frequency distribution of a dataset to understand its underlying pattern. Box Plots: Visualizing the spread of the data, highlighting the median, quartiles, and potential outliers. Scatter Plots: Examining relationships between two continuous variables. Exploratory Data Line Charts: Observing trends over time or another continuous variable. Analysis (EDA) Bar Charts: Comparing categorical data. Heatmaps: Representing the magnitude of a phenomenon as color in two dimensions. Identifying Patterns and Relationships: Correlation Analysis: Checking for linear relationships between variables using correlation coefficients. Cross-tabulation: Analyzing categorical data by examining the relationships between variables in a contingency table. Outlier Detection: Identifying and analyzing data points that significantly differ from other observations, which can indicate anomalies or errors in data collection. Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) are used to reduce the number of variables while retaining the essential patterns in the data. Line chart A line chart is a type of data visualization that displays information as a series of data points called 'markers' connected by straight line segments. This chart is particularly useful for showing trends over time or continuous data. LINE CHART It’s important to ensure the data units, such as the phd, service, and salary variables, are used for plotting. However, only the salaries are visible, while the phd and service information is not clearly displayed on the plot. This is because the numerical units in the salaries are in the hundreds of thousands, while the phd and services information is in very small units. Visualizing Patterns with High Differences in Numerical Units In : dataset[["rank", "discipline","phd","service", "sex", "salary"]].plot() Key Features: Axes: Line charts have two axes: the x-axis (horizontal) typically represents time or a continuous variable, and the y-axis (vertical) represents the dependent variable being measured. Data Points: Each point on the line represents a data value at a given point in time or condition. Lines: Connecting the data points with lines helps visualize the trend or pattern over the specified period or condition. Use Cases: Time Series Analysis: Tracking changes over time, such as stock prices, temperature changes, or monthly sales figures. Trend Analysis: Identifying upward, downward, or cyclical trends in data. Comparison: Comparing multiple data series to see how different variables change relative to each other over time. Example: Imagine you want to track the monthly sales revenue of a company over a year. A line chart would clearly show the trend import matplotlib.pyplot as plt n sales from Topic monthofto the section month, helping identify peak periods and any seasonal variations. # Data months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', OUTPUT: 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] sales = [150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700] # Create line chart plt.plot(months, sales, marker='o', linestyle='-') # Add titles and labels plt.title('Monthly Sales Revenue') plt.xlabel('Month') plt.ylabel('Sales ($)') # Show plot plt.show() BAR CHART A bar chart is a graphical representation of data using rectangular bars. The length or height of each bar is proportional to the value it represents. Bar charts are commonly used to compare different categories or groups, making them ideal for visualizing categorical data. Key Features: Axes: The chart has two axes: the x-axis (horizontal) and the y-axis (vertical). The x-axis typically represents the categories, while the y-axis represents the values or frequencies. Bars: Each bar corresponds to a category and its length or height represents the value of that category. Bars can be displayed either vertically or horizontally. Spacing: Bars are usually spaced evenly apart to clearly distinguish between different categories. Types of Bar Charts: 1. Vertical Bar Chart: The most common type where bars extend vertically from the x-axis. 2. Horizontal Bar Chart: Useful when category names are long or when comparing many categories. 3. Stacked Bar Chart: Bars are divided into sub-bars to show the composition of each category. 4. Grouped Bar Chart: Groups bars together for each category to compare multiple series of data. Use Cases: Comparison: Ideal for comparing the quantities of different categories. Trends Over Time: When categories represent time periods, it can show trends over time. Distribution: Shows how a variable is distributed across different categories. magine you want to compare the sales of different products in a store. You can use a bar chart where each bar represents a product, and he height of the bar represents the sales import matplotlib.pyplot as plt figures. # Data categories = ['Product A', 'Product B', 'Product C', 'Product D'] values = [50, 30, 40, 70] # Create bar chart plt.bar(categories, values) # Add titles and labels plt.title('Sales of Different Products') plt.xlabel('Products') plt.ylabel('Sales') # Show plot plt.show() SCATTER PLOT A scatter plot is a type of data visualization that uses dots to represent the values obtained for two different variables. Each dot on the scatter plot represents one observation from a dataset, with its position determined by the values of the two variables. Key Features: Axes: The x-axis represents one variable, while the y-axis represents the other variable. Dots: Each dot represents an observation from the dataset. The position of the dot is determined by the values of the two variables. Patterns: Scatter plots can reveal various types of relationships between variables, such as linear, non-linear, or no relationship. Use Cases: Correlation Analysis: Determining if there is a relationship between two variables. Outlier Detection: Identifying data points that deviate significantly from the rest of the data. Cluster Identification: Finding clusters or groups of data points that have similar characteristics. magine you want to examine the relationship between the heights and weights of ndividuals. A scatter plot would allow you to import matplotlib.pyplot as plt visualize howTopic of thechanges weight section with height, helping to identify any correlation between # Data hese two variables. height = [150, 160, 170, 180, 190, 200, 210, 220] OUTPUT: weight = [50, 60, 65, 70, 75, 80, 85, 90] # Create scatter plot plt.scatter(height, weight) # Add titles and labels plt.title('Height vs. Weight') plt.xlabel('Height (cm)') plt.ylabel('Weight (kg)') # Show plot plt.show() AREA PLOT An area plot is a type of line chart where the area between the line and the axis is filled with color or shading. This helps to emphasize the magnitude of change over time. Key Features: Axes: Similar to line charts, with the x-axis representing time or another continuous variable, and the y-axis representing the 01 Problem vs. Soluti magnitude of the data. Topic of the section Filled Area: The area under the line is filled to highlight the volume of data. Single Series: Usually displays one data 02 Product series. Topic of the section Use Cases: Trend Visualization: Showing how values change over time, emphasizing the volume. Comparative Analysis: Comparing the Example: Visualizing the monthly rainfall in a region over a import matplotlib.pyplot as plt year can be effectively shown with an area plot, highlighting the volume of rainfall each month. # Data months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] rainfall = [3.1, 3.3, 3.8, 4.1, 4.5, 4.9, 5.2, 5.0, 4.7, 4.3, 3.9, 3.4] PROBLEM # Create area plot Despite being red, plt.fill_between(months, rainfall, Mars is actually a very color="skyblue", alpha=0.4) cold place plt.plot(months, rainfall, color="Slateblue", alpha=0.6) CHALLENGE # Add titles and labels Jupiter is a gas giant plt.title('Monthly Rainfall') and the biggest planet plt.xlabel('Month') plt.ylabel('Rainfall (inches)') # Show plot plt.show() STACKED PLOT A stacked plot, or stacked area plot, is an extension of the area plot that shows multiple data series stacked on top of each other. This type of plot emphasizes the total and the contribution of each part to the whole. import matplotlib.pyplot as plt Example: Visualizing the total sales of different product # Data categories over a year can be effectively months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', shown with a stacked plot, highlighting the 'Oct', 'Nov', 'Dec'] contribution of each category to the total sales. category_A = [3, 4, 2, 5, 7, 6, 8, 7, 6, 5, 4, 3] category_B = [2, 3, 4, 3, 2, 3, 4, 5, 4, 3, 2, 1] category_C = [1, 2, 3, 2, 1, 2, 3, 4, 3, 2, 1, 1] # Create stacked plot plt.plot(months, category_A, color="skyblue", alpha=0.4, label='Category A') plt.plot(months, category_B, color="olive", alpha=0.4, label='Category B') plt.plot(months, category_C, color="gold", alpha=0.4, label='Category C') plt.fill_between(months, category_A, color="skyblue", alpha=0.4) plt.fill_between(months, category_A, [i+j for i, j in zip(category_A, category_B)], color="olive", alpha=0.4) plt.fill_between(months, [i+j for i, j in zip(category_A, category_B)], [i+j+k for i, j, k in zip(category_A, category_B, category_C)], color="gold", alpha=0.4) # Add titles and labels plt.title('Monthly Sales by Category') plt.xlabel('Month') AREA PLOT STACKED PLOT Typically represents a single data series. Represents multiple data series stacked Emphasizes the volume and magnitude on top of each other. of change for a single dataset over time. Emphasizes both the cumulative total Fills the area under a single line. and the individual contributions of Simpler, focusing on one dataset. multiple datasets. Fills the area under multiple lines, stacking them to show part-to-whole relationships. More complex, showing multiple datasets and their cumulative effect. PIE CHART A pie chart is a circular statistical graphic that is divided into slices to illustrate numerical proportions. Each slice represents a category's contribution to the whole. Topic of the section Pie charts are useful for comparing parts of a whole. They do not show changes over time. Bar graphs are used to compare different groups or to track changes over time. However, when trying to measure change over time, bar graphs are best when the changes are larger. In addition, a pie chart is useful for comparing small variables, but when it comes to a large number of variables, it falls short. Key Features: Circular Shape: The entire chart represents 100% of the data. Slices: Each slice corresponds to a category and its size is proportional to its percentage of the total. Labels: Slices are often labeled with the category name and percentage value for clarity. Use Cases: Proportional Data: Best used to show the relative sizes of parts of a whole. Limited Categories: Ideal for datasets with a small number of categories to avoid clutter. Example : Visualizing the market share of different companies within an industry can be effectively import matplotlib.pyplot as plt shown with a pie chart. # Data labels = ['Company A', 'Company B', 'Company C', 'Company D'] sizes = [30, 20, 25, 25] colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue'] explode = (0.1, 0, 0, 0) # explode 1st slice # Create pie chart plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140) # Equal aspect ratio ensures that pie is drawn as a circle. plt.axis('equal') # Add title plt.title('Market Share of Companies') # Show plot A lollipop chart is a variation of a bar chart that uses a line and a circular marker (dot) to represent data points. It is particularly useful for comparing values across categories while maintaining a clean, minimalistic appearance. When a bar chart feels too cluttered. To emphasize individual data points. The line (stick) represents the magnitude of the value. The circle (lollipop) at the end of the line highlights the exact data point. WHEN TO USE LOLLIPOP CHART To compare discrete categories with numerical values When a bar chart feels too cluttered. To emphasize individual data points. Polar chart (or polar plot) is a circular graph where data points are plotted using angles (θ) and radii (r) instead of x-y coordinates. It's useful for visualizing cyclic data, directional data, or patterns in periodic datasets. Thank YOU CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, infographics & images by Freepik

Use Quizgecko on...
Browser
Browser