Data Visualization with Matplotlib PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to data visualization using Matplotlib in Python. It covers various data types, including univariate, bivariate, and multivariate data, and explains different visualization methods such as histograms, scatter plots, pie charts.
Full Transcript
Data visualization Data visualization is the most important step in the life cycle of data science Data visulizaion Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way...
Data visualization Data visualization is the most important step in the life cycle of data science Data visulizaion Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent way for employees or business owners to present data to non-technical audiences without confusion. Data visualization convert large and small data sets into visuals, which is easy to understand and process for humans. Advantages of Data Visualization: Univariate, Bivariate and Multivariate statistical analysis. Categorical variables — variables that have a finite number of categories or distinct groups. Examples: gender, method of payment, horoscope, etc. Numerical variables — variables that consist of numbers. There are two main numerical variables. Discrete variables — variables that can be counted within a finite time. Examples: the change in your pocket, number of students in a class, numerical grades, etc. Continuous variables — variables that are infinite in number often measured on a scale of sort. Examples: weight, height, temperature, date and time of a payment, etc. Univariate data This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. The most common univariate analysis is checking the central tendency (mean, median and mode), the range, the maximum and minimum values, and standard deviation of a variable. Common visual technique used for univariate analysis is a histogram, which is a frequency distribution graph. You could also use a box plot to compare the spread of the variables and provides an insight into outliers. Bivariate data This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables. These variables could be dependent or independent to each other. In Bivariate analysis is that there is always a Y-value for each X-value. The most common visual technique for bivariate analysis is a scatter plot, where one variable is on the x-axis and the other on the y-axis. Multivariate data When the data involves three or more variables, it is categorized under multivariate. For three variables, you can create a 3-D model to study the relationship (also known as Trivariate Analysis). Multivariate refers to multiple dependent variables that result in one outcome. This means that a majority of our real-world problems are multivariate. For example, based on the season, we cannot predict the weather of any given year. Several factors play an important role in predicting the same. Such as, humidity, precipitation, pollution, etc. Determining the value of an apartment. Factors possibly related to the value are size of the apartment, age of the building, number of bedrooms, number of bathrooms, and location (e.g. floor, view, etc.). Matplotlib Matplotlib is a Python library which is defined as a multi-platform data visualization library built on Numpy array. It can be used in python scripts, shell, web application, and other graphical user interface toolkit. he John D. Hunter originally conceived the matplotlib in 2002. It has an active development community and is distributed under a BSD-style license. Its first version was released in 2003 Matplotlib Architecture Backend Layer it consists of the implementation of the various functions that are necessary for plotting. There are three essential classes from the backend layer FigureCanvas(The surface on which the figure will be drawn) Renderer(The class that takes care of the drawing on the surface) Event(It handle the mouse and keyboard events). Artist layer second layer in the architecture. It is responsible for the various plotting functions, like axis, which coordinates on how to use the renderer on the figure canvas. Scripting layer topmost layer on which most of our code will run. The methods in the scripting layer, almost automatically take care of the other layers, and all we need to care about is the current state (figure & subplot). Working with Pyplot matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure creates a plotting area in a figure plots some lines in a plotting area decorates the plot with labels, etc. The pyplot module provide the plot() function which is frequently use to plot a graph. Function in pyplot matplotlib.pyplot is a collection of command style functions that make Matplotlib work like MATLAB. Each Pyplot function makes some change to a figure. For example, a function creates a figure, a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. Sr. Function & Description Axis Functions No Sr. Function & Description 1 Bar No Sr.N Function & Description Make a bar plot. 1 Axes o 2 Barh Add axes to the figure. Make a horizontal bar plot. 2 Text 3 Boxplot Add text to the axes. 1 Figtext Make a box and whisker plot. 3 Title Add text to figure. 4 Hist Set a title of the current axes. Plot a histogram. 2 Figure 4 Xlabel Creates a new figure. 5 hist2d Set the x axis label of the current axis. Make a 2D histogram plot. 3 Show 5 Xlim 6 Pie Get or set the x limits of the current axes. Display a figure. Plot a pie chart. 6 Xscale 4 Savefig 7 Plot. Plot lines and/or markers to the Axes. Save the current figure. 7 Xticks 8 Polar Get or set the x-limits of the current tick locations 5 Close Make a polar plot.. and labels. Close a figure window. 9 Scatter 8 Ylabel Make a scatter plot of x vs y. Set the y axis label of the current axis. 10 Stackplot 9 Ylim Draws a stacked area plot. Get or set the y-limits of the current axes. 11 Stem 10 Yscale Create a stem plot. Set the scaling of the y-axis. 12 Step Make a step plot. 11 Yticks Get or set the y-limits of the current tick locations 13 Quiver and labels. Plot a 2-D field of arrows. from matplotlib import pyplot as plt x = ['c','c++','java','se','os','network'] y = [40,50,45,40,78,45] plt.plot(x, y,'bo') plt.title('Line graph') plt.ylabel('Y axis') plt.xlabel('X axis') plt.show() Formatting the plot Example format String 'b' Using for the blue marker with default shape. 'ro' Red circle '-g' Green solid line '--' A dashed line with the default color '^k:' Black triangle up markers connected by a dotted line matplotlib supports the following color abbreviation: Character Color 'b' Blue 'g' Green 'r' Red 'c' Cyan 'm' Magenta 'y' Yellow 'k' Black 'w' White Line graph from matplotlib import pyplot as plt x=[1,2,3,4,5,6] y = [40,50,45,40,78,45] plt.plot(x, y) plt.title('Line graph') plt.ylabel('Y axis') plt.xlabel('X axis') plt.show() from matplotlib import pyplot as plt x=[1,2,3,4,5,6] y = [40,50,45,40,78,45] plt.plot(x, y,'r',label='line one', linewidth=5) plt.title('Line graph') plt.ylabel('Y axis') plt.xlabel('X axis') plt.grid(True, color='k') plt.show() Histogram Histogram A histogram is a graphical representation of a grouped frequency distribution with continuous classes. It is an area diagram and can be defined as a set of rectangles with bases along with the intervals between class boundaries and with areas proportional to frequencies in the corresponding classes. In such representations, all the rectangles are adjacent since the base covers the intervals between class boundaries. The heights of rectangles are proportional to corresponding frequencies of similar classes Histogram Bar Graph It is a two-dimensional It is a one-dimensional figure figure The frequency is shown The height shows the by the area of each frequency and the width has no rectangle significance. It shows rectangles It consists of rectangles touching each other separated from each other with equal spaces. To construct a histogram, follow these steps Bin the range of values. Divide the entire range of values into a series of intervals. Count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The matplotlib.pyplot.hist() function plots a histogram. It computes and draws the histogram of x. Function x array or sequence of arrays bins integer or sequence or ‘auto’, optional optional parameters range The lower and upper range of the bins. density If True, the first element of the return tuple will be the counts normalized to form a probability density cumulative If True, then a histogram is computed where each bin gives the counts in that bin plus all bins for smaller values. histtype The type of histogram to draw. Default is ‘bar’‘bar’ is a traditional bar-type histogram. If multiple data are given the bars are arranged side by side. ‘barstacked’ is a bar-type histogram where multiple data are stacked on top of each other. ‘step’ generates a lineplot that is by default unfilled. import matplotlib.pyplot as plt x = [1, 1, 2, 3, 3, 5, 7, 8, 9, 10, 10, 11, 11, 13, 13, 15, 16, 17, 18, 18, 18, 19, 20, 21, 21, 23, 24, 24, 25, 25, 25, 25, 26, 26, 26, 27, 27, 27, 27, 27, 29, 30, 30, 31, 33, 34, 34, 34, 35, 36, 36, 37, 37, 38, 38, 39, 40, 41, 41, 42, 43, 44, 45, 45, 46, 47, 48, 48, 49, 50, 51, 52, 53, 54, 55, 55, 56, 57, 58, 60, 61, 63, 64, 65, 66, 68, 70, 71, 72, 74, 75, 77, 81, 83, 84, 87, 89, 90, 90, 91] plt.style.use('ggplot') plt.hist(x, bins=[0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 99], facecolor='blue') plt.show() from matplotlib import pyplot as plt import numpy as np fig,ax = plt.subplots(1,1) a = np.array([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27]) ax.hist(a, bins = [0,20,40,60,80,100], facecolor='blue') ax.set_title("histogram of result") ax.set_xticks([0,20,40,60,80,100]) ax.set_xlabel('marks') ax.set_ylabel('no. of students') plt.show() Density plot density plot is a representation of the distribution of a numeric variable. It uses a kernel density estimate to show the probability density function of the variable (see more). It is a smoothed version of the histogram and is used in the same concept. We can plot a density plot in many ways using python. Using Python scipy.stats module Using Seaborn kdeplot module Using pandas plot function Using Seaborn distplot Using Python scipy.stats module import numpy as np import matplotlib.pyplot as plt from scipy.stats import gaussian_kde data = [2,3,3,4,2,1,5,6,4,3,3,3,6,4,5,4,3,2] density = kde.gaussian_kde(data) x = np.linspace(-2,10,300) y=density(x) plt.plot(x, y) plt.title("Density Plot of the data") plt.show() Using pandas plot function import pandas as pd import matplotlib.pyplot as plt data = [2,3,3,4,2,1,5,6,4,3,3,3,6,4,5,4,3,2] df=pd.DataFrame(data) df.plot(kind='density') plt.show() Age range Count Age range Count 0–5 36 31–35 76 Age 6–10 19 Count range 36–40 74 11–15 18 61–65 16 41–45 54 16–20 99 66–70 3 46–50 50 21–25 139 71–75 3 51–55 26 26–30 121 56–60 22 scatter plot scatter plot is a data visualization that displays the values of two different variables as points. The data for each point is represented by its horizontal (x) and vertical (y) position on the visualization. Additional variables can be encoded by labels, markers, color, transparency, size (bubbles), and creating 'small multiples' of scatter plots. Scatter plots are also known as scatterplots, scatter graphs, scatter charts, scattergrams, and scatter diagrams. The scatter() method in the matplotlib library is used to draw a scatter plot. Scatter plot The scatter() method in the matplotlib library is used to draw a scatter plot. The scatter() method takes in the following parameters: x_axis_data- An array containing x-axis data y_axis_data- An array containing y-axis data s- marker size (can be scalar or array of size equal to size of x or y) c- color of sequence of colors for markers marker- marker style cmap- cmap name linewidths- width of marker border edgecolor- marker border color alpha- blending value, between 0 (transparent) and 1 (opaque) import matplotlib.pyplot as plt x =[5, 7, 8, 7, 2, 17, 2, 9,4, 11, 12, 9, 6] y =[99, 86, 87, 88, 100, 86,103, 87, 94, 78, 77, 85, 86] plt.scatter(x, y, c ="r") plt.show() import matplotlib.pyplot as plt grade=[20,30,40,50,60,70,80,90] boys =[13,10,35,44,32,22,11,9] girls =[14,24,39,46,33,21,9,3] plt.scatter(grade, boys, c ="r") plt.scatter(grade, girls, c ="b") plt.show() Box plot Box plot displays a summary of a set of data containing the minimum, first quartile, median, third quartile, and maximum. In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. Syntax pyplot.boxplot(data, notch=None, vert=None,patch_artist=None, widths=None) Attribute Value data array or sequence of array to be plotted notch optional parameter accepts boolean values optional parameter accepts boolean values false and true for horizontal and vert vertical plot respectively bootstrap optional parameter accepts int specifies intervals around notched boxplots usermedians optional parameter accepts array or sequence of array dimension compatible with data positions optional parameter accepts array and sets the position of boxes widths optional parameter accepts array and sets the width of boxes patch_artist optional parameter having boolean values labels sequence of strings sets label for each dataset meanline optional having boolean value try to render meanline as full width of box order optional parameter sets the order of the boxplot import matplotlib.pyplot as plt C =[82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52] CPP=[62,5,91,25,36,32,96,95,3,90,95,32,27,55,100,15,71,11,37,21] Java=[23,89,12,78,72,89,25,69,68,86,19,49,15,16,16,75,65,31,25,52] Python=[59,73,70,16,81,61,88,98,10,87,29,72,16,23,72,88,78,99,75,30] box_plot_data=[C,CPP,Java,Python] box=plt.boxplot(box_plot_data,vert=1,patch_artist=True,labels=['C','CPP','Java','Python']) plt.show() box=plt.boxplot(box_plot_data,vert=1,patch_artist=True, labels=['C','CPP','Java','Python']) plt.show() box=plt.boxplot(box_plot_data, notch='True', vert=1,patch_artist=True,labels=['C','CPP','Java','Python']) plt.show() Pie Chart Pie Chart is a circular statistical plot that can display only one series of data. The area of the chart is the total percentage of the given data. The area of slices of the pie represents the percentage of the parts of the data. The slices of pie are called wedges pyplot.pie(data, explode=None, labels=None, colors=None, autopct=None, shadow=False) x array-like. The wedge sizes. Data labels list. A sequence of strings providing the labels for each wedge. Colors A sequence of matplotlibcolorargs through which the pie chart will cycle. If None, will use the colors in the currently active cycle. Autopct string, used to label the wedges with their numeric value. The label will be placed inside the wedge. The format string will be fmt%pct. Start angle Shadow Colors Explode component Legend explode parameter allows you to stand out wedges The To add a list of explanation for each wedge, use the legend() function: import matplotlib.pyplot as plt my_data = [300, 500, 700] my_labels = 'Tasks Pending', 'Tasks Ongoing', 'Tasks Completed' my_colors = ['red', 'lightblue', 'silver'] my_explode = (0, 0.1, 0) plt.pie(my_data, labels=my_labels, autopct='%1.1f%%' ,startangle=90, shadow=True, colors=my_colors, explode=my_explode) plt.title('My Tasks') plt.axis('equal') plt.legend() plt.show() Hexbin plot ❑ A hexbin plot in Matplotlib is a 2D histogram representation of data points using hexagonal bins instead of rectangular bins. ❑ It's useful for visualizing the distribution and density of data points. ❑ You can create a hexbin plot in Matplotlib using the hexbin function. Syntax matplotlib.pyplot.hexbin(x, y, C=None, gridsize=100, bins=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, edgecolors='face', reduce_C_function=, mincnt=None, marginals=False, data=None, **kwargs) Hexbin plot x: The x-coordinates of the data points. y: The y-coordinates of the data points. C (optional): The values associated with each data point (e.g., weights or counts). If not specified, it will be treated as a count of 1 for each point. gridsize (optional): The number of hexagons in the x-direction. Default is 100. bins (optional): A specification of the bin edges. If specified, it takes precedence over gridsize. cmap (optional): The colormap for coloring the hexagons. norm (optional): Normalize data values into the [0, 1] range for colormap scaling. vmin and vmax (optional): The minimum and maximum values for colormap scaling. alpha (optional): The transparency of the hexagons. linewidths (optional): The width of the hexagon edges. edgecolors (optional): The color of the hexagon edges. You can set it to 'face' to use the same color as the hexagon face. reduce_C_function (optional): A function to reduce C values within each hexagon. Default is np.mean. mincnt (optional): Ignore bins with fewer than mincnt counts. marginals (optional): Whether to add marginal histograms (default is False). The hexbin function returns an object that represents the hexbin plot, which you can use to customize the plot further or add a colorbar. Hexbin plot import numpy as np import matplotlib.pyplot as plt In this example: 1.We generate some random data points x and y. # Generate some random data np.random.seed(0) 2.We use the plt.hexbin function to create the hexbin plot. x = np.random.normal(0, 1, 1000) 3.You can adjust the gridsize parameter to control the size of the hexagonal y = np.random.normal(0, 1, 1000) bins. 4.The cmap parameter specifies the colormap for coloring the hexagons. # Create a hexbin plot 3.We add a colorbar to indicate the count of points in each bin. plt.hexbin(x, y, gridsize=30, cmap='viridis') 4.Labels and a title are set for the plot. 5.Finally, we display the plot using plt.show(). # Add a colorbar for reference plt.colorbar(label='Count in Bin') You can replace the x and y arrays with your own data to create a hexbin plot for your specific dataset. # Set labels and title plt.xlabel('X-axis') plt.ylabel('Y-axis') plt.title('Hexbin Plot Example') # Show the plot plt.show()