Data Visualization in Python (PDF)

Chapter 4 Data visualization 4.1 Introduction to data visualization Data visualization refers to the representation of data through use of graphics. Making informative visualization is an important task in data analysis, no matter as a part of the exploratory process or as a way of generating ideas for models. Python has many add-on libraries for making static or dynamic visualizations. In this chapter, we will introduce a library called matplotlib which stands for mathematics-plot-library and the techniques of making various graphs for presenting statistical data and also the result of artificial intelligence. In fact, matplotlib is a desktop plotting package designed for creating publication quality plots. There are many modules under it including pyplot. We usually import it by: 4.2 Frequency plot Consider a set of one dimensional data. The values can be either numerical values (e.g. marks of students) or non-numerical values (e.g. letter grades of students). Frequency plot is based on the number of occurrence of each unique value. Let's use an example to illustrate different plots. The file "ama1234.csv" contains the marks and letter grades of 100 students. We can first read the file as a single DataFrame, then extract the two columns as two Series. Base on the grades, we can directly plot a histogram using hist() to show the frequency of students obtaining each grades. The show() method displays the plot. In the previous chapter, we have introduced the value_counts() method in Pandas which gives a frequency table of a set of data by counting the frequency of each unique value. The result is a Series with the index being each unique value and the values being the frequency of each unique value. We might first store it as two arrays: Using these two arrays x(index) and y(values), we can plot a bar chart or a pie chart to illustrate frequency of the set of data. For bar chart, two arrays are required for the x-axis and y-axis. For pie chart, only an array of value is necessary. We might also put the labelling and auto-percentage optionally. 4.3 Data binning If we look at the Series of marks in the example problem, we can see that the entries are float point numerical values ranged between 0 to 100, corrected to 1 decimal point. It would be unrealistic to make a frequency plot of each unique mark. Instead, we might group similar marks together. In statistics, data binning is a way to group numbers of more-or-less continuous values into a smaller number of "bins". After binning, we might plot a histogram or a density plot base on the frequency of binned values. In pyplot, when we plot a histogram of an array of values from a continuous variable, it will be automatically binned. We can also specify the binning criteria by the number of even width intervals, or by the end points between intervals by the following syntax respectively: plt.hist(data_set, bins=number_of_bins) plt.hist(data_set, bins=[point_0,point_1,...,point_n]) In the example, without putting any parameter, the data is automatically put into 10 bins of equal width along the real number line. The end points of the intervals and the frequency of data in each interval are shown in the array. Suppose we would like to use 20 bins with narrower interval, simply set the parameter 20. Suppose the rubric of ama1234 is as follows: 𝐴: 85,100 , 𝐵: 70,85 , 𝐶: 60,70 , 𝐷: 50,59 , 𝐹: 0,50 We may form an array with the threshold marks together with 0 mark and full mark (100) to divide the interval for data binning. Notice that the width of each interval is not even. This is illustrated by width of rectangles on the histogram. 4.4 Line graph Some set of data is dependent to a continuous variable. For example, stock price is dependent on the time variable. To present such data, we may use a line graph. This is good for showing the trend and local extreme values. As an example, we will try to study the stock price of two companies, Apple (AAPL) and Microsoft (MSFT). The files "AAPL.csv" and "MSFT.csv" contains the historical data in 5 years. The column of adjusted close price is stored as a Series. In pyplot, we can directly use plot() to make a line graph using two methods: (i) with a Series or array; or (ii) by selecting two columns from a DataFrame. Moreover, the colour and style of the line can be adjusted by the commands inside the brackets. command colour command style 'k' black '--' dashed line 'g' green ':' dotted line 'r' red '*' points with stars 'b' blue 'o' points with circle 'y' yellow '+' points with plus sign Method (i) - Directly apply the plot function with a Series as input. Method (ii) - Choose the date as x-axis, adjusted price as y-axis. For better display we can rotate the date values (optional). However, this is hard for comparison. In matplotlib, we can plot several lines on the same figure. To show information of the figure and distinguish the lines, we can also add title, labels and legend. This applies not only on line graph but also on the graphs we have introduced before. Upon executing the show() command, all these graphs and information before will be displayed on the same figure. However, it is still hard to compare the performance of the two stocks, as they have different scales. A solution is to divide the daily adjusted close price by the first entry. In fact, those data can be found and downloaded as csv file from the internet. For example, Yahoo Finance is a useful source of financial data. https://finance.yahoo.com/ 4.5 Scatter plot A dataset might consist of more than one variables. under the same index. For example, in a set of health data of a class of students, height (m) and weight (kg) are two variables. We can call such dataset bivariate data. In statistics and data science, we are interested to study if there is any relationship between these two variables using methods like correlation and regression. More will be discussed in a later chapter. Bivariate data can be visualized by a scatter plot. An easy example can be referred to the set of health data. Suppose we are going to study the relation between height and weight. After reading the csv file "health.csv" into a DataFrame, extract the columns representing the two variables (height and weight) into two Series. For each data index, a point with x-coordinate being its height and y-coordinate being its weight is plot on the figure. As a result, a scatter plot should contain as many points as the number of rows of the original DataFrame. There is already a function to make a scatter plot in matplotlib.pyplot library. The syntax is as follows where the two variables can be two Series with the same indexes, or two columns from a DataFrame. plt.scatter(variable1, variable2) In case the data points can be classified into different categories (e.g. male and female), we might assign colours to each point with c=[colour0,colour1,...]. If the data points are weighted, we might show each point with a different size using s=[size0,size1,...].

Data Visualization in Python (PDF)

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue