Python Data Analytics - Chapter 3 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document is chapter 3 of a textbook on data analytics using Python and the Pandas library. It introduces data structures such as Series and DataFrames, along with descriptive statistics and data loading techniques, ideal for undergraduate students.
Full Transcript
Chapter 3 Data analytics 3.1 Introduction to Python data analytics We have introduced descriptive statistics for describing the characteristics of a set of data. However, the computation can be tedious when the data size is large. Computer programming can be used as a tool for making such...
Chapter 3 Data analytics 3.1 Introduction to Python data analytics We have introduced descriptive statistics for describing the characteristics of a set of data. However, the computation can be tedious when the data size is large. Computer programming can be used as a tool for making such computations. In this chapter, we will introduce useful techniques for data analytics using Python with Pandas. Pandas stand for Python Data Analysis. It is a library that contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. Pandas adopts array-based computing, which allows data processing without for loops. It is designed for working with tabular or heterogeneous data. Before using the functions in Pandas library, we have to import it first by. When we call functions in Pandas, we can start with "pd.". In Pandas, data can be stored into two forms, Series and DataFrame. Notice that the letters 'S', 'D' and 'F' are capital. A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index. We can simply convert a list of numerical values into a Series by applying pd.Series(). For example, we can convert the list of first 5 prime numbers into a Series. The index by default is 0 to 4. However, we may also rename the index as another list of string or numbers. One useful feature of Series is vectorized computation. If we apply an arithmetic operation between two Series, the entries of the corresponding indexes will be calculated, resulting a new Series. If we apply an arithmetic operation between a Series and a number, the operation will be applied to each entry of the Series. We can verify this result in Python. With vectorized computation, we can apply standardization to the whole Series easily. Sometimes the values in a Series are not sorted in order. We can use the sort_values function in Pandas for sorting. The default order is ascending but it can be changed to descending. Notice after sorting, the index is still attached to its corresponding value in the Series. The sort_value function results in a new Series without changing the original Series. 3.2 Descriptive statistics and arithmetic in Pandas Pandas Series are equipped with a set of common mathematical and statistical methods. The table below shows some common descriptive statistics methods in Pandas. To apply these methods, put a dot "." after a Series and then call the functions. Statistical meaning Function Statistical meaning Function sum sum() sample size count() arithmetic mean mean() mean absolute deviation mad() median median() population variance var(ddof=0) maximum max() sample variance var(ddof=1) minimum min() population standard deviation std(ddof=0) mode mode() sample standard deviation std(ddof=1) Using these functions, we can verify the results in the previous chapter. Instead of showing each particular item, we can also generate a summarized result of the descriptive statistics of a Series using the method describe(). Notice that the std here refers to sample standard deviation. The numbers resulted from the describe method are calculated based on the data values. However, some methods return the indirect statistics. For example, we know the maximum value in Series A is 32, but we might not know where this value is. The functions idxmax() and idxmin() returns the index of the maximum/minimum entry in a Series. In Series B, we find that there are multiple entries with the same value. If we want to check the unique values and the frequency of each unique value in a Series, we can use the function value_counts() which returns a new Series with its index being the unique values and its entries being the frequency of each unique values. This is what we call a frequency table. Series is only a one-dimensional array-type of data structure, which is only suitable for storing data set of a single variable. However, in real-life the data file might contain multiple variables. For example, the health record of a class of students might contain their gender (string), height (float) and weight (float). Each of these three variables can stored as a Series, and we can combine these Series together into a two-dimensional tabular form called DataFrame. Each of the Series is regarded as one column in the DataFrame. In order distinguish, we can also give names to these columns. In the example below, we first define the three Series with values and names. Then we combine the Series together into a DataFrame using the concat method. The syntax is: df_name = pd.concat([Series1, Series2,...], axis = 1) The DataFrame is displayed in tabular form tidily as shown below. Since DataFrame is a two-dimensional data structure, we can call out either a row, a column or a single entry from a DataFrame. To call a single column, directly use the column name: To call a single row, apply the loc method and use the index of the row: To call a single entry, we can apply the loc method and include both row index and column name of the target entry. For a DataFrame, we might want to add new columns based on some arithmetic of the existing columns. Recall that since a column of a DataFrame is a Series, it also supports vectorized computation. If we make arithmetic operations between two columns, the result is a Series with the same dimension. We can create a new column in a DataFrame with such result. The syntax is: DataFrame_name["new_col_name"] = Series_name For example, we would like to create a new column of weight in pounds, which equals to weight in kilograms multiplied by 2.2. We would like to create another column of BMI (body mass index) which is the weight in kg over square of height in m. Refer to the coding below: 3.3 Data loading In the previous section, the data no matter in Series or DataFrame form are input one-by- one on our own. In reality, this would be impossible due to the volume of big-data. Pandas provides various method to read data from various file formats or sources into a DataFrame for analytics. One common type of data file is comma-separated values (csv) file. It is a delimited text file that uses a comma to separate values. Each line is regarded as a data record. However, the first line is usually used as column titles, indicating the meaning of the values stored in this column. When opened on Excel, the values are automatically arranged by rows and columns without showing the commas. When opened on Notepad, each record is stored on a line with its values separated by commas. In the example below, we would like to read data from an excel file "health1.csv" which already preserve the first row as the header, providing information of each column. The header row will be converted to the column names of the DataFrame and not regarded as data values. The syntax is: df_name = pd.read_csv("file_path") If the data file doesn't contain a header as in "health2.csv", we need to specify by putting header=None. The column names in the DataFrame created will be by default 0,1,2,... In case we would like to add column names to a file without header, we can use names: By default, the DataFrame created from reading a data file will be assigned with index 0,1,2,... and so on. In some data file, one of the column might contain the index of this set of data. If you wish to set a particular column from the data file to be the index column, put the parameter index_col within the brackets of read_csv. In the file "health3.csv", a column called "id" contains the student identity number. We can set it to be the index column as this value can uniquely distinguish different rows (students). In case the data file is not a csv file, we can still read it using read_table method. But we need to specify the delimiting character, which separate between values. In the example below, the file "health.txt" use whitespace as the delimiting character. We can read it by: Notice that when you read a data file, you need to ensure that the file is under the same location of your python file. Suppose we have performed data processing and updated a DataFrame. We can export it into a data file for storage and sharing using to_csv method. The newly created csv file will be put under the same location of your python file by default. 3.4 Data preparation and data cleaning We have learnt data loading and simple data analytics. But in reality, you will find a gap between these two steps. Due to the process of data collection, the original data file might contain problematic entries and hence not ready for carrying out data analytics. To fill in this gap, we need a process called data preparation. In data preparation, the most important process is data cleaning. It is a process to fix or remove incorrect, corrupted or missing data. In the example below, two entries in the csv file are replaced by a word "unknown" and an empty cell. When this file is read as a DataFrame, they are not regarded as numerical values. The "unknown" is read as a string and not valid for statistical measures such as mean() or sum(). An error message will occur. On the other hand, the empty cell is read as NaN which means Not-a-Number. The statistical measures can still be evaluated but this entry will be ignored. Oppositely, if you want to empty the value in a cell, you can enter None. To remedy these problematic entries, we might not want to edit it one-by-one (as there might be thousands of such in a set of big-data!). Instead, we can use some existing methods in Pandas. Applying fillna(0) to a Series, DataFrame or a particular column of a DataFrame replaces all the NaN by 0. This 0 can be changed to other values. As a more general way, the replace() method allow you to replace any old value in the Series/DataFrame to a new value. The old and new values have to be specified inside the brackets separated by comma. Notice that for either these two methods, the result is another object (Series or DataFrame) without changing the original one. To update the original object, assign it to the new object. In data preparation, another useful technique is data filtering which refers to selecting desirable samples from the dataset under some certain criteria. In a pandas DataFrame, such criteria can be based on the values of its columns. The syntax is as follows: new_df = old_df[old_df["col_name"](relation)(number)] For example, let's consider the original DataFrame df1 containing the health data of 5 students in the previous session. Suppose we would like to create two new DataFrames by separating df into the two gender groups. We can check if the value in df["sex"] is equal to "M" or "F". Notice that for equality we use double equal signs ==. The criteria can also be an inequality. For example, we can filter the data based on the height being greater than or equal to 1.8 m, or the weight lower than 60 kg. We have learnt that data in a Series can be sorted in ascending or descending order. For a DataFrame, data can also be sorted according to the values of a designated variable (column).