Python Pandas Lecture 9 PDF
Document Details

Uploaded by SafeLimit
Dr. Faten Khalifa
Tags
Summary
This document is a lecture on the Python programming library Pandas, focusing on data structures like Series and DataFrames, with accompanying code examples. The lecture is suitable for undergraduate-level students in data science or a similar field.
Full Transcript
Programming for Data Science Lecture 9 Dr. Faten Khalifa Pnadas Pandas is short for "Panel Data" and "Python Data Analysis". It refers to both its ability to handle panel data (multidimensional structured datasets) and its focus on data...
Programming for Data Science Lecture 9 Dr. Faten Khalifa Pnadas Pandas is short for "Panel Data" and "Python Data Analysis". It refers to both its ability to handle panel data (multidimensional structured datasets) and its focus on data manipulation, cleaning and analysis. Pandas adopts many coding idioms from NumPy. NumPy is best suited for working with homogeneous numerical array data. Pandas is designed for working with tabular or heterogeneous data. The main data structures are Series and DataFrame. Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc. DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns. Pnadas - Series A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. import pandas as pd from pandas import Series, DataFrame obj = pd.Series([4, 7, -5, 3]) print(obj) print("---------------------") print(obj.values) print("---------------------") print(obj.index) Pnadas - Series obj2 = pd.Series([6, 7, -5, 3], index=['d', 'b', 'a', 'c']) print(obj2) print("---------------------") print(obj2.values) print("---------------------") print(obj2.index) print(obj2['a']) obj2['d'] = 66 print(obj2) print(obj2[['a', 'b', 'c']]) print(obj2) Pnadas - Series We can use NumPy functions Another way to think about a or NumPy-like operations, such Series is as a fixed-length, as filtering with a boolean ordered dictionary, as it is a array, scalar multiplication, or mapping of index values to applying math functions. data values. print(obj2[obj2 > 5]) print('b' in obj2) print(obj2 * 2) print('e' in obj2) print(np.power(obj2,2)) Pnadas - Series You can create a Series from a dictionary When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series. sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} obj3 = pd.Series(sdata) print(obj3) print("---------------") states = ['California', 'Ohio', 'Oregon', 'Texas'] obj4 = pd.Series(sdata, index=states) print(obj4) Pnadas - Series isnull and notnull functions in pandas can be used to detect missing data (NAN (Not A Number) or NA (Not Available)) print(pd.isnull(obj4)) # or obj4.isnull() print("---------------") print(pd.notnull(obj4)) Pnadas - Series One of the powerful features of pandas Series is its ability to automatically align data by index labels during arithmetic operations. When performing operations on two Series objects, pandas matches the data by their index labels rather than by their positions. print(obj3 + obj4) Both the Series object itself and its index have a name attribute. obj4.name = 'population' obj4.index.name = 'state' print(obj4) Pnadas - Series A Series’s index can be altered in-place by assignment print(obj) obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] print(obj) print(obj4.count()) # counts non-NA/null obj5 = obj4.fillna(0.5) print(obj5) Pnadas - DataFrame A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} frame = pd.DataFrame(data) print(frame) Pnadas - DataFrame For large DataFrames, the head If you specify a sequence of method selects only the first five columns, the DataFrame’s columns rows. will be arranged in that order frame.head() # or frame.head(n) pd.DataFrame(data, columns=['year', 'state', 'pop']) Pnadas - DataFrame If you pass a column that isn’t contained in the dict, it will appear with missing values in the result: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six']) print(frame2) print("---------------") print(frame2.columns) Pnadas - DataFrame A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute. print(frame2['state']) print(frame2.year) Rows by loc attribute. frame2.loc['three'] Pnadas - DataFrame Columns can be modified by assignment. frame2['debt'] = 16.5 frame2['debt'] = np.arange(6.) print(frame2) print(frame2) Pnadas - DataFrame When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame; for not inserting missing values in any holes. val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five']) frame2['debt'] = val print(frame2) Pnadas - DataFrame The del keyword will delete columns as with a dict. frame2['eastern'] = frame2.state == 'Ohio' del frame2['eastern'] print(frame2) print(frame2) DataFrame Creation DataFrame Creation import pandas as pd dic_list = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003], From a 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} dictionary of lists My_data= pd.DataFrame(dic_list) import pandas as pd state=pd.Series(['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada']) From a year=pd.Series([2000, 2001, 2002, 2001, 2002, 2003]) pop= pd.Series([1.5, 1.7, 3.6, 2.4, 2.9, 3.2]) dictionary of Series dic_series={"state":state,"year":year,"pop":pop} pd.DataFrame(dic_series) print(My_data.values) print("-----------") print(My_data.index) DataFrame Creation import pandas as pd pop = {'Nevada': {2001: 2.4, 2002: 2.9}, From a 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}} nested dictionary data=pd.DataFrame(pop) print(data) print(data.values) print("-----------") print(data.index) DataFrame operations Transposing your data data.T Explicit indexing pd.DataFrame(pop, index=[2001, 2002, 2003]) data.columns.name="state" name attributes set data.index.name="year" data Membership import pandas as pd pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}} data=pd.DataFrame(pop) data print("Ohio" in data.columns ) Membership print(2005 in data.index) import pandas as pd dic_list = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'], Duplicated 'year': [2000, 2001, 2002, 2001, 2002, 2003], indicies 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} data= pd.DataFrame(dic_list,index=[0,0,0,0,1,1]) print(data) Selections with duplicate labels data[data.index==1] Index methods and properties index1= pd.Index([2,4,8,50]) index2=pd.Index(np.arange(10)) index3 = index1.append(index2) print(index3) print(index3.is_unique) Reindexing import pandas as pd dic_list = {'state': ['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'], 'year': [2003, 2001, 2002, 2001, 2002, 2000], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} data=pd.DataFrame(dic_list,index=["d","c","b","f","e","a"]) print(data) data2 = data.reindex(["a","b","c","k","e","f","g"]) print(data2) Reindexing reindex() Method: Allows both row and column reordering or alignment in one operation. Missing values are automatically filled with NaN. DF_Numpy import pandas as pd import numpy as np frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], columns=['Ohio', 'Texas', 'California']) print(frame) print("------------------") states=["California","Ohio", 'Nevada', "Texas"] frame2=frame.reindex(["d","c","b","a"],columns=states) print(frame2) print(frame2.loc[['a','d','b'],states]) Reindexing import pandas as pd import numpy as np obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[2, 4, 5]) print(obj3) obj3=obj3.reindex(np.arange(9),method="ffill") # bfill print(obj3) Dropping Entries from an Axis import pandas as pd Dropping in obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e']) print(obj) Series obj=obj.drop("c") obj=obj.drop(["a","e"]) print(obj) import pandas as pd import numpy as np Dropping in data = pd.DataFrame(np.arange(16).reshape((4, 4)), DataFrame index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four']) print(data) data= data.drop(["Colorado","Utah"]) print(data) Dropping Entries from an Axis Dropping data= data.drop(["one","three"],axis="columns") columns print(data) When inplace=True is used, the drop method modifies the original DataFrame (data) directly and does not return a new DataFrame. inplace data = pd.DataFrame(np.arange(16).reshape((4, 4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four']) data.drop(["Colorado","Utah"],inplace=True) print(data) Indexing, Selection , Filtering import pandas as pd import numpy as np obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd']) print(obj) print("-------------------") print(obj[obj 5] #data < 5 #data[datainteger-location-based selection #data.iloc[2, [3, 0, 1]] #data.iloc #data.iloc[["two","three"]] #data.iloc[[1,3],] #data.iloc[[1, 2], [3, 0, 1]] #data.loc[:'Utah', 'two'] #data.loc[:'Utah', ['two','one']] #data.iloc[:,:3][data.three> 5] #data.at["Colorado","two"] #data.iat[3,2] #data._get_value("New York","three") #data._set_value("Utah","four",100) # data