Python Data Science 2025 Past Paper PDF

Document Details

Uploaded by Deleted User

University of Petra

2025

Dr. Hossam M. Mustafa

Tags

pandas python data science data analysis

Summary

This is a past paper from the University of Petra for a Programming for Data Science course, 2025-2024, covering structured data manipulation using Pandas. The document details the pandas library in Python, including data structures, series, dataframes, and various functions.

Full Transcript

University of Petra Faculty of Information Technology Department of Data Science and Artificial Intelligence Programming for Part3-A: Data Science Structured or tabular data manipulation using Panda...

University of Petra Faculty of Information Technology Department of Data Science and Artificial Intelligence Programming for Part3-A: Data Science Structured or tabular data manipulation using Pandas 606315 (2025-2024 First Semester) Dr. Hossam M. Mustafa Contents 1. Introduction 8. Iteration 2. Data Structures 9. Sorting 3. Series 10. Indexing and Selecting Data 4. Series Basic Functions 11. Statistical Functions 5. DataFrame 12.Working with Text Data 6. DataFrame Basic Functions 7. Descriptive Statistics Introduction Pandas is an open-source Python library that offers high-performance, easy-to-use data structures and data analysis tools. The name Pandas is derived from the word Panel Data. In 2008, developer Wes McKinney started developing pandas. Pandas aims to process and analyze data, regardless of the origin of data load, prepare, manipulate, model, and analyze. Features of Pandas: Fast and efficient DataFrame object with default and customized indexing. Tools for loading data into in-memory data objects from different file formats. Data alignment and integrated handling of missing data. Reshaping and pivoting of data sets. Columns from a data structure can be deleted or inserted. Group by data for aggregation and transformations. High-performance merging and joining of data. Data Structures Pandas deals with the following three data structures: Series DataFrame Panel These data structures are built on top of the NumPy array, which means they are fast. Dimension & Description the higher dimensional data structure is a container of its lower dimensional data structure. Data Structure Dimensions Description Series 1 1D labeled homogeneous array, size-immutable. General 2D labeled, size-mutable tabular structure with Data Frames 2 potentially heterogeneously typed columns. Panel 3 General 3D labeled, size-mutable array. Data Structures Series Series is a one-dimensional array-like structure with homogeneous data. For example, the following series is a collection of integers 11, 33, 66, …. 11 33 66 27 52 51 83 92 24 62 Properties: Homogeneous data Size Immutable Values of Data Mutable Mutability All Pandas data structures are value mutable (can be changed) except Series, all are size mutable. Series is size immutable. Data Structures DataFrame DataFrame is a two-dimensional array with heterogeneous data. Name Age Gender GPA Data Type of Columns Rama 22 Female 3.45 Column Type Fadi 21 Male 3.95 Name String Tamer 20 Male 3.9 Age Integer Sami 21 Male 2.78 Gender String GPA Float The data is represented in rows and columns. Each column represents an attribute, and each row represents a student. Properties: Heterogeneous data Size Mutable Data Mutable Data Structures Panel Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. A panel can be illustrated as a container of DataFrame. The Panel is not covered in this course. Properties: Heterogeneous data Size Mutable Data Mutable Series Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Create a series: Can be created using the following: pandas.Series( data, index, dtype, copy) Parameter & Description A series can be created using various inputs : 1 data: data takes various forms like ndarray, list, constants Array index: Index values must be unique and hashable, same length as data. 2 Default np.arrange(n) if no index is passed. Dict 3 dtype: dtype is for data type. If None, data type will be inferred Scalar value or constant 4 copy: Copy data. Default False Series Import Pandas: import pandas as pd Pandas is usually imported under the pd alias. import numpy as np data = np.array(['a','b','c','d']) import pandas as pd s = pd.Series(data,index=[10,20,30,40]) print(s) Creating Empty Series: Creating a Series from dictionary: import pandas as pd The dictionary keys are taken in a sorted order to ser = pd.Series() construct index. print(ser) import pandas as pd import numpy as np Creating a Series from ndarray: data = {'A' : 10.0, 'B' : 20.0, 'C' : 30.0} ser = pd.Series(data) The index passed must be of the same length. print(ser) If no index is given, then the default index will be ser2 = pd.Series(data,index=['B','C','A','D']) range(n) where n is the array length. print(ser2) import pandas as pd Creating a Series from Scalar: import numpy as np import pandas as pd data = np.array(['a','b','c','d']) import numpy as np ser = pd.Series(data) ser = pd.Series(10,index=[1,2,3,4,5]) print(ser) print(ser) Series Accessing Data from Series: import pandas as pd import numpy as np Data in the series can be accessed similarly to that data = np.array(['a','b','c','d']) in a NumPy ndarray. s = pd.Series(data) print(s) The counting starts from zero for the array, which means the first element is stored at the zeroth #retrieve the first two element position and so on. print(s[0:2]) We pass a slice instead of an index like this: #Retrieve the last 2 elements. [start:end]. print(s[-2:]) Accessing Data using Label (Index) #accessing Data using Label A Series is like a fixed-size dictionary in that you data2 = np.array([10,20,30,40]) s2 = pd.Series(data2, index=['a','b','c','d']) can get and set values by index label. print(s2['b']) multiple elements can be received using a list of index label values. # multiple elements print(s2[['b','d']]) Series Basic Functions Attribute or Method & Description Example 1 axes : Returns a list of the row axis labels print ( ser.axes ) 2 dtype : Returns the dtype of the object. print (ser.dtype) 3 empty : Returns True if series is empty. print (ser.empty) ndim : Returns the number of dimensions of the underlying 4 print (ser.ndim) data, by definition 1. print (ser.size) 5 size : Returns the number of elements in the underlying data. 6 values : Returns the Series as ndarray. print (ser.values) 7 head() : Returns the first n rows. print (ser.head(2)) 8 tail() : Returns the last n rows. print (ser.tail(2)) DataFrame A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Create a DataFrame: can be created using the following: pandas.DataFrame( data, index, columns, dtype, copy) Parameter & Description A pandas DataFrame can be created using various inputs : data : data takes various forms like ndarray, series, map, lists, dict, constants 1 and also another DataFrame. Lists index: For the row labels, the Index to be used for the resulting frame is dict 2 Optional Default np.arange(n) if no index is given. Series columns : For column labels, the optional default syntax is np.arange(n). This 3 Numpy ndarrays is only true if no index is passed. 4 Dtype : Data type of each column. Another DataFrame copy: This command (or whatever it is) is used for copying of data, if the 5 default is False. DataFrame Import Pandas: Creating DataFrame from Dictionary of ndarrays Pandas is usually imported under the pd alias. or Lists: import pandas as pd import pandas as pd data = {'Name':['Raghad', 'Mohd', 'Mustafa', Creating Empty DataFrame: ‘Nirmeen'],'Age':[28,34,29,42]} df = pd.DataFrame(data) import pandas as pd print (df) df = df = pd.DataFrame() print(df) Indexed DataFrame using arrays: import pandas as pd Creating a DataFrame from Lists: import numpy as np The DataFrame can be created using a single list or a data = np.array([["Raghad",20],["Rahaf",21], list of lists. ["Shahad",20], ["Ashod",20], ["Fahid",21] ]) df1 = pd.DataFrame( data, columns=["Name","Age"], import pandas as pd index=["r1","r2","r3","r4","r5"]) data = [1,2,3,4,5] print(df1) df = pd.DataFrame(data) Creating DataFrame from Lists of Dictionaries : print (df) import pandas as pd data2 = [['Rama',10],['Ali',12],['Hiba',13]] data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df2 = pd.DataFrame(data2,columns=['Name','Age']) df = pd.DataFrame(data, index=['first', 'second']) print (df2) print( df) DataFrame Creating DataFrame from Lists of Dictionaries Create a DataFrame from Dictionary of Series: The following example shows how to create a DataFrame Dictionary of Series can be passed to form a DataFrame. The with a list of dictionaries, row indices, and column resultant index is the union of all the series indexes passed. indices. import pandas as pd import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} #With two-column indices, values as dictionary keys df = pd.DataFrame(d) df1 = pd.DataFrame(data, index=['first', 'second'], print (df) columns=['a', 'b']) # two column indices with one index with another name Column Selection: df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1']) We will understand this by selecting a column from the print (df1) DataFrame. print (df2) import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print ( df ['one']) DataFrame import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), Column Selection: 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) To a column from the DataFrame, we use # column selection [‘column_name’] syntax. print ( df ['one']) Column Addition: # Adding a new column to an existing DataFrame To add a new column to an existing data frame, or add df['three']=pd.Series([10,20,30],index=['a','b','c']) from existing column, we use [‘column_name’] to print( df) # Adding a column from existing DataFrame columns define the new column. And then fill it with the pd.Series function. df['four']=df['one']+df['three'] print( df) Column Deletion: # using del function Columns can be deleted or popped; using del function or pop function del df['one'] print( df) # using pop function df.pop('two') print( df) DataFrame import pandas as pd Row Selection: d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c'])} Rows can be selected by passing a row label to a loc function. df = pd.DataFrame(d) Selection by integer location: # Selection by Label print ( df.loc ['a']) Rows can be selected by passing an integer location to an iloc function. # Selection by integer location print ( df.iloc) Slice Rows: # Slice Rows print ( df[0:2]) Multiple rows can be selected using the ‘ : ’ operator. #Addition of Rows Addition of Rows: df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) Add new rows to a DataFrame using the append function. This function will append the rows at the df = pd.concat([df,df2],ignore_index=True) end. # or df.reset_index(drop=True,inplace=True) print ( df ) Deletion of Rows: # Drop rows with label 0 Use the index label to delete or drop rows from a df = df.drop(0) DataFrame. If the label is duplicated, then multiple df1.drop([0,1],inplace=True) print ( df ) rows will be dropped. DataFrame Basic Functions Attribute or Method & Description Example 1 T : Transposes rows and columns. print ( df.T ) axes : Returns a list with the row axis labels and column axis 2 print (df.axes) labels as the only members. 3 dtypes : Returns the dtypes in this object. print (df.dtypes) empty : True if NDFrame is entirely empty [no items]; if any of 4 print (df.empty) the axes are of length 0. print (df.ndim) 5 ndim : Number of axes / array dimensions. shape : Returns a tuple representing the dimensionality of the 6 print (df.shape) DataFrame. 7 size : Number of elements in the NDFrame. print (df.size) 8 values : Numpy representation of NDFrame. print (df.values) 9 head() : Returns the first n rows. print (df.head(2)) 10 tail() : Returns last n rows. print (df.tail(2)) Descriptive Statistics A large number of methods collectively compute descriptive statistics and other related operations on DataFrame, and most of these are aggregations. import pandas as pd Function Description import numpy as np Number of non-null #Create a Dictionary of series 1 count() observations d = {'Name':pd.Series(['Ashod','Shaheen','Rahaf','Bashar','Raghad','Basma','Fahid' 2 sum() Sum of values , 3 mean() Mean of Values 'Mohammad','Waleed','Abdullah','Shahid','Hossam']), 'Age':pd.Series([21,17,22,23,20,19,23,22,20,20,21,22]), 4 median() Median of Values 'GPA':pd.Series([3.23,3.24,3.98,2.56,3.20,3.6,3.8,3.78,2.98,3.80,3.10,3.65]) 5 mode() Mode of values } Standard Deviation of the 6 std() #Create a DataFrame Values df = pd.DataFrame(d) 7 min() Minimum Value print(df.sum()) 8 max() Maximum Value print(df.sum(1)) 9 abs() Absolute Value print(df.mean()) print(df.std()) 10 prod() Product of Values print(df.median()) print(df.min()) 11 cumsum() Cumulative Sum print(df.max()) 12 cumprod() Cumulative Product print(df.cumsum()) Print(df.Age.value_counts()) Descriptive Statistics Summarizing Data import pandas as pd The describe() function computes a import numpy as np summary of statistics pertaining to the #Create a Dictionary of series DataFrame columns. d = {'Name':pd.Series(['Ashod','Shaheen','Rahaf','Bashar','Raghad','Basma','Fahid' This function gives the mean, std and IQR , values. And the function excludes the 'Mohammad','Waleed','Abdullah','Shahid','Hossam']), character columns and gives the summary 'Age':pd.Series([21,17,22,23,20,19,23,22,20,20,21,22]), about numeric columns. 'GPA':pd.Series([3.23,3.24,3.98,2.56,3.20,3.6,3.8,3.78,2.98,3.80,3.10,3.65]) } The 'include' argument is used to pass necessary information regarding what #Create a DataFrame columns need to be considered for df = pd.DataFrame(d) print(df.describe(include=['object'])) summarizing. Iteration When iterating over a Series, it is regarded as array- import pandas as pd like, and basic iteration produces the values. DataFrame import numpy as np follow the dictionary-like convention of iterating over N=20 the keys of the objects. df = pd.DataFrame({ Basic iteration produces: 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), Series − values 'x': np.arange(0,N), 'y': np.random.rand(N), DataFrame − column labels 'C': np.random.choice(['Low','Medium','High'],N), 'D': np.random.randint(10, 100, size=(N)) Iterating a DataFrame }) print(df) To iterate over the rows of the DataFrame, we can use the following functions: for col in df: print(col) iterrows() − iterate over the rows as (index,series) pairs # Using iterrows for row_index,row in df.iterrows(): itertuples() − iterate over the rows as print ( row_index,row ) namedtuples # Using itertuples for row in df.itertuples(): print(row) Sorting import pandas as pd There are two kinds of sorting available in Pandas: By import numpy as np label or by Actual Value. N=20 Sort by Label df = pd.DataFrame({ 'A': pd.date_range(start='2016-01-01',periods=N,freq='D'), Using the sort_index() method, by passing the axis 'x': np.arange(0,N), arguments and the order of sorting. 'y': np.random.rand(N), 'C': np.random.choice(['Low','Medium','High'],N), Sorting is done on row labels in ascending order. It 'D': np.random.randint(10, 100, size=(N)) can be changed by passing the boolean value to }) ascending parameter. # sort by index df_sorted_index = df.sort_index(); By passing the axis argument with a value 0 or 1, print(df_sorted_index) sorting can be done on column labels. #Sort the Columns By Value df_sorted_index = df.sort_index(axis=1,ascending=False); print(df_sorted_index) Like index sorting, sort_values() is the method for sorting by values. It accepts a 'by' argument which will # Sort by value use the column name. df_sorted_value = df.sort_values(by= ['C']); print(df_sorted_value) Sorting Algorithm sort_values() provides a provision to choose the # Sort by value df_sorted_value = df.sort_values(by= ['C','y']); algorithm (kind argument) from mergesort, heapsort print(df_sorted_value) and quicksort. Note that Mergesort is the most stable algorithm. # Sorting values using specified sorting algorithm df_sorted_value = df.sort_values(by= ['C'], kind='mergesort'); print(df_sorted_value) Indexing and Selecting Data Pandas supports three types of Multi-axes import pandas as pd indexing: import numpy as np #Create a Dictionary of series Indexing & Description d = 1.loc() : Label based {'Name':pd.Series(['Ashod','Shaheen','Rahaf','Bashar','Raghad','Basma', 'Fahid','Mohammad','Waleed','Abdullah','Shahid','Hossam']), 2.iloc() : Integer based 'Age':pd.Series([21,17,22,23,20,19,23,22,20,20,21,22]), 'GPA':pd.Series([3.23,3.24,3.98,2.56,3.20,3.6,3.8,3.78,2.98,3.80,3.10,3.65]) loc() has multiple access methods: } A single scalar label #Create a DataFrame A list of labels df = pd.DataFrame(d) A slice object A Boolean array # Select all rows one columns print( df.loc[:, ['Name']]) loc takes two single/list/range operator separated by ','. # Select first three rows for multiple columns The first one indicates the row and the print( df.loc[0:2, ['Name','Age'] ]) second one indicates columns. # getting values with a boolean array print( df.loc[:, ['Age']]>20) Indexing and Selecting Data iloc() is a method used to get purely integer import pandas as pd import numpy as np based indexing (0-based indexing). d = {'Name':pd.Series(['Ashod','Shaheen','Rahaf','Bashar','Raghad','Basma','Fahid', The various access methods are as follows − 'Mohammad','Waleed','Abdullah','Shahid','Hossam']), 'Age':pd.Series([21,17,22,23,20,19,23,22,20,20,21,22]), An Integer 'GPA':pd.Series([3.23,3.24,3.98,2.56,3.20,3.6,3.8,3.78,2.98,3.80,3.10,3.65]) } A list of integers #Create a DataFrame df = pd.DataFrame(d) A range of values # select all rows for a specific column print ( df.iloc[:4] ) # Integer slicing print (df.iloc[1:5, 2:4] ) Statistical Functions Statistical methods help in the understanding and analyzing the behavior of data. import pandas as pd import numpy as np Covariance #Create a Dictionary of series Covariance is applied on series data. The Series df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e']) object has a method cov to compute covariance between series objects. print ( df.cov() ) Correlation print ( df['a'].cov(df['b']) ) Correlation shows the linear relationship print ( df.corr()) between any two array of values (series). There print ( df['a'].corr(df['b']) ) are multiple methods to compute the correlation print ( df.rank()) like pearson(default), spearman and kendall. Data Ranking Data Ranking produces ranking for each element in the array of elements. In case of ties, assigns the mean rank. Working with Text Data Pandas provides a set of string functions which make it import pandas as pd easy to operate on string data. Most importantly, these import numpy as np d = {'Name':pd.Series(['Ashod','Shaheen','Rahaf','Bashar', functions ignore missing/NaN. 'Raghad','Basma','Fahid', 'Mohammad','Waleed','Abdullah', lower() : Converts strings in the swapcase : Swaps the case 'Shahid','Hossam']), Series/Index to lower case. lower/upper. 'Age':pd.Series([21,17,22,23,20,19,23,22,20,20,21,22]),} upper(): Converts strings in the replace(a,b) : Replaces the value a df = pd.DataFrame(d) Series/Index to upper case. with the value b. print ( df['Name'].str.lower()) isnumeric() : Checks whether all contains(pattern): Returns a Boolean print ( df['Name'].str.upper()) characters in each string in the value True for each element if the Series/Index are numeric. Returns substring contains in the element, else print ( df['Name'].str.len()) Boolean. False. strip(): Helps strip isupper() : Checks whether all print ( df['Name'].str.strip()) whitespace(including newline) from characters in each in upper case or each string in the Series/index from not. Returns Boolean. print ( df['Name'].str.split('a')) both the sides. print ( df['Name'].str.cat( sep='_' )) find(pattern) : Returns the first split(' '): Splits each string with the position of the first occurrence of the given pattern. print ( df['Name'].str.replace('a','A')) pattern. cat(sep=' '): Concatenates the count(pattern) : Returns count of print ( df['Name'].str.repeat(2)) series/index elements with given appearance of pattern in each element. separator. repeat(value) : Repeats each islower() : Checks whether all print ( df['Name'].str.count('a')) element with specified number of characters in each string in lower case print ( df['Name'].str.findall('a')) times. or not. Returns Boolean print ( df['Name'].str.islower()) len(): Computes String length(). print ( df['Name'].str.isnumeric())

Use Quizgecko on...
Browser
Browser