Pandas Dataframe Handling PDF
Document Details
Kartik Thakkar
Tags
Summary
This document provides a tutorial on using the Pandas library in Python for data manipulation and analysis, focusing on DataFrame handling. It covers reading CSV and Excel files, and exporting to these formats.
Full Transcript
Dataframe Handling using Panda Reading CSV File using Dataframe pandas is a library specialized for data analysis, so you expect that it is mainly focused on calculation and data processing. Moreover, even the process of writing and reading data from/to external files can be consider...
Dataframe Handling using Panda Reading CSV File using Dataframe pandas is a library specialized for data analysis, so you expect that it is mainly focused on calculation and data processing. Moreover, even the process of writing and reading data from/to external files can be considered a part of the data processing. In fact, you will see how, even at this stage, you can perform some operations in order to prepare the incoming data to further manipulations. Thus, this part is very important for data analysis and therefore a specific tool for this purpose must be present in the library pandas: a set of functions called I/O API. These functions are divided into two main categories, completely symmetrical to each other: readers and writers. Reading CSV File using Pandas To access data from the CSV file, we require a function read_csv() from Pandas that retrieves data in the form of the data frame. Created By: Kartik Thakkar # Import pandas import pandas as pd # reading csv file df = pd.read_csv("people.csv") print(df.head()) Pandas DataFrame to a CSV file by using the Pandas to_csv() method. By default, the to csv() method exports DataFrame to a CSV file with row index as the first column and comma as the delimiter. # importing pandas as pd import pandas as pd # list of name, degree, score nme = ["aparna", "pankaj", "sudhir", "Geeku"] deg = ["MBA", "BCA", "M.Tech", "MBA"] scr = [90, 40, 80, 98] # dictionary of lists Created By: Kartik Thakkar dict = {'name': nme, 'degree': deg, 'score': scr} df = pd.DataFrame(dict) print(df) # saving the dataframe df.to_csv('file1.csv') # saving the dataframe df.to_csv('file2.csv', header=False, index=False) Reading Excel File using Pandas To read Excel files using the Pandas library in Python, you can use the pandas.read_excel() function. Pandas provides a powerful and flexible way to read, manipulate, and analyze tabular data, including Excel files. import pandas as pd # Read the Excel file data = pd.read_excel("data.xlsx") # Display the data print(data) By default, read_excel() reads the first sheet in the Excel file. If your Excel file has multiple sheets and you want to read a specific sheet, you can specify the sheet name or index using the sheet_name parameter: data = pd.read_excel("data.xlsx", sheet_name="Sheet2") Other Parameters 1. sheet_name: Specifies the sheet to read. It can be the sheet name (string) or sheet index (int). If not provided, the default is to read the first sheet. 2. header: Specifies the row index (0-based) to use as the column names. If None, no header will be used. 3. skiprows: Specifies the number of rows to skip at the beginning of the file. 4. usecols: Specifies which columns to read from the file. You can provide column names or indices. 5. nrows: Specifies the number of rows to read from the file. Created By: Kartik Thakkar 6. index_col: Specifies which column to use as the index of the DataFrame. 7. parse_dates: Specifies which columns to parse as dates. 8. dtype: Specifies the data types of columns as a dictionary. 9. na_values: Specifies a list of values to recognize as NaN (missing) values. Write DataFrame to Excel File The to_excel() function in Pandas is used to write a Pandas DataFrame to an Excel file. This function allows you to export your data for further analysis or sharing. Here's how you can use the to_excel() function: import pandas as pd # Create a sample DataFrame data = { "Name": ["Alice", "Bob", "Charlie"], "Age": [25, 30, 28], "Country": ["USA", "Canada", "UK"] } df = pd.DataFrame(data) # Write the DataFrame to an Excel file df.to_excel("output.xlsx", index=False) The to_excel() function provides various parameters that allow you to customize the export: § sheet_name: Specifies the name of the Excel sheet. Default is "Sheet1". § startrow and startcol: Specifies the starting row and column for writing data. § index: Specifies whether to include the DataFrame index. Default is True. § header: Specifies whether to include the column headers. Default is True. § columns: Specifies which columns to include in the Excel file. § na_rep: Specifies the representation of NaN (missing) values. § float_format: Specifies a format for floating-point numbers. Created By: Kartik Thakkar Central Tendency measures: Pandas provides a set of functions to perform common statistical calculations on data stored in a Pandas DataFrame or Series. Here are the functions for calculating mean, median, mode, and standard deviation: Mean: The mean() function calculates the arithmetic mean (average) of the data. import pandas as pd data = [10, 20, 30, 40, 50] series = pd.Series(data) mean_value = series.mean() print("Mean:", mean_value) Median (median()): The median() function calculates the middle value of the data when it's sorted. median_value = series.median() print("Median:", median_value) Mode (mode()): The mode() function calculates the most frequently occurring value(s) in the data. data = [10, 20, 30, 40, 20, 50, 20] series_with_mode = pd.Series(data) mode_value = series_with_mode.mode() print("Mode:", mode_value) Note that the mode() function can return multiple mode values if there are multiple values with the same highest frequency. Standard Deviation (std()): The std() function calculates the standard deviation, which measures the dispersion of values around the mean. std_deviation = series.std() Created By: Kartik Thakkar print("Standard Deviation:", std_deviation) Keep in mind that these functions work on both Pandas Series and DataFrames. If you have a DataFrame, you can apply these functions to specific columns by selecting those columns from the DataFrame. Pandas makes it easy to perform statistical calculations and analysis on your data, making it a powerful tool for data manipulation and exploration. Pandas DataFrame describe() Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of numeric values. When this method is applied to a series of strings, it returns a different output which is shown in the examples below. Syntax: DataFrame.describe(percentiles=None, include=None, exclude=None) Parameters: percentile: list like data type of numbers between 0-1 to return the respective percentile include: List of data types to be included while describing dataframe. Default is None exclude: List of data types to be Excluded while describing dataframe. Default is None Example: # importing pandas module import pandas as pd # importing regex module import re # making data frame data = pd.read_csv(myData.csv") # removing null values to avoid errors data.dropna(inplace=True) # percentile list perc = [.20,.40,.60,.80] Created By: Kartik Thakkar # list of dtypes to include include = ['object', 'float', 'int'] # calling describe method desc = data.describe(percentiles=perc, include=include) # display Print(desc) Variance (var()): The var() function calculates the variance of the data. import pandas as pd data = [10, 20, 30, 40, 50] series = pd.Series(data) variance_value = series.var() print("Variance:", variance_value) DataFrame FuncBons (1) Head Pandas head() method is used to return top n (5 by default) rows of a data frame or series. # importing pandas module import pandas as pd # making data frame data = pd.read_csv("data.csv") # calling head() method # storing in new variable dataRows = data.head() # display Print(dataRows) Example 2 Created By: Kartik Thakkar # importing pandas module import pandas as pd # making data frame data = pd.read_csv("data.csv") # calling head() method # storing in new variable MyCol = data[‘Name’].head(10) # display Print(dataRows) (2) Tail Pandas tail() method is used to return bottom n (5 by default) rows of a data frame or series. data_bottom = data.tail() Dealing with Dataframe Selec8ng Column / Columns In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name. # Import pandas package import pandas as pd # Define a dictionary containing employee data data = { 'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age':[27, 24, 22, 32], 'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 'Qualification':['Msc', 'MA', 'MCA', 'Phd'] } # Convert the dictionary into DataFrame df = pd.DataFrame(data) # select two columns print(df[['Name', 'Qualification']]) Created By: Kartik Thakkar loc and iloc func8ons In Pandas, loc and iloc are powerful indexing methods used to select rows and columns from a DataFrame. They provide different ways to access data based on labels (loc) or integer-based positions (iloc). Here's how they work: loc: The loc indexer is used to select data by labels, including index labels and column labels. You can use loc to select specific rows and columns based on their labels. import pandas as pd # Create a sample DataFrame data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'Country': ['USA', 'Canada', 'UK']} df = pd.DataFrame(data, index=['row1', 'row2', 'row3']) # Select a single row by label row_bob = df.loc['row2'] # Select multiple rows by labels selected_rows = df.loc[['row1', 'row3']] # Select specific columns for a row by label age_charlie = df.loc['row3', 'Age'] # Select rows based on conditions young_people = df.loc[df['Age'] < 30] print(row_bob) print(selected_rows) print(age_charlie) print(young_people) Created By: Kartik Thakkar iloc: The iloc indexer is used to select data by integer-based positions. It allows you to access data based on row and column indices. import pandas as pd # Create a sample DataFrame data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'Country': ['USA', 'Canada', 'UK']} df = pd.DataFrame(data) # Select a single row by position row_bob = df.iloc # Select multiple rows by positions selected_rows = df.iloc[[0, 2]] # Select specific columns for a row by position age_charlie = df.iloc[2, 1] # Select rows and columns by position ranges subset = df.iloc[0:2, 0:2] print(row_bob) print(selected_rows) print(age_charlie) print(subset) Both loc and iloc provide a flexible and efficient way to select data from a DataFrame. The key difference is that loc uses labels for indexing, while iloc uses integer-based positions. to_numpy() The to_numpy() function in Pandas is used to convert a DataFrame or Series into a NumPy array. This function provides a convenient way to extract the underlying data from a Pandas object and work with it using NumPy's array functionality. Here's how you can use the to_numpy() function: Created By: Kartik Thakkar import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'Country': ['USA', 'Canada', 'UK']} df = pd.DataFrame(data) numpy_array = df.to_numpy() print(numpy_array) Keep in mind a few important points: § The to_numpy() function returns a new NumPy array. Any modifications you make to the NumPy array won't affect the original Pandas object. § The resulting NumPy array will have the same data type as the data within the Pandas object, unless the data types within the column are mixed. In this case, the array will have an "object" dtype. § Be cautious when converting large DataFrames to NumPy arrays, as it can consume a significant amount of memory. § For Series, the to_numpy() function returns a one-dimensional array, while for DataFrames, it returns a two-dimensional array. § If the DataFrame contains different data types, the NumPy array will promote the data to a common data type to accommodate all values. Created By: Kartik Thakkar