DS PANDAS.pdf
Document Details
Uploaded by Deleted User
Full Transcript
PANDAS Introduction to Pandas Pandas is a powerful open-source library in Python for data manipulation and analysis. It provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables. Key Features of Pandas: 1. Data Structure...
PANDAS Introduction to Pandas Pandas is a powerful open-source library in Python for data manipulation and analysis. It provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables. Key Features of Pandas: 1. Data Structures: Pandas provides two primary data structures: - Series: A one-dimensional labeled array of values. - DataFrame: A two-dimensional labeled data structure with columns of potentially different types. 2. Data Manipulation: Pandas offers various methods for filtering, sorting, grouping, merging, and reshaping data. 3. Data Analysis: Pandas integrates well with other popular data analysis libraries in Python, such as NumPy, Matplotlib, and Scikit-learn. 4. Data Input/Output: Pandas supports reading and writing data from various file formats, including CSV, Excel, JSON, and SQL databases. Basic Pandas Data Structures: Series A Series is a one-dimensional labeled array of values. import pandas as pd # Create a Series s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']) print(s) Output: a 1 b 2 c 3 1 d 4 e 5 dtype: int64 DataFrame A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David'], 'Age': [25, 31, 42]} df = pd.DataFrame(data) print(df) Output: Name Age 0 John 25 1 Mary 31 2 David 42 Common Pandas Operations: 1. Filtering: df[df['Age'] > 30] 2. Sorting: df.sort_values(by='Age') 3. Grouping: df.groupby('Name') 4. Merging: pd.merge(df1, df2, on='Name') 2 Real-World Applications of Pandas: 1. Data Science 2. Business Intelligence 3. Web Scraping 4. Data Visualization 5. Machine Learning 3 Creating dataframe from an Excel Creating a Pandas DataFrame from an Excel File You can create a Pandas DataFrame from an Excel file using the read_excel function from the Pandas library. Example: import pandas as pd # Load the Excel file df = pd.read_excel('example.xlsx') # Print the DataFrame print(df) Parameters: - filename: The path to the Excel file. - sheet_name: The name of the sheet to read (default is the first sheet). - header: The row to use as the column names (default is 0). - na_values: The values to recognize as missing/NaN. - parse_dates: The columns to parse as dates. Example with Parameters: Import pandas as pd # Load the Excel file df = pd.read_excel('example.xlsx', sheet_name='Sheet1', header=0, na_values=['NA'], parse_dates=['Date']) 4 # Print the DataFrame print(df) Supported Excel File Formats: -.xls -.xlsx -.xlsm -.xlsb -.odf -.ods Error Handling: - If the file is not found, a FileNotFoundError is raised. - If the sheet is not found, a ValueError is raised. Real-World Example: Suppose you have an Excel file sales.xlsx with the following data: | Date | Product | Sales | |------------|---------|-------| | 2022-01-01 | A | 100 | | 2022-01-02 | B | 200 | | 2022-01-03 | A | 150 | You can create a DataFrame from this file using: import pandas as pd df = pd.read_excel('sales.xlsx', sheet_name='Sales', header=0, parse_dates=['Date']) 5 print(df) Output: Date Product Sales 0 2022-01-01 A 100 1 2022-01-02 B 200 2 2022-01-03 A 150 6 Creating dataframe from.csv file Creating a Pandas DataFrame from a.csv Fil You can create a Pandas DataFrame from a.csv file using the read_csv function from the Pandas library. Example: import pandas as pd # Load the csv file df = pd.read_csv('data.csv') # Print the DataFrame print(df) Parameters: - filename: The path to the csv file. - sep: The separator used in the file (default is ,). - header: The row to use as the column names (default is 0). - na_values: The values to recognize as missing/NaN. - parse_dates: The columns to parse as dates. Example with Parameters: import pandas as pd # Load the csv file df = pd.read_csv('data.csv', sep=';', header=0, na_values=['NA'], parse_dates=['Date']) 7 # Print the DataFrame print(df) Error Handling: - If the file is not found, a FileNotFoundError is raised. - If the file is empty, a EmptyDataError is raised. Real-World Example: Suppose you have a csv file sales.csv with the following data: Date,Sales,Product 2022-01-01,100,A 2022-01-02,200,B 2022-01-03,150,A You can create a DataFrame from this file using: import pandas as pd df = pd.read_csv('sales.csv', sep=',', header=0, parse_dates=['Date']) print(df) Output: Date Sales Product 0 2022-01-01 100 A 1 2022-01-02 200 B 2 2022-01-03 150 A 8 Python dictionary A dictionary in Python is an unordered collection of key-value pairs. It is mutable, meaning it can be modified after creation. Creating a Dictionary # Using curly brackets {} my_dict = {"name": "John", "age": 30} # Using dict() function my_dict = dict(name="John", age=30) # Using dictionary comprehension my_dict = {key: value for key, value in [("name", "John"), ("age", 30)]} Dictionary Operations Accessing Elements my_dict = {"name": "John", "age": 30} print(my_dict["name"]) # Output: John Updating Elements my_dict = {"name": "John", "age": 30} my_dict["age"] = 31 print(my_dict) # Output: {'name': 'John', 'age': 31} Adding Elements my_dict = {"name": "John", "age": 30} my_dict["city"] = "New York" print(my_dict) # Output: {'name': 'John', 'age': 30, 'city': 'New York' Removing Elements my_dict = {"name": "John", "age": 30} del my_dict["age"] 9 print(my_dict) # Output: {'name': 'John'} Checking Existence my_dict = {"name": "John", "age": 30} print("name" in my_dict) # Output: True Dictionary Methods keys() my_dict = {"name": "John", "age": 30} print(my_dict.keys()) # Output: dict_keys(['name', 'age']) values() my_dict = {"name": "John", "age": 30} print(my_dict.values()) # Output: dict_values(['John', 30]) items() my_dict = {"name": "John", "age": 30} print(my_dict.items()) # Output: dict_items([('name', 'John'), ('age', 30)] get() my_dict = {"name": "John", "age": 30} print(my_dict.get("name")) # Output: John update() my_dict = {"name": "John", "age": 30} my_dict.update({"city": "New York"}) print(my_dict) # Output: {'name': 'John', 'age': 30, 'city': 'New York'} 10 pop() my_dict = {"name": "John", "age": 30} print(my_dict.pop("age")) # Output: 30 Dictionary Iteration my_dict = {"name": "John", "age": 30} for key, value in my_dict.items(): print(f"{key}: {value}") Output: name: John age: 30 11 Python list A list in Python is a collection of items that can be of any data type, including strings, integers, floats, and other lists. Lists are denoted by square brackets [] and are mutable, meaning they can be modified after creation. Creating a List # Empty list my_list = [] # List with values my_list = [1, 2, 3, 4, 5] # List with mixed data types my_list = ['apple', 2, 3.5, True] List Methods 1. Append Adds an element to the end of the list. my_list = [1, 2, 3] my_list.append(4) print(my_list) 2. Extend Adds multiple elements to the end of the list. my_list = [1, 2, 3] my_list.extend([4, 5, 6]) print(my_list) 3. Insert Inserts an element at a specified position. my_list = [1, 2, 3] 12 my_list.insert(1, 4) print(my_list) 4. Remove Removes the first occurrence of an element. my_list = [1, 2, 3, 2, 4] my_list.remove(2) print(my_list) 5. Pop removes and returns an element at a specified position. my_list = [1, 2, 3] popped_element = my_list.pop(1) print(my_list) # Output: [1, 3] print(popped_element) # Output: 2 6. Index Returns the index of the first occurrence of an element. my_list = [1, 2, 3] index = my_list.index(2) print(index) 7. Count Returns the number of occurrences of an element. my_list = [1, 2, 2, 3] count = my_list.count(2) print(count) 8. Sort Sorts the list in-place. 13 my_list = [3, 2, 1] my_list.sort() print(my_list) 9. Reverse Reverses the list in-place. my_list = [1, 2, 3] my_list.reverse() print(my_list) List Slicing my_list = [1, 2, 3, 4, 5] # Get the first three elements print(my_list[:3]) # Output: [1, 2, 3] # Get the last two elements print(my_list[-2:]) # Output: [4, 5] # Get the middle element print(my_list) # Output: 3 list Comprehensions # Create a list of squares squares = [x**2 for x in range(5)] print(squares) # Output: [0, 1, 4, 9, 16] 14 Python tuples Tuples are immutable, ordered collections of values in Python. Creating Tuples # Using parentheses () my_tuple = ("apple", "banana", "cherry") # Using tuple() function my_tuple = tuple(["apple", "banana", "cherry"]) # Using tuple comprehension my_tuple = tuple(fruit for fruit in ["apple", "banana", "cherry"]) Tuple Operations Indexing my_tuple = ("apple", "banana", "cherry") print(my_tuple) Slicing my_tuple = ("apple", "banana", "cherry") print(my_tuple[1:2]) Concatenation my_tuple1 = ("apple", "banana") my_tuple2 = ("cherry", "date") print(my_tuple1 + my_tuple2) Output: ('apple', 'banana', 'cherry', 'date') Repetition my_tuple = ("apple",) print(my_tuple * 3) # Output: ('apple', 'apple', 'apple') 15 Checking Existence my_tuple = ("apple", "banana", "cherry") print("banana" in my_tuple) # Output: True Tuple Methods index() my_tuple = ("apple", "banana", "cherry") print(my_tuple.index("banana")) # Output: 1 count() my_tuple = ("apple", "banana", "banana") print(my_tuple.count("banana")) # Output: 2 Tuple Iteration my_tuple = ("apple", "banana", "cherry") for fruit in my_tuple: print(fruit) Output: apple banana cherry Advantages of Tuples: 1. Immutable, ensuring data integrity. 2. Faster than lists for large datasets. 3. Can be used as dictionary keys. 16 Disadvantages of Tuples: 1. Immutable, limiting flexibility. 2. Less flexible than lists for insertions/deletions. Real-World Applications: 1. Data storage and retrieval. 2. Function arguments and return values. 3. Database query results. 17 Operations on dataframes 1. Filtering import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David'], 'Age': [25, 31, 42]} df = pd.DataFrame(data) Filter rows where Age > 30 filtered_df = df[df['Age'] > 30] print(filtered_df) 2. Sorting import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David'], 'Age': [25, 31, 42]} df = pd.DataFrame(data) # Sort by Age in ascending order sorted_df = df.sort_values(by='Age') print(sorted_df) 3. Grouping import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David', 'John', 'Mary'], 'Age': [25, 31, 42, 25, 31], 'Score': [90, 85, 95, 90, 85]} 18 df = pd.DataFrame(data) # Group by Name and calculate mean Score grouped_df = df.groupby('Name')['Score'].mean() print(grouped_df) 4. Merging import pandas as pd # Create two DataFrames data1 = {'Name': ['John', 'Mary', 'David'], 'Age': [25, 31, 42]} df1 = pd.DataFrame(data1) data2 = {'Name': ['John', 'Mary', 'David'], 'Score': [90, 85, 95]} df2 = pd.DataFrame(data2) # Merge df1 and df2 on Name merged_df = pd.merge(df1, df2, on='Name') print(merged_df) 5. Pivoting import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David', 'John', 'Mary'], 'Year': [2020, 2020, 2020, 2021, 2021], 'Score': [90, 85, 95, 92, 88]} df = pd.DataFrame(data) 19 # Pivot by Name and Year pivoted_df = df.pivot_table(values='Score', index='Name', columns='Year') print(pivoted_df) 6. Melting import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David'], 'Math': [90, 85, 95], 'Science': [92, 88, 96]} df = pd.DataFrame(data) # Melt df melted_df = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score') print(melted_df) 7. Handling Missing Values import pandas as pd import numpy as np # Create a DataFrame data = {'Name': ['John', 'Mary', np.nan], 'Age': [25, 31, 42]} df = pd.DataFrame(data) # Drop rows with missing values df.dropna() 20 8. Dataframe Concatenation import pandas as pd # Create two DataFrames data1 = {'Name': ['John', 'Mary'], 'Age': [25, 31]} df1 = pd.DataFrame(data1) data2 = {'Name': ['David', 'Emma'], 'Age': [42, 35]} df2 = pd.DataFrame(data2) # Concatenate df1 and df2 concatenated_df = pd.concat([df1, df2]) print(concatenated_df) 9. Dataframe Iteration import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David'], 'Age': [25, 31, 42]} df = pd.DataFrame(data) # Iterate over rows for index, row in df.iterrows(): print(row['Name'], row['Age']) 21 10. Dataframe Statistics import pandas as pd # Create a DataFrame data = {'Name': ['John', 'Mary', 'David'], 'Age': [25, 31, 42]} df = pd.DataFrame(data) # Calculate mean Age mean_age = df['Age'].mean() print(mean_age) 22 Function application and mapping in data science using python Function Application Function application involves applying a function to each element of a dataset. Python provides several ways to achieve this: 1. Map(): Applies a function to each element of an iterable. def square(x): return x**2 numbers = [1, 2, 3, 4, 5] squared_numbers = list(map(square, numbers)) print(squared_numbers) # [1, 4, 9, 16, 25] 1. Lambda Functions: Anonymous functions used with map(), filter(), and reduce(). numbers = [1, 2, 3, 4, 5] squared_numbers = list(map(lambda x: x**2, numbers)) print(squared_numbers) # [1, 4, 9, 16, 25] 1. List Comprehensions: Concise way to create lists by applying functions. numbers = [1, 2, 3, 4, 5] squared_numbers = [x**2 for x in numbers] print(squared_numbers) # [1, 4, 9, 16, 25] Mapping: Mapping involves applying a function to each element of a dataset and returning a new dataset with the transformed values. 1. Pandas apply(): Applies a function to each row or column of a DataFrame. import pandas as pd data = {'Name': ['John', 'Mary', 'David'], 'Age': [25, 31, 42]} df = pd.DataFrame(data) 23 def double_age(age): return age * 2 df['Double Age'] = df['Age'].apply(double_age) print(df) 1. NumPy vectorize(): Converts a Python function into a NumPy ufunc. import numpy as np def square(x): return x**2 vectorized_square = np.vectorize(square) numbers = np.array([1, 2, 3, 4, 5]) squared_numbers = vectorized_square(numbers) print(squared_numbers) # [1, 4, 9, 16, 25] 24 Summarizing and computing descriptive statistics Descriptive statistics provide an overview of the basic features of a dataset. Mesures of Central Tendency: 1. Mean: Average value of a dataset. 2. Median: Middle value of a dataset. 3. Mode: Most frequently occurring value. Measures of Variability: 1. Range: Difference between maximum and minimum values. 2. Variance: Average squared difference from the mean. 3. Standard Deviation: Square root of variance. Other Descriptive Statistics: 1. Interquartile Range (IQR): Difference between 75th and 25th percentiles. 2. Skewness: Measure of asymmetry. 3. Kurtosis: Measure of tail heaviness. Python Libraries 1. NumPy: numpy.mean(), numpy.median(), numpy.std() 2. Pandas: df.mean(), df.median(), df.std() 3. SciPy: scipy.stats.mode(), scipy.stats.skew(), scipy.stats.kurtosis() Example Code: import pandas as pd import numpy as np # Create a sample dataset data = {'Score': [90, 85, 95, 92, 88]} 25 df = pd.DataFrame(data) # Calculate descriptive statistics mean_score = df['Score'].mean() median_score = df['Score'].median() std_dev = df['Score'].std() print(f"Mean: {mean_score}") print(f"Median: {median_score}") print(f"Standard Deviation: {std_dev}") Output: Mean: 90.0 Median: 90.0 Standard Deviation: 3.1622776601683795 Visualization: 1. Histograms: Visualize distribution of values. 2. Box Plots: Visualize median, quartiles, and outliers. 26 Reading and writing data in text format in Python Reading Text Data 1. Open Function: Reads text files. with open('file.txt', 'r') as file: data = file.read() print(data) 1. Readlines Function: Reads text files into a list. with open('file.txt', 'r') as file: lines = file.readlines() print(lines) Writing Text Data 1. Write Function: Writes strings to text files with open('file.txt', 'w') as file: file.write('Hello, World!') 1. Writelines Function: Writes lists of strings to text files. lines = ['Line 1\n', 'Line 2\n', 'Line 3\n'] with open('file.txt', 'w') as file: file.writelines(lines) CSV File 1. csv Module: Reads and writes CSV files. import csv with open('data.csv', 'r') as file: reader = csv.reader(file) data = list(reader) print(data) 27 import csv data = [['Name', 'Age'], ['John', 25], ['Mary', 31]] with open('data.csv', 'w') as file: writer = csv.writer(file) writer.writerows(data) JSON Files 1. json Module: Reads and writes JSON files. import json with open('data.json', 'r') as file: data = json.load(file) print(data) import json data = {'name': 'John', 'age': 25} with open('data.json', 'w') as file: json.dump(data, file) 28