Document Details

TopNotchSeattle5088

Uploaded by TopNotchSeattle5088

Suleyman Demirel University

2024

Meraryslan Meraliyev

Tags

pandas library python programming data analysis data manipulation

Summary

This document is an introduction to the pandas library for use in data analysis purposes.  It covers data structures, data manipulation, and handling missing values, among other things. The date of creating the document is September 22, 2024

Full Transcript

Week 3: Introduction to pandas Meraryslan Meraliyev Suleyman Demirel University September 22, 2024 Meraryslan Meraliyev (Suleyman Demirel University)Week 3...

Week 3: Introduction to pandas Meraryslan Meraliyev Suleyman Demirel University September 22, 2024 Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 1 / 51 Table of Contents 1 Introduction to pandas 2 Core Data Structures 3 Data Inspection 4 Indexing and Selecting Data 5 Data Manipulation 6 Handling Missing Data 7 Merging and Concatenation 8 Data Aggregation 9 Data Visualization 10 Advanced Topics 11 Use Cases Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 2 / 51 Introduction to pandas pandas is a Python library for data manipulation and analysis. It provides data structures like Series and DataFrame. pandas is widely used for data wrangling, analysis, and preprocessing. pandas is built on top of NumPy and integrates seamlessly with it. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 3 / 51 Why Use pandas? Efficient handling of large datasets. Powerful data filtering, grouping, and aggregation. Built-in handling for missing data. Seamless integration with other libraries like NumPy, Matplotlib, and SciPy. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 4 / 51 Core Data Structures in pandas Series: 1D labeled array, similar to a column in a spreadsheet. DataFrame: 2D labeled data structure, similar to a table. Index: The labels for the rows and columns. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 5 / 51 Creating a Series in pandas 1 import pandas as pd 2 # Creating a Series 3 s = pd. Series ([1 , 2 , 3 , 4 , 5] , index =[ ’a ’ , ’b ’ , ’c ’ , ’d ’ , ’e ’ ]) 4 print ( s ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 6 / 51 Practical Task 1: Series Creation Task: 1 Create a pandas Series with 5 elements and custom labels. 2 Select values based on labels. 3 Perform arithmetic operations (e.g., add 10 to each element). Objective: Understand Series creation, indexing, and vectorized operations. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 7 / 51 Creating a DataFrame in pandas 1 import pandas as pd 2 3 # Creating a DataFrame 4 data = { ’ Name ’: [ ’ John ’ , ’ Anna ’ , ’ Peter ’ , ’ Linda ’] , 5 ’ Age ’: [23 , 24 , 25 , 26] , 6 ’ City ’: [ ’ New York ’ , ’ Paris ’ , ’ London ’ , ’ Berlin ’ ]} 7 8 df = pd. DataFrame ( data ) 9 print ( df ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 8 / 51 Practical Task 2: DataFrame Creation Task: 1 Create a DataFrame from a dictionary with columns for Name, Age, and Salary. 2 Display the DataFrame. 3 Select only the ‘Name‘ and ‘Salary‘ columns. Objective: Practice DataFrame creation and column selection. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 9 / 51 DataFrame Attributes shape: Dimensions of the DataFrame. columns: Column names. dtypes: Data types of each column. index: The row labels. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 10 / 51 Example: DataFrame Attributes 1 # Display DataFrame attributes 2 print ( df. shape ) # Dimensions of the DataFrame 3 print ( df. columns ) # Column labels 4 print ( df. index ) # Row labels 5 print ( df. dtypes ) # Data types of each column Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 11 / 51 Indexing and Selecting Data in pandas Use loc[] for label-based indexing. Use iloc[] for position-based indexing. Access columns by name, and rows by index. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 12 / 51 Selecting Data Example 1 # Selecting a single column 2 print ( df [ ’ Name ’ ]) 3 4 # Selecting multiple columns 5 print ( df [[ ’ Name ’ , ’ Age ’ ]]) 6 7 # Selecting rows with. loc and. iloc 8 print ( df. loc ) # Label - based 9 print ( df. iloc ) # Position - based Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 13 / 51 Practical Task 3: Selecting Data Task: 1 Select the ‘Age‘ column using both ‘loc‘ and ‘iloc‘. 2 Select the first two rows of the DataFrame. 3 Retrieve the value of ‘Age‘ for ‘Peter‘. Objective: Master data selection techniques. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 14 / 51 Modifying DataFrames Adding new columns with calculated values. Updating values directly or using conditions. Dropping rows or columns using drop(). Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 15 / 51 Example: Modifying DataFrames Adding New Columns 1 # Adding a new column ’ Bonus ’ which is 10% of ’ Salary ’ 2 df [ ’ Bonus ’] = df [ ’ Salary ’] * 0.1 3 print ( df ) Updating Values 1 # Updating the ’ Salary ’ of employees who are over 25 years old 2 df. loc [ df [ ’ Age ’] > 25 , ’ Salary ’] = df [ ’ Salary ’] + 5000 3 print ( df ) Dropping Columns or Rows 1 # Dropping the ’ City ’ column 2 df = df. drop ( columns =[ ’ City ’ ]) 3 4 # Dropping Meraryslan theDemirel Meraliyev (Suleyman first row 3: Introduction to pandas University)Week September 22, 2024 16 / 51 Practical Task 4: Modify DataFrame Task: 1 Add a new column called Department with values of your choice. 2 Update the ‘Salary‘ of ‘John‘ to ‘52000‘. 3 Drop the ‘Age‘ column from the DataFrame. Objective: Practice adding, updating, and dropping DataFrame columns. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 17 / 51 Handling Missing Data Detect missing values using isnull(). Fill missing values using fillna(). Remove missing values using dropna(). Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 18 / 51 Example: Handling Missing Data 1 import numpy as np 2 3 # Introducing missing data 4 df. loc [1 , ’ Age ’] = np. nan 5 6 # Detecting missing values 7 print ( df. isnull () ) 8 9 # Filling missing values with the mean 10 df [ ’ Age ’ ]. fillna ( df [ ’ Age ’ ]. mean () , inplace = True ) 11 12 print ( df ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 19 / 51 Practical Task 5: Handle Missing Data Task: 1 Introduce missing values into the Salary column. 2 Fill the missing Salary values with the median salary. 3 Remove any rows that still contain missing values. Objective: Learn to handle missing data effectively. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 20 / 51 Merging DataFrames in pandas Combine DataFrames using keys, similar to SQL joins. Types of merges: inner, outer, left, right. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 21 / 51 Merging DataFrames Example 1 # Creating another DataFrame 2 data2 = { ’ Name ’: [ ’ John ’ , ’ Anna ’ , ’ Mike ’] , 3 ’ Department ’: [ ’ HR ’ , ’ IT ’ , ’ Finance ’ ]} 4 5 df2 = pd. DataFrame ( data2 ) 6 7 # Merging on ’ Name ’ 8 merged_df = pd. merge ( df , df2 , on = ’ Name ’ , how = ’ inner ’) 9 print ( merged_df ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 22 / 51 Practical Task 6: Merge DataFrames Task: 1 Create two DataFrames: one with Name, Age, and Salary, and another with Name and Department. 2 Merge them on the Name column using an outer join. Objective: Practice merging DataFrames using different join types. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 23 / 51 Data Aggregation in pandas Group data by columns using groupby(). Common aggregation functions: mean(), sum(), count(), min(), max(). Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 24 / 51 Grouping and Aggregation Example 1 # Grouping by ’ Department ’ and calculating mean Salary 2 grouped = df. groupby ( ’ Department ’) [ ’ Salary ’ ]. mean () 3 print ( grouped ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 25 / 51 Practical Task 7: Grouping and Aggregation Task: 1 Group the DataFrame by Department. 2 Calculate the total and average salary for each department. Objective: Understand how to perform group-wise operations. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 26 / 51 Data Visualization with pandas pandas integrates with Matplotlib for easy plotting. Use plot() to create line, bar, scatter, or histogram plots. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 27 / 51 Example: Data Visualization 1 import matplotlib. pyplot as plt 2 3 # Creating a line plot for the ’ Age ’ column 4 df [ ’ Age ’ ]. plot ( kind = ’ line ’ , title = ’ Age Over Entries ’) 5 plt. xlabel ( ’ Index ’) 6 plt. ylabel ( ’ Age ’) 7 plt. show () 8 9 # Creating a bar plot for Salary by Department 10 df. plot ( kind = ’ bar ’ , x = ’ Department ’ , y = ’ Salary ’ , title = ’ Salary by Department ’) 11 plt. xlabel ( ’ Department ’) 12 plt. ylabel ( ’ Salary ’) 13 plt. show () Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 28 / 51 Practical Task 8: Data Visualization Task: 1 Create a scatter plot showing the relationship between Age and Salary. 2 Generate a histogram for the Age distribution. Objective: Learn to create various types of plots using pandas. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 29 / 51 Merging DataFrames (Advanced) Merging DataFrames with multiple keys. Merging DataFrames with different indexes. Using concat() for joining DataFrames along the row or column axis. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 30 / 51 Practical Task 9: Advanced Merging Task: 1 Merge two DataFrames on multiple columns. 2 Concatenate two DataFrames along rows and columns. Objective: Understand advanced merging and concatenation. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 31 / 51 Working with Time Series in pandas pandas provides built-in support for time series data. Resampling, shifting, and rolling windows can be easily applied. Use DateTimeIndex for time-based indexing. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 32 / 51 Example: Time Series Data 1 import pandas as pd 2 import numpy as np 3 4 # Creating a date range 5 dates = pd. date_range ( ’ 20230101 ’ , periods =6) 6 7 # Creating a DataFrame with a DateTimeIndex 8 df_time = pd. DataFrame ( np. random. randn (6 , 4) , index = dates , columns = list ( ’ ABCD ’) ) 9 print ( df_time ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 33 / 51 Practical Task 10: Time Series Task: 1 Create a time series DataFrame with daily dates and random data for two columns. 2 Resample the data to a weekly frequency and calculate the sum. Objective: Practice time series operations in pandas. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 34 / 51 Pivot Tables in pandas pivott able()isusedforsummarizingdata.Usefulformulti − leveldataaggregation. Supports various aggregation functions (e.g., mean, sum). Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 35 / 51 Example: Pivot Tables 1 # Creating a pivot table 2 pivot_df = df. pivot_table ( values = ’ Salary ’ , index = ’ Department ’ , aggfunc = ’ mean ’) 3 print ( pivot_df ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 36 / 51 Practical Task 11: Pivot Tables Task: 1 Create a pivot table to find the average salary per department. 2 Use multi-level indexing to group by Department and City. Objective: Understand pivot tables and multi-level aggregation. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 37 / 51 Use Case 1: Financial Analysis Analyze stock prices and calculate indicators. Calculate moving averages and volatility. Resample time series data for time-based analysis. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 38 / 51 Example: Stock Price Analysis 1 import pandas as pd 2 3 # Simulating stock prices over 10 days 4 dates = pd. date_range ( ’ 20230101 ’ , periods =10) 5 stock_prices = pd. Series ([100 , 102 , 101 , 103 , 102 , 104 , 105 , 106 , 107 , 108] , index = dates ) 6 7 # Calculate rolling 3 - day average 8 rolling_avg = stock_prices. rolling ( window =3). mean () 9 print ( rolling_avg ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 39 / 51 Use Case 2: Healthcare Data Analysis Analyze patient records, treatment outcomes, and medical data. Perform time series analysis on medical data like heart rate or blood pressure. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 40 / 51 Example: Healthcare Data Analysis 1 # Sample patient data 2 data = { ’ Patient ’: [ ’A ’ , ’B ’ , ’C ’ , ’D ’] , 3 ’ Heart Rate ’: [72 , 80 , 78 , 85] , 4 ’ Blood Pressure ’: [120 , 125 , 118 , 130]} 5 6 df_health = pd. DataFrame ( data ) 7 8 # Calculate mean Heart Rate and Blood Pressure 9 mean_hr = df_health [ ’ Heart Rate ’ ]. mean () 10 mean_bp = df_health [ ’ Blood Pressure ’ ]. mean () 11 12 print ( " Mean Heart Rate : " , mean_hr ) 13 print ( " Mean Blood Pressure : " , mean_bp ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 41 / 51 Use Case 3: E-commerce Analytics Analyze customer behavior and sales trends. Perform customer segmentation and purchase pattern analysis. Monitor inventory and predict future demand. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 42 / 51 Example: E-commerce Sales Data 1 # Sample sales data 2 sales_data = { ’ Product ’: [ ’ Laptop ’ , ’ Smartphone ’ , ’ Tablet ’] , 3 ’ Units Sold ’: [100 , 200 , 150] , 4 ’ Revenue ’: [100000 , 120000 , 75000]} 5 6 df_sales = pd. DataFrame ( sales_data ) 7 8 # Calculate total revenue 9 total_revenue = df_sales [ ’ Revenue ’ ]. sum () 10 print ( " Total Revenue : " , total_revenue ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 43 / 51 Use Case 4: Machine Learning Preprocess data for machine learning models. Perform feature engineering and selection. Normalize and clean data for model training. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 44 / 51 Example: Data Preprocessing for ML 1 import pandas as pd 2 3 # Sample dataset for ML 4 data = { ’ Feature1 ’: [2.5 , 3.1 , 4.2 , 5.0 , 2.9] , 5 ’ Feature2 ’: [1.1 , 0.9 , 1.2 , 1.0 , 1.3] , 6 ’ Label ’: [0 , 1 , 1 , 0 , 1]} 7 8 df_ml = pd. DataFrame ( data ) 9 10 # Normalize features 11 df_ml [ ’ Feature1 ’] = ( df_ml [ ’ Feature1 ’] - df_ml [ ’ Feature1 ’ ]. mean () ) / df_ml [ ’ Feature1 ’ ]. std () 12 df_ml [ ’ Feature2 ’] = ( df_ml [ ’ Feature2 ’] - df_ml [ ’ Feature2 ’ ]. mean () ) / df_ml [ ’ Feature2 ’ ]. std () 13 14 print ( df_ml ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 45 / 51 Use Case 5: Social Media Analytics Analyze social media engagement, hashtags, and user trends. Perform sentiment analysis on user posts or comments. Track user activity over time (e.g., likes, shares, comments). Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 46 / 51 Example: Social Media Engagement Data 1 # Sample social media engagement data 2 engagement_data = { ’ Post ’: [ ’ Post1 ’ , ’ Post2 ’ , ’ Post3 ’] , 3 ’ Likes ’: [150 , 200 , 300] , 4 ’ Comments ’: [20 , 35 , 40]} 5 6 df_engagement = pd. DataFrame ( engagement_data ) 7 8 # Calculate total engagement ( Likes + Comments ) 9 df_engagement [ ’ Total Engagement ’] = df_engagement [ ’ Likes ’] + df_engagement [ ’ Comments ’] 10 print ( df_engagement ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 47 / 51 Use Case 6: Inventory Management Manage stock levels and reorder points. Track inventory and predict future needs. Optimize reorder processes to minimize costs. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 48 / 51 Example: Inventory Tracking 1 # Sample inventory data 2 inventory_data = { ’ Item ’: [ ’ Item1 ’ , ’ Item2 ’ , ’ Item3 ’] , 3 ’ Stock ’: [50 , 20 , 30] , 4 ’ Reorder Level ’: [20 , 10 , 15]} 5 6 df_inventory = pd. DataFrame ( inventory_data ) 7 8 # Identify items below reorder level 9 low_stock = df_inventory [ df_inventory [ ’ Stock ’] < df_inventory [ ’ Reorder Level ’ ]] 10 print ( low_stock ) Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 49 / 51 Practical Task 12: Use Cases Task: 1 Choose one of the use cases discussed (e.g., Financial Analysis, Healthcare Data Analysis). 2 Implement a small project or analysis using pandas related to your chosen use case. 3 Present your findings in a concise report or visualization. Objective: Apply pandas to real-world scenarios and showcase its versatility. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 50 / 51 Conclusion pandas simplifies data analysis and manipulation. Use it for data cleaning, aggregation, and visualization. Practice with real-world datasets to master key functionalities. Feel free to explore more advanced topics like time series, pivot tables, and multi-indexing. Meraryslan Meraliyev (Suleyman Demirel University)Week 3: Introduction to pandas September 22, 2024 51 / 51

Use Quizgecko on...
Browser
Browser