Python Programming: Data Analysis Part 1 PDF
Document Details
Uploaded by LuckiestOganesson
Dr. Ahmed
Tags
Summary
This lecture covers the basics of data analysis using Python. It introduces the key concepts, techniques, and popular libraries (e.g., NumPy, Pandas, Matplotlib) for data analysis. The document also highlights Python's role in advanced tasks such as machine learning, and its versatile ecosystem.
Full Transcript
PYTHON PROGRAMMING PYTHON FOR DATA ANALYSIS Unit 9: Python for Data Analysis Python Programming : Lecture Dr. Ahmed 1 Outline of This Lecture 9.1 Objective 9.2 Introduction to Data Analysis with Python...
PYTHON PROGRAMMING PYTHON FOR DATA ANALYSIS Unit 9: Python for Data Analysis Python Programming : Lecture Dr. Ahmed 1 Outline of This Lecture 9.1 Objective 9.2 Introduction to Data Analysis with Python 9.2.1 Understanding the role of Python in data analysis 9.2.2 Overview of popular Python libraries for data analysis (e.g., NumPy, Pandas, Matplotlib, Seaborn) 9.2.3 Data Manipulation with NumPy 9.3 Reading Data; Selecting and Filtering the Data; Data manipulation, sorting, grouping, rearranging 9.4 Plotting the data 9.5 Descriptive statistics 9.6 Inferential statistics 8.9 Summary 8.10 Exercise 8.11 Reference Python Programming : Lecture Dr. Ahmed 2 Lecture Objectives After reading through this chapter, you will be able to : 1. Understand the fundamental concepts and techniques of data analysis using Python. 2. Gain proficiency in using key Python libraries for data analysis, such as NumPy, Pandas, Matplotlib, and Seaborn. 3. Learn how to manipulate, clean, and preprocess data using Pandas, including handling missing data and data cleaning techniques. 4. Develop skills in visualizing data effectively using Matplotlib and Seaborn, including creating various types of plots and customizing visualizations. Python Programming : Lecture Dr. Ahmed 3 Role of Python in Data Analysis Python plays a crucial role in data analysis due to its wide range of powerful libraries and tools specifically designed for working with data. Here are some key aspects of Python's role in data analysis: Data Manipulation: Python provides libraries like NumPy and Pandas, which offer efficient data structures and functions for handling and manipulating large datasets. These libraries enable tasks such as data cleaning, filtering, sorting, merging, reshaping, and aggregation. Data Visualization: Python offers libraries such as Matplotlib and Seaborn that allow users to create a variety of high- quality visualizations, including line plots, scatter plots, bar plots, histograms, heatmaps, and more. These libraries provide customization options for creating visually appealing and informative plots. Statistical Analysis: Python provides libraries like SciPy and Statsmodels that offer a wide range of statistical functions, probability distributions, hypothesis tests, and regression models. These libraries enable users to perform statistical analysis and make data-driven decisions. Python Programming : Lecture Dr. Ahmed 4 Role of Python in Data Analysis Here are some key aspects of Python's role in data analysis: Machine Learning: Python has become a popular language for machine learning due to libraries like Scikit-learn, TensorFlow, and PyTorch. These libraries provide implementations of various machine learning algorithms and tools for tasks such as classification, regression, clustering, and model evaluation. Integration and Ecosystem: Python has a vast ecosystem of libraries and tools that seamlessly integrate with each other. This allows data analysts to leverage specialized libraries for specific tasks, such as natural language processing (NLTK), network analysis (NetworkX), and geospatial analysis (GeoPandas). Additionally, Python integrates well with other data-related technologies and databases, making it a versatile choice for data analysis workflows. Accessibility and Community Support: Python is known for its simplicity and readability, making it accessible to beginners and experienced programmers alike. It has a large and active community of users who contribute to its development and provide support through forums, tutorials, and open-source projects. This community-driven nature ensures that there are abundant resources available for learning and problem-solving. Python Programming : Lecture Dr. Ahmed 5 Python Libraries for Data Analysis Many popular Python toolboxes/libraries: NumPy SciPy Pandas All these libraries are installed on SciKit-Learn YOUR Computer Visualization libraries matplotlib Seaborn and many more … Python Programming : Lecture Dr. Ahmed 6 Python Libraries for Data Analysis NumPy: introduces objects for multidimensional arrays and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects provides vectorization of mathematical operations on arrays and matrices which significantly improves the performance NumPy is a fundamental library for numerical computing in Python. NumPy is the foundation for many other data analysis libraries in Python. Link: http://www.numpy.org/ Python Programming : Lecture Dr. Ahmed 7 Python Libraries for Data Analysis SciPy: collection of algorithms for linear algebra, differential equations, numerical integration, optimization, statistics and more part of SciPy Stack built on NumPy provides additional functionality for scientific computing and advanced mathematics. It includes modules for optimization, integration, interpolation, signal processing, linear algebra, and more. SciPy complements NumPy and provides a rich set of tools for scientific analysis and modeling. Link: https://www.scipy.org/scipylib/ Python Programming : Lecture Dr. Ahmed 8 Python Libraries for Data Analysis Pandas: adds data structures and tools designed to work with table-like data (similar to Series and Data Frames in R Language) It introduces two primary data structures: Series and DataFrame. provides tools for data manipulation: reshaping, merging, sorting, slicing, aggregation etc. Pandas provides functions and methods for data cleaning, transformation,.. Allows handling missing data Link: http://pandas.pydata.org/ Python Programming : Lecture Dr. Ahmed 9 Python Libraries for Data Analysis SciKit-Learn: provides machine learning algorithms: classification, regression, clustering, model validation etc. built on NumPy, SciPy and matplotlib Scikit-learn provides a consistent API and supports various data formats, making it easy to apply machine learning techniques to real-world datasets. Link: http://scikit-learn.org/ Python Programming : Lecture Dr. Ahmed 10 Python Libraries for Data Analysis matplotlib: Matplotlib is a versatile plotting library that enables the creation of a wide variety of static, animated, and interactive visualizations. python 2D plotting library which produces publication quality figures in a variety of hardcopy formats. It provides a MATLAB-like interface for creating plots and charts, allowing customization of colors, markers, labels, and other visual elements a set of functionalities similar to those of MATLAB line plots, scatter plots, barcharts, histograms, pie charts etc. Link: https://matplotlib.org/ Python Programming : Lecture Dr. Ahmed 11 Python Libraries for Data Analysis Seaborn: Seaborn is a statistical data visualization library built on top of Matplotlib. It provides high level interface for drawing attractive statistical graphics It provides a high-level interface for creating attractive and informative statistical graphics. It simplifies the process of creating complex visualizations such as distribution plots, categorical plots, correlation matrices, and time series plots. It also offers additional features like color palettes, themes, and advanced statistical plotting capabilities. It is similar (in style) to the popular ggplot2 library in R Language Link: https://seaborn.pydata.org/ Python Programming : Lecture Dr. Ahmed 12 Python Libraries for Data Analysis TensorFlow and PyTorch: TensorFlow and PyTorch are powerful deep learning libraries in Python. They provide tools for building and training neural networks, with support for high-performance GPU computing. These libraries are widely used in tasks such as image recognition, natural language processing, and recommender systems. Link: https://www.tensorflow.org/ Python Programming : Lecture Dr. Ahmed 13 Start Jupyter nootebook # On Your Computer Type jupyter notebook Python Programming : Lecture Dr. Ahmed 14 Loading Python Libraries In [ ]: #Import Python Libraries import numpy as np import scipy as sp import pandas as pd import matplotlib as mpl import seaborn as sns Press Shift+Enter to execute the jupyter cell Python Programming : Lecture Dr. Ahmed 15 Reading data using pandas In [ ]: #Read csv file df = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv") Note: The above command has many optional arguments to fine-tune the data import process. There is a number of pandas commands to read other data formats: pd.read_excel('myfile.xlsx',sheet_name='Sheet1', index_col=None, na_values=['NA']) pd.read_stata('myfile.dta') pd.read_sas('myfile.sas7bdat') pd.read_hdf('myfile.h5','df') Python Programming : Lecture Dr. Ahmed 16 Exploring data frames In : #List first 5 records df.head() Out: Python Programming : Lecture Dr. Ahmed 17 Data Frame data types Python Programming : Lecture Dr. Ahmed 18 Data Frame data types In : #Check a particular column type df['salary'].dtype Out: dtype('int64') In : #Check types for all the columns df.dtypes Out: rank object discipline object phd int64 service int64 sex object salary int64 dtype: object Python Programming : Lecture Dr. Ahmed 19 Data Frames attributes Python objects have attributes and methods. df.attribute description dtypes list the types of the columns columns list the column names axes list the row labels and column names ndim number of dimensions size number of elements shape return a tuple representing the dimensionality values numpy representation of the data Python Programming : Lecture Dr. Ahmed 20 Data Frames methods Unlike attributes, python methods have parenthesis. All attributes and methods can be listed with a dir() function: dir(df) df.method() description head( [n] ), tail( [n] ) first/last n rows describe() generate descriptive statistics (for numeric columns only) max(), min() return max/min values for all numeric columns mean(), median() return mean/median values for all numeric columns std() standard deviation sample([n]) returns a random sample of the data frame dropna() drop all the records with missing values Python Programming : Lecture Dr. Ahmed 21 Selecting a column in a Data Frame Method 1: Subset the data frame using column name: df['sex'] Method 2: Use the column name as an attribute: df.sex Note: there is an attribute rank for pandas data frames, so to select a column with a name "rank" we should use method 1. Python Programming : Lecture Dr. Ahmed 22 Data Frames groupby method Using "group by" method we can: Split the data into groups based on some criteria Calculate statistics (or apply a function) to each group Similar to dplyr() function in R In [ ]: #Group data using rank df_rank = df.groupby(['rank']) In [ ]: #Calculate mean value for each numeric column per each group df_rank.mean() Python Programming : Lecture Dr. Ahmed 23 Data Frames groupby method Once groupby object is create we can calculate various statistics In [ ]:#Calculate mean salary for each professor rank: for each group: df.groupby('rank')[['salary']].mean() Note: If single brackets are used to specify the column (e.g. salary), then the output is Pandas Series object. When double brackets are used the output is a Data Frame Python Programming : Lecture Dr. Ahmed 24 Data Frames groupby method groupby performance notes: - no grouping/splitting occurs until it's needed. Creating the groupby object only verifies that you have passed a valid mapping - by default the group keys are sorted during the groupby operation. You may want to pass sort=False for potential In [ speedup: ]:#Calculate mean salary for each professor rank: df.groupby(['rank'], sort=False)[['salary']].mean() Python Programming : Lecture Dr. Ahmed 25 Data Frame: filtering To subset the data we can apply Boolean indexing. This indexing is commonly known as a filter. For example if we want to subset the rows in which the salary value is greater than $120K: In [ ]:#Calculate mean salary for each professor rank: df_sub = df[ df['salary'] > 120000 ] Any Boolean operator can be used to subset the data: > greater; >= greater or equal; < less;