Advanced Programming Pandas in Python PDF
Document Details
Tags
Summary
This document provides examples and explanations of advanced Python programming concepts focusing on Pandas for data manipulation. It covers data structures, CSV file operations, and creating DataFrames, showcasing how to load, manipulate, save, and handle data using Pandas.
Full Transcript
# Advanced Programming ## برمجة متقدمة pandas ## CSV File operations in Python - يمكن استيراد الملفات بتنسيق CSV وتصديرها من البرامج التي تخزن البيانات في الجداول، مثل OpenOfficeCalc. أو Microsoft Excel - CSV stands for "comma separated values". - يُعرف كل سطر في الملف بالبيانات السجل. يتكون كل س...
# Advanced Programming ## برمجة متقدمة pandas ## CSV File operations in Python - يمكن استيراد الملفات بتنسيق CSV وتصديرها من البرامج التي تخزن البيانات في الجداول، مثل OpenOfficeCalc. أو Microsoft Excel - CSV stands for "comma separated values". - يُعرف كل سطر في الملف بالبيانات السجل. يتكون كل سجل من حقل واحد أو أكثر، مفصولة بفواصل (المعروفة أيضًا باسم المحددات)، أي أن كل سجل هو أيضًا جزء من هذا الملف. - استخدام المكتبة CSV من خلال `import CSV` الامر ##### Spreadsheet ``` 00 X ``` ##### CSV ``` DATA ``` ##### Database ``` ``` ## Advanced Programming ## CSV File operations in Python ```python import csv f = open("D:\\datafile\\Student.csv", "r") csv_reader = csv.reader(f) #reader() functic #csv_reader is the csv reader object print("Content of student file are: ") #Reading the records from the file for row in csv_reader: print(row) f.close() ``` ##### output:1 ``` Content of student file are: ['Name', 'class', 'marks'] ['Mohan', '11', '56'] ['Vijay', '12', '35'] ['Mala', '11', '87'] ['Kavita', '11', '48'] ['Aman', '12', '55'] ['Tinku', '12', '63'] ['Anjum', '11', '75'] ``` ## Advanced Programming ## CSV File operations in Python ```python import csv with open("D:\\datafile\\Student.csv", "r") as csvf: records = csv.reader(csvf) #reader() function reads the file #records is the csv reader object print("Content of student file are: ") rows = [] #Reading the records from the file for rec in records: rows.append(rec) print(rows) ``` ##### output:1 ``` Content of student file are: [['Name', 'class', 'marks'), ('Mohan', 11', '56'], ['Vijay', '12', '35'], ['Mal a', '11', '87'], ['Kavita', '11', '48'], ['Aman', '12', '55'], ['Tinku', '12', '6 3'], ['Anjum', '11', '75']] ``` ## Advanced Programming ## CSV File operations in Python ```python import csv csv_columns = ['id', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5'] dict_data = { 'id': ['1', '2', '3'], 'Column1': [33, 25, 56], 'Column2': [35, 30, 30], 'Column3': [21, 40, 55], 'Column4': [71, 25, 55], 'Column5': [10, 10, 40], } csv_file = "temp.csv" try: with open(csv_file, 'w') as csvfile: writer = csv.DictWriter(csvfile, fieldnames=csv_columns) writer.writeheader() for data in dict_data: writer.writerow(dict_data) except IOError: print("I/O error") data = csv. DictReader (open(csv_file)) print("CSV file as a dictionary:\n") for row in data: print(row) ``` ##### output:1 ``` ['id': "['1', '2', '3']", 'Column1': '[33, 25, 56]', 'Column2': '[35, 30, 30]', 'Column3': '[21, 40, 55]', 'Column4': '[71, 25, 55]', 'Column5': '[10, 10, 40]'} ['id': "['1', '2', '3']", 'Column1': ' [33, 25, 56]', 'Column2': '[35, 30, 30]', 'Column3': '[21, 40, 55]', 'Column4': '[71, 25, 55]', 'Column5': '[10, 10, 40]'} ('id': "['1', '2', '3']", 'Column1': '[33, 25, 56]', 'Column2': '[35, 30, 30]', 'Column3': '[21, 40, 55]', 'Column4': '[71, 25, 55]', 'Column5': '[10, 10, 40]'} ('id': "['1', '2', '3']", 'Column1': '[33, 25, 56]', 'Column2': '[35, 30, 30]', 'Column3': '[21, 40, 55]', 'Column4': '[71, 25, 55]', 'Column5': '[10, 10, 40]'} ['id': "['1', '2', '3']", 'Column1': '[33, 25, 56]', 'Column2': '[35, 30, 30]', 'Column3': '[21, 40, 55]', 'Column4': '[71, 25, 55]', 'Column5': '[10, 10, 40]'} ['id': "['1', '2', '3']", 'Column1': '[33, 25, 56]', 'Column2': '[35, 30, 30]', 'Column3': '[21, 40, 55]', 'Column4': '[71, 25, 55]', 'Column5': '[10, 10, 40]'} ``` ## Advanced Programming ## CSV File operations in Python ```python import csv data = [[10, 'a1', 1], [12, 'a2', 3], [14, 'a3', 5], [16, 'a4', 7], [18, 'a5', 9] with open("temp.csv", "w", newline="") as f: writer = csv.writer(f) writer.writerows(data) with open('temp.csv', newline='') as csvfile: data = csv.reader(csvfile, delimiter=' ') for row in data: print(', '.join(row)) ``` ##### output:1 ``` 10, a1,1 12,a2,3 14,a3,5 16,a4,7 18,a5,9 ``` ## Advanced Programming ## Pandas in Python pandas عبارة عن حزمة Python توفر هياكل بيانات سريعة ومرنة ومعبرة مصممة لجعل العمل مع البيانات " a relation" أو "المصنفة" أمرًا سهلاً وبديهيًا. ويهدف إلى أن يكون لبنة البناء الأساسية عالية المستوى لإجراء تحليل عملي وحقيقي لبيانات العالم في بايثون. ### data structures: Series and DataFrame. #### Start by importing these Python modules ```python import pandas as pd from pandas import DataFrame, Series ``` ##### Series السلسلة عبارة عن كائن يشبه المصفوفة أحادي البعد يحتوي على مصفوفة من البيانات ، جميع البيانات الموجودة في السلسلة هي من نفس نوع البيانات. يتم توجيه حساب السلسلة بعد محاذاة فهرس السلسلة أولاً لكل من المعاملات. ## Advanced Programming ## Pandas in Python ```python import pandas as pd from pandas import DataFrame, Series s1 = Series(range(0,4)) s2 = Series(range(1,5)) s3 = s1 + s2 s4 = Series(['a','b'])*3 print(s1) print(s2) print(s3) print(s4) ``` ##### output:1 ``` 0, 1, 2, 3 1, 2, 3, 4 1, 3, 5, 7 'aaa','bbb' ``` ##### output:2 ```python import pandas as pd ds = pd.Series([2, 4, 6, 8, 10]) print("Pandas Series and type") print(ds) print(type(ds)) print("Convert Pandas Series to Python list") print(ds.tolist()) print(type(ds.tolist())) ``` ##### Pandas Series and type ``` Pandas Series and type 0 2 1 4 2 6 3 8 4 10 dtype: int64 <class 'pandas.core.series.Series'> Convert Pandas Series to Python list [2, 4, 6, 8, 10] <class 'list'> ``` ## Advanced Programming ## Pandas in Python ### DataFrame object: The pandas DataFrame is a two-dimensional table of data with column and row indexes. The columns are made up of pandas Series objects. ##### Row index (df.index) ##### Column index (df.columns) ``` Series of data Series of data Series of data Series of data Series of data Series of data Series of data ``` ### الشكل العام ```python class pandas.DataFrame(data=None, index=None, columns=None, dty pe=None, copy=None) ``` #### Parameters: - **Data:**ndarray (structured or homogeneous), Iterable, dict, or DataFrame - **Index:** Index or array-like - **Columns:**Index or array-like - **Dtype:** dtype, default None - **Copy:**bool or None, default None ## Advanced Programming ## Pandas in Python ```python df = pd.DataFrame( {"a": [4,5,6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = [1, 2, 3]) ``` ```python df = pd.DataFrame( {"a" : [4,5,6], "b" : [7, 8, 9], "c" : [10, 11, 12]}, index = pd.MultiIndex.from_tuples( [('d',1),('d',2),('e',2)], names=['n','v']))) ``` ```python df = pd.DataFrame( [[4, 7, 10], [5, 8, 11], [6, 9, 12]], index=[1, 2, 3], columns=['a', 'b', 'c']) ``` ##### output: | | a | b | c | |---|---|---|---| | n | v | | | | d | 1 | 4 | 7 | 10 | | d | 2 | 5 | 8 | 11 | | e | 2 | 6 | 9 | 12 | ## Advanced Programming ## Pandas in Python ### Load a DataFrame from a CSV file ```python df = pd.read_csv('datacsv.csv') print(df.values) print(' -") df = pd.read_csv('datacsv.csv', header = 0, index_col= 0, quotechar='"', sep= ':', na_values = ['na', '-', '.', '']) print(df.values) ``` ##### output: ``` [[ 1. 0.1147793 0.272886 1.4136375 1.4136375 ... 1.58297578] [2. 0.2447283 0.332881 1.4035425 1.4035425 ... 1.54354811] [3. 0.3583363 0.402835 1.38297578 1.38297578 ... 1.5136375 ] ... [ nan nan nan ... nan nan nan] [ nan nan nan nan nan ... nan] [ nan nan nan ... nan -0.22077031 nan]] [] ``` ### كنشاط البحث عن البارمترات التي تستخدم مع بانداز ## Advanced Programming ## Pandas in Python ### Load DataFrames from a Microsoft Excel file ```python import pandas as pd workbook = pd.read_excel('./data1.xlsx') dictionary = {} for sheet_name in workbook. sheet_names: df = workbook.parse(sheet_name) dictionary[sheet_name] = df ``` ### Data in Series then combine into a DataFrame ```python import pandas as pd from pandas import DataFrame, Series s1 = Series(range(6)) s2 = s1 * s1 s2.index = s2.index + 2 # misalign indexes df = pd.concat([s1, s2], axis= 1) print(df) ``` ##### pd.concat([df1,df2]) Append rows of DataFrames ``` 0 0.0 NaN 1 1.0 NaN 2 2.0 0.0 3 3.0 1.0 4 4.0 4.0 5 5.0 9.0 6 NaN 16.0 7 NaN 25.0 ``` ## Advanced Programming ## Pandas in Python ### Data in Series then combine into a DataFrame ```python import pandas as pd from pandas import DataFrame, Series s3 = Series({'Tom': 1, 'Dick': 4, 'Har':9}) s4 = Series({'Tom': 3, 'Dick': 2, 'Mar': 5}) df = pd.concat({'A': s3, 'B': s4 }, axis= 1) print(df) ``` ```python import pandas as pd from pandas import DataFrame, Series s3 = Series({'Tom': 1, 'Dick': 4, 'Har':9}) s4 = Series({'Tom':3, 'Dick': 2, 'Mar':5}) df = pd.concat({'A': s3, 'B': s4 }, axis= 0) print(df) ``` ##### output: | | A | B | |---|---|---| | Tom | 1.0 | 3.0 | | Dick | 4.0 | 2.0 | | Har | 9.0 | NaN | | Mar | NaN | 5.0 | | | Tom | |---|---| | A | 1 | | A | Dick | 4 | | A | Har | 9 | | B | Tom | 3 | | B | Dick | 2 | | B | Mar | 5 | | | dtype: int64 | ## Advanced Programming ## Pandas in Python ### Get a DataFrame from data in a Python dictionary ```python import pandas as pd from pandas import DataFrame, Series df = DataFrame({ 'col0' : [1.0, 2.0, 3.0, 4.0], # data by col 'col1': [100, 200, 300, 400] }) Print(df) ``` ```python import pandas as pd from pandas import DataFrame, Series df = DataFrame.from_dict({ 'row0' : {'col0': 0, 'col1': 'A'}, # data by row 'row1': {'col0':1, 'col1': 'B'} }, orient= 'index') Print(df) ``` ##### output: | | col0 | col1 | |---|---|---| | 0 | 1.0 | 100 | | 1 | 2.0 | 200 | | 2 | 3.0 | 300 | | 3 | 4.0 | 400 | | | col0 | col1 | |---|---|---| | row0 | 0 | A | | row1 | 1 | B | ## Advanced Programming ## Pandas in Python ### الحصول على تجمع للبيانات من بيانات قاموس وفقا للعمود # Import pandas package ```python import pandas as pd ``` # Define a dictionary containing employee data ```python data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'], 'Age': [27, 24, 22, 32], 'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'], 'Qualification': ['Msc', 'MA', 'MCA', 'Phd']} ``` # Convert the dictionary into DataFrame ```python df = pd.DataFrame(data) ``` # select two columns ```python print(df[['Name', 'Qualification')]]) ``` | | Name | Age | Address | Qualification | |---|---|---|---|---| | 0 | Jai | 27 | Delhi | Msc | | 1 | Princi | 24 | Kanpur | MA | | 2 | Gaurav | 22 | Allahabad | MCA | | 3 | Anuj | 32 | Kannauj | Phd | ## Advanced Programming ## Pandas in Python ### الحصول على تجمع للبيانات من بيانات قاموس Get a DataFrame from data in a Python dictionary ```python import pandas as pd from pandas import DataFrame, Series df = DataFrame.from_dict({ 'row0': [1, 1+1j, 'A'], 'row1': [2, 2+2j, 'B'] }, orient='index') Print(df) ``` ##### output: | | 0 | 1 | 2 | |---|---|---|---| | row0 | 1 | 1.0+1.0j | A | | rowl | 2 | 2.0+2.0j | B | ### الحصول على تجمع للبيانات باستخدام دالة التوليد العشوائي Create play/fake data (useful for testing) ```python import pandas as pd import numpy as np from pandas import DataFrame, Series df = DataFrame(np.random.rand(50,5)) Print(df) ``` ## Advanced Programming ## Pandas in Python ### الحصول على تجمع للبيانات باستخدام الدالة مع الفهرسة Create play/fake data (useful for testing) ```python import pandas as pd import numpy as np from pandas import DataFrame, Series df = DataFrame(np.random.rand(50,5)) df.index = pd.date_range('1/1/2023', periods= len(df), freq= 'M') print(df) ``` ### Saving a DataFrame - Saving a DataFrame to a CSV file ```python df.to_csv('name.csv', encoding='utf-8') ``` - Saving a DataFrame to a Python dictionary ```python dictionary = df.to_dict() ``` - Saving a DataFrame to a Python string ```python string = df.to_string() ``` **Note:** sometimes may be useful for debugging ##### output: | | 0 | 1 | 2 | 3 | 4 | |---|---|---|---|---|---| | 2023-01-31 | 0.126339 | 0.487015 | 0.164965 | 0.346717 | 0.475812 | | 2023-02-28 | 0.299348 | 0.552461 | 0.466173 | 0.911177 | 0.552601 | | 2023-03-31 | 0.828491 | 0.224908 | 0.606144 | 0.381507 | 0.355177 | | 2023-04-30 | 0.57155 | 0.200999 | 0.288864 | 0.805461 | 0.546776 | | 2023-05-31 | 0.337062 | 0.184931 | 0.590253 | 0.001906 | 0.055903 | | 2023-06-30 | 0.598078 | 0.712961 | 0.275347 | 0.75346 | 0.17987 | | 2023-07-31 | 0.795879 | 0.970955 | 0.750632 | 0.290019 | 0.219603 | | 2023-08-31 | 0.937335 | 0.929332 | 0.898281 | 0.624556 | 0.601289 | | 2023-09-30 | 0.417001 | 0.752348 | 0.488996 | 0.54058 | 0.493881 | | 2023-10-31 | 0.331164 | 0.000456 | 0.330012 | 0.929573 | 0.8947 | | 2023-11-30 | 0.710397 | 0.41203 | 0.80884 | 0.338079 | 0.079029 | | 2023-12-31 | 0.258642 | 0.989037 | 0.075493 | 0.550387 | 0.066442 | | 2024-01-31 | 0.967857 | 0.124993 | 0.849789 | 0.125824 | 0.101143 | | 2024-02-29 | 0.080603 | 0.394417 | 0.870483 | 0.953631 | 0.169348 | | 2024-03-31 | 0.253436 | 0.072892 | 0.797173 | 0.693718 | 0.598168 | | 2024-04-30 | 0.078086 | 0.975894 | 0.587404 | 0.221121 | 0.954267 | | 2024-05-31 | 0.802831 | 0.463103 | 0.899811 | 0.683405 | 0.686487 | | 2024-06-30 | 0.815208 | 0.960528 | 0.880242 | 0.829289 | 0.75831 | | 2024-07-31 | 0.940361 | 0.47543 | 0.736305 | 0.96989 | 0.77564 | | 2024-08-31 | 0.581873 | 0.059003 | 0.303877 | 0.582788 | 0.505404 | | 2024-09-30 | 0.833381 | 0.670322 | 0.880136 | 0.94539 | 0.476044 | | 2024-10-31 | 0.802844 | 0.114463 | 0.964082 | 0.158168 | 0.75319 | | 2024-11-30 | 0.565662 | 0.573383 | 0.062366 | 0.546118 | 0.735715 | ## Advanced Programming ## Pandas in Python ### العمل مع كل تجميع البيانات ### Working with the whole DataFrame #### Peek at the DataFrame contents ```python df.info() ``` #### # index & data types ```python n = 4 ``` #### # get first n rows ```python dfh = df.head(n) ``` #### # get last n rows ```python dft = df.tail(n) ``` #### # summary stats cols ```python dfs = df.describe() ``` ```python top_left_corner_df = df.iloc[:5, :5] ``` #### output: ``` <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 50 entries, 2023-01-31 to 2027-02-28 Freq: M Data columns (total 5 columns): # Column Non-Null Count Dtype 0 50 non-null float64 1 50 non-null float64 2 50 non-null float64 3 50 non-null float64 4 50 non-null float64 dtypes: float64(5) memory usage: 2.3 KB ``` ```python print('dfh=',dfh) print('dft=',dft) print('dfs=',dfs) print(top_left_corner_df) ``` #### dfh= | | 0 | 1 | 2 | 3 | 4 | |---|---|---|---|---|---| | 2023-01-31 | 0.979733 | 0.065809 | 0.130391 | 0.746232 | 0.025002 | | 2023-02-28 | 0.509621 | 0.6818 | 0.183054 | 0.331511 | 0.878733 | | 2023-03-31 | 0.550654 | 0.375979 | 0.047027 | 0.890103 | 0.697349 | | 2023-04-30 | 0.578231 | 0.228297 | 0.516871 | 0.593178 | 0.193102 | #### dft= | | 0 | 1 | 2 | 3 | 4 | |---|---|---|---|---|---| | 2026-11-30 | 0.600341 | 0.516335 | 0.619204 | 0.298736 | 0.028776 | | 2026-12-31 | 0.57043 | 0.881442 | 0.144381 | 0.889185 | 0.693941 | | 2027-01-31 | 0.619109 | 0.686661 | 0.211526 | 0.948044 | 0.561165 | | 2027-02-28 | 0.489892 | 0.607875 | 0.226021 | 0.162858 | 0.426632 | #### dfs= | | 0 | 1 | 2 | 3 | 4 | |---|---|---|---|---|---| | count | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | | mean | 0.503658 | 0.508282 | 0.470172 | 0.516284 | 0.506745 | | std | 0.30605 | 0.28233 | 0.290161 | 0.270978 | 0.29106 | | min | 0.002331 | 0.065809 | 0.02523 | 0.049404 | 0.021257 | | 25% | 0.209623 | 0.261014 | 0.202731 | 0.30693 | 0.259059 | | 50% | 0.560542 | 0.533317 | 0.479759 | 0.539808 | 0.493503 | | 75% | 0.764188 | 0.717967 | 0.695986 | 0.737363 | 0.756249 | | max | 0.994563 | 0.965376 | 0.958582 | 0.948044 | 0.990105 | ## Advanced Programming ## Pandas in Python ### التعامل مع التكرار في تجمع البيانات ### DataFrame iteration methods - df.iteritems()#2 (col-index, Series) pairs - df.iterrows() # (row-index, Series) pairs #### # example ... iterating over columns ```python for (name, series) in df.iteritems(): print('Col name: + str(name)) print('First value: ' + str(series.iat[0]) + '\n') ``` #### DataFrame non-indexing attributes - dfT = df.T # transpose rows and cols - 1 = df.axes # list row and col indexes - s = df.dtypes # Series column data types - b = df.empty # True for empty DataFrame - i = df.ndim # number of axes (2) - t = df.shape # (row-count, column-count) - i = df.size # row-count * column-count - a = df.values # get a numpy array for df ##### output: - First value: 0.4469566419073925 Col name: 1 - First value: 0.027193130421518386 Col name: 2 - First value: 0.18912306012374291 Col name: 3 - First value: 0.43137736522311554 Col name: 4 - First value: 0.03952328477224265 ## Advanced Programming ## Pandas in Python ### تجمع البيانات مع الدوال الرياضية ### Maths on the whole DataFrame (not a complete list) - df = df.abs() # absolute values - df = df.add(o) #2 add df, Series or value - s = df.count() #23 non NA/null values - df = df.cummax() # (cols default axis) - df = df.cummin(2) #2 (cols default axis) - df = df.cumsum() # (cols default axis) - df = df.cumprod() #2 (cols default axis) - df = df.diff() # 1st diff (col def axis) - df = df.div(0) #2 div by df, Series, value - df = df.dot(o) # matrix dot product - s = df.max() # max of axis (col def) - s = df.mean() #2 mean (col default axis) - s = df.median()#2 median (col default) - df = df.min() # min of axis (col def) - df = df.mul(0) # mul by df Series val - s = df.sum() # sum axis (cols default) **Note:** The methods that return a series default to working on columns. ##### output: ## Advanced Programming ## Pandas in Python ### تجمع البيانات مع الدوال الرياضية ### Maths on the whole DataFrame (not a complete list) ```python import numpy as np import pandas as pd from pandas import DataFrame, Series df = pd.read_