Pandas DataFrame Cheat Sheet PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a cheat sheet for working with pandas DataFrames. It explains how to import necessary libraries, create and manipulate DataFrames, and load data from various sources like CSV files and Excel spreadsheets. The use of pandas for data analysis tasks in Python is the focal point.
Full Transcript
Cheat Sheet: The pandas DataFrame Object Preliminaries Get your data into a DataFrame St...
Cheat Sheet: The pandas DataFrame Object Preliminaries Get your data into a DataFrame Start by importing these Python modules Load a DataFrame from a CSV file import numpy as np df = pd.read_csv('file.csv')# often works import matplotlib.pyplot as plt df = pd.read_csv(‘file.csv’, header=0, import pandas as pd index_col=0, quotechar=’”’,sep=’:’, from pandas import DataFrame, Series na_values = [‘na’, ‘-‘, ‘.’, ‘’]) Note: these are the recommended import aliases Note: refer to pandas docs for all arguments From inline CSV text to a DataFrame from StringIO import StringIO # python2.7 The conceptual model #from io import StringIO # python 3 data = """, Animal, Cuteness, Desirable row-1, dog, 8.7, True DataFrame object: The pandas DataFrame is a two- row-2, bat, 2.6, False""" dimensional table of data with column and row indexes. df = pd.read_csv(StringIO(data), The columns are made up of pandas Series objects. header=0, index_col=0, skipinitialspace=True) Column index (df.columns) Note: skipinitialspace=True allows a pretty layout Load DataFrames from a Microsoft Excel file Series of data Series of data Series of data Series of data Series of data Series of data Series of data # Each Excel sheet in a Python dictionary (df.index) Row index workbook = pd.ExcelFile('file.xlsx') dictionary = {} for sheet_name in workbook.sheet_names: df = workbook.parse(sheet_name) dictionary[sheet_name] = df Note: the parse() method takes many arguments like read_csv() above. Refer to the pandas documentation. Series object: an ordered, one-dimensional array of Load a DataFrame from a MySQL database data with an index. All the data in a Series is of the import pymysql same data type. Series arithmetic is vectorised after first from sqlalchemy import create_engine aligning the Series index for each of the operands. engine = create_engine('mysql+pymysql://' s1 = Series(range(0,4)) # -> 0, 1, 2, 3 +'USER:PASSWORD@localhost/DATABASE') s2 = Series(range(1,5)) # -> 1, 2, 3, 4 df = pd.read_sql_table('table', engine) s3 = s1 + s2 # -> 1, 3, 5, 7 s4 = Series(['a','b'])*3 # -> 'aaa','bbb' Data in Series then combine into a DataFrame # Example 1... The index object: The pandas Index provides the axis s1 = Series(range(6)) labels for the Series and DataFrame objects. It can only s2 = s1 * s1 contain hashable objects. A pandas Series has one s2.index = s2.index + 2# misalign indexes Index; and a DataFrame has two Indexes. df = pd.concat([s1, s2], axis=1) # --- get Index from Series and DataFrame idx = s.index # Example 2... idx = df.columns # the column index s3 = Series({'Tom':1, 'Dick':4, 'Har':9}) idx = df.index # the row index s4 = Series({'Tom':3, 'Dick':2, 'Mar':5}) df = pd.concat({'A':s3, 'B':s4 }, axis=1) # --- some Index attributes Note: 1st method has in integer column labels b = idx.is_monotonic_decreasing Note: 2nd method does not guarantee col order b = idx.is_monotonic_increasing Note: index alignment on DataFrame creation b = idx.has_duplicates i = idx.nlevels # multi-level indexes Get a DataFrame from data in a Python dictionary # default --- assume data is in columns # --- some Index methods df = DataFrame({ a = idx.values() # get as numpy array 'col0' : [1.0, 2.0, 3.0, 4.0], l = idx.tolist() # get as a python list 'col1' : [100, 200, 300, 400] idx = idx.astype(dtype)# change data type }) b = idx.equals(o) # check for equality idx = idx.union(o) # union of two indexes i = idx.nunique() # number unique labels label = idx.min() # minimum label label = idx.max() # maximum label Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 1 Get a DataFrame from data in a Python dictionary # --- use helper method for data in rows Working with the whole DataFrame df = DataFrame.from_dict({ # data by row 'row0' : {'col0':0, 'col1':'A'}, Peek at the DataFrame contents 'row1' : {'col0':1, 'col1':'B'} df.info() # index & data types }, orient='index') n = 4 dfh = df.head(n) # get first n rows df = DataFrame.from_dict({ # data by row dft = df.tail(n) # get last n rows 'row0' : [1, 1+1j, 'A'], dfs = df.describe() # summary stats cols 'row1' : [2, 2+2j, 'B'] top_left_corner_df = df.iloc[:5, :5] }, orient='index') DataFrame non-indexing attributes Create play/fake data (useful for testing) dfT = df.T # transpose rows and cols # --- simple l = df.axes # list row and col indexes df = DataFrame(np.random.rand(50,5)) (r, c) = df.axes # from above s = df.dtypes # Series column data types # --- with a time-stamp row index: b = df.empty # True for empty DataFrame df = DataFrame(np.random.rand(500,5)) i = df.ndim # number of axes (2) df.index = pd.date_range('1/1/2006', t = df.shape # (row-count, column-count) periods=len(df), freq='M') (r, c) = df.shape # from above i = df.size # row-count * column-count # --- with alphabetic row and col indexes a = df.values # get a numpy array for df import string import random r = 52 # note: min r is 1; max r is 52 DataFrame utility methods c = 5 dfc = df.copy() # copy a DataFrame df = DataFrame(np.random.randn(r, c), dfr = df.rank() # rank each col (default) columns = ['col'+str(i) for i in dfs = df.sort() # sort each col (default) range(c)], dfc = df.astype(dtype) # type conversion index = list((string.uppercase + string.lowercase)[0:r])) DataFrame iteration methods df['group'] = list( df.iteritems()# (col-index, Series) pairs ''.join(random.choice('abcd') df.iterrows() # (row-index, Series) pairs for _ in range(r)) ) # example... iterating over columns for (name, series) in df.iteritems(): print('Col name: ' + str(name)) print('First value: ' + Saving a DataFrame str(series.iat) + '\n') Saving a DataFrame to a CSV file Maths on the whole DataFrame (not a complete list) df.to_csv('name.csv', encoding='utf-8') df = df.abs() # absolute values df = df.add(o) # add df, Series or value s = df.count() # non NA/null values Saving DataFrames to an Excel Workbook df = df.cummax() # (cols default axis) from pandas import ExcelWriter df = df.cummin() # (cols default axis) writer = ExcelWriter('filename.xlsx') df = df.cumsum() # (cols default axis) df1.to_excel(writer,'Sheet1') df = df.cumprod() # (cols default axis) df2.to_excel(writer,'Sheet2') df = df.diff() # 1st diff (col def axis) writer.save() df = df.div(o) # div by df, Series, value df = df.dot(o) # matrix dot product Saving a DataFrame to MySQL s = df.max() # max of axis (col def) import pymysql s = df.mean() # mean (col default axis) from sqlalchemy import create_engine s = df.median()# median (col default) e = create_engine('mysql+pymysql://' + s = df.min() # min of axis (col def) 'USER:PASSWORD@localhost/DATABASE') df = df.mul(o) # mul by df Series val df.to_sql('TABLE',e, if_exists='replace') s = df.sum() # sum axis (cols default) Note: if_exists ! 'fail', 'replace', 'append' Note: The methods that return a series default to working on columns. Saving a DataFrame to a Python dictionary dictionary = df.to_dict() DataFrame filter/select rows or cols on label info df = df.filter(items=['a', 'b']) # by col Saving a DataFrame to a Python string df = df.filter(items=, axis=0) #by row string = df.to_string() df = df.filter(like='x') # keep x in col df = df.filter(regex='x') # regex in col Note: sometimes may be useful for debugging df = df.select(crit=(lambda x:not x%5))#r Note: select takes a Boolean function, for cols: axis=1 Note: filter defaults to cols; select defaults to rows Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 2 Columns value set based on criteria Working with Columns df['b']=df['a'].where(df['a']>0,other=0) df['d']=df['a'].where(df.b!=0,other=df.c) A DataFrame column is a pandas Series object Note: where other can be a Series or a scalar Data type conversions Get column index and labels s = df['col'].astype(str) # Series dtype idx = df.columns # get col index na = df['col'].values # numpy array label = df.columns # 1st col label pl = df['col'].tolist() # python list lst = df.columns.tolist() # get as a list Note: useful dtypes for Series conversion: int, float, str Trap: index lost in conversion from Series to array or list Change column labels df.rename(columns={'old':'new'}, Common column-wide methods/attributes inplace=True) df = df.rename(columns={'a':1,'b':'x'}) value = df['col'].dtype # type of data value = df['col'].size # col dimensions value = df['col'].count()# non-NA count Selecting columns value = df['col'].sum() s = df['colName'] # select col to Series value = df['col'].prod() df = df[['colName']] # select col to df value = df['col'].min() df = df[['a','b']] # select 2 or more value = df['col'].max() df = df[['c','a','b']]# change order value = df['col'].mean() s = df[df.columns] # select by number value = df['col'].median() df = df[df.columns[[0, 3, 4]] # by number value = df['col'].cov(df['col2']) s = df.pop('c') # get col & drop from df s = df['col'].describe() s = df['col'].value_counts() Selecting columns with Python attributes s = df.a # same as s = df['a'] Find index label for min/max values in column # cannot create new columns by attribute label = df['col1'].idxmin() df.existing_col = df.a / df.b label = df['col1'].idxmax() df['new_col'] = df.a / df.b Trap: column names must be valid identifiers. Common column element-wise methods s = df['col'].isnull() Adding new columns to a DataFrame s = df['col'].notnull() # not isnull() df['new_col'] = range(len(df)) s = df['col'].astype(float) df['new_col'] = np.repeat(np.nan,len(df)) s = df['col'].round(decimals=0) df['random'] = np.random.rand(len(df)) s = df['col'].diff(periods=1) df['index_as_col'] = df.index s = df['col'].shift(periods=1) df1[['b','c']] = df2[['e','f']] s = df['col'].to_datetime() df3 = df1.append(other=df2) s = df['col'].fillna(0) # replace NaN w 0 Trap: When adding an indexed pandas object as a new s = df['col'].cumsum() column, only items from the new series that have a s = df['col'].cumprod() corresponding index in the DataFrame will be added. s = df['col'].pct_change(periods=4) The receiving DataFrame is not extended to s = df['col'].rolling_sum(periods=4, accommodate the new series. To merge, see below. window=4) Trap: when adding a python list or numpy array, the Note: also rolling_min(), rolling_max(), and many more. column will be added by integer position. Append a column of row sums to a DataFrame Swap column contents – change column order df['Total'] = df.sum(axis=1) df[['B', 'A']] = df[['A', 'B']] Note: also means, mins, maxs, etc. Dropping columns (mostly by label) Multiply every column in DataFrame by Series df = df.drop('col1', axis=1) df = df.mul(s, axis=0) # on matched rows df.drop('col1', axis=1, inplace=True) Note: also add, sub, div, etc. df = df.drop(['col1','col2'], axis=1) s = df.pop('col') # drops from frame Selecting columns with.loc,.iloc and.ix del df['col'] # even classic python works df = df.loc[:, 'col1':'col2'] # inclusive df.drop(df.columns, inplace=True) df = df.iloc[:, 0:2] # exclusive Vectorised arithmetic on columns Get the integer position of a column index label df['proportion']=df['count']/df['total'] j = df.columns.get_loc('col_name') df['percent'] = df['proportion'] * 100.0 Test if column index values are unique/monotonic Apply numpy mathematical functions to columns if df.columns.is_unique: pass #... df['log_data'] = np.log(df['col1']) b = df.columns.is_monotonic_increasing df['rounded'] = np.round(df['col2'], 2) b = df.columns.is_monotonic_decreasing Note: Many more mathematical functions Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 3 Select a slice of rows by label/index Working with rows [inclusive-from : inclusive–to [ : step]] df = df['a':'c'] # rows 'a' through 'c' Get the row index and labels Trap: doesn't work on integer labelled rows idx = df.index # get row index label = df.index # 1st row label Append a row of column totals to a DataFrame lst = df.index.tolist() # get as a list # Option 1: use dictionary comprehension sums = {col: df[col].sum() for col in df} Change the (row) index sums_df = DataFrame(sums,index=['Total']) df.index = idx # new ad hoc index df = df.append(sums_df) df.index = range(len(df)) # set with list df = df.reset_index() # replace old w new # Option 2: All done with pandas # note: old index stored as a col in df df = df.append(DataFrame(df.sum(), df = df.reindex(index=range(len(df))) columns=['Total']).T) df = df.set_index(keys=['r1','r2','etc']) df.rename(index={'old':'new'}, Iterating over DataFrame rows inplace=True) for (index, row) in df.iterrows(): # pass Trap: row data type may be coerced. Adding rows df = original_df.append(more_rows_in_df) Sorting DataFrame rows values Hint: convert to a DataFrame and then append. Both df = df.sort(df.columns, DataFrames should have same column labels. ascending=False) df.sort(['col1', 'col2'], inplace=True) Dropping rows (by name) df = df.drop('row_label') Random selection of rows df = df.drop(['row1','row2']) # multi-row import random as r k = 20 # pick a number Boolean row selection by values in a column selection = r.sample(range(len(df)), k) df_sample = df.iloc[selection, :] df = df[df['col2'] >= 0.0] df = df[(df['col3']>=1.0) | Note: this sample is not sorted (df['col1']= 2].index match. If not a unique match, may return a slice or print(df.ix[idx]) mask. Select a slice of rows by integer position Get integer position of rows that meet condition [inclusive-from : exclusive-to [: step]] a = np.where(df['col'] >= 2) #numpy array default start is 0; default end is len(df) df = df[:] # copy DataFrame Test if the row index values are unique/monotonic df = df[0:2] # rows 0 and 1 df = df[-1:] # the last row if df.index.is_unique: pass #... df = df[2:3] # row 2 (the third row) b = df.index.is_monotonic_increasing df = df[:-1] # all but the last row b = df.index.is_monotonic_decreasing df = df[::2] # every 2nd row (0 2..) Trap: a single integer without a colon is a column label for integer numbered columns. Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 4 Working with cells In summary: indexes and addresses Selecting a cell by row and column labels In the main, these notes focus on the simple, single value = df.at['row', 'col'] level Indexes. Pandas also has a hierarchical or value = df.loc['row', 'col'] multi-level Indexes (aka the MultiIndex). value = df['col'].at['row'] # tricky Note:.at[] fastest label based scalar lookup A DataFrame has two Indexes Typically, the column index (df.columns) is a list of Setting a cell by row and column labels strings (observed variable names) or (less df.at['row, 'col'] = value commonly) integers (the default is numbered from 0 df.loc['row, 'col'] = value to length-1) df['col'].at['row'] = value # tricky Typically, the row index (df.index) might be: o Integers - for case or row numbers (default is Selecting and slicing on labels numbered from 0 to length-1); df = df.loc['row1':'row3', 'col1':'col3'] o Strings – for case names; or Note: the "to" on this slice is inclusive. o DatetimeIndex or PeriodIndex – for time series data (more below) Setting a cross-section by labels df.loc['A':'C', 'col1':'col3'] = np.nan Indexing df.loc[1:2,'col1':'col2']=np.zeros((2,2)) # --- selecting columns df.loc[1:2,'A':'C']=othr.loc[1:2,'A':'C'] s = df['col_label'] # scalar Remember: inclusive "to" in the slice df = df[['col_label']] # one item list df = df[['L1', 'L2']] # many item list Selecting a cell by integer position df = df[index] # pandas Index df = df[s] # pandas Series value = df.iat[9, 3] # [row, col] value = df.iloc[0, 0] # [row, col] # --- selecting rows value = df.iloc[len(df)-1, df = df['from':'inc_to']# label slice len(df.columns)-1] df = df[3:7] # integer slice df = df[df['col'] > 0.5]# Boolean Series Selecting a range of cells by int position df = df.iloc[2:4, 2:4] # subset of the df df = df.loc['label'] # single label df = df.iloc[:5, :5] # top left corner df = df.loc[container] # lab list/Series s = df.iloc[5, :] # returns row as Series df = df.loc['from':'to']# inclusive slice df = df.iloc[5:6, :] # returns row as row df = df.loc[bs] # Boolean Series Note: exclusive "to" – same as python list slicing. df = df.iloc # single integer df = df.iloc[container] # int list/Series Setting cell by integer position df = df.iloc[0:5] # exclusive slice df.iloc[0, 0] = value # [row, col] df = df.ix[x] # loc then iloc df.iat[7, 8] = value # --- select DataFrame cross-section # r and c can be scalar, list, slice Setting cell range by integer position df.loc[r, c] # label accessor (row, col) df.iloc[0:3, 0:5] = value df.iloc[r, c]# integer accessor df.iloc[1:3, 1:4] = np.ones((2, 3)) df.ix[r, c] # label access int fallback df.iloc[1:3, 1:4] = np.zeros((2, 3)) df[c].iloc[r]# chained – also for.loc df.iloc[1:3, 1:4] = np.array([[1, 1, 1], [2, 2, 2]]) # --- select cell Remember: exclusive-to in the slice # r and c must be label or integer df.at[r, c] # fast scalar label accessor.ix for mixed label and integer position indexing df.iat[r, c] # fast scalar int accessor value = df.ix[5, 'col1'] df[c].iat[r] # chained – also for.at df = df.ix[1:5, 'col1':'col3'] # --- indexing methods Views and copies v = df.get_value(r, c) # get by row, col From the manual: Setting a copy can cause subtle df = df.set_value(r,c,v)# set by row, col errors. The rules about when a view on the data is df = df.xs(key, axis) # get cross-section df = df.filter(items, like, regex, axis) returned are dependent on NumPy. Whenever an array df = df.select(crit, axis) of labels or a Boolean vector are involved in the indexing operation, the result will be a copy. Note: the indexing attributes (.loc,.iloc,.ix,.at.iat) can be used to get and set values in the DataFrame. Note: the.loc, iloc and.ix indexing attributes can accept python slice objects. But.at and.iat do not. Note:.loc can also accept Boolean Series arguments Avoid: chaining in the form df[col_indexer][row_indexer] Trap: label slices are inclusive, integer slices exclusive. Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 5 Joining/Combining DataFrames Groupby: Split-Apply-Combine Three ways to join two DataFrames: The pandas "groupby" mechanism allows us to split the merge (a database/SQL-like join operation) data into groups, apply a function to each group concat (stack side by side or one on top of the other) independently and then combine the results. combine_first (splice the two together, choosing values from one over the other) Grouping gb = df.groupby('cat') # by one columns Merge on indexes gb = df.groupby(['c1','c2']) # by 2 cols df_new = pd.merge(left=df1, right=df2, gb = df.groupby(level=0) # multi-index gb how='outer', left_index=True, gb = df.groupby(level=['a','b']) # mi gb right_index=True) print(gb.groups) How: 'left', 'right', 'outer', 'inner' Note: groupby() returns a pandas groupby object How: outer=union/all; inner=intersection Note: the groupby object attribute.groups contains a dictionary mapping of the groups. Merge on columns Trap: NaN values in the group key are automatically df_new = pd.merge(left=df1, right=df2, dropped – there will never be a NA group. how='left', left_on='col1', right_on='col2') Iterating groups – usually not needed Trap: When joining on columns, the indexes on the for name, group in gb: passed DataFrames are ignored. print (name) Trap: many-to-many merges on a column can result in print (group) an explosion of associated data. Selecting a group Join on indexes (another way of merging) dfa = df.groupby('cat').get_group('a') df_new = df1.join(other=df2, on='col1', dfb = df.groupby('cat').get_group('b') how='outer') df_new = df1.join(other=df2,on=['a','b'], Applying an aggregating function how='outer') # apply to a column... Note: DataFrame.join() joins on indexes by default. s = df.groupby('cat')['col1'].sum() DataFrame.merge() joins on common columns by s = df.groupby('cat')['col1'].agg(np.sum) default. # apply to the every column in DataFrame s = df.groupby('cat').agg(np.sum) Simple concatenation is often the best df_summary = df.groupby('cat').describe() df=pd.concat([df1,df2],axis=0)#top/bottom df_row_1s = df.groupby('cat').head(1) df = df1.append([df2, df3]) #top/bottom Note: aggregating functions reduce the dimension by df=pd.concat([df1,df2],axis=1)#left/right one – they include: mean, sum, size, count, std, var, Trap: can end up with duplicate rows or cols sem, describe, first, last, min, max Note: concat has an ignore_index parameter Applying multiple aggregating functions Combine_first gb = df.groupby('cat') df = df1.combine_first(other=df2) # apply multiple functions to one column # multi-combine with python reduce() dfx = gb['col2'].agg([np.sum, np.mean]) df = reduce(lambda x, y: # apply to multiple fns to multiple cols x.combine_first(y), dfy = gb.agg({ [df1, df2, df3, df4, df5]) 'cat': np.count_nonzero, 'col1': [np.sum, np.mean, np.std], Uses the non-null values from df1. The index of the 'col2': [np.min, np.max] combined DataFrame will be the union of the indexes }) from df1 and df2. Note: gb['col2'] above is shorthand for df.groupby('cat')['col2'], without the need for regrouping. Transforming functions # transform to group z-scores, which have # a group mean of 0, and a std dev of 1. zscore = lambda x: (x-x.mean())/x.std() dfz = df.groupby('cat').transform(zscore) # replace missing data with group mean mean_r = lambda x: x.fillna(x.mean()) dfm = df.groupby('cat').transform(mean_r) Note: can apply multiple transforming functions in a manner similar to multiple aggregating functions above, Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 6 Applying filtering functions Filtering functions allow you to make selections based Working with dates, times and their indexes on whether each group meets specified criteria # select groups with more than 10 members Dates and time – points and spans eleven = lambda x: (len(x['col1']) >= 11) With its focus on time-series data, pandas has a suite of df11 = df.groupby('cat').filter(eleven) tools for managing dates and time: either as a point in time (a Timestamp) or as a span of time (a Period). Group by a row index (non-hierarchical index) t = pd.Timestamp('2013-01-01') df = df.set_index(keys='cat') t = pd.Timestamp('2013-01-01 21:15:06') s = df.groupby(level=0)['col1'].sum() t = pd.Timestamp('2013-01-01 21:15:06.7') dfg = df.groupby(level=0).sum() p = pd.Period('2013-01-01', freq='M') Note: Timestamps should be in range 1678 and 2261 years. (Check Timestamp.max and Timestamp.min). Pivot Tables A Series of Timestamps or Periods ts = ['2015-04-01 13:17:27', Pivot '2014-04-02 13:17:29'] Pivot tables move from long format to wide format data df = DataFrame(np.random.rand(100,1)) # Series of Timestamps (good) df.columns = ['data'] # rename col s = pd.to_datetime(pd.Series(ts)) df.index = pd.period_range('3/3/2014', periods=len(df), freq='M') # Series of Periods (often not so good) df['year'] = df.index.year s = pd.Series( [pd.Period(x, freq='M') df['month'] = df.index.month for x in ts] ) s = pd.Series( # pivot to wide format pd.PeriodIndex(ts,freq='S')) df = df.pivot(index='year', Note: While Periods make a very useful index; they may columns='month', values='data') be less useful in a Series. # melt to long format From non-standard strings to Timestamps dfm = df t = ['09:08:55.7654-JAN092002', dfm['year'] = dfm.index '15:42:02.6589-FEB082016'] dfm = pd.melt(df, id_vars=['year'], s = pd.Series(pd.to_datetime(t, var_name='month', value_name='data') format="%H:%M:%S.%f-%b%d%Y")) Also: %B = full month name; %m = numeric month; # unstack to long format %y = year without century; and more … # reset index to remove multi-level index dfu=df.unstack().reset_index(name='data') Dates and time – stamps and spans as indexes An index of Timestamps is a DatetimeIndex. Value counts An index of Periods is a PeriodIndex. s = df['col1'].value_counts() date_strs = ['2014-01-01', '2014-04-01', '2014-07-01', '2014-10-01'] dti = pd.DatetimeIndex(date_strs) pid = pd.PeriodIndex(date_strs, freq='D') pim = pd.PeriodIndex(date_strs, freq='M') piq = pd.PeriodIndex(date_strs, freq='Q') print (pid - pid) # 90 days print (pim - pim) # 3 months print (piq - piq) # 1 quarter time_strs = ['2015-01-01 02:10:40.12345', '2015-01-01 02:10:50.67890'] pis = pd.PeriodIndex(time_strs, freq='U') df.index = pd.period_range('2015-01', periods=len(df), freq='M') dti = pd.to_datetime(['04-01-2012'], dayfirst=True) # Australian date format pi = pd.period_range('1960-01-01', '2015-12-31', freq='M') Hint: unless you are working in less than seconds, prefer PeriodIndex over DateTimeImdex. Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 7 Period frequency constants (not a complete list) Upsampling and downsampling Name Description # upsample from quarterly to monthly U Microsecond pi = pd.period_range('1960Q1', L Millisecond periods=220, freq='Q') S Second df = DataFrame(np.random.rand(len(pi),5), index=pi) T Minute dfm = df.resample('M', convention='end') H Hour # use ffill or bfill to fill with values D Calendar day B Business day # downsample from monthly to quarterly W-{MON, TUE, …} Week ending on … dfq = dfm.resample('Q', how='sum') MS Calendar start of month M Calendar end of month Time zones QS-{JAN, FEB, …} Quarter start with year starting t = ['2015-06-30 00:00:00', (QS – December) '2015-12-31 00:00:00'] Q-{JAN, FEB, …} Quarter end with year ending (Q dti = pd.to_datetime(t – December) ).tz_localize('Australia/Canberra') AS-{JAN, FEB, …} Year start (AS - December) dti = dti.tz_convert('UTC') ts = pd.Timestamp('now', A-{JAN, FEB, …} Year end (A - December) tz='Europe/London') From DatetimeIndex to Python datetime objects # get a list of all time zones dti = pd.DatetimeIndex(pd.date_range( import pyzt start='1/1/2011', periods=4, freq='M')) for tz in pytz.all_timezones: s = Series([1,2,3,4], index=dti) print tz na = dti.to_pydatetime() #numpy array Note: by default, Timestamps are created without time na = s.index.to_pydatetime() #numpy array zone information. Frome Timestamps to Python dates or times Row selection with a time-series index df['date'] = [x.date() for x in df['TS']] # start with the play data above df['time'] = [x.time() for x in df['TS']] idx = pd.period_range('2015-01', Note: converts to datatime.date or datetime.time. But periods=len(df), freq='M') does not convert to datetime.datetime. df.index = idx From DatetimeIndex to PeriodIndex and back february_selector = (df.index.month == 2) df = DataFrame(np.random.randn(20,3)) february_data = df[february_selector] df.index = pd.date_range('2015-01-01', periods=len(df), freq='M') q1_data = df[(df.index.month >= 1) & dfp = df.to_period(freq='M') (df.index.month True print(s.dtype) # datetime64[ns] print(s.dt.quarter) # 2, 2, 2 The tail of a time-series DataFrame print(s.dt.year) # 2012, 2011, 2010 df = df.last("5M") # the last five months Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 8 Working with missing and non-finite data Working with Categorical Data Working with missing data Categorical data Pandas uses the not-a-number construct (np.nan and The pandas Series has an R factors-like data type for float('nan')) to indicate missing data. The Python None encoding categorical data. can arise in data as well. It is also treated as missing s = Series(['a','b','a','c','b','d','a'], data; as is the pandas not-a-time construct dtype='category') (pandas.NaT). df['B'] = df['A'].astype('category') Note: the key here is to specify the "category" data type. Missing data in a Series Note: categories will be ordered on creation if they are s = Series( [8,None,float('nan'),np.nan]) sortable. This can be turned off. See ordering below. #[8, NaN, NaN, NaN] s.isnull() #[False, True, True, True] Convert back to the original data type s.notnull()#[True, False, False, False] s = Series(['a','b','a','c','b','d','a'], s.fillna(0)#[8, 0, 0, 0] dtype='category') s = s.astype('string') Missing data in a DataFrame df = df.dropna() # drop all rows with NaN Ordering, reordering and sorting df = df.dropna(axis=1) # same for cols s = Series(list('abc'), dtype='category') df=df.dropna(how='all') #drop all NaN row print (s.cat.ordered) df=df.dropna(thresh=2) # drop 2+ NaN in r s=s.cat.reorder_categories(['b','c','a']) # only drop row if NaN in a specified col s = s.sort() df = df.dropna(df['col'].notnull()) s.cat.ordered = False Trap: category must be ordered for it to be sorted Recoding missing data df.fillna(0, inplace=True) # np.nan ! 0 Renaming categories s = df['col'].fillna(0) # np.nan ! 0 s = Series(list('abc'), dtype='category') df = df.replace(r'\s+', np.nan, s.cat.categories = [1, 2, 3] # in place regex=True) # white space ! np.nan s = s.cat.rename_categories([4,5,6]) # using a comprehension... Non-finite numbers s.cat.categories = ['Group ' + str(i) With floating point numbers, pandas provides for for i in s.cat.categories] positive and negative infinity. Trap: categories must be uniquely named s = Series([float('inf'), float('-inf'), np.inf, -np.inf]) Adding new categories Pandas treats integer comparisons with plus or minus s = s.cat.add_categories() infinity as expected. Removing categories Testing for finite numbers s = s.cat.remove_categories() (using the data from the previous example) s.cat.remove_unused_categories() #inplace b = np.isfinite(s) Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 9 Working with strings Basic Statistics Working with strings Summary statistics # assume that df['col'] is series of s = df['col1'].describe() strings df1 = df.describe() s = df['col'].str.lower() s = df['col'].str.upper() DataFrame – key stats methods s = df['col'].str.len() df.corr() # pairwise correlation cols df.cov() # pairwise covariance cols # the next set work like Python df.kurt() # kurtosis over cols (def) df['col'] += 'suffix' # append df.mad() # mean absolute deviation df['col'] *= 2 # duplicate df.sem() # standard error of mean s = df['col1'] + df['col2'] # concatenate df.var() # variance over cols (def) Most python string functions are replicated in the pandas DataFrame and Series objects. Value counts s = df['col1'].value_counts() Regular expressions s = df['col'].str.contains('regex') Cross-tabulation (frequency count) s = df['col'].str.startswith('regex') s = df['col'].str.endswith('regex') ct = pd.crosstab(index=df['a'], s = df['col'].str.replace('old', 'new') cols=df['b']) df['b'] = df.a.str.extract('(pattern)') Note: pandas has many more regex methods. Quantiles and ranking quants = [0.05, 0.25, 0.5, 0.75, 0.95] q = df.quantile(quants) r = df.rank() Histogram binning count, bins = np.histogram(df['col1']) count, bins = np.histogram(df['col'], bins=5) count, bins = np.histogram(df['col1'], bins=[-3,-2,-1,0,1,2,3,4]) Regression import statsmodels.formula.api as sm result = sm.ols(formula="col1 ~ col2 + col3", data=df).fit() print (result.params) print (result.summary()) Smoothing example using rolling_apply k3x5 = np.array([1,2,3,3,3,2,1]) / 15.0 s = pd.rolling_apply(df['col1'], window=7, func=lambda x: (x * k3x5).sum(), min_periods=7, center=True) Cautionary note This cheat sheet was cobbled together by bots roaming the dark recesses of the Internet seeking ursine and pythonic myths. There is no guarantee the narratives were captured and transcribed accurately. You use these notes at your own risk. You have been warned. Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter] 10