Chapter 8.pptx
Document Details
Uploaded by ExultantExtraterrestrial
2021
Tags
Full Transcript
Chapter 8 How to analyze the data © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 1 Objectives Applied 1. Melt the data in two or more columns. 2. Group and aggregate...
Chapter 8 How to analyze the data © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 1 Objectives Applied 1. Melt the data in two or more columns. 2. Group and aggregate the data in a DataFrame. 3. Pivot the data in a DataFrame or create a pivot table from the data. 4. Bin the data for a column. 5. Select the rows with the n largest values from a DataFrame, calculate the percent change for the rows in a column of a DataFrame, or rank the rows in a DataFrame by largest or smallest values. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 2 Objectives (continued) Knowledge 1. Distinguish between the pivot() and the pivot_table() methods. 2. Explain why binning the data in a column is useful. 3. Describe the way you can find other methods for specific analysis tasks. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 3 The melt() method Method Description melt(params) Melts the data in two or more columns into two columns. Parameters of the melt() method Parameter Description id_vars The column or columns that won’t be melted. value_vars The columns to melt. If none are specified, all will be melted. var_name The name of the column that will contain the melted column names, or “variable” by default. value_name The name of the column that will contain the melted column values, or “value” by default. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 4 The cars DataFrame cars.head() © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 5 How to use the melt() method cars_melted = pd.melt(cars, id_vars='price', value_vars=['enginesize','curbweight'], var_name='feature', value_name='featureValue') cars_melted © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 6 How to plot melted data with the hue parameter sns.relplot(data=cars_melted, x='featureValue', y='price', hue='feature') © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 7 How to plot melted data with the col parameter sns.relplot(data=cars_melted, x='featureValue', y='price', col='feature', facet_kws={'sharex':False}) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 8 Some of the aggregate methods that are optimized for grouping sum() mean() median() count() std() min() max() © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 9 The fires DataFrame before the data is grouped fires.head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 10 How to get the average for each numeric column in each state fires.groupby('state').mean().head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 11 How to get the maximum value for each month in each state fires.groupby(['state','fire_year','fire_month']) \.max().dropna().head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 12 The groupby() method Method Description groupby(params) Returns a GroupBy object that supports aggregate methods such as sum(). Parameters of the groupby() method Parameter Description by The column or list of columns to group by. as_index If False, doesn’t create an index based on the groupby columns. If True (the default), it does. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 13 The fires DataFrame fires.head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 14 A GroupBy object with the fire_year column as the index yearly_group = fires.groupby('fire_year') yearly_sums = yearly_group.sum() yearly_sums.head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 15 A GroupBy object without indexes yearly_group = fires.groupby( 'fire_year', as_index=False).sum() yearly_sums = yearly_group.sum() yearly_sums.head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 16 The agg() method Method Description agg() Applies an aggregate method or list of methods to a Series or DataFrame object. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 17 The GroupBy object monthly_group = fires.groupby( ['state','fire_year','fire_month']) How to apply aggregate methods to all numeric columns monthly_group.agg(['sum','count','mean']).dropna().head( 3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 18 How to apply aggregate methods to a single column monthly_group.days_burning.agg( ['sum','count','mean']).dropna().head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 19 How to apply varied aggregate methods to numeric columns df = monthly_group.agg({'acres_burned': ['sum','max','min'], 'days_burning':['sum','mean'], 'fire_name':'count'}).dropna() df.head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 20 The pivot() method Method Description pivot(params) Pivots the data based on the index, columns, and values parameters. Parameters of the pivot() method Parameter Description index The column or list of columns to use as the row index (no duplicates). columns The column or list of columns to use as the column index. values The column or list of columns to use to populate the new DataFrame. By default, all remaining columns are used. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 21 The top_states DataFrame states = ['AK','CA','ID','TX'] top_states = fires.groupby(['state','fire_year'], as_index=False).sum() top_states = top_states.query('state in @states') top_states.head(2) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 22 How to pivot the data top_states.pivot(index='fire_year', columns='state', values='acres_burned').head(2) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 23 How to plot the data with the Pandas plot() method top_states.pivot(index='fire_year', columns='state', values='acres_burned').plot() © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 24 The pivot_table() method Method Description pivot_table(params) Produces a pivot table with an applied aggregate method. Parameters of the pivot_table() method Parameter Description index The column or list of columns to use as the row index (allows duplicates). columns The column or list of columns to use as the column index. values The column or list of columns that contain the values to be aggregated. By default, all non-nuisance columns are aggregated. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 25 Parameters of the pivot_table() method (cont.) Parameter Description aggfunc The aggregate method or list of methods to be applied to each column in the values parameter. fill_value The value to replace any missing values with in the resulting pivot table. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 26 How to use the pivot_table() method to create a DataFrame states = ['AK','CA','ID','TX'] fires_top_4 = fires.query('state in @states') fires_top_4 = fires_top_4.pivot_table(index='fire_year', columns='state', values='acres_burned', aggfunc='sum') fires_top_4.head(2) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 27 How to plot the data with the Pandas plot() method fires_top_4.plot() © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 28 The cut() method Method Description cut(params) Bins the data into equal-sized bins. Parameters of the cut() method Parameter Description x The column that contains the data to be binned. bins The number of bins to create, or a list of values for the bin edges. labels The labels to use for the bins. right If set to False, the right edges are not included in the bins. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 29 The fires_filtered DataFrame fires_filtered = fires.query( 'fire_year == 2010 and days_burning > 0').dropna() How to create four bins for the data pd.cut(fires_filtered.acres_burned, bins=4) =========================================== 1145154 (-296.103, 76535.75] 1145175 (-296.103, 76535.75]... 1879725 (-296.103, 76535.75] 1880370 (-296.103, 76535.75] Name: acres_burned, Length: 1858, dtype: category Categories (4, interval[float64]): [(-296.103, 76535.75] < (76535.75, 153061.5] < (153061.5, 229587.25] < (229587.25, 306113.0]] © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 30 How to add labels to the bins pd.cut(fires_filtered.acres_burned, bins=[0,100000,200000,300000,400000], labels=['small','medium','large','very large']) =================================================== 1145154 small 1145175 small... 1879725 small 1880370 small Name: acres_burned, Length: 1858, dtype: category Categories (4, object): ['small' < 'medium' < 'large' < 'very large'] © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 31 The distribution of values in the bins pd.cut(fires_filtered.acres_burned, bins=[0,100000,200000,300000,400000], labels=['small','medium','large','very large']).value_counts() ================================================================ == small 1855 medium 2 very large 1 large 0 Name: acres_burned, dtype: int64 © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 32 The qcut() method Method Description qcut(params) Bins the data into quantiles with the same number of unique values in each bin. The number of rows in each bin will be skewed if there are duplicates. Parameters of the qcut() method Parameter Description x The column that contains the data to be binned. q The number of quantiles to create. labels The labels to use for the bins. duplicates What to do with bins that have the same edges. The default is raise, which raises a ValueError. If set to drop, the non-unique bins are dropped. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 33 How to use four quantiles to bin the data pd.qcut(fires_filtered.acres_burned, q=4, labels=['small','medium','large','very large']) ======================================================= 1145154 medium 1145175 very large... 1880209 small 1880370 large Name: acres_burned, Length: 4882, dtype: category Categories (4, object): ['small' < 'medium' < 'large' < 'very large'] © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 34 The distribution of the values in each bin pd.qcut(fires_filtered.acres_burned, q=4, labels=['small','medium','large','very large']).value_counts() ================================================================= ===== medium 1227 small 1221 very large 1220 large 1214 Name: acres_burned, dtype: int64 © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 35 How to assign bin labels to a new column fires_filtered['fire_size'] = pd.qcut( fires_filtered.acres_burned, q=4, labels=['small','medium','large','very large']) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 36 A qcut() method that drops duplicate bins pd.qcut(fires_filtered.days_burning, q=4, labels=['short','medium','long'], duplicates='drop').value_counts() ========================================= short 1018 long 433 medium 407 Name: days_burning, dtype: int64 © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 37 The DataFrame with a fire_size column that bins the data fires_filtered.head() © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 38 How to plot the binned data sns.catplot(data=fires_filtered, kind='count', x='fire_month', hue='fire_size') © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 39 The nlargest() method Method Description nlargest(params) Returns the first n rows with the largest values in the specified columns. Parameters of the nlargest() method Parameter Description n The number of rows to return. columns The columns that determine which rows to keep. keep The rows to keep in the event of a tie. Possible values: first, last, and all. The default is first. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 40 How to use the nlargest() method cars.nlargest(n=6, columns='enginesize') © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 41 Another nlargest() example cars.nlargest(n=6, columns=['enginesize','price']) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 42 The pct_change() method Method Description pct_change() Calculates the percent change from the previous row to the current row for a DataFrame or Series object. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 43 The fires data df = fires[['state','fire_year','acres_burned']] \.groupby(['state','fire_year']).sum() df.head() © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 44 How to use the pct_change() method df.pct_change() © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 45 The rank() method Method Description rank(params) Computes numerical data ranks (1 through n) along an axis. Parameters of the rank() method Parameter Description ascending If False, ranks in descending order. If True (the default), ranks in ascending order. method How to rank the group of records that have ties. Possible values include average (the default), min, max, first, and dense. pct If True, displays the rankings in percentile form. False is the default. © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 46 The state totals df = fires.groupby('state').sum() \ [['acres_burned','fire_year','days_burning']] df.head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 47 How to add an acres_rank column based on acres burned df[acres_rank'] = df.acres_burned.rank(ascending=False) df.head(3) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 48 How to add a days_rank column based on days burning df['days_rank'] = df.days_burning.rank(method='max') df.sort_values('days_burning').head(4) © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 49 A Google search for a way to count the unique values in a column © 2021, Mike Murach & Associates, Inc. Murach's Python for Data Analysis C8, Slide 50