Sports Analytics Python Notes.pdf

Full Transcript

Sports Analytics Course 1 1) Review Python basics module 2 2) Pythagorean expectation 3) Pythagorean Expectation Predictor 4) Regression Analysis Course 2 1) Hakes and Sauer Table Course 3 1) Predictive modeling, basics of forecasting (last lab...

Sports Analytics Course 1 1) Review Python basics module 2 2) Pythagorean expectation 3) Pythagorean Expectation Predictor 4) Regression Analysis Course 2 1) Hakes and Sauer Table Course 3 1) Predictive modeling, basics of forecasting (last lab, not first two) 2) NHL forecasting model (module 4… its ok to not know everything perfectly) Course 1 Review Python Basics Lesson 1 Importing Libraries 1) import pandas as pd 2) import numpy as np pandas (pd): Used for data manipulation and analysis. numpy (np): Used for numerical operations. Reading, Shaping and Displaying NBA Teams Data 3) NBA_Teams = pd.read_csv("../../Data/Week 2/nba_teams.csv") 4) display(NBA_Teams) 5) print(NBA_Teams.shape) pd.read_csv(): Loads data from a CSV file into a DataFrame. display(): Displays the DataFrame in a readable format..shape: Returns the dimensions of the DataFrame (rows, columns). Renaming Columns in NBA_Teams DataFrame 6) NBA_Teams.rename(columns={'Unnamed: 0': 'TEAM_NUMBER', 'ID': 'TEAM_ID'}, inplace=True).rename(columns={}, inplace=True): Renames columns. inplace=True modifies the DataFrame directly without needing to reassign it. Dropping a Column 7) NBA_Teams.drop(['TEAM_NUMBER'], axis=1, inplace=True).drop(columns, axis, inplace=True): Removes specified columns. axis=1 specifies columns, and inplace=True applies changes directly to the DataFrame. Reading and Displaying Basketball Games Data 8) Games = pd.read_csv("../../Data/Week 2/basketball_games.csv") 9) display(Games.head()) 10) Games.drop(, axis=0, inplace=True) 11) Games = Games[Games.TEAM_NAME != "Las Vegas Aces"] Games.drop(, axis=0, inplace=True): Drops the first row (index 0). Games[Games.TEAM_NAME != "Las Vegas Aces"]: Filters out rows where the team name is "Las Vegas Aces". Merging DataFrames 12) NBA_Games = pd.merge(NBA_Teams, Games, on=['TEAM_ID', 'TEAM_NAME']) 13) display(NBA_Games.head()) pd.merge(): Merges two DataFrames on specified columns. Dropping Unnecessary Columns 14) NBA_Games.drop(['ABBREVIATION'], axis=1, inplace=True, errors='ignore') errors='ignore': Ignores errors if the column does not exist. Sorting Values and Displaying Top Rows 15) NBA_Games.sort_values(by=['GAME_ID'], ascending=[False], inplace=True).sort_values(by, ascending, inplace=True): Sorts the DataFrame by specified columns. ascending=[False] sorts in descending order. Filtering Non-null Rows 16) NBA_Games = NBA_Games[pd.notnull(NBA_Games["FG_PCT"])] 17) NBA_Games.shape pd.notnull(): Filters out rows with null values in the specified column..fillna(value, inplace=True): Fills null values with specified values. NBA_Games.mean() fills with column means. Filling NaN Values with Column Mean 18) NBA_Games = NBA_Games.fillna(NBA_Games.mean()) 19) NBA_Games.info().fillna(): Fills NaN values with specified values (in this case, the mean). Adding a New Column for Total Made Shots and Game Results 20) NBA_Games['GM'] = NBA_Games['FGM'] + NBA_Games['FG3M'] + NBA_Games['FTM'] 21) NBA_Games['RESULT'] = np.where(NBA_Games['PLUS_MINUS'] > 0, 'W', 'L') ['GM']: Creates a new column in the DataFrame. np.where(condition, value_if_true, value_if_false): Creates a new column based on a condition. Sorting Values and Adding Point Difference Column 22) NBA_Games["POINT_DIFF"] = NBA_Games.groupby(["GAME_ID"])["PTS"].diff() 23) NBA_Games['POINT_DIFF'] = NBA_Games['POINT_DIFF'].fillna(NBA_Games.groupby('GAME_ID')['POINT_D IFF'].transform('mean')).groupby(): Groups data by specified columns..diff(): Calculates the difference between consecutive rows..transform('mean'): Applies the mean function within each group. Grouping and Summing Statistics by Team and Season 24) NBA_Team_Stats = NBA_Games.groupby(['TEAM_ID', 'SEASON_ID']).sum().reset_index() 25) NBA_Game_Count = NBA_Games.groupby(['TEAM_ID', 'SEASON_ID']).size().reset_index(name='GAME_COUNT').sum(): Sums up the values for each group..reset_index(): Resets the index to convert the grouped data back into a DataFrame..size(): Counts the number of occurrences in each group. Saving the Cleaned Data to CSV 26) NBA_Games.to_csv("../../Data/Week 2/NBA_Games.csv", index=False).to_csv(): Saves the DataFrame to a CSV file. index=False excludes the index from the file. Summary: Reading Data: Loaded NBA teams and basketball games data from CSV files. Cleaning and Preparing Data: Renamed columns, dropped unnecessary columns and rows, and filtered data. Merging Data: Combined NBA teams and games data into a single DataFrame. Data Transformation: Created new columns for analysis, filled missing values, and dropped rows with NaN values. Grouping and Aggregation: Calculated summary statistics and game counts for each team and season. Saving Data: Exported the cleaned and transformed data to a new CSV file. Key Points: 1. Inplace Operations: inplace=True modifies the DataFrame directly, saving memory and potentially improving performance. 2. Data Filtering and Cleaning: ○ Dropping rows and columns (.drop()). ○ Filling missing values (.fillna()). ○ Filtering out null values (pd.notnull()). 3. Merging and Grouping: ○ Combining DataFrames (pd.merge()). ○ Grouping data to perform aggregate calculations (.groupby()). 4. Creating New Columns: ○ Using arithmetic operations to create new columns. ○ Using conditional statements (np.where()). 5. Sorting and Saving Data: ○ Sorting values (.sort_values()). ○ Exporting cleaned data to CSV (.to_csv()). When Code is Used: Data Loading and Display: pd.read_csv(), display(),.head(),.info() Data Cleaning and Transformation:.rename(),.drop(),.fillna(),.dropna(),.notnull(), pd.notnull(),.columns,.sort_values() Data Merging and Grouping: pd.merge(),.groupby(),.sum(),.diff(),.size() Data Exporting:.to_csv() axis=1: Specifies operations on columns (use axis=0 for rows). errors='ignore': Prevents errors if the operation does not apply (e.g., dropping a non-existent column). Course 1 Pythagorean expectation Lesson 2 Pythagorean Expectations in MLB / NBA: Definition - The Pythagorean expectation is a formula used in sports analytics to estimate the winning percentage of a team based on the number of runs (or points) they score and allow. It is based on the idea that the ratio of runs scored to runs allowed can predict a team's success better than the simple win-loss record. The Pythagorean expectation helps predict a team's performance more accurately than win-loss records alone, as it accounts for the underlying strength of the team's offense and defense. Analysts use it to identify teams that may have been lucky or unlucky based on their actual win-loss record compared to their Pythagorean expectation. It can also be used to project future performance or evaluate the impact of potential trades or changes in the team roster. Steps in Order when applying the Pythagorean Expectation: 1) Importing Libraries: We imported necessary libraries (pandas, numpy, statsmodels, matplotlib.pyplot, seaborn) to handle data, perform statistical analysis, and visualize results. 2) Loading Data: We loaded game log data from an Excel file using pd.read_excel() into a pandas DataFrame (MLB). We also printed out the list of column names to understand the structure of our dataset. 3) Data Selection and Renaming: From the loaded DataFrame (MLB), we selected specific columns (VisitingTeam, HomeTeam, VisitorRunsScored, HomeRunsScore, Date) that were relevant for our analysis. We renamed columns (VisitorRunsScored to VisR, HomeRunsScore to HomR) for simplicity. 4) Adding Variables: We added new columns (hwin, awin, count) to MLB18 DataFrame using np.where(): ○ hwin: Indicates whether the home team won (1 if HomeRunsScore > VisitorRunsScored, otherwise 0). ○ awin: Indicates whether the away team won (1 if HomeRunsScore < VisitorRunsScored, otherwise 0). ○ count: Counts each game (set to 1 for each row). 5) Grouping Data: We grouped MLB18 by HomeTeam and VisitingTeam separately using.groupby() and calculated sums (hwin, HomR, VisR, count) for each team. We used.reset_index() to reset the index after grouping. 6) Merging DataFrames: We merged the grouped DataFrames (MLBhome and MLBaway) on the column 'team' to create a comprehensive DataFrame (MLB18) containing aggregated statistics for each team. 7) Calculating Additional Statistics: We calculated additional statistics (W for total wins, G for total games played, R for total runs scored, RA for total runs allowed) based on the grouped and merged data. 8) Calculating Win Percentage and Pythagorean Expectation: We computed win percentage (wpc) as W / G and Pythagorean Expectation (pyth) using the formula (R^2) / (R^2 + RA^2). 9) Data Visualization: We visualized the relationship between win percentage (wpc) and Pythagorean Expectation (pyth) using a scatter plot (sns.relplot() from Seaborn). 10) Regression Analysis: We performed a linear regression analysis (smf.ols()) to explore the relationship between wpc and pyth, fitting a model and summarizing the results (pyth_lm.summary()). Step 1: Importing Necessary Libraries import pandas as pd import numpy as np import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn as sns Pandas: For data manipulation and analysis. NumPy: For numerical operations. Statsmodels: For statistical modeling. Matplotlib: For plotting. Seaborn: For enhanced data visualization. Step 2: Loading Data MLB = pd.read_excel('../../Data/Week 1/Retrosheet MLB game log 2018.xlsx') print(MLB.columns.tolist()) pd.read_excel: Reads an Excel file into a pandas DataFrame (MLB). print(MLB.columns.tolist()): Prints the list of column names in the DataFrame MLB. Step 3: Data Selection and Renaming MLB18 = MLB[['VisitingTeam','HomeTeam','VisitorRunsScored','HomeRunsScore','Date'] ] MLB18 = MLB18.rename(columns={'VisitorRunsScored':'VisR','HomeRunsScore':'HomR'}) MLB[['VisitingTeam','HomeTeam','VisitorRunsScored','HomeRunsScore','Date']]: Selects specific columns from MLB. MLB18.rename: Renames columns for easier referencing (VisitorRunsScored to VisR, HomeRunsScore to HomR). Step 4: Adding Variables MLB18['hwin']= np.where(MLB18['HomR']>MLB18['VisR'],1,0) MLB18['awin']= np.where(MLB18['HomR'] VisR assigns 1 to hwin, otherwise 0). Step 5: Grouping Data MLBhome = MLB18.groupby('HomeTeam')['hwin','HomR','VisR','count'].sum().reset_index( ) MLBaway = MLB18.groupby('VisitingTeam')['awin','HomR','VisR','count'].sum().reset_in dex() groupby('HomeTeam').sum(): Groups data by HomeTeam, sums up hwin, HomR, VisR, and count for each team. reset_index(): Resets index to default integer index after grouping. Step 6: Merging DataFrames MLB18 = pd.merge(MLBhome,MLBaway,on='team') pd.merge: Merges MLBhome and MLBaway DataFrames on the column 'team'. Step 7: Calculating Additional Statistics MLB18['W']=MLB18['hwin']+MLB18['awin'] MLB18['G']=MLB18['Gh']+MLB18['Ga'] MLB18['R']=MLB18['HomRh']+MLB18['VisRa'] MLB18['RA']=MLB18['VisRh']+MLB18['HomRa'] Calculates total wins (W), total games played (G), total runs scored (R), and total runs allowed (RA). Step 8: Calculating Win Percentage and Pythagorean Expectation MLB18['wpc'] = MLB18['W']/MLB18['G'] MLB18['pyth'] = MLB18['R']**2/(MLB18['R']**2 + MLB18['RA']**2) wpc: Calculates win percentage (W divided by G). pyth: Calculates Pythagorean Expectation using the formula (R^2) / (R^2 + RA^2). Step 9: Data Visualization sns.relplot(x="pyth", y="W", data = MLB18) sns.relplot: Generates a scatter plot using Seaborn, with pyth on the x-axis and W on the y-axis. Step 10: Regression Analysis pyth_lm = smf.ols(formula = 'wpc ~ pyth', data=MLB18).fit() pyth_lm.summary() smf.ols: Fits a linear regression model using wpc as the dependent variable and pyth as the independent variable..fit(): Performs the fitting of the model to the data..summary(): Prints out the summary of the regression results. Course 1 Pythagorean Expectation Predictor Lesson 3 The Pythagorean predictor analysis in baseball is a method to estimate the number of wins a team should have based on the number of runs they score and allow. It's based on the Pythagorean Expectation formula, which is: This formula can help predict a team's performance and compare it with their actual win-loss record. Step 1: Import Libraries import pandas as pd import numpy as np import statsmodels.formula.api as smf import matplotlib.pyplot as plt import seaborn as sns Step 2: Read Data MLB = pd.read_excel('../../Data/Week 1/Retrosheet MLB game log 2018.xlsx') print(MLB.columns.tolist()) Step 3: Create a DataFrame with Necessary Variables MLB18 = MLB[['VisitingTeam','HomeTeam','VisitorRunsScored','HomeRunsScore',' Date']] MLB18 = MLB18.rename(columns={'VisitorRunsScored':'VisR','HomeRunsScore':'Ho mR'}) MLB18['count'] = 1 Step 4: Separate Home and Away Team Performances Home Team: MLBhome = MLB18[['HomeTeam','HomR','VisR','count','Date']].copy() MLBhome['home'] = 1 MLBhome = MLBhome.rename(columns={'HomeTeam':'team','VisR':'RA','HomR':'R'}) Away Team: MLBaway = MLB18[['VisitingTeam','VisR','HomR','count','Date']].copy() MLBaway['home'] = 0 MLBaway = MLBaway.rename(columns={'VisitingTeam':'team','VisR':'R','HomR':'RA' }) Step 5: Concatenate Home and Away Performances MLB18 = pd.concat([MLBhome, MLBaway]) Step 6: Define Wins MLB18['win'] = np.where(MLB18['R'] > MLB18['RA'], 1, 0) Step 7: Split Data into First and Second Half of the Season First Half of Season: Half1 = MLB18[MLB18.Date < 20180717] Second Half of Season: Half2 = MLB18[MLB18.Date > 20180717] Step 8: Summarize Performance for Each Half First Half: Half1perf = Half1.groupby('team')['count','win','R','RA'].sum().reset_index() Half1perf = Half1perf.rename(columns={'count':'count1','win':'win1','R':'R1','RA':'RA1 '}) Second Half: Half2perf = Half2.groupby('team')['count','win','R','RA'].sum().reset_index() Half2perf = Half2perf.rename(columns={'count':'count2','win':'win2','R':'R2','RA':'RA2 '}) Step 9: Calculate Win Percentage and Pythagorean Expectation First Half: Half1perf['wpc1'] = Half1perf['win1'] / Half1perf['count1'] Half1perf['pyth1'] = Half1perf['R1']**2 / (Half1perf['R1']**2 + Half1perf['RA1']**2) Second Half: Half2perf['wpc2'] = Half2perf['win2'] / Half2perf['count2'] Half2perf['pyth2'] = Half2perf['R2']**2 / (Half2perf['R2']**2 + Half2perf['RA2']**2) Step 10: Merge DataFrames and Plot Data Half2predictor = pd.merge(Half1perf, Half2perf, on='team') Plotting: sns.relplot(x="pyth1", y="wpc2", data=Half2predictor) sns.relplot(x="wpc1", y="wpc2", data=Half2predictor) Step 11: Correlation Analysis keyvars = Half2predictor[['team','wpc2','wpc1','pyth1','pyth2']] keyvars.corr() Step 12: Sorting and Displaying Results keyvars = keyvars.sort_values(by=['wpc2'], ascending=False) keyvars Course 1 Regression Analysis Lesson 4 Basics of Linear Regression Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables. When we have one independent variable, it's called simple linear regression, and with multiple independent variables, it's called multiple linear regression. Key concepts: 1. Dependent variable (Y): The variable we are trying to predict or explain. 2. Independent variable(s) (X): The variable(s) we are using to make predictions. 3. Regression line: The best-fit line through the data points. 4. Intercept (b0): The value of Y when all X are zero. 5. Slope (b1): The change in Y for a one-unit change in X. Important Terms to understand: R-squared Definition: R-squared (R²) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. Importance: It shows how well the data points fit the regression model. An R² value close to 1 indicates a good fit, meaning the model explains a lot of the variability in the outcome. Adjusted R-squared Definition: Adjusted R-squared adjusts the R² value for the number of predictors in the model. Unlike R², it can decrease if the additional predictors do not improve the model. Importance: It provides a more accurate measure of the goodness-of-fit, especially when multiple predictors are involved, by accounting for the model's complexity. F-statistic and Prob (p-value) Definition: The F-statistic is used to determine if there is a significant relationship between the dependent variable and the independent variables in the model. The Prob (p-value) associated with the F-statistic indicates the probability that the null hypothesis (no relationship) is true. Importance: A significant F-statistic (usually p-value < 0.05) means that the regression model provides a better fit to the data than a model with no predictors. Coefficients Definition: Coefficients are the values that multiply the predictor variables in the regression equation. They represent the change in the dependent variable for a one-unit change in the predictor variable. Importance: They show the direction and magnitude of the relationship between each predictor and the dependent variable. Positive coefficients indicate a positive relationship, while negative coefficients indicate a negative relationship. Standard Error Definition: Standard error measures the accuracy with which a sample distribution represents a population by estimating the standard deviation of the coefficient. Importance: It provides a measure of the precision of the coefficient estimates. Smaller standard errors indicate more precise estimates. t-statistic and P-value Definition: The t-statistic measures how many standard deviations the coefficient is away from zero. The P-value indicates the probability that the coefficient is different from zero purely by chance. Importance: A significant t-statistic (usually p-value < 0.05) means that the corresponding predictor is a significant contributor to the model. Confidence Interval Definition: A confidence interval gives a range of values within which the true coefficient value is expected to fall, with a certain level of confidence (usually 95%). Importance: It provides an estimate of the uncertainty around the coefficient estimate. Narrower intervals indicate more precise estimates. Step 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import statsmodels.formula.api as sm Step 2: Load Data NHL_Team_Stats = pd.read_csv("../../Data/Week 4/NHL_Team_Stats.csv") NHL_Team_R_Stats = pd.read_csv("../../Data/Week 4/NHL_Team_R_Stats.csv") Step 3: Simple Linear Regression (Goals For and Winning Percentage) reg1 = sm.ols(formula='win_pct ~ goals_for', data=NHL_Team_R_Stats).fit() Step 4: Visualizing and Correlating Goals Against with Winning Percentage sns.lmplot(x='goals_against', y='win_pct', data=NHL_Team_R_Stats) plt.xlabel('Total Goals against') plt.ylabel('Winning Percentage') plt.title("Relationship between Goals against and Winning Percentage", fontsize=20) NHL_Team_R_Stats['goals_against'].corr(NHL_Team_R_Stats['win_pct']) reg2 = sm.ols(formula='win_pct ~ goals_against', data=NHL_Team_R_Stats).fit() print(reg2.summary()) Step 5: Visualizing and Analyzing Average Goals For per Game sns.lmplot(x='avg_gf', y='win_pct', data=NHL_Team_R_Stats) plt.xlabel('Average Goals for per Game') plt.ylabel('Winning Percentage') plt.title("Relationship between Average Goals for and Winning Percentage", fontsize=20) reg3 = sm.ols(formula='win_pct ~ avg_gf', data=NHL_Team_R_Stats).fit() print(reg3.summary()) Step 6: Multiple Linear Regression with Type NHL_Team_Stats['type'] = NHL_Team_Stats['type'].astype(object) reg5 = sm.ols(formula='win_pct ~ avg_gf + type', data=NHL_Team_Stats).fit() print(reg5.summary()) Step 7: Multiple Linear Regression with Additional Variables reg6 = sm.ols(formula='win_pct ~ avg_gf + avg_ga + competition_name', data=NHL_Team_Stats).fit() print(reg6.summary()) Step 8: Interaction Term in Regression reg7 = sm.ols(formula='win_pct ~ avg_gf + type + avg_gf*type', data=NHL_Team_Stats).fit() print(reg7.summary()) Step 9: Pythagorean Winning Percentage NHL_Team_Stats['pyth_pct'] = NHL_Team_Stats['goals_for']**2 / (NHL_Team_Stats['goals_for']**2 + NHL_Team_Stats['goals_against']**2) sns.lmplot(x='pyth_pct', y='win_pct', data=NHL_Team_Stats) plt.xlabel('Pythagorean Winning Percentage') plt.ylabel('Winning Percentage') plt.title("Relationship between Pythagorean Winning Percentage and Winning Percentage", fontsize=20) reg8 = sm.ols(formula='win_pct ~ pyth_pct', data=NHL_Team_Stats).fit() print(reg8.summary()) Step 10: Regression with Competition Name sns.lmplot(x='pyth_pct', y='win_pct', hue='competition_name', data=NHL_Team_Stats) plt.xlabel('Pythagorean Winning Percentage') plt.ylabel('Winning Percentage') plt.title("Relationship between Pythagorean Winning Percentage and Winning Percentage", fontsize=20) reg9 = sm.ols(formula='win_pct ~ pyth_pct + competition_name', data=NHL_Team_Stats).fit() print(reg9.summary()) reg10 = sm.ols(formula='win_pct ~ pyth_pct + competition_name + pyth_pct*competition_name', data=NHL_Team_Stats).fit() print(reg10.summary()) Course 2 Hakes and Sauer Table Lesson 1 This script effectively aggregates baseball game data, calculates relevant offensive and defensive metrics, and performs regression analysis to explore the relationships between these metrics and team performance. It's a comprehensive approach to applying the Hakes and Sauer method to analyze team win percentages in baseball based on OBP and SLG metrics. Adjustments may be needed depending on specific data formats and additional statistical considerations. Step 1: Data Loading and Preparation import pandas as pd import numpy as np # Load the data from an Excel file (adjust the path as needed) Teams = pd.read_excel("../Data/Game logs 1999-2003.xlsx") # Create binary indicators for home and away wins Teams['hwin'] = np.where(Teams['home_score'] > Teams['visitor_score'], 1, 0) Teams['awin'] = np.where(Teams['home_score'] < Teams['visitor_score'], 1, 0) # Extract year from the date column Teams['year'] = Teams['date'].astype(str).str[0:4] Step 2: Data Aggregation # Aggregate home team statistics Teamshome = Teams.groupby(['home', 'year']).sum().reset_index() # Aggregate away team statistics Teamsaway = Teams.groupby(['visitor', 'year']).sum().reset_index() # Merge home and away statistics Teams2 = pd.merge(Teamshome, Teamsaway, left_on=['home', 'year'], right_on=['visitor', 'year']) # Calculate total wins for each team Teams2['wins'] = Teams2['hwin_x'] + Teams2['awin_y'] Step 3: Calculating Offensive and Defensive Metrics (OBP and SLG) # Calculate Offensive and Defensive metrics Teams2['OBPFOR'] = (Teams2['home_h_x'] + Teams2['visitor_h_y'] + Teams2['home_bb_x'] + Teams2['visitor_bb_y'] + Teams2['home_hbp_x'] + Teams2['visitor_hbp_y']) / \ (Teams2['home_ab_x'] + Teams2['visitor_ab_y'] + Teams2['home_bb_x'] + Teams2['visitor_bb_y'] + Teams2['home_hbp_x'] + Teams2['visitor_hbp_y'] + Teams2['home_sf_x'] + Teams2['visitor_sf_y']) Teams2['OBPAGN'] = (Teams2['home_h_y'] + Teams2['visitor_h_x'] + Teams2['home_bb_y'] + Teams2['visitor_bb_x'] + Teams2['home_hbp_y'] + Teams2['visitor_hbp_x']) / \ (Teams2['home_ab_y'] + Teams2['visitor_ab_x'] + Teams2['home_bb_y'] + Teams2['visitor_bb_x'] + Teams2['home_hbp_y'] + Teams2['visitor_hbp_x'] + Teams2['home_sf_y'] + Teams2['visitor_sf_x']) Teams2['SLGFOR'] = ((Teams2['home_h_x'] + Teams2['visitor_h_y'] - (Teams2['home_2b_x'] + Teams2['visitor_2b_y']) - (Teams2['home_3b_x'] + Teams2['visitor_3b_y']) - (Teams2['home_hr_x'] + Teams2['visitor_hr_y']) + 2 * (Teams2['home_2b_x'] + Teams2['visitor_2b_y']) + 3 * (Teams2['home_3b_x'] + Teams2['visitor_3b_y']) + 4 * (Teams2['home_hr_x'] + Teams2['visitor_hr_y'])) / (Teams2['home_ab_x'] + Teams2['visitor_ab_y'])) Teams2['SLGAGN'] = ((Teams2['home_h_y'] + Teams2['visitor_h_x'] - (Teams2['home_2b_y'] + Teams2['visitor_2b_x']) - (Teams2['home_3b_y'] + Teams2['visitor_3b_x']) - (Teams2['home_hr_y'] + Teams2['visitor_hr_x']) + 2 * (Teams2['home_2b_y'] + Teams2['visitor_2b_x']) + 3 * (Teams2['home_3b_y'] + Teams2['visitor_3b_x']) + 4 * (Teams2['home_hr_y'] + Teams2['visitor_hr_x'])) / (Teams2['home_ab_y'] + Teams2['visitor_ab_x'])) Step 4: Further Data Processing and Regression Analysis # Additional aggregation and calculation of win percentages TeamsGh = Teams.groupby(['year', 'home']).size().reset_index(name='hwin') TeamsGa = Teams.groupby(['year', 'visitor']).size().reset_index(name='awin') TeamsG = pd.merge(TeamsGh, TeamsGa, left_on=['year', 'home'], right_on=['year', 'visitor']) TeamsG['Games'] = TeamsG['hwin'] + TeamsG['awin'] Teams3 = pd.merge(Teams2[['year', 'home', 'wins', 'OBPFOR', 'OBPAGN', 'SLGFOR', 'SLGAGN']], TeamsG, left_on=['year', 'home'], right_on=['year', 'home']) Teams3['wpc'] = Teams3['wins'] / Teams3['Games'] # Regression using statsmodels import statsmodels.formula.api as smf WinOBP_lm = smf.ols(formula='wpc ~ OBPFOR + OBPAGN', data=Teams3).fit() WinSLG_lm = smf.ols(formula='wpc ~ SLGFOR + SLGAGN', data=Teams3).fit() WinOBPSLG_lm = smf.ols(formula='wpc ~ OBPFOR + OBPAGN + SLGFOR + SLGAGN', data=Teams3).fit() WinOBPSLGR_lm = smf.ols(formula='wpc ~ I(OBPFOR - OBPAGN) + I(SLGFOR - SLGAGN)', data=Teams3).fit() # Display summary statistics from statsmodels.iolib.summary2 import summary_col Header = ['1', '2', '3', '4'] Table_1 = summary_col([WinOBP_lm, WinSLG_lm, WinOBPSLG_lm, WinOBPSLGR_lm], regressor_order=['Intercept', 'OBPFOR', 'OBPAGN', 'SLGFOR', 'SLGAGN'], model_names=Header) print(Table_1) Course 3 NHL Forecasting Model Lesson 1 Why it’s important for learning about forecasting: Team Performance Evaluation Game Outcome Predictions Management and Strategy Fan Engagement Data-Driven Decision Making Performance Optimization Competitive Advantage Step 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import statsmodels.formula.api as smf from IPython.display import display, HTML display(HTML(data=""" div#notebook-container { width: 95%; } div#menubar-container { width: 65%; } div#maintoolbar-container { width: 99%; } """)) Pandas: For data manipulation. NumPy: For numerical operations. Matplotlib: For plotting. Seaborn: For data visualization. Statsmodels: For statistical modeling. IPython.display: To display HTML content in Jupyter Notebooks. Step 2: Import Data NHL_game = pd.read_csv("../../Data/Week 4/NHL_game2.csv") salary = pd.read_csv("../../Data/Week 4/nhl_avg_salary_2016.csv") pd.read_csv(): Loads CSV files into pandas DataFrames (NHL_game and salary). Step 3: Display Data display(NHL_game[0:10]) display(salary[0:10]) display(): Displays the first 10 rows of the DataFrames. Step 4: Data Exploration salary.shape salary.describe().round(decimals = 0) NHL_game.columns Salary.columns.shape: Returns the dimensions of the DataFrame..describe(): Generates descriptive statistics..round(): Rounds the statistics to 0 decimals..columns: Lists column names of the DataFrame. Step 5: Filter and Clean Data NHL16 = NHL_game[NHL_game.year == 2016].copy() NHL16.drop(['competition_name', 'type', 'year', 'comp_id', 'team_name', 'date', 'goals_against', 'goals_for'], axis=1, inplace=True) salary.rename(columns={'Team': 'tricode', 'average': 'salary'}, inplace=True) DataFrame[condition]: Filters rows based on a condition..copy(): Creates a copy of the DataFrame..drop(columns, axis=1, inplace=True): Drops specified columns..rename(columns={'old_name': 'new_name'}, inplace=True): Renames columns. Step 6: Aggregate and Merge Data team_sal = salary.groupby(['tricode']).sum() NHL16 = pd.merge(NHL16, team_sal, on=['tricode']).groupby('column').sum(): Groups data by a column and calculates the sum. pd.merge(df1, df2, on='column'): Merges two DataFrames on a specified column. Step 7: Extract and Clean Home and Away Data NHL16home = NHL16[NHL16.home_away == 'home'].copy() NHL16away = NHL16[NHL16.home_away == 'away'].copy() NHL16away.drop(['hgd', 'win', 'win.ord'], axis=1, inplace=True) NHL16 = pd.merge(NHL16home, NHL16away, on=['gid']).drop(columns, axis=1, inplace=True): Drops specified columns. DataFrame[condition]: Filters rows based on a condition. pd.merge(df1, df2, on='column'): Merges two DataFrames on a specified column. Step 8: Rename Columns and Calculate Log Salaries NHL16.rename(columns={ 'tid_x': 'tid_home', 'tid_y': 'tid_away', 'tricode_x': 'tricode_home', 'tricode_y': 'tricode_away', 'salary_x': 'salary_home', 'salary_y': 'salary_away' }, inplace=True) NHL16['lg_home_sal'] = np.log(NHL16['salary_home']) NHL16['lg_away_sal'] = np.log(NHL16['salary_away']).rename(columns={'old_name': 'new_name'}, inplace=True): Renames columns. np.log(): Calculates the natural logarithm of a column. Step 9: Calculate Salary Ratio and Perform Regression NHL16.drop(['home_away_x', 'tid_home', 'salary_home', 'home_away_y', 'tid_away', 'salary_away'], axis=1, inplace=True) NHL16['lghtsal_ratio'] = NHL16['lg_home_sal'] - NHL16['lg_away_sal'] GDreg_lm = smf.ols(formula='hgd ~ lghtsal_ratio', data=NHL16).fit() GDreg_lm.summary().drop(columns, axis=1, inplace=True): Drops specified columns. DataFrame['new_column'] = expression: Creates a new column based on an expression. smf.ols(formula, data).fit(): Fits a linear regression model..summary(): Prints the summary of the regression model. Step 10: Visualize Data and Make Predictions sns.relplot(x='lghtsal_ratio', y='hgd', data=NHL16) sns.regplot(x='lghtsal_ratio', y='hgd', data=NHL16, scatter_kws={'s': 20}) NHL16['GDpred'] = GDreg_lm.predict() sns.relplot(): Creates a scatter plot. sns.regplot(): Creates a scatter plot with a regression line..predict(): Makes predictions based on the regression model. Step 11: Evaluate Model Accuracy NHL16['resGDpred'] = np.where(NHL16['GDpred'] > 0, 2, 0) NHL16['GDcorrect'] = np.where(NHL16['resGDpred'] == NHL16['win.ord'], 1, 0) accuracy = sum(NHL16['GDcorrect']) / 1296 np.where(condition, x, y): Evaluates a condition element-wise and assigns values based on the result. sum(): Sums up the elements of a column. Step 12: Fit Ordered Logistic Regression Model from bevel.linear_ordinal_regression import OrderedLogit ol = OrderedLogit() ol.fit(NHL16['lghtsal_ratio'], NHL16['win.ord']) ol.print_summary() OrderedLogit(): Initializes an ordered logistic regression model..fit(X, y): Fits the model..print_summary(): Prints the summary of the model. Step 13: Calculate and Evaluate Probabilities NHL16['predL'] = 1 / (1 + np.exp(-(ol.coef_ - ol.coef_ * NHL16['lghtsal_ratio']))) NHL16['predD'] = 1 / (1 + np.exp(-(ol.coef_ - ol.coef_ * NHL16['lghtsal_ratio']))) - NHL16['predL'] NHL16['predW'] = 1 - NHL16['predL'] - NHL16['predD'] NHL16.loc[(NHL16.predL > NHL16.predW) & (NHL16.predL > NHL16.predD), 'fitted'] = 0 NHL16.loc[(NHL16.predW > NHL16.predL) & (NHL16.predW > NHL16.predD), 'fitted'] = 2 NHL16.loc[(NHL16.predD > NHL16.predW) & (NHL16.predD > NHL16.predL), 'fitted'] = 1 NHL16['TRUE'] = np.where(NHL16['fitted'] == NHL16['win.ord'], 1, 0) total_correct = NHL16['TRUE'].sum() accuracy = total_correct / 1296 np.exp(): Calculates the exponential of all elements in the input array..loc[condition, 'column'] = value: Updates a column based on a condition. np.where(condition, x, y): Evaluates a condition element-wise and assigns values based on the result. Course 3 MLB Forecasting Model Lesson 2 Main Steps for Analysis: 1) Defining Run Differentials and Game Results: # Define run differentials and assign it into 'run_dff' mlb['H_run_dff'] = mlb['home_score'] - mlb['visitor_score'] # Define the game results from run_dff (1: Win, 0: lose) mlb['H_win'] = mlb['H_run_dff'].apply(lambda x: 1 if x > 0 else 0).apply(): Applies a function along an axis of the DataFrame. 2) Merging Salary Data for Home and Visitor Teams Separately: # Merge salary data for home teams mlb = pd.merge(mlb, salary.rename(columns={'team': 'home'}), on='home') # Now, we need to change the 'home' column to 'visitor' as a matching column salary.rename(columns={'home': 'visitor'}, inplace=True) # Merge salary data for visitors mlb = pd.merge(mlb, salary, on='visitor') pd.merge(): Merges DataFrame or named Series objects with a database-style join..rename(): Renames the labels of a DataFrame. 3) Renaming Columns After Merging # Change the column names properly mlb.rename(columns={'Payroll_x': 'hm_sal'}, inplace=True) mlb.rename(columns={'Payroll_y': 'vst_sal'}, inplace=True).rename(): Renames the labels of a DataFrame. 4) Feature Engineering for Log Salaries and Salary Ratios # Take the log of salary to be used as I.V for regression mlb['lg_hm_sal'] = np.log(mlb['hm_sal']) mlb['lg_vst_sal'] = np.log(mlb['vst_sal']) mlb['lg_ratio'] = mlb['lg_hm_sal'] - mlb['lg_vst_sal'] np.log(): Computes the natural logarithm of each element in the input array. 5) Linear Regression and Predictions # Forecasting with Linear Regression RDreg = smf.ols(formula='H_run_dff ~ lg_ratio', data=mlb).fit() RDreg.summary() # Obtain the fitted results mlb['RDpred'] = RDreg.predict() # If the fitted RD >0, we predict team win (1), otherwise opp win(0) mlb['res_RDpred'] = np.where(mlb['RDpred'] > 0, 1, 0) # Obtain the correct predictions and success rate mlb['RDcorrect'] = np.where(mlb['res_RDpred'] == mlb['H_win'], 1, 0) sum(mlb['RDcorrect']) / 2429 smf.ols(): Ordinary Least Squares regression..fit(): Fits the model to the data..predict(): Returns fitted values. 6) Logistic Regression from sklearn.linear_model import LogisticRegression import statsmodels.api as sm import statsmodels.formula.api as smf # Specify the model and then run logistic regression H_Win_Lg = 'H_win~lg_ratio' model = smf.glm(formula=H_Win_Lg, data=mlb, family=sm.families.Binomial()) result = model.fit() # Print the result print(result.summary()) # Obtain the fitted probabilities of winning on each game by using the logit model fittedProbs = result.predict() print(fittedProbs[0:10]) # Create a binary winning variable by using the fitted probabilities fittedWin = [1 if x >.5 else 0 for x in fittedProbs] print(fittedWin[0:10]) smf.glm(): Generalized Linear Models. sm.families.Binomial(): Binomial family for logistic regression..fit(): Fits the model to the data..predict(): Returns fitted values. 7) Model Evaluation with Confusion Matrix and Classification Report from sklearn.metrics import confusion_matrix, classification_report confusion_matrix(mlb['H_win'], fittedWin) (391 + 948) / 2429 # Accuracy calculation print(classification_report(mlb['H_win'], fittedWin, digits=3)) confusion_matrix(): Computes the confusion matrix to evaluate the accuracy of a classification. classification_report(): Builds a text report showing the main classification metrics. 8) Predicting Home Win Probabilities mlb['pred_Home_W'] = 1 / (1 + np.exp(-(0.119124 + 0.430343 * mlb['lg_ratio']))) mlb['pred_Home_L'] = 1 - mlb['pred_Home_W'] # Create fitted binary outcome based on probabilities mlb.loc[mlb.pred_Home_L > mlb.pred_Home_W, 'fitted'] = 0 mlb.loc[mlb.pred_Home_W > mlb.pred_Home_L, 'fitted'] = 1 mlb['TRUE'] = np.where(mlb['fitted'] == mlb['H_win'], 1, 0) display(mlb[0:10]) # Obtain the success rate of the model in predicting outcomes Total = mlb['TRUE'].sum() print(Total / 2429) np.exp(): Computes the exponential of all elements in the input array..loc[]: Accesses a group of rows and columns by labels or a boolean array..where(): Replaces values where the condition is False.

Use Quizgecko on...
Browser
Browser