Business Analytics Linear Regression PDF
Document Details
Uploaded by BriskMinneapolis
2021
Tags
Summary
This presentation introduces business analytics and covers linear regression methods. It details the concepts of descriptive, predictive, and prescriptive analytics techniques, providing examples and explanations. The presentation delves into simple linear regression, equations, and model estimation.
Full Transcript
Business Analytics © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Linear Regression...
Business Analytics © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Linear Regression Chapter 7 © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Introduction (Slide 1 of 2) Managerial decisions are often based on the relationship between two or more variables: Example: After considering the relationship between advertising expenditures and sales, a marketing manager might attempt to predict sales for a given level of advertising expenditures. Sometimes a manager will rely on intuition to judge how two variables are related. If data can be obtained, a statistical procedure called regression analysis can be used to develop an equation showing how the variables are related. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Introduction (Slide 2 of 2) Dependent variable or response: Variable being predicted. Independent variables or predictor variables: Variables being used to predict the value of the dependent variable. Simple linear regression: A regression analysis for which any one unit change in the independent variable, x, is assumed to result in the same change in the dependent variable, y. Multiple linear regression: A regression analysis involving two or more independent variables. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Simple Linear Regression Model Regression Model Estimated Regression Equation © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Simple Linear Regression Model (Slide 1 of 5) Regression Model: The equation that describes how y is related to x and an error term. The error term accounts for the variability in y that cannot be explained by the linear relationship between x and y. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Simple Linear Regression Model (Slide 2 of 5) Estimated Regression Equation: The parameter values are usually not known and must be estimated using sample data. the regression equation and dropping the error term, we obtain the estimated regression for simple linear regression. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Simple Linear Regression Model (Slide 3 of 5) In the estimated simple linear regression equation: The graph of the estimated simple linear regression equation is called the estimated regression line. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Simple Linear Regression Model (Slide 4 of 5) Figure 7.1: The Estimation Process in Simple Linear Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Simple Linear Regression Model (Slide 5 of 5) Figure 7.2: Possible Regression Lines in Simple Linear Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method Least Squares Estimates of the Regression Parameters Using Excel’s Chart Tools to Compute the Estimated Regression Equation © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 1 of 15) Least squares method: A procedure for using sample data to find the estimated regression equation. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 2 of 15) Table 7.1: Miles Traveled and Driving x = Miles y = Travel Time Assignment i Traveled (hours) Travel Time for 10 Butler Trucking 1 100 9.3 Company Driving Assignments 2 50 4.8 3 50 8.9 4 100 6.5 5 50 4.2 6 80 6.2 7 75 7.4 8 65 6.0 9 90 7.6 10 90 6.1 © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 3 of 15) Figure 7.3: Scatter Chart of Miles Traveled and Travel Time for Sample of 10 Butler Trucking Company Driving Assignments © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 4 of 15) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 5 of 15) Hence, We are finding the regression that minimizes the sum of squared errors. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 6 of 15) Least Squares Estimates of the Regression Parameters: For the Butler Trucking Company data in Table 7.1: The estimated simple linear regression model: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 7 of 15) If the length of a driving assignment were 1 unit (1 mile) longer, the mean travel time for that driving assignment would be 0.0678 units (0.0678 hours, or approximately 4 minutes) longer. If the driving distance for a driving assignment was 0 units (0 miles), the mean travel time would be 1.2739 units (1.2739 hours, or approximately 76 minutes). © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 8 of 15) Experimental region: The range of values of the independent variables in the data used to estimate the model. The regression model is valid only over this region. Extrapolation: Prediction of the value of the dependent variable outside the experimental region. It is risky. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 9 of 15) Butler Trucking Company example: Use the estimated model and the known values for miles traveled for a driving assignment (x) to estimate mean travel time in hours. For example, the first driving assignment in Table 7.1 has a value for miles The mean travel time in hours for this driving assignment is estimated to be: The resulting residual of the estimate is: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 10 of 15) Table 7.2: Predicted Travel Time and Residuals for 10 Butler Trucking Company Driving Assignments © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 11 of 15) Figure 7.4: Scatter Chart of Miles Traveled and Travel Time for Butler Trucking Company Driving Assignments with Regression Line Superimposed © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 12 of 15) Figure 7.5: A Geometric Interpretation of the Least Squares Method © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 13 of 15) Using Excel’s Chart Tools to Compute the Estimated Regression Equation: After constructing a scatter chart with Excel’s chart tools: 1. Right-click on any data point and select Add Trendline. 2. When the Format Trendline task pane appears: Select Linear in the Trendline Options area. Select Display Equation on chart in the Trendline Options area. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 14 of 15) Figure 7.6: Scatter Chart and Estimated Regression Line for Butler Trucking Company © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Least Squares Method (Slide 15 of 15) Slope Equation y-Intercept Equation © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model The Sums of Squares The Coefficient of Determination Using Excel’s Chart Tools to Compute the Coefficient of Determination © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 1 of 10) The Sums of Squares: Sum of squares due to error: The value of SSE is a measure of the error in using the estimated regression equation to predict the values of the dependent variable in the sample. From Table 7.2, © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 2 of 10) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 3 of 10) Table 7.3 shows the sum of squared deviations obtained by using for each driving assignment in the sample. Butler Trucking Example: For the ith driving assignment in the © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 4 of 10) The corresponding sum of squares is called the total sum of squares (SST). © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 5 of 10) Table 7.3: Calculations for the Sum of Squares Total for the Butler Trucking Simple Linear Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 6 of 10) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 7 of 10) Relation between SST, SSR, and SSE: where SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 8 of 10) The Coefficient of Determination: The ratio SSR/SST used to evaluate the goodness of fit for the estimated regression equation; this ratio is called the coefficient of determination and is denoted by Take values between zero and one. Interpreted as the percentage of the total sum of squares that can be explained by using the estimated regression equation. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 9 of 10) Using Excel’s Chart Tools to Compute the Coefficient of Determination: To compute the coefficient of determination using the scatter chart in Figure 7.3: 1. Right-click on any data point in the scatter chart and select Add Trendline… 2. When the Format Trendline task pane appears: Select Display R-squared value on chart in the Trendline Options area. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Assessing the Fit of the Simple Linear Regression Model (Slide 10 of 10) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model Regression Model Estimated Multiple Regression Equation Least Squares Method and Multiple Regression Butler Trucking Company and Multiple Regression Using Excel’s Regression Tool to Develop the Estimated Multiple Regression Equation © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 1 of 11) Regression Model: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 2 of 11) Regression Model (cont.): Represents the change in the mean value of the dependent variable y that corresponds to a one unit increase in the independent variable holding the values of all other independent variables in the model constant. The multiple regression equation that describes how the mean value of y is related to © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 3 of 11) Estimated Multiple Regression Equation: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 4 of 11) Least Squares Method and Multiple Regression: The least squares method is used to develop the estimated multiple regression equation: Finding Uses sample data to provide the values of that minimize the sum of squared residuals. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 5 of 11) Figure 7:10: The Estimation Process for Multiple Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 6 of 11) Butler Trucking Company and Multiple Regression: The estimated simple linear regression equation, The linear effect of the number of miles traveled explains 66.41% This implies, 33.59% of the variability in sample travel times remains unexplained The managers might want to consider adding one or more independent variables, such as number of deliveries, to the model to explain some of the remaining variability in the dependent variable. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 7 of 11) Butler Trucking Company and Multiple Regression (cont.): Estimated multiple linear regression with two independent variables: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 8 of 11) Figure 7.11: Data Analysis Tools Box Using Excel’s Regression Tool to Develop the Estimated Multiple Regression Equation: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 9 of 11) Figure 7.12: Regression Dialog Box © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 10 of 11) Figure 7.13: Excel Regression Output for the Butler Trucking Company with Miles and Deliveries as Independent Variables © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. The Multiple Regression Model (Slide 11 of 11) Figure 7.14: Graph of the Regression Equation for Multiple Regression Analysis with Two Independent Variables © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression Conditions Necessary for Valid Inference in the Least Squares Regression Model Testing Individual Regression Parameters Addressing Nonsignificant Independent Variables Multicollinearity © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 1 of 17) Statistical inference: Process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of sample data drawn from the population. In regression, inference is commonly used to estimate and draw conclusions about: The regression parameters The mean value and/or the predicted value of the dependent variable y for specific values of the independent variables Consider both hypothesis testing and interval estimation. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 2 of 17) Conditions Necessary for Valid Inference in the Least Squares Regression Model: For any given combination of values of the independent variables distributed with a mean of 0 and a constant variance. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 3 of 17) Figure 7.15: Illustration of the Conditions for Valid Inference in Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 4 of 17) Figure 7.16: Example of a Random Error Pattern in a Scatter Chart of Residuals and Predicted Values of the Dependent Variable © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 5 of 17) Figure 7.17: Examples of Diagnostic Scatter Charts of Residuals from Four Regressions © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 6 of 17) Figure 7.18: Excel Residual Plots for the Butler Trucking Company Multiple Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 7 of 17) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 8 of 17) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 9 of 17) Testing Individual Regression Parameters: To determine whether statistically significant relationships exist between the dependent variable y and each of the independent variables © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 10 of 17) Testing Individual Regression Parameters (cont.): As the magnitude of t increases (as t deviates from zero in either direction), we are more likely to reject the hypothesis that the regression parameter © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 11 of 17) Testing Individual Regression Parameters (cont.): Confidence interval can be used to test whether each of the regression parameters Confidence interval: An estimate of a population parameter that provides an interval believed to contain the value of the parameter at some level of confidence. Confidence level: Indicates how frequently interval estimates based on samples of the same size taken from the same population using identical sampling techniques will contain the true value of the parameter we are estimating. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 12 of 17) Addressing Nonsignificant Independent Variables: If practical experience dictates that the nonsignificant independent variable has a relationship with the dependent variable, the independent variable should be left in the model. If the model sufficiently explains the dependent variable without the nonsignificant independent variable, then consider rerunning the regression without the nonsignificant independent variable. The appropriate treatment of the inclusion or exclusion of the y-intercept © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 13 of 17) Multicollinearity: Multicollinearity refers to the correlation among the independent variables in multiple regression analysis. In t tests for the significance of individual parameters, the difficulty caused by multicollinearity is that it is possible to conclude that a parameter associated with one of the multicollinear independent variables is not significantly different from zero when the independent variable actually has a strong relationship with the dependent variable. This problem is avoided when there is little correlation among the independent variables. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 14 of 17) Figure 7.21: Excel Regression Output for the Butler Trucking Company with Miles and Gasoline Consumption as Independent Variables © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 15 of 17) Figure 7.22: Scatter Chart of Miles and Gasoline Consumed for Butler Trucking Company © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 16 of 17) Multicollinearity (cont.): Testing for an overall regression relationship: Use an F test based on the F probability distribution. If the F test leads us to reject the hypothesis that the values of are all zero: Conclude that there is an overall regression relationship. Otherwise, conclude that there is no overall regression relationship. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Inference and Regression (Slide 17 of 17) Multicollinearity (cont.): Testing for an overall regression relationship (cont.): The test statistic generated by the sample data for this test is: SSR = Sum of squares due to regression. SSE = Sum of squares due to error. q = the number of independent variables in the regression model. n = the number of observations in the sample. Larger values of F provide stronger evidence of an overall regression relationship. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables Butler Trucking Company and Rush Hour Interpreting the Parameters More Complex Categorical Variables © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 1 of 10) Butler Trucking Company and Rush Hour: Dependent variable, y: Travel time. travel on the congested segment of highway during afternoon rush hour. on the congested segment of highway during afternoon rush hour. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 2 of 10) Figure 7.23: Histograms of the Residuals for Driving Assignments That Included Travel on a Congested Segment of a Highway During the Afternoon Rush Hour and Residuals for Driving Assignments That Did Not © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 3 of 10) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 4 of 10) Interpreting the Parameters: The model estimates that travel time increases by: 0.0672 hours (about 4 minutes) for every increase of 1 mile traveled, holding constant the number of deliveries and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour period. 0.6735 hours (about 40 minutes) for every delivery, holding constant the number of miles traveled and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour period. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 5 of 10) Interpreting the Parameters (cont.): The model estimates that travel time increases by (cont.): 0.9980 hours (about 60 minutes) if the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour period, holding constant the number of miles traveled and the number of deliveries. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 6 of 10) Interpreting the Parameters (cont.): © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 7 of 10) More Complex Categorical Variables: If a categorical variable has k levels, k minus 1 dummy variables are required, with each dummy variable corresponding to one of the levels of the categorical variable and coded as 0 or 1. Example: Suppose a manufacturer of vending machines organized the sales territories for a particular state into three regions: A, B, and C. The managers want to use regression analysis to help predict the number of vending machines sold per week. Suppose the managers believe sales region is one of the important factors in predicting the number of units sold. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 8 of 10) More Complex Categorical Variables (cont.): Example (cont.): Sales region: categorical variable with three levels (A, B, and C). Each variable can be coded 0 or 1 as: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 9 of 10) More Complex Categorical Variables (cont.): Example (cont.): The regression equation relating the mean number of units sold to the dummy variables is written as: Observations corresponding to Region A correspond to so the estimated mean number of units sold in Region A is: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Categorical Independent Variables (Slide 10 of 10) More Complex Categorical Variables (cont.): Example (cont.): Observations corresponding to Region B are coded so the estimated mean number of units sold in Region C is: Observations corresponding to Region C are coded so the estimated mean number of units sold in Region B is: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships Quadratic Regression Models Piecewise Linear Regression Models Interaction Between Independent Variables © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 1 of 16) Figure 7.25: Scatter Chart for the Reynolds Example © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 2 of 16) Figure 7.26: Excel Regression Output for the Reynolds Example © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 3 of 16) Figure 7.27: Scatter Chart of the Residuals and Predicted Values of the Dependent Variable for the Reynolds Simple Linear Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 4 of 16) Equation (7.18) corresponds to a quadratic regression model. Quadratic Regression Models: In the Reynolds example, to account for the curvilinear relationship between months employed and scales sold, we could include the square of the number of months the salesperson has been employed as a second independent variable. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 5 of 16) Figure 7.28: Relationships That Can Be Fit with a Quadratic Regression Model © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 6 of 16) Figure 7.29: Excel Data for the Reynolds Quadratic Regression Model © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 7 of 16) Figure 7.30: Excel Output for the Reynolds Quadratic Regression Model © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 8 of 16) Figure 7.31: Scatter Chart of the Residuals and Predicted Values of the Dependent Variable for the Reynolds Quadratic Regression Model © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 9 of 16) Piecewise Linear Regression Models: For the Reynolds data, as an alternative to a quadratic regression model: Recognize that below some value of Months Employed, the relationship between Months Employed and Sales appears to be positive and linear. Whereas the relationship between Months Employed and Sales appears to be negative and linear for the remaining observations. Piecewise linear regression model: This model will allow us to fit these relationships as two linear regressions that are joined at the value of Months at which the relationship between Months Employed and Sales changes. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 10 of 16) Piecewise Linear Regression Models (cont.): Knot: The value of the independent variable at which the relationship between dependent variable and independent variable changes; also called breakpoint. For the Reynolds data, knot is the value of the independent variable Months Employed at which the relationship between Months Employed and Sales changes. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 11 of 16) © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 12 of 16) Piecewise Linear Regression Models (cont.): Define a dummy variable: Then fit the following estimated regression equation: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 13 of 16) Figure 7.33: Data and Excel Output for the Reynolds Piecewise Linear Regression Model © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 14 of 16) Interaction Between Independent Variables: Interaction: This occurs when the relationship between the dependent variable and one independent variable is different at various values of a second independent variable. The estimated multiple linear regression equation is given as: © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 15 of 16) Figure 7.34: Mean Unit Sales (1,000s) as a Function of Selling Price and Advertising Expenditures © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Modeling Nonlinear Relationships (Slide 16 of 16) Figure 7.35: Excel Output for the Tyler Personal Care Linear Regression Model with Interaction © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting Variable Selection Procedures Overfitting © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 1 of 10) Variable Selection Procedures: Special procedures are sometimes employed to select the independent variables to include in the regression model. Iterative procedures: At each step of the procedure, a single independent variable is added or removed and the new model is evaluated. Iterative procedures include: Backward elimination. Forward selection. Stepwise selection. Best subsets procedure: Evaluates regression models involving different subsets of the independent variables. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 2 of 10) Variable Selection Procedures (cont.): Backward elimination procedure: Begins with the regression model that includes all of the independent variables under consideration. At each step, backward elimination considers the removal of an independent variable according to some criterion. Stops when all independent variables in the model are significant at a specified level of significance. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 3 of 10) Variable Selection Procedures (cont.): Forward selection procedure: Begins with none of the independent variables under consideration included in the regression model. At each step, forward selection considers the addition of an independent variable according to some criterion. Stops when there are no independent variables not currently in the model that meet the criterion for being added to the regression model. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 4 of 10) Variable Selection Procedures (cont.): Stepwise selection procedure: Begins with none of the independent variables under consideration included in the regression model. The analyst establishes both a criterion for allowing independent variables to enter the model and a criterion for allowing independent variables to remain in the model. To initiate the procedure, the most significant independent variable is added to the empty model if its level of significance satisfies the entering threshold. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 5 of 10) Variable Selection Procedures (cont.): Stepwise selection procedure (cont.): Each subsequent step involves two intermediate steps: First, the remaining independent variables not in the current model are evaluated, and the most significant one is added to the model. Then the independent variables in the current model are evaluated, and the least significant one is removed. Stops when no independent variables not currently in the model have a level of significance for remaining in the regression model. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 6 of 10) Variable Selection Procedures (cont.): Best subsets procedure: Simple linear regressions for each of the independent variables under consideration are generated, and then the multiple regressions with all combinations of two independent variables under consideration are generated, and so on. Once a regression model has been generated for every possible subset of the independent variables under consideration, the entire collection of regression models can be compared and evaluated. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 7 of 10) Overfitting: Overfitting generally results from creating an overly complex model to explain idiosyncrasies in the sample data. In regression analysis, this often results from the use of complex functional forms or independent variables that do not have meaningful relationships with the dependent variable. If a model is overfit to the sample data, it will perform better on the sample data used to fit the model than it will on other data from the population. Thus, an overfit model can be misleading about its predictive capability and its interpretation. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 8 of 10) Overfitting (cont.): How does one avoid overfitting a model? Use only independent variables that you expect to have real and meaningful relationships with the dependent variable. Use complex models, such as quadratic models and piecewise linear regression models, only when you have a reasonable expectation that such complexity provides a more accurate depiction of what you are modeling. Do not let software dictate your model; use iterative modeling procedures, such as the stepwise and best-subsets procedures, only for guidance and not to generate your final model. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 9 of 10) Overfitting (cont.): How does one avoid overfitting a model? (cont.): If you have access to a sufficient quantity of data, assess your model on data other than the sample data that were used to generate the model (this is referred to as cross-validation). One possible ways to execute cross-validation is the holdout method. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Model Fitting (Slide 10 of 10) Overfitting (cont.): Holdout method: The sample data are randomly divided into mutually exclusive and collectively exhaustive training and validation sets. Training set: The data set used to build the candidate models that appear to make practical sense. Validation set: The set of data used to compare model performances and ultimately select a model for predicting values of the dependent variable. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Big Data and Regression Inference and Very Large Samples Model Selection © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Big Data and Regression (Slide 1 of 6) Inference and Very Large Samples: Virtually all relationships between independent variables and the dependent variable will be statistically significant if the sample is sufficiently large. That is, if the sample size is very large, there will be little difference in the © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Big Data and Regression (Slide 2 of 6) Figure 7.36: Excel Regression Output for Credit Card Company Example © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Big Data and Regression (Slide 3 of 6) Table 7.4: Regression Parameter Estimates and the Corresponding p values for 10 Multiple Regression Models, Each Estimated on 50 Observations from the LargeCredit Data © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Big Data and Regression (Slide 4 of 6) Figure 7.37: Excel Regression Output for Credit Card Company Example after Adding Number of Hours per Week Spent Watching Television © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Big Data and Regression (Slide 5 of 6) Model Selection: When dealing with large samples, it is often more difficult to discern the most appropriate model. If developing a regression model for explanatory purposes, the practical significance of the estimated regression coefficients should be considered when interpreting the model and considering which variables to keep in the model. If developing a regression model to make future predictions, the selection of the independent variables to include in the regression model should be based on the predictive accuracy on observations that have not been used to train the model. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Big Data and Regression (Slide 6 of 6) Figure 7.38: Predictive Accuracy on LargeCredit Validation Set © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Prediction with Regression © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Prediction with Regression (Slide 1 of 2) In addition to the point estimate, there are two types of interval estimates associated with the regression equation: A confidence interval is an interval estimate of the mean y value given values of the independent variables. A prediction interval is an interval estimate of an individual y value given values of the independent variables. © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. Prediction with Regression (Slide 2 of 2) Table 7.5: Predicted Values and 95% Confidence Intervals and Prediction Intervals for 10 New Butler Trucking Routes © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part. End of Chapter 7 © 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.