Chapter 6 (Part I) Trendlines and Regression Analysis PDF

CHAPTER 6 (PART I) Trendlines and Regression Analysis Prepared by: Nur Liyana Mohamed Yousop CHAPTER OUTLINES Introduction Simple Linear Regression Multiple Linear Regression Regression with Categorical Independent Variable - categorical a non Ex : variable numerical is variable male/female INTRODUCTION SCOPE OF BUSINESS ANALYTICS CHAPTER 4 CHAPTER 6 AND 7 CHAPTER 8 PREDICTIVE ANALYTICAL MODELS Predictive analytics are executed by processing historical data to forecast future happenings. Linear function Logarithmic function Polynomial function Predictive analytical model is developed by using mathematical functions as following: Power function Exponential function PREDICTIVE ANALYTICAL MODELS Linear function Logarithmic function Polynomial function Power function Exponential function y=a+bx y=In(x) y=ax2+bx+c y=axb y=abx Linear functions show steady increases or decreases over the range of x. This is the simplest type of function used in predictive models. E.g. Demand function (price and quantity) Logarithmic functions are used when the rate of change in a variable increases or decreases quickly. E.g. Richter scale used to measure earthquake Polynomial functions are functions that have a quadratic, a cubic, a quartic and other properties (all functions plus, minus, multiplication), taking just non-negative integer power of x. E.g. Business people use polynomials to see how rising of a goods will affect its sales Power functions are defined by single monomials (includes number and variables that are multiplied together, e.g. 3xy) where a≠0 and b>0 Exponential functions come with a property where y increases or decreases at constantly increasing rate. E.g. Continuosly compounding interest (PV and FV) TYPES OF DATA Time series Cross-sectional Pooled @ Panel data Location Time Example 1 n Malaysia GDP from 2009-2019 1 India GDP in 2019 China GDP in 2019 Japan GDP in 2019 n Malaysia GDP from 2009-2019 Japan GDP from 2009-2019 Kore GDP from 2009-2019 n n TYPES OF DATA TIME SERIES CROSS-SECTIONAL DATA POOLED / PANEL DATA Our syllabus MODE L I NG R E L ATI ONSHI P S AND T R E NDS I N DATA For time series data, use a line chart. Create charts to better understand data sets. For crosssectional data, use a scatter chart. MODELING A PRICE-DEMAND FUNCTION Linear demand function: Demand = 20,512 - 9.5116(Price) EXCEL TRENDLINE TOOL ORDINARY LEAST SQUARE REGRESSION TYPES OF VARIABLES Dependent Variable • The variable that depends on other variable/s that is/are measured Independent Variable • The variable that is stable and unaffected by the other variables you are trying to measure. LEAST-SQUARES REGRESSION Ordinary Least Squares regression (OLS) is more commonly named linear regression (simple or multiple depending on the number of explanatory variables). Regression is a powerful analysis that can analyze multiple variables simultaneously to answer complex research questions. However, if OLS assumptions are not satisfied, then results cannot be trusted. LEAST-SQUARES REGRESSION To obtain the best-fitted line, minimize the distance between the actual values and the predicted values through Ordinary Least-Squares method (OLS). Formula OLS_SLR: y = ß0 +ß1x Where; y : Dependent variable ß0 : Intercept (often labeled the constant) is the expected mean value of y when all x=0 ß1 : Slope represents the rate of change in y as x changes. x : Independent variable REGRESSION ANALYSIS Regression analysis Simple linear regression Multiple regression A tool for building mathematical and statistical models that characterize relationships between a dependent (ratio) variable and one or more independent, or explanatory variables (ratio or categorical), all of which are numerical. Involves a single independent variable. Involves two or more independent variables. POPULATION & SAMPLE REGRESSION MODELS Random Sample Population Unknown Relationship 𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝑿𝒊 + 𝜺𝒊 ☺ ☺ ☺ ☺ ☺ ෡𝟎 + 𝜷 ෡ 𝟏 𝑿𝟏 ෡𝒊 = 𝜷 𝒀 ☺ ☺ ☺ The fitted line is best on estimation. There is a difference between actual value ( 𝑦 ) and the predicted value (𝑦). ො 𝑦- 𝑦ො = 𝜀𝑖 Observed error / Residuals TEXTBOOK PAGE: 70 SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION • Simple linear regression (SLR) is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables: • One variable, denoted x, is regarded as the predictor, explanatory, or independent variable. • The other variable, denoted y, is regarded as the response, outcome, or dependent variable. Independent Variable Dependent Variable SIMPLE LINEAR REGRESSION Finds a linear relationship between: • one independent variable X and; • one dependent variable Y First prepare a scatter plot to verify the data has a linear trend. Use alternative approaches if the data is not linear. OLS_SLR: STEP BY STEP DATA: HOME MARKET VALUE STEP 1: Determine Y (DV) and X (IV) $130,000.00 $110,000.00 Size of a house is typically related to its market value. • Y = market value ($) → DV • X = square feet → IV $100,000.00 STEP 2: Plot the Scatter Plot MARKET VALUES $120,000.00 The scatter plot of the full data set (42 homes) indicates a linear trend. $90,000.00 $80,000.00 $70,000.00 $60,000.00 1,400 1,600 1,800 2,000 SQUARE FEET 2,200 2,400 2,600 How to adjust scale? o Select axis o Right click → Format axis o Axis option → Change scale FINDING THE BEST-FITTING REGRESSION LINE SLR Formula: Market value = ß0 + ß(Square feet) RELATIONSHIP BETWEEN HOME MARKET VALUES AND SIZE OF THE HOUSE (SQUARE FEET) $130,000.00 y = 35.036x + 32673 R² = 0.5347 $120,000.00 Click Chart → Add Chart Elements → Trendline → Linear MARKET VALUES $110,000.00 $100,000.00 Y $90,000.00 Linear (Y) $80,000.00 $70,000.00 $60,000.00 1,400 STEP 3: Find the Best-Fit Regression Line 1,600 1,800 2,000 SQUARE FEET 2,200 2,400 2,600 FINDING THE BEST-FITTING REGRESSION LINE SLR Formula: Market value = ß0 + ß(Square feet) RELATIONSHIP BETWEEN HOME MARKET VALUES AND SIZE OF THE HOUSE (SQUARE FEET) $130,000.00 y = 35.036x + 32673 R² = 0.5347 $120,000.00 MARKET VALUES $110,000.00 $100,000.00 Y $90,000.00 Linear (Y) $80,000.00 • The regression model explains variation in market value due to size of the home. • It provides better estimates of market value than simply using the average. • Market value = 32,673 + $35.036 (Square feet) • The estimated market value of a home with 2,200 square feet would be: $70,000.00 $60,000.00 1,400 STEP 4: Determine the Best Regression Line 1,600 1,800 2,000 SQUARE FEET 2,200 2,400 2,600 • Market value = $32,673 + $35.036(2,200) = $109,752 SIMPLE LINEAR REGRESSION WITH EXCEL Data → Data Analysis → Regression HOME MARKET VALUE REGRESSION RESULTS HOME MARKET VALUE REGRESSION RESULTS REGRESSION STATISTICS Y= bo + b1X Analysis Details Interpretation Multiple R Value range -1 to 1 o Value > 0 : +ve correlation o Value < 0 : -ve correlation o Value = 0 : no correlation 0.7313 > 0, +ve correlation Y (R2) R-Squared / Coefficient of determination X Variation in the DV explained by IV o Value between 0 and 1 o Closer to 1, better fit R2 = 0.5347 53.47% of variation in market values is explained by the size of the house (square feet) HOME MARKET VALUE REGRESSION RESULTS REGRESSION STATISTICS Analysis Adjusted RSquared Standard Error Details Interpretation> do Modified Adjusted R2 = 0.5231 can be ignored o Will be beneficial when the present model is compared with other models that incorporate more explanatory variables. - R2 Standard error of the estimate is the difference between the observed (ACTUAL) and ESTIMATED values. SE will be small if the data is close to regression line. The SE will be big if the data is dispersed widely from the if None have more In than if have 1 IU only HOME MARKET VALUE REGRESSION RESULTS ANALYSIS OF VARIANCE Analysis Details ANOVA is used to test for significance of regression: ANOVA H0: population slope coefficient (𝛽1 ) = 0 H1: population slope coefficient (𝛽1 ) ≠ 0 Interpretation H0: 𝛽1 = 0 (IV has no effect on the DV) H1: 𝛽1 ≠ 0 (IV explains variation in DV) F-test = 3.8E-08 < 0.05 H rejected The significance of F-value given in the 0 The slope is not equal to zero. Using a ANOVA table is the p-value for the F-test. linear relationship, home size (square feet) is a significant variable in explaining If F < the level of significance (normally 5%), variation in market value H0 rejected HOME MARKET VALUE REGRESSION RESULTS ANALYSIS OF VARIANCE Positive slope • y-value in increases as x-values increase Negative • y-value decreases as x-values increase slope Zero slope • y-value stays constant as x-values increase HOME MARKET VALUE REGRESSION RESULTS TESTING HYPOTHESIS FOR REGRESSION COEFFICIENT Analysis Details Interpretation An alternate method for testing whether a slope or intercept is zero p-values = 0.0000 < α=5% is to use a t-test: H0: population slope coefficient (𝛽1 ) = 0 H1: population slope coefficient (𝛽1 ) ≠ 0 T-TEST The test can be computed by using: Regression T-Test ෡ −𝛽1 𝛽 1 = 𝑆tandard 𝐸𝑟𝑟𝑜𝑟 𝑜𝑓 𝑆𝑙𝑜𝑝𝑒 Excel provides the p-values for tests on the slope and intercept. H0 rejected We can conclude that coefficient is statistically not equal to zero. Meaning that home size (square feet) has a significant relationship with market values. HOW TO INTERPRET COEFFICIENT? In our example X is House Size (Square Feet) and Y is Home Value, thus, ෝ = 𝟑𝟐, 𝟔𝟕𝟑. 𝟐𝟏𝟗𝟗 + 𝟑𝟓. 𝟎𝟑𝟔𝟒𝑿 𝒚 For coefficient: If the house size increases by 1 square feet, the home value increases by $35.0364. For intercept: If there is no change in house size, thus, the home value will be $32,673.2199 HOME MARKET VALUE REGRESSION RESULTS CONFIDENCE INTERVAL FOR REGRESSION COEFFICIENT Analysis CONFIDENCE INTERVAL Details Confidence intervals (Lower 95% and Upper 95% values in the output) provide information about the unknown values of the true regression coefficients, accounting for sampling error. Interpretation For the Home Market Value data, it can be concluded that the true intercept and slopes lies between [14,823, 50,523] and [24.59, 45.48] respectively at α=5% level of significance. Although we estimated that a house with 1,750 square feet has a market value of 32,673 + 35.036(1,750) =$93,986, if the true population parameters are at the extremes of Narrower confidence intervals the confidence intervals, the estimate might be as low as provide more accuracy in our 14,823 + 24.59(1,750) = $57,855 or as high as 50,523 + predictions. 45.48(1,750) = $130,113. RESIDUALS • Residuals are the observed errors associated with estimating the value of the dependent variable using the regression line: 𝜀𝑖 = 𝑦𝑖 − 𝑦ො𝑖 RESIDUAL ANALYSIS AND REGRESSION ASSUMPTIONS • Residual (ε) = Actual (Observed) Y value − Predicted Y value • Standard residual = Residual / Standard deviation • Rule of thumb: Standard residuals outside of ±2 or ±3 are potential outliers. • Excel provides a table and a plot of residuals. This point has a standard residual of 4.5336 CHECKING ASSUMPTIONS Assumption Verification Details Linearity • Examine scatter diagram (should appear linear) • Examine residual plot (should appear random) If assumption is met: o Residuals randomly scattered about zero o Do not exhibit a specific pattern • View a histogram of standard residuals • Formal Goodness of Fit Test (e.g. Pearson, Chi-square, Jacque-Bera and others) If assumption is met: o Bell-shaped distribution Linear relationship between IV and DV Normality of Errors Errors of all IVs are normally distributed, mean=0 CHECKING ASSUMPTIONS Assumption Verification Homoscedasticity • Examine the residual plot Constant variance Variance around the regression line is similar for all the IVs Details If assumption is met: o There will not be dramatic differences in the spread of the data for different values of the IVs Independence of Errors • Durbin Watson Statistics If assumption is met: (Autocorrelation) o No autocorrelation, if 1.5 ≤ D ≤ The error term for all IVs should 2.5 not be correlated with one • d takes on values between 0 and 4. A value of d = 2 another. If the do, then the means there is no autocorrelation. A value problem of autocorrelation substantially below 2 (and especially a value less than 1) means that the data is positively persists. autocorrelated. A value of d substantially above 2 means that the data is negatively autocorrelated CHECKING REGRESSION ASSUMPTIONS FOR THE HOME MARKET VALUE DATA • Linearity • linear trend in scatterplot • no pattern in residual plot CONTINUED… • Normality of Errors • Residual histogram appears slightly skewed but is not a serious departure • Data→ Data Analysis → Histogram • Homoscedasticity • Residual plot shows no serious difference in the spread of the data for different X values. Square Feet Residual Plot Histogram-Standard Residual 35000 30 30000 25 25000 15 Frequency 10 5 Residuals Frequency 20000 20 15000 10000 5000 0 -50001,300 1,500 1,700 1,900 -10000 -15000 0 -3 -2 -1 0 1 BIN 2 3 More Square Feet 2,100 2,300 2,500 CONTINUED… • Homoscedasticity - Sama -not met CONTINUED… • Autocorrelation

Chapter 6 (Part I) Trendlines and Regression Analysis PDF

Document Details

Tags

Related

Summary

Full Transcript