Introduction to Linear Regression PDF
Document Details

Uploaded by LowCostPiano
Tags
Summary
Chapter 7 provides an introduction to linear regression, covering linear models, the line of best fit, and variables in a linear model. It explains how to describe and predict the relationship between variables, even if the data isn't perfect. The document also discusses ordinary least squares (OLS) and conditions for using linear regression.
Full Transcript
Chapter 7: Introduction to Linear Regression The Linear Model ​ Linear Model: A straight line that represents the relationship between two variables. ​ Line of Best Fit (Linear Model): ○​ A line drawn through scattered data points to show the general direction or tren...
Chapter 7: Introduction to Linear Regression The Linear Model ​ Linear Model: A straight line that represents the relationship between two variables. ​ Line of Best Fit (Linear Model): ○​ A line drawn through scattered data points to show the general direction or trend. ○​ Helps predict values for levels not observed in the data (e.g., predicting sales based on advertising). ​ Purpose: To describe and predict the relationship between variables, even if the data isn't perfect. Variables in a Linear Model ​ Dependent Variable (Response): The variable we want to predict or explain (e.g., income, sales). ​ Independent Variable (Explanatory): The variable we use to make predictions (e.g., education, advertising, interest rate). What Does "Best Fit" Mean? ​ The best-fitting line minimizes the difference (errors or residuals) between actual data points and predicted values. ​ Goal: Draw the line as close as possible to all data points by minimizing errors. Correlation and the Line ​ The linear regression equation:​ y^=b0+b1x(similar to y=mx+b)y^​=b0​+b1​x(similar to y=mx+b) ○​ y^y^​: Predicted value ○​ b0b0​: Intercept (value of yy when x=0x=0) ○​ b1b1​: Slope (how much yy changes for each unit change in xx) Ordinary Least Squares (OLS) ​ OLS Method: Technique used to find the line of best fit. ​ Least Squares Line: The line that minimizes the sum of squared residuals (differences between observed and predicted values). ​ Focus: Minimize the total squared distance from the data points to the regression line. Why Not Just Add the Residuals? ​ Residuals can be positive or negative: ○​ Positive: Actual > Predicted ○​ Negative: Actual < Predicted ​ Adding raw residuals might give zero (they cancel out), even if the fit is bad. ​ Solution: Square the residuals before summing → avoids cancellation and reflects actual error more accurately. SSE (Sum of Squared Errors) ​ SSE = ∑(y−y^)2∑(y−y^​)2 ​ Measures the total squared difference between actual values and predicted values. ​ Smaller SSE = Better model fit. Regression Lines and Correlation ​ Regression Line: The straight line obtained using the least squares method. ​ Shows the general relationship between variables, just like correlation shows association. ​ The line helps quantify and predict based on the correlation. Conditions for Using Linear Regression 1.​ Quantitative Variables Condition: ○​ Both variables must be quantitative (numeric). ○​ E.g., valid: age vs. height; invalid: color vs. height. 2.​ Linearity Condition: ○​ Relationship must be linear. ○​ Data should form a straight-line trend in a scatterplot. ○​ If the relationship is curved (e.g., levels off), linear regression won’t work well. 3.​ Outlier Condition: ○​ Outliers can distort the regression line. ○​ One extreme value (e.g., a month with very high ad spending) can skew the model. ○​ Important to identify and assess their impact before using the model.