Chapter 7 Regression PDF

Regression Chapter 7 Learning Objectives Understand what is regression How to perform regression in Excel How to improve regression model prediction What is logistic regression? Advantages and disadvantages of regression Hands on exercise to preform regression in Excel What is Regression? It is a well known statistical technique to predict relationship between several independent variables and one dependent variable Supervised learning technique Find the best fitting curve for a dependent variable in a multi dimensional space The best fitting curve can be linear (a straight line) or non- linear The quality of fit is measured by coefficient of correlation (r) R2 is the amount of variance explained by the curve and r is the square root of the amount of variance How much to produce? You and a friend who owns a pizza shop. You are discussing how much dough will be needed each day You both agree on the fact that the cooler the weather is, the more pizza will be sold. You add that this is not the only factor to take into account, but there are other variables that can also affect the number of sales. You decide to run a small experiment by recording the mean temperature during the shop opening hours and the amount of pizza that is sold You do this every day for the entire summer season. The summer turns out to be particularly rainy, and the temperature variation is high, which helps you to achieve a good range for the variables. What do you do with the data? Key Steps for Regression List all the variables available for making the model. Establish a Dependent Variable (DV) of interest. Examine visually (if possible) relationships between variables of interest. Find a way to predict DV using the other variables. Case Study: Data Driven Prediction Markets Who is Nate Silver? New breed of data based political forecaster, predicting election results based on big data and advanced analytics He predicted that Obama would win the 2012 Presidential election with 291 electoral votes, compared to 247 for Mitt Romney. His predictions were correct in all the 50 states including the swing states His predictions for the Senate races were also correct in 31 of the 33 Senate races Forecasting political elections as a scientific discipline. Develop hypothesis Gather all available information Analyze the data Correlations and Relationships Categorize variables that have a relationship with each other and categorize variables that are distinct and unrelated to other variables Correlation is a measure of the strength of a relationship The strength of a correlation is a quantitative measure that is measured in a normalized range between 0 (zero) and 1. A correlation of 1 indicates a perfect relationship, where the two variables are in perfect sync. A correlation of 0 indicates that there is no relationship between the variables Relationship can be positive or it can be negative (inverse) r is the Correlation coefficient Value of r ranges from -1 to +1. 0 is no relationship Visual Look at Relationships Scatter plot is a simple diagram for plotting all data points between 2 variables on a 2 dimensional graph It provides a visual layout of where all the data points are placed in that two-dimensional space. The scatter plot can be useful for graphically intuiting the relationship between two variables Scatter Plots Regression Exercise The regression model is described as a linear equation y = β0 + β1 x + ε y is the dependent variable, the variable being predicted. x is the independent variable, or the predictor variable. There could be many predictor variables (such as x1 , x2 ,...) in a regression equation. There can be only one dependent variable (y) in the regression equation. A simple example of a regression equation would be to predict a house price from the size of the house. Lets look at a sample house prices data House Data House price ($) vs size (sqft) The two dimensions $350,000 House price (outcome) $300,000 Size (predictor) $250,000 House Price $200,000 Plot this using a scatter plot $150,000 You can see a positive $100,000 $50,000 correlation between House Price $0 and Size (sqft). However, the 1400 1600 1800 2000 2200 2400 2600 relationship is not perfect. Size (Sq ft) Running a regression model between the two variables will provide further details Correlation and Regression Coefficient of correlation is 0.891. R2 , the measure of total variance Size House Price explained by the equation, is 0.794, or Size 1 79%. That means the two variables are House Price 0.891130555 1 moderately and positively correlated. Regression coefficients help create the following equation for predicting house prices. House Price ($) = 139.48 * Size(sqft) – 54191 This equation explains only 79% of the variance in house prices. If other predictor variables are made available, such as the number of rooms in the house. It might help improve the regression model. House Data (Correlation and Regression) House Price (outcome), Size (predictor), #Rooms (predictor) House price has a strong correlation Size House Price #Rooms with number of rooms (0.944) Size 1 Adding this variable to the regression House Price 0.891131 1 model will add to the strength of the #Rooms 0.748611 0.9442544 1 model. Running a regression model between Regression Statistics these three variables produces the Multiple R 0.98429955 following output R Square 0.968845605 Coefficients It shows the co-efficient of correlation Intercept 12923.73074 of this regression model is 0.984. #Rooms 23613.1396 Size 65.60625952 R2 , the total variance explained by the equation, is 0.968 or 97%. The variables are positively and very strongly correlated. Adding a new relevant variable has helped improve the strength of the regression model. Predict the house price Using the regression coefficients helps create the following equation for predicting house prices. House Price ($) = 65.6 * Size (sqft) + 23613 * Rooms + 12924 This predictive equation can be used for future transactions. Predict the price House Price ($) = 65.6 * 2000 (sqft) + 23613 * 3 + 12924 = $214,963 The predicted values should be compared with the actual values to see how close the model is able to predict the actual value. As new data points become available, there are opportunities to fine-tune and improve the model. Non-linear Regression Exercise The relationship between the variables may also be curvilinear. For example, given past data from electricity consumption (KwH) and temperature (temp), predict the electrical consumption from the temperature value. It is visually clear that the line does not fit the data well. The relationship between temperature and Kwatts follows a curvilinear model, where it hits bottom at a certain value of temperature. The regression model confirms the relationship since R is only 0.77 and R- square is also only 60%. Non-linear Regression Exercise The regression model can then be enhanced using a Temp2 variable in the equation. The second line is the relationship between KWH and Temp2. The scatter plot shows that the Energy consumption shows a strong linear relationship with the quadratic Temp 2 variable. Regression Statistics Running the regression model after Multiple R R Square 0.992305907 0.984671012 adding the quadratic variable, leads to Intercept Coefficients 67245.16853 the following results: Temp.sq 15.87004186 Temp -1911.038841 Predict Energy Consumption The co-efficient of correlation of the regression model is now 0.99. R2 , the total variance explained by the equation is 0.985, or 98.5%. That means the variables are very strongly and positively correlated. The regression coefficients help create the following equation for Energy Consumption: Energy Consumption = 15.87 * Temp2 -1911 * Temp + 67245 Predict the Kwatts value for when the temperature is 72 degrees. Energy consumption = (15.87 * 72*72) - (1911 * 72) + 67245 = 11923 Kwatts Logistic Regression Regression models traditionally work with continuous numeric value data for dependent and independent variables. Logistic regression models can work with dependent variables with binary values, such as whether a loan is approved (yes or no). Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables. For example, Logistic regression might be used to predict whether a patient has a given disease (e.g. diabetes), based on observed characteristics of the patient (age, gender, body mass index, results of blood tests, etc.). Logistic Regression Logistical regression models use probability scores as the predicted values of the dependent variable. It takes the natural logarithm of the odds of the dependent variable being a case (logit) to create a continuous criterion (a transformed version of the dependent variable) which is used as the dependent variable The dependent variable in logistic regression is binomial (or categorical, i.e. has only two possible values), the logit is the continuous function upon which linear regression is conducted. Example of a general logistic function, with independent variable on the horizontal axis and the logit dependent variable on the vertical axis Advantages of Regression Models Regression models are easy to understand as they are built upon basic statistical principles such as correlation and least square error. Regression models provide simple algebraic equations that are easy to understand and use. The strength (or the goodness of fit) of the regression model is measured in terms of the correlation coefficients, and other related statistical parameters that are well understood. Regression models can match and beat the predictive power of other modeling techniques. Regression models can include all the variables that one wants to include in the model. Regression modeling tools are pervasive. They are found in statistical packages as well as data mining packages. MS Excel spreadsheets can also provide simple regression modeling capabilities. Disadvantages of Regression Models Regression models can not cover for poor data quality issues. If the data is not prepared well to remove missing values, or is not well-behaved in terms of a normal distribution, the validity of the model suffers. Regression models suffer from collinearity problems (meaning strong linear correlations among some independent variables). If the independent variables have strong correlations among themselves, then they will eat into each other’s predictive power and the regression coefficients will lose their ruggedness. Regression models will not automatically choose between highly collinear variables, although some packages attempt to do that. Regression models can be unwieldy and unreliable if a large number of variables are included in the model. All variables entered into the model will be reflected in the regression equation, irrespective of their contribution to the predictive power of the model. There is no concept of automatic pruning of the regression model. Regression models do not automatically take care of non-linearity. The user needs to imagine the kind of additional terms that might be needed to be added to the regression model to improve its fit. Regression models work only with numeric data and not with categorical Which Technique to Use? Depending on the type of target variable Regression models (continuous target variables) Classification models (discrete target variables) For example, if we want to predict a real number or an integer number, we use regression, whereas if we are trying to predict a category with a finite number of options, we use classification. In Class Exercise Create a regression model to predict the Test2 from the Test1 score Predict the score for one who got a 46 in Test 1 Create a regression model for liberty data What is the dependent variable? What are the independent variables?

Chapter 7 Regression PDF

Document Details

Tags

Related

Summary

Full Transcript