Topic 2c - Correlation & Linear Regression (Student Notes) PDF

2c | Data Organisation, Analysis & Interpretation – Correlation and Linear Regression Describing relationships between variables using correlation analysis and linear regression Intended Learning Outcomes: ❖ Identify the variables in a linear regression using a spreadsheet application. ❖ Explain the strength of the association between two variables using correlation analysis. ❖ Analyse and interpret the outcomes of the specific variable using linear regression. Have you ever wondered…. What is a correlation? When you think about these questions, you’re thinking about how one variable is related to the other. In statistical terms, we call this correlation. Correlation is used to measure and describe the relationship between 2 variables. It is a very useful tool because it helps us predict the relationship between 2 variables and measures how much variation is matched by variation in the other variable. To analyse a correlation, we require 2 values from each observation, 1 value for each variable. And the variables are usually termed X and Y. X is the Independent variable. This is the variable you know or can control or manipulate so this is the predictor or explanatory variable. BLO1001 Statistics for Business 1|Pa g e Y is the Dependent variable. This is the variable you are trying to predict or find out. Let’s use an example and combine it with a scatterplot diagram to help us understand correlation. Example 4.1 Suppose a teacher, Mr O’Connor wants to find out the relationship between 2 variables – students’ family income and students’ average grade. He randomly asked 14 students and collected the responses in figure 4.1 below. Figure 4.1: Responses from 13 random students in Figure 4.2: Scatterplot diagram from the students’ Year 8 responses Source: Gravetter and Wallnau (2017) In this example, a student’s family income is the independent variable X. This is the variable that can be controlled and is used to predict Y. The student’s average grade is the dependent variable Y as it is the variable that we are trying to find out. Figure 4.2 is a scatterplot diagram. Scatterplots are a great way to check quickly for correlation between pairs of data. At a glance, you can already see that there is a correlation or pattern between family income (X) and the student’s average grade (Y). However, scatterplot diagrams are visuals and do not allow us to quantify the strength of the relationship so that comparisons can be made between data sets; hence, we need to look at statistical techniques like correlation and regression. BLO1001 Statistics for Business 2|Pa g e Coefficient of Correlation The coefficient of correlation is a statistical measure of the strength of the linear relationship between 2 variables. It is denoted by r. With r, we look at 2 characteristics between the variables X and Y: 1 Direction of the relationship The sign of the correlation, positive or negative describes the direction of the relationship. In a positive correlation, the two variables tend to change in the same direction: as the value of the X variable increases from one individual to another, the Y variable also tends to increase; when the X variable decreases, the Y variable also decreases. Some may also term this as a direct relationship. E.g., Increase in temperature = Increase in ice-cream sales In a negative correlation, the two variables tend to go in opposite directions. As the X variable increases, the Y variable decreases. That is, it is an inverse relationship. E.g., Increase days absent from school = Decrease in school grades BLO1001 Statistics for Business 3|Pa g e 2 Strength of the relationship In a linear relationship, the data points could fit perfectly on a straight line. Every time X increases by one point, the value of Y also changes by a consistent amount. With increasing strength of relationship, the r values get closer to ±1.00. Perfect positive correlation (r) Perfect negative correlation (r) Strength is +1.00 Strength is -1.00 Strong positive correlation (r) No linear relationship Data points are very close to the straight line. Value of r is zero (0) r-value is close to +1.00 Figure 4.3: Scatterplot diagrams depicting strength of correlation The range of r is between -1.00 to +1.00. The ± sign reflects the direction of r. Based only on absolute values (ignoring the sign), the following table tells us about the strength of the correlation between independent variable X and dependent variable Y: Range of absolute values Strength of relationship 0 to 0.1 No correlation 0.1 to 0.4 Weak correlation 0.4 to 0.7 Moderate correlation More than 0.7 Strong correlation BLO1001 Statistics for Business 4|Pa g e Considerations when using and interpreting the coefficient of correlation (r) When you look at the coefficient of correlation, there are some considerations that you need to bear in mind. 1. Correlation ≠ Causation Correlation simply describes a relationship between two variables. It does not explain why the two variables are related. Correlation is NOT a proof of a cause-and-effect relationship between the two variables X and Y. For example, the consumption of chocolate (X) and pimples (Y) has a strong correlation. It does not indicate that an increase in the consumption of chocolate caused an increase in pimples. 2. Correlation is affected by range of values in the data set If the data collected is in a restricted range, this can affect the correlation of the two variables. What do we mean? For example, the correlation between the consumption of chocolate (X) and pimples (Y) can be significantly different if we restrict the data collection to only teenagers between 13 and 17 years old. For the correlation to provide a more accurate description for the population, there should be a wider age range in the data collection phase. 3. Correlation is affected by outliers In Topic 2a, we saw how outliers can affect the mean. Similarly, extreme data points can also have a dramatic effect on the value of a correlation. In the example below, without the extreme value, we would state that there is no correlation between variables X and Y. However, when an extreme value is included, suddenly it becomes a strong correlation as r has changed from 0.08 to 0.85. Outlier Figure 4.4: How one extreme value can affect the value of the coefficient of correlation Source: Gravetter and Wallnau (2017) BLO1001 Statistics for Business 5|Pa g e Coefficient of Determination When judging how “good” a relationship is, it is tempting for us to focus on the numerical value of the coefficient of correlation (r). For example, r of +0.5 may seem to indicate a moderate relationship since it is halfway between 0 and +1.00. However, interpreting this value as implying 50% predictability will be incorrect. A correlation of +0.5 does not imply 50% prediction accuracy. To understand how well one variable predicts another (if it is a good fit), we need to square the correlation coefficient. When we square the coefficient of correlation, this gives us the coefficient of determination (r2). r2 measures the proportion of variability in the dependent variable (Y) that is accounted for by the variation in the independent variable (X). r2 is a positive value and ranges from 0 to 1. So, r of +0.5 means that independent variable X has a positive and moderate influence on dependent variable Y but r2 = 0.25 means that only about 25% of the variation in dependent variable (Y) can be accounted for by the variation in the independent variable (X), that is to mean that the predictable portion is only r2 (25%) of the total variability. Independent variable (X) = Independent variable (X) = IQ Independent variable (X) = Shoe size Dependent variable (Y) = Monthly salary Dependent variable (Y) = IQ College GPA Dependent variable (Y) = r=0 r = +0.60 Annual Salary r2 = 0 r2 = 0.36 r = +1.00 r2 = 1.00 Figure 4.5: Different data sets depicting 3 different linear relationships Source: Gravetter and Wallnau (2017) In figure 4.5a, with r = 0, there is no correlation between a person’s shoe size and IQ level. In addition, we cannot predict one’s IQ using the shoe size because r2 = 0 means 0% of the variation in dependent variable Y (IQ) can be explained by the variation in X (shoe size). Figure 4.5b shows there is a positive and moderate correlation between a student’s IQ and his/her college GPA. From this correlation, it is possible to predict a student’s GPA based on the IQ level but we realise such a prediction is not perfect. It does not always mean a student with high IQ will have high GPA. With a coefficient of determination (r2) of 0.36, this means 36% of the variation in GPA can be explained by differences in IQ levels. Simply put, a student’s IQ helps explain about 36% of why his/her GPA is higher or lower. BLO1001 Statistics for Business 6|Pa g e The example in figure 4.5c shows a perfect linear relationship between 2 variables, monthly salary and annual salary for a group of graduates. A r = +1.00 indicates a perfect correlation between monthly and annual salary. Correspondingly, r2 is thus 1.00, meaning 100% variation in annual salary can be explained by the variation in the graduate’s monthly salary. There is 100% predictability so if we know the graduate’s monthly salary, we will be able to predict the annual salary with perfect accuracy. And if two graduates have different annual salaries, the difference can be completely explained (100%) by the difference in their monthly salaries. Simple Linear Regression Understanding the relationship between the variables is fundamental for linear regression. Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. In the context of our syllabus, we focus on simple linear regression, looking at two variables: one independent variable and one dependent variable. The linear regression equation is represented by: Y’ = a + bX Where: Y’ is the average predicted value for any X. a is the Y‐intercept, or the estimated Y value when X = 0 b is the slope of the line, where Y changes for every one unit change in X. the least squares principle is used to obtain a and b To interpret the linear regression equation: Y’ = a + bX a is intercept value = means that when independent variable X is 0, dependent variable Y is value of a. b is slope value = means that with every additional unit of independent variable X, dependent variable Y will increase/decrease by value of b. BLO1001 Statistics for Business 7|Pa g e Assume that a 24-hour gym charges a one-time membership of $50 and an extra monthly fee of $15 for unlimited use of the facility. With this information, we can use the linear regression equation to calculate the total cost and interpret the relationship between the total cost and the number of months. Step 1 Let’s first identify the independent and dependent variables. X = Number of months Y = Total cost Step 2 The linear regression equation is Y’ = a + bX So, for this gym, the linear regression equation is: 𝑌′ = 50 + 15𝑋 Where: a = $50 as in a case when number of months (independent variable X) is 0, the total cost (dependent variable Y) is $50. b = $15 as this means with every additional unit of independent variable X, dependent variable Y will increase/decrease by (value of b). So, every additional month of using the gym (X) will increase the total cost (Y) by $15. Step 3 With this linear regression equation, we are able to predict the total amount that a person would have to pay for gym membership based on the number of months that he/she wishes to use the facility. Suppose you want to know the total cost to use the gym for 9 months, Y’ = 50 + 15X = 50 + 15(9) = 50 + 135 = $185 The total amount that you would have to pay is $185. We have learnt earlier that correlation is affected by the range of values used in the data set. The linear regression equation is also sensitive to the range of values in the independent variable and it can change, which is why we do not substitute an X value beyond the values in our data set. Suppose now the gym specifies that the current membership rates are guaranteed for 12 months. After this period, the gym may offer a one-time renewal discount and reduce the monthly fees for its loyal customers but this is not confirmed. Using the linear regression equation 𝑌′ = 50 + 15𝑋 to predict the total cost for an 18-month membership would then be inappropriate. The equation was designed under the assumption of a 12-month duration and applying it to 18 months exceeds this range making the prediction unreliable. BLO1001 Statistics for Business 8|Pa g e Where and Why Correlation and Regression are used Correlation and regression are very useful statistical concepts that help us to understand how two things are related to each other. If you look hard enough, you’ll realise that correlation is being applied not just in multiple business situations, but it occurs in our daily lives too. Prediction We use relationships to make predictions. If two variables are known to be related in some systematic way, it is possible to use one of the variables to make accurate predictions about the other. Validity Through the direction and strength of the correlation, we can verify the validity of the test/question that we are asking. Reliability Correlation can be used to determine reliability. One way to evaluate reliability is to use correlations to determine the relationship between two sets of measurements. When reliability is high, the correlation between two measurements should be strong. Linear Regression and the COPAI process In the data COPAI process, correlation analysis helps in the analysis phase by identifying and quantifying the relationships between variables. This step not only aids in understanding the connections but also in deciding the direction of further analysis and linear regression comes into play, using the relationships between the variables to build models that predict outcomes. These predictions are then interpreted to inform decision-making processes, thereby integrating statistical insights into practical business strategies and operations. During the tutorial for this Topic, our Tutors will conduct a demonstration to show how we can use MS Excel to generate Regression output for analysis. End of Topic 2c BLO1001 Statistics for Business 9|Pa g e

Topic 2c - Correlation & Linear Regression (Student Notes) PDF

Document Details

Tags

Related

Summary

Full Transcript