Correlation in Statistics PDF
Document Details
Uploaded by PreciseParabola
KMU IPMS Peshawar
Tags
Summary
This document provides an introduction to correlation in statistics. It explains the concept of correlation coefficients and how to interpret them. It also includes examples of positive and negative correlations, and details how a scatter diagram is used to visualize the relationship between variables.
Full Transcript
Correlation in Statistics This section shows how to calculate and interpret correlation coefficients for ordinal and interval level scales. Methods of correlation summarise the relationship between two variables in a single number called the correlation coefficient. The correlation coefficient is u...
Correlation in Statistics This section shows how to calculate and interpret correlation coefficients for ordinal and interval level scales. Methods of correlation summarise the relationship between two variables in a single number called the correlation coefficient. The correlation coefficient is usually represented using the symbol r, and it ranges from -1 to +1. A correlation coefficient quite close to 0, but either positive or negative, implies little or no relationship between the two variables. A correlation coefficient close to plus 1 means a positive relationship between the two variables, with increases in one of the variables being associated with increases in the other variable. A correlation coefficient close to -1 indicates a negative relationship between two variables, with an increase in one of the variables being associated with a decrease in the other variable. A correlation coefficient can be produced for ordinal, interval or ratio level variables, but has little meaning for variables which are measured on a scale which is no more than nominal. For ordinal scales, the correlation coefficient can be calculated by using Spearman’s rho. For interval or ratio level scales, the most commonly used correlation coefficient is Pearson’s r, ordinarily referred to as simply the correlation coefficient. Also, read: Correlation and Regression What Does Correlation Measure? In statistics, Correlation studies and measures the direction and extent of relationship among variables, so the correlation measures co-variation, not causation. Therefore, we should never interpret correlation as implying cause and effect relation. For example, there exists a correlation between two variables X and Y, which means the value of one variable is found to change in one direction, the value of the other variable is found to change either in the same direction (i.e. positive change) or in the opposite direction (i.e. negative change). Furthermore, if the correlation exists, it is linear, i.e. we can represent the relative movement of the two variables by drawing a straight line on graph paper. Correlation Coefficient The correlation coefficient, r, is a summary measure that describes the extent of the statistical relationship between two interval or ratio level variables. The correlation coefficient is scaled so that it is always between -1 and +1. When r is close to 0 this means that there is little relationship between the variables and the farther away from 0 r is, in either the positive or negative direction, the greater the relationship between the two variables. The two variables are often given the symbols X and Y. In order to illustrate how the two variables are related, the values of X and Y are pictured by drawing the scatter diagram, graphing combinations of the two variables. The scatter diagram is given first, and then the method of determining Pearson’s r is presented. From the following examples, relatively small sample sizes are given. Later, data from larger samples are given. Scatter Diagram A scatter diagram is a diagram that shows the values of two variables X and Y, along with the way in which these two variables relate to each other. The values of variable X are given along the horizontal axis, with the values of the variable Y given on the vertical axis. Later, when the regression model is used, one of the variables is defined as an independent variable, and the other is defined as a dependent variable. In regression, the independent variable X is considered to have some effect or influence on the dependent variable Y. Correlation methods are symmetric with respect to the two variables, with no indication of causation or direction of influence being part of the statistical consideration. A scatter diagram is given in the following example. The same example is later used to determine the correlation coefficient. Types of Correlation The scatter plot explains the correlation between the two attributes or variables. It represents how closely the two variables are connected. There can be three such situations to see the relation between the two variables – Positive Correlation – when the values of the two variables move in the same direction so that an increase/decrease in the value of one variable is followed by an increase/decrease in the value of the other variable. Negative Correlation – when the values of the two variables move in the opposite direction so that an increase/decrease in the value of one variable is followed by decrease/increase in the value of the other variable. No Correlation – when there is no linear dependence or no relation between the two variables. Here are some examples of positive correlations: 1. The more time you spend on a project, the more effort you'll have put in. 2. The more money you make, the more taxes you will owe. 3. The nicer you are to employees, the more they'll respect you. 4. The more education you receive, the smarter you'll be. 5. The more overtime you work, the more money you'll earn. Here are some examples of negative correlations: 1. The more payments you make on a loan, the less money you'll owe. 2. As the number of your employees decreases, the more job positions you'll have open. 3. The more you work in the office, the less time you'll spend at home. 4. The more employees you hire, the fewer funds you'll have. 5. The more time you spend on a project, the less time you'll have. Here are some examples of entities with zero correlation: 1. The nicer you treat your employees, the higher their pay will be. 2. The smarter you are, the later you'll arrive at work. 3. The wealthier you are, the happier you'll be. 4. The earlier you arrive at work, your need for more supplies increases. 5. The more funds you invest in your business, the more employees will leave work early. relation. To compare two datasets, we use the correlation formulas. Pearson Correlation Coefficient Formula The most common formula is the Pearson Correlation coefficient used for linear dependency between the data sets. The value of the coefficient lies between -1 to +1. When the coefficient comes down to zero, then the data is considered as not related. While, if we get the value of +1, then the data are positively correlated, and -1 has a negative correlation. Where n = Quantity of Information Σx = Total of the First Variable Value Σy = Total of the Second Variable Value Σxy = Sum of the Product of first & Second Value Σx2 = Sum of the Squares of the First Value Σy2 = Sum of the Squares of the Second Value Linear Correlation Coefficient Formula The formula for the linear correlation coefficient is given by; Sample Correlation Coefficient Formula The formula is given by: rxy = Sxy/SxSy Where Sx and Sy are the sample standard deviations, and Sxy is the sample covariance. Population Correlation Coefficient Formula The population correlation coefficient uses σx and σy as the population standard deviations and σxy as the population covariance. rxy = σxy/σxσy Regression Analysis Regression analysis refers to assessing the relationship between the outcome variable and one or more variables. The outcome variable is known as the dependent or response variable and the risk elements, and co-founders are known as predictors or independent variables. The dependent variable is shown by “y” and independent variables are shown by “x” in regression analysis. Simple Linear Regression Equation As we know, linear regression is used to model the relationship between two variables. Thus, a simple linear regression equation can be written as: Y = a + bX Where, Y = Dependent variable X = Independent variable a = [(∑y)(∑x2) – (∑x)(∑xy)]/ [n(∑x2) – (∑x)2] b = [n(∑xy) – (∑x)(∑y)]/ [n(∑x2) – (∑x)2] Regression Coefficient In the linear regression line, the equation is given by: Y = b0 + b1X Here b0 is a constant and b1 is the regression coefficient. The formula for the regression coefficient is given below. b1 = ∑[(xi – x)(yi – y)]/ ∑[(xi – x)2] The observed data sets are given by xi and yi. x and y are the mean value of the respective variables. We know that there are two regression equations and two coefficients of regression. The regression coefficient of y and x formula is: byx = r(σy/σx) The regression coefficient of x on y formula is: bxy = r(σx/σy) Where, σx = Standard deviation of x σy = Standard deviation of y Some of the properties of a regression coefficient are listed below: The regression coefficient is denoted by b. The regression coefficient of y on x can be represented as byx. The regression coefficient of x on y can be represented as bxy. If one of these regression coefficients is greater than 1, then the other will be less than 1. They are not independent of the change of scale. They will change in the regression coefficient if x and y are multiplied by any constant. The arithmetic mean of both regression coefficients is greater than or equal to the coefficient of correlation. The geometric mean between the two regression coefficients is equal to the correlation coefficient. If bxy is positive, then byx is also positive and vice versa.