Correlation and Regression PDF
Document Details
Uploaded by WellBalancedSakura
Tags
Summary
This document provides an introduction to correlation and regression analysis. It covers different types of correlation (positive, negative, zero, linear, and non-linear) and discusses the uses of correlation in various fields, including economics and business studies. The document also introduces regression analysis as a statistical tool to study relationships between variables.
Full Transcript
BBA – 3rd Semester Data Analytics CORRELATION Introduction: In today’s business world we come across many activities, which are dependent on each other. In businesses we see large number of problems involving the use of two o...
BBA – 3rd Semester Data Analytics CORRELATION Introduction: In today’s business world we come across many activities, which are dependent on each other. In businesses we see large number of problems involving the use of two or more variables. Identifying these variables and its dependency helps us in resolving the many problems. Many times there are problems or situations where two variables seem to move in the same direction such as both are increasing or decreasing. At times an increase in one variable is accompanied by a decline in another. For example, family income and expenditure, price of a product and its demand, advertisement expenditure and sales volume etc. If two quantities vary in such a way that movements in one are accompanied by movements in the other, then these quantities are said to be correlated. Meaning: Correlation is a statistical technique to ascertain the association or relationship between two or more variables. Correlation analysis is a statistical technique to study the degree and direction of relationship between two or more variables. A correlation coefficient is a statistical measure of the degree to which changes to the value of one variable predict change to the value of another. When the fluctuation of one variable reliably predicts a similar fluctuation in another variable, there’s often a tendency to think that means that the change in one causes the change in the other. Uses of correlations: 1. Correlation analysis helps inn deriving precisely the degree and the direction of such relationship. 2. The effect of correlation is to reduce the range of uncertainity of our prediction. The prediction based on correlation analysis will be more reliable and near to reality. 3. Correlation analysis contributes to the understanding of economic behaviour, aids in locating the critically important variables on which others depend, may reveal to the economist the connections by which disturbances spread and suggest to him the paths through which stabilizing farces may become effective 4. Economic theory and business studies show relationships between variables like price and quantity demanded advertising expenditure and sales promotion measures etc. 5. The measure of coefficient of correlation is a relative measure of change. Page | 1 Types of Correlation: Correlation is described or classified in several different ways. Three of the most important are: I. Positive and Negative II. Linear and non-linear I. Positive, Negative and Zero Correlation: Whether correlation is positive (direct) or negative (in-versa) would depend upon the direction of change of the variable. Positive Correlation: If both the variables vary in the same direction, correlation is said to be positive. It means if one variable is increasing, the other on an average is also increasing or if one variable is decreasing, the other on an average is also deceasing, then the correlation is said to be positive correlation. For example, the correlation between heights and weights of a group of persons is a positive correlation. Height (cm) : X 158 160 163 166 168 171 174 176 Weight (kg) : Y 60 62 64 65 67 69 71 72 Negative Correlation: If both the variables vary in opposite direction, the correlation is said to be negative. If means if one variable increases, but the other variable decreases or if one variable decreases, but the other variable increases, then the correlation is said to be negative correlation. For example, the correlation between the price of a product and its demand is a negative correlation. Price of Product (Rs. Per Unit) : X 6 5 4 3 2 1 Demand (In Units) : Y 75 120 175 250 215 400 Zero Correlation: Actually it is not a type of correlation but still it is called as zero or no correlation. When we don’t find any relationship between the variables then, it is said to be zero correlation. It means a change in value of one variable doesn’t influence or change the value of other variable. For example, the correlation between weight of person and intelligence is a zero or no correlation. II. Linear and Non-linear Correlation: Depending upon the constancy of the ratio of change between the variables, the correlation may be Linear or Non-linear Correlation. Linear Correlation: If the amount of change in one variable bears a constant ratio to the amount of change in the other variable, then correlation is said to be linear. If such variables are plotted on a graph paper all the plotted points would fall on a straight line. For example: If it is assumed that, to produce one unit of finished product we need 10 units of raw materials, then subsequently to produce 2 units of finished product we need double of the one unit. Raw material : X 10 20 30 40 50 60 Finished Product : Y 2 4 6 8 10 12 Non-linear Correlation: If the amount of change in one variable does not bear a constant ratio to the amount of change to the other variable, then correlation is said to Page | 2 be non-linear. If such variables are plotted on a graph, the points would fall on a curve and not on a straight line. For example, if we double the amount of advertisement expenditure, then sales volume would not necessarily be doubled. Advertisement Expenses : X 10 20 30 40 50 60 Sales Volume : Y 2 4 6 8 10 12 Illustration 01: State in each case whether there is (a) Positive Correlation (b) Negative Correlation (c) No Correlation Sl No Particulars Solution 1 Price of commodity and its demand Negative 2 Yield of crop and amount of rainfall Positive 3 No of fruits eaten and hungry of a person Negative 4 No of units produced and fixed cost per unit Negative 5 No of girls in the class and marks of boys No Correlation 6 Ages of Husbands and wife Positive 7 Temperature and sale of woollen garments Negative 8 Number of cows and milk produced Positive 9 Weight of person and intelligence No Correlation 10 Advertisement expenditure and sales volume Positive Methods of measurement of correlation: Quantification of the relationship between variables is very essential to take the benefit of study of correlation. For this, we find there are various methods of measurement of correlation, which can be represented as given below: Methods of Measurement of Correlation Graphic Method Algebric Method 1. Karl Pearson’s Coefficient of Correlation 1. Scatter Diagram 2. Spearman’s Rank Coefficient of Correlation Page | 3 Scatter Diagram: This is graphic method of measurement of correlation. It is a diagrammatic representation of bivariate data to ascertain the relationship between two variables. Under this method the given data are plotted on a graph paper in the form of dot. i.e. for each pair of X and Y values we put dots and thus obtain as many points as the number of observations. Usually an independent variable is shown on the X-axis whereas the dependent variable is shown on the Y-axis. Once the values are plotted on the graph it reveals the type of the correlation between variable X and Y. A scatter diagram reveals whether the movements in one series are associated with those in the other series. Perfect Positive Correlation: In this case, the points will form on a straight line falling from the lower left hand corner to the upper right hand corner. Perfect Negative Correlation: In this case, the points will form on a straight line rising from the upper left hand corner to the lower right hand corner. High Degree of Positive Correlation: In this case, the plotted points fall in a narrow band, wherein points show a rising tendency from the lower left hand corner o the upper right hand corner. High Degree of Negative Correlation: In this case, the plotted points fall in a narrow band, wherein points show a declining tendency from upper left hand corner to the lower right hand corner. Low Degree of Positive Correlation: If the points are widely scattered over the diagrams, wherein points are rising from the left hand corner to the upper right hand corner. Low Degree of Negative Correlation: If the points are widely scattered over the diagrams, wherein points are declining from the upper left hand corner to the lower right hand corner. Zero (No) Correlation: When plotted points are scattered over the graph haphazardly, then it indicate that there is no correlation or zero correlation between two variables. Page | 4 Diagram – I Diagram – II Diagram – III Diagram – IV Diagram – V Diagram – VI Page | 5 Diagram – VII Illustration 02: Given the following pairs of values: Capital Employed (Rs. In Crore) 1 2 3 4 5 7 8 9 11 12 Profit (Rs. In Lakhs) 3 5 4 7 9 8 10 11 12 14 (a) Draw a scatter diagram (b) Do you think that there is any correlation between profits and capital employed? Is it positive or negative? Is it high or low? Solution: From the observation of scatter diagram we can say that the variables are positively correlated. In the diagram the points trend toward upward rising from the lower left hand corner to the upper right hand corner, hence it is positive correlation. Plotted points are in narrow band which indicates that it is a case of high degree of positive correlation. Page | 6 16 14 12 Profit (Rs. in Lakhs) 10 8 6 4 2 0 0 2 4 6 8 10 12 14 Capital Employed (Rs. in Crore) Karl Pearson’s Coefficient of Correlation: Karl Pearson’s method of calculating coefficient of correlation is based on the covariance of the two variables in a series. This method is widely used in practice and the coefficient of correlation is denoted by the symbol “r”. If the two variables under study are X and Y, the following formula suggested by Karl Pearson can be used for measuring the degree of relationship of correlation. Page | 7 Illustration 03: From following information find the correlation coefficient between advertisement expenses and sales volume using Karl Pearson’s coefficient of correlation method. Firm 1 2 3 4 5 6 7 8 9 10 Advertisement Exp. (Rs. In Lakhs) 11 13 14 16 16 15 15 14 13 13 Sales Volume (Rs. In Lakhs) 50 50 55 60 65 65 65 60 60 50 Solution: Let us assume that advertisement expenses are variable X and sales volume are variable Y. Calculation of Karl Pearson’s coefficient of correlation Firm X Y x=X-Ẋ x2 y=Y -Ẏ y2 xy 1 11 50 -3 9 -8 64 24 2 13 50 -1 1 -8 64 8 3 14 55 0 0 -3 9 0 4 16 60 2 4 2 4 4 5 16 65 2 4 7 49 14 6 15 65 1 1 7 49 7 7 15 65 1 1 7 49 7 8 14 60 0 0 2 4 0 9 13 60 -1 1 2 4 -2 10 13 50 -1 1 -8 64 8 140 580 22 360 70 ∑X ∑Y ∑x2 ∑y2 ∑xy Ẋ = ∑X = 140 = 14 Ẏ = ∑Y = 580 = 58 n 10 n 10 ∑xy 70 70 r= = = = 0.7866 √∑x2 ∑y2 √22∗360 88.9944 Interpretation: From the above calculation it is very clear that there is high degree of positive correlation i.e. r = 0.7866, between the two variables. i.e. Increase in advertisement expenses leads to increased sales volume. Illustration 04: Find the correlation coefficient between age and playing habits of the following students using Karl Pearson’s coefficient of correlation method. Age 15 16 17 18 19 20 Number of students 250 200 150 120 100 80 Regular Players 200 150 90 48 30 12 Page | 8 Solution: To find the correlation between age and playing habits of the students, we need to compute the percentages of students who are having the playing habit. Percentage of playing habits = No. of Regular Players / Total No. of Students * 100 Now, let us assume that ages of the students are variable X and percentages of playing habits are variable Y. Calculation of Karl Pearson’s coefficient of correlation Percentage No of Regular Age (X) Students Players of Playing X-Ẋ (X - Ẋ)2 Y-Ẏ (Y - Ẏ)2 (X - Ẋ)(Y - Ẏ) Habits (Y) 15 250 200 80 -2.5 6.25 30 900 -75 16 200 150 75 -1.5 2.25 25 625 -37.5 17 150 90 60 -0.5 0.25 10 100 -5 18 120 48 40 0.5 0.25 -10 100 -5 19 100 30 30 1.5 2.25 -20 400 -30 20 80 12 15 2.5 6.25 -35 1225 -87.5 105 300 17.5 3350 -240 ∑X ∑Y ∑x2 ∑y2 ∑xy Ẋ = ∑X = 105 = 17.5 Ẏ = ∑Y = 300 = 50 n 6 n 6 ∑(X−X)(Y−Y) −240 −240 r= = = = -0.9912 √∑(X−X)2 ∑(Y−Y)2 √17.5∗3350 242.126 Page | 9 Interpretation: From the above calculation it is very clear that there is high degree of negative correlation i.e. r = -0.9912, between the two variables of age and playing habits. i.e. Playing habits among students decreases when their age increases. Question : Find Karl Pearson’s coefficient of correlation between capital employed and profit obtained from the following data. Capital Employed (Rs. In Crore) 10 20 30 40 50 60 70 80 90 100 Profit (Rs. In Crore) 2 4 8 5 10 15 14 20 22 50 Question : A computer while calculating the correlation coefficient between the variable X and Y obtained the following results: N = 30; ∑X = 120 ∑X2 = 600 ∑Y = 90 ∑Y2 = 250 ∑XY = 335 It was, however, later discovered at the time of checking that it had copied down two pairs of observations as: (X, Y) : (8, 10) (12, 7) While the correct values were: (X, Y) : (8, 12) (10, 8) Obtain the correct value of the correlation coefficient between X and Y. Spearman’s Rank Coefficient of Correlation: When quantification of variables becomes difficult such beauty of female, leadership ability, knowledge of person etc, then this method of rank correlation is useful which was developed by British psychologist Charles Edward Spearman in 1904. In this method ranks are allotted to each element either in ascending or descending order. The correlation coefficient between these allotted two series of ranks is popularly called as “Spearman’s Rank Correlation” and denoted by “R”. To find out correlation under this method, the following formula is used. 2 R=1- 6∑D where, D =Difference of the ranks between paired items in two series. N3 − N N = Number of pairs of ranks N −N Illustration 09: Find out spearman’s coefficient of correlation between the two kinds of assessment of graduate students’ performance in a college. Name of students A B C D E F G H I Internal Exam 51 68 73 46 50 65 47 38 60 External Exam 49 72 74 44 58 66 50 30 35 Page | 10 Solution: Calculation of Spearman’s Rank Coefficient of Correlation Internal External Name Ranks (R1) Ranks (R2) D = R1 – R2 D2 Exam Exam A 51 5 49 6 -1 1 B 68 2 72 2 0 0 C 73 1 74 1 0 0 D 46 8 44 7 1 1 E 50 6 58 4 2 4 F 65 3 66 3 0 0 G 47 7 50 5 2 4 H 36 9 30 9 0 0 I 60 4 35 8 -4 16 ∑D2 = 26 R=1- 6∑D2 =1– 6∗26 = 1- 156 =1- 156 = 1 - 0.2167 = 0.7833 N3− N 93− 9 729 − 9 720 Interpretation: From the above calculation it is very clear that there is high degree of positive correlation i.e. R = 0.7833, between two exams. It means there is a high degree of positive correlation between the internal exam and external exam of the students. Illustration 10: The coefficient of rank correlation of the marks obtained by 10 students in statistics and accountancy was found to be 0.8. It was later discovered that the difference in ranks in the two subjects obtained by one of the students was wrongly taken as 7 instead of 9. Find the correct coefficient of rank correlation. Solution: 2 2 2 2 R = 1 - 6∑D => 0.8 = 1 - 6∑D => 0.8 = 1 - 6∑D => 6∑D = 1-0.8 => N3− N 103 − 10 990 990 6∑D2 = 0.2 => 6∑D2 = 0.2 * 990 => ∑D2 = 198/6 => ∑D2 = 33 990 But this is not correct ∑D2 therefore we need to compute correct value Correct ∑D2 = 33 – 72 + 92 = 65 Hence, correct 2 value of rank coefficient of correlation is: R = 1 - 6∑D = 1 – 6∗65 = 1 - 390 = 1 – 0.394 = 0.606 N3− N 990 990 Illustration 11: Ten competitors in a beauty contest are ranked by three judges in the following order: 1st Judge 1 6 5 10 3 2 4 9 7 8 2nd Judge 3 5 8 4 7 10 2 1 6 9 3 Judge rd 6 4 9 8 1 2 3 10 5 7 Page | 11 Use the rank correlation coefficient to determine which pairs of judges has the nearest approach to common tastes in beauty. Solution: In order to find out which pair of judges has the nearest approach to common tastes in beauty, we compare rank correlation between the judgements of 1. 1st Judge and 2nd Judge 2. 2nd Judge and 3rd Judge 3. 1st Judge and 3rd Judge Calculation of Spearman’s Rank Coefficient of Correlation Rank by 1st Rank by 2nd Rank by 3rd Judge (R1) Judge (R2) Judge (R3) D2 = (R1–R2)2 D2 = (R2–R3)2 D2 = (R1–R3)2 1 3 6 4 9 25 6 5 4 1 1 4 5 8 9 9 1 16 10 4 8 36 16 4 3 7 1 16 36 4 2 10 2 64 64 0 4 2 3 4 1 1 9 1 10 64 81 1 7 6 5 1 1 4 8 9 7 1 4 1 N = 10 N = 10 N = 10 ∑D2 = 200 ∑D2 = 214 ∑D2 = 60 2 1. 1st Judge and 2nd Judge: R = 1 - 6∑D =1– 6∗200 = 1 – 1200 = 1 – 1.2121= -0.2121 N3 − N 103 − 10 990 2 2. 2nd Judge and 3rd Judge: R = 1 - 6∑D =1– 6∗214 =1– 1284 = 1 – 1.297 = -0.297 N3 − N 103 − 10 990 2 3. 1st Judge and 3rd Judge: R = 1 - 6∑D =1– 6∗60 =1– 360 = 1 – 0.3636 = 0.6364 N3 − N 103 − 10 990 Interpretation: From the above calculation it can be observed that coefficient of correlation is positive in the judgement of the first and third judges. Therefore, it can be concluded that first and third judges have the nearest approach to common tastes in beauty. Properties of Coefficient of Correlation: 1. The coefficient of correlation always lies between – 1 to +1, symbolically it can written as – 1 ≤ r ≤ 1. 2. The coefficient of correlation is independent of change of origin and scale. 3. The coefficient of correlation is a pure number and is independent of the units of measurement. It means if X represent say height in inches and Y represent say weights in kgs, then the correlation coefficient will be neither in inches nor in kgs but only a pure number. 4. The coefficient of correlation is the geometric mean of two regression coefficient, Page | 12 symbolically r = √bxy ∗ byx 5. If X and Y are independent variables then coefficient of correlation is zero. Page | 13 REGRESSION Meaning: A study of measuring the relationship between associated variables, wherein one variable is dependent on another independent variable, called as Regression. It is developed by Sir Francis Galton in 1877 to measure the relationship of height between parents and their children. Regression analysis is a statistical tool to study the nature and extent of functional relationship between two or more variables and to estimate (or predict) the unknown values of dependent variable from the known values of independent variable. The variable that forms the basis for predicting another variable is known as the Independent Variable and the variable that is predicted is known as dependent variable. For example, if we know that two variables price (X) and demand (Y) are closely related we can find out the most probable value of X for a given value of Y or the most probable value of Y for a given value of X. Similarly, if we know that the amount of tax and the rise in the price of a commodity are closely related, we can find out the expected price for a certain amount of tax levy. Uses of Regression Analysis: 1. It provides estimates of values of the dependent variables from values of independent variables. 2. It is used to obtain a measure of the error involved in using the regression line as a basis for estimation. 3. With the help of regression analysis, we can obtain a measure of degree of association or correlation that exists between the two variables. 4. It is highly valuable tool in economies and business research, since most of the problems of the economic analysis are based on cause and effect relationship. Distinction between Correlation and Regression Sl No Correlation Regression 1 It measures the degree and direction It measures the nature and extent of of relationship between the variables. average relationship between two or more variables in terms of the original units of the data 2 It is a relative measure showing It is an absolute measure of association between the variables. relationship. 3 Correlation Coefficient is independent Regression Coefficient is independent of change of both origin and scale. of change of origin but not scale. 4 Correlation Coefficient is independent Regression Coefficient is not of units of measurement. independent of units of measurement. 5 Expression of the relationship Expression of the relationship between the variables ranges from –1 between the variables may be in any Page | 14 to +1. of the forms like: Y = a + bX Y = a + bX + cX2 6 It is not a forecasting device. It is a forecasting device which can be used to predict the value of dependent variable from the given value of independent variable. 7 There may be zero correlation such as There is nothing like zero regression. weight of wife and income of husband. Regression Lines and Regression Equation: Regression lines and regression equations are used synonymously. Regression equations are algebraic expression of the regression lines. Let us consider two variables: X & Y. If y depends on x, then the result comes in the form of simple regression. If we take the case of two variable X and Y, we shall have two regression lines as the regression line of X on Y and regression line of Y on X. The regression line of Y on X gives the most probable value of Y for given value of X and the regression line of X on Y given the most probable value of X for given value of Y. Thus, we have two regression lines. However, when there is either perfect positive or perfect negative correlation between the two variables, the two regression line will coincide, i.e. we will have one line. If the variables are independent, r is zero and the lines of regression are at right angles i.e. parallel to X axis and Y axis. Therefore, with the help of simple linear regression model we have the following two regression lines 1. Regression line of Y on X: This line gives the probable value of Y (Dependent variable) for any given value of X (Independent variable). Regression line of Y on X : Y – Ẏ = byx (X – Ẋ) OR : Y = a + bX 2. Regression line of X on Y: This line gives the probable value of X (Dependent variable) for any given value of Y (Independent variable). Regression line of X on Y : X – Ẋ = bxy (Y – Ẏ) OR : X = a + bY In the above two regression lines or regression equations, there are two regression parameters, which are “a” and “b”. Here “a” is unknown constant and “b” which is also denoted as “byx” or “bxy”, is also another unknown constant popularly called as regression coefficient. Hence, these “a” and “b” are two unknown constants (fixed numerical values) which determine the position of the line completely. If the value of either or both of them is changed, another line is determined. The parameter “a” determines the level of the fitted line (i.e. the distance of the line directly above or below the origin). The parameter “b” determines the slope of the line (i.e. the change in Y for unit change in X). Page | 15 If the values of constants “a” and “b” are obtained, the line is completely determined. But the question is how to obtain these values. The answer is provided by the method of least squares. With the little algebra and differential calculus, it can be shown that the following two normal equations, if solved simultaneously, will yield the values of the parameters “a” and “b”. Two normal equations: X on Y Y on X ∑X = Na + b∑Y ∑Y = Na + b∑X ∑XY = a∑Y + b∑Y 2 ∑XY = a∑X + b∑X2 This above method is popularly known as direct method, which becomes quite cumbersome when the values of X and Y are large. This work can be simplified if instead of dealing with actual values of X and Y, we take the deviations of X and Y series from their respective means. In that case: Regression equation Y on X: Y = a + bX will change to (Y – Ẏ) = byx (X – Ẋ) Regression equation X on Y: X = a + bY will change to (X – Ẋ) = bxy (Y – Ẏ) In this new form of regression equation, we need to compute only one parameter i.e. “b”. This “b” which is also denoted either “byx” or “bxy” which is called as regression coefficient. Regression Coefficient: The quantity “b” in the regression equation is called as the regression coefficient or slope coefficient. Since there are two regression equations, therefore, we have two regression coefficients. 1. Regression Coefficient X on Y, symbolically written as “bxy” 2. Regression Coefficient Y on X, symbolically written as “byx” Different formula’s used to compute regression coefficients: Method Regression Coefficient X on Y Regression Coefficient Y on X Using the correlation σ𝑥 σ𝑦 coefficient (r) and bxy = 𝑟 byx = 𝑟 σ𝑦 σ𝑥 standard deviation (σ) Direct Method: Using bxy = N∑XY− ∑X∑Y byx = N∑XY− ∑X∑Y sum of X and Y N∑Y2− (∑Y)2 N∑X2− (∑X)2 ∑𝑥𝑦 ∑𝑥𝑦 When deviations are bxy = byx = taken from arithmetic ∑𝑦2 ∑𝑥2 mean where x = X - Ẋ and y = Y - Ẏ where x = X - Ẋ and y = Y - Ẏ Properties of Regression Coefficients: 1. The coefficient of correlation is the geometric mean of the two regression coefficients. Symbolically r = √bxy ∗ byx Page | 16 2. If one of the regression coefficients is greater than unity, the other must be less than unity, since the value of the coefficient of correlation cannot exceed unity. For example if bxy = 1.2 and byx = 1.4 “r” would be = √1.2 ∗ 1.4 = 1.29, which is not possible. 3. Both the regression coefficient will have the same sign. i.e. they will be either positive or negative. In other words, it is not possible that one of the regression coefficients is having minus sign and the other plus sign. 4. The coefficient of correlation will have the same sign as that of regression coefficient, i.e. if regression coefficient have a negative sign, “r” will also have negative sign and if the regression coefficient have a positive sign, “r” would also be positive. For example, if bxy = -0.2 and byx = -0.8 then r = - √0.2 ∗ 0.8 = – 0.4 5. The average value of the two regression coefficient would be greater than the value of coefficient of correlation. In symbol (bxy + byx) / 2 > r. For example, if bxy = 0.8 and byx = 0.4 then average of the two values = (0.8 + 0.4) / 2 = 0.6 and the value of r = r = √0.8 ∗ 0.4 = 0.566 which less than 0.6 6. Regression coefficients are independent of change of origin but not scale. Illustration 01: Find the two regression equation of X on Y and Y on X from the following data: X : 10 12 16 11 15 14 20 22 Y : 15 18 23 14 20 17 25 28 Solution: Calculation of Regression Equation X Y X2 Y2 XY 10 15 100 225 150 12 18 144 324 216 16 23 256 529 368 11 14 121 196 154 15 20 225 400 300 14 17 196 289 238 20 25 400 625 500 22 28 484 784 616 120 160 1,926 3,372 2,542 ∑X ∑Y ∑X2 ∑Y2 ∑XY Here N = Number of elements in either series X or series Y = 8 Now we will proceed to compute regression equations using normal equations. Regression equation of X on Y: X = a + bY The two normal equations are: ∑X = Na + b∑Y ∑XY = a∑Y + b∑Y2 Substituting the values in above normal equations, we get Page | 17 120 = 8a + 160b.... (i) 2542 = 160a + 3372b.... (ii) Let us solve these equations (i) and (ii) by simultaneous equation method Multiply equation (i) by 20 we get 2400 = 160a + 3200b Now rewriting these equations: 2400 = 160a + 3200b 2542 = 160a + 3372b (-) (-) (-). -142 = -172b Therefore now we have -142 = -172b, this can rewritten as 172b = 142 Now, b = 142 = 0.8256 (rounded off) 172 Substituting the value of b in equation (i), we get 120 = 8a + (160 * 0.8256) 120 = 8a + 132 (rounded off) 8a = 120 - 132 8a = -12 a = -12/8 a = -1.5 Thus we got the values of a = -1.5 and b = 0.8256 Hence the required regression equation of X on Y: X = a + bY => X = -1.5 + 0.8256Y Regression equation of Y on X: Y = a + bX The two normal equations are: ∑Y = Na + b∑X ∑XY = a∑X + b∑X2 Substituting the values in above normal equations, we get 160 = 8a + 120b.... (iii) 2542 = 120a + 1926b.... (iv) Let us solve these equations (iii) and (iv) by simultaneous equation method Multiply equation (iii) by 15 we get 2400 = 120a + 1800b Now rewriting these equations: 2400 = 120a + 1800b 2542 = 120a + 1926b (-) (-) (-). -142 = -126b Therefore now we have -142 = -126b, this can rewritten as 126b = 142 Now, b = 142 = 1.127 (rounded off) 126 Substituting the value of b in equation (iii), we get 160 = 8a + (120 * 1.127) 160 = 8a + 135.24 Page | 18 8a = 160 - 135.24 8a = 24.76 a = 24.76/8 a = 3.095 Thus we got the values of a = 3.095 and b = 1.127 Hence the required regression equation of Y on X: Y = a + bX => Y = 3.095 + 1.127X Illustration 02: After investigation it has been found the demand for automobiles in a city depends mainly, if not entirely, upon the number of families residing in that city. Below are the given figures for the sales of automobiles in the five cities for the year 2019 and the number of families residing in those cities. City No. of Families (in lakhs): X Sale of automobiles (in ‘000): Y Belagavi 70 25.2 Bangalore 75 28.6 Hubli 80 30.2 Kalaburagi 60 22.3 Mangalore 90 35.4 Fit a linear regression equation of Y on X by the least square method and estimate the sales for the year 2020 for the city Belagavi which is estimated to have 100 lakh families assuming that the same relationship holds true. Solution: Calculation of Regression Equation City X Y X2 XY Belagavi 70 25.2 4900 1764 Bangalore 75 28.6 5625 2145 Hubli 80 30.2 6400 2416 Kalaburagi 60 22.3 3600 1338 Mangalore 90 35.4 8100 3186 375 141.7 28,625 10,849 ∑X ∑Y ∑X2 ∑XY Regression equation of Y on X: Y = a + bX The two normal equations are: ∑Y = Na + b∑X ∑XY = a∑X + b∑X2 Substituting the values in above normal equations, we get 141.7 = 5a + 375b....................................... (i) 10849= 375a + 28625b.................................. (ii) Let us solve these equations (i) and (ii) by simultaneous equation method Multiply equation (i) by 75 we get 10627.5 = 375a + 28125b Page | 19 Now rewriting these equations: 10627.5 = 375a + 28125b 10849 = 375a + 28625b (-) (-) (-). -221.5 = -500b Therefore now we have -221.5 = -500b, this can rewritten as 500b = 221.5 Now, b = 221.5 = 0.443 500 Substituting the value of b in equation (i), we get 141.7 = 5a + (375 * 0.443) 141.7 = 5a + 166.125 5a = 141.7 - 166.125 5a = -24.425 a = -24.425/5 a = -4.885 Thus we got the values of a = -4.885 and b = 0.443 Hence, the required regression equation of Y on X: Y = a + bX => Y = -4.885 + 0.443X Estimated sales of automobiles (Y) in city Belagavi for the year 2020, where number of families (X) are 100(in lakhs): Y = -4.885 + 0.443X Y = -4.885 + (0.443 * 100) Y = -4.885 + 44.3 Y = 39.415 (‘000) Means sales of automobiles would be 39,415 when number of families are 100,00,000 Illustration 03: From the following data obtain the two regression lines: Capital Employed (Rs. in lakh): 7 8 5 9 12 9 10 15 Sales Volume (Rs. in lakh): 4 5 2 6 9 5 7 12 Solution: Calculation of Regression Equation X Y X2 Y2 XY 7 4 49 16 28 8 5 64 25 40 5 2 25 4 10 9 6 81 36 54 12 9 144 81 108 9 5 81 25 45 10 7 100 49 70 15 12 225 144 180 75 50 769 380 535 ∑X ∑Y ∑X2 ∑Y2 ∑XY Page | 20 Regression line/equation of X on Y: Regression line/equation of Y on X: (X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ) Ẋ = ∑X = 75 = 9.375 Ẋ = ∑X = 75 = 9.375 n 8 n 8 Ẏ = ∑Y = 50 = 6.25 Ẏ = ∑Y = 50 = 6.25 n 8 n 8 Regression coefficient of X on Y: Regression coefficient of Y on X: n∑XY− ∑X∑Y n∑XY− ∑X∑Y bxy = byx = n∑Y2− (∑Y)2 n∑X2− (∑X)2 (8∗535) – (75∗50) (8∗535) – (75∗50) bxy = bxy = (8∗380) – (50)2 (8∗769) – (75)2 4280 – 3750 4280 – 3750 = = 3040 – 2500 6152 – 5625 530 530 = = 0.9815 = = 1.0057 540 527 (X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ) X – 9.375 = 0.9815 (Y – 6.25) Y – 6.25 = 1.0057 (X – 9.375) X – 9.375 = 0.9815Y – 6.1344 Y – 6.25 = 1.0057X – 9.4284) X = 9.375 – 6.1344 + 0.9815Y Y = 6.25 – 9.4284 + 1.0057X X = 3.2406 + 0.9815Y Y = -3.1784 + 1.0057X Illustration 04: From the following information find regression equations and estimate the production when the capacity utilisation is 70%. Average (Mean) Standard Deviation Production (in lakh units) 42 12.5 Capacity Utilisation (%) 88 8.5 Correlation Coefficient (r) 0.72 Solution: Let production be variable X and capacity utilisation be variable Y. Regression equation of production based on based on capacity utilisation shall be given by X on Y and regression equation of capacity utilisation of production shall be given by Y on X, which can be computed as given below: Given Information: Ẋ = 42 Ẏ = 88 σx = 12.5 σy = 8.5 r = 0.72 Regression coefficient of X on Y: Regression coefficient of Y on X: 𝑥 σ 𝑦 σ bxy = 𝑟 = 0.72 ∗ 12.5 = 1.0588 byx = 𝑟 = 0.72 ∗ 8.5 = 0.4896 σ𝑦 8.5 σ𝑥 12.5 Regression Equation of X on Y: Regression Equation of Y on X: (X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ) X – 42 = 1.0588 (Y – 88) Y – 88 = 0.4896 (X – 42) X = 42 – 93.1744 + 1.0588Y Y = 88 – 20.5632 + 0.4896X X = -51.1744 + 1.0588Y Y = 67.4368 + 0.4896X Page | 21 Estimation of the production when the capacity utilisation is 70% is regression equation X on Y, where Y = 70 Regression Equation of X on Y: (X – Ẋ) = bxy (Y – Ẏ) X = -51.1744 + 1.0588Y = -51.1744 + (1.0588 * 70) = -51.1744 + 74.116 = 22.9416 Therefore, the estimated production would be 22,94,160 units when there is a capacity utilisation of 70%. Illustration 05: The following data gives the age and blood pressure (BP) of 10 sports persons. Name : A B C D E F G H I J Age (X) : 42 36 55 58 35 65 60 50 48 51 BP (Y) : 98 93 110 85 105 108 82 102 118 99 i. Find regression equation of Y on X and X on Y (Use the method of deviation from arithmetic mean) ii. Find the correlation coefficient (r) using the regression coefficients. iii. Estimate the blood pressure of a sports person whose age is 45. Solution: Calculation of Regression Equation x=X-Ẋ y=Y-Ẏ Name Age (X) BP (Y) x2 y2 xy x=X-50 y=Y-100 A 42 98 -8 -2 64 4 16 B 36 93 -14 -7 196 49 98 C 55 110 5 10 25 100 50 D 58 85 8 -15 64 225 -120 E 35 105 -15 5 225 25 -75 F 65 108 15 8 225 64 120 G 60 82 10 -18 100 324 -180 H 50 102 0 2 0 4 0 I 48 118 -2 18 4 324 -36 J 51 99 1 -1 1 1 -1 500 1,000 0 0 904 1,120 -128 ∑X ∑Y ∑x ∑y ∑x2 ∑y2 ∑xy Ẋ = ∑X = 500 = 50 Ẏ = ∑Y = 1000 = 100 n 10 n 10 Regression coefficients can be computed using the following formula: ∑𝑥𝑦 ∑𝑥𝑦 bxy = byx = where x = X - Ẋ and y = Y - Ẏ ∑𝑦2 ∑𝑥2 Page | 22 Regression coefficient of X on Y: Regression coefficient of Y on X: ∑𝑥𝑦 ∑𝑥𝑦 bxy = = −128 = -0.1143 byx = = −128 = -0.1416 ∑𝑦2 1120 ∑𝑥2 904 Regression equation of X on Y: Regression equation of Y on X: (X – Ẋ) = bxy (Y – Ẏ) (Y – Ẏ) = byx (X – Ẋ) X – 50 = -0.1143 (Y – 100) Y – 100 = -0.1416 (X – 50) X – 50 = -0.1143Y + 11.43 Y – 100 = -0.1416X + 7.08 X = 50 + 11.43 – 0.1143Y Y = 100 + 7.08 – 0.1416X X = 61.43 - 0.1143Y Y = 107.08 – 0.1416X Computation of coefficient of correlation using regression coefficient: r = √bxy ∗ byx = – √0.1143 ∗ 0.1416 = – √0.01618488 = – 0.1272 Therefore, we have low degree of negative correlation between age and blood pressure of sports person. Estimation of the blood pressure (Y) of a sports person whose age is X=45 can be calculated using regression equation Y on X: Regression equation of Y on X: (Y – Ẏ) = byx (X – Ẋ) Y = 107.08 – 0.1416X = 107.08 – (0.1416 * 45) = 107.08 – 6.372 = 100.708 It means estimated blood pressure of a sports person is 101 (rounded off) whose age is 45. ***** Page | 23