Unit V Statistics PDF
Document Details
Uploaded by FragrantCosecant
null
null
null
Tags
Summary
This document provides an introduction to statistical concepts including moments, skewness, and kurtosis. It details the theoretical underpinnings and applications, including formulas used in calculations. It also explains how raw moments relate to central moments. The document likely serves as a learning resource for an undergraduate-level statistics course.
Full Transcript
Department of Mathematics UNIT-V STATISTICS Topic Learning Objectives: Upon Completion of this unit, students will be able to: Expand their knowledge and skills of the Statistical Concepts and a pers...
Department of Mathematics UNIT-V STATISTICS Topic Learning Objectives: Upon Completion of this unit, students will be able to: Expand their knowledge and skills of the Statistical Concepts and a personal development experience towards the needs of statistical data analysis. Understand the Central Moments, Skewness and Kurtosis. Describe & evaluate the concept of correlation and regression coefficients. Investigate the strength and direction of a relationship between two variables by collecting measurements and using appropriate statistical analysis. To model a linear relationship between a dependent variable and two or more independent variables. Introduction: In many fields of Applied Mathematics and Engineering we face some problems and do experiments involving two variables. In this chapter, we consider the Mathematical theory of statistics, by presenting an elementary treatment of Central moments, mean, variance, coefficients of skewness and kurtosis in terms of moments, curve fitting, correlation and regression. In mathematics, a moment is a specific quantitative measure of the shape of a function. It is used in both mechanics and statistics. If the function represents physical density, then the zeroth moment is the total mass, the first moment divided by the total mass is the center of mass, and the second moment is the rotational inertia. If the function is a probability distribution, then the zeroth moment is the total probability (i.e. one), the first moment is the mean, the second central moment is the variance, the third standardized moment is the skewness, and the fourth standardized moment is the kurtosis. Moments: In mechanics, moment refers to the turning or the rotating effect of a force whereas it is used to describe the peculiarities of a frequency distribution in statistics. We can measure the central tendency of a set of observations by using moments. Moments also help in measuring the scatteredness, asymmetry and peakedness of a curve for a particular distribution. Moments refers to the average of the deviations from mean or some other value raised to a certain power. The arithmetic mean of various powers of these deviations in any distribution is called the moments of the distribution about mean. Moments about mean are generally used in statistics. Third Semester 1 Statistics (MA231TA) Department of Mathematics Moments for ungrouped data: Now we first define the moments for ungrouped data. The rth moment about origin is denoted by and defined by, ∑ , r = 1, 2, 3 … (1) Here the is the rth moment when we are dealing with the n observations denoted by x1, x2... xn. Thus, for r =1, 2, 3 and 4 we get the first four raw moments about the origin. ∑ ∑ ∑ ∑ Similarly, we can define the rth moment about the arithmetic mean ̄ or this is also called the r th central moment and it is denoted by the notation and it is defined as: ∑ ( ̄) r = 1, 2, 3 … (2) Thus, for r =1, we get the first central moment about the mean as ∑ ( ̄) Similarly for r = 2, we get the second central moment about the mean as ∑ ( ̄ ) which is equal to variance. Moments for grouped data: Suppose we are having observations x1, x2,... ,xn which are the mid points of the class- intervals and f1, f2,... ,fn are their corresponding frequencies then the rth moment about origin is denoted by and defined by, ∑ , r = 1, 2, 3 … and ∑ (3) th Similarly, the r moment about arithmetic mean is denoted by and defined by, ∑ ( ̄ ) r = 1, 2, 3 … (4) th Also, the r moment about any point A is denoted by and defined by, ∑ ( ) r = 1, 2, 3 … (5) ( - ) ( - ̄) Note: If Then rth order moments about an arbitrary point A and mean ̄ are defined respectively by ∑ ∑ r = 1, 2, 3 … Relation between raw (Moments about origin or any point) and Central Moments The central moments can be expressed in terms of raw moments and vice-versa. The general relation between the moments about mean in terms of moments about any point is given by, ( ) , r =1,2, … (6) In particular, on putting r = 2, 3 and 4 in equation (6), we get Third Semester 2 Statistics (MA231TA) Department of Mathematics Conversely, - - r = 1, 2, 3 … (7) In particular, on putting r = 2, 3 and 4 in equation (7), we get Example 1: The first four moments of a distribution about the value 4 of the variables are -1.5, 17, -30 and 108. Find the moments about the mean. Solution: Given A = 4, - - Moments about mean: ( ) ( )( ) ( ) ( )( ) ( )( ) ( ) Example 2: Calculate the first four moments of the following distribution about the mean. x: 0 1 2 3 4 5 6 7 8 f: 1 8 28 56 70 56 28 8 1 Solution: x f d = (x - ̄ ) fd fd2 fd3 fd4 0 1 -4 -4 16 -64 256 1 8 -3 -24 72 -216 648 2 28 -2 -56 112 -224 448 3 56 -1 -56 56 -56 56 4 70 0 0 0 0 0 5 56 1 56 56 56 56 6 28 2 56 112 224 448 7 8 3 24 72 216 648 8 1 4 4 16 64 256 2 ∑f=N=256 ∑ fd = 0 ∑ fd = 512 ∑ fd3 = 0 4 ∑ fd = 2816 ̄∑ Moments about the mean ̅ = 4 are ∑ ∑ ∑ ∑ ∑ Example 3: Wages of workers are given in the following table: 1.5 - 2.5 2.5 - 3.5 3.5 - 4.5 4.5 - 5.5 5.5 - 6.5 1 3 7 3 4 Third Semester 3 Statistics (MA231TA) Department of Mathematics Calculate the first four central moments of the following distribution. x Mid value x f fx ̅ 1.5-2.5 2 1 2 -2 -2 4 -8 16 2.5-3.5 3 3 9 -1 -3 3 -3 3 3.5-4.5 4 7 28 0 0 0 0 0 4.5-5.5 5 3 15 1 3 3 3 3 5.5-6.5 6 4 34 2 8 16 32 64 ∑ ∑ ∑ ∑ = ∑ ∑ Total =0 26 24 86 ∑ Mean of x values ( ̅ ) = ∑ 4 First central moment ( ) = 0 Second central moment ( ) = 1.4444 Third central moment ( ) = 1.3333 Fourth central moment ( ) = 4.7778 Example 4: Wages of workers are given in the following table: 1.5 - 2.5 2.5 - 3.5 3.5 - 4.5 4.5 - 5.5 5.5 - 6.5 1 3 7 3 3 Calculate the first four central moments of the following distribution. Third Semester 4 Statistics (MA231TA) Department of Mathematics ∑ Mean of x values ( ̅ ) = ∑ 4.2353 First central moment ( ) = 0 Second central moment ( ) = 1.2388 Third central moment ( ) = 0.0537 Fourth central moment ( ) = 3.6525 Skewness and Kurtosis: Averages tell us about the central value of the distribution and measures of dispersion tell us about the concentration of the items around a central value. These measures do not reveal whether the dispersal of value on either side of an average is symmetrical or not. If observations are arranged in a symmetrical manner around a measure of central tendency, we get a symmetrical distribution; otherwise, it may be arranged in an asymmetrical order which gives asymmetrical distribution. Measures of Skewness and Kurtosis, like measures of central tendency and dispersion, study the characteristics of a frequency distribution. Thus, skewness is a measure that studies the degree and direction of departure from symmetry. A symmetrical distribution, gives a „symmetrical curve‟, where the value of mean, median and mode are exactly equal. On the other hand, in an asymmetrical distribution, the values of mean, median and mode are not equal. When two or more symmetrical distributions are compared, the difference in them is studied with „Kurtosis‟. On the other hand, when two or more symmetrical distributions are compared, they will give different degrees of Skewness. These measures are mutually exclusive i.e. the presence of skewness implies absence of kurtosis and vice-versa. Measures of Kurtosis: Kurtosis enables us to have an idea about the flatness or peakedness of the curve. It is measured by the Karl Pearson co-efficient β2 and given by Kurtosis studies the concentration of the items at the central part of a series. The following figure in which all the three curves A, B and C are symmetrical about the mean. Third Semester 5 Statistics (MA231TA) Department of Mathematics Curve of the type „A‟ which is neither flat nor peaked is called the normal curve or „MESOKURTIC‟ curve (β2 = 3). If items concentrate too much at the center (more peaked than the normal curve), the curve of the type „C‟ becomes „LEPTOKURTIC‟ curve (β2 > 3). If the concentration at the center is comparatively less (flatter than the normal curve), the curve of the type „B‟ becomes „PLATYKURTIC‟ curve (β2 < 3). Measures of Skewness: Literally, skewness means „lack of symmetry‟. A distribution is said to be skewed if (i) Mean, Median and Mode fall at different points. (ii) The curve drawn with the help of the given data is not symmetrical but stretched more to one side than to the other. Karl Pearson‟s coefficient of Skewness: The method is most frequently used for measuring skewness. The formula for measuring coefficient of skewness is as follows: - Sk = , where σ is the standard deviation of the distribution. Based upon moments, co-efficient of skewness is defined as follows: √ ( ) ( , where ) and. Nature of Skewness: Skewness can be positive or negative or zero. The direction of skewness is determined by observing whether the mean is greater than the mode (positive skewness) or less than the mode (negative skewness). (i) When the values of mean, median and mode are equal, there is no skewness. (ii) When mean > median > mode, skewness will be positive. (iii) When mean < median < mode, skewness will be negative. Characteristic of a good measure of skewness: 1. It should be a pure number in the sense that its value should be independent of the unit of Third Semester 6 Statistics (MA231TA) Department of Mathematics the series and also degree of variation in the series. 2. It should have zero-value, when the distribution is symmetrical. 3. It should have a meaningful scale of measurement so that we could easily interpret the measured value. Note: From …………………(*) we observe the following: is always positive whether is positive or negative. is always positive as is variance. from (*) is always positive which is not so always as skewness may be negative also. To overcome this, the measure of skewness is defined by √ Here sign of depends on the sign of. Similarly, the measure of kurtosis is defined by Example 5: Wages of workers are given in the following table: 10-12 12-14 14-16 16-18 18-20 20 - 22 22 - 24 1 3 7 12 12 4 3 Calculate the first four central moments of the following distribution. Also compute β1 and β2. Third Semester 7 Statistics (MA231TA) Department of Mathematics Solution: Mid-point Wages f d = (x -17) / 2 fd fd2 fd3 fd4 x 10-12 1 11 -3 -3 9 -27 81 12-14 3 13 -2 -6 12 -24 48 14-16 7 15 -1 -7 7 -7 7 16-18 12 17 0 0 0 0 0 18-20 12 19 1 12 12 12 12 20-22 4 21 2 8 16 32 64 22-24 3 23 3 9 27 81 243 ∑ = 13 ∑ = 83 ∑ = 67 ∑ =455 ∑ ∑ ∑ ∑ Moments about mean: ( )( ) ( ) ( )( ) ( )( ) So, we have β Exercise: 1. The first four raw moments of a distribution are 2, 136, 320 and 40,000. Find the coefficients of skewness and kurtosis. Ans. β 2. Find the second, third and fourth central moments of the frequency distribution given below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis. Class limits Frequency 110 – 115 5 115 – 120 15 120 – 125 20 125 – 130 35 130 – 135 10 135 – 140 10 140 – 145 5 Ans. √ √ β Third Semester 8 Statistics (MA231TA) Department of Mathematics 3. Find the second, third and fourth central moments of the frequency distribution given below. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis. 5 10 15 20 25 30 35 4 10 20 36 16 12 2 Ans. β √ β 4. Compute the first four moments about mean from the following data. Hence, find (i) a measure of skewness and (ii) a measure of kurtosis. Class Intervals: 0 -10 10 – 20 20 – 30 30 – 40 Frequency: 1 3 4 2 Ans. β √ β Correlation and Regression: The word correlation is used in everyday life to denote some form of association. In statistical terms we use correlation to denote association between two quantitative variables. We also assume that the association is linear, that one variable increases or decreases a fixed amount for a unit increase or decrease in the other. The other technique that is often used in these circumstances is regression, which involves estimating the best straight line to summarize the association. Correlation: Correlation means simply a relation between two or more variables. Two variables are said to be correlated if the change in one variable results in a corresponding change in the other. Ex: 1. x: supply y: price 2. x: demand y: Price. Positive correlation: If an increase or decrease in one variable corresponds to an increase or decrease in the other then the correlation is said to be positive correlation or direct correlation. Ex: 1. Demand and price of commodity. 2. Income and expenditure. Negative correlation: If an increase or decrease in one variable corresponds to an decrease or increase in the other then the correlation is said to be negative correlation or inversely correlated. Ex: 1.Supply and Price of a commodity. 2. Correlation between Volume and pressure of a perfect gas. Third Semester 9 Statistics (MA231TA) Department of Mathematics No correlation If there exist no relationship between two variables then they are said to be non correlated. Scatter diagram To obtain a measure of relationship between two variables x and y we plot their corresponding values in the xy - plane. The resulting diagram showing the collection of the dots is called the dot diagram or scatter diagram. Correlation Coefficient (Karl Pearson correlation coefficient) The degree of association is measured by a correlation coefficient, denoted by r. It is sometimes called Karl Pearson's correlation coefficient and is a measure of linear association. If a curved line is needed to express the relationship, other and more complicated measures of the correlation must be used. Let be the corresponding n values of y, then the coefficient of correlation between x and y is ∑( ̄ )( ̄) ∑ , where - variance of the x series, - variance of the y series, ∑ → Mean of the x series → mean of the y series. For computation purpose we can use the formula ∑ (∑ )(∑ ). √* ∑ (∑ ) +* ∑ (∑ ) + Limits for correlation coefficient The coefficient of correlation numerically does not exceed unity ( ). Proof: ∑( ̄ )( ̄) We have , i =1,2,………n, √ ∑( ̄ ) √ ∑( ) Taking ̄ and ̄ Third Semester 10 Statistics (MA231TA) Department of Mathematics ∑ (∑ ) ∑ ∑. (1) √ ∑ √ ∑ By Schwartz inequality, which states that if a i , i =1, 2… n are real quantities then (∑ ) ∑ ∑ and the sign of equality holding if and only if. Using this equation (1) becomes , | | | | Hence correlation coefficient cannot exceed unity numerically. Note: Figure 1.1 Correlation illustrated. 1. If r =-1 there is a perfect negative correlation. 2. If r =1 there is a perfect positive correlation. 3. If r =0 then the variables are non-correlated. RANK CORRELATION In many practical situations, characters are not measurable. They are qualitative characteristics and individuals or items can be ranked in order of their merits. This type of situation occurs when we deal with the qualitative study such as honesty, beauty, voice, etc. For example, contestants of a singing competition may be ranked by judge according to their performance. In another example, students may be ranked in different subjects according to their performance in tests. Arrangement of individuals or items in order of merit or proficiency in the possession of a certain characteristic is called ranking and the number indicating the position of individuals or items is known as rank. Third Semester 11 Statistics (MA231TA) Department of Mathematics If ranks of individuals or items are available for two characteristics then correlation between ranks of these two characteristics is known as rank correlation. With the help of rank correlation, we find the association between two qualitative characteristics. As we know that the Karl Pearson‟s correlation coefficient gives the intensity of linear relationship between two variables and Spearman‟s rank correlation coefficient gives the concentration of association between two qualitative characteristics. In fact, Spearman‟s rank correlation coefficient measures the strength of association between two ranked variables. Derivation of the Spearman‟s rank correlation coefficient formula is discussed in the following section. RANK CORRELATION COEFFICIENT FORMULA Suppose we have a group of n individuals and let x1, x 2 ,..., x n and y1 , y2 ,..., yn be the ranks of n individuals in characteristics A and B respectively. Then rank correlation coefficient is given by ∑ ( ) Here is difference between ranks assigned in characteristics A and B. and n is number of pairs of data. This formula was given by Spearman and hence it is known as Spearman‟s rank correlation coefficient formula. Note 1: When two or more observations have equal values, if there is a tie, it is difficult to assign ranks to them. In such cases, the observations are given the average of the ranks they would have received. Then, a different formula is used to calculate the rank correlation coefficient. The Spearman‟s correlation coefficient for tied ranks can be calculated using the formula 0∑ ,( ) ( ) ( ) -1 ( ) Where are number of repetitions of ranks and ∑( ) are the corresponding correction factors. Note 2: lie between -1 and 1. Examples: 1. If r is the correlation coefficient between x and y and z= ax+by. Show that ( ). Solution: Let z = ax + by ∑ ∑ ∑ , ∑( ) ∑( ) ∑( ) ∑( )( ) σ σ Third Semester 12 Statistics (MA231TA) Department of Mathematics ( ). σ 2. While calculating the correlation coefficient between x and y from 25 pairs of observations a person obtained the following values. ∑ ∑ ∑ ∑ ∑. It was later discovered that he had copied down the pairs (8,12) and (6,8) as (6,12) and (8,6) respectively. Obtain the correct value of the correlation coefficient. Solution: To get the correct values, we subtract the incorrect values and add the corresponding correct values. Therefore, correct values of sums ∑ ∑ ∑ ∑ and ∑ ( ) ( ) ( ) ( ) , given n = 25, ∑ (∑ )(∑ ) = 0.51912. √* ∑ (∑ ) +* ∑ (∑ ) + 3. The following Table gives the age (in years) of 10 married couples. Calculate the coefficient of correlation between these ages. Age of Husband(x) 23 27 28 29 30 31 33 35 36 39 Age of wife(y) 18 22 23 24 25 26 28 29 30 32 Solution: Here n=10, we find ̄ ∑ ̄ ∑. Xi - ̄ ̄ 23 18 -8.1 65.61 -7.7 59.29 62.37 27 22 -4.1 16.81 -3.7 13.69 15.17 28 23 -3.1 9.61 -2.7 7.29 8.37 29 24 -2.1 4.41 -1.7 2.89 3.57 30 25 -1.1 1.21 -0.7 0.49 0.77 31 26 -0.1 0.01 0.3 0.09 -0.03 33 28 1.9 3.61 2.3 5.29 4.37 35 29 3.9 15.21 3.3 10.89 12.87 36 30 4.9 24.01 4.3 18.49 21.07 39 32 7.9 62.41 6.3 39.69 49.77 ∑ ∑ ∑ =178.3 Third Semester 13 Statistics (MA231TA) Department of Mathematics ∑ r= = 0.9955 ≈ 1. √∑ ∑ i.e, the ages of husbands and wives are almost perfectly correlated. 4. Suppose we have ranks of 8 students of B.Sc. in Statistics and Mathematics. On the basis of rank we would like to know that to what extent the knowledge of the student in Statistics and Mathematics is related. Rank in Statistics 1 2 3 4 5 6 7 8 Rank in Mathematics 2 4 1 5 3 8 7 6 Solution: Spearman‟s rank correlation coefficient formula is ∑ ( ) Let us denote the rank of students in Statistics by R x and rank in Mathematicsby R y. For the calculation of rank correlation coefficient, we have to find ∑ which is obtained through the following table: Rank in Rank in Difference of Statistics Mathematics Ranks (Rx ) (Ry ) 1 2 −1 1 2 4 −2 4 3 1 2 4 4 5 −1 1 5 3 2 4 6 8 −2 4 7 7 0 0 8 6 2 4 ∑ =22 i Here, n = number of paired observations = 8 ∑ ( ) Thus, there is a positive association between ranks of Statistics and Mathematics. 5. Suppose we have ranks of 5 students in three subjects Computer, Physics and Statistics and we want to test which two subjects have the same trend. Third Semester 14 Statistics (MA231TA) Department of Mathematics Rank in 2 4 5 1 3 Computer Rank in Physics 5 1 2 3 4 Rank in Statistics 2 3 5 4 1 Solution: In this problem, we want to see which two subjects have same trend i.e., which two subjects have the positive rank correlation coefficient. Here we have to calculate three rank correlation coefficients = Rank correlation coefficient between the ranks of Computer and Physics = Rank correlation coefficient between the ranks of Physics and Statistics = Rank correlation coefficient between the ranks of Computer and Statistics Let and be the ranks of students in Computer, Physics and Statistics respectively. Rank in Rank in Rank in = = = Compute Physics Statistics r (R1) (R2) (R3) R1−R2 R2−R3 R1−R3 2 5 2 −3 9 3 9 0 0 4 1 3 3 9 −2 4 1 1 5 2 5 3 9 −3 9 0 0 1 3 4 −2 4 −1 1 −3 9 3 4 1 −1 1 −3 9 2 4 Total 32 32 14 Thus ∑ ,∑ ∑ Now, ∑ ( ) -0.6 ∑ ( ) -0.6 ∑ ( ) -0.3 is negative which indicates that Computer and Physics have opposite trend. Similarly, negative rank correlation. shows the opposite trend in Physics and Statistics. = 0.3 indicates that Computer and Statistics have same trend. Sometimes we do not have rank but actual values of variables are available. If we are interested in rank correlation coefficient, we find ranks from the given values. Considering this case we are taking a problem and try to solve it. Third Semester 15 Statistics (MA231TA) Department of Mathematics Example 6: Calculate rank correlation coefficient from the following data: x 78 89 97 69 59 79 68 y 125 137 156 112 107 136 124 Solution: We have some calculation in the following table: x y Rank of x (Rx) Rank of y (Ry) d = Rx-Ry 78 125 4 4 0 0 89 137 2 2 0 0 97 156 1 1 0 0 69 112 5 6 -1 1 59 107 7 7 0 0 79 136 3 3 0 0 68 124 6 5 1 1 ∑ Assign the rank to x-series and y-series in descending order of the given numbers i.e. in x- series maximum is 97, assign rank 1, next number is 89, assign rank 2 and so on. Similarly rank can be assigned for the y-series. Spearman‟s Rank correlation formula is ∑ ( ) ( ) Example 7: Calculate rank correlation coefficient from the following data: x 81 78 73 73 69 68 62 58 y 10 12 18 18 18 22 20 24 Solution: We have some calculation in the following table: x y Rank of x Rank of y d = Rx-Ry d2 (Rx) (Ry) 81 10 1 8 7 49 78 12 2 7 5 25 73 18 3.5 5 1.5 2.25 73 18 3.5 5 1.5 2.25 69 18 5 5 0 0 68 22 6 2 -4 6 62 20 7 3 -4 16 58 24 8 1 -7 49 ∑ Third Semester 16 Statistics (MA231TA) Department of Mathematics Spearman‟s Rank correlation formula is *∑ ,( ) ( )-+ ( ) Where (the two items of x have equal value i,e. 73) and (three items of y having value i,e 18) 0 (( ) ( ))1 ( ) Regression: Correlation describes the strength of an association between two variables, and is completely symmetrical, the correlation between A and B is the same as the correlation between B and A. However, if the two variables are related it means that when one changes by a certain amount the other changes on an average by a certain amount. The relationship can be represented by a simple equation called the regression equation. In this context "regression" (the term is a historical anomaly) simply means that the average value of y is a "function" of x, that is, it changes with x. Regression analysis is a mathematical measure of the average relationship between two or more variables in terms of the original units of data. Line of regression: Line of regression is the line which gives the best estimate to the value of one variable for any specific value of the other variable. So the line of regression is the line of best fit. Method of Least squares: Suppose we are given n values of x1, x2, x3,….., xn of an independent variable x and the corresponding values y1, y2, y3,….., yn of a variable y depending on x. Then the pairs (x1, y1), (x2, y2),........, (xn, yn) give us n- points in the xy-plane. Generally, it is not possible to find the actual curve y = f(x) that passes through these points. Hence, we try to find a curve that serves as best approximation to the curve y = f(x). Such a curve is referred to as the curve of best fit. The process of determining a curve of best fit is called curve fitting. A method to find curve of best fit is called method of least squares. The method of least squares tells that the curve should pass as closely as possible to meet all the points. Let y = f(x) be an approximate relation that fits into the data (x i, yi) then yi are Third Semester 17 Statistics (MA231TA) Department of Mathematics called observed values Yi = f(xi) are called the expected values. The expected values Ei = yi - Yi are called the estimated error or residuals. The method of least squares provides a relationship y = f(x) such that sum of the squares of the residues is least. Such a curve is known as least square curve. Regression line of y on x: Let regression line of y on x be y = a + bx. The normal equations by the method of least squares is ∑ ∑ , ∑ ∑ ∑ , ∑ ∑. ̄ ̄ is the regression line passing through (( ̄ ̄ ) ∑( )( ) ∑( ) ∑( ) , ∑( ) ∑ σ σy y y r (x x) Y b yx X is the regression line of y on x. σx Regression line of x on y: ( ) Note: 1. Regression coefficient of y on x ∑( ̄ )( ̄) ∑ ∑ ∑ ∑( ̄) ∑. (∑ ) 2. Regression coefficient of x on y ∑( ̄ )( ̄) ∑ ∑ ∑ ∑( ̄) ∑. (∑ ) Third Semester 18 Statistics (MA231TA) Department of Mathematics Properties of Lines of Regression (Linear Regression) 1. The two regression lines x on y and y on x always intersect at their means ( ̄ ȳ ). 2. If is the angle between two regression lines then. (i) When r = 0 (the variables are independent), = =. Then the two lines of regression are perpendicular to each other. (ii) When 1(variables are perfectly correlated), or. Then the lines of regression coincide. 3. The coefficient of correlation is the geometric mean of the coefficient of regression i.e. √ xy yx. Note: If yx and xy both are positive then is positive. Similarly if yx and xy both are negative then is negative. Examples: 1. If two regression equations of the variables x and y are x = 19.13 – 0.87y, y = 11.6 – 0.5x, find (a) mean of x (b) mean of y (c)The correlation coefficient between x and y. Soln: Since ̄ ̄ lie on two regression lines, ̄ ̄ ̄ ̄ ̄ ̄ √ (since and have negative sign) 2. In the following table data is showing the test scores made by salesman on an intelligent test and their weekly sales. Test scores(x) 1 2 3 4 5 6 7 8 9 10 sales(y) 2.5 6 4.5 5 4.5 2 5.5 3 4.5 3 Calculate the regression line of sales on test scores and estimate the most possible weekly volume if a sales man scores 70. Soln: ̄ ̄ , Regression line of y on x is ̄ ( ̄ ), y = 0.06x + 0.45. When x = 70, y = 4.65. Third Semester 19 Statistics (MA231TA) Department of Mathematics 3. In a partially destroyed laboratory, record of an analysis of correlation data, the following results only are legible. Variance of x=9, Regression equations 8x -10y + 66 = 0, 40x - 18y = 214. What are (i) the mean values of x and y (ii) the correlation coefficient between x and y (iii) the standard deviation of y. Soln:(i) Since both the lines of regression pass through the point ( ̄ , ̄ ) 8 ̄ -10 ̄ + 66 = 0, 40 ̄ -18 ̄ - 214 = 0. Solving these equations, we get ̄ =13 , ̄ =17 (ii) Let 8x - 10y + 66 = 0 and 40x - 18y = 214 be the lines of regression of y on x and x on y respectively , Hence b yx = ,. Since both the regression coefficients positive we take r = 0.6. Standard deviation of y = 4. 4. The following table gives the stopping distance y in meters of a motor bike Moving at a speed of x Kms/hour when the breaks are applied x 16 24 32 40 48 56 y 0.39 0.75 1.23 1.91 2.77 3.81 Find the correlation coefficient between the speed and the stopping distance, and the equations of regression lines. Hence estimate the maximum speed at which the motor bike could be driven if the stopping distance is not to exceed 5 meters. Soln: ̄ ̄ σ σ ∑ (∑ )(∑ ) =. √* ∑ (∑ ) +* ∑ (∑ ) + The equation of the line of regression of y on x is y = 0.0851x - 1.2536 (i) and the equation of the line of regression of x on y is x =11.352y + 15.453. (ii) For y = 5, equation (ii) gives x = 72.213. Accordingly, for the stopping distance not to exceed 5 meters, the speed must not exceed 72 Kms/hour. Third Semester 20 Statistics (MA231TA) Department of Mathematics Multivariate Regression Analysis using least squares estimation of the parameters When several independent variables are used to estimate the value of the dependent variable it is called multiple regression. The multiple linear regression model is just an extension of the simple linear regrssion model. In simple linear regression, we used “x” to represent the explanatory variable. In multiple linear regression, we will have more than one explanatory variable. Let an experiment be conducted n times, and the data is obtained as follows: Observation number Response Explanatory variables Y X1 X 2 … X k 1 y1 x11 x12 …x1k 2 y2 x21 x22 … x2k ⁝ ⁝ ⁝ ⁝ ⋱ ⁝ Yn xn1 xn 2 … xnk n Assuming that the model is y 0 1 X1 2 X 2 ... k X k , where, y is an observed value of variable for a particular observation in the population. are parameters which are to be determined. the n-tuples of observations are also assumed to follow the same model. Thus they satisfy y1 0 1 x11 2 x12 ... k x1k y2 0 1 x21 2 x22 ... k x2k ⁝ ⁝ yn 0 1 xn1 2 xn 2 ... k xnk. These n equations can be written as ∑ Using least squares principle, we get the following normal equations: ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ………………………………………………………………………………………. ………………………………………………………………………………………. ∑ ∑ ∑ ∑ ∑ Solving the above normal equations, we get the values of. Third Semester 21 Statistics (MA231TA) Department of Mathematics Example: 1. A company produces two different items A and B. The data below shows the sale of these items in one day and the profit made by the company on that day. (Sales of item 8 11 9 8 6 10 7 A) (Sales of item 6 4 5 7 1 1 0 B) Profit (y) 93.26 89.76 60.78 79.34 28.23 75.83 32.74 Fit the best multilinear model that represents the relationship between sales of A and B and the profit. Solution: The normal equations corresponding to the regression equation y 0 1 x1 2 x 2 are: ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ 8 6 93.26 64 36 746.08 559.56 48 11 4 89.76 121 16 987.36 359.04 44 9 5 60.78 81 25 547.02 303.9 45 8 7 79.34 64 49 634.72 555.38 56 6 1 28.23 36 1 169.38 28.23 6 10 1 75.83 100 1 758.3 75.83 10 7 0 32.74 49 0 229.18 0 0 ∑= 59 24 459.94 515 128 4072.04 1881.94 209 2. A set of experimental runs was made to determine a way of predicting cooking time at various values of oven width and flue temperature. The coded data were recorded as follows: 6.40 15.05 18.75 30.25 44.85 48.94 51.55 61.50 100.44 111.42 1.32 2.69 3.56 4.41 5.35 6.20 7.12 8.87 9.80 10.65 1.15 3.40 4.10 8.75 14.82 15.15 15.32 18.18 35.19 40.40 Estimate the multiple linear regression equation y 0 1 x1 2 x 2 Third Semester 22 Statistics (MA231TA) Department of Mathematics Solution: The normal equations corresponding to the regression equation y 0 1 x1 2 x 2 are: ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ For the given data ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ Substituting these values in the above normal equations and solving we get, Hence the required multiple linear regression equation is Exercise: 1. If the coefficient of correlation between the variables x and y is 0.5 and the acute - angle between their lines of regression is. / Find the ratio of the standard deviation of x and y. Ans. 2. Prove the following formulas for the coefficient of correlation r (in the usual notation) a) ∑( ) , ∑( ). 3. Find the rank correlation coefficient for the following data: x 56 42 72 36 63 47 55 49 38 42 68 60 y 147 125 160 118 149 128 150 145 115 140 152 155 4. Ten participants in a contest are ranked by two judges as follows: X 1 6 5 10 3 2 4 9 7 8 Y 6 4 9 8 1 2 3 10 5 7 Calculate the rank correlation coefficient. Third Semester 23 Statistics (MA231TA) Department of Mathematics 5. The following table shows the ages x and the systolic pressures of 12 persons. CAge (x) 56 42 72 36 63 47 55 49 38 42 68 60 aBlood Pressure (y) 147 125 160 118 149 128 150 145 115 140 152 155 lculate the coefficient of correlation between x and y. Estimate the blood pressure of a person whose age is 45 years. Ans. r = 0.8961, y = 80.78 + 1.138 x , when x = 45, y = 132. 6. The height (inches) and weight (pounds) of baseball players are given below: (76, 212), (76, 224), (72, 180), (74, 210), (75, 215), (71, 200), (77, 235), (78, 235), (77, 194), (76, 185). (i) Estimate the coefficient of correlation between weight and height of baseball players. (ii) Find the regression line between weight and height. Use the regression equation to find the weight of a baseball player that is 68 inches tall. Ans. r = 0.5529, y = 4.737 x – 147.227, x = 0.064 y + 61.712, when x = 68, y = 97.37. 7. The equations of regression lines of two variables x and y are 4 x – 5y + 33 = 0 and 20x - 9y = 107, Find the correlation coefficient and the means of x and y. Ans. r = 0.6, Mean of x = 13 and Mean of y = 17. 8. If the tangent of the angle between the lines of regression of y on x and x on y is 0.6 and the standard deviation of y is twice the standard deviation of x. find the coefficient of correlation between x and y. Ans. r = 0.5. 9. The chemistry grade, intelligence test score and number of classes missed data of 12 students are given. Chemistry 85 74 76 90 85 87 94 98 81 91 76 74 grade ( ) Test 65 50 55 65 55 70 65 70 55 70 50 55 score( ) Classes 1 7 5 2 6 3 2 5 4 3 1 4 missed( ) Third Semester 24 Statistics (MA231TA) Department of Mathematics a) Fit the best multilinear model that represents the relationship of the form y 0 1 x1 2 x 2 b) Estimate the chemistry grade for a student who has an intelligence test score of 60 and missed 4 classes 10. An experiment was conducted to determine if the weight of an animal can be predicted after a given period of time on the basis of the initial weight of the animal and the amount of feed that was eaten. The following data, measured in kilograms, were recorded: Final 95 77 80 100 97 70 50 80 92 84 weight( ) Initial 42 33 33 45 39 36 32 41 40 38 weight( ) Feed 272 226 259 292 311 183 173 236 230 235 weight( ) a) Fit the best multilinear model that represents the relationship of the form y 0 1 x1 2 x 2 b) Predict the final weight of an animal having an initial weight of 35 kilograms that is given 250 kilograms of feed. Resources: 1. https://nptel.ac.in/courses/111105042/ 2. http://www.nptelvideos.in/2012/12/regression-analysis.html 3. https://nptel.ac.in/courses/111104074/ Third Semester 25 Statistics (MA231TA)