Measures of Central Tendency PDF
Document Details
Uploaded by NourishingRoseQuartz
null
Tags
Summary
This document provides a detailed explanation of measures of central tendency. It covers mathematical means, including arithmetic mean (direct and short-cut methods), geometric mean, harmonic mean, median, and mode.
Full Transcript
60 44 Measures of Central Tendency 2.1.1 Introduction. An average or a central value of a statistical series in the value of the variable which describes the characteristics of the entire distribution. (1) Arithmetic mean (2) Geometric mean Mode E3 The following are the five measures of central tend...
60 44 Measures of Central Tendency 2.1.1 Introduction. An average or a central value of a statistical series in the value of the variable which describes the characteristics of the entire distribution. (1) Arithmetic mean (2) Geometric mean Mode E3 The following are the five measures of central tendency. (3) Harmonic mean 2.1.2 Arithmetic Mean. (4) Median (5) ID Arithmetic mean is the most important among the mathematical mean. According to Horace Secrist, U “The arithmetic mean is the amount secured by dividing the sum of values of the items in a series by their number.” (1) Simple arithmetic mean in individual series (Ungrouped data) is given by x D YG (i) Direct method : If the series in this case be x 1 , x 2 , x 3 ,......, x n then the arithmetic mean x x x 2 x 3 .... x n 1 Sum of the series , i.e., x 1 n n Number of terms n x i i 1 (ii) Short cut method d , n where, A = assumed mean, d = deviation from assumed mean = x – A, where x is the individual item, U Arithmetic mean (x ) A ST d = sum of deviations and n = number of items. (2) Simple arithmetic mean in continuous series (Grouped data) (i) Direct method : If the terms of the given series be x 1 , x 2 ,...., x n and the corresponding frequencies be f1 , f2 ,.... fn , f x f2 x 2 .... fn x n x 1 1 f1 f2 .... fn then the arithmetic mean x is given by, n fx i i i 1 n . fi i 1 (ii) Short cut method : Arithmetic mean (x ) A f ( x A) f Where A = assumed mean, f = frequency and x – A = deviation of each item from the assumed mean. Measures of Central Tendency 45 (3) Properties of arithmetic mean (i) Algebraic sum of the deviations of a set of values from their arithmetic mean is zero. If x i / fi , i = 1, 2, …, n is the frequency distribution, then n f (x i i x ) 0 , x being the mean of the distribution. i 1 60 (ii) The sum of the squares of the deviations of a set of values is minimum when taken about mean. (iii) Mean of the composite series : If x i , (i 1, 2,....., k ) are the means of k-component series of sizes n i , (i 1, 2,...., k ) respectively, then the mean x of the composite series obtained on n1 x 1 n 2 x 2 .... n k x k n1 n 2 .... n k E3 combining the component series is given by the formula x 2.1.3 Geometric Mean. n n ni x i i1 n. i i1 ID If x 1 , x 2 , x 3 ,......, x n are n values of a variate x, none of them being zero, then geometric mean 1 (log x 1 log x 2 ..... log x n ). n In case of frequency distribution, G.M. of n values x 1 , x 2 ,..... x n of a variate x occurring with (G.M.) is given by G.M. (x 1. x 2. x 3...... x n )1 / n log( G.M.) U frequency f1 , f2 ,....., fn is given by G.M. (x 1f1. x 2f2..... x nfn )1 / N , where N f1 f2 ..... fn. D YG 2.1.4 Harmonic Mean. The harmonic mean of n items x 1 , x 2 ,......, x n is defined as H.M. n 1 1 1 ..... x1 x 2 xn If the frequency distribution is f1 , f2 , f3 ,......, fn respectively, then H.M. Note. f1 f2 f3 ..... fn f1 f f 2 ..... n xn x1 x 2 : A.M. gives more weightage to larger values whereas G.M. and H.M. give more U weightage to smaller values. If the mean of the distribution is 2.6, then the value of y is Variate x 1 2 3 4 5 Frequency f of x 4 5 y 1 2 ST Example: 1 (a) 24 Solution: (c) We know that, Mean (b) 13 (c) 8 [Kurukshetra CEE 2001] (d) 3 n fx i i i1 n f i i1 i.e. 2.6 Example: 2 1 4 2 5 3 y 4 1 5 2 or 31.2 2.6 y 28 3 y or 0. 4 y 3. 2 y 8 4 5 y 1 2 In a class of 100 students there are 70 boys whose average marks in a subject are 75. If the average marks of the complete class are 72, then what are the average marks of the girls [AIEEE 2002] 46 Measures of Central Tendency (a) 73 Example: 3 (c) 68 (d) 74 Let the average marks of the girls students be x, then 72 70 75 30 x 100 i.e., 7200 5250 x , x = 65. 30 (Number of girls = 100 – 70 = 30) If the mean of the set of numbers x1 , x 2 , x 3 ,....., x n is x , then the mean of the numbers x i 2i , 1 i n is 60 Solution: (b) (b) 65 [Pb. CET 1988] (b) x n 1 (a) x 2n (c) x 2 n n Solution: (c) Example: 5 n n i 1 n i i1 n i x nx i 1 i 1 n n x 2(1 2 ... n) n nx 2 n(n 1) 2 x (n 1) n ID Example: 4 n i.e., (x 2i) x 2 i i i i1 E3 We know that x The harmonic mean of 4, 8, 16 is [AMU 1995] (a) 6.4 (b) 6.7 (c) 6.85 (d) 7.8 3 48 H.M. of 4, 8, 16 6.85 1 1 1 7 4 8 16 The average of n numbers x1 , x 2 , x 3 ,......, x n is M. If x n is replaced by x , then new average is [DCE 2000] (b) nM x n x n D YG (a) M x n x U Solution: (b) x (d) x n (c) (n 1)M x n (d) M xn x n x 1 x 2 x 3...... x n i.e. n Solution: (b) nM x1 x 2 x 3 ..... x n 1 x n nM x n x1 x 2 x 3 ..... x n 1 nM x n x x1 x 2 x 3 ..... x n 1 x n n nM x n x New average n Example: 6 Mean of 100 items is 49. It was discovered that three items which should have been 60, 70, 80 were wrongly read as 40, 20, 50 respectively. The correct mean is [Kurukshetra CEE 1994] 1 (a) 48 (b) 82 (c) 50 (d) 80 2 U M Sum of 100 items 49 100 4900 Sum of items added 60 70 80 210 ST Solution: (c) Sum of items replaced 40 20 50 110 New sum 4900 210 110 5000 5000 50 Correct mean 100 2.1.5 Median. Median is defined as the value of an item or observation above or below which lies on an equal number of observations i.e., the median is the central value of the set of observations provided all the observations are arranged in the ascending or descending orders. (1) Calculation of median (i) Individual series : If the data is raw, arrange in ascending or descending order. Let n be the number of observations. Measures of Central Tendency 47 n 1 If n is odd, Median = value of 2 If n is even, Median = 1 n value of 2 2 th th item. n item value of 1 2 th item 60 (ii) Discrete series : In this case, we first find the cumulative frequencies of the variables arranged in ascending or descending order and the median is given by th n 1 Median = observation, where n is the cumulative frequency. 2 E3 (iii) For grouped or continuous distributions : In this case, following formula can be used ID N C 2 i (a) For series in ascending order, Median = l f D YG U Where l = Lower limit of the median class f = Frequency of the median class N = The sum of all frequencies i = The width of the median class C = The cumulative frequency of the class preceding to median class. (b) For series in descending order N C i , where u = upper limit of the median class Median = u 2 f N n f i i1 ST U As median divides a distribution into two equal parts, similarly the quartiles, quantiles, deciles and percentiles divide the distribution respectively into 4, 5, 10 and 100 equal parts. The N j C i; j 1, 2, 3. Q1 is the lower quartile, Q 2 is the median and jth quartile is given by Q j l 4 f Q 3 is called the upper quartile. (2) Lower quartile th n 1 (i) Discrete series : Q1 size of item 4 N C 4 i (ii) Continuous series : Q1 l f (3) Upper quartile 48 Measures of Central Tendency 3(n 1) (i) Discrete series : Q 3 size of 4 th item 3N C 4 (ii) Continuous series : Q 3 l i f 60 (4) Decile : Decile divide total frequencies N into ten equal parts. Nj C 10 Dj l i [j = 1, 2, 3, 4, 5, 6, 7, 8, 9] f E3 N C If j = 5, then D5 l 2 i. Hence D5 is also known as median. f ID (5) Percentile : Percentile divide total frequencies N into hundred equal parts N k C 100 Pk l i f where k = 1, 2, 3, 4, 5,.......,99. Solution: (b) U The following data gives the distribution of height of students Height (in cm) 160 150 152 161 156 154 Number of 12 8 4 4 3 3 students The median of the distribution is (a) 154 (b) 155 (c) 160 Arranging the data in ascending order of magnitude, we obtain D YG Example: 7 of 7 (d) 161 150 8 152 4 154 3 155 7 156 3 160 12 161 4 8 12 15 22 25 37 41 U Height (in cm) Number students Cumulative frequency 155 Here, total number of items is 41, i.e. an odd number. Hence, the median is From cumulative frequency table, we find that median i.e. 21st item is 155. (All items from 16 to 22nd are equal, each = 155) The median of a set of 9 distinct observation is 20.5. If each of the largest 4 observation of the set is increased by 2, then the median of the new set [AIEEE 2003] (a) Is increased by 2 (b) Is decreased by 2 (c) Is two times the original median (d) Remains the same as that of the original set ST Example: 8 Solution: (d) Example: 9 41 1 th i.e. 21st item. 2 th 9 1 n = 9, then median term 5 th term. Since last four observation are increased by 2. 2 ∵ The median is 5th observation which is remaining unchanged. There will be no change in median. Compute the median from the following table Marks obtained No. of students Measures of Central Tendency 49 2 10-20 18 20-30 30 30-40 45 40-50 35 50-60 20 60-70 6 70-80 3 (a) 36.55 (b) 35.55 (c) 40.05 Solution: (a) No. of students Cumulative frequency 0-10 2 2 10-20 18 20 20-30 30 50 30-40 45 40-50 35 50-60 20 60-70 6 70-80 3 ID 130 150 156 159 U n f 159 95 (d) None of these E3 Marks obtained 60 0-10 Here n = 159, which is odd. 1 1 (n 1) (159 1) 80 , which is in the class 30-40 (see the row of 2 2 D YG Median number cumulative frequency 95, which contains 80). Hence median class is 30-40. We have l = Lower limit of median class = 30 f = Frequency of median class = 45 C = Total of all frequencies preceding median class = 50 i = Width of class interval of median class = 10 U N 159 C 50 295 2 i 30 2 10 30 36.55. Required median l f 45 45 2.1.6 Mode. ST Mode : The mode or model value of a distribution is that value of the variable for which frequency is maximum. For continuous series, mode is calculated as, Mode f1 f0 l1 i 2 f1 f0 f2 the Where, l1 = The lower limit of the model class f1 = The frequency of the model class f0 = The frequency of the class preceding the model class f2 = The frequency of the class succeeding the model class i = The size of the model class. 50 Measures of Central Tendency Mean = Median = Mode Mean Mod Median e E3 Mean Media Mod n e 60 Symmetric distribution : A symmetric is a symmetric distribution if the values of mean, mode and median coincide. In a symmetric distribution frequencies are symmetrically distributed on both sides of the centre point of the frequency curve. ID A distribution which is not symmetric is called a skewed-distribution. In a moderately asymmetric the interval between the mean and the median is approximately one-third of the interval between the mean and the mode i.e. we have the following empirical relation between them Example: 10 The mode of the distribution Marks of (a) 5 Solution: (b) [AMU 1988] 4 5 6 7 8 6 7 10 8 3 D YG No. students U Mean – Mode = 3(Mean – Median) Mode = 3 Median – 2 Mean. It is known as Empirical relation. (b) 6 (c) 8 (d) 10 Since frequency is maximum for 6 Mode = 6 Example: 11 Consider the following statements [AIEEE 2004] (1) Mode can be computed from histogram (2) Median is not independent of change of scale U (3) Variance is independent of change of origin and scale Which of these is/are correct (a) (1), (2) and (3) (c) Only (1) and (2) (d) Only (1) It is obvious. ST Solution: (d) (b) Only (2) Important Tips Some points about arithmetic mean Of all types of averages the arithmetic mean is most commonly used average. It is based upon all observations. If the number of observations is very large, it is more accurate and more reliable basis for comparison. Some points about geometric mean It is based on all items of the series. It is most suitable for constructing index number, average ratios, percentages etc. G.M. cannot be calculated if the size of any of the items is zero or negative. Some points about H.M. Measures of Central Tendency 51 This is useful in problems related with rates, ratios, time etc. A.M. G.M. H.M. and also (G.M.)2 ( A.M.)(H.M.) Some points about median It is an appropriate average in dealing with qualitative data, like intelligence, wealth etc. The sum of the deviations of the items from median, ignoring algebraic signs, is less than the sum from any other point. 60 It is based on all item of the series. Some points about mode It is not based on all items of the series. As compared to other averages mode is affected to a large extent by fluctuations of sampling,. It is not suitable in a case where the relative importance of items have to be considered. E3 2.1.7 Pie Chart (Pie Diagram). ID Here a circle is divided into a number of segments equal to the number of components in the corresponding table. Here the entire diagram looks like a pie and the components appear like slices cut from the pie. In this diagram each item has a sector whose area has the same percentage of the total area of the circle as this item has of the total of such items. For example if N be the total and n1 is one of the components of the figure corresponding to a particular Example: 12 D YG U n item, then the angle of the sector for this item 1 360 , as the total number of degree in the N angle subtended by the whole circular arc at its centre is 360°. If for a slightly assymetric distribution, mean and median are 5 and 6 respectively. What is its mode [DCE 199 (a) 5 Solution: (d) (b) 6 (c) 7 (d) 8 We know that Mode = 3Median – 2Mean = 3(6) – 2(5) = 8 Example: 13 A pie chart is to be drawn for representing the following data U Items of expenditure Number of families 150 Food and clothing 400 ST Education House rent 40 Electricity 250 Miscellaneous 160 The value of the central angle for food and clothing would be (a) 90° Solution: (d) (b) 2.8° Required angle for food and clothing (c) 150° [NDA 1998] (d) 144° 400 360 144 1000 2.1.8 Measure of Dispersion. The degree to which numerical data tend to spread about an average value is called the dispersion of the data. The four measure of dispersion are 52 Measures of Central Tendency (1) Range deviation (2) Mean deviation (3) Standard deviation (4) Square (1) Range : It is the difference between the values of extreme items in a series. Range = Xmax – Xmin x max x min. x max x min 60 The coefficient of range (scatter) Range is not the measure of central tendency. Range is widely used in statistical series relating to quality control in production. E3 (i) Inter-quartile range : We know that quartiles are the magnitudes of the items which divide the distribution into four equal parts. The inter-quartile range is found by taking the difference between third and first quartiles and is given by the formula Inter-quartile range Q 3 Q 1 Where Q1 = First quartile or lower quartile and Q3 = Third quartile or upper quartile. Percentile range P90 P10 ID (ii) Percentile range : This is measured by the following formula Where P90 = 90th percentile and P10 = 10th percentile. U Percentile range is considered better than range as well as inter-quartile range. D YG (iii) Quartile deviation or semi inter-quartile range : It is one-half of the difference Q Q1 between the third quartile and first quartile i.e., Q.D. 3 and coefficient of quartile 2 Q Q1 deviation 3. Q 3 Q1 Where, Q3 is the third or upper quartile and Q1 is the lowest or first quartile. (2) Mean deviation : The arithmetic average of the deviations (all taking positive) from the mean, median or mode is known as mean deviation. (i) Mean deviation from ungrouped data (or individual series) | x M | n Where |x – M| means the modulus of the deviation of the variate from the mean (mean, median or mode). M and n is the number of terms. ST U Mean deviation (ii) Mean deviation from continuous series : Here first of all we find the mean from which deviation is to be taken. Then we find the deviation dM | x M | of each variate from the mean M so obtained. Next we multiply these deviations by the corresponding frequency and find the product f.dM and then the sum f dM of these products. f | x M | f dM , where n = f. n n Important Tips Lastly we use the formula, mean deviation Mean coefficient of dispersion M ean deviation from the mean M ean Measures of Central Tendency 53 M ean deviation from the median M edian Median coefficient of dispersion Mode coefficient of dispersion In general, mean deviation (M.D.) always stands for mean deviation about median. M ean deviation from the mode M ode Coefficient of S.D. , where x is the A.M. x (ii) Standard deviation from individual series (x x ) 2 N where, x = The arithmetic mean of series N = The total frequency. (iii) Standard deviation from continuous series U ID E3 60 (3) Standard deviation : Standard deviation (or S.D.) is the square root of the arithmetic mean of the square of deviations of various values from their arithmetic mean and is generally denoted by read as sigma. (i) Coefficient of standard deviation : To compare the dispersion of two frequency distributions the relative measure of standard deviation is computed which is known as coefficient of standard deviation and is given by fi (x i x )2 N where, x = Arithmetic mean of series x i = Mid value of the class D YG fi = Frequency of the corresponding x i N = f = The total frequency Short cut method 2 d2 d N N 2 ST U fd 2 fd (i) (ii) N N where, d = x – A = Deviation from the assumed mean A f = Frequency of the item N = f = Sum of frequencies (4) Square deviation (i) Root mean square deviation S 1 N n f (x i i A) 2 i1 where A is any arbitrary number and S is called mean square deviation. (ii) Relation between S.D. and root mean square deviation : If be the standard deviation and S be the root mean square deviation. Then S 2 2 d 2. Obviously, S 2 will be least when d = 0 i.e. x A Hence, mean square deviation and consequently root mean square deviation is least, if the deviations are taken from the mean. 54 Measures of Central Tendency 2.1.9 Variance. The square of standard deviation is called the variance. Coefficient of standard deviation and variance : The coefficient of standard deviation is . Coefficient of variance = coefficient of S.D. 100 Where, d 1 x 1 x , d 2 x 2 x and x 1 [n1 ( 12 d 12 ) n 2 ( 22 d 22 )] n1 n 2 n1 x 1 n 2 x 2. n1 n 2 E3 standard deviation of two series, then 2 60 100. x x Variance of the combined series : If n1 ;n 2 are the sizes, x 1 ; x 2 the means and 1 ; 2 the the ratio of the S.D. to A.M. i.e., Important Tips Range is widely used in statistical series relating to quality control in production. Standard deviation ≤ Range i.e., variance ≤ (Range) 2. Empirical relations between measures of dispersion 4 Mean deviation (standard deviation) 5 2 Semi interquartile range (standard deviation) 3 5 Semi interquartile range (mean deviation) 6 For a symmetrical distribution, the following area relationship holds good U ID D YG X covers 68.27% items X 2 covers 95.45% items X 3 covers 99.74% items n2 1. 12 S.D. of first n natural numbers is Range is not the measure of central tendency. 2.1.10 Skewness. U “Skewness” measures the lack of symmetry. It is measured by 1 (x i )3 and is {(x i 2 )}3 / 2 ST denoted by 1. The distribution is skewed if, (i) Mean Median Mode (ii) Quartiles are not equidistant from the median and (iii) The frequency curve is stretched more to one side than to the other. (1) Distribution : There are three types of distributions (i) Normal distribution : When 1 0 , the distribution is said to be normal. In this case Mean = Median = Mode (ii) Positively skewed distribution : When 1 0 , the distribution is said to be positively skewed. In this case Mean > Median > Mode Measures of Central Tendency 55 (iii) Negative skewed distribution : When 1 0 , the distribution is said to be negatively skewed. In this case Mean < Median < Mode 60 (2) Measures of skewness (i) Absolute measures of skewness : Various measures of skewness are (a) S K M M d (b) S K M M o (c) S k Q 3 Q1 2 M d where, M d = median, M o = mode, M = mean E3 Absolute measures of skewness are not useful to compare two series, therefore relative measure of dispersion are used, as they are pure numbers. (3) Relative measures of skewness M Mo (M M d ) (i) Karl Pearson’s coefficient of skewness : S k , 3 S k 3 , where 3 is standard deviation. Q 3 Q1 2 M d Q 3 Q1 ID (ii) Bowley’s coefficient of skewness : S k Solution: (a) Example: 15 A batsman scores runs in 10 innings 38, 70, 48, 34, 42, 55, 63, 46, 54, 44, then the mean deviation is[Kerala En (a) 8.6 (b) 6.4 (c) 10.6 (d) 9.6 Arranging the given data in ascending order, we have 34, 38, 42, 44, 46, 48, 54, 55, 63, 70, 46 48 ( n 10 , median is the mean of 5th and 6th items) 47 Here median M = 2 | xi M | | x i 47 | 13 9 5 3 1 1 7 8 16 23 Mean deviation 8.6 n 10 10 S.D. of data is 6 when each observation is increased by 1, then the S.D. of new data is [Pb. CET 1994] (a) 5 (b) 7 (c) 6 (d) 8 S.D. and variance of data is not changed, when each observation is increased (OR decreased) by the D YG Example: 14 U Bowley’s coefficient of skewness lies between –1 and 1. P P90 2 M d D1 D9 2 M d (iii) Kelly’s coefficient of skewness : S K 10 P90 P10 D9 D1 U Solution: (c) same constant. Example: 16 In a series of 2n observations, half of them equal a and remaining half equal –a. If the standard deviation of the observations is 2, then |a| equals [AIEEE 2004] 2 n (b) 2 (c) 2 ST (a) (d) Solution: (c) Let a, a,..........n times – a, – a, – a, – a, ----- n time i.e. mean = 0 and S.D. Example: 17 na 2 na 2 a 2 a. Hence | a | 2 2n If is the mean of distribution (yi , fi ) , then fi (yi ) 1 n n(a 0)2 n(a 0)2 2n 2 (a) M.D. Solution: (c) Example: 18 (b) S.D. (c) 0 fi y i We have, fi (y i ) fi y i fi fi fi 0 fi What is the standard deviation of the following series Measurement 0-10 10-20 20-30 30-40 s Frequency 1 3 4 2 [Kerala PET 2001] (d) Relative frequency [DCE 1996] 56 Measures of Central Tendency (a) 81 (b) 7.6 (c) 9 (d) 2.26 Solution: (c) Frequency yi 0-10 10-20 20-30 30-40 1 3 4 2 10 5 15 25 35 f u2 2 c2 i i fi –2 –1 0 1 fiui fiui2 –2 –3 0 2 –3 4 3 0 2 9 2 2 10 2 9 3 90 9 81 = 9 10 10 In an experiment with 15 observations on x, the following results were available x 2 2830 , x 170. On observation that was 20 was found to be wrong and was replaced by the correct value 30. Then the corrected variance is E3 Solution: (a) yi A , A = 25 10 (a) 78.00 (b) 188.66 (c) 177.33 x 170 , x 2 2830 Increase in x 10 , then x 170 10 180 Increase in x 2 900 400 500 , then x 2830 500 3330 [AIEEE 2003] (d) 8.33 ID Example: 19 f u2 i i fi ui 60 Class 2 2 Example: 20 1 3330 180 x x 2 222 144 78 n n 15 15 The quartile deviation of daily wages (in Rs.) of 7 persons given below 12, 7, 15, 10, 17, 19, 25 is Solution: (d) (a) 14.5 (b) 5 (c) 9 (d) 4.5 The given data in ascending order of magnitude is 7, 10, 12, 15, 17, 19, 25 [Pb. CET 1991, 96; Kurukshetra CEE 1997] D YG U Variance n 1 Here Q1 size of 4 th item = size of 2nd item = 10 th U Example: 21 3(n 1) Q3 size of item = size of 6th item = 19 4 Q Q1 19 10 4.5 Then Q.D. 3 2 2 Karl-Pearson’s coefficient of skewness of a distribution is 0.32. Its S.D. is 6.5 and mean 39.6. Then the median of the distribution is given by [Kurukshetra CEE 1991] (a) 28.61 (b) 38.81 (c) 29.13 (d) 28.31 M Mo We know that S k , Where M = Mean, Mo = Mode, = S.D. 39.6 M o i.e. 0.32 Mo 37.52 and also know that, Mo 3median – 2mean 6.5 37.52 3(Median) – 2(39.6) Median = 38.81 (approx.) ax b The S.D. of a variate x is . The S.D. of the variate where a, b, c are constant, is [Pb. CET 1996] c ST Solution: (b) Example: 22 a (a) c Solution: (b) (b) ax b a b i.e., y x c c c y Ax B Let y a b i.e. y Ax B , where A , B c c a c a2 (c) 2 c (d) None of these y y A(x x ) (y y )2 A 2 (x x )2 (y y )2 A 2 (x x )2 n. y2 A 2.n x2 y2 A 2 x2 y | A | x y a x c Measures of Central Tendency 57 a . c ST U D YG U ID E3 60 Thus, new S.D. 60 Correlation and Regression 65 2.2.1 Introduction. ID E3 “If it is proved true that in a large number of instances two variables tend always to fluctuate in the same or in opposite directions, we consider that the fact is established and that a relationship exists. This relationship is called correlation.” (1) Univariate distribution : These are the distributions in which there is only one variable such as the heights of the students of a class. (2) Bivariate distribution : Distribution involving two discrete variable is called a bivariate distribution. For example, the heights and the weights of the students of a class in a school. (3) Bivariate frequency distribution : Let x and y be two variables. Suppose x takes the values x 1 , x 2 ,....., x n and y takes the values y1 , y 2 ,....., y n , then we record our observations in the form of ordered pairs ( x 1 , y 1 ) , where U 1 i n,1 j n. If a certain pair occurs fij times, we say that its frequency is fij. Example: 1 D YG The function which assigns the frequencies fij ’s to the pairs (x i , y j ) is known as a bivariate frequency distribution. The following table shows the frequency distribution of age (x) and weight (y) of a group of 60 individuals x (yrs) 40 – 45 45 – 50 50 – 55 55 – 60 60 – 65 45 – 50 2 5 8 3 0 50 – 55 1 3 6 10 2 12 1 y (yrs.) 40 – 45 U Solution: 55 – 60 0 2 5 Then find the marginal frequency distribution for x and y. Marginal frequency distribution for x x 45 – 50 ST 3 10 f Marginal frequency distribution for y 50 – 55 55 – 60 60 – 65 19 25 3 y 45 – 50 50 – 55 55 – 60 f 18 22 20 2.2.2 Covariance. Let (x 1 , x i ); i 1,2,....., n be a bivariate distribution, where x 1 , x 2 ,....., x n are the values of variable x and y 1 , y 2 ,....., y n those of y. Then the covariance Cov (x, y) between x and y is given by 1 n 1 n 1 1 n (x i x )(y i y ) or Cov (x , y ) (x i y i x y ) where, x x i and y n i1 n i1 n n i1 means of variables x and y respectively. Covariance is not affected by the change of origin, but it is affected by the change of scale. Cov (x , y ) Example: 2 Covariance (x , y ) between x and y, if x 15 , y 40 , x.y 110 , n 5 is n y i are i 1 [DCE 2000] 66 Correlation and Regression (a) 22 Given, (d) None of these x 15, y 40 x.y 110 , n 15 1 n We know that, Cov (x , y ) n i 1 1 x i. yi n n i 1 1 xi n n y n x.y n x n y 1 1 1 i i 1 60 Solution: (c) (c) – 2 (b) 2 1 15 40 (110 ) 22 3 8 2. 5 5 5 2.2.3 Correlation. U ID E3 The relationship between two variables such that a change in one variable results in a positive or negative change in the other variable is known as correlation. (1) Types of correlation (i) Perfect correlation : If the two variables vary in such a manner that their ratio is always constant, then the correlation is said to be perfect. (ii) Positive or direct correlation : If an increase or decrease in one variable corresponds to an increase or decrease in the other, the correlation is said to be positive. (iii) Negative or indirect correlation : If an increase or decrease in one variable corresponds to a decrease or increase in the other, the correlation is said to be negative. D YG (2) Karl Pearson's coefficient of correlation : The correlation coefficient r( x , y ) , between two variable x and Cov (x , y ) y is given by, r(x , y ) Var(x ) Var(y ) (x x )(y y ) r (x x ) 2 (y y ) 2 U ST Also rxy Example: 3 Cov (x , y ) x y dx 2 dy 2 dx , r(x , y ) n n x i1 2 i n x i i1 Cov (x , y ) var( x ). var(y ) 2 n n y i y i i1 i1 n 2 63 94 66 Take A 5, B 5 2. dx. dy dxdy n dx dy dy n n 2 2 2 2 , where dx x x ; dy y y. For the data x: 4 7 8 3 4 y: 5 8 6 3 5 The Karl Pearson’s coefficient is (a) Solution: (a) x y Cov (x , y ) dxdy (3) Modified formula : r or n n n n x i y i x i y i i1 i1 i1 (b) 63 [Kerala (Engg.) 2002] (c) 63 94 (d) 63 66 Correlation and Regression 67 yi ui x i 5 vi y i 5 u i2 v i2 u iv i 4 7 8 3 4 5 8 6 3 5 –1 2 3 –2 –1 0 3 1 –2 0 1 9 1 4 0 0 9 1 4 0 0 6 3 4 0 i v 1 i u 2 u v n u v 1 1 u n u v n v 1 r(x , y ) i i i i 2 2 i 2 i i i 2 i 19 v 1 2 5 2 i u v 14 i i 13 E3 u Total 60 xi 13 2 1 19 5 2 63 . 94 66 2 14 5 Example: 4 Coefficient of correlation between observations (1, 6),(2, 5),(3, 4), (4, 3), (5, 2), (6, 1) is Solution: (b) (a) 1 (b) – 1 (c) 0 Since there is a linear relationship between x and y, i.e. x y 7 ID [Pb. CET 1997; Him. CET 2001; DCE 2002] Coefficient of correlation = – 1. coefficient of correlation is (a) 0.48 Example: 6 Solution: (b) (c) 0.87 Since the covariance is – ive. Correlation coefficient must be – ive. Hence (d) is the correct answer. The coefficient of correlation between two variables x and y is 0.5, their covariance is 16. If the S.D of x is 4, then the S.D. of y is equal to [AMU 1988, 89, 90] (a) 4 (b) 8 (c) 16 (d) 64 We have, rxy 0.5, Cov (x , y ) 16. S.D of x i.e., x 4 , y ? U ST 0.5 Cov (x , y ) x. y 16 ; y 8. 4. y For a bivariate distribution (x, y) if x 50 , y 60 , xy 350 , then r(x, y) is x n r(x , y ) Example: 8 (b) 5/36 x 5 50 n 10. Cov (x , y ) x 5, y 6 variance of x is 4, variance of y is 9, [AMU 1991; Pb. CET 1998; DCE 1998] (a) 5/6 Solution: (a) (d) None of these Cov ( x , y ) We know that coefficient of correlation x. y We know that, r(x , y ) Example: 7 (b) 0.78 D YG Solution : (d) 131 148 272 and the variance of x is and the variance of y is. The 3 3 3 U The value of co-variance of two variables x and y is Example: 5 (d) None of these (c) 11/3 (d) 11/18 n xy x.y 350 (5)(6) = 5. n Cov (x , y ) x. y 10 5 4. 9 = 5. 6 A, B, C, D are non-zero constants, such that (i) both A and C are negative. (ii) A and C are of opposite sign. 68 Correlation and Regression If coefficient of correlation between x and y is r, then that between AX B and CY D is (b) – r (a) r Solution : (a,b) (c) A r C (d) A r C (i) Both A and C are negative. Now Cov ( AX B, CY D) AC Cov.( X, Y ) | C | y Hence ( AX B, CY D) = ( AX B, CY D) = (ii) = 60 D AC.Cov ( X , Y ) AC ( X , Y ) = (X, Y ) r, (| A | x )(| C | y ) | AC | AC ( X , Y ) , ( AC 0 ) | AC | ( AC 0) E3 AX B | A | x and CY AC ( X , Y ) = ( X , Y ) r. AC 2.2.4 Rank Correlation. d Note 2 d 2 n(n 2 1) , which is the Spearman's formulae for rank correlation coefficient. = sum of the squares of the difference of two ranks and n is the number of pairs of observations. D YG Where 6 U Rank Correlation : 1 ID Let us suppose that a group of n individuals is arranged in order of merit or proficiency in possession of two characteristics A and B. These rank in two characteristics will, in general, be different. For example, if we consider the relation between intelligence and beauty, it is not necessary that a beautiful individual is intelligent also. : We always have, d ( x i i yi ) x y i i n(x ) n(y ) 0 , ( x y ) If all d's are zero, then r 1 , which shows that there is perfect rank correlation between the variable and which is maximum value of r. If however some values of x i are equal, then the coefficient of rank correlation is given by U 1 6 d 2 (m 3 m ) 12 r 1 n(n 2 1) ST where m is the number of times a particular x i is repeated. Positive and Negative rank correlation coefficients Let r be the rank correlation coefficient then, if r 0 , it means that if the rank of one characteristic is high, then that of the other is also high or if the rank of one characteristic is low, then that of the other is also low. e.g., if the two characteristics be height and weight of persons, then r 0 means that the tall persons are also heavy in weight. r 1 , it means that there is perfect correlation in the two characteristics i.e., every individual is getting the same ranks in the two characteristics. Here the ranks are of the type (1, 1), (2, 2),....., (n, n). r 1 , it means that if the rank of one characteristics is high, then that of the other is low or if the rank of one characteristics is low, then that of the other is high. e.g., if the two characteristics be richness and slimness in person, then r 0 means that the rich persons are not slim. Correlation and Regression 69 r 1 , it means that there is perfect negative correlation in the two characteristics i.e, an individual getting highest rank in one characteristic is getting the lowest rank in the second characteristic. Here the rank, in the two characteristics in a group of n individuals are of the type (1, n), (2, n 1),....., (n, 1). r 0 , it means that no relation can be established between the two characteristics. Important Tips If r 0 , the variable x and y are said to be uncorrelated or independent. If r 1 , the correlation is said to be negative and perfect. If r 1, the correlation is said to be positive and perfect. Correlation is a pure number and hence unitless. Correlation coefficient is not affected by change of origin and scale. If two variate are connected by the linear relation x y K , then x, y are in perfect indirect correlation. Here r 1. If x, y are two independent variables, then (x y, x y) u v n u. v 1 1 u n u v n v x2 y2 x2 y2. 1 i 2 2 i 2 i i 2 , where ui x i A, vi yi B. i Two numbers within the bracket denote the ranks of 10 students of a class in two subjects (1, 10), (2, 9), (3, 8), (4, 7), (5, 6), (6, 5), (7, 4), (8, 3), (9, 2), (10, 1). The rank of correlation coefficient is [MP PET 1996] (a) 0 (b) – 1 (c) 1 (d) 0.5 Solution: (b) d U Example: 9 i ID r(x , y ) i i E3 60 Rank correlation coefficient is r 1 6. 2 n(n 2 1) , Where d y x for pair (x, y) D YG d 2 9 2 7 2 5 2 3 2 12 (1)2 (3)2 (5)2 (7)2 (9)2 330 Also n 10 ; r 1 Example : 10 6 330 1. 10 (100 1) Let x 1 , x 2 , x 3 ,....., x n be the rank of n individuals according to character A and y1 , y 2 ,......, y n the ranks of same individuals according to other character B such that x i yi n 1 for i 1, 2, 3,....., n. Then the coefficient of rank correlation between the Solution: (c) characters A and B is (a) 1 (b) 0 x i yi n 1 for all i 1, 2, 3,....., n (c) – 1 (d) None of these U Let x i yi di. Then, 2 x i n 1 di di 2 x i (n 1) n i 1 [2 x i (n 1)]2 = ST i 1 n n di 2 4 i 1 i 1 2 i [4 x 2 i (n 1)2 4 x i (n 1)] i 1 n x x i 2 (n)(n 1)2 4 (n 1) i 1 n d n n di 2 i = 4 i 1 n(n 1)(2n 1) n(n 1) (n)(n 1)2 4 (n 1) 6 2 n(n 1). 3 r 1 6 2 d 2 i n(n 2 1) 1 6(n)(n 2 1) i.e., r 1. 3(n)(n 2 1) Regression 2.2.5 Linear Regression. 70 Correlation and Regression If a relation between two variates x and y exists, then the dots of the scatter diagram will more or less be concentrated around a curve which is called the curve of regression. If this curve be a straight line, then it is known as line of regression and the regression is called linear regression. Line of regression: The line of regression is the straight line which in the least square sense gives the best fit to the given frequency. 60 2.2.6 Equations of lines of Regression. (1) Regression line of y on x : If value of x is known, then value of y can be found as Cov(x , y) 2 x (x x ) or y y r y (x x ) x E3 y y (2) Regression line of x on y : It estimates x for the given value of y as xx Cov (x , y ) 2 y (y y) or x x r x (y y ) y r x y Cov (x , y ) 2 y x Cov (x , y ) x2. U (ii) Regression coefficient of x on y is b xy r y ID (3) Regression coefficient : (i) Regression coefficient of y on x is b yx 2.2.7 Angle between Two lines of Regression. D YG Equation of the two lines of regression are y y byx (x x ) and x x b xy (y y ) We have, m 1 slope of the line of regression of y on x = b yx r. m 2 Slope of line of regression of x on y = y x 1 y b xy r. x U y r y ( y r 2 y ) x (1 r 2 ) x y m 2 m1 r x x tan =. r y y 1 m 1m 2 r x2 r y2 r( x2 y2 ) 1. x r x ST Here the positive sign gives the acute angle , because r 2 1 and x , y are positive. tan Note 1 r 2 x y. 2 r x y2.....(i) : If r 0 , from (i) we conclude tan or / 2 i.e., two regression lines are at right angels. If r 1 , tan 0 i.e., 0 , since is acute i.e., two regression lines coincide. 2.2.8 Important points about Regression coefficients bxy and byx. (1) r byx.bxy i.e. the coefficient of correlation is the geometric mean of the coefficient of regression. (2) If b yx 1 , then b xy 1 i.e. if one of the regression coefficient is greater than unity, the other will be less than unity. Correlation and Regression 71 (3) If the correlation between the variable is not perfect, then the regression lines intersect at (x , y ). (4) b yx is called the slope of regression line y on x and 1 is called the slope of regression line x on y. b xy (5) byx b xy 2 byx b xy or b yx b xy 2r , i.e. the arithmetic mean of the regression coefficient is greater than the (7) The product of lines of regression’s gradients is given by y2 x2 60 correlation coefficient. (6) Regression coefficients are independent of change of origin but not of scale.. E3 (8) If both the lines of regression coincide, then correlation will be perfect linear. (9) If both b yx and b xy are positive, the r will be positive and if both b yx and b xy are negative, the r will be negative. Important Tips . Thus the regression lines are perpendicular. 2 If r 0 , then tan is not defined i.e. If r 1 or 1 , then tan = 0 i.e. = 0. Thus the regression lines are coincident. If regression lines are y ax b and x cy d , then x If byx, bxy and r 0 then Correlation measures the relationship between variables while regression measures only the cause and effect of relationship between the variables. If line of regression of y on x makes an angle , with the +ive direction of X-axis, then tan byx. If line of regression of x on y makes an angle , with the +ive direction of X-axis, then cot b xy. ID U bc d ad b and y . 1 ac 1 ac D YG 1 1 (b xy b yx ) r and if bxy, byx and r 0 then (b xy b yx ) r. 2 2 Example : 11 The two lines of regression are 2 x 7 y 6 0 and 7 x 2 y 1 0. The correlation coefficient between x and y is Solution: (b) (a) – 2/7 (b) 2/7 (c) 4/49 The two lines of regression are 2 x 7 y 6 0.....(i) and 7 x 2 y 1 0 [DCE 1999] (d) None of these......(ii) U If (i) is regression equation of y on x, then (ii) is regression equation of x on y. 2 2 1 6 We write these as y x and x y 7 7 7 7 4 2 2 1 , So our choice is valid. , b xy ; byx.b xy 7 49 7 ST b yx r2 Example: 12 4 2 r. 49 7 [ byx 0, b xy 0 ] Given that the regression coefficients are – 1.5 and 0.5, the value of the square of correlation coefficient is [Kurukshetra CEE 2002] (a) 0.75 (c) – 0.75 (b) 0.7 (d) – 0.5 Solution: (c) Correlation coefficient is given by r 2 byx.bxy = (1.5)(0.5) 0.75. Example: 13 In a bivariate data x 30, y 400 , x 2 196 , xy 850 and n 10.The regression coefficient of y on x is [Kerala (Engg.) 2002] (a) – 3.1 (b) – 3.2 (c) – 3.3 (d) – 3.4 72 Correlation and Regression Cov (x , y ) 1 1 1 1 (850 ) xy 2 x. y = (30 )(400 ) = 35 n 10 100 n 2 2 Var(x ) x2 byx Example: 14 196 30 1 x x2 = 10.6 = n 10 10 n Cov (x , y ) 35 = – 3.3. Var (x ) 10.6 If two lines of regression are 8 x 10 y 66 0 and 40 x 18 y 214 , then (x , y ) is (a) (17, 13) Solution: (b) (b) (13, 17) (c) (– 17, 13) E3 1 9 2 4 , b xy . Therefore, tan 3 3 1 b yx b xy b xy 1 b yx 4 3 1 = 3 2 . 4 /3 18 1 2/3 1 (b) 1 2 (d) 1 3 2 (c) 1 3 Slope of regression line of y on x = byx tan 30 o Slope of regression line of x on y = 1. Hence, r b xy.byx 3 3 1 3 1 1 . 3 3 ST If two random variables x and y, are connected by relationship 2 x y 3 , then rxy (a) 1 Solution: (b) 1 1 tan 60 o 3 b xy U b xy Example: 17 (d) None of these If the lines of regression of y on x and x on y make angles 30 o and 60 o respectively with the positive direction of X-axis, then the correlation coefficient between x and y is [MP PET 2002] (a) Solution: (c) 2 9 (c) ID (b) U byx 1 18 D YG Example: 16 (d) (– 13, – 17) 4 2 and of x on y is. If the acute angle between the regression line is , then tan 3 3 The regression coefficient of y on x is (a) Solution: (a) [AMU 1994; DCE 1994] Since lines of regression pass through (x , y ) , hence the equation will be 8 x 10 y 66 0 and 40 x 18 y 214 On solving the above equations, we get the required answer x 13, y 17. Example: 15 60 Solution: (c) (b) – 1 (c) – 2 [AMU 1991] (d) 3 Since 2 x y 3 2 x y 3 ; y y 2(x x ). So, byx 2 1 1 Also x x (y y ) , b xy 2 2 1 rxy2 b yx.b xy (2) 1 rxy 1. 2 2.2.9 Standard error and Probable error. ( both byx , b xy are –ive) Correlation and Regression 73 (1) Standard error of prediction : The deviation of the predicted value from the observed value is known as the standard error prediction and is defined as S y (y y p) 2 n where y is actual value and y p is predicted value. 60 In relation to coefficient of correlation, it is given by (ii) Standard error of estimate of y is S y y 1 r 2. (i) Standard error of estimate of x is S x x 1 r 2 (2) Relation between probable error and standard error : If r is the correlation coefficient in a sample of n 1 r2 n 1 r2 n and probable error P.E. (r) = 0.6745 (S.E.)= 0.6745 E3 pairs of observations, then its standard error S.E. (r) . The probable error or the standard error are used for interpreting the coefficient of correlation. ID (i) If r P.E.(r) , there is no evidence of correlation. (ii) If r 6 P.E.(r) , the existence of correlation is certain. Example: 18 If Var(x ) 21 and Var(y ) 21 and r 1 , then standard error of y is 4 (b) 1 2 D YG (a) 0 S y y 1 r2 y 1 1 0. ST U Solution: (a) U The square of the coefficient of correlation for a bivariate distribution is known as the “Coefficient of determination”. (c) 1 4 (d) 1