Descriptive Statistics for Univariate Data PDF

USOlMISTAOl DESCRIPTIVE STATISTICS FOR UNIVARIATE DATA ---------------------------------------------------------------------------------------------------------------- Unit-I :ANALYSIS OF QUANTITATIVE DATA- I Meaning/Definition: (i) Statistics is a science which deals with collection, presentation, analysis and interpretation of numerical data. (ii) Statistics is a method of decision making in the face of uncertainty on the basis of numerical data and at calculated risk. Types of Data: What is the Scale? A scale is a device or an object used to measure or quantify any event or another object. Levels of Measurements There are four different scales of measurement. The data can be defined as being one of the four scales. The four types of scales are:  Nominal Scale  Ordinal Scale  Interval Scale  Ratio Scale Nominal Scale A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or “labels” to classify or identify the objects. A nominal scale usually deals with the non-numeric variables or the numbers that do not have any value. Characteristics of Nominal Scale  A nominal scale variable is classified into two or more categories. In this measurement mechanism, the answer should fall into either of the classes.  It is qualitative. The numbers are used here to identify the objects.  The numbers don’t define the object characteristics. The only permissible aspect of numbers in the nominal scale is “counting.” Example: An example of a nominal scale measurement is given below: What is your gender? M- Male F- Female Here, the variables are used as tags, and the answer to this question should be either M or F. Ordinal Scale 2nd The ordinal scale is the level of measurement that reports the ordering and ranking of data without establishing the degree of variation between them. Ordinal represents the “order.” Ordinal data is known as qualitative data or categorical data. It can be grouped, named and also ranked. Characteristics of the Ordinal Scale  The ordinal scale shows the relative ranking of the variables  It identifies and describes the magnitude of a variable  Along with the information provided by the nominal scale, ordinal scales give the rankings of those variables  The interval properties are not known  The surveyors can quickly analyse the degree of agreement concerning the identified order of variables Example:  Ranking of school students – 1st, 2nd, 3rd, etc.  Ratings in restaurants  Evaluating the frequency of occurrences  Very often  Often  Not often  Not at all Assessing the degree of agreement  Totally agree  Agree  Neutral  Disagree  Totally disagree Interval Scale 3rd The interval scale is the level of measurement scale. It is defined as a quantitative measurement scale in which the difference between the two variables is meaningful. In other words, the variables are measured in an exact manner, not as in a relative way in which the presence of zero is arbitrary. Characteristics of Interval Scale:  The interval scale is quantitative as it can quantify the difference between the values  It allows calculating the mean and median of the variables  To understand the difference between the variables, you can subtract the values between the variables  The interval scale is the preferred scale in Statistics as it helps to assign any numerical values to arbitrary assessment such as feelings, calendar types, etc. Example:  Likert Scale  Net Promoter Score (NPS)  Bipolar Matrix Table Ratio Scale The ratio scale is the 4th level of measurement scale, which is quantitative. It is a type of variable measurement scale. It allows researchers to compare the differences or intervals. The ratio scale has a unique feature. It possesses the character of the origin or zero points. Characteristics of Ratio Scale:  Ratio scale has a feature of absolute zero  It doesn’t have negative numbers, because of its zero-point feature  It affords unique opportunities for statistical analysis. The variables can be orderly added, subtracted, multiplied, divided. Mean, median, and mode can be calculated using the ratio scale.  Ratio scale has unique and useful properties. One such feature is that it allows unit conversions like kilogram – calories, gram – calories, etc. Example: An example of a ratio scale is: What is your weight in Kgs?  Less than 55 kgs  55 – 75 kgs  76 – 85 kgs  86 – 95 kgs  More than 95 kgs Measures of Central Tendency To understand the concept of the above let us consider the following example: A study is conducted to determine if dieting plus exercise is more effective in producing weight loss than dieting alone. Twelve pairs of matched subjects are run in the study. Subjects are matched on initial weight, initial level of exercise, age, and Gender. One member of each pair is put one diet for 3 months. The other member receives the same diet but in addition is put on a moderate exercise regime. The following scores indicate the weight loss in pounds over the 3- monthe period for each subject: Pair 1 2 3 4 5 6 7 8 9 10 11 12 Diet+ Exercise 24 20 22 15 23 21 16 17 19 25 24 13 Diet alone 16 18 19 16 18 18 17 19 13 18 19 14 (i) Identify the objective of the above problem. (ii) Which statistical measure do you calculate? Why? Objective: To compare two different methods of producing weight loss. To achieve the said objective one such measure is to calculate an average (mean). Measures of Central Tendency are the measures which condense a huge set of numerical data into single numerical values which are representative of the entire data set (distribution). They give us an idea about the concentration of the values in the central part of the distribution. In brief, Measures of Central Tendency of a statistical data is the value of variable which is representative of the entire data set (distribution). Two series of observations are not comparable because of the unsystematic variations generally present in the series (sets of numbers) but constants make it possible to compare the series easily. Measures of Central Tendency are very much useful for (i) Describing the distribution in concise manner. (ii) Comparative studies of different distributions. (iii) Computing various other statistical measures such as dispersion (variation), skewness (lack of symmetry), kurtosis etc. The various measures of central tendency are (i) Mean or Arithmetic mean (A.M) (ii) Median (iii) Mode (iv) Geometric mean (G.M) (v) Harmonic mean (H.M) Requisites of a good (ideal) measure of central tendency: There are various measures of central tendency. The difficulties lies in choosing the measures as no hard and fast rules have been made to select anyone. However, some norms have been set which work as a guideline for choosing a particular measure of central tendency. A measure of central tendency is good or satisfactory if it possesses the following characteristics: (1) It should be rigidly defined. It means that the definition should be clear and unambiguous so that it leads to one and only one interpretation by the different persons. (2) It should be easy to calculate and understand. (3) It should be based on all the observations. (4) It should be least affected by extreme observations. (5) It should be stable with regarding to sampling. It means that if a no. of samples of same size is drawn from a population, the measures of central tendency having the minimum variation among the different calculated values. (1) Mean or Arithmetic mean: Mean of a given set of observations is their sum divided by the number of observations. It is the most common and useful measure of central tendency. For ungrouped (raw) data: Let 𝑋𝑖, 𝑖 = 1,2 … 𝑛 be the given n observations then their mean is denoted by 𝑋̅ and is defined as 𝒏 𝑺𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏𝒔 𝟏 ̅= 𝑿 = ∑ 𝑿𝒊 𝒏𝒐. 𝒐𝒇𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏𝒔 𝒏 𝒊=𝟏 For Grouped data: For Simple (Discrete) frequency distribution: Let (𝑋𝑖, 𝑓𝑖), 𝑖 = 1,2 … 𝑛 be the given frequency distribution then their mean is denoted by 𝑋̅ and is defined as 𝒏 𝒏 𝟏 ̅ = ∑ 𝒇𝒊𝑿𝒊 𝑾𝒉𝒆𝒓𝒆 𝑵 = ∑ 𝒇𝒊 𝑿 𝑵 𝒊=𝟏 𝒊=𝟏 For grouped frequency distribution: In case of grouped frequency distribution, 𝑋𝑖 ′ 𝑠 are the mid-values of respective classes. Merits and Demerits of Mean Merits: (1) It is rigidly defined. (2) It is easy to calculate and understand. (3) It is based on all the observations. (4) Of all the averages, mean is stable regarding sampling. Demerits: (1) It is very much affected by extreme observations. (2) It cannot be used in case of open-end classes. (3) It cannot be determined graphically. (4) It may lead to wrong conclusions if the details of the data from which it is calculated are not available. Deviation about any arbitrary value A: If 𝑋𝑖, 𝑖 = 1, 2 … 𝑛 be 𝑛 observations and 𝐴 is any arbitrary value. Then 𝑋𝑖 − 𝐴, 𝑖 = 1, 2 … 𝑛 is called deviation of 𝑖𝑡ℎ observation about any value 𝐴. Deviation about any mean: If 𝑋𝑖, 𝑖 = 1, 2 … 𝑛 be n observations and 𝑋̅ is the mean then𝑋𝑖 − 𝑋̅, 𝑖 = 1,2 … 𝑛 is called deviation of 𝑖𝑡ℎ observation about mean. Properties of Mean: (1) The algebraic sum of the deviations of the observations from their mean is always zero. Mathematically, 𝑛 𝐧 ∑(𝑋𝑖 − 𝑋̅) = 0 𝑜𝑟 ∑ 𝐟𝐢(𝐗𝐢 − 𝐗 ̅) = 𝟎 𝑖=1 𝐢=𝟏 Proof: 𝐧 𝐧 𝐧 𝐧 𝟏 ̅) = ∑ 𝐗𝐢 − 𝐗 ∑(𝐗𝐢 − 𝐗 ̅ ∑ 𝟏 = 𝐧𝐗 ̅ − 𝐧𝐗 ̅=𝟎 ̅= ∵𝐗 ∑ 𝐗𝐢 𝐧 𝐢=𝟏 𝐢=𝟏 𝐢=𝟏 𝐢=𝟏 OR 𝑛 ∑(𝑋𝑖 − 𝐴) = 0 𝑖𝑓 𝐴 = 𝑋̅ 𝑖=1 Proof: 𝑛 𝑛 𝑛 ∑(𝑋𝑖 − 𝐴) = ∑ 𝑋𝑖 − 𝐴 ∑ 1 = 0 𝑖=1 𝑖=1 𝑖=1 𝑛 ⟹ ∑ 𝑋𝑖 = 𝑛𝐴 𝑖=1 𝑛 1 ⟹ 𝐴 = ∑ 𝑋𝑖 = 𝑋̅ 𝑛 𝑖=1 (2) The sum of the squares of deviations of the given set of observations is minimum when taken from mean. Mathematically, 𝐧 ̅ 𝐒 = ∑(𝐗𝐢 − 𝐀)𝟐 𝐢𝐬 𝐦𝐢𝐧𝐢𝐦𝐮𝐦 𝐰𝐡𝐞𝐧 𝐀 = 𝐗 𝐢=𝟏 Or For a frequency distribution, 𝐧 𝐧 𝐧 𝟏 𝟐 ̅ 𝐰𝐡𝐞𝐫𝐞 𝐗 𝐒 = ∑ 𝐟𝐢(𝐗𝐢 − 𝐀) 𝐢𝐬 𝐦𝐢𝐧𝐢𝐦𝐮𝐦 𝐰𝐡𝐞𝐧 𝐀 = 𝐗 ̅ = ∑ 𝐟𝐢𝐗𝐢, 𝐍 = ∑ 𝐟𝐢 𝐍 𝐢=𝟏 𝐢=𝟏 𝐢=𝟏 𝑖. 𝑒 𝑛 𝑛 ∑(𝑋𝑖 − 𝑋̅)2 𝑜𝑟 ∑ 𝑓𝑖(𝑋𝑖 − 𝑋̅)2 𝑖𝑠 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑖=1 𝑖=1 Proof: Here we apply the principle of maxima and minima from differential calculus. For 𝑆 to be minimum, we should have 𝛛𝐒 𝛛𝟐 𝐒 = 𝟎 𝐚𝐧𝐝 >0 𝛛𝐀 𝛛𝐀𝟐 We have 𝐧 𝐒 = ∑ 𝐟𝐢(𝐗𝐢 − 𝐀)𝟐 − − − − − − − − − (𝟏) 𝐢=𝟏 Differentiating (1) w.r.to 𝐴 and equating to zero, we get 𝐧 𝛛𝐒 = ∑ 𝟐𝐟𝐢(𝐗𝐢 − 𝐀)(−𝟏) 𝛛𝐀 𝐢=𝟏 𝐧 = −𝟐 ∑ 𝐟𝐢(𝐗𝐢 − 𝐀) − − − − − − − − − (𝟐) 𝐢=𝟏 Now 𝐧 𝛛𝐒 = 𝟎 ⟹ −𝟐 ∑ 𝐟𝐢(𝐗𝐢 − 𝐀) = 𝟎 𝛛𝐀 𝐢=𝟏 𝐧 ⟹ ∑ 𝐟𝐢(𝐗𝐢 − 𝐀) = 𝟎 𝐢=𝟏 𝐧 𝐧 ⟹ ∑ 𝐟𝐢𝐗𝐢 − 𝐀 ∑ 𝐟𝐢 = 𝟎 𝐢=𝟏 𝐢=𝟏 𝐧 𝐧 ⟹ ∑ 𝐟𝐢𝐗𝐢 − 𝐍𝐀 = 𝟎 ∵ 𝐍 = ∑ 𝐟𝐢 𝐢=𝟏 𝐢=𝟏 𝐧 𝟏 ̅ ⟹ 𝐀 = ∑ 𝐟𝐢𝐗𝐢 = 𝐗 𝐍 𝐢=𝟏 Differentiating (2) w.r. to 𝐴, we get 𝐧 𝐧 𝛛𝟐 𝐒 = −𝟐 ∑ 𝐟𝐢(−𝟏) = 𝟐 ∑ 𝐟𝐢 = 𝟐𝐍 > 0 𝛛𝐀𝟐 𝐢=𝟏 𝐢=𝟏 Hence 𝑆 is minimum at the point 𝐴 = 𝑋̅ (3) Mean depends on change of origin as well as scale. Proof: Let 𝑋𝑖, 𝑖 = 1, 2 … 𝑛 be n observations then their mean is denoted by 𝑋̅ and is given by 𝐧 𝟏 ̅ = ∑ 𝐗𝐢 𝐗 𝐧 𝐢=𝟏 𝑋𝑖−𝐴 Let us define a new variable 𝑢𝑖 as 𝑢𝑖 = 𝐶 , 𝑖 = 1,2 … 𝑛𝑤ℎ𝑒𝑟𝑒𝐴𝑖𝑠𝑛𝑒𝑤𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑛𝑑𝐶𝑏𝑒𝑡ℎ𝑒𝑛𝑒𝑤𝑠𝑐𝑎𝑙𝑒 From the above, we have 𝑋𝑖 = 𝐴 + 𝐶𝑢𝑖, 𝑖 = 1,2 … 𝑛 Taking summation over 𝑖 from 1 to 𝑛 we get, 𝒏 𝒏 𝒏 𝒏 ∑ 𝑿𝒊 = ∑ 𝑨 + 𝑪 ∑ 𝒖𝒊 = 𝒏𝑨 + 𝑪 ∑ 𝒖𝒊 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 Dividing both the sides by 𝑛, we get ̅ = 𝐀 + 𝐂𝐮 𝐗 ̅ Which shows that mean depends on change of origin and scale. (4) Combined Mean: If 𝑋̅1, 𝑋̅2, … 𝑋̅𝑘 be the means of k groups (series) with 𝑛1 , 𝑛2 … 𝑛𝑘 no. of observations resp. then the mean of combined group (all the observations) with 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑖 + ⋯ + 𝑛𝑘 observations is given by 𝐧 𝐗 ̅ + 𝐧𝟐 𝐗 ̅ 𝟐 + ⋯ 𝐧𝐢 𝐗 ̅ 𝐢 + ⋯ 𝐧𝐤 𝐗 ̅𝐤 ̅= 𝟏 𝟏 𝐗 𝐧𝟏 + 𝐧𝟏 + ⋯ + 𝐧𝐢 + ⋯ 𝐧𝐤 Proof: Let (𝐗 𝐢𝟏 , 𝐗 𝐢𝟐 , … 𝐗 𝐢𝐣 , … 𝐗 𝐢𝐤 ), 𝐢 = 𝟏, 𝟐 … 𝐤, 𝐣 = 𝟏, 𝟐, … 𝐧𝐢 be the observations in 𝑘 groups respectively. Now 𝑛𝑖 1 𝑋̅𝑖 = ∑ 𝑋𝑖𝑗 , 𝑖 = 1,2 … 𝑘 𝑛𝑖 𝑗=1 𝑛𝑖 ∴ ∑ 𝑋𝑖𝑗 = 𝑛𝑖 𝑋̅𝑖 , 𝑖 = 1,2 … 𝑘 𝑗=1 Now 𝑋̅ = 𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛 = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑎𝑙𝑙 (𝑛) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 (𝑛)𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑜. 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑛1 𝑛2 𝑛𝑖 𝑛𝑘 1 = [∑ 𝑋1𝑗 + ∑ 𝑋2𝑗 + ⋯ + ∑ 𝑋𝑖𝑗 + ⋯ + ∑ 𝑋𝑘𝑗 ] 𝑛 𝑗=1 𝑗=1 𝑗=1 𝑗=1 Where 𝑛 = 𝑛1 + 𝑛2 + ⋯ + 𝑛𝑖 + ⋯ + 𝑛𝑘 1 ∴ 𝑋̅ = [𝑛 𝑋̅ + 𝑛2 𝑋̅2 + ⋯ + 𝑛𝑖 𝑋̅𝑖 + ⋯ + 𝑛𝑘 𝑋̅𝑘 ] 𝑛 1 1 Hence the result. In particular, if 𝑋̅1 𝑎𝑛𝑑 𝑋̅2 be the means of two groups with 𝑛1 , 𝑛2 no. of observations respectively; then the mean 𝑋̅ of combined group with 𝑛1 + 𝑛2 observations is given by 𝑛1 𝑋̅1 + 𝑛2 𝑋̅2 𝑋̅ = 𝑛1 + 𝑛2 If 𝑛𝑖 = 𝑛 ∀𝑖 = 1,2 … 𝑘 i.e. no. of observations in each group is same then 𝑋̅1 + 𝑋̅2 + ⋯ 𝑋̅𝑖 + ⋯ 𝑋̅𝑘 𝑋̅ = 𝑘 i.e. mean of combined group is mean of all means. (5) Weighted Mean: Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be 𝑛 observations and 𝑊1 , 𝑊2 , … , 𝑊𝑛 be the corresponding weights then the weighted mean is given by 𝑊1 𝑋1 + 𝑊2 𝑋2 + ⋯ + 𝑊𝑖 𝑋𝑖 + ⋯ 𝑊𝑛 𝑋𝑛 ∑ 𝑊𝑖 𝑋𝑖 𝑋̅𝑤 = = 𝑊1 + 𝑊2 + ⋯ + 𝑊𝑖 + ⋯ + 𝑊𝑛 ∑ 𝑊𝑖 If 𝑊𝑖 = 𝑊 ∀𝑖 = 1,2 … 𝑛 then If 𝑊𝑖 = 𝑊  𝑖 = 1, 2 … 𝑛 then 𝑊 ∑ 𝑋𝑖 1 𝑋̅𝑤 = = ∑ 𝑋𝑖 = 𝑋̅ 𝑊 ∑1 𝑛 i.e. when each observations has equal Weightage then weighted mean is same as mean. (ii) Median: Median is that value of the variable which divides the data (set of observations) into two equal parts so that the no. of observations below median and above median is equal. Thus, we see that against mean which is based on all the observations the median is the only positional average i.e. its value depends on the middle position (term). For ungrouped (raw) data: Let 𝑋𝑖, 𝑖 = 1,2 … 𝑛 be the given n observations. Steps: (1) Arrange the data either in ascending or descending order. (2) Median is the middle term or mean of two middle terms according as the no. of observations is odd or even. 𝑛 + 1 𝑡ℎ 𝑉𝑎𝑙𝑢𝑒 𝑜𝑓 ( ) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠, 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑 𝑀𝑒𝑑𝑖𝑎𝑛 = 2 𝑛 𝑡ℎ 𝑛 𝑡ℎ { 𝑀𝑒𝑎𝑛 𝑜𝑓 ( ) 𝑎𝑛𝑑 ( + 1) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠, 𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛 2 2 For Grouped data: For simple frequency distribution: Let (𝑋𝑖, 𝑓𝑖), 𝑖 = 1,2 … 𝑛 be the given frequency distribution Steps: (1) Calculate the cumulative frequency of less than type. 𝑁+1 (2) Calculate ( ) 2 𝑁+1 (3) Select the cumulative frequency just greater than (or equal to) ( 2 ) (4) The value of the variable corresponding to selected cumulative frequency is median. For grouped frequency distribution: Let (𝑋𝑖 − 𝑋𝑖 + 1, 𝑓𝑖), 𝑖 = 1,2 … 𝑛 be the given grouped frequency distribution. Steps: (1) Calculate the cumulative frequency of less than type. 𝑁 (2) Calculate ( 2 ) 𝑁 (3) Select the cumulative frequency just greater than (or equal to) ( 2 ) (4) The class corresponding to selected cumulative frequency is called median class and median is calculated by the following formula 𝑁 − 𝐹< 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑙 + ( 2 )×𝑐 𝑓 Where 𝑙 = lower limit of a median class 𝐹< = cumulative frequency of the class previous to median class 𝑓 = frequency of a median class 𝑐 = class-width of a median class Remark: classes must be continuous. Merits and Demerits of Median: Merits: (1) It is rigidly defined. (2) It is easy to calculate and understand. (3) It is not affected by extreme observations and hence it is very much useful in case of open-end classes. (4) It can be determined graphically. Demerits: (1) It is not based on all the observations. Remark: The sum of the absolute deviations of a given set of observations is minimum when taken from median. (iii) Mode: Mode is the value of variable which occurs most frequently (maximum no. of times) in the given data (set of observations). Mode is a measure which representing the common or typical value of the data. Uses: (i) Average size of shoe sold in a shop is 8. (ii) Average size of shirt sold in a readymade garment shop is 90 (XL). (iii) Average student in a hostel spends Rs. 1500 per month. In all the above cases, the average referred to as mode. For ungrouped (raw) data: Let 𝑋𝑖, 𝑖 = 1,2 … 𝑛 be the given n observations. From the given data select that value which occur maximum no. of times (most often). For simple frequency distribution: Let (𝑋𝑖, 𝑓𝑖), 𝑖 = 1,2 … 𝑛 be the given frequency distribution Steps: (1) Select the maximum frequency. (2) The value of the variable corresponding to selected frequency is mode. For grouped frequency distribution: Let (𝑋𝑖 − 𝑋𝑖 + 1, 𝑓𝑖), 𝑖 = 1,2 … 𝑛 be the given grouped frequency distribution. Steps: (1) Select the maximum frequency. (2) The class corresponding to selected frequency is called the modal class. (3) Mode is determined by the following formula 𝑓1 − 𝑓0 𝑀𝑜𝑑𝑒 = 𝑙 + ( )×𝑐 2𝑓1 − 𝑓0 − 𝑓2 Where 𝑙 =lower limit of modal class. 𝑓1 = frequency of modal class. 𝑓0 =frequency previous to modal class. 𝑓2 = frequency next to modal class. 𝑐 =class-width of modal class. Remark: classes must be continuous. Merits and Demerits of Mode: Merits: (1) It is easy to calculate. (2) It can be determined graphically. (3) It is not at all affected by extreme observations. Demerits: (1) Mode is not rigidly defined. It is ill-defined iff (a) Maximum frequency is repeated. (b) Maximum frequency occurs either in the very beginning or at the end. (c) The given distribution is irregular. (2) It is not based on all the observations. (iv) Geometric Mean (G.M): Geometric mean of a set of observations is the nth root of their product. For ungrouped (raw) data: Let 𝑋𝑖, 𝑖 = 1, 2 … 𝑛 be the given n observations. Then their Geometric mean is defined as 1 𝑛 𝑛 𝐺. 𝑀 = (∏ 𝑋𝑖 ) = 𝑛𝑡ℎ 𝑟𝑜𝑜𝑡 𝑜𝑓 𝑡ℎ𝑒𝑖𝑟 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠. 𝑖=1 In particular, if 𝑛 = 2 (i.e. with two observations 𝑋1and 𝑋2then geometric mean can be computed by taking the square root of their product. If 𝑛 > 2, the no. of observations is greater than 2, then computation of 𝑛 th root is very tedious. In such case the calculations are facilitated by making the use of logarithms. Taking the logarithm on both sides, we get 𝑛 1 log(𝐺. 𝑀) = 𝑙𝑜𝑔 (∏ 𝑋𝑖 ) 𝑛 𝑖=1 𝑛 1 ∴ 𝐺. 𝑀 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 ( ∑ 𝑙𝑜𝑔(𝑋𝑖)) 𝑛 𝑖=1 Thus we see that logarithm of G.M is the mean of their logarithms. For Grouped data: For simple frequency distribution: Let (𝑋𝑖, 𝑓𝑖), 𝑖 = 1, 2 … 𝑛be the given frequency distribution. Then the Geometric mean is given by 𝑛 𝑛 1 ∴ 𝐺. 𝑀 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 ( ∑ 𝑓𝑖𝑙𝑜𝑔(𝑋𝑖)) , 𝑤ℎ𝑒𝑟𝑒 𝑁 = ∑ 𝑓𝑖 𝑁 𝑖=1 𝑖=1 For grouped frequency distribution: In case of grouped frequency distribution, 𝑋𝑖’s are the mid-values of respective classes. Remark: If one of the numbers (observation) is zero, G.M is zero. (v) Harmonic Mean (H.M): Harmonic mean is the reciprocal of the mean of the reciprocals of the given observations. For ungrouped (raw) data: Let 𝑋𝑖, 𝑖 = 1, 2 … 𝑛 be the given 𝑛 observations. Then their Harmonic mean is denoted by 1 𝐻. 𝑀 = = 𝑟𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒𝑖𝑟 𝑟𝑒𝑐𝑖𝑝𝑟𝑜𝑐𝑎𝑙𝑠. 1 1 ∑ 𝑛 𝑋𝑖 i.e. H.M of 𝑛 observations is reciprocal of the mean of their reciprocals. For Grouped data: For simple frequency distribution: Let (𝑋𝑖, 𝑓𝑖), 𝑖 = 1, 2 … 𝑛be the given frequency distribution. Then the Harmonic mean is given by 1 𝑁 𝐻. 𝑀 = = , 𝑤ℎ𝑒𝑟𝑒 𝑁 = ∑ 𝑓𝑖 1 𝑓𝑖 𝑓𝑖 ∑ ∑ 𝑁 𝑋𝑖 𝑋𝑖 For grouped frequency distribution: In case of grouped frequency distribution, 𝑋𝑖’s are the mid-values of respective classes. Remark: H.M cannot be calculated if one of the numbers (observation) is zero. Relationship between A.M, G.M and H.M 𝑨. 𝑴 ≥ 𝑮. 𝑴 ≥ 𝑯. 𝑴 The sign of equality holds if and only if all the 𝑛 numbers (observations) are equal. Proof: We shall establish the result for two numbers only, although the result holds true for 𝑛 observations. Let 𝑎 and 𝑏 be two real positive numbers i.e. 𝑎 > 0, 𝑏 > 0 then 𝑎+𝑏 𝐴. 𝑀 = 2 𝐺. 𝑀 = √𝑎𝑏 2 2𝑎𝑏 𝐻. 𝑀 = = 1 1 𝑎+𝑏 𝑎+𝑏 2 𝑎+𝑏 𝑎+𝑏−2√𝑎𝑏 (√𝑎−√𝑏) Consider 𝐴. 𝑀 − 𝐺. 𝑀 = 2 − √𝑎𝑏 = = ≥0 2 2 ∴ 𝐴. 𝑀 ≥ 𝐺. 𝑀 − − − (1) 2 (√𝑎−√𝑏) The sign of equality holds only if = 0; √𝑎 − √𝑏 = 0 2 ⟹ √𝑎 = √𝑏 ⟹𝑎=𝑏 i.e. if and only if the two numbers are equal. 2𝑎𝑏 Also consider 𝐺. 𝑀 − 𝐻. 𝑀 = √𝑎𝑏 − 𝑎+𝑏 2√𝑎𝑏√𝑎𝑏 = √𝑎𝑏 − 𝑎+𝑏 2√𝑎𝑏 √𝑎𝑏 = √𝑎𝑏 (1 − )= (𝑎 + 𝑏 − 2√𝑎𝑏) 𝑎+𝑏 𝑎+𝑏 √𝑎𝑏 2 = (√𝑎 − √𝑏) ≥ 0 𝑎+𝑏 ∴ 𝐺. 𝑀 − 𝐻. 𝑀 ≥ 0 − − − (2) √𝑎𝑏 2 The sign of equality holds only if𝑎+𝑏 (√𝑎 − √𝑏) = 0; √𝑎 − √𝑏 = 0 ⟹ √𝑎 = √𝑏 ⟹𝑎=𝑏 i.e. if and only if the two numbers are equal. From (1) and (2) 𝐴. 𝑀 ≥ 𝐺. 𝑀 ≥ 𝐻. 𝑀 The sign of equality holds if and only if the two numbers (observations) are equal. Remark: (i) For two numbers 𝐺 2 = 𝐴𝐻 Where 𝐴, 𝐺, 𝐻 are A.M, G.M, H.M respectively. Proof: Let 𝑎 > 0 and 𝑏 > 0 are two positive numbers. Then 𝑎+𝑏 2𝑎𝑏 𝐴×𝐻 = × = 𝑎𝑏 = 𝐺 2 2 𝑎+𝑏 For more than two observations, the result 𝐺 2 = 𝐴𝐻 holds only if the numbers (observations) are in G.P Quantiles (Partition values): Quantiles are the values which divide the entire data (set of numbers or observations) into some number of equal parts. The number of parts may be two, four, eight, ten or hundred. Quartiles: Quartiles are the values which divide the entire data (set of numbers or observations) into four equal parts. They are 3 in numbers namely 𝑄1, 𝑄2, 𝑄3. The 𝑖𝑡ℎquartile 𝑄𝑖is the value of 𝑋 (variable) corresponding to the cumulative frequency just 𝑖×𝑁 greater than (or equal to) 4 , 𝑖 = 1,2,3. For continuous frequency distribution, the class corresponding to the cumulative frequency just 𝑖×𝑁 greater than (or equal to) 4 is called 𝑖𝑡ℎquartile class and is given by 𝑖𝑁 − 𝐹< 𝑄𝑖 = 𝑙 + 4 × 𝐶, 𝑖 = 1,2,3 𝑓 Octiles: Octiles are the values which divide the entire data (set of numbers or observations) into eight equal parts. They are 7 in numbers namely 𝑂1, 𝑂2, … 𝑂7. The 𝑗𝑡ℎoctile𝑂𝑗 is the value of 𝑋 (variable) corresponding to the cumulative frequency just 𝑗×𝑁 greater than (or equal to), 8 , 𝑖 = 1,2, … 7 For continuous frequency distribution, the class corresponding to the cumulative frequency just 𝑗×𝑁 greater than (or equal to) 8 is called 𝑗𝑡ℎoctile class and is given by 𝑗𝑁 − 𝐹< 𝑂𝑗 = 𝑙 + 8 × 𝐶, 𝑗 = 1,2, … 7 𝑓 Deciles: Deciles are the values which divide the entire data (set of numbers or observations) into ten equal parts. They are 9 in numbers namely 𝐷1, 𝐷2, … , 𝐷9. The 𝑘𝑡ℎdecileDk is the value of 𝑋 (variable) corresponding to the cumulative frequency just 𝑘×𝑁 greater than (or equal to) 10 , 𝑘 = 1,2, … 9 For continuous frequency distribution, the class corresponding to the cumulative frequency just 𝑘×𝑁 greater than (or equal to) 10 is called 𝑘𝑡ℎdecile class and is given by 𝑘𝑁 − 𝐹< 𝐷𝑘 = 𝑙 + 10 × 𝐶, 𝑘 = 1,2, … 9 𝑓 Percentiles: Percentiles are the values which divide the entire data (set of numbers or observations) into hundred equal parts. They are 99 in numbers namely P1, P2,…P99. The 𝑚𝑡ℎpercentile Pm is the value of 𝑋 (variable) corresponding to the cumulative frequency just 𝑚×𝑁 greater than (or equal to) , 𝑚 = 1,2, … 99 100 For continuous frequency distribution, the class corresponding to the cumulative frequency just 𝑚×𝑁 greater than (or equal to) 100 is called 𝑚𝑡ℎ𝑜ctile class and is given by 𝑚𝑁 − 𝐹< 𝑃𝑚 = 𝑙 + 100 × 𝐶, 𝑚 = 1,2, … 99 𝑓 Graphs of Frequency Distributions:- The frequency graphs are designed to reveal the characteristics features of a frequency data. Such graph are more appealing to the eye than the tabulated data and are readily perceptible to the mind. Most commonly graphs are 1)Histogram 2)Frequency Polygon 3)Frequency Curve 4) Ogive curve Ogives:- 1)‘Less than‘ ogive is obtained on plotting the ‘Less than ‘ c.f. against the upper limit of the corresponding class and joining the points so obtained by a smooth freehand curve. 2)‘More than‘ ogive is obtained on plotting the ‘More than ‘ c.f. against the lower limit of the corresponding class and joining the points so obtained by a smooth freehand curve. 3) Ogive are particularly useful for graphical computation of partition value like Median, Quartiles, Octiles , Deciles and percentiles etc. 4) Both ‘Less than ‘ and ‘more than ‘ Ogives intersect at a point. Draw a line perpendicular to X- axis from their point of intersection give t he value of Median. Ex:The following table gives the frequency distribution of the monthly income of 600 families in a certain city. Monthly Income(‘000Rs) No. of families. 60 170 200 60 50 40 20 Draw less than and more than ogive and determine graphically a) ( i )Median,Q1,Q3,D5 & P85 comment on it. it. (ii)Quartile deviation(Q.D)(b) Draw Histogram and determine the value of Mode. Solution: Let X:Monthly Income f: number of families. Monthly Income(‘000Rs) No. of families. Cumulative frequency Less than More than Below 75 60 60 600 75-150 170 230 540 150-225 200 430 370 225-300 60 490 170 300-375 50 540 110 375-450 40 580 60 450> 20 600 20 (a)From Graph we can say that (i)Median =Md =176 (approximately) First Quartiles =Q1= 115 (approximately) Third Quartiles = Q3 =250 (approximately) 5th Deciles = D5 = 170(approximately) 85th Percentiles =P85 =330(approximately) (ii)Quartiles Deviation= (Q3-Q1)/2=(250-115)/2 = 67.5 Ex: The following table gives the frequency distribution of the marks of 800 candidates in an examination. Marks 0-20 20-40 40-60 60-80 80-100 No. of students 50 220 300 170 60 Draw less than and more than ogive and determine a) ( i )Median,Q1,Q3,P85 & comment on it. Solution:- Let X: Marks.f: no. of students Monthly Income(‘000Rs) No. of families. Cumulative frequency Less than More than 0-20 50 50 600 20-40 220 270 550 40-60 300 570 330 60-80 170 740 230 80-100 60 800 60 From the graph (i)Median=Q2=D5=O4=P50=49 marks Q1=P25=O2=34 marks P85=72 marks 𝑄3−𝑄1 (ii)Q.D.= 2 =29 marks (iii)Results(%)=% of students who pass in the examination =% of students who got marks ≥40 Now, no. of students getting marks

Descriptive Statistics for Univariate Data PDF

Document Details

Tags

Related

Summary

Full Transcript