Data Analytics (CS4061D) Descriptive Statistics PDF
Document Details
Uploaded by TopQualityColosseum2519
Dr. Debasis Samanta
Tags
Summary
This document covers a lecture on descriptive statistics in data analytics, examining concepts like data summarization and measurement of location and dispersion. The lecture introduces different measures like mean, median, mode, midrange, and graphical methods like box plots, focusing on various types of data and their analysis. Examples illustrate how data is used, along with defining population and sampling.
Full Transcript
Data Analytics (CS4061D) Descriptive Statistics Dr. Debasis Samanta Associate Professor Department of Computer Science & Engineering Today’s discussion… Introduction Data summarization Measurement of location Mean, median, mode, m...
Data Analytics (CS4061D) Descriptive Statistics Dr. Debasis Samanta Associate Professor Department of Computer Science & Engineering Today’s discussion… Introduction Data summarization Measurement of location Mean, median, mode, midrange, etc. Measure of dispersion Range, Variance, Standard Deviation, etc. Other measures MAD, AAD, Percentile, IQR, etc. Graphical summarization Box plot 2 TRP: An example Television rating point (TRP) is a tool provided to judge which programs are viewed the most. This gives us an index of the choice of the people and also the popularity of a particular channel. For calculation purpose, a device is attached to the TV sets in few thousand viewers’ houses in different geographic and demographic sectors. The device is called as People's Meter. It reads the time and the programme that a viewer watches on a particular day for a certain period. An average is taken, for example, for a 30-days period. The above further can be augmented with a personal interview survey (PIS), which becomes the basis for many studies/decision making. Essentially, we are to analyze data for TRP estimation. 3 Defining Data Definition 3.1: Data A set of data is a collection of observed values representing one or more characteristics of some objects or units. Example: For TRP, data collection consist of the following attributes. Age: A viewer’s age in years Sex: A viewer’s gender coded 1 for male and 0 for female Happy: A viewer’s general happiness NH for not too happy PH for pretty happy VH for very happy TVHours: The average number of hours a respondent watched TV during a day 4 Defining Data Viewer# Age Sex Happy TVHours … … … … … … … … … … 55 34 F VH 5 … … … … … Note: A data set is composed of information from a set of units. Information from a unit is known as an observation. An observation consists of one or more pieces of information about a unit; these are called variables. 5 Defining Population Definition 3.2: Population A population is a data set representing the entire entities of interest. Example: All TV Viewers in the country/world. Note: 1. All people in the country/world is not a population. 2. For different survey, the population set may be completely different. 3. For statistical learning, it is important to define the population that we intend to study very carefully. 6 Defining Sample Definition 3.3: Sample A sample is a data set consisting of a population. Example: All students studying in Class XII is a sample, whereas those students belong to a given school is population. Note: Normally a sample is obtained in such a way as to be representative of the population. 7 Defining Statistics Definition 3.4: Statistics A statistics is a quantity calculated from data that describes a particular characteristics of a sample. Example: The sample mean (denoted by 𝑦) ത is the arithmetic mean of a variable of all the observations of a sample. 8 Defining Statistical Inference Definition 3.5: Statistical inference Statistical inference is the process of using sample statistics to make decisions about population. Example: In the context of TRP Overall frequency of the various levels of happiness. Is there a relationship between the age of a viewers and his/her general happiness? Is there a relationship between the age of the viewer and the number of TV hours watched? 9 Data Summarization To identify the typical characteristics of data (i.e., to have an overall picture). To identify which data should be treated as noise or outliers. The data summarization techniques can be classified into two broad categories: Measures of location Measures of dispersion 10 Measurement of location It is also alternatively called as measuring the central tendency. A function of the sample values that summarizes the location information into a single number is known as a measure of location. The most popular measures of location are Mean Median Mode Midrange These can be measured in three ways Distributive measure Algebraic measure Holistic measure 11 Distributive measure It is a measure (i.e. function) that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (i.e. entire) data set. Example sum(), count() 12 Algebraic measure It is a measure that can be computed by applying an algebraic function to one or more distributive measures. Example sum( ) average = count( ) 13 Holistic measure It is a measure that must be computed on the entire data set as a whole. Example Calculating median What about mode? 14 Mean of a sample The mean of a sample data is denoted as 𝒙 ഥ. Different mean measurements known are: Simple mean Weighted mean Trimmed mean In the next few slides, we shall learn how to calculate the mean of a sample. We assume that given 𝑥1 , 𝑥2 , 𝑥3 ,….., 𝑥𝑛 are the sample values. 15 Simple mean of a sample Simple mean It is also called simply arithmetic mean or average and is abbreviated as (AM). Definition 3.6: Simple mean If 𝑥1 , 𝑥2 , 𝑥3 ,….., 𝑥𝑛 are the sample values, the simple mean is defined as 𝒏 𝟏 ഥ = xi 𝒙 𝒏 𝒊=𝟏 16 Weighted mean of a sample Weighted mean It is also called weighted arithmetic mean or weighted average. Definition 3.7: Weighted mean When each sample value 𝑥𝑖 is associated with a weight 𝑤𝑖 , for i = 1,2,…,n, then it is defined as σ𝑛𝑖=1 wixi ഥ= 𝒏 𝒙 σ𝒊=𝟏 wi Note When all weights are equal, the weighted mean reduces to simple mean. 17 Trimmed mean of a sample Trimmed Mean If there are extreme values (also called outlier) in a sample, then the mean is influenced greatly by those values. To offset the effect caused by those extreme values, we can use the concept of trimmed mean Definition 3.8: Trimmed mean Trimmed mean is defined as the mean obtained after chopping off values at the high and low extremes. 18 Properties of mean Lemma 3.1 If 𝒙𝒊 , i = 1,2,…,m are the means of m samples of sizes 𝒏𝟏 , 𝒏𝟐 ,….., 𝒏𝒎 respectively, then the mean of the combined sample is given by:- σ𝒎 𝒊=𝟏 𝒏𝒊 𝒙𝒊 𝒙= σ𝒎 𝒊=𝟏 𝒏𝒊 (Distributive Measure) Lemma 3.2 If a new observation 𝒙𝒌 is added to a sample of size n with mean 𝒙, the new mean is given by ′ 𝒏 𝒙 + 𝒙𝒌 𝒙 = 𝒏+𝟏 19 Properties of mean Lemma 3.3 If an existing observation 𝒙𝒌 is removed from a sample of size n with mean 𝒙, the new mean is given by ′ 𝒏 𝒙 − 𝒙𝒌 𝒙 = 𝒏−𝟏 Lemma 3.4 If m observations with mean 𝒙𝒎 , are added (removed) from a sample of size n with mean 𝒙𝒏 , then the new mean is given by 𝒏 𝒙𝒏 ± 𝒎 𝒙𝒎 𝒙= 𝒏±𝒎 20 Properties of mean Lemma 3.5 If a constant c is subtracted (or added) from each sample value, then the mean of the transformed variable is linearly displaced by c. That is, ′ 𝒙 = 𝒙∓𝒄 Lemma 3.6 If each observation is called by multiplying (dividing) by a non-zero constant, then the altered mean is given by ′ 𝒙 = 𝒙∗𝒄 Where, * is x (multiplication) or ÷ (division) operator. 21 Mean with grouped data Sometimes data is given in the form of classes and frequency for each class. Class 𝑥1 - 𝑥2 𝑥2 - 𝑥3 ….. 𝑥𝑖 - 𝑥𝑖+1 ….. 𝑥𝑛−1 - 𝑥𝑛 Frequency 𝑓1 𝑓2 ….. 𝑓𝑖 ….. 𝑓𝑛 There three methods to calculate the mean of such a grouped data. Direct method Assumed mean method Step deviation method 22 Direct method Direct Method σ𝒏𝒊=𝟏 fi xi ഥ= 𝒙 σ𝒏𝒊=𝟏 fi 𝟏 xi+ xi+1 Where, xi = (lower limit + upper limit) of the ith class, i.e., xi = 𝟐 𝟐 (also called class size), and fi is the frequency of the ith class. Note σ fi (xi - 𝒙 ഥ) = 0 23 Assumed mean method Assumed Mean Method σ𝒏𝒊=𝟏 fi di ഥ=𝑨+ 𝒙 σ𝒏𝒊=𝟏 fi x + xi+1 where, A is the assumed mean (it is usually a value xi = i 𝟐 chosen in the middle of the groups di = (𝑨 - xi ) for each i ) 24 Step deviation method Step deviation method σ𝒏𝒊=𝟏 fi ui ഥ=𝑨+ 𝒙 𝒉 σ𝒏𝒊=𝟏 fi where, A = assumed mean h = class size (i.e., 𝐱 𝐢+𝟏 - 𝐱 𝐢 for the ith class) xi − A ui = 𝒉 25 Mean for a group of data For the above methods, we can assume that… All classes are equal sized Groups are with inclusive classes, i.e., xi = 𝐱 𝐢−𝟏 (linear limit of a class is same as the upper limit of the previous class) Data with exclusive classes 10 - 19 20 - 29 30 - 39 40 − 49 Data with inclusive classes 9.5 – 19.5 19.5 – 29.5 29.5 – 39.5 39.5 – 49.5 26 Ogive: Graphical method to find mean Ogive (pronounced as O-Jive) is a cumulative frequency polygon graph. When cumulative frequencies are plotted against the upper (lower) class limit, the plot resembles one side of an Arabesque or ogival architecture, hence the name. There are two types of Ogive plots Less-than (upper class vs. cumulative frequency) More than (lower class vs. cumulative frequency) Example: Suppose, there is a data relating the marks obtained by 200 students in an examination 444, 412, 478, 467, 432, 450, 410, 465, 435, 454, 479, ……. (Further, suppose it is observed that the minimum and maximum marks are 410, 479, respectively.) 27 Ogive: Cumulative frequency table 444, 412, 478, 467, 432, 450, 410, 465, 435, 454, 479, ……. Step 1: Draw a cumulative frequency table Conversion into Cumulative Marks No. of students exclusive series Frequency (x) (f) (C.M) 410-419 409.5-419.5 14 14 420-429 419.5-429.5 20 34 430-439 429.5-439.5 42 76 440-449 439.5-449.5 54 130 450-459 449.5-459.5 45 175 460-469 459.5-469.5 18 193 470-479 469.5-479.5 7 200 28 Ogive: Graphical method to find mean Conversion into No. of Cumulative Marks exclusive students Frequency series (x) (f) (C.M) 410-419 409.5-419.5 14 14 420-429 419.5-429.5 20 34 430-439 429.5-439.5 42 76 440-449 439.5-449.5 54 130 450-459 449.5-459.5 45 175 Step 2: Less-than Ogive460-469 graph 459.5-469.5 18 193 470-479 469.5-479.5 7 200 Cumulative Upper class Frequency Less than 419.5 14 Less than 429.5 34 Less than 439.5 76 Less than 449.5 130 Less than 459.5 175 Less than 469.5 193 Less than 479.5 200 29 Ogive: Graphical method to find mean Conversion into No. of Cumulative Marks exclusive students Frequency series (x) (f) (C.M) 410-419 409.5-419.5 14 14 420-429 419.5-429.5 20 34 430-439 429.5-439.5 42 76 440-449 439.5-449.5 54 130 450-459 449.5-459.5 45 175 460-469 459.5-469.5 18 193 470-479 469.5-479.5 7 200 Step 3: More-than Ogive graph Cumulative Upper class Frequency More than 409.5 200 More than 419.5 186 More than 429.5 166 More than 439.5 124 More than 449.5 70 More than 459.5 25 More than 469.5 7 30 Information from Ogive Mean from Less-than Ogive Mean from More-than Ogive A % C freq of.65 for the third class 439.5.....449.5 means that 65% of all scores are found in this class or below. 31 Information from Ogive Less-than and more-than Ogive approach A cross point of two Ogive plots gives the mean of the sample 32 Some other measures of mean There are three mean measures of location: Arithmetic Mean (AM) Geometric mean (GM) Harmonic mean (HM) 33 Some other measures of mean Arithmetic Mean (AM) 𝑆: 𝑥1 , 𝑥2 Harmonic Mean (HM) 𝑥1 +𝑥2 𝑥= ҧ 2 𝑆: 𝑥1 , 𝑥2 𝑥ҧ − 𝑥1 = 𝑥2 − 𝑥ҧ 2 𝑥ො = 1 1 + 𝑥1 𝑥2 2 1 1 Geometric mean (GM) 𝑥ො = 𝑥1 + 𝑥2 𝑆: 𝑥1 , 𝑥2 𝑥 = 𝑥1. 𝑥2 𝑥1 𝑥 = 𝑥 𝑥2 34 ??? Is there any generalization for AM (ഥ 𝒙), GM ( 𝒙) and 𝒙) calculations for a sample of size ≥ 2? HM (ෝ In which situation, a particular mean is applicable? If there is any interrelationship among them? 35 Geometric mean Definition 3.9: Geometric mean Geometric mean of n observations (none of which are zero) is defined as: 𝒏 𝟏/𝒏 = 𝒙 ෑ xi 𝒊=𝟏 where, n ≠ 0 Note GM is the arithmetic mean in “log space”. This is because, alternatively, 𝒏 𝟏 𝒍𝒐𝒈 𝒙 = 𝒍𝒐𝒈 𝒙𝒊 𝒏 𝒊=𝟏 This summary of measurement is meaningful only when all observations are > 0 If at least one observation is zero, the product will itself be zero! For a negative value, root is not real 36 Harmonic mean Definition 3.10: Harmonic mean If all observations are non zero, the reciprocal of the arithmetic mean of the reciprocals of observations is known as harmonic mean. For ungrouped data 𝒏 ෝ= 𝒙 𝟏 σ𝒏𝒊=𝟏 xi For grouped data σ𝒏𝒊=𝟏 fi ෝ= 𝒙 fi σ𝒏𝒊=𝟏 xi where, fi is the frequency of the ith class with xi as the center value of the ith class. 37 Significant of different mean calculations There are two things involved when we consider a sample Observation Range Example: Rainfall data Rainfall (in r1 r2 … rn mm) Days Here, rainfall is d1 the observation d2 and day…is the range for each dn element in the sample (in number) Here, we are to measure the mean “rate of rainfall” as the measure of location 38 Significant of different mean calculations Case 1: Range remains same for each observation Example: Having data about amount of rainfall per week, say. Rainfall 35 18 … 22 (in mm) Days 7 7 … 7 (in number) 39 Significant of different mean calculations Case 2: Ranges are different, but observation remains same Example: Same amount of rainfall in different number of days, say. Rainfall 50 50 … 50 (in mm) Days 1 2 … 7 (in number) 40 Significant of different mean calculations Case 3: Ranges are different, as well as the observations Example: Different amount of rainfall in different number of days, say. Rainfall 21 34 … 18 (in mm) Days 5 3 … 7 (in number) 41 Rule of thumbs for means AM: When the range remains same for each observation Example: Case 1 Rainfall 35 18 … 22 (in mm) Days 7 7 … 7 (in number) 𝑛 1 𝑟ҧ = 𝑟𝑖 𝑛 1 42 Rule of thumbs for means HM: When the range is different but each observation is same Example: Case 2 Rainfall 50 50 … 50 (in mm) Days 1 2 … 7 (in number) 𝑛 𝑟ǁ = 1 σ𝑛1 𝑟𝑖 43 Rule of thumbs for means GM: When the ranges are different as well as the observations Example: Case 3 Rainfall 21 34 … 18 (in mm) Days 5 3 … 7 (in number) 1 𝑛 𝑛 𝑟Ƹ = ෑ 𝑟𝑖 1 44 Rule of thumbs for means The important things to recognize is that all three means are simply the arithmetic means in disguise! Each mean follows the “additive structure”. Suppose, we are given some abstract quantities {x1, x2, …, xn} Each of the three means can be obtained with the following steps 1. Transform each xi into some yi 2. Taking the arithmetic mean of all yi’s 3. Transforming back the to the original scale of measurement 45 Rule of thumbs for means For arithmetic mean Use the transformation yi = xi Take the arithmetic mean of all yi s to get 𝑦ത Finally, 𝑥ҧ = 𝑦ത For geometric mean Use the transformation 𝒚𝒊 = 𝐥𝐨𝐠 𝒙𝒊 Take the arithmetic mean of all yi s to get 𝑦ത 𝒙 = 𝒆𝒚ഥ Finally, ෝ For harmonic mean 𝟏 Use the transformation 𝒚𝒊 = 𝒙𝒊 Take the arithmetic mean of all yi s to get 𝑦ത 𝟏 Finally, 𝒙 = ഥ 𝒚 46 Relationship among means A simple inequality exists between the three means related summary measure as AM ≥ GM ≥ HM 47 Median of a sample Definition 3.12: Median of a sample Median of a sample is the middle value when the data are arranged in increasing (or decreasing) order. Symbolically, 𝒙(𝒏+𝟏)/𝟐 𝒊𝒇 𝒏 𝒊𝒔 𝒐𝒅𝒅 ෝ = ൞𝟏 𝒙 𝒊𝒇 𝒏 𝒊𝒔 𝒆𝒗𝒆𝒏 𝒙𝒏/𝟐 + 𝒙(𝒏+𝟏) 𝟐 𝟐 48 Median of a sample Definition 3.12: Median of a grouped data Median of a grouped data is given by 𝑵 − 𝒄𝒇 ෝ=𝒍+ 𝒙 𝟐 𝒉 𝒇 where h = width of the median class N = σ𝒏𝒊=𝟏 𝒇𝒊 𝒇𝒊 is the frequency of the ith class, and n is the total number of groups cf = the cumulative frequency N = the total number of samples l = lower limit of the median class Note A class is called median class if its cumulative frequency is just greater than N/2 49 Mode of a sample Mode is defined as the observation which occurs most frequently. For example, number of wickets obtained by bowler in 10 test matches are as follows. 1 2 0 3 2 4 1 1 2 2 In other words, the above data can be represented as:- 0 1 2 3 4 # of matches 1 3 4 1 1 Clearly, the mode here is “2”. 50 Mode of a grouped data Definition 3.13: Mode of a grouped data Select the modal class (it is the class with the highest frequency). Then the mode 𝒙 is given by: ∆𝟏 =l+ 𝒙 h ∆𝟏 +∆𝟐 where, h is the class width ∆𝟏 is the difference between the frequency of the modal class and the frequency of the class just after the modal class ∆𝟐 is the difference between the frequency of the modal class and the class just before the modal class l is the lower boundary of the modal class Note If each data value occurs only once, then there is no mode! 51 Relation between mean, median and mode A given set of data can be categorized into three categories:- Symmetric data Positively skewed data Negatively skewed data To understand the above three categories, let us consider the following Given a set of m objects, where any object can take values 𝒗𝟏 , 𝒗𝟐 ,…..,𝒗𝒌. Then, the frequency of a value 𝒗𝒊 is defined as 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒐𝒃𝒋𝒆𝒄𝒕𝒔 𝒘𝒊𝒕𝒉 𝒗𝒂𝒍𝒖𝒆 𝒗𝒊 Frequency(𝒗𝒊 ) = 𝒏 for i = 1,2,…..,k 52 Symmetric data For symmetric data, all mean, median and mode lie at the same point 53 Positively skewed data Here, mode occurs at a value smaller than the median 54 Negatively skewed data Here, mode occurs at a value greater than the median 55 Empirical Relation! There is an empirical relation, valid for moderately skewed data Mean – Mode = 3 * (Mean – Median) 56 Midrange It is the average of the largest and smallest values in the set. Steps 1. A percentage ‘p’ between 0 and 100 is specified. 2. The top and bottom of (p/2)% of the data is thrown out 3. The mean is then calculated in the normal way Thus, the median is trimmed mean with p = 100% while the traditional mean corresponds to p = 0% Note Trimmed mean is a special case of Midrange 57 Measures of dispersion Location measure are far too insufficient to understand data. Another set of commonly used summary statistics for continuous data are those that measure the dispersion. A dispersion measures the extent of spread of observations in a sample. Some important measure of dispersion are: Range Variance and Standard Deviation Mean Absolute Deviation (MAD) Absolute Average Deviation (AAD) Interquartile Range (IQR) 58 Measures of dispersion Example Suppose, two samples of fruit juice bottles from two companies A and B. The unit in each bottle is measured in litre. Sample A 0.97 1.00 0.94 1.03 1.06 Sample B 1.06 1.01 0.88 0.91 1.14 Both samples have same mean. However, the bottles from company A with more uniform content than company B. We say that the dispersion (or variability) of the observation from the average is less for A than sample B. The variability in a sample should display how the observation spread out from the average In buying juice, customer should feel more confident to buy it from A than B 59 Range of a sample Definition 3.14: Range of a sample Let X = 𝐱 𝟏 , 𝐱 𝟐 , 𝐱 𝟏 ,….., 𝐱 𝐧 be n sample values that are arranged in increasing order. The range R of these samples are then defined as: R = max(X) – min(X) = 𝐱 𝐧 - 𝐱 𝟏 z Range identifies the maximum spread, it can be misleading if most of the values are concentrated in a narrow band of values, but there are also a relatively small number of more extreme values. The variance is another measure of dispersion to deal with such a situation. 60 Variance and Standard Deviation Definition 3.15: Variance and Standard Deviation Let X = { 𝐱 𝟏 , 𝐱 𝟐 , 𝐱 𝟏 ,….., 𝐱 𝐧 } are sample values of n samples. Then, variance denoted as σ² is defined as :- 𝐧 𝟏 𝛔𝟐 = 𝐱𝐢 − 𝐱 𝟐 𝐧−𝟏 𝐢=𝟏 where, x denotes the mean of the sample The standard deviation, σ, of the samples is the square root of the variance 𝛔𝟐 61 Coefficient variation Basic properties σ measures spread about mean and should be chosen only when the mean is chosen as the measure of central tendency σ = 0 only when there is no spread, that is, when all observations have the same value, otherwise σ > 0 Definition 3.16: Coefficient variation A related measure is the coefficient of variation CV, which is defined as follows σ CV = × 100 𝐱 This gives a ratio measure to spread. 62 Variance and Standard Deviation Lemma 3.8 𝐱−𝐚 If data are transformed as 𝐱′= , the variance is transformed 𝐜 as ′𝟐 𝟏 𝟐 𝛔 = 𝟐 𝛔 𝐜 Proof ′ 𝐱−𝐚 The new mean 𝐱 = 𝐜 1 ′ 2 1 xi −a x−a 2 σni=1 xi′ − x = σni=1 − n n c c 1 n 2 = σ xi − a − x−a c2 n i=1 1 = 2 σni=1 xi − x 2 c n 1 2 = σ [PROVED] c2 63 Mean Absolute Deviation (MAD) Since, the mean can be distorted by outlier, and as the variance is computed using the mean, it is thus sensitive to outlier. To avoid the effect of outlier, there are two more robust measures of dispersion known. These are: Mean Absolute Deviation (MAD) MAD (X) = median 𝐱𝟏 − 𝐱 , ….. , 𝐱𝐧 − 𝐱 Absolute Average Deviation (AAD) 𝟏 AAD(X) = σ𝐧𝐢=𝟏 𝐱 𝐢 − 𝐱 𝐧 where, X = {𝐱 𝟏 , 𝐱 𝟐 ,…..,𝐱 𝐧 }is the sample values of n observations 64 Interquartile Range Like MAD and AAD, there is another robust measure of dispersion known, called as Interquartile range, denoted as IQR To understand IQR, let us first define percentile and quartile Percentile The percentile of a set of ordered data can be defined as follows: o Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile 𝐱 𝐩 is a value of x such that p% of the observed values of x are less than 𝐱 𝐩 o Example: The 50th percentile is that value 𝐱 𝟓𝟎% such that 50% of all values of x are less than 𝐱 𝟓𝟎%. Note: The median is the 50th percentile. 65 Interquartile Range Quartile The most commonly used percentiles are quartiles. The first quartile, denoted by 𝐐𝟏 is the 25th percentile. The third quartile, denoted by 𝐐𝟑 is the 75th percentile The median, 𝐐𝟐 is the 50th percentile. The quartiles including median, give some indication of the center, spread and shape of a distribution. The distance between 𝐐𝟏 and 𝐐𝟑 is a simple measure of spread that gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined as IQR = 𝐐𝟑 - 𝐐𝟏 66 Application of IQR Outlier detection using five-number summary A common rule of the thumb for identifying suspected outliers is to single out values falling at least 1.5 × IQR above 𝐐𝟑 and below 𝐐𝟏. In other words, extreme observations occurring within 1.5 × IQR of the quartiles 67 Application of IQR Five Number Summary Since, 𝐐𝟏 , 𝐐𝟐 and 𝐐𝟑 together contain no information about the endpoints of the data, a complete summary of the shape of a distribution can be obtained by providing the lowest and highest data value as well. This is known as the five-number summary The five-number summary of a distribution consists of : The Median 𝐐𝟐 The first quartile 𝐐𝟏 The third quartile 𝐐𝟑 The smallest observation The largest observation These are, when written in order gives the five-number summary: Minimum, 𝐐𝟏 , Median (𝐐𝟐 ), 𝐐𝟑 , Maximum 68 Box plot Graphical view of Five number summary Maximum Q3 Median Q1 Minimum 69 Box plot Reference The detail material related to this lecture can be found in Probability and Statistics for Enginneers and Scientists (8th Ed.) by Ronald E. Walpol, Sharon L. Myers, Keying Ye (Pearson), 2013. 71 Questions of the day… 1. Which of the following central tendency measurements allows distributive, algebraic and holistic measure? mean median Mode Which measure may be faster than other? Why? 2. Give three situations where AM, GM and HM are the right measure of central tendency? 72 Questions of the day… 3. Given a sample of data, how to decide whether it is a) Symmetric? b) Skew-symmetric (positive or negative)? c) Uniformly increasing (or decreasing)? d) In-variate? 4. How the box-plots will look for the following types of samples? a) Symmetric b) Positively skew-symmetric c) Negatively skew-symmetric d) in-variate 73 Questions of the day… 5. Draw the curves for the following types of distributions and clearly mark the likely locations of mean, median and mode in each of them. a. Symmetric b. Positively skew-symmetric c. Negatively skew-symmetric 6. The variance σ² of a sample X = { 𝐱 𝟏 , 𝐱 𝟐 , 𝐱 𝟏 ,….., 𝐱 𝐧 } of n data is defined as follows. 𝐧 𝟏 𝛔𝟐 = 𝐱𝐢 − 𝐱 𝟐 𝐧−𝟏 𝐢=𝟏 where, x denotes the mean of the sample. Why (n-1) is in the denominator in stead of n? 74 Questions of the day… 5. What are the degree of freedoms in each of the following cases. a. A sample with a single data b. A sample with n data c. A sample of tabular data with n rows and m columns 75