Descriptive Statistics PDF

Summary

This document introduces descriptive statistics, covering concepts like population, individuals, and different types of variables. It also discusses introductory concepts and variables, and provides basic examples for a better understanding.

Full Transcript

Chapter 1. Descriptive Statistics The purpose of descriptive statistics is to provide tools for the study of quantifiable phenomenon. For example: provide information on the average weight of a group of people, to have an idea of the level of 2CP students in mathematics. 1) Introductory and basic...

Chapter 1. Descriptive Statistics The purpose of descriptive statistics is to provide tools for the study of quantifiable phenomenon. For example: provide information on the average weight of a group of people, to have an idea of the level of 2CP students in mathematics. 1) Introductory and basic concepts:  The set on which the observations are based is called the population.  An element of the population is called Individual.  A sample is a representative (well chosen) subset of the population (see the course of survey techniques). For example, - The students of ESI enrolled in the current year. The population is all students of ESI enrolled in the current Year. The individual is a “student of ESI enrolled in the current year”. - Cars registered this year in Algeria. The population is all cars registered in this year in Algeria. The individual is a “car” - Job applications received by a company. The individual is a “job application”  To describe a population, we identify and class its individuals having the same characteristics into subsets called characteristics. These characteristics are the different possible situations of a mathematical entity called a variable (because it varies as we will see in the following examples). The concept of variable is the importance and practical aspect to descriptive statistics and what follows in our PRST course. For example, - The population of Algeria may be described by such variables as: age, gender, nationality, number of children … etc. - Car production by: color, value, engine power, number of seats, … etc. Possible situations of some variables, - The variable age can take an infinite number of positive values between two of its possible values. (we will call it later continuous variable). - The variable number of children can take finite number of positive values between two of its possible values. (We will call it later discrete variable). - The variable gender has two possibilities: feminine and masculine. (2 non-quantifiable characteristics we will call it later categorical variable). - The marital status has four possibilities: married, single, widowed and divorced. (4 non-quantifiable characteristics and it’s also considered a categorical variable) - The color has many possibilities, after consultation with the applicants of the statistical study; one can take only a certain number of the main possibilities. An example of this would be: red, blue, green, white, other color. There are five possibilities and the variable “color” is also a categorical variable. Remark 1: The different situations of a variable to be considered in a statistical study must be incompatible, exhaustive and unambiguous. Remark 2: We note by X or (Y , or Z ,...) to designate the variable and by xi or ( yi , or z i ,...) to designate its value respectively for the i-th possible situation. 1 Chapter 1. Descriptive Statistics  Different type of variables:  Categorical variables: (also called qualitative variables) [Despite its practical importance this case will not be the subject of our course]. A categorical variable refers to a characteristic that can’t be quantifiable like the gender, the behavior, the nationality, … etc. Categorical variables can be either nominal or ordinal. - Nominal variables: A nominal variable is one that describes a name, label or category without natural order like the gender, the nationality. - Ordinal variables: An ordinal variable is one whose values are defined by an order relation between different categories, like the variable “the behavior”: {“excellent”, “very good”, “good”, … }.  Numerical variables: (also called quantitative variables) A numerical variable is a quantifiable characteristic whose values are number such as the age, the number of children … etc. The numerical variables may be continuous or discrete. - Discrete variables: A discrete variable can assume only a finite number of real values within a given interval. Then for the discrete variable, its observations x1 x 2 , x3 ,... are isolated numbers. For example the values of the variable “number of children” may be: 0, 1, 2, 3, 4, 5 and more. Continuous variables: A variable is said to be continuous if it can assume an infinite number of real values like, the age, the weight, the height, … etc. Then possible values x1 x 2 , x3 ,... of a variable X are presented as a half open intervals (semi open brakets) like e1 ,e2  , e2 , e3  , e3 , e4 ,... , by taking into consideration of course the previous remark 1. We call these intervals: class interval or bins. ei  ei 1 ci  is called the class midpoint or the midpoint of the class interval 2 ei , ei 1 . ai  ei 1  ei , the difference between the upper endpoint and the lower endpoint of ei , ei 1  is called the i-th class interval width. Remark 3: A variable can take on so many values that it has to be grouped into classes and can be treated as continuous.  Presentation of data:  Data list: 2 Chapter 1. Descriptive Statistics It’s the first format for presenting data sets, it’s an explicit listing of all measurements (observations of interest to us) made on all individuals in the sample (or on the population if it’s possible).  Data frequency table: It’s a table listing each distinct value xi in the first column and its frequency ni in the second column. - ni is the number of times the value xi appears in the data set, in other word: how often a value xi appears in a data set. - n i i  n is called sample size (or the size of the population if we consider the entire population in our study). - In the data frequency table, the xi are presented in the first column and the ni in the second column, without of course forgetting the title specifying the population, the variable and some information deemed useful. - Instead of ni , we can put inthe second column of the data frequency table the so- called the relative frequencies f i or the so-called the percentages f i % such as: ni ni fi  and f i %  f i  100   100 , n n with f i i 1 and also  f i %  100. i - The couple ( xi , ni ) i 1,..., k , ( xi , f i ) i 1,..., k or ( xi , f i %) i 1,..., k where k is the number of the possible values of the variable X is called the frequency distribution.  Example1: (a) Example of a case: List data (See exercises 2 and 3 from TD). (b) Example of a case: Discrete quantitative data. Distribution of households by number of children ( X ) xi 0 1 2 3 4 5  ni 50 60 40 20 5 5 180= n c) Example of a case: Continuous quantitative data. Distribution of taxis by the number of kilometres travelled ( X ) xi [10, 20 [ [20, 30[ [30, 40 [ [40, 50[ [50, 60[ [60, 70[  ni 9 13 22 10 7 4 65 d) Example of a case: Categorical data. Distribution of company staff by marital status ( X ) 3 Chapter 1. Descriptive Statistics xi Married Single widowed Divorced  ni 35 75 4 6 120 Remark 4: [objectives of the PRST course and some more useful general vocabulary] Collecting data on all individuals in a large population is usually not realistic. For example: calculate the average weight of children of Algeria; the population can contain hundreds of thousands of children. The best we can do would be estimate the average by looking at a subset of the population of interest, the subset which is called a sample is often chosen with a simple random of the population (See the course of survey techniques). In concluding that the average weight calculated in a sample is the average of all children of Algeria, in this way we say that we have drawn inference about the population based on information obtained from the sample. Throughout this procedure (drawing inference about the population from a sample), specific vocabularies is generally used, some of them have already been presented and will be represented in a general definition in addition to others.  Population: a population is any specific collection of subjects of interest (all objects of interest).  Sample: a sample is any subset or subcollection of the population (the objects examined), including the case that the sample consists of the whole population.  Measurement: a measurement is a number or attribute computed for each member of a population or a sample.  The sample data: taken together the measurements of sample elements.  Parameter: a parameter is a number that summarizes some aspect of the population as a whole.  Statistic: a statistic is a number computed from the sample data.  Statistics: is a collection of methods for collecting, displaying, analyzing and drawing conclusions from data.  Descriptive statistics: is the branch of statistics that involves organizing, displaying and describing data.  Inference statistics: is the branch of statistics that involves conclusions about a population based on information contained in a sample taken from that population.  Qualitative data: are measurements for which consist there is no natural numerical scale, but which consists of attributes, labels or other numerical characteristics.  Quantitative data: are numerical measurements that arise from a natural numerical scale. 2) Statistical summaries: From now on, the frequency distribution ( xi , ni ) i 1,..., k is assumed to be given as a list or table. Remark 5: in this introductory chapter of the PRST course, we are only interested in the case where we have a single numerical variable ( discrete or continuous) to describe the population. The case where we have two or more variables will be treated theoretically in chapter 6 of the same course. 2.1) Graphical summaries: 4 Chapter 1. Descriptive Statistics It’s to visually synthesise the information contained in the frequency distribution. 2.1.1) The bar graph : Used for a discrete variable. The classes are arranged and indicates in order on the horizontal axis, and for each xi a vertical bar whose lenght ni is drawn. 2.1.2) Histogram: used for a continuous variable. It devides up the range of possible value in a data set into classes. For each class [ei ; ei 1 [ a rectangle is constructed with a base length ei 1  ei  ai equal to the rectangle values in that special class and a length equal to the number ni of observations falling into that class. Remark 6: in contrast to the bar graph where each value xi has a vertical segment of length proportional to its ni is made, for an histogram, each class is given the area of a rectangle with the length a i of the class as its base and ni as its height. Remark 7: the histogram is the area formed by all the rectangles. In relative frequency, this area is equal to 1. To draw a histogram, from the above Remark 7, we consider two cases: 1. The lengths of classes a i are all equal: the height of the ith rectangle is ni. 2. The lengths of classes a i are not equal: to keep the same area, the heights of the rectangles must be corrected as follows: a unit length a is identified (which is often in practice, but not always, the smaller of a i , or their greatest common divisor or sometimes even the value 1). The ith height hi of the ith class (which called the corrected frequency) is calculated by : a hi  ni . ai We present the both columns (ai ) and (hi ) in the same data frequency table, then we Draw ye histogram as in (1) but instead of ni we put hi. 2.1.3) The frequency polygon and the frequency curve graph: To obtain a graphical representation less heaving to visualize, we can plot: - The frequency polygon: is obtained by joining the midpoints of the tops of the rectangle imagining the lengths classes a i are equal to the unit length a (of course if the lengths are not equal). Two false classes are added at both ends of the frequency distribution in order to keep the areas always equal to 1 in relative frequency. (When the variable is discrete, we join the tops of the vertical segments of the bar graph). 5 Chapter 1. Descriptive Statistics - The frequency curve: is the graphical fit of the frequency polygon, it represents an estimate of the probability function that is supposed to follow X (The random variable X ; see chapters 3, 4, 5, 6). The frequency curve graph appearance is almost like that of the frequency polygon except that in the polygon, its lines joining the midpoints of the rectangle tops are straight but in the frequency curve, the lines are smooth. Example 2: for each cases (b) and (c) in example 1, draw the appropriate graphs and also for the next case Case (e): Distribution of taxis by the number of kilometres travelled ( X ) xi [10, 20 [ [20, 30[ [30, 40 [ [40, 50[ [50, 70[  ni 9 13 22 10 11 65 Remark 8: “Some possible descriptions of histograms” - Symmetric - Skewed (asymmetric, long tail to one side) - Right-tail stretched out… positive skew Left-tail stretched out… negative skew - Unimodal (one peak) - Bimodal (two peaks) - Bell-shaped - Uniformly distributed (flat). Remark 9: if instead of ni , we use f i (or percentage), it’s useful to precise in the title such as: the relative frequency (or percentage) histogram and relative frequency (percentage) curve. 2.1.4) the cumulative frequency curve: it’s plotted to answer questions such as: “how many individuals whose X is less than …?” or “how many individuals with X is less than … ?“ For example, “how many households have less than 2 children?” or “more than 4 children?”, “how many taxis have travelled less than 30000 km?” or “how many taxis have travelled more than 32650?”, note that this value 32650 unlike other is not one of the end of class intervals and therefore, does not appear on the frequency table, we will see that the cumulative frequency can give us an approximate answer (projection on the x-axis), unlike the frequency table. To construct the cumulative frequency curve, first the relative frequency f i ( ni or f i % respectively) are summed in a new column of the frequency table as shown in the following: 6 Chapter 1. Descriptive Statistics -If we sum from top to bottom -If we sum from top to bottom -The values increase from top to bottom -The values decrease from bottom to top when reading when reading -The cumulative frequencies are called : -The cumulative frequency are called : less than cumulative frequencies more/greater than cumulative frequency (or increasing cumulative frequencies) (or decreasing cumulative frequencies)  f i , f icc , Fi  f i , f icd , designed by :  designed by :  ni , nicc , N i ni , nicd , -This corresponds to the notion : -This corresponds to the notion : «less than» «more/greater than» The cumulative frequency graph can be plotted in two ways: 1) Cumulative frequency curve of less than type, this is the one that is simply called the cumulative frequency curve. It’s a curve representing the increasing cumulative frequencies. Each point represented on the graph has the coordinates ( xi , Fi ) , ( xi , N i ) or ( xi , f i % ) according to the respective case, we worked with ( xi , f i ) i 1,..., k ,, ( xi , ni ) i 1,..., k , or ( xi , f i %) i 1,..., k. 2) Cumulative frequency curve of more than type, it’s a curve representing the decreasing frequencies. Remark: 1) The function represented by the frequency curve will be noted f X (in the continuous case) and X (in the discrete case), it’s the probability function of the random variable X. The function represented by the cumulative frequency curve after its smoothing, will be noted FX and called the distribution function of the random variable X. 2) We wil express mathematically (see chapter 4): i F ( x)   f h , Ou x tels que xi  x  xi 1. h 1 F ()  0 and F ()  1 3) In the particular case of the continuous case, we will see in chapter 4, x F ( x)   f ( x)dx , 0 We interpret that F (x ) is the area to the left of the value x in the histogram. Example 3: for each of the cases (b), (c) et ( e) given above, draw the cumulative frequency curve and the cumulative frequency curve of more than type. 2.2) Numerical summaries: After the summary of the frequency distribution in the form of a data frequency table and the associated graphical summaries, in the following parts of this chapter, we will focus on some statistical characteristics called indicators of the frequency distribution or simply numerical summaries. These indicators are defined to characterize the frequency distribution by numerical values on: 7 Chapter 1. Descriptive Statistics - The value of the variable in the “centre” of the frequency distribution: the central trend. - A position indicator linked to the given rank. - The variation in values. - The shape of the frequency distribution. The indicators we will define are those that satisfy the maximum of the six Yule’s conditions, this is because there is no statistical indicator that simultaneously satisfies the six of the following Yule’s conditions: Yule’s conditions: A statistical indicator should be a typical value, 1. Defined objectively and therefore independent of the observer, 2. Dependant on all observations, 3. Of concrete significance to be understood by non-specialists, 4. Simple to calculate, 5. Not very sensitive to “sampling fluctuations”, 6. Lends itself easily to mathematical operators. 2.2.1) Central tendency indicators: 2.2.1.1) The mode: ( M o or M o (X ) ) The mode of a frequency distribution is the value of the variable that corresponds to the highest Frequency. Example: Let the frequency list: {5, 6, 7, 7, 8, 8, 8, 9, 11} M 0  8 (The most frequent or dominant value). Calculation of the mode:  Discrete case: - From the frequency table of ( xi , f i ) i 1,..., k or ( xi , ni ) i 1,..., k or ( xi , f i %) i 1,..., k  , this is the xi for which f i or ni or f i % is highest. - From the bar graph of ( xi , f i ) i 1,..., k or ( xi , ni ) i 1,..., k or ( xi , f i %) i 1,..., k  , this is the xi corresponding to the highest stick.  Continuous case: (we define the modal class). The graph is the histogram, the modal class is defined as the class in the frequency table or histogram corresponding to the maximum frequency after correction of the frequencies in the case where the lengths of the classes are unequal. Remark: in the discrete case, the mode satisfies Yule’ conditions 1, 3, 4 and 5.The determination of the modal class in the continuous case is not specified because it depends on the division into classes selected, and it is for this imprecision that the mode is used less than the two other indicators that we will define. 2.2.1.2) The median: ( M e or M e (X ) ) This is the value of the variable that divides the frequency distribution into two equal subsets, assuming that the individuals are arranged in order. Example: Let the frequency list {12, 28, 6, 3, 32, 15, 21}, The ordered frequency list is « 3, 6, 12, 15, 21, 28, 32 », M e  15 8 Chapter 1. Descriptive Statistics Calculation of the median:  Discrete variable: - If the number of observations (frequency list) n is odd (as in the previous example), M e is the ( n21 ) th observation of the ordered frequency distribution. - If the number of observations (frequency list) n is even, we define a median interval. n  2.k , M e is often approximated by the average of the k ème and (k  1) ème observations of the ordered frequency list. Example8: Let the ordered frequency list : {3, 6, 12, 15, 21, 28, 32, 38}, the interval median is [15, 21[. - In the case of grouped data, i.e ( xi , f i ) i 1,..., k or ( xi , ni ) i 1,..., k or ( xi , f i %) i 1,..., k  , the median is calculated from the column of cumulative frequencies in the frequency table or from the y-axis is of the cumulative frequency curve. The value 0.5 is marked on the curve or the column quoted above. If the value 0.5 appears between two rows of the frequency table, the median is the value that corresponds to the lowest row (of course, paying attention to how the values Fi or N i or f i %   are represented in the table), otherwise, i.e if the value 0.5 appears, M e is an exact value.  Continuous variable: - Determination of the median class: 0.5 is marked in the column of the cumulative frequencies or on the y-axis is of the cumulative frequency curve. If the value 0.5 is associated with the value at the end of the class, the median is an exact value. If the value 0.5 is between two ends of the class, we have a median interval and will be approximated by the linear by the linear interpolation method. - Approximation by the linear interpolation method: 0.5  F  (Me ) ai M e  ei  (Me ) (Me ) (Me ) i fi ai (Me ) is the length of the class median ei  (Me ) , ei 1 (Me )  (Me ) fi the relative frequency of the median class (Me ) Fi the cumulative relative frequency of the median class Remark7 : The above formula is written according to the frequencies used:   (Me ) n (Me )  (M ) ai ai e M e  ei     50  Fi % ( M e ) (Me ) (M e ) (Me )  2 N i  or M e e i (Me ) ni fi % Remark8 : the median satisfies Yule’s conditions, 1, 3, 4 and 5. It can’t be 9 Chapter 1. Descriptive Statistics changed even if half of the observations are very high or very low. 2.2.1.3) The arithmetic mean: ( X )  For a data list: for a data list of size n , for each value xi there is only one observation, the 1 n mean x will be: x   xi. n i 1 Example: the marks (out of 20) of eight students are: 3, 5, 7, 9, 10, 11, 12, and 18. x  18 (3  5  7  9  10  11  12  18)  9.375  For a frequency distribution : ( xi , ni )i 1, 2,..., k 1 k k For each value xi there is ni observations, we will have: x   ni.x i n i 1 avec n   ni i 1 Example: the data list 3, 3, 3, 5, 9, 9, 11, 11can be summarized as follows: xi 3 5 9 11  ni 3 1 2 2 8 x  18 [(3  3)  (5  1)  (9  2)  (11 2)]  6.75 k ni Remar9: since f i  we also write: x   f i.xi n i 1  Calculation of x : - Discrete variable: add to the table of ( xi , ni )i 1, 2,..., k the column " xi  ni " and sum in the column by dividing by n. - Continuous variable: xi represents the class interval [ei ; ei 1 [ , we replace this class ei  ei 1 interval by its midpoint ci  and we calculate as before (as calculated in 2 discrete variable case).  Properties of the arithmetic mean: It is easy to show that:  The arithmetic mean satisfies all the Yule’s conditions except the fifth condition; an observation that is too high or too low, can have a strong influence in calculating the arithmetic mean. k   f (.x i 1 i i  x)  0 ou X x 0 k   f (.x i 1 i i  a) 2  ( X  a) 2 is minimal for a  x  X  a.X /  b (a, b)  IR *  IR  X  a.X /  b (widely used in calculations )  Let P be a population of size n which is divided into two sub-populations P1 and P2 of sizes N1 and N 2 respectively. 10 Chapter 1. Descriptive Statistics Let be a variable on P and x , x1 and x 2 are the averages observed on P, P1 and P2 respectively, we show that: 2 x  1n ( N 1 x1  N 2 x 2 )  1 n  N.x i 1 i i This case can be generalised to a partition into r sub-populations ( r  2 ) and we get: r x 1 n  N.x i 1 i i  Comparison of mean, median and mode: In the case of uni-modal distributions, the median is frequently between the mean and the mode, and closer to the mean than to the mode, as shown in the three figures with the conclusion we can draw from each figure about the shape of the frequency curve. MO = Me = x MO Me x x ME MO Symmetric frequeny polygon Positevely skewed or right skewed Negatively skewed or left skewed (MO=Me= x ) (MOMe> x )  The geometric mean and the harmonic mean:  The geometric mean G : this is the average applied to measurements of geometric or exponentially increasing quantities. Its formula is given by:  n  n  xi n  i 1  log G  1 n  log( x ) i 1 i simple case G  k n n  xini  log G  1 n f i log( xi )  log X weighted case  i 1 i 1 So if we consider the change of variable Y  log( X ) , by calculating y (adding a column representing Y to the statistical table as we did with the affine change), we can then deduce the value of G from the relation log G  y. 11 Chapter 1. Descriptive Statistics  The harmonic mean H : this is used when it is possible to assign a real meaning to the inverses of data, such as the speeds (m/s, Km/h, …, ). Its formula is given by:  n 1   n  X1     i 1 1 xi simple case  H   n 1 1 weighted case  k  k   ni  X1    i 1 xi  i 1 fi xi In the same way as the calculation of G , if we consider the change of variable 1 Y , by calculating y , we can then deduce the value of H from the relation X H 1. y 2.2.2) Position indicators: (Generalization of the median) In addition to measures of central tendency, there are measures of position which called in general quantiles. The median, which is considered before as a central tendancy parameter, is also a case of position indication, so to better understand quantiles and calculate them, we have to generalise the median as we will see below. These quantiles are position measures used to indicate the position of an individual in a group while dividing the distribution into l groups, then we need for this to determine (l  1) positions (quantiles).  If l  4 , in this case, quantiles are called quartiles, and we have three quartiles which denoted Q1 , Q2 and Q3. 25% 25% 25% 25% Q1 Q2 Q3 50% 50% Remarq10:  For example, we will say that 25% of the values taken by the variable are less than Q1.  Q2  Me.  Q3  Q1 includes 50% of observations, it is called inter-quartiles range.  If l  10 , quantiles are called deciles denoted D1 , D2... , D8 , D9. 10% of observations are less than D1 , 20% of observations are less than D2 , 50% of observations are less than D5  Me , 80% of observations are less than D8.  If l  100 , quantiles are called centiles denoted C1 , C 2 , C3... , C98 , C99. 12 Chapter 1. Descriptive Statistics For example, we say 99% of observations are less than C68 , and we note that C5  Me.  Remarq11: quantiles are determined from the growing cumulative frequencies curve (drawn accurately, of course). For the continuous case and only for the continuous case, we can approximate the values of the quantiles, as we have already done with the median, by using the linear interpolation formula.  p  eI  aI  p.n  FI  fI  p : The required quantile. eI : The lower bound of class I containing the desired quantile. p : The percentage (in decimal form) of observations to which the quantile corresponds. n : The total number of observations in the distribution (the population size). f I : The relative frequency of the class I. FI : The cumulative frequencies of all the classes preceding class I. a I : The length (width) of the frequency class I. 2.2.3) Dispersion indicators: Let’s take the following two data lists of hourly wages for employees in two companies E1 and E2. E1: 140 150 180 250 300 350 et 380. E2 : 180 200 220 250 290 300 et 310. Note that, despite Me1  Me2 and x1  x 2 , we can note conclude the same conclusions about the two data lists. To compare them, we need to use measures of dispersion by highlighting the deviations, the most important of which are listed below: I. The range (of a sample or of a data set): E  xk  x1 x k and x1 are the smallest and the largest values of the quantitative characteristic (variable) X. II. The average (or mean) absolute deviation (dispersion): The average absolute deviation is the average of the absolute deviations from a specified measure of central tendency (usually the mean, median, or mode), When it is computed as the absolute deviations from the mean, it is commonly called the mean absolute deviation. Let T be a measure of central tendency, so T   x,  Me, Mo. The average absolute deviation is interpreted by the following formula: n - In the simple case: ET  1 n x i 1 i T. k k k - In the weighted case : ET  1 n n i 1 i xi  T   f i xi  T , n   ni. i 1 i 1 13 Chapter 1. Descriptive Statistics III. Standard deviation and variance: (see the third property of arithmetic mean) A. Standard deviation: - simple case:  ( x)  X  X  2  1 n  ( xi  x) 2. n i 1   1 k k   f (x 2 - weighted case:  ( x)  X  X  ni ( x i  x ) 2  i i  x) 2. n i 1 i 1 B. Variance: V ( x)  X  X 2   2 ( x). C. Properties of variance: k  k  2  f i xi    f i xi  , weighted case. 2  C1- V ( x)  X  X  2  X2  X   2   1  1  2  1 x 2   1 x  , simple case n n  n i  i n 1   1 (It’s called, the developed formula of variance or Koning formula) V ( x)  a 2  V ( X / ) C2- if X  aX  b , a  0 then  /  ( X )  a   ( X ) / C3- because of squaring in calculating the standard deviation, it is more sensitive to Sampling fluctuations and extreme values (high or low values) but it still satisfies Yule’s conditions number 1, 2, and 6. C4- Let P be a population of size n which is divided into two sub-populations P1 and P2 of sizes N1 and N 2 respectively. Let be a variable on P and x , x1 and x 2 are the arithmitic means observed on P, P1 and P2 respectively, and V (X ) , V ( X 1 ) and V ( X 2 ) are the variances observed on P, P1 and P2 respectively. We show that: V (X )  1 n N.V ( X )  N.V ( X 1 1 2 2 )  N1.( x1  x ) 2  N 2.( x 2  x ) 2.  Which can be generalized to a partition into r sub-populations( r  2 ) , and we get:  r r  V ( X )  1n   N i.V ( X i )   N i.( xi  x ) 2 .  i 1 i 1  IV. Coefficient of variation: Whenever two samples have the same units of measure, the variance and standard deviation for each can be compared directly. But it’s necessary to use the coefficient of variation to compare the standard deviations when the units are different. It’s expressed by: 14 Chapter 1. Descriptive Statistics  ( x) . x Often the coefficient of variation is given as a percentage, and we notice that it’s not dependent on the unit of measure. V. Inter-quantiles range: The most commonly used intervals are as follows : Designation Range Percentage of observations contained in Inter-quartiles interval : Q3-Q1 50% [Q1 ; Q3[ Inter-déciles interval : D9-D1 80% [D1 ; D9[ 2.2.4) Shape indicators: The graphical representations of a frequency distribution shows us its shape, and in this part of this chapter we will check and if necessary correct the shape of the frequency curve by measuring its skweness and Kurtosis. For any symmetrical frequency distribution ( xi , ni )i 1, 2,..., k , we have: ( xi , ni )i 1, 2,..., k is symmetric if and only if x  x1 2  Me  X  M  O   0 ; 0.5. (*) If (*) is not realized, then ( xi , ni )i 1, 2,..., k is not symmetric, and to study its shape, we have to calculate two coefficients noted  1 and  2 which are called the Fisher’s coefficient of asymmetry and the Fisher’s kurtosis coefficient respectively (first and second fisher’s coefficients). These two coefficients depend on centred moments defined as follows: Moments centrés  r , r  1 :  1 n  n  ( xi  x ) r simple case  i 1 r   1 k k  n  ni ( xi  x )   f i ( xi  x ) weighted case r r  i 1 i 1  Note that:  r  ( X  x ) r which is the mean of the variable Y  ( X  x ) r. So to calculate  r , we represent Y on the frequency table of ( xi , ni )i 1, 2,..., k , then we calculate its arithmetic mean since Y  r.  1  X  x  0 (see the properties of the arithmetic mean) and 2 2  ( X  x)2  V ( X )  X 2  X. 15 Chapter 1. Descriptive Statistics  For an affine change variable Y  aX  b we will have  r (Y )  a r. r ( X ) , a and b should be well  r (Y ) chosen to facilitate the calculations of  r (Y ) , we can then deduce  r ( X ) . ar  The odd-order centred moments are zero for a symmetric frequency distribution, negative for a left skewed unimodal distribution and positive for a right skewed unimodal distribution. (This remark will help us to interpret the coefficient  1 defined below). 3 The Fisher’s coefficient of asymmetry: 1  3 , 2  0. ( 2 ) 2  1  0 for a symmetric frequency distribution, negative for a left skewed unimodal distribution and positive for a right skewed, (see the graphs in the section : comparing between M O , M e and x ). 4 The Fisher’s coefficient of kurtosis:  2   3 , 2  0. ( 2 ) 2  2  0 , “positive kurtosis”, or we say that the distribution is “leptokurtic”, it indicates that distribution is peaked and posses thick tails when compared with the normal distribution (the normal distribution is a mesokurtic distribution, its  2 is equal to zero, we will look at it in detail in chapter 6).  2  0 , “negative kurtosis”, or we say that the distribution is “platykurtic”, it indicates that distribution is flatter (less peaked) when compared with the normal distribution. To summarise, see the figure below: Mesocurtic  2  0 leptokurtic :  2  0 platykurtic :  2  0 Remarque12: -  1 and  2 are invariant for any affine change of variable. -  1 and  2 are independent of the unit of measurement of the variable under study. 16

Use Quizgecko on...
Browser
Browser