Main Textbook - One-Dimensional Frequency Distributions PDF
Document Details
Uploaded by AdulatoryUkiyoE9080
Monash University
W.K. Härdle et al
Tags
Summary
This chapter details one-dimensional frequency distributions, exploring various graphical methods like bar graphs, pie charts, and pictographs for visualizing discrete data. It analyzes examples such as job proportions in Germany and the evolution of household sizes.
Full Transcript
Chapter 2 One-Dimensional Frequency Distributions 2.1 One-Dimensional Distribution The collection of information about class boundaries and relative or absolute frequencies constitutes the frequency distribution. For a single variable (e.g., height) we have a one-dimensional frequency distribution...
Chapter 2 One-Dimensional Frequency Distributions 2.1 One-Dimensional Distribution The collection of information about class boundaries and relative or absolute frequencies constitutes the frequency distribution. For a single variable (e.g., height) we have a one-dimensional frequency distribution. If more than one variable is measured for each statistical unit (e.g., height and weight), we may define a two- dimensional frequency distribution. We use the notation X to denote the observed variable. 2.1.1 Frequency Distributions for Discrete Data Suppose the variable X can take on k distinct values xj ; j D 1; ::; k. Note that we index these distinct values or classes using the subscript j. We will denote n observations on the random variable by xi ; i D 1; : : : ; n. The context will usually make it clear whether we are referring to the k distinct values or the n observations. We will assume that n > k. Frequency Table For a discrete variable X, the frequency table displays the distribution of frequencies over the given categories. From now on we will speak of discrete variables to encompass categorical variables and discrete metric variables with few possible observations. Note that the sum of theP frequencies across the various categories equals the number of observations, i.e., kjD1 xj D n (Table 2.1). © Springer International Publishing Switzerland 2015 21 W.K. Härdle et al., Introduction to Statistics, DOI 10.1007/978-3-319-17704-5_2 22 2 One-Dimensional Frequency Distributions Table 2.1 A frequency table Values Absolute frequencies Relative frequencies x1 h.x1 / f.x1 / x2 h.x2 / f.x2 / :: :: :: : : : xj h xj f xj :: :: :: : : : xk h.xk / f.xk / Total n 1 40 Fig. 2.1 Example of a bar graph 30 h(x) 20 10 0 x1 x2 x3 x4 2.1.2 Graphical Presentation Several graph types exist for displaying frequency distributions of discrete data. Bar Graph In a bar graph, frequencies are represented by the height of bars vertically drawn over the categories depicted on the horizontal axis. Since the categories do not represent intervals as in the case of grouped continuous data, the width of the bars cannot be interpreted meaningfully. Consequently, the bars are drawn with equal width (Fig. 2.1). Stacked Bar Chart Sometimes one wants to compare relative frequencies in different samples (different samples may arise at different points in time or from different populations). This can be done by drawing one bar graph for each sample. An alternative is the stacked bar 2.1 One-Dimensional Distribution 23 Fig. 2.2 Example of a 12 stacked bar chart 10 8 6 4 2 0 A B C Fig. 2.3 Example of a pie x2 chart x1 30.7 % 21.93 % x4 18.42 % x3 28.95 % chart. It consists of as many segmented bars as there are samples. Each segment of a bar chart represents a relative frequency (Fig. 2.2). Pie Chart In pie charts, frequencies are displayed as segments of a pie. The area of each segment is proportional to the corresponding relative frequency (Fig. 2.3). Pictograph In a pictograph, the size or number of pictorial symbols is proportional to observed frequencies (Fig. 2.4). Statistical Map Different relative frequencies in different areas are visualized by different colors, shadings, or patterns (Fig. 2.5). 24 2 One-Dimensional Frequency Distributions Fig. 2.4 Two examples of pictographs 0−10% 10−20% 20−30% 30−40% 40−50% 50−60% Fig. 2.5 Example of a statistical map 2.1 One-Dimensional Distribution 25 Table 2.2 Frequency table j Status xj h xj.10000 s/ f xj on employed population in Germany 1 Wage-earners 14568 0:389 2 Salaried 16808 0:449 3 Civil servants 2511 0:067 4 Self employed 3037 0:081 5 Family employed 522 0:014 Total 37466 1:000 pie chart 18000 bar chart salaried 45 % 16000 14000 number in thousands 12000 10000 familiy employed 1% 8000 Self employed 8% 6000 civil servants 4000 wage earners 7% 39 % 2000 0 salaried wage earners civil servants Self employed familiy employed 45 % 39 % 7% 8% 1% status Fig. 2.6 Pie chart and bar graph on employed population in Germany Explained: Job Proportions in Germany In April 1991, Germany’s employed population was surveyed with respect to type of employment. Table 2.2 summarizes the data. Visualizing the proportions helps us to analyze the data. In Fig. 2.6 you can clearly see the high proportion of wage-earners and salaried in contrast to the other categories. Enhanced: Evolution of Household Sizes The evolution of household sizes over the twentieth century can be studied using data compiled at various points in time. Statistical elements: households Statistical variable: size of household (metric, discrete) Table 2.3 contains relative frequencies measured in percent for various years. The structural shift in the pattern of household sizes towards the end of the century becomes visible if we draw bar charts for each year. The graphics in Fig. 2.7 display a clear shift towards smaller families during the twentieth century. 26 2 One-Dimensional Frequency Distributions Table 2.3 Frequency table Household size X 1900 1925 1950 1990 on the evolution of household sizes over the twentieth 1 7.1 6.7 19.4 35.0 century 2 14.7 17.7 25.3 30.2 3 17.0 22.5 23.0 16.7 4 16.8 19.7 16.2 12.8 5 44.4 33.3 16.1 5.3 Total 100 100 100 100 45 1900 45 1925 relative frequency in percent relative frequency in percent 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 1 2 3 4 >5 1 2 3 4 >5 size of household size of household 45 1950 45 1990 relative frequency in percent relative frequency in percent 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 1 2 3 4 >5 1 2 3 4 >5 size of household size of household Fig. 2.7 Histograms on the evolution of household sizes over the twentieth century 2.2 Frequency Distribution for Continuous Data Given a sample x1 ; x2 ; : : : ; xn on a continuous variable X, we may group the data into k classes with class boundaries denoted by xl1 ; xu1 D xl2 ; xu2 D xl3 ; : : : ; xuk and class widths xj D xuj xlj.j D 1; : : : ; k/. Note that the upper boundary for a given class is equal to the lower boundary for the succeeding class. An observation xi belongs to class j, if xlj xi < xuj. Since within a category, there are a range of possible values we will focus on the midpoint and denote it 2.2 Frequency Distribution for Continuous Data 27 Table 2.4 Structure of a Absolute Relative frequency table Class # Classes frequencies frequencies 1 xl1 X < xu1 h.x1 / f.x1 / 2 xl2 X < xu2 h.x2 / f.x2 / :: :: :: :: : : : : j xlj X < xuj h xj f xj :: :: :: :: : : : : k xlk X < xuk h.xk / f.xk / Total n 1 by xj. (Contrast this with the discrete data case where xj denotes the value for the category.) Once again the subscript j corresponds to categories xj ; j D 1; : : : ; k and the subscript i denotes observations xi ; i D 1; : : : ; n. Frequency Table A frequency table for continuous data provides the distribution of frequencies over the given classes. The structure of a frequency table is shown in Table 2.4. Graphical Presentation Histogram In a histogram, continuous data that have been grouped into categories are repre- sented by rectangles. Class boundaries are marked on the horizontal axis. As they can be of varying width, we cannot simply represent frequencies by the heights of bars as we did for bar graphs. Rather, we must correct for class widths. The rectangles are constructed so that their areas are equal to the corresponding absolute or relative frequencies. h xj hO xj xj D u xuj xlj D h xj xj xjl or f xj Of xj xj D xuj xlj D f xj xj xj u l 28 2 One-Dimensional Frequency Distributions 800 600 Frequency 400 200 0 0 2000 4000 6000 8000 10000 monthly net income in Euro Fig. 2.8 Example of histogram—716 observations on monthly income (Euro) If the class widths are identical, then the frequencies are also proportional to the heights of the rectangles. The rectangles are drawn contiguous to each other, reflecting common class boundaries xuj D xljC1 (Fig. 2.8). Stem-and-Leaf Display In stem-and-leaf displays (plots), the data are not summarized using geometric objects. Rather, the actual values are arranged to give a rough picture of the data structure. The principle is similar to that of the bar chart, but values belonging to a particular class are recorded horizontally rather than being represented by vertical bars. Classes are set up by splitting the numerical observations into two parts: One or more of the leading digits make up the stem, the remaining (trailing) digits are called leaves. All observations with the same leading digits, i.e., the same stem, belong to one class. Typically, class frequencies are proportional to the lengths of the lines. The principle is best understood by applying it to real data. Consider the following collection of observations : 32; 32; 35; 36; 40; 44; 47; 48; 53; 57; 57; 100; 105 The “stems” consist of the following “leading digits”: 3; 4; 5; 10. They corre- spond to the number of times that “ten” divides into the observation. The resulting stem-and-leaf diagram is displayed below. Frequency Stems Leaves 4 3 2256 4 4 0478 3 5 377 2 10 05 2.2 Frequency Distribution for Continuous Data 29 Displaying data graphically (or, as is the case here, quasi-graphically), we can extract more relevant information than we could otherwise. (The human brain is comparatively efficient at storing and comparing visual patterns.) The above stem-and-leaf plot appears quite simple. We can refine this by splitting the lines belonging to one stem in two, the first one for the trailing digits in the range one to four, the second for five to nine. We label the first group with l for low, the second with h for high. In the resulting stem-and-leaf plot the data appears approximately evenly distributed: Frequency Stems Leaves 2 3 l 22 2 3 h 56 2 4 l 04 2 4 h 78 1 5 l 3 2 5 h 77 1 10 l 0 1 10 h 5 Yet there is an apparent gap between stems 5 and 10. It is indeed one of the advantages of stem-and-leaf plots that they are helpful in both giving insights into concentration of data in specific regions and spotting extraordinary or extreme observations. By labeling 100 and 105 as outliers we obtain a useful enhancement to the stem-and-leaf plot: Frequency Stems Leaves 2 3 l 22 2 3 h 56 2 4 l 04 2 4 h 78 1 5 l 3 2 5 h 77 2 Extremes: 100; 105 For an example with data conveying a richer structure of concentration and a more detailed stem structure have a look at the following examples for grouped continuous data. Dotplots Dotplots are used to graphically display small datasets. For each observation, a “dot” (a point, a circle or any other symbol) is plotted. Some data will take on the same values. Such ties would result in “overplotting” and thus would distort the display of the frequencies. 30 2 One-Dimensional Frequency Distributions dotplot 0 5 10 15 hourly wage (n=150) dotplot 0 5 10 15 hourly wage of males (blue) and females (red) Fig. 2.9 Example of dotplot—student salaries in the USA The dots are therefore spread out into the vertical dimension in a random fashion. The y-axis thus contains uniformly spread random numbers over the Œ0; 1 interval. Provided, the size of each symbol is sufficiently small for a given sample size, the dots are then unlikely to overlap each other. Example The data in Fig. 2.9 consist of 150 observations on student salaries in the USA. In the upper part panel, we display a dot plot for all 150 observations. In the lower part, we use color to distinguish the gender of the students. Since the random perturbations in the vertical dimension are different for the two panels, the points are located in slightly different positions. Explained: Petrol Consumption of Cars Petrol consumption of 74 cars has been measured in miles per gallon (MPG). The measurements are displayed in a frequency table shown in Table 2.5. Using the same 2.2 Frequency Distribution for Continuous Data 31 Table 2.5 Petrol Absolute Relative consumption of 74 cars in X: Petrol consumption frequencies frequencies miles per gallon (MPG) (MPG) h xj f xj 12 X < 15 8 0:108 15 X < 18 10 0:135 18 X < 21 20 0:270 21 X < 24 13 0:176 24 X < 27 12 0:162 27 X < 30 4 0:054 30 X < 33 3 0:041 33 X < 36 3 0:041 36 X < 39 0 0:000 39 X < 42 1 0:013 Total 74 1:000 0.25 0.20 0.15 0.10 0.05 0.00 [12, 15) [15, 18) [18, 21) [21, 24) [24, 27) [27, 30) [30, 33) [33, 36) [36, 39) [39, 42) Fig. 2.10 Histogram for petrol consumption of 74 cars in miles per gallon (MPG) constant class width of 3 MPG, the frequency distribution is displayed in a histogram in Fig. 2.10. As is evident from both, the frequency table and the histogram, the largest proportion of cars lies in the category 18–21 MPG. Explained: Net Income of German Nationals Data Statistical elements: German nationals, residing in private households, minimum age 18 Statistical variable: monthly net income sample size n 716 32 2 One-Dimensional Frequency Distributions 800 bandwidth 800 Euro bandwidth 500 Euro 500 600 300 h(x) h(x) 400 200 100 0 0 400 2000 3600 5200 6800 250 1250 2750 4250 5750 X = {monthly net income in Euro} X = {monthly net income in Euro} bandwidth 250 Euro bandwidth 100 Euro 150 250 100 h(x) h(x) 150 50 50 0 0 125 1125 2375 3625 4875 6125 50 850 1850 2950 4050 5150 6250 X = {monthly net income in Euro} X = {monthly net income in Euro} Fig. 2.11 Histograms of monthly net income in Euro for different bandwidths Histogram In the histograms shown in Fig. 2.11, the classes are income brackets of equal width. Reducing the common class size (and hence increasing the number of classes) yields a more detailed picture of the income distribution. Observe how the absolute frequencies decline as the class widths become more narrow. Furthermore, increasing the number of classes decreases the smoothness of the graph. Additional gaps become visible as more information about the actual data is displayed. In choosing a class width we are striking a balance between two criteria: the essential information about the population which might be more strikingly conveyed in a smoother graph, and greater detail contained in a histogram with a larger number of classes. We can also separate histograms by gender, using a bin width of 500 Euro, as shown in Fig. 2.12. 2.2 Frequency Distribution for Continuous Data 33 100 150 200 250 male (n = 986) female (n = 1014) 300 200 h(x) h(x) 100 50 0 0 250 1750 3250 4750 6250 250 1750 3250 4750 6250 X = {monthly net income in Euro} X = {monthly net income in Euro} Fig. 2.12 Histograms of monthly net income in Euro for males and females Stem-and-Leaf Display The stem-and-leaf plot provided in Table 2.6 displays all 716 income figures. It is more detailed than the stem-and-leaf plots we have previously drawn. The stems, specified by the first leading digit, are divided into five subclasses corresponding to different values in the first trailing, i.e., leaf digit: The first line of each stem, denoted by *, lists all leaves starting with 0 or 1, the second (t) those starting with 2 or 3, and so on. As the stem width is specified to be 1000, the first leaf digit counts the hundreds. To condense exposition, each two observations belonging to the same class (i.e., being the same leaf) are represented by just one number (leaf). For example, six of the 716 surveyed persons earn between 2400 and 2500 Euros, denoted by “444” in the “2 f” line. The ampersand (&) denotes pairs of observations covering both leaves repre- sented by one line. For example, 4 persons earn between 4200 and 4400 Euros. Following the convention of each leaf representing two cases, there are two persons with net earnings in the interval Œ4200; 4300/. The other two persons, symbolized by &, would be displayed by the sequence “23,” if one leaf represented one observation. Thus, one of the two persons belongs to the income bracket Œ4200; 4300/, the other to the Œ4300; 4400/-bracket. Observe, that the 17 “extreme” values are displayed separately to highlight their distance from the other more heavily populated classes. 34 2 One-Dimensional Frequency Distributions Table 2.6 Stem-and-leaf plot Frequency Stem and Leaf 2 0 * 1 21 0 t 2233333333 35 0 f 44444444555555555 47 0 s 66666666666666667777777 41 0. 88888888888899999999 45 1 * 0000000000000000111111 38 1 t 2222222222222233333 63 1 f 4444444444455555555555555555555 45 1 s 6666666666667777777777 72 1. 88888888888888888888888889999999999 78 2 * 00000000000000000000000000000001111111 46 2 t 22222222222222333333333 32 2 f 444555555555555 28 2 s 66666667777777 23 2. 88888889999 28 3 * 00000000000011 10 3 t 2233 16 3 f 44555555 8 3 s 6677 5 3. 88 12 4 * 00000& 4 4 t 2& 10 Extremes: (4400), (4500), (5000),(5500), (5600),(5900), (6400), (6500), (7000), (15000) Stem width: 1000 Each leaf: 2 case(s), & denotes fractional leaves 2.3 Empirical Distribution Function Empirical distribution functions can be constructed for data that have a natural numerical ordering. If h xj is the absolute frequency of observations on a discrete variable, then the absolute frequency (or number) of observations not exceeding that value is called the absolute cumulated frequency: X j H xj D h.xs / ; j D 1; : : : ; k sD1 2.3 Empirical Distribution Function 35 The relative cumulative frequency is calculated as: H xj Xj F xj D D f.xs / ; j D 1; : : : ; k n sD1 If the variable is continuous and the data are grouped into k classes, then the above definitions apply except that we interpret H.xj / as the frequency of observations not exceeding the upper boundary of the j-th class. 2.3.1 Empirical Distribution Function for Discrete Data For the relative cumulative frequency we have 8 5 size of household 2.3.2 Empirical Distribution Function for Grouped Continuous Data As for discrete data, the empirical distribution function for grouped continuous data is a function of relative cumulative frequencies. But in this case, rather than using a step function, one plots the cumulative frequencies against the upper boundaries of each class, then joints the points with straight lines. Mathematically, the empirical distribution function may be written as: 8 ˆ ˆ 0 if x < xl1 n p, then we define xp D x.k/. The quantile is thus the observation with rank k, x.k/. If, k D n p is an integer, we will take xp to be the midpoint between x.k/.and x.kC1/. Quantiles for Grouped Data For data that are grouped in classes, we will carry out interpolations between class boundaries to obtain a p-quantile: p F xlj xp D xlj C xuj xlj f xj Here, xlj , xuj and f xj are the lower boundary, upper boundary, and the relative frequency of the class containing the p-th quantile. The cumulative relative fre- quency up to and including the class preceding the quantile class is denoted by F xlj. The quantile xp can be defined using interpolation. The principle of interpolation for the quantity p D F.xp / can be easily understood from Fig. 2.18. Some special quantiles: deciles (tenths)—the ordered observations are divided into ten equal parts. p D s=10; s D 1; : : : ; 9—deciles: x0:1 ; x0:2 ; : : : ; x0:9 quintiles—the ordered observations are divided into five equal parts. p D r=5; r D 1; 2; 3; 4—quintiles: x0:2 ; x0:4 ; x0:6 ; x0:8 quartiles—the ordered observations are divided into four equal parts. p D q=4; q D 1; 2; 3—quartiles: x0:25 ; x0:5 ; x0:75 Median (Central Value) The value which divides the ordered observations into two equal parts is called the median xz D x0:5. The median is much less sensitive to outlying or extreme observations than other measures such as the mean which we study below. The median xz corresponds to the second quartile x0:5. Median for Ungrouped Data for n odd: x0:5 D x. nC1 / 2 for n even: x0:5 D.x.n=2/ C x.n=2C1/ /=2: This is simply the mid-point of the two center-most observations. Median for Grouped Variables The median for grouped data is defined as the mid-point of the class which contains the central portion of the data. 2.4 Numerical Description of One-Dimensional Frequency Distributions 43 ^ F(x) 1 p xp x ^ f(x) xp Fig. 2.18 Quantiles of grouped data Formally, let xlj and xuj be the lower and upper boundaries of the class for which F.xuj1 / D F.xlj / 0:5 and F.xuj / 0:5. Then, 0:5 F.xlj / x0:5 D xlj C .xuj xlj / f.xj / The median can be easily determined from the graph of the distribution function since F.x0:5 / D 0:5, see Fig. 2.19. 44 2 One-Dimensional Frequency Distributions F(x) 1 0.5 x0.5 x Fig. 2.19 Median for grouped continuous data Properties of the Median (of Numerical Variables) optimality X n X k jxi x0:5 j D jxj x0:5 j f.xj / ! min: iD1 jD1 The median is optimal in the sense that it minimizes the sum of absolute deviations of the observations from a point that lies in the midst of the data (Fig. 2.19). linear transformation yi D a C bxi ! y0:5 D a C bx0:5 If the data are transformed linearly, then the median is shifted by that same linear transformation. Calculation of Quartiles The empirical distribution function (third column of the Table 2.12) implies that both the first quartile x0:25 ; p D 0:25 and the second quartile x0:5 ; p D 0:50 belong to third group (3000–5000 EUR). By interpolation we find 2.4 Numerical Description of One-Dimensional Frequency Distributions 45 Table 2.12 Proportion of Empirical distribution Example—Monthly net Income range households: function: income of households (up to 25000 EUR) (EUR) f.x/ F.x/ 1–800 0:044 0:044 800–1400 0:166 0:210 1400–3000 0:471 0:681 3000–5000 0:243 0:924 5000–25000 0:076 1:000 1 0.75 F(X) 0.50 0.25 1536 2385 3568 5000 X (in DM) Fig. 2.20 Graph of the empirical distribution function and quartiles the following (Fig. 2.20). 0:25 0:21 x0:25 D 1400 C 1600 D 1535:88 EUR 0:471 0:50 0:21 x0:50 D 1400 C 1600 D 2385:14 EUR 0:471 0:75 0:681 x0:75 D 3000 C 2000 D 3567:90 EUR 0:243 The Interpretation 25 % of the households has net monthly income not exceeding 1535.88 EUR and 75 % of the households has income higher than 1535.88 EUR (first quartile). 50 % of the households have income smaller than 2385.14 EUR and 50 % of the households have income higher than 2385.14 EUR (second quartile). 75 % of the households have income less than 3567.90 EUR and 25 % of the households have income exceeding 3567.90 EUR (third quartile). The above also implies that 50 % of the households has net income between 1535.88 EUR and 3567.90 EUR. 46 2 One-Dimensional Frequency Distributions Arithmetic Mean N is obtained by summing all observations The arithmetic mean or average, denoted X, and dividing by n. The arithmetic mean is sensitive to outliers. In particular, an extreme value tends to “pull” the arithmetic mean in its direction. The mean can be calculated in various ways, using the original data, using the frequency distribution and using the relative frequency distribution. For discrete data, each method yields a numerically identical answer. Calculation using original data: 1X n xN D xi n iD1 Calculation using the frequency and relative frequency distribution: 1X X k k xN D xj h.xj / D xj f.xj / n jD1 jD1 Properties of the Arithmetic Mean Center of gravity: The sum of the deviations of the data from the arithmetic mean is equal to zero. X n X k.xi xN / D 0 ,.xj xN /h.xj / D 0 iD1 jD1 Minimum sum of squares: The sum of squares of the deviations of the data from the arithmetic mean is smaller than the sum of squares of deviations from any other value c. X n X n.xi xN /2 <.xi c/2 iD1 iD1 X k X k.xj xN /2 h.xj / <.xj c/2 h.xj / jD1 jD1 Pooled data: Assume that the observed data are in disjoint sets D1 ; D2 ; : : : ; Dr , and that the arithmetic mean xN p for each of the sets is known. Then the arithmetic mean of all observed values (considered as one set) can be calculated using the formula 1X X r r xN D xN p np nD np n pD1 pD1 where np denotes the number of observations in p-th group (p D 1; : : : ; r). 2.4 Numerical Description of One-Dimensional Frequency Distributions 47 Table 2.13 Proportion of Cumulative distribution Example 1—Monthly income MIH households function of households (MIH) (EUR) f.x/ F.x/ 1–800 0:044 0:044 800–1400 0:166 0:210 1400–3000 0:471 0:681 3000–5000 0:243 0:924 5000–25000 0:076 1:000 Table 2.14 xN D 1881:40 EUR Example 2—Monthly income of 716 people x0:25 D 1092:50 EUR x0:50 D 1800:00 EUR x0:75 D 2400:00 EUR ‘mode’ D 2000.00 EUR Linear transformation: yi D a C bxi ! yN D a C bNx Sum: zi D xi C yi ! zN D xN C yN From the data of Example 1 given in Table 2.13 we can calculate the arithmetic mean using the mid-points of the groups: xN D 400 0:044 C 1100 0:166 C 2200 0:471 C 4000 0:243 C 15000 0:076 D 17:6 C 182:6 C 1036:2 C 972 C 1140 D 3348:4 EUR. The arithmetic mean 3348.4 EUR is higher than the median calculated above (2385.14 EUR). This can be explained by the fact that the arithmetic mean is more sensitive to the relatively small number of large incomes. The high values shift the arithmetic mean but do not influence the median (Table 2.14). Explained: Average Prices of Cars This dataset contains prices (in USD) of 74 cars. The distribution of prices is displayed using a dotplot below. The price variable is on the horizontal axis. The data are randomly scattered in the vertical direction for better visualization. In Fig. 2.21, the median is displayed in red and the arithmetic mean in magenta. As can be seen, the two values almost coincide. 48 2 One-Dimensional Frequency Distributions 4000 6000 8000 10000 12000 14000 16000 Fig. 2.21 Prices for 74 cars (USD)—arithmetic mean: 4896.417 (magenta) and median: 4672.000 (red) 4000 6000 8000 10000 12000 14000 16000 Fig. 2.22 Corrected prices for 74 cars (USD)—arithmetic mean: 5063.083 (magenta) and median: 4672.000 (red) 4000 6000 8000 10000 12000 14000 16000 Fig. 2.23 Repeated measurements of car prices—arithmetic mean: 5063.083 (magenta) and median: 5006.500 (red) For symmetric distributions, the median and arithmetic mean are identical. This is almost true for our example. However, during a check of the data, it was discovered that one value had not been entered correctly. The value 15962 USD was incorrectly changed to 5962 USD. Figure 2.22 contains corrected values: The median (because it is robust) did not change. On the other hand, the arithmetic mean has increased significantly, as it is sensitive to extreme values. The miscoded observation takes on a value well outside the main body of the data. The measurements were repeated after some time with the results shown in Fig. 2.23. 2.4 Numerical Description of One-Dimensional Frequency Distributions 49 Fig. 2.24 Screenshot of the interactive example, available at http://u.hu-berlin.de/men_dot1 Now, there are a number of relatively more expensive cars. The distribution of prices is now skewed to the right. These more extreme observations pull the mean to the right much more so than the median. Thus for right-skewed distributions, the arithmetic mean is larger than the median. Interactive: Dotplot with Location Parameters The interactive example includes a number of sidebar panels. You can access the panels by setting a mark at the corresponding check box on the upper right. Please select a dotplot type, e.g., jitter if you like the mean and median to be included in the plot The last two panels allow you to choose a dataset or variable and to change the font size. For a detailed explanation on datasets and variables, please refer to Appendix A. Output The interactive example allows us to display a one-dimensional frequency distribu- tion in the form of a dotplot for a variety of variables. Possible values are displayed along the horizontal axis. For easier visualization, the observations may be randomly shifted (jitter) in the vertical direction. The median and the arithmetic mean can be displayed graphically and numerically (Fig. 2.24). Interactive: Simple Histogram The interactive example includes a number of sidebar panels. You can access the panels by setting a mark at the corresponding check box on the upper right. 50 2 One-Dimensional Frequency Distributions Fig. 2.25 Screenshot of the interactive example, available at http://u.hu-berlin.de/men_hist Please select the number of bins if you like the observations to be shown The last two panels allow you to choose a dataset or variable and to change the font size. For a detailed explanation on datasets and variables, please refer to Appendix A. Output The graphic displays all observations of a variable summarized in a histogram (Fig. 2.25). 2.5 Location Parameters: Mean Values—Harmonic Mean, Geometric Mean If the observed variables are ratios, then the arithmetic mean may not be appropriate. Harmonic Average The harmonic average, denoted xN H , is useful for variables which are ratios. We assume that all data points are not equal to zero, i.e., xi ¤ 0. As a consequence the xj ¤ 0. n xN H D P n 1 xi iD1 2.5 Location Parameters: Mean Values—Harmonic Mean, Geometric Mean 51 P k gj jD1 xN H D ; j D 1; : : : ; k P k gj xj jD1 In the latter formula, gj provides additional information which will become clear in the example below. Example 1 Part of the road j 1 2 3 4 Distance gj in km 2 4 3 8 Speed xj in km/h 40 50 80 100 We would like to calculate the average speed of the car during the period of travel. It is inappropriate to simply average the speeds since they are measured over differing periods of time. In the table, gj is the distance traveled in each segment. Using the above formula we calculate: Pk gj Total time: xj D 0:2475 h jD1 Pk Total distance: gj D 17 km jD1 17 2C4C3C8 Average: xN H D 0:2475 D 2 4 3 8 D 68:687 km=h 40 C 50 C 80 C 100 The arithmetic mean would lead to an incorrect result 67:5 km=h, because it does not account for the varying lengths of the various parts of the road. Correct use of the arithmetic mean would involve calculating the time spent along each segment. In the above example these times are denoted by hj D gj =xj for each segment. h1 D g1 =x1 D 0:05I h2 D g2 =x2 D 0:08I h3 D g3 =x3 D 0:0375I h4 D g4 =x4 D 0:08I 40 0:05 C 50 0:08 C 80 0:0375 C 100 0:08 xN D D 68:687 km=h 0:05 C 0:08 C 0:0375 C 0:08 Thus, in order to calculate the average of ratios using additional information for the numerator (in our case xj with the additional information gj ) we use the harmonic average. In order to calculate the average from ratios using additional information on the denominator, we choose the arithmetic average. Example 2 Four students, who have part time jobs, have the hourly (respectively weekly) salaries given in Table 2.15. We are supposed to find the average hourly salary. This calculation cannot be done using only the arithmetic average of the hourly salaries, because that would 52 2 One-Dimensional Frequency Distributions Table 2.15 Hourly and Student Euro/h Weekly salary in Euro weekly salary of four students A 18 180 B 20 300 C 15 270 D 19 380 Table 2.16 Hourly salary Student Euro/h Working hours and working hours of four students A 18 10 B 20 15 C 15 18 D 19 20 not take into account the different times spent in the job. The variable of interest is a ratio (Euro/h) and the additional information (weekly salary in Euro) is related to the numerator of this ratio. Hence, we will use the harmonic average. P gj j 180 C 300 C 270 C 380 1130 xN H D P gj D 180 300 270 380 D D 17:94 xj 18 C 20 C 15 C 19 63 j These four students earn on average 17.94 Euro/h (Table 2.15). The situation changes if we are given the number of hours worked per week (instead of the weekly salary). Now, the additional information (weekly working hours) is related to the denominator of the ratio. Hence, we can use an arithmetic average, in this case the weighted arithmetic average. 18 10 C 20 15 C 15 18 C 19 20 1130 xN D D D 17:94 10 C 15 C 18 C 20 63 The average salary is again 17.94 Euro/h. Geometric Average The geometric mean, denoted xN G , is used to calculate the mean value of variables which are positive, are ratios (e.g., rate of growth) and are multiplicatively related. p xN G D n x1 x2 xn 2.5 Location Parameters: Mean Values—Harmonic Mean, Geometric Mean 53 The logarithm of the geometric average is equal to the arithmetic average of the logarithms of the observations: 1X log xN G D n log xi n iD1 Mean Growth Rate and Forecast Let x0 ; x1 ; : : : ; xn be the measurements ordered according to the time of observation from 0 to n. The growth rates can be calculated as it D xt =xt1 i1 i2 in D xn =x0 The product of all growth rates is equal to the total growth from time 0 to n. The average growth rate will be obtained as a geometric average of the growth rates in distinct time periods: p p xn {Ng D n i1 i2 in D n x0 Knowing the mean growth rate and the value in time n, we can forecast the value in time n C T. x?nCT D xn .N{G /T Solving this equation with respect to T, we obtain a formula for the time which is necessary to reach the given value: log.xnCT / log.xn / TD log.N{G / Example 1 Now we calculate: mean value (geometric average) forecast for 1990 time (year), when GDP reaches the value 2500. r 8 1971:8 {NG D D 1:0162 1733:8 x?1990 D 1971:8 1:01622 D 2036:2 bn DM log.2500/ log.1971:8/ TD D 14:77 years. log.1:0162/ 54 2 One-Dimensional Frequency Distributions Table 2.17 Gross domestic Year t GDP xt it product (GDP) for Germany in 1985 prices (bn DM) 1980 0 1733:8 – 1981 1 1735:7 1:0011 1982 2 1716:5 0:9889 1983 3 1748:4 1:0186 1984 4 1802:0 1:0307 1985 5 1834:5 1:0180 1986 6 1874:4 1:0217 1987 7 1902:3 1:0149 1988 8 1971:8 1:0365 Table 2.18 German stock index (DAX) during the period 1990–1997 Year 1989 1990 1991 1992 1993 1994 1995 1996 1997 DAX 1791 1399 1579 1546 2268 2107 2254 2889 4250 (end of the year) DAX 21.9 % 12.9 % 2.1 % 46.7 % 7.1 % 7.0 % 28.0 % 47.1 % (change) The value of GDP of 2500 is forecasted in year 1988 C 15 D 2003 (Table 2.17). Example 2 The German stock index (DAX) was changing during the period 1990–1997, as shown in Table 2.18. We want to find the average yearly change in the DAX over the period. Use of the arithmetic average leads to an incorrect result as illustrated below. xN D.21:9/C.12:9/C.2:1/C.46:7/C.7:1/C.7:0/C.28:2/C.47:1/ 8 D 110:80 8 D 13:85 % Starting in the year 1989 and using the “average change of DAX” to calculate the value of the DAX in 1997, one obtains: 1990 1791 1.1385D2093 1991 2093 1.1385D2383 ::: ::: 1997 4440 1.1385D5055 The result 5055 is much higher than the actual value of the DAX in 1997 which was 4250. The correct mean value is, in this case, the geometric mean, because it measure the growth during a certain period. The value of DAX in 1990 can be calculated from the value in 1989 and the relative change as follows: DAX1990 D.1 C.0:219// DAX1989 D.1 C.0:219// 1791 D 0:781 1791 D 1399 2.6 Measures of Scale or Variation 55 Analogously, we can “forecast” the value for 1991 from the relative change and the value of DAX in 1990: DAX1991 D.1 C 0:129/ DAX1990 D.1 C 0:129// 1399 D 1:129 1399 D 1579 The values are multiplicatively related. The geometric mean yields the following: p 8 XG D 0:781 1:129 0:979 1:467 0:929 1:070 1:282 1:417 D 1:1141 The average growth rate per year of the DAX over the period 1990–1997 was 11.41 %. Using this geometric mean and the value of DAX in 1989 to predict the value of DAX in 1997, we obtain the correct result: 1990 1791 1.1141D1995 1991 1995 1.1141D2223 ::: ::: 1997 3815 1.1141D4250 The average growth rate of DAX in 1990–1997 can be used also to forecast the value of at the end of year 1999. We obtain the prediction: DAX1999 D DAX1997 1:1141 1:1141 D 4250 1:11412 D 5275 2.6 Measures of Scale or Variation The various measures of location outlined in the previous sections are not sufficient for a good description of one-dimensional data. An illustration of this follows: Monthly expenditures for free time and holidays (in EUR): data from 10 two person households: 210, 250, 340, 360, 400, 430, 440, 450, 530, 630 displayed on the axis: data from 10 four person households: 340, 350, 360, 380, 390, 410, 420, 440, 460, 490 displayed on the axis: 56 2 One-Dimensional Frequency Distributions The arithmetic average XN is in both cases is equal to 404 EUR, but the graphs show visible differences between the two distributions. For households with four people the values are more concentrated around the center (in this case the mean) than for households with two people, i.e., the spread or variation is smaller. Measures of scale measure the variability of data. Together with measures of location (such as means, medians, and modes) they provide a reasonable description of one-dimensional data. Intuitively one would want measures of dispersion to have the property that if the same constant was added to each of the data-points, the measure would be unaffected. A second property is that if the data were spread further apart, for example through multiplication by a constant greater than one, the measure should increase. Range The range is the simplest measure of scale: Range for Ungrouped Data The range, denoted R, is defined as the difference between the largest and the smallest observed value R D xmax xmin D x.n/ x.1/ where x.1/ ; : : : ; x.n/ are the ordered data, i.e., the order statistics. Range for Grouped Data For grouped data, the range R is defined as the difference between the upper bound of the last (highest) class xuk and the lower bound of the first (smallest) class xl1 : R D xuk xl1 Properties For a linear transformation we have: yi D a C bxi ! Ry D jbjRx Note that addition of the constant a which merely shifts the data does not affect the measure of variability. 2.6 Measures of Scale or Variation 57 Interquartile Range The interquartile range is the difference between the third quartile x0;75 and the first quartile x0;25 : QA D x0:75 x0:25 The interquartile range is the width of the central region which captures 50 % of the observed data. The interquartile range relative to the median is defined as QAr D QA=x0:5. Properties Robust towards extreme values (outliers) Linear transformation: yi D a C bxi ! QAy D jbjQAx Again addition of the constant a does not affect the measure of variability. Mean Absolute Deviation The mean of the absolute deviations of the observed values from a fixed point c is called the mean absolute deviation (MAD) and it is denoted by d. The fixed point c can be any value. Usually, it is chosen to be one of the measures of location; typically the mean xN or median x0:5. As with the range and the interquartile range, adding the same constant to all the data. Multiplication by a constant rescales the measure by the absolute value of that same constant. Each of the formulas below may be used for ungrouped data. If the data have been grouped, then one would use the second formula where the xj are mid-points of the classes, and h.xj / and f.xj / are the absolute and relative frequencies: 1X n dD jxi cj n iD1 1X X k k dD jxj cjh.xj / D jxj cjf.xj / n jD1 jD1 Properties The optimality property of the median implies that the median is the value which minimizes the mean absolute deviation. Thus any other value substituted for c above would yield a larger value of this measure. For a linear transformation of the data: yi D a C bxi ! dy D jbjdx 58 2 One-Dimensional Frequency Distributions Example Observed values: 2, 5, 9, 20, 22, 23, 29 x0:5 D 20; d.x0:5 / D 8; 29 xN D 15:71; d.Nx/ D 8:90 The Variance and the Standard Deviation The mean of the squared deviations of the observed values from a certain fixed point c is called the mean squared error (MSE) or the mean squared deviation. The point c can be chosen ad libitum. 1X n MQ.c/ D.xi c/2 n iD1 1X X k k MQ.c/ D.xj c/2 h.xj / D.xj c/2 f.xj / n jD1 jD1 The Variance If we choose the point c to be the mean xN , then the MSE is called the variance. The variance of the observed values will be denoted as s2 and may be computed as follows. 1X 1X 2 n n s2 D.xi xN /2 D x xN 2 n iD1 n iD1 i 1X X k k s2 D.xj xN /2 h.xj / D.xj xN /2 f.xj / n jD1 jD1 Standard Deviation The standard deviation (s) is defined as the square root of the variance. v u n p u1 X s D s2 D t.xi xN /2 n iD1 v v u k u k u1 X uX sDt.xj xN /2 h.xj / D t.xj xN /2 f.xj / n jD1 jD1 The variance s2 (and therefore also the standard deviation s) is always greater than or equal to 0. Zero variance implies that the observed data are all identical and consequently do not have any spread. 2.6 Measures of Scale or Variation 59 Properties The mean squared error with respect to xN (the variance) is smaller than the mean square error with respect to any other point c. This result can be proved as follows: 1X 1X n n MSE.c/ D.xi c/2 D.xi xN C xN c/2 n iD1 n iD1 " n # 1 X Xn 2 2 D.xi xN / C 2.Nx c/.xi xN / C n.Nx c/ n iD1 iD1 1X n D.xi xN /2 C.Nx c/2 D s2 C.Nx c/2 n iD1 P n The middle term of the middle line vanishes since.xi xN / D 0. These iD1 formulas imply that the mean square error MSE.c/ is always greater than or equal to the variance. Obviously equality holds only if c D xN. For linear transformations we have: yi D a C bxi ! s2y D b2 s2x ; sy D jbjsx Standardization: by subtracting the mean and dividing by the standard deviation one creates a new dataset for which the mean is zero and the variance is one. Let: zi D a C bxi ; where a D Nx=sx ; b D 1=sx , then xj xN zi D sx ) zN D 0; s2z D 1 Example Observed values: 2, 5, 9, 20, 22, 23, 29 x0:5 D 20 MSE.x0:5 / D 109:14 xN D 15:71 MSE.Nx/ D Variance D 90:78 Theorem (pooling) Let us assume that the observed values (data) are divided into r groups with ni i D 1; ::; r observations. Assume also that the means and variances in these groups are known. To obtain the variance s2 of the pooled data we may use: X r ni X r ni s2 D s2i C.xNi xN /2 iD1 n iD1 n xN1 ; : : : ; xNr are the arithmetic averages in the groups s21 ; : : : ; s2r are the variances in the groups n1 ; : : : ; nr are numbers of observations in the groups, n D n1 C C nr 60 2 One-Dimensional Frequency Distributions Variance Decomposition The above formula illustrates that the variance can be decomposed into two parts: Total variance D variance within the groups C variance between the groups. Coefficient of Variation In order to compare the standard deviations for different distributions, we introduce a relative measure of scale (relative to the mean), the so-called coefficient of variation. The coefficient of variation expresses variation as a percentage of the mean: v D s=Nx xN > 0 Example The mean values and the standard deviations of two sets of observations are: xN 1 D 250 s1 D 10 xN 2 D 750 s2 D 30 By comparing the standard deviations, we conclude that the variation in the second dataset is three times higher than the variation in the first. But, in this case it would be more appropriate to compare the coefficients of variation since the data have very different means: v1 D 10=250 D 0:04 v2 D 30=750 D 0:04 The relative spread of both datasets is the same. Explained: Variations of Pizza Prices The price (in EUR) of Dr. Oetker pizza was collected in 20 supermarkets in Berlin (Fig. 2.26): 3.99; 4.50; 4.99; 4.79; 5.29; 5.00; 4.19; 4.90; 4.99; 4.79; 4.90; 4.69; 4.89; 4.49; 5.09; 4.89; 4.99; 4.29; 4.49; 4.19 The average price for a pizza in these 20 supermarkets is 4.27 Euro (= mean) The median price is 4.84 Euro (= median) The difference between the highest and smallest price is 1.30 Euro (= range) If the MAD is calculated around the mean it is 0.29 Euro (= MAD) if calculated around the median it is 0.28 Euro (= MAD). 50 % of all prices lie in the interval between 4.49 Euro (quartile x0:25 ) and 4.99 Euro (quartile x0:75 ), this interval is of width 0.50 Euro (= interquartile range).2 Mean square error around the mean is 0.12241 Euro2 (= variance), the square root of the variance is 0.34987 Euro (= standard deviation). 2.6 Measures of Scale or Variation 61 arithm. mean standard deviation range 4.0 4.2 4.4 4.6 4.8 5.0 5.2 median interquartile range range 4.0 4.2 4.4 4.6 4.8 5.0 5.2 Fig. 2.26 Prices for pizza in 20 supermarkets—parameters of scale Enhanced: Parameters of Scale for Cars The price of 74 types of cars in USD was collected in 1985. The data are displayed in Fig. 2.27. The upper panel displays the range (green), arithmetic average (black), and the standard deviation (red). The lower panel displays the range (green), median (mint green), and the interquartile range (magenta). Arithmetic average: 4896.417 Median: 4672 Range 4536 Interquartile range 1554.75 Standard deviation 991.2394 During a check of the data, it was discovered that there was an input error. The correct value of 15962 USD was incorrectly recorded as 5962 USD. Figure 2.28 contains the corrected results. Arithmetic average: 5063.083 Median: 4672 Range 12508 Interquartile range 1554.75 Standard deviation 1719.064 It is clear that the range increased, because it is a function of the extreme values. The value of interquartile range did not change since no prices within this range were altered. The standard deviation increased significantly. The reason is that 62 2 One-Dimensional Frequency Distributions 4000 6000 8000 10000 12000 14000 16000 Fig. 2.27 Prices of 74 cars in USD—upper panel: range (green), arithmetic average (black), and the standard deviation (red); lower panel: range (green), median (mint green), and the interquartile range (magenta) standard deviation is calculated from all observed prices and involves the squares of deviations which causes it to be particularly sensitive to extreme values (outliers). The investigation was repeated after some time. The results are presented in Fig. 2.29. Arithmetic average: 6165.257 Median: 5006.5 Range 12615 Interquartile range 2112 Standard deviation 2949.496 Now, there are a number of expensive vehicles whose prices are substantially different from the lower priced cars. Thus the price are skewed to the right. For skewed distributions, the standard deviation is typically higher than the interquartile range. This feature is demonstrated in the above example. Interactive: Dotplot with Scale Parameters The interactive example includes a number of sidebar panels. You can access the panels by setting a mark at the corresponding check box on the upper right. 2.6 Measures of Scale or Variation 63 4000 6000 8000 10000 12000 14000 16000 Fig. 2.28 Corrected prices of 74 cars in USD—upper panel: range (green), arithmetic average (black), and the standard deviation (red); lower panel: range (green), median (mint green), and the interquartile range (magenta) Please select a dotplot type, e.g., jitter if you like the mean, median, range, or interquartile range to be included in the plot The last two panels allow you to choose a dataset or variable and to change the font size. For a detailed explanation on datasets and variables, please refer to Appendix A. Output The interactive example in Fig. 2.30 allows us to display a one-dimensional fre- quency distribution in the form of a dotplot for a variety of variables. Possible values are displayed along the horizontal axis. For easier visualization, the observations may be randomly shifted (jitter) in the vertical direction. Furthermore, the median, the arithmetic mean, range, and interquartile range can be included. 64 2 One-Dimensional Frequency Distributions 4000 6000 8000 10000 12000 14000 16000 Fig. 2.29 Repeated investigation of prices of 74 cars in USD—upper panel: range (green), arithmetic average (black), and the standard deviation (red); lower panel: range (green), median (mint green), and the interquartile range (magenta) Fig. 2.30 Screenshot of the interactive example, available at http://u.hu-berlin.de/men_dot2 2.7 Graphical Display of the Location and Scale Parameters Boxplot (Box-Whisker-Plot) Unlike the stem-and-leaf diagram, the boxplot does not contain information about all observed values. It displays only the most important information about the frequency distribution. Specifically, the boxplot contains the smallest and the largest 2.7 Graphical Display of the Location and Scale Parameters 65 X0.75 + 3 QA X0.75 + 1.5 QA (upper fence) X0.75 X X0.5 = QA X0.25 X0.25 − 1.5 QA (lower fence) X0.25 − 3 QA Fig. 2.31 The structure of a boxplot observed values x.1/ and x.n/ and three quartiles x0:25 ; x0:5 ax0:75. The second quartile x0:5 is of course the median (Fig. 2.31). The quartiles are denoted by a line and the first and third quartile are connected so that we obtain a box. The line inside this box denotes the median. The height of this box is the interquartile range which is the difference between the third and the first quartile: x0:75 and x0:25. Inside this box, one finds the central 50 % of all observed values. The whiskers show the smallest and largest values within a 1.5 multiple of the interquartile range calculated from the boundary of the box. The bounds x0:25 1:5 QA and x0:75 C1:5QA are called the lower and upper fence, respectively. The values lying outside the fences are marked as outliers with a different symbol. Usually, the boxplot also displays the mean as a dashed line. The boxplot provides quick insight into the location, scale, shape, and structure of the data. 66 2 One-Dimensional Frequency Distributions 20 15 10 wage 5 0 Example—boxplot of student salaries in USD 20 15 wage 10 5 0 male female Example—boxplot of student salaries in USD; males and females separated Explained: Boxplot of Car Prices The prices of 74 types of cars were obtained in 1983. The results are displayed in Fig. 2.32. The upper panels of the graphs contain dotplots. The lower panels show boxplots. The values lying outside a 1.5 multiple (resp. 3 multiple) of the interquartile range are denoted as extreme (outlying) observations. These outlying observations produce a large difference between the median (solid line) and the mean (dashed line). 2.7 Graphical Display of the Location and Scale Parameters 67 Table 2.19 Total Men Women Example—Student salaries in USD xmin D 1 xmin D 1 xmin D 1:74997 xmax D 44:5005 xmax D 26:2903 xmax D 44:5005 R D 43:5005 R D 25:2903 R D 42:7505 x0:25 D 5:24985 x0:25 D 6:00024 x0:25 D 4:74979 x0:5 D 7:77801 x0:5 D 8:92985 x0:5 D 6:79985 x0:75 D 11:2504 x0:75 D 12:9994 x0:75 D 10:0001 QA D 6:00065 QA D 9:99916 QA D 5:25031 xN D 9:02395 xN D 9:99479 xN D 7:87874 s2 D 26:408 s2 D 27:9377 s2 D 22:2774 s D 5:13887 s D 5:28562 s D 4:7199 v D 0:57 v D 0:53 v D 0:60 4000 6000 8000 10000 12000 14000 16000 extreme values Fig. 2.32 Boxplot of prices of 74 cars Interactive: Visualization of One-Dimensional Distributions The interactive example includes a number of sidebar panels. You can access the panels by setting a mark at the corresponding check box on the upper right. Please choose a dotplot type, e.g., jitter the number of bins for the histogram if you like the mean and median to be included in the plots 68 2 One-Dimensional Frequency Distributions Fig. 2.33 Screenshot of the interactive example, available at http://u.hu-berlin.de/men_vis The last two panels allow you to choose a dataset or variable and to change the font size. For a detailed explanation on datasets and variables, please refer to Appendix A. Output The interactive example in Fig. 2.33 allows us to display a one-dimensional fre- quency distribution in the form of a dotplot, a histogram, a boxplot, and cumulative distribution function for a variety of variables. Possible values are displayed along the horizontal axis. For easier visualization, the observations may be randomly shifted (jitter) in the vertical direction. Furthermore, the median and the arithmetic mean can be included. You also receive a table showing the numerical values of certain parameters. Chapter 3 Probability Theory 3.1 The Sample Space, Events, and Probabilities Probability theory is concerned with the outcomes of random experiments. These can be either real world processes or thought experiments. In both cases, the experiment has to be infinitely repeatable and there has to be a well-defined set of outcomes. The set of all possible outcomes of an experiment is called the sample space which we will denote by S. Consider the process of rolling a die. The set of possible outcomes is the set S D f1; 2; 3; 4; 5; 6g. Each element of S is a basic outcome. However, one might be interested in whether the number thrown is even, or whether it is greater than 3, and so on. Thus we need to be able to speak of various combinations of basic outcomes, that is subsets of S. An event is defined to be a subset of the set of possible outcomes S. We will denote an event using the symbol E. Events which consist of only one element, such as a two was thrown, are called simple events or elementary events. Simple events are by definition not divisible into more basic events, as each of them includes one and only one possible outcome. Example Rolling a single die once results in the occurrence of one of the simple events {1}, {2}, {3}, {4}, {5}, {6}. As we have indicated, the sample space S is {1,2,3,4,5,6}. Example For tossing a coin twice we have the following sample space: S D fTT; TH; HT; HHg and simple events: {TT},{TH},{HT},{HH}, T Tail, H Head. This specification also holds if two coins are tossed once. It will be convenient to be able to combine events in various ways, in order to make statements such as “one of these two events happened” or “both events occurred.” For example, one might want to say that “either a 2 or 4 was thrown,” © Springer International Publishing Switzerland 2015 69 W.K. Härdle et al., Introduction to Statistics, DOI 10.1007/978-3-319-17704-5_3 70 3 Probability Theory sample space event A complementary event A Fig. 3.1 Most simple Venn diagram or “an even number larger than 3 was thrown.” Since events are sets (in particular subsets of the set S), we may draw upon the conventional tools of set theory. Venn Diagram A common graphical representation of events as subsets of the sample space is the Venn diagram (Fig. 3.1). It can be used to visualize various combinations of events such as intersections and unions. 3.2 Event Relations and Operations In the last section, we have defined events as subsets of the sample space S. In interpreting events as sets, we can apply the same operations and relations to events that we know from basic set theory. We shall now recapitulate some of the most important concepts of set theory. Subsets and Complements A is subset of B is denoted by A B. Thus if event A occurs, B occurs as well (Fig. 3.2). A and B are equivalent events if and only if (abbreviated as “iff”) A B and B A. Any event A is obviously a subset of S, A S. We define the complement of A, denoted by A, to be the set of points in S that are not in A. 3.2 Event Relations and Operations 71 Fig. 3.2 Venn diagram for sample space S event relation A is subset of B, A B B A Union of Sets The set of points belonging to either the set A or the set B is called the union of sets A and B, and is denoted by A [ B. Thus if the event “A or B” has occurred, then a basic outcome in the set A [ B has taken place (Fig. 3.3). Set unions can be extended to n sets and hence n events A1 ; A2 ; : : : ; An : in which case we have A1 [ A2 [ : : : [ An D [niD1 Ai Example Rolling a die once Define A D f1; 2g and B D f2; 4; 6g. Then, A [ B D f1; 2; 4; 6g. General Results A[A DA A [ S D S where S is the sample space. A [ ; D A where ; is the null set, the set with no elements in it. A[ADS Intersection of Sets The set of points common to the sets A and B is known as intersection of A and B, A \ B. Thus if the event “A and B” has occurred, then a basic outcome in the set A \ B has taken place (Fig. 3.4). Set intersections can be extended to n sets and hence to n events A1 , A2 , : : : ; An : A1 \ A2 \ : : : \ An D \niD1 Ai 72 3 Probability Theory sample space S event A event B event A∪ B Fig. 3.3 Venn diagram for the union of two sets, A [ B sample space S event A event B A∩ B Fig. 3.4 Venn diagram for the intersection of two sets, A \ B 3.2 Event Relations and Operations 73 Fig. 3.5 Venn diagram of sample space S disjoint events event A event B Example Rolling a die once Define A D f1; 2g and B D f2; 4; 6g Then A \ B D f2g General Results A\A DA A\S D A A\; D; A\A D; ;\SD; Disjoint Events Two sets or events are said to be disjoint (or mutually exclusive) if their intersection is the empty set: A \ B D ;. Interpretation: events A and B cannot occur simultaneously (Fig. 3.5). By definition, A and A are mutually exclusive. The reverse doesn’t hold, i.e., disjoint events are not necessarily complements of each other. Example Rolling a die once Define A D f1; 3; 5g and B D f2; 4; 6g. Then, B D A and A D B. ) A\B D A\A D ; Interpretation: events A and B are disjoint and complementary. Define C D f1; 3g and B D f2; 4g. )C\DD; Interpretation: events C and D are disjoint but not complementary. Logical Difference of Sets or Events The set or event C is the logical difference of events A and B if it represents the event: “A has occurred but B has not occurred,” i.e., it is the outcomes in A, that are not in B: AnB D C A \ B (Fig. 3.6). 74 3 Probability Theory Fig. 3.6 Venn diagram for sample space S logical difference of two events, AnB event A\B B Example Rolling a six-sided die once Define A D f1; 2; 3g and B D f3; 4g. Then, AnB D C D f1; 2g and BnA D f4g. Disjoint Decomposition of the Sample Space A set of events A1 ; A2 ; : : : ; An is called disjoint decomposition of S, if the following conditions hold: Ai ¤ ;.i D 1; 2; : : : ; n/ Ai \ Ak D ;.i ¤ kI i; k D 1; 2; : : : ; n/ A1 [ A2 [ : : : [ An D S One can think of such a decomposition as a partition of the sample space where each basic outcome falls into exactly one set or event. Sharing a birthday cake results in a disjoint decomposition or partition of the cake. Example Rolling a six-sided dice Sample space: S D f1; 2; 3; 4; 5; 6g. Define A1 D f1g, A2 D f3; 4g, A3 D f1; 3; 4g, A4 D f5; 6g, A5 D f2; 5g, A6 D f6g. Claim: one possible disjoint decomposition is given by A1 ; A2 ; A5 ; A6. Proof: A1 \ A2 D ;, A1 \ A5 D ;, A1 \ A6 D ;, A2 \ A5 D ;, A2 \ A6 D ;, A5 \ A6 D ;, A1 [ A2 [ A5 [ A6 D S. 3.3 Probability Concepts 75 Table 3.1 Summary of event relations Verbal Technical Algebraic If A occurs, B is subset of A AB then B occurs also B and A always occur A and B are AB together equivalent events A and B cannot occur A and B are A\BD ; together disjoint events A occurs if and only if A and B are BDA B does not occur complementary events A occurs if and only if A is union of Ai A D [i Ai at least one Ai occurs A occurs if and only if A is intersection A D \i Ai all Ai occur of all Ai Some Set Theoretic Laws De Morgan’s laws A\BDA[B A[BDA\B Associative laws.A \ B/ \ C D A \.B \ C/.A [ B/ [ C D A [.B [ C/ Commutative laws A\BDB\A A[BDB[A Distributive laws A \.B [ C/ D.A \ B/ [.A \ C/ A [.B \ C/ D.A [ B/ \.A [ C/ 3.3 Probability Concepts Probability is a measure P. / which quantifies the degree of (un)certainty associated with an event. We will discuss three common approaches to probability. 76 3 Probability Theory Classical Probability Laplace’s classical definition of probability is based on equally likely outcomes. He postulates the following properties of events: the sample space is composed of a finite number of basic outcomes the random process generates exactly basic outcome and hence one elementary event the elementary events are equally likely, i.e., occur with the same probability Accepting these assumptions, the probability of any event A (subset of the sample space) can be computed as #.basic outcomes in A/ #.elementary events comprising A/ P.A/ D D #.basic outcomes in S/ #.elementary events comprising S/ Properties 0 P.A/ 1 P.;/ D 0 P.S/ D 1 Example Rolling a six-sided die Sample space: S D f1; 2; 3; 4; 5; 6g. Define, event A D “any even number” Elementary events in A: f2g,f4g,f6g P.A/ D 36 D 0:5 Statistical Probability Richard von Mises originated the relative frequency approach to probability: The probability P.A/ for an event A is defined as the limit of the relative frequency of A, i.e., the value the relative frequency will converge to if the experiment is repeated an infinite number of times. It is assumed that replications are independent of each other. Let hn.A/ denote the absolute frequency of A occurring in n repetitions. The relative frequency of A is then defined as hn.A/ fn.A/ D n According to the statistical concept of probability we have P.A/ D lim fn.A/ n!1 3.3 Probability Concepts 77 Since 0 fn.A/ 1 it follows that 0 P.A/ 1. Example Flipping a coin Denote by A the event “head appears.” Absolute and relative frequencies of A after n trials are listed in Table 3.2. This particular sample displays a non-monotonic convergence to 0:5, the theoretical probability of a head occurring in repeated flips of a “fair” coin. Visualizing the sequence of relative frequencies fn.A/ as a function of sample size, as done in Fig. 3.7, provides some intuition into the character of the convergence. A central objective of statistics is to estimate or approximate probabilities of events using observed data. These estimates can then be used to make probabilistic statements about the process generating the data (e.g., confidence intervals which we Table 3.2 Flipping of a coin n hn.A/ fn.A/ 10 7 0:700 20 11 0:550 40 17 0:425 60 24 0:400 80 34 0:425 100 47 0:470 200 92 0:460 400 204 0:510 600 348 0:580 800 404 0:505 1000 492 0:492 2000 1010 0:505 3000 1530 0:510 4000 2032 0:508 5000 2515 0:503 0.7 0.6 fn(A) 0.5 0.4 0.3 10 40 80 200 600 number of throws (n) Fig. 3.7 Relative frequencies of A=“head appears” as a function of sample size n 78 3 Probability Theory will study later), to test propositions about the process and to predict the likelihood of future events. Axiomatic Foundation of Probability P is a probability measure. It is a function which assigns a number P.A/ to each event A of the sample space S. Axiom 1 P.A/ is real-valued with P.A/ 0. Axiom 2 P.S/ D 1. Axiom 3 If two events A and B are mutually exclusive (A \ B D ;), then P.A [ B/ D P.A/ C P.B/ Properties Let A; B; A1 ; A2 ; : : : S be events and P. / a probability measure. Then the following properties follow from the above three axioms: 1. P.A/ 1 2. P.A/ D 1 P.A/ 3. P.;/ D 1 P.S/ D 0 4..A \ B D ;/ ) P.A \ B/ D P.;/ D 0 5. If A B, then P.A/ P.B/ 6. If Ai \ Aj D ; for i ¤ j, then P.A1 [ A2 [ : : :/ D P.A1 / C P.A2 / C : : : 7. P.AnB/ D P.A/ P.A \ B/ Addition Rule of Probability Let A and B be any two events (Fig. 3.8). Then, P.A [ B/ D P.A/ C P.B/ P.A \ B/ : Extension to three events A, B, C: P.A [ B [ C/ D P.A/ C P.B/ C P.C/ P.A \ B/ P.A \ C/ P.B \ C/ C P.A \ B \ C/ 3.3 Probability Concepts 79 Fig. 3.8 Addition rule of sample space S probability event A event B