IntroSTAT Notes PDF

INTROSTAT Suleman Patel, Les Underhill and Dave Bradfiled Department of Statistical Sciences University of Cape Town December 2015 Introduction IntroSTAT apart, there seem to be two kinds of introductory Statistics textbooks. There are those that assume no mathematics at all, and get themselves tied up in all kinds of knots trying to explain the intricacies of Statistics to students who know no calculus. There are those that assume lots of mathematics, and get themselves tied up in the knots of mathematical statistics. IntroSTAT assumes that students have a basic understanding of differentiation and integration. The book was designed to meet the needs of students, primarily those in business, commerce and management, for a course in applied statistics. IntroSTAT is designed as a lecture-book. One of our aims is to maximize the time spent in explaining concepts and doing examples. It is for this reason that three types of examples are included in the chapters. Those labeled A are used to motivate concepts, and often contain explanations of methods within them. They are for use in lectures. The B examples are worked examples — they shouldn’t be used in lectures — there is nothing more deadly dull than lecturing through worked examples. Students should use the B examples for private study. The C examples contain problem statements. A selection of these can be tackled in lectures without the need to waste time by the lecturer writing up descriptions of examples, and by the students copying them down. There are probably more exercises at the end of most chapters than necessary. A selection has been marked with asterisks (∗ ) — these should be seen as a minimum set to give experience with all the types of exercises. Acknowledgements... We are grateful to our colleagues who used editions 1 to 4 of IntroSTAT and made suggestions for changes and improvements. We have also appreciated comments from students. We will continue to welcome their ideas and hope that they will continue to point out the deficiencies. Mrs Tib Cousins undertook the enormous task of turning edition 4 into TEX files, which were the basis upon which this revision was undertaken. Mrs Margaret Zaborowski helped us proofread the text — but any remaining errors are our responsibility. This volume is essentially the 1996 edition of Introstat with some minor corrections of errors, and was reset in LATEX from the the original plain TEX version. iii Contents Introduction iii 1 EXPLORING DATA 1 2 SET THEORY 47 3 PROBABILITY THEORY 57 4 RANDOM VARIABLES 99 5 PROBABILITY DISTRIBUTIONS I 123 v Chapter 1 EXPLORING DATA KEYWORDS: Data summary and display, qualitative and quanti- tative data, pie charts, bar graphs, histograms, symmetric and skew distributions, stem-and-leaf plots, median, quartiles, extremes, five- number summary, box-and-whisker plots, outliers and strays, measures of spread and location, sample mean, sample variance and standard de- viation, summary statistics, exploratory data analysis. Facing up to uncertainty... We live in an uncertain world. But we still have to take decisions. Making good decisions depends on how well informed we are. Of course, being well informed means that we have useful information to assist us. So, having useful information is one of the keys to good decision making. Almost instinctively, most people gather information and process it to help them take decisions. For example, if you have several applicants for a vacant post, you would not draw a number out of a hat to decide which one to employ. Almost without thinking about it, you would attempt to gather as much relevant information as you can about them to help you compare the applicants. You might make a short-list of applicants to interview, and prepare appropriate questions to put to each of them. Finally you come to an informed decision. Sometimes the available information is such that we feel it is easy to make a good decision. But at other times, so much confusion and uncertainty cloud the situation that we are inclined to go by “gut-feeling” or even by guessing. But we can do better than this. This books aims to equip you with some of the necessary skills to “outguess” the competition. Or, putting it less brashly, to help you to make consistently sound decisions. As the world becomes more technologically advanced, people realize more and more that information is valuable. Obtaining the information they need might just require a phone call, or maybe a quick visit to the library. Sometimes, they might need to expend more energy and extract some information out of a database. Or worse, they might have 1 2 INTROSTAT to design an experiment and gather some data of their own. On other occasions, the information might be hidden in historical records. Usually, data contains information that is not self-evident. The message cannot be extracted by simply eye-balling the data. Ironically, the more valuable the information, the more deeply it usually lies buried within the data. In these instances, statistical tools are needed to extract the information from the data. Herein lies the focus of this book. For example, consider the record of share prices on the Johannesburg Stock Exchange. Hidden in this data lies a wealth of information — whether or not a share is risky, or if it is over- or underpriced. This data even contains traces of our own emotions — whether our sentiments are mawkish or positive, risk prone or risk averse — and our preferences — for higher dividends, for smaller companies, for blue-chip shares. Little wonder that there is a multitude of financial analysts out there trying to analyse share price data hoping that they might unearth valuable information that will deliver the promise of better profits. Just as the financial analysts have an insatiable appetite for information on which to base better investment decisions, so in every field of human endeavour, people are analysing information with the objective of improving the decisions they take. One of the essentials skills needed to extract information from data, to interpret this information, and to take decisions based on it, is Statistics. Not everyone is willing, or has the foresight, to master a course in the science of Statistics. We are fortunate that this is true — otherwise statisticians would not hold the monopoly on superior decision-making! You have already made at least one good decision — the decision to do a course in Statistics. What Statistics is (and is not!)... Most people seem to think that what statisticians do all day is to count and to add. The two kinds of “statisticians” that most frequently impinge on the general public are really parodies of statisticians: the “sports” statistician and the “official” statistician. Statistics is not what you see at the bottom of the television screen during the French Open Tennis Championships: statistique, followed by a count of the number of double faults and aces the players have produced in the match so far! Nor is statistics about adding up dreary columns of figures, and coming to the conclusion, for example, that there were 30 777 000 sheep in South Africa in 1975. That sort of count is enough to put anyone to sleep! If statisticians don’t count in the 1–2–3 sense, in what sense do they count? What is statistics? We define statistics as the science of decision making in the face of uncertainty. The emphasis is not on the collection of data (although the statistician has an important role in advising on the data collection process), but on taking things one step further — interpreting the data. Statistics may be thought of as data-based decision making. Perhaps it is a pity that our discipline is called Statistics. A far better name would have been Decision Science. Statistics really comes into its own when the CHAPTER 1. EXPLORING DATA 3 decisions to be made are not clear-cut and obvious, and there is uncertainty (even after the data has been gathered) about which of several alternative decisions is the best one to choose. For example, the decision about which card to play in a game of bridge to maximize your chance of winning, or the decision about where to locate a factory so as to maximize the likelihood that your company’s share of the market will reach a target value are not simple decisions. In both situations, you can gather as much data as you can (the cards in your hand, and those already played in the first case, proximity to raw materials and to markets in the second), and take a best possible decision on the basis of this data, but there is still no guarantee of success. In both cases, your opposition may react in unexpected ways, and you risk defeat. In the above sentences, the words “uncertainty”, “chance”, “likelihood” and “risk” have appeared. All these are qualitative terms. Before the statistician can get down to his or her real job (of taking decisions in the face of uncertainty), this nebulous concept of uncertainty has to be put onto a firm footing. Probability Theory is the branch of mathematics that achieves this quantification of uncertainty. Therefore, before you can become a statistician, you have to learn a hefty chunk of Probability Theory. This is contained in chapters 2 to 7. Chapters 8 to 12 deal with the science of data-based decision making. However, in the remainder of this chapter, we aim to give you insight into what is to come in the later chapters, to give you a feeling for data, and to do “data-based decision making” using intuitive concepts. Display, summarize and interpret... Before getting deeply involved with tackling any situation or problem in daily life, it is wise to take a step back and take a glimpse at “the big picture” — and so it is with Statistics. As a starting point, statisticians make a “quick and dirty” summary of the data they are about to analyse in order to get a “feel” for what they are dealing with. The initial overview usually involves: constructing a visual display of the data; sum- marizing the data with a few pertinent “key” numbers; and gaining insight into the “potential” of the data. What do we mean by data?... Data is information. There are data drips and data floods, and statisticians have to learn to deal with both. Usually, there is either too little or too much data! When data comes in floods, the problem is to extract the salient features. When data comes in drips, the problem is to know what are valid interpretations. Besides the amount of data, there are different types of data. For the moment, we need to distinguish between qualitative and quantitative data. Qualitative data is usually non-numerical, and arises when we classify objects using labels or names as categories: for example, make of car, colour of eyes, gender, nationality, profession, 4 INTROSTAT cause of death, etc. Sometimes the categories are semi-numerical: for example, size of companies categorized as small, medium or large. Quantitative data, on the other hand, is always numerical, and data points can be ranked or ordered. Quantitative data usually arises from measuring or counting: for example, flying time between airports, number of rooms in a house, salary of an accountant, cost of building a school, volume of water in a dam, number of new car sales in a month, the size of the AIDS epidemic, etc.. Visual displays of qualitative data... Two neat ways of displaying qualitative data are the pie chart and the bar chart. Example 1A: Table 1.1 contains data on a class of 81 Master of Business Administration (MBA) students. The table shows each student’s faculty for their first degree, in either Arts, Commerce, Engineering, Medicine, Science or “other”. Also given are their test scores for an entrance examination known as the GMAT, a test commonly used by business schools worldwide as part of the information to assist in the selection process. Our brief is to construct a visual summary of the distribution of students having the various first-degree categories in the table. Firstly, we decide that first-degree category, the data that we are being asked to display, is qualitative data. Appropriate display techniques are the pie chart and the bar graph. Secondly, we find the frequency distribution of the qualitative data by counting the number of students falling into each category. At the same time, we calculate relative frequencies by dividing the frequency in each category by the total number of observations: First degree Frequency Relative frequency Engineering 28 0.35 Science 16 0.20 Arts 16 0.20 Commerce 10 0.12 Medicine 5 0.06 Other 6 0.07 Thirdly, we plot the pie chart and the bar graph. Pie chart: the actual construction of a pie chart is straightforward! We have arranged the segments in anti-clockwise order, starting at “three o’clock”, but there is no hard- and-fast rule about this. The pie chart communicates most effectively if the relative frequencies are arranged in decreasing order of size. CHAPTER 1. EXPLORING DATA 5 Table 1.1: MBA Student Data First GMAT First GMAT First GMAT degree score degree score degree score 1. Engineering 610 28. Engineering 710 55. Arts 500 2. Engineering 510 29. Science 600 56. Arts 620 3. Engineering 610 30. Science 550 57. Arts 550 4. Engineering 580 31. Science 540 58. Arts 600 5. Engineering 720 32. Science 620 59. Arts 520 6. Engineering 620 33. Science 650 60. Arts 520 7. Engineering 540 34. Science 500 61. Commerce 550 8. Engineering 500 35. Science 590 62. Commerce 520 9. Engineering 750 36. Science 630 63. Commerce 560 10. Engineering 640 37. Science 660 64. Commerce 560 11. Engineering 550 38. Science 570 65. Commerce 600 12. Engineering 650 39. Science 600 66. Commerce 540 13. Engineering 600 40. Science 630 67. Commerce 550 14. Engineering 600 41. Science 500 68. Commerce 650 15. Engineering 510 42. Science 580 69. Commerce 510 16. Engineering 570 43. Science 560 70. Commerce 560 17. Engineering 620 44. Science 550 71. Medicine 590 18. Engineering 590 45. Arts 560 72. Medicine 700 19. Engineering 660 46. Arts 550 73. Medicine 640 20. Engineering 550 47. Arts 500 74. Medicine 680 21. Engineering 560 48. Arts 510 75. Medicine 580 22. Engineering 630 49. Arts 570 76. Other 550 23. Engineering 540 50. Arts 510 77. Other 680 24. Engineering 560 51. Arts 660 78. Other 540 25. Engineering 650 52. Arts 500 79. Other 640 26. Engineering 540 53. Arts 710 80. Other 620 27. Engineering 680 54. Arts 510 81. Other 450 6 INTROSTAT Piechart showing proportions of M.B.A. students.................................................................................................................. Engineering.................................................................................................................................................... Science..................................................................................................................................................................................................................................................................................................................................................................................................... Other............................................................................................................................................................................................................................................... Medicine........................................................... Arts.................................................................................................................................................... Commerce............ Bar graph: Notice that there is no quantitative scale along the vertical axis of the bar graph, that the “bars” are not connected, and that the widths of the bars have no particular relevance. Because there is no quantitative ordering of the categories, we are free to arrange them as we please. As for the pie chart, it is generally best to arrange the bars in decreasing order of relative frequency; this makes comparison easier, and also tends to highlight the important features of the data. Relative frequencies could also have been used in the construction of the bar graph..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 28 Engineering........................................................... 16 Science....................................................................................................................................................................................................................................................................................................... 16 Arts................................................................................................................................................................................................................................................................................................................................................... 10 Commerce...................... 6 Other........................................................................................................ 5 Medicine...................................................... 0 10 20 30 Number of MBA students The visually striking impact of both the pie chart and the bar graph is that engineers form the largest proportion of this class of MBA students. Next, we might ask for a reason for this. A plausible explanation is that engineers are not exposed to much CHAPTER 1. EXPLORING DATA 7 management and administrative training during their undergraduate years, and that they make up for this by doing an MBA. A second explanation is that the data was extracted during recessionary times, when engineers were not in demand — perhaps they were “investing in themselves” by doing an MBA while projects were scarce. How would you set about investigating whether this latter explanation is correct? Our plots have provided some insight into this data. Sometimes, all we achieve is to demonstrate the obvious. At other times, our plots will reveal completely unexpected phenomena. Careful interpretation is then needed, frequently with the help of the “ex- perts” from the discipline that the data comes from. Example 2C: Table 1.2 gives the composition of the “All-Share Index” of the Johan- nesburg Stock Exchange as at 2 January 1990. The breakdown of the All-Share Index reflects that it is composed of seven “major sector” indices, namely Coal, Diamonds, All-Gold, Metals & Minerals, Mining Financial, Financial, and Industrial Indices. (a) Construct a bar chart showing the number of shares in each of the seven major sector indices that contribute to the All-Share Index. (b) Now construct a bar chart showing the relative weightings of each of the seven ma- jor sectors as a percentage of the All-Share Index, and comment on any differences you find from (a) above. Visual displays of quantitative data I : histograms... Histograms are a time-honoured and familiar way of displaying quantitative data. We demonstrate the construction procedure by means of an example. Example 3A: Referring back to the data on GMAT scores in Example 1A, draw a histogram for the distribution of the GMAT scores for the students. We recommend a four-step histogram procedure: 1. Determine the size of the sample1 , i.e. the number of observations. We have n = 81 students — and throughout this book we reserve use of the symbol n for the concept of sample size, the number of numbers we are dealing with! Find the smallest and largest numbers in the sample. Call these xmin and xmax , respectively. The smallest GMAT score was from student 81, who scored 450, and the largest was the 750 achieved by student 9: xmin = 450 xmax = 750 1 We consider a sample to be a small number of observation taken from the population of interest. We hope that the sample is representative of the population as a whole, so that conclusions drawn from the sample will be valid for the population. We consider methods of obtaining a representative sample in Chapter 11. 8 INTROSTAT Table 1.2: Composition of the All Share Index on the JSE No. of Percentage Major shares in weighting in sector Subsidiary Sector Index subsidiary All-Share index index Index Coal 2 0.82 Coal Diamonds 1 8.27 Diamonds Gold — Rand and others 5 1.39 Gold — Evander 2 0.89 Gold — Klerksdorp 3 5.45 All-Gold Gold — OFS 4 3.99 Gold — West Wits 3 7.27 Copper 1 0.55 Manganese 1 0.91 Metal Platinum 2 5.00 & Tin 1 0.01 minerals Other metals & minerals 3 0.26 Mining houses 3 16.79 Mining Mining holding 3 7.13 financial Banks & financial services 5 2.80 All- Insurance 4 2.15 Share Investment trusts 3 1.02 Financial Property 12 0.47 Property trusts 11 0.97 Industrial holdings 6 9.94 Beverages & hotels 2 3.43 Building & construction 6 0.92 Chemicals 2 3.46 Clothing, footwear & textiles 10 0.62 Electronics, electr. & battery 7 1.39 Engineering 7 0.95 Fishing 1 0.08 Food 4 2.14 Furniture & household goods 5 0.30 Industrial Motors 6 0.43 Paper & packaging 3 2.26 Pharmaceutical & medical 3 0.48 Printing & publishing 3 0.18 Steel & allied 2 1.99 Stores 10 2.16 Sugar 1 0.43 Tobacco & match 1 2.48 Transportation 3 0.22 CHAPTER 1. EXPLORING DATA 9 2. Choose class intervals that cover the range from xmin to xmax. Here are two guidelines that help determine the length L of the class intervals: the first is due to Mr Sturge, the second is used by the computer package GENSTAT. If the class intervals are made too narrow, the histogram looks “spikey”, and if too wide, the histogram is “blurred”. Sturge says: use class intervals of length xmax − xmin xmax − xmin L= = 1 + log2 n 1 + 1.44 loge n GENSTAT says: use class intervals of length xmax − xmin L= √. n For our data, Sturge says xmax − xmin 750 − 450 = = 40.94, 1 + 1.44 loge n 1 + 1.44 loge 81 while GENSTAT says xmax − xmin 750 − 450 L= √ = √ = 33.33. n 81 As a general rule, avoid choosing class intervals which are of awkward lengths. Multiples of 2, 5 and 10 are most frequently used. Feel free to choose intervals between half and double those suggested by the guidelines. All the class intervals should be the same width. Resist the temptation to make the class intervals wider over that part of the range where the data is sparser — this has the effect of destroying the visual message of the histogram. For this example, L = 50 is a sensible choice for the width of the class interval. It is convenient to start our class intervals at 450, and carry on in steps of 50 as far as is necessary, so that the boundaries of the class intervals are at 450, 500, 550, 600, 650, 700, 750, and 800. We also need to agree that scores that fall on the boundaries will be allocated to the higher of the two class intervals, so strictly the class intervals are 450–499, etc., as shown in the frequency distribution table below. 3. Count the number of GMAT scores falling into each class interval. The most convenient way to do this is to set up a tick sheet, and to make one pass through the data allocating each score to its class interval. This sets up a frequency distribution: 10 INTROSTAT Class interval frequency 450–499 1 500–549 20 550–599 24 600–649 21 650–699 10 700–749 4 750–799 1 Total 81 4. Plot the histogram, choosing suitable scales for each axis: 25.............................................................................................................................. 20............................................................................................................................................................................................................................................................................................................................................................................................................................................... Number 15.................................................................................................................................................................... of...................................................................................................................................................... MBA......................................................................................................................................... students 10.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. 5..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 0........................................................................... 400 500 600 700 800 GMAT scores The striking feature of the histogram is that it is not symmetric but is skewed to the right, which means that it has a long tail stretching off to right. The terms in bold are technical, jargon terms, but their meanings are obvious. CHAPTER 1. EXPLORING DATA 11 A seasoned statistician would expect a distribution of test scores (or examination results) to have a tail at both ends of the frequency distribution. In the above display, there has been a truncation of the distribution at 500 (apart from a single score of 450). We would infer that the acceptance criterion on the MBA programme is a GMAT score of 500 or more. In reality there is a tail on the left, but it is suppressed by the fact that applicants who achieved these scores were not accepted. In the light of this, a statistician would also query the score of 450. Is it an error in the data? Maybe it should be 540, and there has been a transcription error. But a more plausible explanation is that the student was outstanding in some other aspect of the selection process — maybe the personal interview was very impressive! Example 4B: The risks taken by investors when they invest in the stock exchange are of considerable interest to financial analysts. Investors associate the risks of investments with how volatile (or variable) the price changes are. Analysts measure volatility of price changes using the “standard deviation” — a statistical measure of variability that we will learn about later in this chapter. The table below contains the standard deviations (or risks) of a sample of 75 shares listed on the Johannesburg Stock Exchange. The units are per cent per month. Construct a suitable histogram of the data. 23 22 17 18 21 25 23 25 12 23 27 14 28 9 23 19 23 11 16 11 15 15 12 12 12 21 13 11 13 13 27 20 17 8 13 28 14 9 13 11 23 23 10 12 12 26 25 11 12 20 22 21 9 13 19 19 13 14 15 17 17 10 25 26 11 12 25 22 12 11 22 20 14 10 23 1. The sample size is n = 75, the extreme values are xmin = 8, xmax = 28. 2. Sturge says: L = (28 − 8)/(1 + 1.44 loge 75) = 2.8, while GENSTAT says: p L = (28 − 8)/ (75) = 2.3. So a sensible length for the class interval is 2, and we use class interval boundaries at 8, 10, 12,... , 28. Effectively, the class intervals are 8–9, 10–11,... , 28–29. 3. Count the number of shares falling into each class: 12 INTROSTAT Class interval Frequency 8–9 4 10–11 10 12–13 16 14–15 7 16–17 5 18–19 4 20–21 6 22–23 12 24–25 5 26–27 4 28–29 2 Total 75 Finally, we plot the histogram: 20............. 15............................................. Number.................................................................. of 10........................................................................................................... shares................................................................................................................................................................................. 5........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 0.......................................................................................... 10 20 30 Risk The striking feature of this histogram is that is has two clear peaks. In statistical jargon, it is said to be bimodal. The visual display of this information has thus revealed information which was not at all obvious from even a careful search through the 75 values in the table of data. The financial analyst now needs an explanation for the bimodality. Further investigation revealed that “gold shares” were predominantly responsible for the peak on the right while “industrial shares” were found to be responsible for the peak on CHAPTER 1. EXPLORING DATA 13 the left. The histogram reveals that gold shares generally have a substantially higher risk than industrial shares. In layman’s terms, we conclude that gold shares are generally more “volatile” than industrial shares. Example 5C: Plot a histogram to display the examination marks of 25 students and comment on the shape of the histogram: 68 72 39 50 69 52 51 50 41 52 65 37 45 78 48 55 53 61 71 42 57 34 57 66 87 Example 6C: A company that produces timber is interested in the distribution of the heights of their pine trees. Construct a histogram to display the heights, in metres, of the following sample of 30 trees: 18.3 19.1 17.3 19.4 17.6 20.1 19.9 20.0 19.5 19.3 17.7 19.1 17.4 19.3 18.7 18.2 20.0 17.7 20.0 17.5 18.5 17.8 20.1 19.4 20.5 16.8 18.8 19.7 18.4 20.4 Visual displays of quantitative data II : stem-and-leaf plots... Stem-and-leaf plots are a relatively new display technique. The visual effect is very similar to that of the histogram; however, they have the advantage that additional information is represented — the original data values can be extracted from the display. Thus stem-and-leaf plots can be used as a means of data storage. We learn how to construct a stem-and-leaf plot by means of an example, using somewhat historic data. The procedure is simple. Example 7A: At the end of 1983/84 English football season, the points scored by each club were as follows: 14 INTROSTAT Arsenal 63 Nottingham Forest 71 Aston Villa 60 Notts County 41 Birmingham City 48 Queens Park Rangers 73 Coventry City 50 Southampton 71 Everton 59 Stoke City 50 Ipswich Town 53 Sunderland 52 Leicester City 51 Tottenham Hotspur 61 Liverpool 79 Watford 57 Luton Town 51 West Bromwich 51 Manchester United 74 West Ham United 60 Norwich City 50 Wolverhampton 29 To produce the stem-and-leaf plot for the points scored by each team, we split each number into a “stem” and a “leaf”. In this example, the natural split is to use the “tens” as stems, and the “units” as leaves. Because the numbers range from the 20s to the 70s, our stems run from 2 to 7. We write them in a column: stems leaves 2 3 4 5 6 7 We now make one pass through the data. We split each number into its “stem” and its “leaf”, and write the “leaf” on the appropriate “stem”. The first number, the 63 points scored by Arsenal, has stem “6” and leaf “3”. We write a “3” as a leaf on stem “6”: stems leaves 2 3 4 5 6 3 7 Aston Villa’s 60 points become leaf “0” on stem “6”. Birmingham City’s 48 points are entered as leaf “8” on stem “4”. After the first six scores have been entered, we have: CHAPTER 1. EXPLORING DATA 15 stems leaves 2 3 4 8 5 093 6 30 7 Continue until all 22 numbers have been entered: stems leaves count 2 9 1 3 0 4 81 2 5 0931100271 10 6 3010 4 7 94131 5 22 In the last column, we enter the count of the number of leaves on each stem, add them up, and check that we have entered the right number of leaves! The final step is to sort the leaves on each stem from smallest to largest, and to add a cumulative count column: sorted cum. stems leaves count count 2 9 1 1 3 0 1 4 18 2 3 5 0001112379 10 13 6 0013 4 17 7 11349 5 22 What have we created? Essentially, we have a histogram on its side, with class intervals of length 10. But in addition, we have retained all the original information. In a histogram, we would only have known that five teams scored between 70 and 79 points; now we know that there were scores of 71 (two teams), 73, 74, and Liverpool’s league-winning 79! 16 INTROSTAT Example 8A: For the data of example 1A, produce and compare stem-and-leaf plots of the GMAT scores of students with Engineering and Arts backgrounds. All the GMAT scores ended in a zero, so this contains no useful information; therefore we use the hundreds as “stems” and the tens as “leaves”. For both categories of students, we would then have only three stems, “5”, “6” and “7”. Looking back at the histogram display of GMAT scores (example 3A), we note that we used class intervals of width 50 units. We can create this width class interval in the stem-and-leaf plot as demonstrated below. Engineering: stems leaves count 5· 140144 6 5? 8579566 7 6· 11240023 8 6? 5658 4 7· 21 2 7? 5 1 28 Arts: stems leaves count 5· 01101022 8 5? 6575 4 6· 20 2 6? 6 1 7· 1 1 7? 0 16 In this approach, we split each 100 into two stems; the first is labelled “·” and encompasses the leaves from 0 to 4, the second is labelled“?” and includes the leaves from 5 to 9. The final step is to sort the leaves for each stem. CHAPTER 1. EXPLORING DATA 17 ENGINEERING ARTS sorted cum. sorted cum. stems leaves count stems leaves count 5· 011444 6 5· 00011122 8 5? 5566789 13 5? 5567 12 6· 00112234 21 6· 02 14 6? 5568 25 6? 6 15 7· 12 27 7· 1 16 7? 5 28 7? 16 Again, a skewness to the right is evident in both displays. Striking, too, is the observation that students with an engineering background tend to have GMAT scores in the upper 500s and lower 600s, whereas the majority of arts background students have scores in the 500s. Although the sample sizes are small, this pattern seems marked enough to suggest that engineers perform better, on average, than the arts students. If splitting “stems” into two parts seems inadequate for the data set on hand, here is a system for splitting them into five! Example 9B: Produce a stem-and-leaf plot for the risk data of Example 4B. As for the histogram, it would be sensible to use stems of width 2. Each stem is therefore split into five: 0 and 1 are denoted “·”, 2 and 3 are denoted “t”, 4 and 5 are denoted “f”, 6 and 7 are denoted “s”, and 8 and 9 are denoted “?”. Notice the convenient mnemonics — English is a marvellous language! Arts: sorted cum. stems leaves count count ? 8999 4 4 1· 0001111111 10 14 1t 2222222223333333 16 30 1f 4444555 7 37 1s 67777 5 42 1? 8999 4 46 2· 000111 6 52 2t 222233333333 12 64 2f 55555 5 69 2s 6677 4 73 2? 88 2 75 Note that we have presented the stem-and-leaf plot with the leaves already sorted. 18 INTROSTAT Example 10C: Produce a stem-and-leaf plot for the examination marks of another group of 25 students. 50 79 53 85 50 53 65 58 43 45 48 51 54 72 71 61 51 72 53 39 67 27 43 69 53 Example 11C: The maximum temperatures (◦ C) at 20 towns in southern African one summer’s day are given in the following table. Produce a stem-and-leaf plot. Pietersburg 30 Windhoek 32 Pretoria 20 Cape Town 22 Johannesburg 28 George 21 Nelspruit 30 Port Elizabeth 18 Mmabatho 33 East London 17 Bethlehem 30 Beaufort West 23 Bloemfontein 31 Queenstown 22 Kimberley 31 Durban 26 Upington 28 Pietermaritzburg 27 Keetmanshoop 28 Ladysmith 30 Example 12C: In order to assess the prices of the television repair industry, a faulty television set was taken to 37 TV repair shops for a quote. The data below represents the quoted prices in rands. Construct a stem-and-leaf plot and comment on its features. 60 55 158 38 48 120 85 245 90 60 49 38 98 185 200 150 140 75 125 125 125 145 200 145 94 165 105 75 75 120 36 150 120 176 60 78 28 Five-number data summaries — median, lower and upper quartiles, extremes... At the beginning of this chapter, we said that statisticians gained a feel for the data they were about to analyse in two ways. We have now dealt with the first way, that of constructing a visual display. Now we move on to the second way, by computing a few “key” numbers which summarize the data. Our aim now is to reduce a large batch of data to just a few numbers which we can grasp simultaneously, and thus help us to understand the important features of the data set as a whole. CHAPTER 1. EXPLORING DATA 19 It is useful now to introduce the concepts of rank. In a sample of size n sorted from smallest to largest, the smallest number is said to have rank 1, the second smallest rank 2,... , the largest rank n. We call the smallest number x(1) (so x(1) = xmin ), the second smallest x(2) ,... , and the largest is x(n) (so x(n) = xmax ). We use x(r) for the number with rank r. The cumulative count column of a stem-and-leaf plot makes it easy to find the observation with any given rank. We use x(r+ 1 ) to denote the number half-way between the numbers with rank r and 2 rank r + 1: x(r) + x(r+1) x(r+ 1 ) =. 2 2 We say that x(r+ 1 ) is the number with rank r + 12. Such numbers are called half-ranks. 2 We define the median of a batch of n numbers as the number which has rank (n + 1)/2. We use x(m) to denote the median. If n is an odd number, then the rank of the median will be a whole number, and the median will be the “middle number” in the data set. But if n is even, then the rank will be a half-rank, and will be the average of the “two middle numbers” in the data set. The lower quartile is defined to be the number with rank l = ([m] + 1)/2, where m is the rank of the median. The notation [m] means that if m is something-and-a-half, we drop the half! The alternative to doing this is having to define “quarter-ranks”! The upper quartile has rank u = n − l + 1. The lower and upper quartiles are denoted x(l) and x(u) , respectively. The extremes, the smallest and largest values in a data set, have ranks 1 and n, and we agreed earlier to call them x(1) and x(n) , respectively. These five-numbers provide a useful summary of the batch of data, called, with com- plete lack of imagination, the five-number summary. We write them from smallest to largest: (x(1) , x(l) , x(m) , x(u) , x(n) ). Example 13A: Find the five-number summary for the end-of-season football points of Example 7A. An easy way to find the five-number summary is to use the stem-and-leaf plot. sorted cum. stems leaves count count 2 9 1 1 3 0 1 4 18 2 3 5 0001112379 10 13 6 0013 4 17 7 11349 5 22 Because n = 22, the median has rank m = (n + 1)/2 = (22 + 1)/2 = 11 21. We need to average the numbers with ranks 11 and 12. From the cumulative count, we see that the last leaf on stem 4 has rank 3, and the last leaf on stem 5 has rank 13. Counting 20 INTROSTAT along stem 5, we find that 53 is the number with rank 11 and 57 has rank 12. Thus the median is the average of these two numbers (53 + 57)/2) = 55; we write x(m) = 55. Half the teams scored below 55 points, half scored above 55 points. The lower quartile has rank l = ([11 12 ]+1)/2) = (11+1)/2 = 6. The observation with rank 6 is 50, thus x(l) = 50. The upper quartile has rank u = n − l + 1 = 22 − 6 + 1 = 17. The observation with rank 17 is 63, thus x(u) = 63. The extremes are x(1) = 29 and x(n) = 67. The five-number summary is: (29, 50, 55, 63, 79). Why is this a big deal? Because it tells us that... 1. Half the teams scored below 55 points, half scored above 55 points, because 55 is the median. 2. Half the teams scored between 50 and 63 points, because these two numbers are the lower and upper quartiles. 3. A quarter of the scores lay between 29 and 50, a quarter between 50 and 55, a quarter between 55 and 63, and a quarter between 63 and 79. 4. All the scores lay between 29 and 79. Example 14B: Find the five-number summaries for GMAT scores of both engineering and arts students. Use the stem-and-leaf plot of example 8A. ENGINEERING ARTS sorted cum. sorted cum. stems leaves count count stems leaves count count 5· 011444 6 6 5· 00011122 8 8 5? 5566789 7 13 5? 5567 4 12 6· 00112234 8 21 6· 02 2 14 6? 5568 4 25 6? 6 1 15 7· 12 2 27 7· 1 1 16 7? 5 1 28 7? 0 16 For the engineers, the median has rank m = (28 + 1)/2 = 14 12. Thus x(m) = (600 + 600)/2 = 600. The lower quartile has rank l = ([m] + 1)/2 = ([14 21 ] + 1)/2 = 7 12 , and the upper quartile rank n − l + 1 = 22 − 7 21 + 1 = 21 12. So x(l) = (550 + 550)/2 = 550, and x(u) = (640 + 650)/2 = 645. The five-number summary is (500, 550, 600, 645, 750). For the arts students, the median has rank m = (16+1)/2 = 8 21. Thus x(m) = (520+ 550)/3 = 535. The lower quartile has rank l = ([m] + 1)/2 = ([8 12 ] + 1)/2 = 4 21 , and the CHAPTER 1. EXPLORING DATA 21 upper quartile rank n − l + 1 = 16 − 4 12 + 1 = 12 21. So x(l) = (510 + 510)/2 = 510, and x(u) = (570 + 600)/2 = 585. The five-number summary is (500, 510, 535, 585, 710). For the engineers, the median GMAT score was 600; by contrast, for arts students, it was only 535. The central 50% of engineers obtained scores in the interval from 550 to 645, while the central 50% of arts students were in a downwards-shifted interval, 510 to 585. This reinforces our earlier interpretation that engineers tend to have higher GMAT scores than arts students. Example 15C: Find the five-number summaries for the data of (a) Example 10C, (b) Example 11C, and (c) Example 12C. Visual displays of quantitative data III : box-and-whisker plots... Five-number summaries can be displayed graphically by means of box-and-whisker plots. (This ridiculous name was invented by the American statistician who invented the method, John Tukey , who also invented both the name “stem-and-leaf plot” and the plot itself! John Tukey was not only an inventor of crazy names; he also made an enormous impact on the theory and practice of the discipline Statistics.) Once again, we will use an example to describe how to produce a box-and-whisker plot. Example 16A: Produce a box-and-whisker plot for the football team points of Example 7A, using the five-number summary (29, 50, 55, 63, 79) computed in Example 13A. The procedure is simple: 1. Draw a vertical axis which covers at least the range of the data. 2. Draw a “box” from the lower to the upper quartile. 3. Draw a line across the box at the median. 4. Draw “whiskers” from the box out to the extremes. Applied to the five-number summary (29, 50, 55, 63, 79), this procedure yields the box-and-whisker plot in Figure 1.1. Box-and-whisker plots are especially useful when we wish to compare two or more sets of data. To achieve this, we construct the plots side-by-side. It is essential to use the same vertical scale for all the plots. 22 INTROSTAT 100 80 upper extreme (79) upper quartile (63) 60 median (55) Points lower quartile (50) 40 lower extreme (29) 20 0 Figure 1.1: Example 17B: Draw a series of box-and-whisker plots to compare the GMAT scores of each category of MBA students. We computed the five-number summaries of the GMAT scores for engineering and arts students in Example 14B. The five-number summaries for all the categories are: Engineering (500, 550, 600, 645, 750) Science (500, 550, 585, 625, 660) Arts (500, 510, 535, 585, 710) Commerce (510, 540, 555, 600, 700) Medicine (580, 585, 640, 690, 700) Other (460, 540, 585, 640, 680) The box-and-whisker plots, shown side-by-side, reveal the differences between the various categories of students. CHAPTER 1. EXPLORING DATA 23 800 700 GMAT 600 scores 500 ENG. SCI. ARTS COM. MED. OTH. 400 We see from a comparison of the box-and-whisker plots that the students in this class with a medical background had the highest median GMAT score, followed by engineers, with arts students having the lowest median. The skewness to the right (now shown as a long whisker pointing upwards!) which we commented on earlier for the class as a whole, is also evident for engineering, science, arts and commerce students, the categories for which the sample sizes were large. Outliers and strays... In many data sets, there are one or more values that appear to be very different to the bulk of the observations. Intuitively, we recognize these values because they are a long way from the median of the data set as a whole. We can make our stem-and-leaf plots more informative by plotting and labelling some of the outlying values in such a way that they are highlighted and our attention immediately drawn to them. These outlying values might well represent errors that have crept into the data, either when the observation was made, or when the numbers were transcribed from one sheet of paper to another, or when they were entered into a computer, or even when they were being transferred from one computer to another. On the other hand, outlying values might represent genuine observations, and be of special interest and importance. In any event, these observations need to be checked, and either confirmed or rejected. It is useful to have rules that will aid us to identify such observations. 24 INTROSTAT Outliers are those observations which are greater than x(m) + 6(x(u) − x(m) ) or less than x(m) − 6(x(m) − x(l) ) and we label them boldly on the box-and-whisker plot. Less outlying values called strays are those observations which are not outliers but are greater than x(m) + 3(x(u) − x(m) ) or less than x(m) − 3(x(m) − x(l) ) and we label them less boldly on the plot. The largest and smallest observations which are not strays are called the fences (more Tukeyisms!). When outliers and strays are being portrayed in a box-and-whisker plot, the convention is to take the whiskers out as far as the fences, not the extremes. This helps to isolate and highlight the outlying values. In any event, it is sometimes helpful to identify a few values of special interest or importance in a box-and-whisker plot. Example 18A: The university computing service provides data on the amount of computer usage (hours) by each of 30 students in a course: Student no. Usage Student no. Usage Student no. Usage AD483 53 AM044 2 AS677 36 CI144 7 CS572 25 EK817 20 FV246 38 GM337 36 GR803 33 HN050 48 JK314 84 JR894 154 JV670 31 KM232 35 LJ419 44 LW032 48 MA276 69 MJ076 95 PH544 4 PS279 60 RR676 18 SA831 51 SC186 47 SS154 37 TB864 11 VO822 41 WG794 34 WB909 73 YG007 38 ZP559 125 Is the lecturer justified in claiming that certain students appear to be making excessive use of the computer (playing games?) while the usage of others is so low that she is suspicous that they are not doing the work themselves? CHAPTER 1. EXPLORING DATA 25 The stem-and-leaf plot is sorted cum. stems leaves count 0 247 3 1 18 5 2 05 7 3 134566788 16 4 14788 21 5 13 23 6 09 25 7 3 26 8 4 27 9 5 28 10 28 11 28 12 5 29 13 29 14 29 15 4 30 The five-number summary is (2, 31, 38, 53, 154). The outliers were those observations greater than x(m) + 6(x(u) − x(m) ) = 38 + 6(53 − 38) = 128 or less than x(m) − 6(x(m) − x(l) ) = 38 − 6(38 − 31) = −4. There was only one outlier, the usage of 154 hours by student JR894. The strays were those observations which were not identified as outliers but were greater than x(m) + 3(x(u) − x(m) ) = 38 + 3(53 − 38) = 83 or less than x(m) − 3(x(m) − x(l) ) = 38 − 3(38 − 31) = 17. There are seven strays: four students (AM044 (2 hours), PH544 (4 hours), CI144 (7 hours), TB864 (11 hours)) are at the low usage end, and three (JK314 (84 hours), MJ076 (95 hours) and ZP559 (125 hours) are at the high usage end. The fences are the outermost observations that were not strays, and were the 18 hours and 73 hours. The box-and-whisker plot, with the outlier labelled boldly, and strays merely labelled, looks like this: 26 INTROSTAT JR894 150 ◦ ZP559 100 ◦ MJ076 ◦ JK318 Hours upper quartile (53) 50 median (38) lower quartile (31) ◦ TP864 CI144 ◦◦ PH544 0 AM044 ◦ The lecturer now has a list of students whose computer utilization appears to be suspicious. Example 19C: A company that produces breakfast cereals is interested in the protein content of wheat, its basic raw material. The protein content of 29 samples of wheat (percentages) was recorded as follows: 9.2 8.0 10.9 11.6 10.4 9.5 8.5 7.7 8.0 11.3 10.0 12.8 8.2 10.5 10.2 11.9 8.1 12.6 8.4 9.6 11.3 9.7 10.8 83 10.8 11.5 21.5 9.4 9.7 Confirm the statistician’s conclusion that the values 83 and 21.5 are outliers. The statistician asked that these values should be investigated. Checking back to the original data, it was discovered that 83 should have been 8.3, and 21.5 should have been 12.5. Transposed digits and misplaced decimal points are two of the most frequent types of error that occur when data is entered into a computer. Example 20C: A winery is concerned about the possible impact of “global warming” on the grape crop. It was able to obtain some interesting historical rainfall data going back to 1884 from a wine-producing region. The rainfall (mm) in successive Januaries at Paarl for the 22-year period 1884–1905 were recorded as follows: CHAPTER 1. EXPLORING DATA 27 Year Rain Year Rain Year Rain 1884 2.6 1892 37.8 1900 3.0 1885 4.9 1893.0 1901 145.1 1886 16.3 1894.0 1902 39.7 1887 21.6 1895 52.3 1903 105.9 1888 6.1 1896 4.1 1904 17.8 1889.0 1897 6.4 1905 10.6 1890.0 1898 15.8 1891 1.1 1899 27.7 (a) Produce a stem-and-leaf plot. (b) Find the five-number summary. (c) Draw the box-and-whisker plot, showing outliers and strays, if any. “Statistics” in Statistics... Within the discipline Statistics, we give a precise technical definition to the concept, a statistic. A statistic is any quantity determined from a sample. Thus the median is “a statistic”, and so are the other four numbers that make up a five-number summary. These are examples of summary statistics, because they endeavour to summarize certain aspects of the information contained in the sample. We now learn about a further bunch of “statistics”. Measures of location and spread... We use the term measure of location to describe any statistic that purports to locate the “middle”, in some sense, of the data set. For example, confronted by a collection of data on house prices, we would use a measure of location to answer the question: What is the typical price of a house? The next questions might be: How much variability is there in house prices? What is the difference between the price of a cheap house and that of an expensive house? Measures of spread are designed to provide answers to these two questions. In the next few sections we consider a few of the most important measures of location, and then some measures of spread. The sample median The median, which we denoted x(m) , locates the “middle” of the data in the sense that half the observations are smaller than the median and half are larger than the median. To find the median it is necessary to sort or rank the data from the smallest value to the largest. Remember that if the sample size n is an odd number, the median is the “middle” observation, but if n is even, it is the average of the “two middle” observations. 28 INTROSTAT The sample mean... The sample mean is, with good justification, the most important measure of loca- tion. It is found by adding together all the values in the sample, and dividing this total by n, the sample size. We introduce a subscript notation to describe a sample of size n. Call the first observation we make x1 , the second x2 ,... , the nth xn. Then the sample mean, almost universally denoted x̄ (pronounced, “x bar”), is defined to be x̄ = (x1 + x2 + · · · + xn )/n n 1X = xi n i=1 The sample mean locates the “middle” of the batch of data values in a special way. It is equivalent to hanging a 1 kg mass at points x1 , x2 ,... xn along a ruler (of zero mass), and then x̄ is the point at which the ruler balances. (The masses needn’t be 1 kg, but they must all be equal!) The mean is much easier to calculate than the median. The mean requires a single pass through the data, adding up the values. In contrast, the data needs to be sorted before the median can be computed, an operation which requires several passes through the data. Example 21A: Find the sample mean of the dividend yields of 15 shares in the pa- per and packaging sector of the Johannesburg Stock Exchange. Also find the median. Compare the mean and the median. The yields are expressed as percentages. Copi 3.3 E. Haddon 7.6 Pr. Paper 6.7 Caricar 8.4 Kohler 7.1 Prs. Sup 2.9 Coates 10.7 Metal Box 6.6 Sappi 7.5 Consol 6.0 Metaclo 8.6 Trio Rand 8.2 DRG 9.6 Nampak 5.8 Xactics 3.0 We sum the 15 dividend yields and divide by 15: x̄ = (3.3 + 8.4 + 10.7 + · · · + 8.2 + 3.0)/15 = 6.80 The stem-and-leaf plot for these data is shown below: CHAPTER 1. EXPLORING DATA 29 sorted cum. stems leaves count count 2 9 1 1 3 03 2 3 4 0 3 5 8 1 4 6 067 3 7 7 156 3 10 8 246 3 13 9 6 1 14 10 7 1 15 The median has rank m = (15 + 1)/2 = 8, and thus x(m) = 7.1. In this example, there is little difference between the two measures of location. But this is not always the case... Example 22A: Find the mean and the median of the weekly volume of the same 15 shares as in Example 21A. The weekly volume is the number of shares traded in a week. Copi 2 300 E. Haddon 0 Pr. Paper 700 Caricar 2 100 Kohler 100 Prs. Sup 0 Coates 3 100 Metal Box 111 400 Sappi 40 600 Consol 1 200 Metaclo 700 Trio Rand 84 100 DRG 31 800 Nampak 100 Xactics 45 900 The sample mean is x̄ = (2 300 + 2 100 + · · · + 84 100 + 45 900)/15 = 21 607. Sort the data, locate the middle (8th) value, and find that the median is x(m) = 2 100. The mean is just over 10 times larger than the median. What has gone wrong? Nothing, it’s just that the mean and median locate the “middle” of the data according to a different set of rules! In this example, the mean has been dragged upwards by a few large values, so that only five of the fifteen numbers are larger than the mean. But even if a million Metal Box shares had been traded during the week, the median would have remained the same! The median is an example of a measure of location which is said, in statistical jargon, to be robust. The mean is not robust, being sensitive to outlying values in the data set. Because the mean is not robust, it is important to be aware of possible outliers in any sample of data for which the mean is being computed. The mean and the median tend to be close when the distribution of the values is symmetric and there are no outlying observations. The mean and median differ increas- ingly as the distribution of the data becomes more and more skew. The observations in the long tail of a skew distribution drag the mean in the direction of the tail. The sample mean of a very skew distribution might give a totally misleading impression of the “middle” of the data set. 30 INTROSTAT There are no hard-and-fast rules which state when to use the sample mean and when to use the median as a measure of location. In general terms, the median is good for most sets of data. The sample mean is most useful when the data has a symmetric distribution. Data with a long tail to the right can be made more symmetric by taking logarithms or taking square roots of all the data values. Such manipulations to the original data values are called transformations. The sample mean has mathematical advantages over the median. It is a FAR easier statistic for the mathematical statisticians to do algebraic manipulations with than the median. A vast amount of statistical theory has been developed for the sample mean, and for this reason it is the predominant measure of location used in sophisticated statistical methods. Measures of spread... Measures of spread give insight into the variability of a set of data. Two measures of spread can be defined in an obvious way from the five-number summary. They are: the range R, defined as R = x(n) − x(1) , and the interquartile range I, defined as I = x(u) − x(l). The range is unreliable as a measure of spread because it depends only on the smallest and largest values in the sample, and is thus as sensitive as it can possibly be to outlying values in the sample. It is the ultimate example of a non-robust statistic! On the other hand, the interquartile range is the length of the interval covering the central half of the data values in the sample, and it is not sensitive to outliers in the data. The interquartile range is a robust measure of spread. The sample variance and its square root, the sample standard deviation, have the same advantage, easier algebraic manipulation, over the range and interquartile range that the mean had over the median. Therefore the sample variance is frequently the only measure of spread calculated for a set of data. The sample variance, denoted by s2 , is defined by the formula n 1 X s2 = (xi − x̄)2. n−1 i=1 In words, it is the sum of the squared differences between each data value and the sample mean, with this sum being divided by one less than the number of terms in the sum. The sample standard deviation, denoted by s, is the square root of the sample variance. It is a nuisance to have these two measures of spread, s and s2 , one of which is simply the square root of the other. Why have both? The standard deviation is the easier of the two measures of spread to get an intuitive feeling for, largely because it is measured CHAPTER 1. EXPLORING DATA 31 in the same units as the original data. The variance is measured in “squared units”, an awkward quantity to visualize. For example, if data consists of prices measured in rands, the sample variance has units “squared rands” (whatever that means!), but the standard deviation is in “rands”. Even worse, if the data consists of percentages, the variance has units “%2 ”, whereas the standard deviation has the intelligible units “%”. But mathematical statisticians prefer to work with the variance — not having to deal with a square root in the algebra makes their lives simpler and neater. So the two equivalent measures of spread co-exist side by side, and we just have to come to terms with both of them. Example 23A: Compute the sample variance s2 and the standard deviation s for the dividend yields of the 15 shares of Example 21A. We have computed x̄ = 6.8. So n 2 1 X s = (xi − x̄)2 n−1 i=1 1 (3.3 − 6.8)2 + (8.4 − 6.8)2 + (10.7 − 6.8)2 + · · · = (15 − 1) · · · + (8.2 − 6.8)2 + (3.0 − 6.8)2 1 (−3.5)2 + (1.6)2 + (3.9)2 + · · · + (1.4)2 + (−3.8)2 = 14 1 = 12.25 + 2.56 + 15.21 + · · · + 1.96 + 14.44 14 1 = 75.62 14 = 5.40 √ The standard deviation is s = 5.40 = 2.32. The variance and the standard deviation are always positive. This is guaranteed, because all the terms in the sum are squared, which makes them positive, even though some of the individual differences are negative. The variance can be calculated more efficiently by a short-cut formula. The “short cut” involves reducing the number of subtractions needed to calculate the variance from n to 1. Examine the following steps carefully: n X 2 (n − 1)s = (xi − x̄)2 i=1 n X = (x2i − 2x̄ xi + x̄2 ) i=1 n X n X n X = x2i − 2x̄ xi + x̄2 i=1 i=1 i=1 32 INTROSTAT The third term involves adding x̄2 to itself n times. So it is equal to n x̄2. But x̄ = 1 Pn x , so n i=1 i n 2 1 X 2 n x̄ = xi n i=1 The second term in the sum above can also be rewritten: n X n X 2x̄ xi = 2x̄ xi i=1 i=1 n n 2 X X = xi xi n i=1 i=1 n 2 X 2 = xi n i=1 Substituting these expressions for the second and third terms yields n n n X 2 X 2 1 X 2 (n − 1)s2 = x2i − xi + xi n n i=1 i=1 i=1 n n X 2 1 X = xi − ( xi )2. n i=1 i=1 Thus the short-cut formula for s2 is n n 1 hX 2 1 X 2 i s2 = xi − xi. n−1 n i=1 i=1 Look carefully at this formula. There is now only one subtraction, whereas the original formula involved n subtractions. Example 24A: Calculate the sample variance of the dividend yields again, this time using the short-cut formula. We need ni=1 xi , the sum of the data values, given by P n X xi = 3.3 + 8.4 + · · · + 3.0 = 102.0; i=1 and ni=1 x2i , the sum of squares of the data values, i.e. square them first, then add P them, like this: Xn x2i = 3.32 + 8.42 + · · · + 3.02 = 769.22 i=1 CHAPTER 1. EXPLORING DATA 33 Then n n 2 1 hX 2 1 X 2 i s = xi − xi n−1 n i=1 i=1 1h 1 i = (769.22 − (102.0)2 = 5.40 14 15 as before. If the data has a symmetric distribution with no outliers, then the standard de- viation has the following approximate interpretation. The interval from one standard deviation below the sample mean to one standard deviation above it, (x̄ − s, x̄ + s), should contain about two-thirds of the observations. Thus the sample mean and the sample standard deviation together provide a “two-number summary” of the data set. Many data sets are summarized by these two statistics — the sample mean provides a measure of location and the sample standard deviation a measure of spread. However, the sample variance and the sample standard deviation have the disadvan- tage that, like the mean, they are sensitive to outliers. They are sensitive in two ways. First of all, the outlier distorts the mean, so all the differences (xi − x̄) are misleading. Secondly, if xj , the jth data value, is an outlier, then the term (xj − x̄) will be large relative to the other differences, and, once it is squared, it can make a disproportionately large contribution to the sum of squared differences. Note that the intervals (x(1) , x(n) ), (x̄ − s; x̄ + s), and (x(l) , x(u) ) cover 100%, 68%≈ two-thirds, and exactly 50% of the observations, respectively. But it is not possible to make direct comparisons between the range, the standard deviations and the interquar- tile range. Example 25B: Calculate the sample means, sample standard deviations, medians, interquartile ranges and ranges of the GMAT scores for each faculty category of Exam- ple 1A. Comment on the results. For the sample mean and variance, we need the quantities ni=1 xi and ni=1 x2i. For P P the category “Engineering”, we have n X xi = 619 + 510 + · · · + 710 = 16 850 i=1 n X x2i = 6102 + 5102 + · · · + 7102 = 10 254 100 i=1 Then x̄ = 16 850/28 = 601.8 and 1 s2 = (10 254 100 − (16 850)2 /28) = 4 222.6. 27 34 INTROSTAT √ The standard deviation is s = 4 222.6 = 65.0. The remaining measures of spread and location can readily be obtained from the five- number summaries of Example 17B. For example, for “Engineering”, the lower and upper quartiles were x(l) = 550 and x(u) = 645, and the interquartile range is I = 645 − 550 = 95. The extremes were x(1) = 500 and x(n) = 750, so the range R = 750 − 500 = 250. Calculation of these summary statistics for the remaining five categories yields the table: Location Spread First degree x̄ x(m) s I R Engineering 601.8 600 65.0 95 250 Science 583.1 585 48.5 75 160 Arts 555.6 535 62.8 75 210 Commerce 567.0 555 45.7 60 140 Medicine 638.0 640 53.1 105 120 Other 581.7 585 80.1 100 220 In commenting on this table, we look first at the measures of location. The sample means show that students with first degrees in medicine had the highest mean GMAT score (638.0), followed by engineering students (601.8), and then commerce (567.0). The lowest mean was recorded for arts students (555.6). The medians follow the same pattern, and apart from arts, the sample means and medians are relatively close. In the box-and-whisker plots in Example 17B, we saw that the distribution of GMAT scores for arts appeared to be strongly skewed to the right. Hence the difference between the sample mean (555.6) and the median (535) for this category of students is consistent with the earlier evidence of skewness. For the measures of spread, it is evident that the category “Other” has the largest standard deviation (80.1), followed by engineering (65.0). The smallest standard de- viation was for commerce (45.7). The interquartile ranges (I) and ranges (R) follow a broadly similar pattern. A plausible explanation as to why the category “Other” should have the largest standard deviation (and the second largest interquartile range and range) is that it encompasses a wide diversity of students, not falling into any of the single faculty categories. The conclusions reached here provide a partial description of this MBA class of 81 students. If they were “representative” of all MBA students at all universities, we might be able to generalize the statements. Another worry that we would have before we could generalize the results relates to issues of sample size. Could the differences in the measures of location and spread we observed here occur just because we got an unusually bright group of, for example, medical students in this MBA class? We will defer further consideration of these statistical issues until chapter 8! In order to prepare CHAPTER 1. EXPLORING DATA 35 ourselves for taking that kind of decision we have to learn some probability theory. Example 26C: Calculate the sample mean and standard deviation, the median, the range and interquartile range of Paarl rainfall data (Example 20C). Example 27C: (a) Suppose that the sample mean and standard devation of the n numbers x1 , x2 ,... , xn are x̄ and s. An additional observation xn+1 becomes available. Show that the updated mean x̄? is nx̄ + xn+1 x̄? = n+1 and the updated standard deviation s? is r ? 1 s = (n − 1)s2 + n(x̄ − x̄? )2 + (xn+1 − x̄? )2. n (b) The sample mean of nine numbers is 4.8 with standard deviation 3.0. A 10th observation is made. It is 6.8. Update the mean and the standard deviation. Exploratory data analysis... The techniques we have learnt in this chapter have largely been aiming at getting a feel for a sample of data, a process somewhat grandly called exploratory data analysis. In the age of instant arithmetic and the personal computer, the temptation is to use the statistical methods of chapters 8 to 12 and beyond, and to accept the answers uncritically. We have seen in this chapter that the presence of one or more outliers in a sample can have a pretty devastating effect on the sample mean and the sample standard deviation, the most frequently used summary statistics of all. Likewise, we have seen how skewness affects these statistics. Most statistical methods make a variety of assumptions — many of these can be checked out, visually at least, by the exploratory data analysis techniques described in this chapter. Many of these techniques have become part of the data analysis software of statistical packages. You are strongly encouraged to use them before you do more complex statistical analyses. 36 INTROSTAT Solutions to examples... 2C The frequencies and relative frequencies, from which the bar graphs are con- structed, are given in the table. Major sector (a) Frequency (b) Percentage of All-Share Index Coal 2 0.82% Diamonds 1 8.27% All-Gold

IntroSTAT Notes PDF

Document Details

Tags

Related

Summary

Full Transcript