IntroStat June 2014 PDF
Document Details
Uploaded by BoundlessElf9556
University of Cape Town
2014
Les Underhill and Dave Bradfield
Tags
Summary
This is an introductory textbook on statistics. It assumes a basic understanding of differentiation and integration. The book is designed for students in business, commerce, and management courses focused on applied statistics. Aimed at lecture use, examples with varying levels of problem complexity are provided for both lecture use and independent study.
Full Transcript
INTROSTAT Les Underhill and Dave Bradfield Department of Statistical Sciences University of Cape Town June 2014 ii Introduction IntroSTAT apart, there seem to be two kinds of introductory Statistics textbooks. There are those that assume no mathematic...
INTROSTAT Les Underhill and Dave Bradfield Department of Statistical Sciences University of Cape Town June 2014 ii Introduction IntroSTAT apart, there seem to be two kinds of introductory Statistics textbooks. There are those that assume no mathematics at all, and get themselves tied up in all kinds of knots trying to explain the intricacies of Statistics to students who know no calculus. There are those that assume lots of mathematics, and get themselves tied up in the knots of mathematical statistics. IntroSTAT assumes that students have a basic understanding of differentiation and integration. The book was designed to meet the needs of students, primarily those in business, commerce and management, for a course in applied statistics. IntroSTAT is designed as a lecture-book. One of our aims is to maximize the time spent in explaining concepts and doing examples. It is for this reason that three types of examples are included in the chapters. Those labeled A are used to motivate concepts, and often contain explanations of methods within them. They are for use in lectures. The B examples are worked examples — they shouldn’t be used in lectures — there is nothing more deadly dull than lecturing through worked examples. Students should use the B examples for private study. The C examples contain problem statements. A selection of these can be tackled in lectures without the need to waste time by the lecturer writing up descriptions of examples, and by the students copying them down. There are probably more exercises at the end of most chapters than necessary. A selection has been marked with asterisks (∗ ) — these should be seen as a minimum set to give experience with all the types of exercises. Acknowledgements... We are grateful to our colleagues who used previous versions of IntroSTAT and made suggestions for changes and improvements. We have also appreciated comments from students. We will continue to welcome their ideas and hope that they will continue to point out the deficiencies. Mrs Tib Cousins undertook the enormous task of turning edition 4 into TEX files, which were the basis upon which this revision was undertaken. Mrs Margaret Spicer helped us proofread the text; Dr Derek Chalton and Mr Tim Low suggested the corrections and improvements that are incorporated into the second edition. We are grateful to have remaining errors pointed out to us. This volume is essentially the 1996 edition of Introstat with some minor and major corrections of errors, and was reset in LATEX from the the original plain TEX version. iii iv INTROSTAT Contents Introduction iii 1 EXPLORING DATA 1 2 SET THEORY 47 3 PROBABILITY THEORY 57 4 RANDOM VARIABLES 91 5 PROBABILITY DISTRIBUTIONS I 113 6 MORE ABOUT RANDOM VARIABLES 143 7 PROBABILITY DISTRIBUTIONS II 163 8 MORE ABOUT MEANS 175 9 THE t- AND F-DISTRIBUTIONS 199 10 THE CHI-SQUARED DISTRIBUTION 227 11 PROPORTIONS AND SAMPLE SURVEYS 251 12 REGRESSION AND CORRELATION 271 SUMMARY OF THE PROBABILITY DISTRIBUTIONS 319 TABLES 323 v vi CONTENTS Chapter 1 EXPLORING DATA KEYWORDS: Data summary and display, qualitative and quanti- tative data, pie charts, bar graphs, histograms, symmetric and skew distributions, stem-and-leaf plots, median, quartiles, extremes, five- number summary, box-and-whisker plots, outliers and strays, measures of spread and location, sample mean, sample variance and standard deviation, summary statistics, scatter plots, contingency tables, ex- ploratory data analysis. Facing up to uncertainty... We live in an uncertain world. But we still have to take decisions. Making good decisions depends on how well informed we are. Of course, being well informed means that we have useful information to assist us. So, having useful information is one of the keys to good decision making. Almost instinctively, most people gather information and process it to help them take decisions. For example, if you have several applicants for a vacant post, you would not draw a number out of a hat to decide which one to employ. Almost without thinking about it, you would attempt to gather as much relevant information as you can about them to help you compare the applicants. You might make a short-list of applicants to interview, and prepare appropriate questions to put to each of them. Finally you come to an informed decision. Sometimes the available information is such that we feel it is easy to make a good decision. But at other times, so much confusion and uncertainty cloud the situation that we are inclined to go by “gut-feeling” or even by guessing. But we can do better relying on our instinct and guess work. This books aims to equip you with some of the necessary skills to “outguess” the competition. Or, putting it less brashly, to help you to make consistently sound decisions. As the world becomes more technologically advanced, people realize more and more that information is valuable. Obtaining the information they need might just require a phone call, or maybe a quick visit to the library. Sometimes, they might need to expend more energy and extract some information out of a database. Or worse, they might have to design an experiment and gather some data of their own. On other occasions, the information might be hidden in historical records. Usually, data contains information that is not self-evident. The message cannot be extracted by simply eye-balling the data. Ironically, the more valuable the information, the more deeply it usually lies buried within the data. In these instances, statistical 1 2 INTROSTAT tools are needed to extract the information from the data. Herein lies the focus of this book. For example, consider the record of share prices on the Johannesburg Stock Exchange. Hidden in this data lies a wealth of information — whether or not a share is risky, or if it is over- or underpriced. This data even contains traces of our own emotions — whether our sentiments are mawkish or positive, risk prone or risk averse — and our preferences — for higher dividends, for smaller companies, for blue-chip shares. Little wonder that there is a multitude of financial analysts out there trying to analyse share price data hoping that they might unearth valuable information that will deliver the promise of better profits. Just as the financial analysts have an insatiable appetite for information on which to base better investment decisions, so in every field of human endeavour, people are analysing information with the objective of improving the decisions they take. One of the essential set of ideas and skills needed to extract information from data, to interpret this information, and to take decisions based on it, is the subject called Statistics. Not everyone is willing, or has the foresight, to master a course in the science of Statistics. We are fortunate that this is true — otherwise statisticians would not hold the monopoly on superior decision-making! You have already made at least one good decision — the decision to do a course in Statistics. What Statistics is (and is not!)... Most people seem to think that what statisticians do all day is to count, to add and to average. The two kinds of “statisticians” that most frequently impinge on the general public are really parodies of statisticians: the “sports” statistician and the “official” statistician. Statistics is not what you see at the bottom of the television screen during the French Open Tennis Championships: statistique, followed by a count of the number of double faults and aces the players have produced in the match so far! Nor is statistics about adding up dreary columns of figures, and coming to the conclusion, for example, that there were 30 777 000 sheep in South Africa in 1975. That sort of count is enough to put anyone to sleep! If statisticians do not count in the 1–2–3 sense, in what sense do they count? What is statistics? We define statistics as the science of decision making in the face of uncertainty. The emphasis is not on the collection of data (although the statistician has an important role in advising on the data collection process), but on taking matters one step further — interpreting the data. Statistics may be thought of as data-based decision making. Perhaps it is a pity that our discipline is called Statistics. A far better name would have been Decision Science. Statistics really comes into its own when the decisions to be made are not clear-cut and obvious, and there is uncertainty (even after the data has been gathered) about which of several alternative decisions is the best one to choose. For example, the decision about which card to play in a game of bridge to maximize your chance of winning, or the decision about where to locate a factory so as to maximize the likelihood that your company’s share of the market will reach a target value, are not simple decisions. In both situations, you can gather as much data as you can (the cards in your hand, and those already played in the first case; proximity to raw materials and to markets in the second), and take a best possible decision on the basis of this data, but there is still no guarantee of success. In both cases, your opposition may react in unexpected ways, and you risk defeat. CHAPTER 1. EXPLORING DATA 3 In the above sentences, the words “uncertainty”, “chance”, “likelihood” and “risk” have appeared. All these terms are qualitative and open to many interpretations. Before the statistician can get down to his or her real job (of taking decisions in the face of uncertainty), this nebulous concept of uncertainty has to be put onto a firm footing by being quantified in an objective way. Probability Theory is the branch of mathematics that achieves this quantification of uncertainty. Therefore, before you can become a statistician, you have to learn a hefty chunk of Probability Theory. This material is included in chapters 2 to 7. Chapters 8 to 12 deal with the science of data-based decision making. However, in the remainder of this chapter, we aim to give you insight into what is to come in the later chapters, to give you a feeling for data, and to explore “data-based decision making” using intuitive concepts. Display, summarize and interpret... Before getting deeply involved with tackling any situation or problem in daily life, it is wise to take a step back and take a glimpse at “the big picture” — and so it is with Statistics. As a starting point, statisticians make a “quick and dirty” summary of the data they are about to analyse in order to get a “feel” for what they are dealing with. The initial overview usually involves: constructing a visual display of the data in a graph or perhaps a table; summarizing the data with a few pertinent “key” numbers; and gaining insight into the “potential” of the data. What do we mean by data?... Data is information. There are data drips and data floods, and statisticians have to learn to deal with both. Usually, there is either too little or too much data! When data comes in floods, the problem is to extract the salient features. When data comes in drips, the problem is to know what are valid interpretations. Besides the various amounts of data, there are different types of data. For the mo- ment, we need to distinguish between qualitative and quantitative data. Qualitative data is usually non-numerical (nominal), and arises when we classify objects using la- bels or names as categories: for example, make of car, colour of eyes, gender, nationality, profession, cause of death, etc. Sometimes the categories are semi-numerical (ordinal): for example, size of companies categorized as small, medium or large. Quantitative data, on the other hand, is always numerical, and these data values can be ranked or ordered. Quantitative data usually arises from counting or measuring: for example, flying time between airports, number of rooms in a house, salary of an accountant, cost of building a school, volume of water in a dam, number of new car sales in a month, the size of the AIDS epidemic, etc.. Visual displays of qualitative data... Two efficient ways of displaying qualitative data are the pie chart and the bar chart. 4 INTROSTAT Table 1.1: MBA Student Data First GMAT First GMAT First GMAT degree score degree score degree score 1. Engineering 610 28. Engineering 710 55. Arts 500 2. Engineering 510 29. Science 600 56. Arts 620 3. Engineering 610 30. Science 550 57. Arts 550 4. Engineering 580 31. Science 540 58. Arts 600 5. Engineering 720 32. Science 620 59. Arts 520 6. Engineering 620 33. Science 650 60. Arts 520 7. Engineering 540 34. Science 500 61. Commerce 550 8. Engineering 500 35. Science 590 62. Commerce 520 9. Engineering 750 36. Science 630 63. Commerce 560 10. Engineering 640 37. Science 660 64. Commerce 560 11. Engineering 550 38. Science 570 65. Commerce 600 12. Engineering 650 39. Science 600 66. Commerce 540 13. Engineering 600 40. Science 630 67. Commerce 550 14. Engineering 600 41. Science 500 68. Commerce 650 15. Engineering 510 42. Science 580 69. Commerce 510 16. Engineering 570 43. Science 560 70. Commerce 560 17. Engineering 620 44. Science 550 71. Medicine 590 18. Engineering 590 45. Arts 560 72. Medicine 700 19. Engineering 660 46. Arts 550 73. Medicine 640 20. Engineering 550 47. Arts 500 74. Medicine 680 21. Engineering 560 48. Arts 510 75. Medicine 580 22. Engineering 630 49. Arts 570 76. Other 550 23. Engineering 540 50. Arts 510 77. Other 680 24. Engineering 560 51. Arts 660 78. Other 540 25. Engineering 650 52. Arts 500 79. Other 640 26. Engineering 540 53. Arts 710 80. Other 620 27. Engineering 680 54. Arts 510 81. Other 450 Example 1A: Table 1.1 lists data on a class of 81 Master of Business Administration (MBA) students. The table shows each student’s faculty for their first degree, in either Arts, Commerce, Engineering, Medicine, Science or “other”. Also given are their test scores for an entrance examination known as the GMAT, a test commonly used by business schools worldwide as part of the information to assist in the selection process. Our brief is to construct a visual summary of the distribution of students within the various first-degree categories in the table. Firstly, we note that first-degree category, the data that we are being asked to display, is qualitative data. Appropriate display techniques include the frequency table, the pie chart and the bar graph. Secondly, we find the frequency distribution of the qualitative data by counting the number of students within each category. At the same time, we calculate relative frequencies by dividing the frequency in each category by the total number of obser- vations. We repeat the relative frequencies to a convenient small number of decimal places. CHAPTER 1. EXPLORING DATA 5 First degree Frequency Relative frequency Engineering 28 0.35 Science 16 0.20 Arts 16 0.20 Commerce 10 0.12 Medicine 5 0.06 Other 6 0.07 Total 81 1.00 Thirdly, we plot the pie chart and the bar graph. Pie chart: the actual construction of a pie chart is straightforward! We have ar- ranged the segments in anti-clockwise order, starting at “three o’clock”, but there is no hard-and-fast rule about where to start and which direction to choose. The pie chart communicates most effectively if the relative frequencies are arranged in decreasing or- der of size. Rotate the textbook through 90o and 180o and note that the pie chart may appear different to its original form, even though it is actually the same graphical figure. Pie chart showing proportions of M.B.A. students............................................................................................................. Engineering............................................................................................................................................... Science................................................................................................................................................................................................................................................................................................................................................................................. Other............................................................................................................................................................................................................................................. Medicine................................................ Arts.................................................................................................................................................................... Commerce Bar graph: We may use horizontal or vertical bars to convey the relative frequencies of categories. Our first example was horizontal bars. Notice that there is no quantitative scale along the vertical axis of the bar graph, that the “bars” are not connected, and that the widths of the bars have no particular relevance. Because there is no quantitative ordering of the nominal categories, we are free to arrange them as we please. As for the pie chart, it is generally most effective to arrange the bars for nominal categories in decreasing order of relative frequency; this choice of ordering makes comparison easier, and also tends to highlight the important features of the data. Relative frequencies could also have been used in the construction of the bar graph. We would obtain similar bars but the horizontal axis would reflect proportions. 6 INTROSTAT........................................................................................................ 28 Engineering....................................................................................................................................................................................................................................................................................................................................................................................................................................... 16 Science....................................................................................................................................................................................................................................................................................................... 16 Arts................................................................................................................................................................................................................................................................................................................................................... 10 Commerce...................... 6 Other........................................................................................................ 5 Medicine...................................................... 0 10 20 30 Number of MBA students Visually the most striking impact of both the pie chart and the bar graph is that engineers form the largest proportion of this class of MBA students. Next, we might ask for a reason for this feature. A plausible explanation is that engineers are not exposed to much management and administrative training during their undergraduate years, and that they make up for this fact by doing an MBA. A second explanation is that the data was extracted during recessionary times, when engineers were not in demand — perhaps they were “investing in themselves” by doing an MBA while projects were scarce. How would you set about investigating whether this latter explanation is correct? Our diagrams have provided some insight into this data. Sometimes, all we achieve is to demonstrate the obvious. At other times, our charts and diagrams will reveal com- pletely unexpected phenomena. Careful interpretation is then needed, frequently with the help of the “experts” from the discipline from which the data comes. Example 2C: Table 1.2 gives the composition of the “All-Share Index” of the Johan- nesburg Stock Exchange as at 2 January 1990. The breakdown of the All-Share Index reflects that it is composed of seven “major sector” indices, namely Coal, Diamonds, All-Gold, Metals & Minerals, Mining Financial, Financial, and Industrial Indices. (a) Construct a bar chart showing the number of shares in each of the seven major sector indices that contribute to the All-Share Index. (b) Now construct a bar chart showing the relative weightings of each of the seven ma- jor sectors as a percentage of the All-Share Index, and comment on any differences you find from (a) above. CHAPTER 1. EXPLORING DATA 7 Table 1.2: Composition of the All Share Index on the JSE No. of Percentage Major shares in weighting in sector Subsidiary Sector Index subsidiary All-Share index index Index Coal 2 0.82 Coal Diamonds 1 8.27 Diamonds Gold — Rand and others 5 1.39 Gold — Evander 2 0.89 Gold — Klerksdorp 3 5.45 All-Gold Gold — OFS 4 3.99 Gold — West Wits 3 7.27 Copper 1 0.55 Manganese 1 0.91 Metal Platinum 2 5.00 & Tin 1 0.01 minerals Other metals & minerals 3 0.26 Mining houses 3 16.79 Mining Mining holding 3 7.13 financial Banks & financial services 5 2.80 All- Insurance 4 2.15 Share Investment trusts 3 1.02 Financial Property 12 0.47 Property trusts 11 0.97 Industrial holdings 6 9.94 Beverages & hotels 2 3.43 Building & construction 6 0.92 Chemicals 2 3.46 Clothing, footwear & textiles 10 0.62 Electronics, electr. & battery 7 1.39 Engineering 7 0.95 Fishing 1 0.08 Food 4 2.14 Furniture & household goods 5 0.30 Industrial Motors 6 0.43 Paper & packaging 3 2.26 Pharmaceutical & medical 3 0.48 Printing & publishing 3 0.18 Steel & allied 2 1.99 Stores 10 2.16 Sugar 1 0.43 Tobacco & match 1 2.48 Transportation 3 0.22 8 INTROSTAT Visual displays of quantitative data I : histograms... Histograms are a time-honoured and familiar way of displaying quantitative data. A histogram differs from a bar chart in two ways. The histogram has bars that are not separated from the other thats is, there is no gap between adjacent bars. The bars within a histogram do not correspond to named categories, as in the bar chart. In the histogram the bars correspond to intervals on the number line. These intervals are constructed so that they are all of equal length. The length of the interval is selected so that it is easy to construct the intervals on the number line, but also to ensure we have a suitable number of intervals. We demonstrate the construction procedure by means of an example. Example 3A: Referring back to the data on GMAT scores in Example 1A, draw a histogram for the distribution of the GMAT scores for the students. We recommend a four-step histogram procedure for quantitative data: 1. Determine the size of the sample1 , i.e. the number of observations. We have n = 81 students — and throughout this book we reserve use of the symbol n for the concept of sample size, the number of observations we are dealing with! Find the smallest and largest numbers in the sample. Call these xmin and xmax , respectively. The smallest GMAT score was from student 81, who scored 450, and the largest was the 750 achieved by student 9: xmin = 450 xmax = 750 2. Choose class intervals that cover the range from xmin to xmax. Here are two guidelines that help determine an approximate length L of the class intervals: the first is due to Mr Sturge, the second is used by the computer package GENSTAT. If the class intervals are made too narrow, the histogram looks “spikey”, and if intervals are too wide, the histogram is “blurred”. Sturge says: use class intervals of approximate length L where xmax − xmin xmax − xmin L= = 1 + log2 n 1 + 1.44 loge n GENSTAT says: use class intervals of approximate length L where xmax − xmin L= √. n For our data, Sturge says xmax − xmin 750 − 450 = = 40.94, 1 + 1.44 loge n 1 + 1.44 log e 81 while GENSTAT says xmax − xmin 750 − 450 L= √ = √ = 33.33. n 81 1 We consider a sample to be a small number of elements taken from the population of interest. Each element in the sample provides an observed value for the numerical variable of interest, these values become our dataset. We hope that the sample is representative of the population as a whole, so that the sample data represent the population data. Then conclusions drawn from the sample data will be valid also for the complete population data. We consider methods of obtaining a representative sample in Chapter 11. CHAPTER 1. EXPLORING DATA 9 As a general rule, avoid choosing class intervals which are of awkward lengths. Multiples of 2, 5 and 10 are most frequently used. Feel free to choose intervals between half and double those suggested by the Sturge or GENSTAT guidelines. All the class intervals should be the same width. Resist the temptation to make the class intervals wider over that part of the range where the data is sparser — unequal widths have the effect of destroying the visual message of the histogram. For this example, L = 50 is a sensible choice for the width of the class interval. It is convenient to start our class intervals at 450, and carry on in steps of 50 as far as is necessary to include the maximum observed value. Thus the boundaries of the class intervals are at 450, 500, 550, 600, 650, 700, 750, and 800. We also need to agree that scores that fall on any boundary will be allocated to the higher of the two class intervals, so strictly the class intervals are 450–499, etc., as shown in the frequency distribution table below. 3. Count the number of GMAT scores within each class interval. The most convenient way to do this counting is to set up a tick sheet, and to make one pass through the data allocating each score to its class interval. This process sets up a frequency distribution: Class interval frequency 450–499 1 500–549 21 550–599 25 600–649 19 650–699 10 700–749 4 750–799 1 Total 81 In some textbooks and computer programmes the class intervals are called bins and interval length L is called the bin width. Thus the class frequency is called the bin frequency. 4. Plot the histogram, choosing suitable scales for each axis: 10 INTROSTAT 25.............................................................................................................................. 20............................................................................................................................................................................................................................................................................................................................................................................................................................................... Number 15.................................................................................................................................................................... of...................................................................................................................................................... MBA......................................................................................................................................... 10........................... students................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 5..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 0........................................................................... 400 500 600 700 800 GMAT scores 30.......................................................................................................................................................... 25............................................................................................................................................................................................................................................................. Relative Frequency.................................................................................................................................................................................. of 20...................................................................................................................................................... MBA.................................................................................................................................................................... students......................................................................................................................................... as a percentage 15............................................................................................................................................................................................... (% )........................................................................................................................................................................................................................................................................................................... 10.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. 5................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. 0........................................................................... 400 500 600 700 800 GMAT scores CHAPTER 1. EXPLORING DATA 11 80 1st Quartile Median 3rd Quartile 60 Cumumlative Frequency 40 20 0 450 500 550 600 650 700 750 800 GMAT scores The striking feature of the histogram for the GMAT data on the previous page is that it is not symmetric but is skewed to the right, which means that it has a long tail stretching off to right. The terms in bold are technical, jargon terms, but their meanings are obvious. A seasoned statistician would expect a distribution of test scores (or examination results) to have a tail at both ends of the frequency distribution. In the above display, there has been a truncation of the distribution at 500 (apart from a single score of 450). We would infer that the acceptance criterion on the MBA programme is a GMAT score of 500 or more. In reality there is a tail on the left, but it is suppressed by the fact that applicants who achieved these lower scores were not accepted into the MBA programme. In the light of this information, a statistician would also query the score of 450. Is it an error in the data? Maybe it should be 540, and there has been a transcription error. But a more plausible explanation is that the student was outstanding in some other aspect of the selection process — maybe the personal interview was very impressive! Example 4B: The risks taken by investors when they invest in the stock exchange are of considerable interest to financial analysts. Investors associate the risks of investments with how volatile (or varying) the price changes are. Analysts measure volatility of price changes using the “standard deviation” — a statistical measure of variability that we will learn about later in this chapter. The table below reports the standard deviations (or riskiness) of a sample of 75 shares listed on the Johannesburg Stock Exchange. The units of the data values are per cent per month. Construct a suitable histogram of the data. 12 INTROSTAT 23 22 17 18 21 25 23 25 12 23 27 14 28 9 23 19 23 11 16 11 15 15 12 12 12 21 13 11 13 13 27 20 17 8 13 28 14 9 13 11 23 23 10 12 12 26 25 11 12 20 22 21 9 13 19 19 13 14 15 17 17 10 25 26 11 12 25 22 12 11 22 20 14 10 23 1. The sample size is n = 75, the extreme values are xmin = 8, xmax = 28. 2. Sturge says: L = (28 − 8)/(1 + 1.44 log e 75) = 2.8, while GENSTAT says: ! L = (28 − 8)/ (75) = 2.3. So a sensible length for the class interval is 2, and we use class interval boundaries at 8, 10, 12,... , 28. Effectively, the class intervals are 8–9, 10–11,... , 28–29. 3. Count the number of share standard deviations falling into each class: Class interval Frequency 8–9 4 10–11 10 12–13 16 14–15 7 16–17 5 18–19 4 20–21 6 22–23 12 24–25 5 26–27 4 28–29 2 Total 75 Finally, we plot the histogram: CHAPTER 1. EXPLORING DATA 13 20............. 15............................................. Number.................................................................. of 10........................................................................................................... shares................................................................................................................................................................................. 5........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... 0.......................................................................................... 10 20 30 Risk......... 20................................................................................................ 15............................ Relative Frequency..................................................................................... of shares.................................................................................... as a percentage............................ 10............................................................................................................................. (%)............................................................................................................................................................................................................................................................................................... 5........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ 0.......................................................................................... 10 20 30 Risk 14 INTROSTAT 60 Cumumlative Frequency 40 20 1st Quartile Median 3rd Quartile 0 10 15 20 25 30 Risk The striking feature of this histogram is that is has two clear peaks. In statistical jargon, it is said to be bimodal. The visual display of this information has thus revealed infor- mation which was not at all obvious from even a careful search through the 75 values in the table of data. The financial analyst now needs an explanation for the bimodality. Further investigation revealed that “gold shares” were predominantly responsible for the peak on the right, while “industrial shares” were found to be responsible for the peak on the left. The histogram reveals that gold shares generally have a substantially higher risk than industrial shares. In layman’s terms, we conclude that gold shares are generally more “volatile” than industrial shares. Example 5C: Plot a histogram to display the examination marks (as percentages) of 25 students and comment on the shape of the histogram: 68 72 39 50 69 52 51 50 41 52 65 37 45 78 48 55 53 61 71 42 57 34 57 66 87 Example 6C: A company that produces timber is interested in the distribution of the heights of their pine trees. Construct a histogram to display the heights, in metres, of the following sample of 30 trees: 18.3 19.1 17.3 19.4 17.6 20.1 19.9 20.0 19.5 19.3 17.7 19.1 17.4 19.3 18.7 18.2 20.0 17.7 20.0 17.5 18.5 17.8 20.1 19.4 20.5 16.8 18.8 19.7 18.4 20.4 CHAPTER 1. EXPLORING DATA 15 Visual displays of quantitative data II : stem-and-leaf plots... Stem-and-leaf plots are a relatively new display technique. The visual effect is very similar to that of the histogram; however, they have the advantage that additional information is represented — the original data values can be extracted from the display. Thus stem-and-leaf plots can be used as a means of data storage. We learn how to construct a stem-and-leaf plot by means of an example, using an old dataset. The procedure is simple. Example 7A: At the end of 1983/84 English football season, the points scored by each club were as follows: Arsenal 63 Nottingham Forest 71 Aston Villa 60 Notts County 41 Birmingham City 48 Queens Park Rangers 73 Coventry City 50 Southampton 71 Everton 59 Stoke City 50 Ipswich Town 53 Sunderland 52 Leicester City 51 Tottenham Hotspur 61 Liverpool 79 Watford 57 Luton Town 51 West Bromwich 51 Manchester United 74 West Ham United 60 Norwich City 50 Wolverhampton 29 To produce the stem-and-leaf plot for the points scored by all the teams, we split each number into a “stem” and a “leaf”. In this example, the natural split is to use the “tens” as stems, and the “units” as leaves. Because the numbers range from the 20s to the 70s, our stems run from 2 to 7. We write the in a column: stems leaves 2 3 4 5 6 7 We now make one pass through the data. We split each number into its “stem” and its “leaf”, and write the “leaf” on the appropriate “stem”. The first number, the 63 points scored by Arsenal, has stem “6” and leaf “3”. We write a “3” as a leaf on stem “6”: 16 INTROSTAT stems leaves 2 3 4 5 6 3 7 Aston Villa’s 60 points become leaf “0” on stem “6”. Birmingham City’s 48 points are entered as leaf “8” on stem “4”. After the first six scores have been entered, we have: stems leaves 2 3 4 8 5 093 6 30 7 Continue until leaves have been entered for all 22 numbers: stems leaves count 2 9 1 3 0 4 81 2 5 0931100271 10 6 3010 4 7 94131 5 22 We append a third column in which, we enter the count of the number of leaves on each stem, add up the counts, and check that we have entered the right number of leaves! The final step is to sort the leaves on each stem from smallest to largest, and to add a cumulative count column: sorted cum. stems leaves count count 2 9 1 1 3 0 1 4 18 2 3 5 0001112379 10 13 6 0013 4 17 7 11349 5 22 What have we created? Essentially, we have a histogram on its side, with class intervals of length 10. But in addition, we have retained all the original information. In a histogram, we would only have known that five teams scored between 70 and 79 points; now we know that there were scores of 71 (two teams), 73, 74, and Liverpool’s league-winning 79! CHAPTER 1. EXPLORING DATA 17 Example 8A: For the data of example 1A, produce and compare stem-and-leaf plots of the GMAT scores of students with Engineering and Arts backgrounds. All the GMAT scores ended in a zero, so this common feature gives us no useful information about variability of the scores; therefore we may use the hundreds as “stems” and the tens as “leaves”. For both categories of students, we would then have only three stems, “5”, “6” and “7”. Looking back at the histogram display of GMAT scores (example 3A), we note that we used class intervals of width 50 units. We can create this width class interval in the stem-and-leaf plot as demonstrated below. Engineering: stems leaves count 5· 140144 6 5! 8579566 7 6· 11240023 8 6! 5658 4 7· 21 2 7! 5 1 28 Arts: stems leaves count 5· 01101022 8 5! 6575 4 6· 20 2 6! 6 1 7· 1 1 7! 0 16 In this approach, we split each 100 into two stems; the first is labelled “·” and encompasses the leaves from 0 to 4, the second is labelled“!” and includes the leaves from 5 to 9. The final step is to sort the leaves for each stem. ENGINEERING ARTS sorted cum. sorted cum. stems leaves count stems leaves count 5· 011444 6 5· 00011122 8 5! 5566789 13 5! 5567 12 6· 00112234 21 6· 02 14 6! 5568 25 6! 6 15 7· 12 27 7· 1 16 7! 5 28 7! 16 18 INTROSTAT Again, a skewness to the right is evident in both displays. Striking, too, is the observation that students with an engineering background tend to have GMAT scores in the upper 500s and lower 600s, whereas the majority of arts background students have scores in the 500s. Although the sample sizes are small, this contrast in pattern seems marked enough to suggest that amongst MBA students, the engineers perform better on the GMAT test , on average, than the arts students. If splitting “stems” into two parts seems inadequate for the data set on hand, here is a system for splitting them into five! Example 9B: Produce a stem-and-leaf plot for the risk data of Example 4B. As for the histogram, it would be sensible to use stems of width 2. Each stem is therefore split into five: 0 and 1 are denoted “·”, 2 and 3 are denoted “t”, 4 and 5 are denoted “f”, 6 and 7 are denoted “s”, and 8 and 9 are denoted “!”. Notice the convenient mnemonics — English is a marvellous language! Arts: sorted cum. stems leaves count count ! 8999 4 4 1· 0001111111 10 14 1t 2222222223333333 16 30 1f 4444555 7 37 1s 67777 5 42 1! 8999 4 46 2· 000111 6 52 2t 222233333333 12 64 2f 55555 5 69 2s 6677 4 73 2! 88 2 75 Note that we have presented the stem-and-leaf plot with the leaves already sorted. Example 10C: Produce a stem-and-leaf plot for the examination marks of another group of 25 students. 50 79 53 85 50 53 65 58 43 45 48 51 54 72 71 61 51 72 53 39 67 27 43 69 53 Example 11C: The maximum temperatures (◦ C) at 20 towns in southern Africa one summer’s day are given in the following table. Produce a stem-and-leaf plot. CHAPTER 1. EXPLORING DATA 19 Pietersburg 30 Windhoek 32 Pretoria 20 Cape Town 22 Johannesburg 28 George 21 Nelspruit 30 Port Elizabeth 18 Mmabatho 33 East London 17 Bethlehem 30 Beaufort West 23 Bloemfontein 31 Queenstown 22 Kimberley 31 Durban 26 Upington 28 Pietermaritzburg 27 Keetmanshoop 28 Ladysmith 30 Example 12C: In order to assess the prices of the television repair industry, a faulty television set was taken to 37 TV repair shops for a quote. The data below represents the quoted prices in rands. Construct a stem-and-leaf plot and comment on its features. 60 55 158 38 48 120 85 245 90 60 49 38 98 185 200 150 140 75 125 125 125 145 200 145 94 165 105 75 75 120 36 150 120 176 60 78 28 Five-number data summaries — median, lower and upper quartiles, extremes... At the beginning of this chapter, we said that statisticians obtained a feel for the data they were about to analyse in two ways. We have now dealt with the first way, that of constructing a visual display. Now we move on to the second way, by computing a few “key” numbers which summarize the data. Our aim now is to reduce a large batch of data to just a few numbers which we can grasp simultaneously, and thus help us to understand the important features of the data set as a whole. It is useful now to introduce the concepts of rank. In a numerical dataset of size n sorted from smallest to largest, the smallest number is said to have rank 1, the second smallest rank 2,... , the largest rank n. We call the smallest number x(1) (so x(1) = xmin ), the second smallest x(2) ,... , and the largest is x(n) (so x(n) = xmax ). We use x(r) for the number with rank r. The cumulative count column of a stem-and-leaf plot makes it easy to find the observation with any given rank. We use x(r+ 1 ) to denote the number half-way between the numbers with rank r and 2 rank r + 1: x(r) + x(r+1) x(r+ 1 ) =. 2 2 We say that x(r+ 1 ) is the number with rank r + 21. Such numbers are called half-ranks. 2 We define the median of a batch of n numbers as the number which has rank (n + 1)/2. We use x(m) to denote the median. If n is an odd number, then the rank of the median will be a whole number, and the median will be the “middle number” in the data set. But if n is even, then the rank will be a half-rank, and will be the average of the “two middle numbers” in the data set. The lower quartile is defined to be the number with rank l = ([m] + 1)/2 where m is the rank of the median. The notation [m] means that if m is something-and-a-half, 20 INTROSTAT we drop the half! The alternative to doing this is having to define ”quarter ranks”! The upper quartile has rank u = n − l + 1. The lower and upper quartiles are denoted x(l) and x(u) , respectively. The extremes, the smallest and largest values in a data set, have ranks 1 and n, and we agreed earlier to call them x(1) and x(n) , respectively. These five-numbers provide a useful summary of the batch of data, called, with com- plete lack of imagination, the five-number summary. We write them from smallest to largest: (x(1) , x(l) , x(m) , x(u) , x(n) ). Example 13A: Find the five-number summary for the end-of-season football points of Example 7A. An easy way to find the five-number summary is to use the stem-and-leaf plot. sorted cum. stems leaves count count 2 9 1 1 3 0 1 4 18 2 3 5 0001112379 10 13 6 0013 4 17 7 11349 5 22 Because n = 22, the median has rank m = (n + 1)/2 = (22 + 1)/2 = 11 12. We need to average the numbers with ranks 11 and 12. From the cumulative count, we see that the last leaf on stem 4 has rank 3, and the last leaf on stem 5 has rank 13. Counting along stem 5, we find that 53 is the number with rank 11 and 57 has rank 12. Thus the median is the average of these two numbers (53 + 57)/2) = 55; we write x(m) = 55. Half the teams scored below 55 points, half scored above 55 points. The lower quartile has rank l = ([m] + 1)/2 = ([11 12 ] + 1)/2 = (11 + 1)/2 = 6. The observation with rank 6 is 50, thus x(l) = 50. The upper quartile has rank u = n−l+1 = 22 − 6 + 1 = 17. The observation with rank 17 is 63, thus x(u) = 63. The five-number summary is: (29, 50, 55, 63, 79). Why is this a big deal? Because it tells us that... 1. Half the teams scored below 55 points, half scored above 55 points, because 55 is the median. 2. Half the teams scored between 50 and 63 points, because these two numbers are the lower and upper quartiles. 3. A quarter of the scores lay between 29 and 50, a quarter between 50 and 55, a quarter between 55 and 63, and a quarter between 63 and 79. 4. All the scores lay between 29 and 79. CHAPTER 1. EXPLORING DATA 21 Example 14B: Find the five-number summaries for GMAT scores of both engineering and arts students. Use the stem-and-leaf plot of example 8A. ENGINEERING ARTS sorted cum. sorted cum. stems leaves count count stems leaves count count 5· 011444 6 6 5· 00011122 8 8 5! 5566789 7 13 5! 5567 4 12 6· 00112234 8 21 6· 02 2 14 6! 5568 4 25 6! 6 1 15 7· 12 2 27 7· 1 1 16 7! 5 1 28 7! 0 16 For the engineers, the median has rank m = (28 + 1)/2 = 14 21. Thus x(m) = (600 + 600)/2 = 600. The lower quartile has rank l = ([m] + 1)/2 = ([14 21 ] + 1)/2 = 7 12 , and the upper quartile rank n − l + 1 = 22−7 12 +1 = 21 21. So x(l) = (550+550)/2 = 550, and x(u) = (640 + 650)/2 = 645. The five-number summary is (500, 550, 600, 645, 750). For the arts students, the median has rank m = (16 + 1)/2 = 8 12. Thus x(m) = (520 + 550)/3 = 535. The lower quartile has rank l = ([m] + 1)/2 = ([8 12 ] + 1)/2 = 4 12 , and the upper quartile rank n − l + 1 = 16−4 12 +1 = 12 12. So x(l) = 510, and x(u) = 585. The five-number summary is (500, 510, 535, 585, 710). For the engineers, the median GMAT score was 600; by contrast, for arts students, it was only 535. The central 50% of engineers obtained scores in the interval from 550 to 645, while the central 50% of arts students were in a downwards-shifted interval, 510 to 585. This comparison of the central 50% ranges, reinforces our earlier interpretation that engineers tend to have higher GMAT scores than arts students. Example 15C: Find the five-number summaries for the data of (a) Example 10C, (b) Example 11C, and (c) Example 12C. Visual displays of quantitative data III : box-and-whisker plots... Five-number summaries can be displayed graphically by means of box-and-whisker plots. (This ridiculous name was invented by the American statistician who invented the method, John Tukey , who also invented both the name “stem-and-leaf plot” and the plot itself! John Tukey was not only an inventor of crazy names; he also made an enormous impact on the theory and practice of the discipline Statistics.) Once again, we will use an example to describe how to produce a box-and-whisker plot. 22 INTROSTAT 100 80 upper extreme (79) upper quartile (63) 60 median (55) Points lower quartile (50) 40 lower extreme (29) 20 0 Figure 1.1: Football team points (n = 22) Example 16A: Produce a box-and-whisker plot for the football team points of Example 7A, using the five-number summary (29, 50, 55, 63, 79) computed in Example 13A. The procedure is simple: 1. Draw a vertical axis which covers at least the complete range of all the data values. 2. Draw a “box” from the lower to the upper quartile. 3. Draw a line across the box at the median. 4. Draw “whiskers” from the box out to the extremes. Applied to the five-number summary (29, 50, 55, 63, 79), this procedure yields the box-and-whisker plot in Figure 1.1. Box-and-whisker plots are especially useful when we wish to compare two or more sets of data. To achieve this comparison, we construct the plots side-by-side. It is essential to use the same vertical scale for all the plots that are to be compared. Example 17B: Draw a series of box-and-whisker plots to compare the GMAT scores of each category of MBA students. We computed the five-number summaries of the GMAT scores for engineering and arts students in Example 14B. The five-number summaries for all the categories of the data in example 1A are: CHAPTER 1. EXPLORING DATA 23 Engineering (500, 550, 600, 645, 750) n=28 Science (500, 550, 585, 625, 660) n=16 Arts (500, 510, 535, 585, 710) n=16 Commerce (510, 540, 555, 600, 700) n=10 Medicine (580, 585, 640, 690, 700) n=6 Other (460, 540, 585, 640, 680) n=5 The box-and-whisker plots, shown side-by-side, reveal the differences between the various categories of students. 800 700 GMAT 600 scores 500 ENG. SCI. ARTS COM. MED. OTH. 400 We see from a comparison of the box-and-whisker plots that the students in this class with a medical background had the highest median GMAT score, followed by engineers, with arts students having the lowest median. The skewness to the right (now shown as a long whisker pointing upwards!) which we commented on earlier for the class as a whole, is also evident for engineering, science, arts and commerce students, the categories for which the sample sizes were large. Outliers and strays... In many data sets, there are one or more values that appear to be very different to the bulk of the observations. Intuitively, we recognize these values because they are a long way from the median of the data set as a whole. We can make our stem-and-leaf plots more informative by plotting and labelling some of the outlying values in such a way that they are highlighted and our attention immediately drawn to them. These outlying values may be valid but they could well represent errors that have crept into the data, either when the observation was made, or when the numbers were transcribed from one sheet of paper to another, or when they were entered into a computer, or even when they were being transferred from one computer to another. On the other hand, if outlying values might represent genuine observations, they may be of special interest and 24 INTROSTAT importance. In any event, these observations need to be checked, and either confirmed or rejected. It is useful to have rules that will aid us to identify such observations. Outliers are those observations which are greater than x(m) + 6(x(u) − x(m) ) or less than x(m) − 6(x(m) − x(l) ) and we label them boldly on the box-and-whisker plot. Less outlying values called strays are those observations which are not outliers but are greater than x(m) + 3(x(u) − x(m) ) or less than x(m) − 3(x(m) − x(l) ) and we label them less boldly on the plot. The largest and smallest observations which are not strays are called the fences (more Tukeyisms!). When outliers and strays are being portrayed in a box-and-whisker plot, the convention is to take the whiskers out as far as the fences, not the extremes. This strategy helps to isolate and highlight the outlying values. In any event, it is sometimes helpful to identify a few values of special interest or importance in a box-and-whisker plot. Example 18A: The university computing service provides data on the amount of computer usage (hours) by each of 30 students in a course: Student no. Usage Student no. Usage Student no. Usage AD483 53 AM044 2 AS677 36 CI144 7 CS572 25 EK817 20 FV246 38 GM337 36 GR803 33 HN050 48 JK314 84 JR894 154 JV670 31 KM232 35 LJ419 44 LW032 48 MA276 69 MJ076 95 PH544 4 PS279 60 RR676 18 SA831 51 SC186 47 SS154 37 TB864 11 VO822 41 WG794 34 WB909 73 YG007 38 ZP559 125 Is the lecturer justified in claiming that particular students appear to be making excessive use of the computer (playing games?) while the usage of others is so low that she is suspicious that they are not doing the computing work themselves? CHAPTER 1. EXPLORING DATA 25 The stem-and-leaf plot is sorted cum. stems leaves count 0 247 3 1 18 5 2 05 7 3 134566788 16 4 14788 21 5 13 23 6 09 25 7 3 26 8 4 27 9 5 28 10 28 11 28 12 5 29 13 29 14 29 15 4 30 The five-number summary is (2, 31, 38, 53, 154). The outliers were those observations greater than x(m) + 6(x(u) − x(m) ) = 38 + 6(53 − 38) = 128 or less than x(m) − 6(x(m) − x(l) ) = 38 − 6(38 − 31) = −4. There was only one outlier, the usage of 154 hours by student JR894. The strays were those observations which were not identified as outliers but were greater than x(m) + 3(x(u) − x(m) ) = 38 + 3(53 − 38) = 83 or less than x(m) − 3(x(m) − x(l) ) = 38 − 3(38 − 31) = 17. There are seven strays: four students (AM044 (2 hours), PH544 (4 hours), CI144 (7 hours), TB864 (11 hours)) are at the low usage end, and three (JK314 (84 hours), MJ076 (95 hours) and ZP559 (125 hours) are at the high usage end. The fences are the outermost observations that were not strays, and are the 18 hours and 73 hours. The box-and-whisker plot, with the outlier labelled boldly, and strays merely labeled 26 INTROSTAT JR894 # 150 ◦ ZP559 100 ◦ MJ076 ◦ JK318 Hours upper quartile (53) 50 median (38) lower quartile (31) CI144 ◦ TP864 ◦◦ PH544 AM044 ◦ 0 The lecturer now has a list of students whose computer utilization appears to be suspicious. Example 19C: A company that produces breakfast cereals is interested in the protein content of wheat, the basic raw material of its products. The protein content(percentages of mass) of 29 samples of wheat (percentages) was recorded as follows: 9.2 8.0 10.9 11.6 10.4 9.5 8.5 7.7 8.0 11.3 10.0 12.8 8.2 10.5 10.2 11.9 8.1 12.6 8.4 9.6 11.3 9.7 10.8 83 10.8 11.5 21.5 9.4 9.7 Confirm the statistician’s conclusion that the values 83 and 21.5 are outliers. The statistician asked that these values should be investigated. Checking back to the original data, it was discovered that 83 should have been 8.3, and 21.5 should have been 12.5. Transposed digits and misplaced decimal points are two of the most frequent types of error that occur when data is entered into a computer. Example 20C: A winery is concerned about the possible impact of “global warming” on the grape crop. It was able to obtain some interesting historical rainfall data going back to 1884 from a wine-producing region. The rainfall (mm) in successive Januaries at Paarl for the 22-year period 1884–1905 were recorded as follows: CHAPTER 1. EXPLORING DATA 27 Year Rain Year Rain Year Rain 1884 2.6 1892 37.8 1900 3.0 1885 4.9 1893.0 1901 145.1 1886 16.3 1894.0 1902 39.7 1887 21.6 1895 52.3 1903 105.9 1888 6.1 1896 4.1 1904 17.8 1889.0 1897 6.4 1905 10.6 1890.0 1898 15.8 1891 1.1 1899 27.7 (a) Produce a stem-and-leaf plot. (b) Find the five-number summary. (c) Draw the box-and-whisker plot, showing outliers and strays, if any. “Statistics” in Statistics... Within the discipline Statistics, we give a precise technical definition to the concept, a statistic. A statistic is any quantity determined from the data values of a sample. Thus the median is “a statistic”, and so are the other four numbers that make up a five-number summary. These quantities are examples of summary statistics, because they endeavour to summarize specific aspects of the information concealed within the sample data. We now learn about a further bunch of “statistics”. Measures of location and spread... We use the term measure of location to describe any statistic that purports to locate the “middle”, in some sense, of the data set. For example, confronted by a collection of data on house prices, we would use a measure of location to answer the question: What is the typical price of a house? The next questions might be: How much variability is there in house prices? What is the difference between the price of a cheap house and that of an expensive house? Measures of spread are designed to provide answers to these two questions. In the next few sections we consider a few of the most important measures of location, and then some measures of spread. The sample median The median, which we denoted x(m) , locates the “middle” of the data in the sense that half the observations from the sample are smaller than the median and half are larger than the median. To find the median it is necessary to sort or rank the data values from the smallest value to the largest. Remember that if the sample size n is an odd number, the median is the “middle” observation, but if n is even, the sample median is the average of the “two middle” observations. 28 INTROSTAT The sample mean... The sample mean is, with good justification, the most important measure of loca- tion. It is found by adding together all the values of a variable in the sample data, and dividing this total by n, the sample size. We introduce a subscript notation to describe a sample of size n. We denote the first observation we make on the variable X as x1 , the second x2 ,... , the nth xn. Then the sample mean of the X values, almost universally denoted x̄ (pronounced, “x bar”), is defined to be x̄ = (x1 + x2 + · · · + xn )/n n 1" = xi n i=1 The sample mean locates the “middle” of the batch of data values for the variable X in a special way. It is equivalent to hanging a 1 kg mass at points x1 , x2 ,... xn along a ruler (of zero mass), and then x̄ is the point at which the ruler balances. (The masses in particular need not be be 1 kg, but they must all be equal!) The mean is much easier to calculate than the median. The mean requires a single pass through the data, adding up the values. In contrast, the data needs to be sorted before the median can be computed, an operation which often requires several passes through the data. Example 21A: Find the sample mean of the dividend yields of 15 shares in the pa- per and packaging sector of the Johannesburg Stock Exchange. Also find the median. Compare the mean and the median. The yields are expressed as percentages. Copi 3.3 E. Haddon 7.6 Pr. Paper 6.7 Caricar 8.4 Kohler 7.1 Prs. Sup 2.9 Coates 10.7 Metal Box 6.6 Sappi 7.5 Consol 6.0 Metaclo 8.6 Trio Rand 8.2 DRG 9.6 Nampak 5.8 Xactics 3.0 We sum the 15 dividend yields and divide by 15: x̄ = (3.3 + 8.4 + 10.7 + · · · + 8.2 + 3.0)/15 = 6.80(%) The stem-and-leaf plot for these 15 data values is shown below: sorted cum. stems leaves count count 2 9 1 1 3 03 2 3 4 0 3 5 8 1 4 6 067 3 7 7 156 3 10 8 246 3 13 9 6 1 14 10 7 1 15 CHAPTER 1. EXPLORING DATA 29 The median has rank m = (15 + 1)/2 = 8, and thus x(m) = 7.1%. In this example, there is little difference between the two measures of location. But this is not always the case... Example 22A: Find the mean and the median of the weekly volume of the same 15 shares as in Example 21A. The weekly volume is the number of shares traded in a week. Copi 2 300 E. Haddon 0 Pr. Paper 700 Caricar 2 100 Kohler 100 Prs. Sup 0 Coates 3 100 Metal Box 111 400 Sappi 40 600 Consol 1 200 Metaclo 700 Trio Rand 84 100 DRG 31 800 Nampak 100 Xactics 45 900 The sample mean is x̄ = (2 300 + 2 100 + · · · + 84 100 + 45 900)/15 = 21 607 (shares traded per week). Sort the data, locate the middle (8th) value, and find that the median is x(m) = 2 100 (shares traded per week). The mean is just over 10 times larger than the median. What has gone wrong? Nothing, it is simply just that the mean and median locate the “middle” of the data according to a different set of rules! In this example, the mean has been dragged upwards by a few large values, so that only five of the fifteen numbers are larger than the mean. But even if a million Metal Box shares had been traded during the week, the median would have remained the same! The median is an example of a measure of location which is said, in statistical jargon, to be robust. The mean is not robust, being sensitive to outlying values in the data set. Because the mean is not robust, it is important to be aware of possible outliers in any sample of data for which the mean is being computed. The mean and the median tend to be close to each other when the distribution of the values is symmetric and there are no outlying observations. The mean and median differ increasingly as the distribution of the data becomes more and more skew. The observations in the long tail of a skew distribution drag the mean in the direction of the tail. The sample mean of a very skew distribution might give a totally misleading impression of the “middle” of the data set. There are no hard-and-fast rules which state when to use the sample mean and when to use the median as a measure of location. In general terms, the median is good for most sets of data. The sample mean is most useful when the data has a symmetric distribution. Data with a long tail to the right can be made more symmetric by taking logarithms or taking square roots of all the data values. Such manipulations to the original data values are called transformations. The sample mean has mathematical advantages over the median. The mean is a FAR easier statistic for the mathematical statisticians to use in algebraic manipulations with than the median. A vast amount of statistical theory has been developed for the sample mean, and for this reason it is the predominant measure of location used in sophisticated statistical methods. 30 INTROSTAT Measures of spread... Measures of spread give insight into the variability of a set of data. Two measures of spread can be defined in an obvious way from the five-number summary. They are: the range R, defined as R = x(n) − x(1) , and the interquartile range I, defined as I = x(u) − x(l). The range is unreliable as a measure of spread because it depends only on the smallest and largest values in the sample, and is thus as sensitive as it can possibly be to outlying values in the sample. It is the ultimate example of a non-robust statistic! On the other hand, the interquartile range is the length of the interval covering the central half of the data values in the sample, and it is not sensitive to outliers in the data. The interquartile range is a robust measure of spread. The sample variance and its square root, the sample standard deviation, have the same advantage, easier algebraic manipulation, over the range and interquartile range that the mean had over the median. Therefore the sample variance is frequently the only measure of spread calculated for a set of data. The sample variance, denoted by s2 , is defined by the formula n 2 1 " s = (xi − x̄)2. n−1 i=1 In words, it is the sum of the squared differences between each data value and the sample mean, with this sum being divided by one less than the number of terms in the sum. The sample standard deviation, denoted by s, is the square root of the sample variance. It is a nuisance to have these two measures of spread, s and s2 , one of which is simply the square root of the other. Why have both? The standard deviation is the easier of the two measures of spread to use intuitively , largely because it is measured in the same units as the original data. The variance is measured in “squared units”, an awkward quantity to visualize or keep in mind. For example, if data consists of prices measured in rands, the sample variance has units “squared rands” (whatever that means!), but the standard deviation is in “rands”. Even worse, if the data consists of percentages, the sample variance has units “%2 ”, whereas the standard deviation has the intelligible units “%”. But mathematical statisticians prefer to work with the variance — not having to deal with a square root in the algebra makes their lives simpler and neater. So the two equivalent measures of spread co-exist side by side, and we just have to come to terms with both of them. Example 23A: Compute the sample variance s2 and the standard deviation s for the dividend yields of the 15 shares of Example 21A. CHAPTER 1. EXPLORING DATA 31 We have computed x̄ = 6.8%. So n 1 " s2 = (xi − x̄)2 n−1 i=1 1 (3.3 − 6.8)2 + (8.4 − 6.8)2 + (10.7 − 6.8)2 + · · · # = (15 − 1) · · · + (8.2 − 6.8)2 + (3.0 − 6.8)2 $ 1# (−3.5)2 + (1.6)2 + (3.9)2 + · · · + (1.4)2 + (−3.8)2 $ = 14 1# $ = 12.25 + 2.56 + 15.21 + · · · + 1.96 + 14.44 14 1# $ = 75.62 14 = 5.40% √ The standard deviation is s = 5.40 = 2.32%. The variance and the standard deviation are always positive. This fact is guaranteed, because all the terms in the sum are squared, which makes them positive, even though some of the individual differences are negative. The variance can be calculated more efficiently by a short-cut formula. The “short cut” involves reducing the number of subtractions needed to calculate the variance, from n to just 1 subtraction. Examine the following steps carefully: n " 2 (n − 1)s = (xi − x̄)2 i=1 n " = (x2i − 2x̄ xi + x̄2 ) i=1 n " n " n " = x2i − 2x̄ xi + x̄2 i=1 i=1 i=1 The third term involves adding x̄2 to itself n times. So it is equal to n x̄2. But x̄ = n 1" xi , so n i=1 n 1 %" &2 n x̄2 = xi n i=1 The second term in the sum above can also be rewritten: n " n " 2x̄ xi = 2x̄ xi i=1 i=1 n n 2" " = xi xi n i=1 i=1 n