Introduction to Statistics PDF
Document Details
Uploaded by RoomyBougainvillea3327
University of Lagos
Tags
Summary
This textbook introduces basic statistical concepts, including data collection methods, descriptive and inferential statistics, and various data types. The book covers topics like populations and samples, variables, and data classification. It also demonstrates using frequency distributions and grouping data for analysis. Exercises are included to reinforce understanding.
Full Transcript
Introduction to Statistics 1 CHAPTER ONE INTRODUCTION TO STATISTICS 1.1 STATISTICS Statistics is the science of collecting, classifying, presenting, and interpreting data. Our society has developed into one where science and technology affect everythi...
Introduction to Statistics 1 CHAPTER ONE INTRODUCTION TO STATISTICS 1.1 STATISTICS Statistics is the science of collecting, classifying, presenting, and interpreting data. Our society has developed into one where science and technology affect everything around us. Statistics is one of the most important of these scientific tools. Virtually all facets of our lives are affected by statistics. Statistics has become a necessary element in most academic fields including the sciences, engineering, business, political science, economics, psychology, sociology, education, medicine, nursing, and other health-related areas. Statistics is the universal language of the sciences. Statistics is more than just a “kit of tools”. As potential users of statistics; we need to master the “art” of using these tools correctly. Careful use of statistical methods enables us to: (1) Accurately describe the findings of scientific research (2) Make decisions and (3) Make estimations The field of statistics can be roughly subdivided into two areas: descriptive statistics and inferential statistics. Descriptive statistics is what most people think of when they hear the word statistics. It includes the collection, presentation, and description of data. The term inferential statistics refers to the technique of interpreting the values resulting from the descriptive techniques and then using them to make decisions and draw conclusions about the population. Understanding Basic Statistics 2 1.2 INTRODUCTION OF BASIC TERMS Some basic terms that will be used throughout this book are presented. Population: A collection, or set, of individuals or objects whose properties are to be analyzed. The concept of a population is the most fundamental idea in statistics. The population of concern must be carefully defined and is considered fully defined only when its membership list of elements is specified. The set of “all students who take Algebra course in year two” is an example of a well-defined population. Another is the set of “all lecturers in University of Lagos”. Population involves not only people but also a collection of animals, manufactured objects, or whatever. There are two types of population, finite and infinite. When the members in a population can be physically counted, the population is said to be finite. It is infinite when the membership is uncountable. The number of students in University of Lagos is a finite population. The set of all registered voters in Nigeria is a very large finite population. On the other hand, the population of all stars in the sky and the population of all sands at the seashore all over the world are infinite. Sample A sample is a subset of a population. A subset consists of the individuals, objects, or measurements selected by the sample collector from the population. For example, a set of males in the department of mathematics is a subset of the number of students in the department. A set of Toyota cars parked at the faculty car park is a subset of cars parked at the faculty car park. Variable A variable is a characteristic of interest about each individual element of a population or sample. A student’s name, matriculation number, year and department are all variables. Introduction to Statistics 3 Data Data are raw facts or unprocessed information. There are basically two types of data: (1) Data obtained from numeric or quantitative information and (2) Data obtained from non-numeric or qualitative information. These classifications are given below: Data Numeric or Quantitative Non-Numeric or Qualitative Discrete Continuous Ordinal Categorical Non-numeric data are values that cannot be quantified. For example, matriculation number, tribe, country, etc. Data in this form are either categorical or ordinal. Examples of ordinal non- numeric data are students’ height, Age group, while sex, country, tribe, etc are examples of categorical non-numeric data. Numeric data can be subdivided into two classifications: (1) Discrete numeric data and (2) Continuous numeric data. Counts will always yield discrete numeric data, e.g. the number of students in a school. A measure of a quantity will usually be continuous, e.g. weight of weight lifters. Statistic A statistic is a quantity whose numerical values can be obtained from data. A statistic is a value that describes a sample. Most Understanding Basic Statistics 4 sample statistics are found with the aid of formulas. For example, mean, median, mode etc. Experiment A planned activity whose results yield set of data is known as experiment. 1.3 DATA COLLECTION One of the first problems a statisticians faces is obtaining data. Data can be collected directly from respondents or from established data bank. Data collected directly from the source or respondents are known as primary data and those from established data bank are known as secondary data. Primary data collection for statistical analysis in an involved process and includes the following important steps. 1. Defining the objectives of the survey or experiment. Example: estimating the average height of female students in UNILAG. 2. Defining the target population 3. Defining the strategy and method to be used for data- collection and data measuring. 4. Ascertaining the appropriate descriptive or inferential data- analysis to employ. There are two methods used to collect data. These are experiments and surveys. In an experiment, the investigator controls or modifies the environment and observes the effect on the response variable. This is common in laboratories. In a survey, data are obtained by sampling some population of interest. Various methods that might be used in order to obtain sample data from surveys are presented below. When selecting a sample for a survey; it is necessary to construct a sampling frame. A sampling frame is a list of the elements that belongs to the population from Introduction to Statistics 5 which the sample is drawn. An example is a list of all students in year one, Mathematics department. 1.4 EXERCISES 1.4.1 Select fifty students currently enrolled in your department and collect data for these three variable. 1. number of courses enrolled in 2. total cost of textbooks 3. method of payment used for textbooks a) What is the population? b) Is the population finite or infinite c) What is the sample? d) Classify the responses for each of the three variables as non-numeric data, discrete data, or continuous data. 14.2 Identify each of the following as examples of 1. Non-numeric 2.Discrete 3.Continuous variables: a) The hair colour of people in a concert show. b) The number of hours required to heal a patient of a disease. c) The length of time required answering a telephone call at a certain business center. d) The number of pages per job coming off a computer printer. e) The kind of trees used as Christmas tree. f) The number of voters in a community. g) Whether a statement is true or false. h) The number of books in a library. 1.4.3 Define and explain the following terms: a) Population b) Sample c) Statistic d) Statistics e) Variable f) Data f) Experiment Understanding Basic Statistics 6 CHAPTER TWO SUMMARY AND DISPLAY OF DATA 2.1 FREQUENCY DISTRIBUTION Listing large set of data does not present much of a picture to the reader. Sometimes we want to condense the data into a more manageable form. This can be accomplished with the aid of a Frequency distribution. Let us demonstrate the concept of a frequency distribution by using the following set. 1 5 3 4 1 3 2 5 2 4 1 3 2 0 1 2 1 2 0 2 1 4 5 3 Let x represent these data values, we can use a frequency distribution to represent this set of data by listing the x values with their frequencies in Table 2.1. Table 2.1 Frequency distribution X 0 1 2 3 4 5 F 2 6 6 4 3 3 In the case where many different entries for x and several low frequencies, it often makes sense to combine the data in groups or classes. Let us demonstrate this with this example: 55 60 61 35 41 43 50 78 72 83 45 70 76 31 49 65 79 83 41 86 53 62 52 47 38 57 64 78 47 54 43 73 85 48 66 48 85 86 82 48 56 84 37 57 57 45 95 45 73 39 Summary and Display of Data 7 The following guidelines and terminology will be used to group continuous-type data into classes of equal length. These guidelines can also be used for sets of discrete data that have a large range. 1. Determine the largest (maximum) and smallest (minimum) observations. The range is the difference, R = maximum – minimum 2. A frequency distribution should have a minimum of 5 classes and a maximum of 20. For small data sets, use between 5 and 10 classes. For large data sets, use up to 20 classes. 3. Each data entry must fall into one and only one class. 4. There should be no gaps. Moreover, if there are no entries for a particular class, that class must still be included with a frequency of 0. 5. The first interval should begin about as much below the smallest value as the last interval ends above the largest. 6. The intervals are called class intervals and the boundaries are called class boundaries. 7. The class limits are the smallest and largest possible observed values in a class. 8. The class mark is the midpoint of a class. We set up the following classes for the above data 30 – 39, 40 – 49, 50 – 59, etc. We now create a summary table below in Table 2.2. Understanding Basic Statistics 8 Table 2.2: Frequency distribution Class Class limits Tally Frequency Class Mark Relative Frequency 1 30 – 39 IIII 5 35 10% 2 40 – 49 IIII IIII III 13 45 26% 3 50 – 59 IIII IIII 9 55 18% 4 60 – 69 IIII I 6 65 12% 5 70 – 79 IIII III 8 75 16% 6 80 – 89 IIII III 8 85 16% 7 90 – 99 I 1 95 2% 50 100% Tables like this show us how the data are spread out or distributed; we call this a frequency distribution table or simply a frequency distribution. The relative frequency for a class is the number of entries in the class divided by the total number of entries. For example the relative frequency of class 50 – 59 is 9 x 100% = 18% 50 The next type of tabular display is known as a cumulative frequency distribution, which (as its name suggests) contains a column for the running cumulative total of frequencies for all classes. The cumulative frequency of a class is the total of all class frequencies up to and including the present class. The cumulative frequency distribution of the example given above is as follows: Table 2.3: Cumulative Frequency Table Class Class limits Frequency Cumulative Frequency Relative Cum. Frequency 1 30 – 39 5 5 10% 2 40 – 49 13 18 36% Summary and Display of Data 9 3 50 – 59 9 27 54% 4 60 – 69 6 33 66% 5 70 – 79 8 41 82% 6 80 – 89 8 49 98% 7 90 – 99 1 50 100% Relative Cumulative Frequency is also called Percentage Cumulative Frequency. For example the Relative Cumulative Frequency for class 60 – 69 is 33 x 100% = 66% 50 2.2 GRAPHIC PRESENTATION OF DATA One of the most helpful ways to become acquainted is to use an initial exploratory technique that will result in a pictorial representation of the data. The displays visually reveal patterns of behaviour of the variable being studied. There are several graphic (pictorial) ways to describe data. The type of data and the idea to be presented determines the method used. Data can be presented graphically in many ways as, line graph, dot plot display, bar chart, pie chart, histogram, cumulative frequency curve (Ogive) and stem-and-leaf display. 2.2.1 Dot Plot Display Dot plots display the data of a sample by representing each piece of data with a dot positioned along a scale. This scale can be either horizontal or vertical. The frequency of the values is represented along the other scale. They are usually used to represent the frequency distribution of a discrete variable. The dot plot display is a convenient technique to use as you first begin to analyze the data. It results in a picture of the data as well as sorts the data into numerical order. Understanding Basic Statistics 10 Example 2.1: A random sample of 20 children took their weights in kilogram in a hospital and are presented below: 23 22 26 28 22 29 30 25 26 27 21 23 27 26 25 29 30 26 25 28 Construct a dot plot of these data. Solution Frequency 4 2 x weight 20 22 24 26 28 30 Figure 2.1 Dot plot of weights of children Example 2.2: Use Table 2.2 to construct a Dot plot display Solution Frequency Figure 2.2: Dot plot display 14 12 10 8 6 4 2 Mark 30-39 40-49 50-59 60-69 70-79 80-89 90-99 Summary and Display Data 11 2.2.2 Bar Chart To construct a bar chart, we start with horizontal and vertical axes. We label the quantity being studied horizontally from left to right. The markings along the horizontal axis should correspond to the limits of the classes in the frequency distribution. The corresponding frequency in each class is measured vertically upward. A vertical bar is then drawn across each class interval with height equal to the frequency for that class. We could also draw a bar chart by using the relative frequencies instead of the frequencies for each class. The relative frequencies are measured along the vertical axis as percentages. Example 2.3: Use table 2.2 to construct a frequency bar chart and a bar chart. Solution Frequency 14 12 Mark 10 8 6 4 2 30-39 40-49 50-59 60-69 70-79 80-89 90-99 Figure 2.3: Frequency bar chart Understanding Basic Statistics 12 Frequency 14 12 10 8 6 4 2 30 39 49 59 69 79 89 99 Figure 2.4: Bar chart Example 2.4: A computer anxiety questionnaire was given to 300 children in a computer course. One of the questions was “ I enjoy using computer.” The responses to this particular question were Table 2.4 Response Strongly Agree Slightly Slightly Disagree Strongly Agree Agree Disagree Disagree Number 60 85 40 50 35 30 Solution Summary and Display Data 13 Frequency 100 80 60 40 20 Strongly Agree Slightly Slightly Disagree Slightly Response Agree Agree Disagree Disagree Figure 2.5: Bar chart of responses Example 2.5: The following table shows the intake through JAMB by the Faculty of Science of a certain University in three consecutive years. Table 2.5 Department 2002 2003 2004 Botany 43 40 35 Chemistry 28 35 42 Zoology 45 40 35 Computer Science 33 25 28 Physics 40 35 38 Mathematics 35 42 45 Biology 37 40 42 Total 261 257 265 Draw (i) a component bar chart. (ii) multiple bar chart department by department for the three years. Understanding Basic Statistics 14 Solution (i) 140 Frequency 120 100 2004 80 2003 60 40 2002 20 0 y ce s y ny try s og ic og ic en ta is at ys ol ol m Bo m ci Zo Ph Bi he rS he C at te M pu om C Department Figure 2.6: Component bar chart for JAMB Admission (ii) 50 Frequency 40 2002 30 20 2003 10 2004 0 Biology Zoology Physics Botany Chemistry Mathematics Computer Science Department Figure 2.7: Multiple bar chart for JAMB Admission Summary and Display of Data 15 2.2.3 Pie Chart The pie chart (circle graph) is used to display relative frequencies rather than actual frequencies for the data. We draw a circle and then divide it into a series of wedges or slices to represent each class in the relative frequency distribution. The size of each slice is proportional to the percentage of the data that fall into the corresponding class. Example 2.6 Represent the question in example 2.4 in a pie chart Solution Angle for each class = Number in the class x 3600 Total number of observations Strongly Disagree Strongly 10% Agree 20% Disagree 12% Slighty Disagree 17% Agree 28% Slighty Agree 13% Figure 2.8: Pie chart for response Understanding Basic Statistics 16 Response Number Angles Strongly Agree 60 720 Agree 85 1020 Slighty Agree 40 480 Slighty Disagree 50 600 Disagree 35 420 Strongly Disagree 30 360 Total 300 3600 2.2.4 Histogram The histogram is a type of bar chart representing an entire set of data. A histogram is made up of the following components: a. A title, which identified the population of concern. b. A vertical scale, which identifies the frequencies in the various classes. c. A horizontal scale, which identifies the variable. Values for the class boundaries, class limits, or class marks may be labeled along the x-axis. Use whichever one of these sets of class numbers best represents the variable. Using the Table 2.2. We draw the histogram. Table 2.6 Class Class limits Frequency Class boundaries Class center 1 30 – 39 5 29.5 – 39.5 34.5 2 40 – 49 13 39.5 – 49.5 44.5 3 50 – 59 9 49.5 – 59.5 54.5 4 60 – 69 6 59.5 – 69.5 64.5 5 70 – 79 8 69.5 – 79.5 74.5 6 80 – 89 8 79.5 – 89.5 84.5 7 90 – 99 1 89.5 – 99.5 94.5 Summary and Display of Data 17 Frequency 14 12 10 8 6 4 2 29.5 49.5 69.5 89.5 Marks Figure 2.9: Histogram of Marks In the histogram, a single vertical line between the first two boxes replaces the gap between 39 and 40. However, it is not clear what the vertical line should represent – is it 39 or 40 or what? To resolve this ambiguity, we agree that the vertical line represents 39.5, which is the class boundary between the two classes. In the same way, the next vertical line represents 49.5 and so forth. Another type of graphical display is the frequency polygon. To construct this type of graph, we first determine the measurement corresponding to the midpoint of each class. This value is called the class mark, or class center, or class midpoint and is given by Class center = lower limit + upper limit 2 Understanding Basic Statistics 18 For example, in the class 29.5 to 39.5, the class center is Class center = 29.5 + 39.5 2 = 34.5 14 12 Frequency 10 8 6 4 2 0 0 20 40 60 80 100 Marks Figure 2.10: Frequency polygon The mode is the value of the piece of data that occurs with the greatest frequency. From Figure 2.9, the mode is 46. To obtain this, use a ruler to connect both right and left edges of the tallest bar to the bars on both sides of the tallest bar, then locate their point of intersection and trace this down to the horizontal axis using a vertical broken line. Where this line meets the x-axis is the mode. The modal class is the class with the highest frequency. A data set with two modes is called bimodal. A data set with three modes is called trimodal; if there are more than three modes, it is called multimodal. We now present a relative frequency histogram. Note that the total area of this histogram is equal to one. The shape of this and that Summary and Display of Data 19 of the histogram is the same. The relative frequency histogram g(x) is Number in a class Total number of observation x class interval Example 2.7 The following 30 gains were recorded to the nearest 1 million Naira of some private entrepreneurs. 1 1 1 1 1 1 1 1 1 1 3 3 3 3 4 4 6 8 9 10 12 12 13 14 28 32 34 36 39 40 Construct the relative frequency histogram. Solution Let c0, c1, c2, c3, and c4 represent the class boundaries. Let co = 0.5 and c1 = 3.5 with 14 observations in between; c2 = 10.5 with 6 observations; c3 = 29.5 with 5 observations and c4 = 40.5 with 5 observations. This yield the following relative frequency histogram: 14 (30)(3) , 0.5 x 3.5 6 (30)(7) , 3.5 x 10.5 g(x)= 5 , 10.5 x 29.5 (30)(19) 5 , 29.5 x 40.5 (30)(11) It is important to note in the case of unequal lengths among class intervals that the areas, not the heights, of the rectangles are proportional to the frequencies. Understanding Basic Statistics 20 g(x).16.14.12.10.08.06.04.02 5 10 20 30 40 Gains Figure 2.11: Relative frequency histogram 2.2.5 Cumulative Frequency Curve (Ogive) A frequency distribution can easily be converted to a cumulative frequency distribution by replacing the frequencies with cumulative frequencies. This was shown in Table 2.3. The same information can be presented by using a relative cumulative frequency distribution (See Table 2.3). This combines the cumulative frequency idea and the relative frequency idea. The vertical scale represents either the cumulative frequencies or the relative cumulative frequencies. The horizontal scale Summary and Display of Data 21 represents the upper class boundaries. Until the upper class boundary of a class has been reached, you cannot be sure you have accumulated all the data in that class. Therefore, the horizontal scale for an Ogive is always based on the upper class boundaries. Every Ogive starts on the left with a relative frequency of zero at the lower class boundary of the first class and ends on the right with a relative frequency of 100% at the upper class boundary of the last class. Example 2.8 Prepare an Ogive from Table 2.3 a. Give the estimates of the quartiles b. Find the median c. Estimate the 30 and 70 percentiles d. Obtain the Range, Interquartile range and semi interquartile range e. What number of students scored marks between 60% and 80%? f. What will be the pass mark if 60% of the student failed? Solution Understanding Basic Statistics 22 Frequency Cumulative Frequency 60 50 40 30 20 10 0 0 20 40 60 80 100 120 Marks Figure 2.12: Cumulative frequency curve a) Quartles Q1 = 25th percentile = 46.5 Q2 = 50th percentile = 59.5 Q3 = 75th percentile = 72 b) median is the 50th percentile and it is equal to 59.5 c) 30th percentile = 49.5 70th percentile = 68 d) i. Range = Highest mark - Lowest mark = 95 – 31 = 64 from the raw data in section 2.1 ii. Interquartile range = Q3 - Q1 = 72 - 46.5 = 25.5 iii. Semi-Interquartile range = Q3 - Q1 2 Summary and Display of Data 23 = 72 - 46.5 = 25.5 2 2 = 12.75 e) At 60% mark this intercept the curve at cumulative frequency of 25 students and at 80% mark this intercept the curve at cumulative frequency of 43. Therefore, the number of students that scored between 60% and 80% mark are 43 – 25 = 18 students f) If 60% of the students failed, the pass mark will be from the 60th percentile mark. Trace this to the curve and the pass mark will be 67. 2.2.6 Stem–and-Leaf Stem-and-Leaf display combines the visual impact of the histogram or bar chart with the detail of the original list of data entries. This technique, very simple to create and use, is a combination of a graphic technique and a sorting technique. The data values themselves are used to do this sorting. The stem is the leading digit(s) of the data, and the leaf is the trailing digit(s). For example, the numerical data value 325 might be split 32 – 5 as shown: Leading Digits Trailing Digits 32 5 Example 2.9 Construct the stem-and-leaf of the following sets of data i. 52 33 44 48 49 36 50 61 65 72 68 55 60 53 33 41 68 70 82 85 48 51 37 45 58 65 43 45 61 81 ii 1.6 1.9 3.5 4.9 8.2 7.5 3.3 3.8 4.5 5.2 2.7 4.8 5.7 6.2 7.8 3.4 5.7 8.3 4.1 1.6 2.7 3.1 2.4 1.8 4.5 7.1 3.3 2.5 5.6 1.8 Understanding Basic Statistics 24 Solution i. In a stem-and-leaf plot, we consider all entries. Let’s look at the group of entries in the 30s: 33 33 36 37 40s: 41 44 48 49 48 45 43 45 50s: 52 50 55 53 51 58 60s: 61 65 68 60 68 65 61 70s: 72 70 80s: 82 85 81 We separate the last digit of each entry from the primary numbers 30, 40, 50, 60, 70, 80 and we display the results in ascending order of the values: 3 3 3 6 7 4 1 3 4 5 5 8 8 9 5 0 1 2 3 5 8 6 0 1 1 5 5 8 8 7 0 2 8 1 2 5 Stem Leaf ii. For this data set, the stem is the whole number including the decimal point while the leaf is the trailing decimal digit. The groups entries are: 1 : 1.6 1.9 1.6 1.8 1.8 2 : 2.7 2.7 2.4 2.5 3 : 3.5 3.8 3.3 3.4 3.1 3.3 4 : 4.9 4.5 4.8 4.1 4.5 5 : 5.2 5.7 5.7 5.6 6 : 6.2 7 : 7.5 7.8 7.1 8 : 8.2 8.3 Summary and Display of Data 25 The corresponding stem-and-leaf is: 1 6 6 8 8 9 2 4 5 7 7 3 1 3 3 4 5 8 4 1 5 5 8 9 5 2 6 7 7 6 2 7 1 5 8 8 2 3 Unfortunately, not all sets of data can be organized into a stem- and-leaf plot. First, there should not be too much spread in the data e.g. from 1 – 10000. Similarly, if there is very little spread. Further, the numbers in the data should not be extremely large. For example, if the values were in hundreds of thousands, such as 345,005 and 582,281, then just separating the last digit would be meaningless. 2.2.7 Line graph Line graphs are diagrammatical representation of the relation between two variables x and y. The co-ordinate points of these variables are joined together to have the line graph. Example 2.10 Draw a line graph to represent the information below: Before 14 20 21 24 22 25 26 After 16 24 23 25 30 27 34 Solution Understanding Basic Statistics 26 40 After 30 20 10 0 0 5 10 15 20 25 30 Before Figure 2.13: Line graph 2.3 EXERCISES 2.3.1 A police Constable, using radar, checked the speed of cars as they were traveling down a street: 38 50 60 40 38 40 55 40 50 55 40 38 55 60 55 40 50 38 55 38 60 Construct a dot plot of these data 2.3.2 On the first day of last semester, 30 students were asked for their one-way travel times from home to the University (to the nearest 5 minutes). The resulting data were as follows: 30 30 35 25 25 15 10 20 25 35 40 5 10 25 30 40 5 15 30 40 15 5 20 30 25 45 40 40 35 20 Construct a stem-and-leaf display 2.3.3 The following 45 amounts are the fees that Fast Delivery charged for delivering small freight items last Monday morning: 2.57 4.21 1.05 3.06 4.50 5.05 3.45 2.15 0.92 Summary and Display of Data 27 3.12 2.67 0.76 4.13 5.93 4.15 2.03 0.57 1.85 4.10 3.41 1.86 2.53 1.46 3.85 5.12 3.24 1.89 2.51 0.95 1.24 2.21 5.86 3.57 2.18 4.29 3.50 0.91 0.82 1.47 4.25 3.81 2.48 1.27 5.35 3.33 i. Classify these data into a grouped frequency distribution by using classes of 0.01 – 1.00, 1.01 – 2.00,..., 5.01 – 6.00 ii. Find the class width iii. For the class 4.01 – 5.00, name the value of: (a) the class center (b) the class limit, (c) the class boundaries iv. Construct a relative frequency histogram of these data. 2.3.4 The incomes of 80 employees of a company are recorded as follows in N’000 per annum. 430 650 730 450 357 370 680 880 720 500 555 600 710 375 481 639 700 850 650 400 885 730 650 480 537 390 495 755 800 450 633 741 839 395 485 631 737 810 561 492 453 439 810 750 653 495 849 675 800 795 385 411 865 721 846 666 713 874 815 873 555 414 312 481 672 411 813 817 361 845 315 481 618 535 621 781 432 537 615 811 Use an appropriate class interval to construct the frequency distribution. Draw a histogram to represent the data and frequency polygon on your histogram. Estimate the mode from your histogram. Understanding Basic Statistics 28 2.3.5 The following table shows the frequency distribution of marks of 200 students in a mathematics examination. Mark 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 Frequency 4 26 52 16 36 34 20 12 i) Draw a cumulative frequency curve and estimate the Quartiles. ii) Calculate the interquartile and semi-interquartile range from your graph. iii) Find the pass mark if only 20% of the students should pass. iv) How many of the students scored between 60% and 85%. 2.3.6 Use the table in Exercise 2.3.5 to answer the following questions i. Draw a bar chart of the frequency distribution ii. Draw a pie chart of the frequency distribution iii. Draw a histogram of the frequency distribution 2.3.7 The following table shows the number admitted into the postgraduate programme for 2 years. Department 2003 2004 Chemical Engn. 41 37 Electrical Engn. 40 48 Surveying 35 25 Geography 45 48 Mechanical Engn. 50 45 Geology 35 45 Draw a component and multiple bar charts for this data 2.3.8 The year, x, and the birthrate, y, for 1980 – 2000 were as follows: Year (x) Birthrate (y) 1980 25,004 1981 25,100 1982 24,345 Summary and Display of Data 29 1983 24,850 1984 23,563 1985 23,236 1986 24,450 1987 18,053 1988 19,245 1989 18,348 1990 15,434 1991 13,347 1992 14,111 1993 15,243 1994 16,172 1995 18,815 1996 17,345 1997 16,457 1998 18,413 1999 19,400 2000 18,721 i. Construct a line graph of these birthrates. ii. Interpret your output 2.3.9 A country’s foreign reserve are as given for some period of months in ‘000,000 of dollars 1 1 2 3 3.5 3.6 4 4.1 4.1 4.1 5 5.2 5.3 5.4 6 7 7 7 7 8 8 8 8 8 8 8.5 8.5 8.7 8.9 9 14 20 22 24 28 32 40 41 41 42 100 120 150 200 250 300 350 400 450 500 i. Group these data into six to eight unequal classes ii. Construct a relative frequency histogram iii. Describe the distribution of the reserve Understanding Basic Statistics 30 2.4 STUDENT PROJECTS Collect a random sample consisting of at least 30 numbers on one of the following topics: 1. Go to your department and randomly select CGPA for year one students. 2. Collect the prices on a random set of items in Yem-yem stores. (For instance, you might want to consider clothing, appliances, beverages, stationeries) 3. Select a random set of stock prices (closing prices) for companies listed in the current stock market reports. 4. Go to the Arts Faculty Car Park to get the number of eight brands of cars parked there. (Toyota, Mercedes, etc) 5. Collect data from a random selection of patients at the Medical Center on a single quantity such as weight, blood pressure, height, and temperature. 6. Collect data from a random selection of student’s scores for two courses in a particular semester. 7. Collect the prices of a random set of Books in the bookshop. 8. Get the net profit/loss of a bank for 30 years. Once you have collected this set of data, you should then: a) Construct an appropriate relative frequency distribution. b) Construct the associated bar chart, pie chart, histogram and frequency polygon c) Draw an Ogive and a stem-and-leaf plot. Finally, you should incorporate all these items into a formal statistical project report. The report should include: i. A statement of the topic studied. ii. A statement of the specific population being investigated. Summary and Display of Data 31 iii. The source of the data. iv. A discussion of how the data were collected and why you do or do not believe it is a random sample. v. The tables and diagrams displaying the data. vi. A discussion of any surprises you may have noted in connection with collecting the data or organizing them. Understanding Basic Statistics 32 CHAPTER THREE DESCRIPTIVE ANALYSIS Descriptive Analysis is of two parts namely: i. Measures of location or central tendency ii. Measures of dispersion, variation or spread Measures of central tendency are numerical values that tend to locate in some sense the middle of a set of data. The term average is often associated with these measures. Each of the several measures of central tendency can be called the average value. They are the mean, median, and mode. Once the middle of a set of data has been determined, our search for information immediately turns to the measures of dispersion (spread). The measures of dispersion include the range, variance, and standard deviation. These numerical values describe the amount of spread, or variability, that is found among the data. 3.1 THE SIGMA () NOTATION The Greek capital letter sigma () is used in Mathematics to indicate the summation of a set of addends. Each of these addends must be of the form of the variable following . For example, i. x means sum the variable x ii. (x – 3) means sum the set of addends that are 3 less than the values of each x When large quantities of data are collected, it is usually convenient to index the response so that at a future time its source will be known. This indexing is shown on the notation by using i (or j or Descriptive Analysis 33 k) and affixing the index of the first and last addend at the bottom and top of the . For example, Means to all consecutive values of square of x’s starting with the source: number 1 and proceeding to source number 4 Example 3.1 Find (i) x (ii) x2 (iii) (x)2 x 1 2 4 6 5 7 3 x2 1 4 16 36 25 49 9 Solution x = 1+2+4+6+5+7+3 = 28 x2 = 1 + 4 + 16 + 36 + 25 + 49 + 9 = 140 (x)2 = (28)2 = 784 Example 3.2 Simplify 3 i 1 (3xi + 1) and find its value when x1 = x2 = x3 = 1 Solution 3 i 1 (3xi + 1) = (3xi + 1) + (3x2 + 1) + (3x3 + 1) = (3x1 + 3x2 + 3x3) + (1 + 1 + 1) 3 =3 i 1 xi + 3 = 3 (1 + 1 + 1) + 3 = 9 + 3 = 12 Understanding Basic Statistics 34 3.2 MEAN To find the mean, x (read “x bar”), you will add all the values of the variable x and divide by the number of these values, n. We express this in formula form as n x i 1 i Sample mean = x = (3.1) n Example 3.3: Find the mean of the following numbers 2, 3, 4, 2, 3, 2, 4, 8 n x i 1 i x = = 2+3+4+2+3+2+4+8 n 8 = 28 = 3.5 8 When the sample data has the form of a frequency distribution, we will need to make a slight adaptation in order to find the mean. Consider the frequency distribution of Table 3.1. Table 3.1: ungrouped frequency distribution x 1 2 3 4 5 f 4 8 5 4 7 To calculate the mean x using the above formula; we have x =1 + 1 + 1 + 1 + 2 + 2 +…+ 2 + 3 +...+ 3 + 4+...+ 4 + 7 +…+ 7 x = 4 (1) + 8 (2) + 5 (3) + 4 (4) + 7 (5) = 86 = fx Therefore, the mean of a frequency distribution may be found by dividing the sum of the data, fx, by the sample size, f. We can rewrite formula (3.1) for use with a frequency distribution as: Descriptive Analysis 35 x = xf (3.2) f Table 3.2 x f xf 1 4 4 2 8 16 x = xf 3 5 15 f 4 4 16 = 86 = 3.07 5 7 35 28 Total 28 86 3.2.1 Mean of Grouped Data The class centers (mark) are now being used as representative values for the observed data. Example 3.4: What is the mean of this distribution? Table 3.3 Mark Frequency (f) Class center (x) fx 30 – 39 5 34.5 172.5 40 – 49 10 44.5 445 50 – 59 15 54.5 817.5 60 – 69 10 64.5 645 70 – 79 5 74.5 372.5 Total f = 45 fx = 2452.5 Mean = x = fx = 2452.5 = 54.5 f 45 Understanding Basic Statistics 36 3.2.2 Using Assumed Mean The method of using an assumed mean makes strenuous calculations of large numbers to be easier. For the ungrouped data, we use x = A + di = A + d (3.3) N N and for the grouped data, we use x = A + fi di = A + fd (3.4) fi f where A is the assumed mean, di = xi – A are the deviation of xi from A. Example 3.5: Using the data in Example 3.4, let A = 44.5 Table 3.4 Mark Frequency (f) Class d =x–A fd centre (x) 30 – 39 5 34.5 -10 -50 40 – 49 10 44.5 0 0 50 – 59 15 54.5 10 150 60 – 69 10 64.5 20 200 70 – 79 5 74.5 30 150 Total f = 45 fd = 450 x = A + fd = 44.5 + 450 f 45 = 44.5 + 10 = 54.5 which is the same as in previous example Descriptive Analysis 37 3.2.3 Harmonic Mean This is the reciprocal of the average of reciprocals. It is usually represented by xH and defined by 1 1 n 1 1 N xH= n n (3.5) N j 1 X j 1 1 1 N j 1 X j j 1 X j Example 3.6: Find the Harmonic mean for the following data 2, 5, 3, 6, 7. Solution xH = 5 = 5 1/2 + 1/5 + 1/3 + 1/6 + 1/7 0.5 + 0.2 + 0.33 + 0.167+ 0.143 = 5 = 3.73 1.34 3.2.4 Geometric Mean This is the nth root of the product of the n numbers in a data set. This is usually represented by xG and defined by 1 xG = n X1 x X 2 x... x X n ( X1 x X 2 x... x X n ) n (3.6) Example 3.7: Find the Geometric mean for the data above in Example 3.6 xG = 5 2 x5x3x6 x7 = 5 1260 = 4.17 3.2.5 Arithmetic Mean This has been dealt with earlier. It is represented by xA and defined as stated in (3.1) The arithmetic mean of the sample above is Understanding Basic Statistics 38 x = x = 23 n 5 = 4.6 Note: that the expression xH xG ≤ xA (3.7) is true for any data 3.3 MEDIAN The median is the value of the data that occupies the middle position when the data are ranked in order according to size. The depth (number of positions from either end), or position, of the median is determined by the formula. Depth of median = n + 1 (3.8) 2 If the number of measurement n is an odd number, the median is the middle value. If the number of measurement n is an even number, the median is the average of the middle two values. For example, lets find the median of these numbers 2, 4, 6, 8, 9. In our example, n = 5, and therefore the depth of the median is depth = 5 + 1 = 3 2 That is, the median is the third number from either end in the ranked data, i,e, median is 6 Lets look at these data 4, 6, 7, 8, 10, 12. Here n = 6, and therefore the median depth is depth = 6 + 1 2 = 3.5 This is to say that the median is halfway between the third and fourth pieces of data. To find the number halfway between any two Descriptive Analysis 39 values, add the two values together and divide by 2. In this case, add 7 and 8, then divide by 2. The median is 7.5 For grouped data, the median is obtained by interpolation and given by N 2 Fb Median = L1 C (3.9) Fm Where L1 is lower class boundary of the median class, C - Size (width) of the median class interval, N - Total frequency, Fb - Sum of frequencies of all classes below the median class. Fm - Frequency of median class. Example 3.8: Find the median mark in the table below: Marks 30-39 40-49 50-59 60-69 70-79 Frequency 5 10 15 10 5 Solution Table 3.5 Mark Class Frequency Cumulative Boundaries Frequency 30 – 39 29.5 – 39.5 5 5 40 – 49 39.5 – 49.5 10 15 50 – 59 49.5 – 59.5 15 30 60 – 69 59.5 – 69.5 10 40 70 – 79 69.5 – 79.5 5 45 Total f = 45 Median class - 50 – 59 Understanding Basic Statistics 40 L1 = 49.5 N = 45 C = 59.5 – 49.5 = 10 Fb = 15 Fm = 15 N 2 Fb Median = L1 C Fm 45 2 15 = 49.5 x10 15 = 49.5 + (22.5 - 15) x 10 15 = 49.5 + 0.5 x 10 = 49.5 + 5 = 54.5 Note: The mean for the data is the same as the median due to symmetry of data. 3.4 MODE The mode for a set of data is the value that occurs most frequently. Example 3.9: Find the modes of the following data. 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4 Solution The values with the highest number of occurrence are 2 and 4. They both have equal frequency of 4. That is, we have a bimodal case. For group data, the mode is obtained by Mode = L1 + f1 + f0 (L2 – L1) (3.10) 2f1 + f0 + f2 41 Descriptive Analysis where f0 = the frequency of the group before the group that appears most often, f1 = the frequency of the group that appears most often, f2 = the frequency of the group after the group that appears most often, L1 = the lower limit of the group with f1 and L2 = the upper limit of the group with f1 OR 1 Mode = L1 + C (3.11) 1 2 Where L1 = lower class boundary of modal class, 1 = excess of modal frequency over next lower class 2 = excess of modal frequency over next higher class and C = size of modal class interval Example 3.10. Find the mode of the data given in example 3.8. Using the two methods given above. Solution Method I - (3.10) Mode = 49.5 + 15 + 10 (59.5 – 49.5) 2(15) + 10 + 10 = 49.5 + 25 x 10 50 = 49.5 + 5 = 54.5 Method II - (3.11) L1 = 49.5, 1 = 15 – 10 = 5 2 = 15 – 10 = 5 C = 59.5 - 49.5 = 10 Understanding Basic Statistics 42 5 Mode = 49.5 + x10 5 5 = 49.5 + 5 x 10 10 = 49.5 + 5 = 54.5 Note: The mean = Mode = Median of the data considered above due to symmetry of data. Also Mean – Mode = 3 (Mean - Median) (3.12) 3.5 DECILES, PERCENTILES, QUARTILES when observations are ordered from small to large, the resulting ordered data are called the order statistics of the sample. Lets have the following data 24 31 31 40 45 47 48 48 48 49 50 50 50 50 50 50 51 53 53 56 60 70 71 76 We give ranks to these ordered statistics and use the rank as the subscript on x. The first order statistic x1= 24 has rank 1; the second order statistic x2 = 31 has rank 2, the third order statistic x3 = 31 has rank 3, …; and the 24th order statistic x24 = 76 has rank 24. It is clear here that x1 ≤ x2 ≤ …. ≤ x24. From these order statistics, it is rather easy to find the sample percentiles. If 0 < p < 1, the (100p)th sample percentile has approximately np sample observations less than it and also n(1-p) sample observation greater than it. One way of achieving this is to take the (100p)th sample percentile as the (n+1)pth order statistic, provided that (n+1)p is an integer. If (n+1)p is not an integer but is equal to r plus some proper fraction, say a/b, use a weighted average of the rth and the (r+1)st order statistics. That is, define the (100p)th sample percentile as Descriptive Analysis 43 p = xr + (a/b) (xr+1 – xr ) = (1 – a/b)xr + (a/b) xr+1 (3.13) Note: that this is simply a linear interpolation between xr and xr+1 For illustration, consider the 24 ordered examination scores. With p = ½, we find the 50th percentile by averaging the 12th and 13th order statistics, since (n+1)p = 25/2 = 12.5 0.50 = (½) x12 + (½) x13 = (50 + 50)/2 = 50 With p = ¼, we have (n+1)p = 25/4 =6.25; and thus the 25th sample percentile is 0.25 = (1 – 0.25) x6 + 0.25 x7 = (0.75) (47) + (0.25) (48) = 35.25 + 12 = 47.25 With p = ¾, so that (n+1) p = (25) (3/4) = 18.75, the 75 th sample percentile is 0.75 = (1- 0.75) x18 + (0.75) x19 = (0.25) (53) + (0.75) (53) = (13.25 + 39.75 = 53 Note: that approximately 50%, 25% and 75% of the sample observation are less than 50, 47.25, 53, respectively. As already discussed in chapter two, 50th percentile is the median of the sample. The 25th, 50th, and 75th percentiles are the first, second, and third quartiles of the sample, denoted as Q1, Q2, and Q3, respectively. The 10th, 20th, 30th, ……, 90th percentiles are the deciles of the sample. So note that the 50th percentile is also the median, the second quartile, and the fifth deciles. For example, the 2th and 9th deciles would be calculated as thus: (n+1)p = (25)(2/10)= 5 for the second deciles and (n+1)p = (25)(9/10) = 22.5 for the ninth deciles. 0.20 = (1- 0) x5 + 0x6 = x5 = 45 and 0.90 = (1- 0.5) x22 + 0.5x23 = (0.5) 60 + (0.5) (70) = 30 + 35 = 65 Understanding Basic Statistics 44 3.6 BOX-AND-WHISKER DIAGRAM This is a graphical means for displaying the five-number summary of a set of data (smallest, first quartile, median or second quartile, third quartile and the largest) that is called a box-and-whisker diagram or more simply as a box plot. The three values used – Q1, Q2 and Q3 – are sometimes called hinges. Minimum Q1 Q2 = Median Q3 Maximum Figure 3.1 Box Plot To construct a horizontal box-and-whisker diagram, draw a horizontal axis that is scaled to the data. Above the axis draw a rectangle box with the left and right sides drawn at Q1 and Q3 with a vertical line segment drawn at the median, Q2 = median. A left whisker is drawn as a horizontal line segment from the minimum to the midpoint of the left side of the box, and a right whisker is drawn as a horizontal line segment from the midpoint of the right side of the box to the maximum. Note that the length of the box is equal to the interquartile range (Q3 – Q1). The left and right whiskers contain the first and fourth quarters of the data. Example 3.11: Draw the Box Plot of the data in section 3.5 Solution The five number summary are minimum = 24 Q1 = 47.25, Q2 = 50, Q3 = 53, and the maximum = 76 Figure 3.2: Box Plot 20 40 60 80 Descriptive Analysis 45 Example 3.12: Let x denote the concentration of acid on milligrams per liter. Twenty observations of x are: 115 116 117 118 118 118 119 121 122 125 126 128 129 129 130 131 131 133 133 134 (a) Find the mid range, interquartile range and median (b) Draw a box-and-whisker diagram. Solution a) Midrange = average of the extremes = x1 + xn = 115 + 134 = 249 2 2 2 = 124.5 With p = ¼, we have (n + 1)p = 21/4 = 5.25 and the 25th sample percentile is Q1 = 0.25 = (1 – 0.25) x5 + (0.25) x6 = (0.75) (118) + (0.25) 118 = 118 with p = ½, we have (n + 1)p = 21/2 = 10.5 and the 50th sample percentile is Q2 = 0.50 = (1 – 0.5) x10 + 0.5x11 = (0.5) (125) + (0.5) (126) = 62.5 + 63 = 125.5 With p = ¾, we have (n + 1)p = 21 x ¾ = 15.75 and the 75th sample percentile is Q3 = 0.75 = (1 – 0.75) x15 + 0.75x16 = 0.25x15 + 0.75x16 = (0.25) (130) + (0.75) (131) = 32.5 + 98.25 = 130.75 Interquartile range = Q3 – Q1 = 130.75 – 118 = 12.75 Median = Q2 = 125.5 Understanding Basic Statistics 46 b) 110 115 120 125 130 135 Figure 3.3 box plot Tukey suggested a method for defining outliers that is resistant to the effect of one or two extremes values and makes use of the interquartile range. In a box-and-whisker diagram, construct inner fences to the left and right of the box at a distance of 1.5 times the interquartile range. Outer fences are constructed in the same way at a distance of 3 times the interquartile range. Observations that lie between the inner and outer fences are called suspected outliers. Observations that lie beyond the outer fences are called outliers. 3.7 MEAN ABSOLUTE DEVIATION (MAD) This is the average amount by which values in a distribution differ from the mean. Mean Absolute Deviation for ungrouped data n MAD = i 1 | xi – x | (3.14) n Mean Absolute Deviation of ungrouped data with frequency and of Group Data n MAD = i 1 f| xi – x | (3.15) ∑f Descriptive Analysis 47 Example 3.13: Find the mean deviation for the following data: 3 4 5 8 15 First, we find the mean of the data Mean = x = ∑x = 3 + 4 + 5 + 8 + 15 n 5 = 35 5 = 7 x 3 4 5 8 15 |x– x| 4 3 2 1 8 MAD = ∑ |x – x| = 4+3+2+1+8 n 5 = 18 5 = 3.6 This implies that the average distance that this piece of data is from the mean is 3.6. Example 3.14: Find the mean absolute deviation of the following data. Mark 2 4 5 7 8 9 Frequency 4 6 8 1 4 2 Solution Table 3.6 Mark (x) Frequency (f) fx x–x |x – x| f|x – x | 2 4 8 -3.16 3.16 12.64 4 6 24 -1.16 1.16 6.96 5 8 40 -0.16 0.16 1.28 7 1 7 1.84 1.84 1.84 8 4 32 2.84 2.84 11.36 9 2 18 3.84 3.84 7.68 Total f =25 fx = 129 41.76 Understanding Basic Statistics 48 Mean = ∑fx = 129 ∑f 25 = 5.16 Mean = ∑f|x – x| = 41.76 ∑f 25 = 1.6704 Example 3.15: The following distribution of commuting distances was obtained for a sample of employees. Table 3.7 Distance (Kilometer) Frequency 1.0 – 2.9 2 3.0 – 4.9 6 5.0 – 6.9 12 7.0 – 8.9 50 9.0 –10.9 35 11.0 –12.9 15 13.0 –14.9 5 Find the mean deviation for the commuting distances. Solution Table 3.8 Distance (kg) f Class fx x–x |x – x | f|x – x| center (x) 1.0 – 2.9 2 1.95 3.9 -6.8 6.8 13.6 3.0 – 4.9 6 3.95 23.7 -4.8 4.8 28.8 5.0 – 6.9 12 5.95 71.4 -2.8 2.8 33.6 7.0 – 8.9 50 7.95 397.5 -0.8 0.8 40 9.0 – 10.9 35 9.95 348.25 1.2 1.2 42 11.0 –12.9 15 11.95 179.25 3.2 3.2 48 13.0 –14.9 5 13.95 69.75 5.2 5.2 26 Total ∑f =125 ∑fx =1093.75 f|x – x| = 232 Descriptive Analysis 49 Mean = ∑fx = 1093.75 ∑f 125 = 8.75 Mean = ∑f|x – x | = 232 ∑f 125 = 1.856 3.8 VARIANCE AND STANDARD DEVIATION Variance is a useful measure of the spread of the original values about the mean. When we are concerned with a population, the variance is written in terms of the Greek letter - (lower case sigma) and is denoted by -2. Thus, we can summarize the above calculations with the following formula: Population variate -2 = ∑ (x – )2 OR N(∑x2) – (∑x)2 (3.16) N N2 where N is the size of the population. However, a far more useful measure of the spread or variability in a set of data is the standard deviation, which is defined as the square root of the variance. Standard Deviation (SD) = Variance (3.17) Since the standard deviation is the square root of the variance 2, the standard deviation is denoted by and is found from the formula. Population standard deviation - = (x ) 2 (3.18) N N ( x 2 ) ( x) 2 OR N2 Understanding Basic Statistics 50 One special advantage of working with the standard deviation is that it is measured in the same units as the original data. Thus, if the original set of numbers represent weights of a certain type of item, then both the mean and standard deviation are measured in weights. The larger that - is for a set of numbers, the greater the spread or variability among those numbers. The smaller the value of -, the smaller the amount of variation in the data. All the above ideas for the variance and standard deviation were developed in the context of a population. Very similar ideas exist for the variance and standard deviation of a sample drawn from a population, with one significant difference. When we deal with a sample, we cannot average the sum of the squared deviations, (x – x )2, over the entire set of data. Instead, it is necessary to make the following modification: Sample variance (s2) = ∑(x – µ)2 OR n (∑x2) - (∑x)2 n-1 n(n-1) (3.19) and Sample standard deviation (s) = ( x x) 2 (3.20) n 1 n( x 2 ) ( x ) 2 or n(n 1) That is, instead of dividing by n data points, we divide by n-1. Just as -2 and - represent the variance and standard deviation of a population, respectively, we use the symbols s2 and s to stand for the variance and standard deviation, respectively, of a sample. Variance and standard deviation with frequency counts and of Group data are Descriptive Analysis 51 2 = ∑f (x – x )2 or ∑fx2 - (∑fx)2 ∑f ∑f (3.21) ∑f and s2 = ∑f (x – x )2 or ∑fx2 - (∑fx)2 ∑f - 1 ∑f (3.22) ∑f – 1 Standard deviations - and s are the square roots of (3.21) and (3.22), respectively. Variation and standard deviation using Assumed Mean are given below where d = x – A (assumed mean) ( fd ) 2 fd 2 -2 = f (3.23) f ( fd ) 2 fd 2 s2 = f (3.24) f 1 Standard deviation - and s are the square roots of (3.23) and (3.24), respectively. Variance and Standard Deviation using Assumed Mean and Scaling Factor. The foregoing calculations can be made simpler by further scaling down of d to h = d/c, where c is the regular increment in the x values. The formulas are given below: x = A + ∑fh. c ∑f (3.25) Understanding Basic Statistics 52 ( fh ) 2 2 fh -2 = f 2 xc (3.26) f and ( fh ) 2 2 fh s2 = f 2 xc f 1 (3.27) The standard deviations - and S are the square roots of (3.26) and (3.27), respectively. Example 3.16: Find the mean and standard deviation - of the data: 4 6 8 9 10 12 Solution Using ∑ (x – x )2 N Table 3.9 x x–x (x – x)2 4 -4.33 18.75 7 -1.33 1.77 8 -0.33 0.11 9 0.67 0.45 10 1.67 2.79 12 3.67 13.47 ∑x = 50 ∑(x – x)2 = 37.34 Mean = ∑x = 50 = 8.33 N 6 Descriptive Analysis 53 Variance = ∑(x – x)2 = 37.34 N 6 = 6.22 S.D = Variance = 2.494 Using = ∑x2 – (∑x)2 N N Table 3.10 x x2 4 16 7 49 8 64 9 81 10 100 12 144 ∑x = 50 ∑x2 = 454 Mean = ∑x = 50 = 8.33 N 6 Variance = ∑x2 - (∑x)2 454 - 502 N = 6 N 6 = 454 - 416.7 = 37.33 = 6.22 6 6 S.D.= Variance = 6.22 = 2.494 Example 3.17: Find the mean and standard deviation () for the following grouped frequency distribution. Table 3.11 Class limits f 2–5 7 6–9 15 Understanding Basic Statistics 54 10 – 13 22 14 – 17 14 18 – 21 2 Solution Method I Using ∑fx2 - (∑fx)2/f ∑f Table 3.12 Class limits Class f fx fx2 Center (x) 2–5 3.5 7 24.5 85.75 6–9 7.5 15 112.5 843.75 10 –13 11.5 22 253 2909.5 14 –17 15.5 14 217 3363.5 18 –21 19.5 2 39 760.5 Total ∑f = 60 ∑fx = 646 ∑fx2 = 7963 Mean = ∑fx = 646 ∑f 60 = 10.77 Variance = ∑fx2 - (∑fx)2 7963 - 6462 ∑f = 60 ∑f 60 = 7963 - 6955.27 = 1007.73 60 60 = 16.8 S.D. = 16.8 = 4.099 Method II Using Assumed mean Let A = 11.5 Descriptive Analysis 55 Table 3.13 Class limits Class Frequency d = x – A fd fd2 Centre (x) (f) 2– 5 3.5 7 -8 -56 448 6– 9 7.5 15 -4 -60 240 10 –13 11.5 22 0 0 0 14 –17 15.5 14 4 56 224 18 –21 19.5 2 8 16 128 Total ∑f = 60 ∑fd = -44 ∑fd2 = 1040 Mean = x = A + (∑fd) = 11.5 + -44 ∑f 60 = 11.5 – 0.73 = 10.77 Variance = -2 = ∑fd2 - (∑fd)2 ∑f ∑f = 1040 - (-44)2 = 1040 - 1936 60 60 60 60 = 1040 - 32.27 = 1007.73 60 60 = 16.8 S.D = 16.8 = 4.099 Method III Understanding Basic Statistics 56 Table 3.14 Class limits Class Frequency d=x–A h fh fh2 Centre (x) (f) 2– 5 3.5 7 -8 -2 -14 28 6– 9 7.5 15 -4 -1 -15 15 10 –13 11.5 22 0 0 0 0 14 –17 15.5 14 4 1 14 14 18 –21 19.5 2 8 2 4 8 -11 65 Let A = 11.5, h = d/c where c is the regular increment in x values from one class to another e.g. 7.5 – 3.5 = 4, 11.5 – 7.5 = 4, and so on. Mean = A + ∑fh. c = 11.5 + -11 x 4 ∑f 60 = 11.5 – 44 60 = 11.5 – 0.73 = 10.77 Variance = -2 = [ ∑fh2 - (∑fh)2] c2 ∑f ∑f = [ 65 - (-11)2 ] x 42 = [65 – 2.02] x 42 60 60 60 = 1.05 x 42 = 16.8 S.D = 16.8 = 4.099 3.9 COEFFICIENT OF VARIATION In order to compare the relative amounts of variation in populations having different means, the coefficient of variation, Descriptive Analysis 57 symbolized by CV, has been developed. This is simply the standard deviation expressed as a percentage of the mean. Its formula is C.V = S.D x 100% (3.28) Mean The one with smallest C.V among the variables is preferred to be better than others. 3.10 SKEWNESS AND KURTOSIS When observed frequency distribution depart from symmetry, it is useful to have a statistic that measures the nature and amount of departure. One is skewness while the other is Kurtosis. When a distribution is symmetrical about the mean, the skewness is equal to zero. If the probability histogram has a longer “tail” to the right than to the left, the measure of skewness is positive, and we say that the distribution is skewed positively or to the right. If the probability histogram has a longer tail to the left than to the right, the measure of skewness is negative, and we say that the distribution is skewed negatively or to the left. Skewed positively or to Skewed negatively or to the right the left Figure 3.4 Skewness Skewness = Mean – Mode = x - mode (3.29) Standard Deviation s Understanding Basic Statistics 58 also known as the Pearson’s coefficient of relative skewness and can also be defined as: Skewness = 3(Mean – Median) = 3( x - median) (3.30) Standard Deviation s Using mean – mode = 3(mean – median) Kurtosis is the degree of peakness of a distribution relative to a normal (symmetry) distribution. A leptokurtic curve has more items near the mean and at the tails, with fewer items in the intermediate regions relative to a normal distribution with the same mean and variance. A platykurtic curve has fewer items at the mean and at the tails than the normal curve but has more items in intermediate regions. A bimodal distribution is an extreme platykurtic distribution. The measure of kurtosis is based on both quartiles and percentiles and is given by K = Interquartile Range (Q3 – Q1) (3.31) 0.9 - 0.1 This is also known as percentile coefficients of kurtosis Platykurtic Leptokurtic Figure 3.5 Kurtosis Example 3.18: Use example 3.17 to find the coefficient of variation and skewness. Descriptive Analysis 59 Solution C.V = S.D x 100% = 4.099 x 100% Mean 10.77 = 38.06% Skewness = Mean - Mode = 10.77 - 11.5 S.D 4.099 = -0.73 = -0.18 4.99 3.11 EXERCISES 3.11.1 Let x denotes the concentration in milligrams per liter. Twenty-five observations of x are: 140.1 140.6 150.8 147.5 149.2 145.3 144.3 148.6 149.3 147.5 146.3 145.5 148.3 143.2 144.5 147.8 149.3 147.5 142.1 146.3 145.5 148.1 149.5 150.4 150.5 a) Construct an ordered stem-and-leaf display, using stems 140, 141, 142, ….., 150. b) Find the midrange, range, interquartile range, median, sample mean, and sample variance. c) Draw a box-and-whisker diagram. 3.11.2 The weights (in grams) of 30 indicator housings used on gauges are as follows:- 11.3 12.1 10.8 13.3 13.8 14.5 15.8 17.5 17.8 16.1 19.5 19.4 18.5 15.5 15.6 14.8 10.8 10.5 11.2 13.3 14.6 18.5 17.9 16.3 14.5 17.9 13.3 12.1 13.8 14.6 a) Construct an ordered stem-and-leaf display using integers as the stems and tenths as the leaves. Understanding Basic Statistics 60 b) Find the five-number summary of the data and draw a box-plot. c) Are there any suspected outliners or outliers? d) What type of skewness is indicated by your display and calculate it? 3.11.3 Find (a) ∑x2, (b) (∑x)2, (c) ∑x∑y, (d) ∑y2, (e) (∑y)2 for the data shown below: x 3 4 5 6 7 Y 7 8 9 10 11 4 4 3.11.4 Show that i 1 (3xi + 2) = 3 xi + 8 i 1 3.11.5 Show that n n n i 1 (xi + 2yi) = i 1 xi - 2 yi i 1 3.11.6 The weights, in pounds, of a group of people signing up at a hotel are: 125 141 141 132 155 160 185 165 172 148 131 154 162 148 135 181 172 133 141 135 Find (i) the mean, median and mode of the weights. (ii) the quartiles, skewness and kurtosis 3.11.7 Two sample brands of bulbs are selected and tested to see how many hours they can be used before running out of use. Brand A 1134 1157 1811 1858 1958 Brand B 1456 1787 1611 1872 1853 a) Calculate the mean and standard deviation b) Calculate the coefficient of variation c) Which of the brands is better? Descriptive Analysis 61 3.11.8 Estimate the mean, standard deviation and median for the following set of data: Class boundaries Frequency 151 – 160 50 161 – 170 60 171 – 180 30 181 – 190 35 191 – 200 25 201 – 210 17 211 – 220 13 3.11.9 The speeds, to the kilometer per hour, of a group of 300 cars on a road are as follows:- Class boundaries Frequency 20.5 – 25.5 10 25.5 – 30.5 40 30.5 – 35.5 30 35.5 – 40.5 55 40.5 – 45.5 15 45.5 – 50.5 120 50.5 – 55.5 30 Use the three methods in section 3.8 to find the mean and standard deviation. 3.11.10 Use the table below to find a) mean, median, mode b) the coefficient of skewness for the data and comment of the degree of symmetry of the data. c) the 10th and 90th percentile d) the 4th and 8th deciles Profit 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 15 13 24 17 5 16 15 Understanding Basic Statistics 62 3.11.11 Consider the following frequency table X 12 14 20 25 27 40 45 Frequency 3 5 10 12 15 4 6 a) Compute the mean, variance, and the coefficient of variation b) Compute the measure of skewness of the data and comment on the shape of the distribution of the data 3.11.12 (a) Draw the histogram and the frequency polygon (b) Compute the mean and variance of the data (c) Compute the measure of skewness and comment on the degree of symmetry of the data below: Class limits 15-25 25-35 35-45 45-55 55-65 65-75 Frequency 12 14 1 14 15 25 3.12 STUDENT PROJECTS Continue the investigation you began in the student projects at the end of chapter two to calculate the various numerical descriptions for your set of data. For the data you previously collected, 1. Calculate the mean, median and mode 2. Calculate the sample variance and standard deviation 3. Locate the quartiles for the data 4. Locate the deciles for the data 5. Construct the box plot for the data 6. Calculate the coefficient of variation and measure of skewness Finally, incorporate all these items into a formal statistical project report. Descriptive Analysis 63 A statement of the topic being studied and a specific statement of the population being investigated. The source of the data A discussion of how the data were collected and why you believe that it is a random sample. A list of the data All the statistical results you calculated and the box plot Some of the tables and diagrams you constructed for the first project A discussion of any surprises you may have noted in connection with collecting the data or organizing them. Understanding Basic Statistics 64 CHAPTER FOUR INTRODUCTION TO PROBABILITY The application of probability is evident in most areas of human endeavour. For example, the chance of an accident occurring on a road, probability of getting a head when a coin is tossed, chance of a top politician winning an election, e.t.c. are examples of probability. Therefore, we must be able to assess the degree of uncertainty, in any given situation, and this is done mathematically by using probability 4.1 PROBABILITY OF EVENTS We begin by defining some terminology that we are using in this chapter and in subsequent ones. Experiment: Any process that yields a result or an observation. Outcome: A particular result of an experiment Sample space: The set of all possible outcomes of an experiment. Sample point: The individual outcomes in a sample space. Event: Any subset of the sample space. If A is an event, then n(A) is the number of sample points that belong to event A. Probability of an event is a measure of the likelihood of that event occurring. If an experiment has a finite number of outcomes which are equally likely, then the probability that an event A will occur is given by P(A) = number of ways A can occur (4.1) Total number of possible outcomes Introduction to Probability 65 Example 4.1: A die is tossed once and the outcome could be any of these: The sample space is S = { 1, 2, 3, 4, 5, 6 } Example 4.2: Lets toss a coin twice and the outcome for each of toss in recorded. The sample space is shown here in two different ways. Diagram Representation 1st Toss 2nd Toss Outcomes H HH H T HT H TH T T TT Figure 4.1: Tree Diagram Listing S = { HH, HT, TH, TT } n(S) = 4 Example 4.3: Lets toss a coin thrice and the outcome for each toss is recorded. Understanding Basic Statistics 66 1st Toss 2nd Toss 3rd Toss Outcomes H HHH H T HHT H H HTH T T HTT H THH H T THT T H TTH T T TTT Figure 4.2 Tree Diagram S = { HHH, HHT, HTH, THH,HTT,THT,TTH,TTT } n(S) = 8 Example 4.4. Two dice are rolled and the sum of the numbers appearing are observed. Table 4.1 + 1 2 3 4 5 6 1 2 3 4 5 6 7 2 3 4 5 6 7 8 3 4 5 6 7 8 9 4 5 6 7 8 9 10 5 6 7 8 9 10 11 6 7 8 9 10 11 12 Introduction to Probability 67 The sample space S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} x 2 3 4 5 6 7 8 9 10 11 12 n(x) 1 2 3 4 5 6 5 4 3 2 1 36 with a total of 36-point sample space. 4.2 PERMUTATIONS AND COMBINATONS 4.2.1 Permutation Permutation is a special arrangement of a group of objects in some order. Any other arrangement of the same objects is a different permutation. The key words for permutation are order or arrangement. For example, lets arrange n people in order. There are n possible chances for the first person, n-1 remaining possible chances for the second person, n-2 remaining possible chances for the third person, e.t.c, that is, The number of possible arrangement = n x (n –1) x (n – 2) x ….x 1) = n! (n factorial) Example 4.5 0! = 1 1! = 1 2! = 2 x 1 =2 3! = 3 x 2 x1=6 4! = 4 x 3 x 2 x 1 = 24 5! = 5 x 4 x 3 x 2 x 1 = 120 6! = 6 x 5 x 4 x 3 x 2 x 1 = 720 nPr = n! (n – r)! this is the number of permutations of n objects taken r at a time. Example 4.6: In how many ways can three people be seated on 6 seats in a row? Solution Understanding Basic Statistics 68 Arranging 3 people on 6 seats = 6P3 6P3 = 6! = 6! = 6 x 5 x 4 x 3! (6 –3)! 3! 3! = 6x5x4 = 120 ways Example 4.7: How many distinct arrangements can be made using all the letters of the word Economics. Solution From the word Economics, o = 2, c = 2, and total letters = 9 Total arrangement = (Number of letter)! (Frequency of letters)! = 9! = 9x8x7x6x5x4x3x2 2! 2! 2x2 = 90720 Example 4.8: How many different numbers of six digits can be formed using digits 4, 4, 6, 6, 6, 6. Solution Total digits (n) = 6 4 has frequency = 2 6 has frequency = 4 Total numbers that can be formed = 5! 2! 4! = 6 x 5 x 4! 2 x 1 x 4! = 15 Introduction to Probability 69 Example 4.9 A plate number is to be made so that it contains four letters and four digits. Two letters begin the plate number and two letters end it. In how many ways can this number be made so that the first digits is not zero when i. Both letters and digits cannot be repeated ii. Both letters and digi