Lesson 1 Introduction to Statistics PDF

WHAT IS STATISTICS? Statistics is a scientific body of knowledge that deals with the collection, organization or presentation, analysis, and interpretation of data. Data are facts or a set of information or observation of the study. Collection refers to the gathering of information or data. Organization or presentation involves summarizing data or information in textual, graphical, or tabular forms. Analysis involves describing the data by using statistical methods and procedures. Interpretation refers to the process of making conclusions based on the analyzed data. TWO CATEGORIES OF STATISTICS a. Descriptive Statistics is a statistical procedure concerned with the describing the characteristics and properties of a group of persons, places, or things. Example: We may describe a collection of persons by stating how many are poor and how many are rich, how many are literate and how many are illiterate, how many fall into various categories of age, height, civil status, IQ, and many more. We may also describe a particular hospital in terms of the number of patients it has, the number of clinical units, the number of doctors or the number of nurses. b. Inferential Statistics is a statistical procedure that is used to draw inferences or information about the properties or characteristics by a large group of people, places, or things on the basis of the information obtained from a small portion of a large group. It is concerned with reaching conclusions. At times the information available is incomplete and generalizations are reached based on the data available. Example: As a result of the increase in the number of patients in a hospital this week because of meningococcemia, it is expected that the number of patients will double next week. Suppose we want to know the most favorite brand of toothpaste of a certain barangay and we do not have enough time and money to interview all the residents of that barangay, we may just ask selected residents. With the data obtained from the interviews, we shall draw or make a conclusion as to the barangay’s favorite brand of toothpaste. HISTORY OF STATISTICS *Please read the article in this link https://www.encyclopedia.com/science/encyclopedias-almanacs-transcripts-and- maps/statistics-history-interpretation-and-application. TERMINOLOGIES 1. Population refers to a collection of objects, persons, places, or things. To illustrate this, suppose a researcher wants to determine the average income of the residents of a certain barangay and there are 1500 residents in the barangay. Then all these residents comprise the population. A population is usually denoted or represented by N. Hence in this case, N=1500. 2. Sample is a small portion or part of a population. It could also be defined as a subgroup, subset, or representative of a population. For instance, suppose the above-mentioned researcher does not have enough time and money to conduct the study using the whole population and he wants to use only 200 residents. These 200 residents comprise the sample. A sample is usually denoted by n, thus n = 200. 3. Parameter is any numerical or nominal characteristic of a population. It is a value or measurement obtained from a population. It is usually referred to as the true or actual value. In the preceding illustration, the researcher uses the whole population (N=1500), then the average income obtained is called a parameter. 4. Statistic is an estimate of a parameter. It is any value or measurement obtained from a sample. If the researcher in the preceding illustration makes use of the sample (n=200), then the average income obtained is called a statistic. 5. Data (singular form is datum) are facts, or a set of information or observations under study. More specifically, data are gathered by the researcher from a population or from a sample. Two categories of data a. Qualitative data are data which can assume values that manifest the concept of attributes. These are sometimes called categorical data. Data falling in this category cannot be subjected to meaningful arithmetic operations. They cannot be added, subtracted, or divided. Gender, nationality, marital status, educational level, and race are qualitative data. b. Quantitative data are data which are numerical in nature. These are data obtained from counting or measuring. In addition, meaningful arithmetic operations can be done with this type of data. Test scores, height, weight and blood pressure are quantitative data. Types of data a. Raw data are in their original form and structure. b. Grouped data are placed in tabular form characterized by class intervals with the corresponding frequency. c. Primary data are measured and gathered by the researcher that published it. It refers to information which is gathered directly from an original source or which are based on direct or firsthand experiences. Example: first-person account, autobiographies, diaries d. Secondary data are republished by another researcher or agency. It refers to information which is taken from published or unpublished data which were previously gathered by other individuals and agencies Example: published books, newspapers, magazines, biographies, business reports, and the likes. 6. Constant is a property or characteristic of a population or sample which makes the members of the group similar to each other. For example, if a class is composed of all boys, then gender is a constant 7. A variable refers to a characteristic or property of a population or sample which makes the members different from each other. If a class consists of boys and girls, then gender is a variable in this class. Height is also a variable because different people have different heights. Classification of Variables A. According to Functional Relationship a. Dependent Variable is a variable which is affected or influenced by another variable. b. Independent Variable is one which affects or influences the dependent variable. B. According to Continuity of Values a. Discrete Variable is one that can assume a finite number of values. In other words, it can assume specific values only. The values of a discrete variable are obtained through the process of counting. The number of patients in a clinical unit is a discrete variable. If there are 40 patients, it cannot be reported that there are 40.2 students or 40.5 students. b. Continuous Variable is one that can assume infinite values within a specified interval. The values of a continuous variable are obtained through measuring. Continuous variables are those that fall into the category of “measured to the nearest”. Data measured in decimal fractions, but recorded to the nearest whole, are still continuous data. For example, height is a continuous variable. If one reports that the height of a building is 15 m, it is also possible that another person reports that the height of the same building is 15.1 m or 15.12 m, depending on the precision of the measuring device used. In other words, the height of the building can assume several values. A person two months away from their 22nd birthday is actually closer to age 22 than to age 21, but in most instances that person would be considered to be age 21 until their actual 22nd birthday. C. According to Scale of Measurements Statistics deals mostly with measurements. We define measurement as the assignment of symbols or numerals to objects or events according to some rules. Since different rules are used for the assignment of symbols, then this would yield different scales of measurement. 1. Nominal Scale – this is the most primitive level of measurement. The nominal level of measurement is used when we want to distinguish one object from another for identification purposes. In this level, we can only say that one object is different from another, but the amount of difference between them cannot be determined. We cannot tell that one is better or worse than the other. Telephone numbers, zip code, credit card numbers, gender, nationality, and civil status are of nominal scale. 2. Ordinal Scale – data are arranged in some specific order or rank. When objects are measured in this level, we can say that one is better or greater than the other, but we cannot tell how much more or how much less of the characteristic one object has than the other. The ranking of contestants in a beauty contest, number of siblings in the family and honor of the students in the class are ordinal scale. 3. Interval Scale- if data are measured in the interval level, we can say not only that one object is greater or less than another, but we can also specify the amount of difference. The scores in an examination are of the interval scale of measurement. To illustrate, suppose Maria got 50 in a Math examination while Martha got 40. We can say that Maria got higher than Martha by 10 points. 4. Ratio Scale – this level of measurement is like the interval level. The only difference is that the ratio level always starts from an absolute or true zero point. In addition, in the ratio level, there is always the presence of units of measure. If data are measured in this level, we can say that one object is so many times as large or as small as the other. For example, suppose Mrs. Reyes weighs 50 kg, while her daughter weighs 25 kg. We can say that Mrs. Reyes is twice as heavy as her daughter. Thus, weight is an example of data measured in the ratio scale. EXERCISES A. Indicate whether the data represented in each of the following is a part of a population or a sample. 1. Twenty-five cases of TB have been reported in the past year and a patient care evaluation study is to be carried out using data from all 25 cases. 2. A total 388 chest x-rays were performed during the past month. A quality control review is to be carried on 10% of the group. B. Tell whether the following situations will make use of descriptive or inferential statistics. 1. A teacher computes the average grade of her students and determines the top ten students. 2. The CEO of a hospital predicts the decrease in the number of patients that will be admitted next year based on the data collected this year. 3. A school administrator forecast future expansion of a school. 4. A researcher determines the total number of patients in ZCMC. 5. A researcher investigates the effectiveness of a beauty product. C. Indicate whether the following represent qualitative or quantitative data. 1. Place of birth 2. Type of insurance 3. Condition of the patient at time of discharge 4. Number of hospital admissions D. Indicate whether the following is a discrete or continuous variable. 1. Birth weight 2. Number of times a patient sees her physician during the year 3. Minutes needed to walk a mile 4. Number of possible outcomes in throwing a die. 5. Data obtained in decimal form. E. Determine the scale of measurement of the following. 1. Weight 5. Placement in the 100-meter dash 2. Educational level 6. Acceleration of a vehicle 3. License plate number 7. Number of patients of a hospital in a day 4. Examination scores 8. Civil status DATA GATHERING TECHNIQUES I. Collecting Data 2 Sources of Data 1. Primary sources of data are the government institutions, business agencies, and other organizations. Example: 1. Data are gathered from the National Statistics Office (NSO). 2. Information derived from personal interviews. 2. Secondary sources are books, encyclopedia, journals, magazines, and research or studies conducted by the individuals. Various Ways of Collecting Data 1. The Direct or Interview Method- in this method, the researcher or interviewer has a direct contact with the interviewee. The researcher obtains the information needed by asking questions and inquiries from the interviewee. This method is usually used in most research. In this method the researcher can get more accurate answers or responses since clarification can be done by the interviewee or respondent does not understand the question. However, this method is costly and time consuming. Example: a. A business firm would interview residents of a certain barangay regarding favorite brand of toothpaste, soap or shoes. b. A nurse would interview patients regarding their birthdates, residence and etc. 2. The Indirect or Questionnaire Method- this method makes use of the questionnaire. The researcher gives or distributes the questionnaire to the respondents either by personal delivery or by mail. Using this method, the researcher can save a lot of time and money in gathering the information needed because questionnaires can be given to a large number of respondents at the same time. However, the researcher cannot expect that all distributed questionnaires will be answered because some of the respondents simply ignore the questionnaires. In addition, clarification cannot be made by the respondent who does not understand the question. 3. The registration Method- this method of gathering is enforced by certain laws. Example: registration of births, registration of deaths, registration of vehicles, registration of marriages, registration of license. 4. The Experimental Method- this method is usually used to find out cause and effect relationship of certain phenomena under controlled conditions. Scientific researchers often use this method. Example: Agriculturists would like to know the effect of a new brand of fertilizer on the growth of plants. The new kind of fertilizer will be applied to ten sets of plants, while another ten set of plants will be given the ordinary fertilizer. The growth of the plants will then be compared to determine which fertilizer is better. II. Sampling Techniques This is a procedure used to determine the individuals or members of a sample. Example: Suppose a guidance counselor of a certain school wants to determine the average weekly allowance of the students, if there are 2000 students in this school and the guidance counselor decided to use only 100 students as a sample, who will be included in the sample? *Sampling techniques are used to answer the question concerning who will be included in the sample. 2 Types of Sampling Techniques 1. Probability Sampling- is a sampling technique wherein each member or element of the population has an equal chance of being selected as members of the sample. Several Probability Sampling Techniques A. Random Sampling-this is the basic type of random sampling. Using this technique each individual in the population has an equal chance of being drawn into the sample. Selecting the members or elements of our sample using this technique can be dome in two ways, namely, the lottery method and the use of table of random numbers. Remember that when we use these methods we should have a complete list of the members of the population. A.1 Lottery Method Suppose Mrs. Cruz wants to send five students to attend a 2-day training or seminar in basic computer programming. To avoid bias in selecting these five students from her 40 students, she can use the lottery method. This is done by assigning a number to each student and then writing these numbers on pieces of paper. Then, these pieces of paper will be rolled or folded and placed in a box called lottery box. The lottery box should be thoroughly shaken and then five pieces of paper will be picked or drawn from the box. The students who were assigned to the numbers chosen will be sent to training. In this case, the selection of the students is done without bias. Note that we can simply assign 1 to the first student, 2 to the second student, 3 to the third student, and so on. A.2 Table of Random Numbers We can also use the table of random number to select or draw the members of the sample. Below is a portion of the table of random numbers. 31871 60770 59235 41702 87134 32839 17850 37359 06728 16314 81076 42172 95646 67486 05167 07819 44085 87246 47378 98338 Let us illustrate how this random numbers are used to select the members of the sample. Let us consider the preceding example wherein Mrs. Cruz wants to select 5 students from her 40 students. Again, we will assign a number to each student, say from 1 to 40. Since there are 40 students, we will use the two-digit numbers of the table of random number when selecting the members of the sample. This is because the students have been assigned with numbers 01, 02, 03, 04… up to 40. Looking at the first column of the table of random numbers above, we see that the number formed by the first two digits is 31, hence the student assigned to number 31 chosen as a member of the sample. If we proceed down the column, we see that the number formed is 87 which cannot be used because we have only 40 members. In a similar manner, the third number is 06, so that the student assigned to number 6 is chosen. Notice that the next two numbers from the table are 95 and 44, numbers we cannot use for the same reason before. When we get to the bottom of the column, we move up the column and merely shift one digit to the right for the next random number. Thus, we will have 18 as our next number. This is one of the many alternatives. We can have other ways of selecting the members of the sample until we complete the 5 students. B. Systematic Sampling Notice that if we are to select the members of the sample from a large population, the simple random technique is a long and difficult process. An easier alternative is the use of systematic sampling technique. To draw the members or elements of the sample using this method, we have to select a random starting point, and then draw successive elements from the population. In other words, we pick every nth element of the population as a member of the sample when we use this method. Let us use the example wherein Mrs. Cruz wants to select 5 students from her 40 students. First, we select a random starting point. This is done by dividing the number of members in the population by the number of the members in the sample. Hence, in our case we shall have i = 8. The next step is to write the numbers 1, 2, 3, 4,5,6,7, and 8 on pieces of paper and draw one number by lottery. If we were able to get 5, this means that we will select every 5 th student in the population as members of the sample. Therefore, the 5th, 10th, 15th, 20th, and 25th student shall be the members of the sample. If, for instance, we were able to obtain the number 6, then the members of the sample will be the 6 th, 12th, 18th, 24th and 30th students. C. Stratified Random Sampling There are some instances whereby the members of the population do not belong to the same category, class, or group. To illustrate this, let us suppose that we want to determine the average income of the families in a certain community or barangay. In a typical barangay, different families belong to different income brackets we will draw or select members of the sample using simple random sampling; there is a possibility or chance that none of the families or disproportionate number of families from the low-income, average-income, or high-income group will be included in the sample. In this case, the result of the study should conclude that the average income of the families living in this barangay is high. This suggests that the sample that should be drawn from the population should be proportionally drawn from each group or category-the high, the average and the low-income families. To do this, we will use the stratified random sampling. The word stratified comes from the root word strata which mean groups or categories (singular form of stratum). When we use this method, we are actually dividing the elements of a population into different categories or subpopulation. Let us consider the following example. Example: Suppose a community consists of 5000 families belonging to different income brackets. We will draw 200 families as our sample using stratified random sampling. Below are the subpopulations and the corresponding number of families belonging to each subpopulation or stratum. Strata Number of Families High-income families 1000 Average-income families 2500 Low-income families 1500 N = 5000 Solution: The first step is to find the percentage of each stratum. This is done by dividing the number of families in each stratum by the total number of families. Then, we multiply each percentage by the desired number of families in the sample. The table below shows how it is done. Strata Number of Families Percentage Number of Families in the sample 1000 High 1000 = 0.2 or 20% 0.2 x 200 = 40 5000 2500 Average 2500 = 0.5 or 50% 0.5 x 200 = 100 5000 1500 Low 1500 = 0.3 or 30% 0.3 x 200 = 60 5000 N = 5000 n = 200 From the above table, we see that if we are going to draw 200 members from a population of 5000, we should draw 40 families belonging to the high-income, 100 from the average, and 60 from the low-income group. Observe that the number of families drawn as sample in each stratum is proportional to the number of families from the population. Sometimes, the population is so large that the use of simple random sampling\g will prove tedious and difficult. Under this condition, we can cluster sampling. D. Cluster Sampling is sampling wherein groups or clusters instead of individuals are randomly chose. Recall that in the simple random sampling we select members of the sample individually. In cluster sampling, we will select or draw the members of the sample by group and then we select a sample of elements from each cluster or group randomly. Cluster sampling is sometimes called area sampling because this is usually applied when the population is large. To illustrate the use of this sampling method, let’s suppose that we want to determine the average income of the families in Manila. Let us assume there are 250 barangay’s in Manila. We can draw a random sample of 20 barangay’s using simple random sampling, and then a certain number of families from each of the 20 barangay’s may be chosen. E. Multi-stage Sampling Multi- stage sampling is a combination of several sampling techniques that we have discussed. Usually this method is used by researchers who are interested in studying a very large population; say the whole island of Luzon, or even the Philippines. This is done by starting the selection of the members of the sample using cluster sampling and then dividing each cluster or group into strata. Then, from each stratum individuals are drawn randomly using simple random sampling. 2. Non-Probability Sampling – is a sampling technique wherein members of the sample drawn from the population based on the judgment of the researchers. The results of the study using this sampling technique are relatively biased. This technique lacks objectivity of selection; hence, it is sometimes called subjective sampling. Inferences made based on the sample obtained using these techniques are not so reliable. Non-probability sampling techniques are used because they are convenient and economical. Researchers use these methods because they are inexpensive and easy to conduct. Under this technique, there are several methods which can be used to draw or select the members of the sample. A. Convenience Sampling- as the name implies, convenience sampling is used because of the convenience it offers to the researcher. For example, a researcher who wishes to investigate the most popular noontime show may just interview the respondents through the telephone. The result of this interview will be biased because the opinions of those without telephone will not be included. Although convenience sampling may be used occasionally, we cannot depend on it making inferences about a population. B. Quota Sampling – in this type of sampling, the proportions of the various subgroups in the population are determined and the sample is drawn to have the same percentage in it. This is very similar to the stratified random sampling discussed above. The only difference is that the selection of the members of the sample using quota sampling is not done randomly. To illustrate this, let us suppose that we want to determine the teenagers’ most favorite brand T-shirt. If there are 1000 female and 1000 male teenagers in the population and we want to draw 150 members for our sample, we can select 75 female and 75 male teenagers from the population without using randomization. This is quota sampling. C. Purposive Sampling – this is another method of drawing members of the sample using non-probability sampling. Let us suppose that we want to determine or predict the candidate who will win in the upcoming election. We can conduct the survey or interview in places or precincts where people voted for the winner in a series of post elections because we feel objectively that they will,. Again vote for the next winner in the upcoming election. Also, let us suppose that the target is to find out the affectivity of a certain kind of shampoo. Of course, bad fellows will not be be included in the sample. III. Determining Sample Size To determine the sample size from a given population, the Slovin’s formula is used. N Slovin’s formula: n= 1 + Ne 2 Where n = sample size N = population size E = margin of error To illustrate, suppose we want to find the average of the students in Manila. However, due to insufficient time, only the students in three particular schools were used to estimate the average age. Obviously, the result is not the actual average but just an estimate and thus, there is usually an error when we use the sample instead of the population. Example 1: A group of researchers will conduct a survey to find out the opinion of residents of a particular community regarding the oil price hike. If there are 10000 residents in the community and the researchers plan to use a sample using a 10% margin of error, what should the sample size be? Solution: Here, N = 10000 and e= 10% or 0.10. Substituting the given values in the formula, we have: N 10000 n= = 1 + Ne 2 1 + (10000)(.10) 2 10000 n= 1 + (10000)(.01) n = 99.01 or 99 Hence, the researchers will just conduct the survey using 99 residents. A 10% margin of error means that the researcher is 90% confidents that the result obtained using the sample will closely approximate the result had he used the population. Example 2: Suppose that in Example 1, the researchers would like to use a 5 % margin of error. What should be the size of sample? Solution: Here N = 10000 and e = 5% or 0.05. Substituting the given values in the formula, we have N 10000 n= = 1 + Ne 2 1 + (10000)(.05) 2 10000 n= 1 + (10000)(.0025) 10000 n= 1 + 25 n = 384.62 or 385 Summation Notation In our study of statistics, we shall be using mathematical symbols. These symbols are useful especially in writing formulas. The most common symbols or notation used in statistics is the summation notation or simply summation (  ). Recall that variables are represented by using capital letters. If our variable is age, then we can represent this by X. Hence, if there are 40 students in a class, we can represent the age of the first students by X 1, the age of the second student by X2, the age of the third student by X3, and so on. If we want to find the sum of these ages, then we can write the sum in this way: X1 + X2 + X3 + X4+ ………..+X40 To write the sum of n values or measurement in a simpler way, the summation notation, represented by the Greek capital letter  (sigma) is used. To write the preceding example in summation notation, we have 40 X i =1 i (read as “the summation of X sub i, from i =1 to i =40) Here i is the index of summation and its value ranges from 1, the lower limit, to 40, the upper limit. Observe also that when we write the sum of values in summation notation, we replace the subscript of the sum variable by an arbitrary subscript i and indicate in the index the range of the summation. More examples on writing the summation notation are shown below: 100 1. X1 + X2 + X3 + X4+ ………..+X100 = X i =1 i 20 2. (Y4 + 5) + (Y5 + 5) + (Y6 + 5) +……….+ (Y20 + 5) =  (Y + 5) i 41 i Sometimes instead of writing the given sum in summation notation, we are asked to expand the given summation notation. Example 1 Expand the following: 10 1. X i =7 3 3 3 3 3 i = 𝑥7 + 𝑥8 + 𝑥9 + 𝑥10 5 2. ( A + B ) i =2 i i Solution: Applying the definition of summation, we have 10 1. X i =7 3 3 3 3 i = X 7 + X 8 + X 9 + X 10 3 5 2. ( A + B ) = ( A i =2 i i 2 + B2 ) + ( A3 + B3 ) + ( A4 + B4 ) + ( A5 + B5 ) Suppose the values of a variable are given and we are asked to evaluate the given summation, then we simply substitute the values in the expanded form of the summation. Examples on how to evaluate summation notation are as follow: Example 2 Given the following: X1 = 2 X2 = 4 X3= 5 Y1=1 Y2=3 Y3=7 Evaluate: 3 3 1. X i =1 i 2. ( X i =1 i + Yi ) 3. ∑3𝑖=1 2𝑥𝑖 2 4. ∑2𝑖=1(𝑥𝑖 − 𝑖) 5. ∑3𝑖=2(𝑥𝑖 − 𝑖)2 6. ∑3𝑖=1 𝑥𝑖 𝑦𝑖 7. (∑3𝑖=2 𝑥𝑖 )(∑3𝑖=2 𝑦𝑖 2 ) Solution: Using the summation notation, we have: 3 3 1.  i =1 X i = X1 + X 2 + X 3 2. ( X i =1 i + Yi ) = ( X 1 + Y1 ) + ( X 2 + Y2 ) + ( X 3 + Y3 ) = 2 + 4 +5 = 11 = (2+1) + (4+3) + (5+7) = 3 + 7 +12 = 22 PRESENTING AND DESCRIBING DATA After data have been gathered and checked for possible errors, the next logical step will be to present the data in a manner that is easy to understand. It should also readily convey the relevant information and the important results at a glance. Ungrouped Data – are data that are not organized, or if arranged, could only be from highest to lowest or lowest to highest. Grouped Data – are data that are organized and arranged into different classes or categories. Three methods in presenting data 1. Textual 2. Tabular 3. Graphical TEXTUAL In textual from, the presentation is in narrative or paragraph form. The data are within the text of the paragraph. This involves enumerating the important characteristics, giving emphasis on significant figures and identifying important features of the data. This form may not get immediate interest of the reader. However, it can present a more comprehensive picture of the data because of further written explanation of its nature. Example: 1. Nominally, the peso improved by 1.4 percent as of April 14, 2003 compared to its level in 2002, followed by the Thai baht, which gained 0.86 percent; Indonesian rupiah, 0.68 percent; and Taiwan dollar, 0.2 percent. Other currencies on the other hand, depreciated during the same period. The Singapore dollar fell 2.33 percent. The South Korean won slid 2.14 percent while the Japanese yen dropped 0.61 percent. (Phil Daily Inquirer, April 17, 2003, p.B2) 2. Here is the list of scores for the math exam of the top 10 students in the 4 th year class: 95 95 94 93 94 93 90 91 91 95. TABULAR Sometimes, we could hardly grasp information from textual presentation of data. Thus, we may present data by using tables. By organizing data in tables, important feature about the data can readily understood and comparisons can be easily made. Thus, a table shows complete information regarding the data. A table has the following parts: 1. Heading: it includes the following: a. Table number: This is for easy reference to the table. b. Table title: It briefly explains the content of the table. 3. Box head/ Column header: It describes the data in each column. 4. Stubs/Row classifier: IT shows the classes or categories. 5. Body: This is the main part of the table. 6. Foot note/Source note: This is only placed below the table when the data written are not original; that is, it indicates the source of data. A. CROSS TABULATION TABLE Table1.1 Distribution of Religious Affiliation by Sex for Barangay Tibanga SEX RELIGION MALE FEMALE TOTAL ROMAN CATHOLIC 2,758 2,693 5,451 ISLAM 113 126 239 IGLESIA NI CRISTO 82 79 161 OTHERS 231 275 506 TOTAL 3,184 3,173 6,357 Source: 1994 Iligan Census Summary Report B. FREQUENCY DISTRIBUTON Frequency distribution is grouping of the number of all observations into intervals or classes together with a count of the number of observations that fall in each interval or class. STEPS in constructing frequency distribution 1. Find the range(R): R = highest value – lowest value 2. Estimate the number of classes , k a. 𝑘 = √𝑛 b. 𝑘 = 1 + 3.322 𝑙𝑜𝑔10 𝑛 3. Estimate the width c of the interval (𝑐 = 𝑅⁄𝑘 ). Round of this estimate to the same number of significant decimal places as the original set of data. 4. List the lower and the upper class limits of the first class class should contain intervals. This interval should contain the lowest observation in the data set. 5. List all the class limits by adding the class width to the limits of the previous interval. The highest class should contain the largest observation in the data set. 6. Tally the frequency for each class. 7. Compute the class marks and the class boundaries. L + U𝑖 Class Marks or class midpoint is the midpoint of an intervals: 𝑥𝑖 = 𝑖 2 To find the Class Boundaries, it is important to determine the unit of accuracy. 1 Lower class boundaries = lower class limit – ( 2 unit ) 1 Upper class boundaries = upper class limit + ( unit ) 2 Example: The following scores represent the final examination grade for an elementary statistics course: 23 60 79 32 57 74 52 70 82 36 80 77 81 95 41 65 92 85 55 76 52 10 64 75 78 25 80 98 81 67 41 71 83 54 64 72 88 62 74 43 60 78 89 76 84 48 84 90 15 79 34 67 17 82 69 74 63 80 85 61 C. STEM AND LEAF PLOT For small data set, grouping data into intervals may still be done without any loss of information. Stem and leaf plot is a table consisting of a stem and leaf. GRAPHICAL A. BAR CHART AND HISTOGRAM A bar chart is a graph where the different classes are represented by rectangles and bars. The width of the rectangles is the width of the interval represented by the class limits in the horizontal axis or categories for nominal data. The length of the rectangle, represented by the frequency, is drawn in the vertical axis. A histogram is a graph which is a close resemblance of the bar chart. Histogram employs the class boundaries for the horizontal axis. B. FREQUENCY POLYGON A frequency polygon is constructed by plotting the class marks against the frequency. To complete the polygon, which is mathematically defined as a close figure, an additional class mark is added at the beginning and at the end of the distribution. C. FREQUENCY OGIVE A cumulative frequency distribution can be represented graphically by a frequency ogive. An ogive is obtained by plotting the upper class boundaries on the horizontal scale and the cumulative frequency less than the upper class boundaries in the vertical scale D. PIE CHART AND PICTOGRAPH SEATWORK: Construct a frequency table, cumulative frequency and present the data using the frequency polygon, histogram and frequency ogive for the table below. Table 1.2 Weights (Kg) of 30 new born hounds in Hugo’s Farm 6.0 7.5 6.2 7.9 8.4 7.6 8.3 6.7 7.0 6.4 9.0 6.6 6.1 8.8 9.0 8.1 7.3 7.4 8.3 6.5 7.9 6.0 8.4 8.8 7.1 8.6 7.4 7.0 6.9 6.4 ASSIGNMENT: The following data represent the charges (in dollars) billed by a plumber to his last 30 home service calls: 39.12 61.74 37.29 44.35 57.29 64.10 48.25 67.25 58.95 39.95 38.42 55.80 44.35 38.75 63.91 5.50 40.15 60.29 41.26 49.32 36.07 46.01 41.13 67.29 45.68 63.55 62.12 36.85 45.97 42.89 MEASURES OF CENTRAL LOCATION Measure of central location is a measure that summarizes a set of observations into a single value and that value may be used to represent the entire population. It is a single value about which the set of observations tend to cluster. Three Types of Measure 1. The arithmetic mean or simply the mean. It is the sum of measurements divided by the number of measurements in the set. ∑𝑁 𝑖=1 𝑋𝑖 Population Mean: 𝜇= 𝑁 ∑𝑛 𝑖=1 𝑋𝑖 Sample Mean: 𝑥̅ = 𝑛 Sometimes we want to attach more significance to some observations and weights are used to emphasize them. If we denote the weights of each of the n observations by 𝑤1 , 𝑤2 , 𝑤3 , … , 𝑤𝑛 respectively, a sample mean which provide a relative importance to the individual observations is the weighted mean and is computed using the formula: ∑𝑘 𝑖=1 𝑋𝑖 𝑊𝑖 𝑥̅ = ∑𝑘 𝑖=1 𝑊𝑖 CHARACTERISTICS OF THE MEAN i. The mean reflects the magnitude of every observation since every observation contributes to the value of the mean. ii. It is easily affected by the presence of extreme values, and hence not a good measure of central tendency when extreme observations occur. Example: 1. The number of hours spent by ten students in studying their lessons per day was recorded as follows: 5, 8, 4, 2, 2.5, 2, 2, 5, 3, and 4. Find the arithmetic mean. 2. The MSU – SASE scores of a sample of 9 students who joined the university in the second semester of A.Y. 1995 – 96 were found to be 67, 92, 90, 79, 69, 62, 59, 69, and 95. Compute the mean SASE score. 3. The student’s final grades in Math 17, Philo 1, English communication 1 and Bio 1 are 2.0, 2.25, 1.75, and 2.0 respectively. If the respective credits for these subjects are 6, 3, 3, and 4, determine the student’s GPA or weighted average grade. 4. What is the arithmetic mean for the sample of birthweights in Table 2.1? s 2. Median A measure of central location which is not affected by the presence of abnormally large, or abnormally small observation is the median. The median is the middle value of the set of observations arranges in an increasing or decreasing order of magnitude. It is the middle value when the number of observation is odd, or the arithmetic mean of the two middle value when the number of observation is even, ie., it is the value such that half of the observations fall above it and half below it. PROPERTIES OF THE MEDIAN i. It is a positional value and hence not affected by the presence of extreme values, unlike the arithmetic mean. ii. It is the appropriate measure for at least an ordinal type of data. Example: 1. Refer to example 2 and determine the median of MSU – SASE score. 2. Refer to example 1 and find the median of the number of hours the students study daily. 3. Refer to example 4 and find the median of birthweights in Table 2.1. 3. Mode The mode of the set of observation is that value which occur the most number of times, or the value with the greatest frequency. PROPERTIES OF THE MODE i. The mode is determined by the frequency and not by the values of the observations. ii. The mode may be defined with qualitative or quantitative variables. iii. It is the appropriate measure for a nominal type of data. Example: 1. Find the mode of each of the following sets of numbers: a. 4, 6, 7, 3, 5, 7, 6, 8, 9, 6 b. 90, 50, 45, 30, 26, 37, 28, 86, 59, 75 c. 14, 10, 11, 14, 12, 19, 11, 20, 25, 18 2. Refer to example 1 and 2 of the mean and determine the mode. 3. The religions of a sample of 15 students in the College of Business Administration are: RC, RC, RC, INC, RC, RC, Baptist, Islam, Born – Again, Baptist, RC, and Born – Again. Find the mode. 4. Find the mode of the data given in Table 2.3 below. Ungrouped/ Raw Data – are data that are not organized, or if arranged, could only be from highest to lowest or lowest to highest. Grouped Data – are data that are organized and arranged into different classes or categories. A frequency distribution is a table that lists observed events and the frequency of occurrence of each observed event, is often used to organize raw data. Example: Table 1. Numbers of Laptop Computers per Household Table 2. A frequency for Table 1. Example: Find the mean of data displayed in Table 1. For grouped data: Mean: X =  f X m , where X = measurement or score N f = frequency Xm = classmark N = total frequency N  ~  −  cf  ~ where, X = median Median: X = l +  2 c l = lower class boundary of the median class  fm    N = total frequency  

Lesson 1 Introduction to Statistics PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue