STATISTICSHandout.pdf

1 PRIMER IN STATISTICS Meaning of Statistics We have come into the age of computerization and are becoming rich in informat...

1 PRIMER IN STATISTICS Meaning of Statistics We have come into the age of computerization and are becoming rich in information at a very fast rate. However, data gathered will not make sense unless we know how to use the available information to make good decisions. This problem can be aided by Statistics because Statistics is a science which deals on the collection of data, presentation of the collected data, and interpretation of the results so as to yield meaningful information. Proper interpretation of the results will help us make better decisions for the future. We practice statistical thinking in our everyday lives. Like for example, a student budget his or her allowance based from the knowledge he or she got on the past prices and to the project he or she needs. Parents also do the same. Most of the young children nowadays can make predictions of other people’s behavior based from what they experienced in the past. During election period we try to predict the outcomes based from what we hear or observed around us. We used to make decisions or predictions based from the previous observations. In the professional world, statistics is widely used since decisions made are usually based from the data in the past or from data collected through experiments and surveys. Population and Sample Before proceeding further, it is important to understand the basic concepts about the set of observations to be gathered in a study. The totality of the observation of which a study is concerned about is called the population of the study. Population can refer to the subjects of the study or the observations themselves. For example, if a study is conducted to determine the MSU-Marawi student’s opinion on the possible tuition fee increase, and if there are 16,000 students, we say that we have a population with finite size of 16,000. If one is interested to find out the choice among the senatorial candidates in a coming election, the population of the study is the set of registered voters in that coming election. A study that takes data from the entire population is called a census. Taking the whole population into the study is costly, laborious, time-consuming and sometimes impossible. Suppose your study deal with the opinion of the recipients of the 4P’s in the different barangays of Marawi City whether the program greatly helps their livelihood or not. Then this study is too costly if you make census since you will need to go to each recipient in each barangay and ask about their opinion. It is also very laborious and will take you so long to finish. Perhaps, it will also be impossible to get the opinion of all recipients because some of them might be in other places for vacation when you conducted the study. Thus, there arises a need to study only a part of the population which we call a sample. Ideally, the sample must be taken in such a way that it represents the population very well. It refers to the totality of the POPULATION observations of which the study is concerned. SAMPLE It refers to a part or subset of a population. Importance of Statistics Based from the definition of Statistics in 4.1, we can derive some uses of statistics that we will also encounter for in this course and they are as follows: (1) helps in proper and efficient planning of a statistical inquiry in any field of study. Like aiding in the method of collecting data, deciding on the sample size and the process of selecting the sample. It will also help you in deciding the appropriate data to be collected. (2) aids in presenting complex data in a suitable tabular and graphical form for an easy and clear comprehension of the data. (3) provides us tools that can be used in understanding the nature and pattern of variability of a phenomenon or a set of observation. (4) can help us make reliable inferences about the population even when the data comes only from sample. (5) can help you understand and discover the relationships between variables (6) helps you how to obtain reliable forecasts. (7) it can aid in making decisions on how to improve processes. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 2 Variables and Types of Data In order to gain information about seemingly haphazard events, statisticians collect data for variables used to describe the event. A variable is a characteristic or attribute that changes or varies over time and/or for different individuals or objects under consideration. An experimental unit is the individual or object on which a variable is measured. A single measurement or data value results when a variable is actually measured on an experimental unit. Data are the values (measurements or observations) that the variables can assume. Variables whose values are determined by chance are called random variables. For example, suppose that an insurance company studies its records (called data) about the number of automobiles involved in car accident. On their records, they observed that on the average, 3 out of every 100 automobiles were involved in an accident every year. Although there is no way to predict the specific automobiles that will be involved in an accident (a random occurrence), the company can adjust its rates accordingly since the company knows the general pattern over the long run. That is, on the average, 3% of the insured automobiles will be involved in an accident each year. A collection of data values forms a data set. Each value in the data set is called a data value or datum. Variables can be classified into two categories: qualitative or quantitative Qualitative variables are variables that measure the quality or characteristic on each experimental unit. It produce data that can be categorized according to similarities or differences in kind – often called categorical data. For example, gender, political affiliation and religious preference. Quantitative variables are variables that measure the numerical quantity or amount of each experimental unit. It can be ordered or ranked. For example, heights, weights, volume and body temperatures. Quantitative variables are classified into two groups: discrete and continuous. 1. Discrete variables Discrete variable can assume only a finite or countable number of values. This can be assigned values such as 0, 1, 2, 3, and are said to be countable. Examples: Number of children in a family, number of students in a classroom, number of calls received by a switchboard operator each day for one month 2. Continuous variables Continuous variable can assume an infinite number of values between any two specific values. They are obtained by measuring and often include fractions and decimals. Examples: (a) Temperature is a continuous variable since the variable can assume all values between any two given temperatures. (b) Heights might be rounded to the nearest inch, weights to the nearest ounce, etc. Hence, a recorded height of 73 inches could mean any measure from 72.5 inches up but not including 73.5 inches. Thus, the boundary of this measure is given as 72.5-73.5 inches. Data can be used in different ways. Depending on how they are used, the body of knowledge on statistical methods is divided into two main areas or branches, namely Descriptive Statistics and Inferential Statistics. Two Major Areas in Statistics (1) Descriptive Statistics. It comprises of those statistical tools which deal on the presentation of the observed data that can be done in various forms such as tables, graphs and diagrams or describing the data through computation of measures that summarize the characteristics of the set of data. Example: Consider the national census conducted by the National Statistics Office (NSO) every 10 years. Results of this census give the average age, income, and other characteristics of the Philippine population. The NSO also presents the data in some meaningful form such as charts, graphs, or tables. (2) Inferential Statistics. It consists of those statistical tools concerned with generalizing results from random samples to populations, performing estimations and hypothesis tests, determining relationships among variables, and making predictions. Example: Suppose we want to know the percentage of illiterates in our country. We take a random sample from the population and find the proportion of illiterates in the sample. With the aid of other statistical measures and probability, we make some inferences or general statements about the population proportion of illiterates. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 3 Exercises 1 In each statement that follows identify whether descriptive or inferential statistics is used. (a) The average price of a unit of Camella homes sold in Cagayan de Oro during the week of April 22-28, 2012 was PhP 1,051, 053. (b) According to Provincial Statistics Office, 85% of the workers in their province get to work in public utility vehicle. (c) The National Eye Institute has halted a clinical trial on a type of eye surgery, calling it ineffective and possibly harmful to a person’s vision. (d) “Allergy therapy may make bees go away” (Prevention, April 1995). (e) Drinking decaffeinated coffee can raise cholesterol levels by 7%. (f) It is predicted that the average number of automobiles each household owns will increase next year. (h) The average number of students in a class in Mindanao State University is 22.6. (i) Last year’s total attendance of Azkal’s football games was 50,000. (j) According to the Court of Justice of the Philippines, 14% of trial-ready civil actions and equity cases during 1993 were decided in less than six months. II.Sampling Procedures In sampling, only a relatively small number of respondents or experimental units will be involved, thus, it is commonly used in practice. We examine some of the advantages for doing so. Advantages of Sampling 1. It entails lesser cost, lesser effort and it is less time consuming. a. Since the size of the sample is small compared to the population, the time, cost and effort involved on a sample study are much less than the study done on population. For population, huge fund is required because of the resources to be used which may include more manpower and materials. b. It will also take a much shorter period of time to gather data from a sample than from a population. Thus, sampling can lead to more well-timed results as well. 2. It is less cumbersome and more practical to administer. It is easier to handle and manage and not as much burden in your part if you take only data from a smaller number of respondents. 3. Some experiments are destructive so it is not possible to involve the whole population. For example, a car manufacturer might want to test the durability of cars being produced. Obviously, each car could not be crash-tested to determine its durability or else the company has nothing to sell anymore. To overcome this problem, samples are taken from populations, and estimates are made about the total population based on information derived from the sample. Sampling also has disadvantages, the biggest of which is that the sample may not truly reflect the characteristic of the population and this would lead to wrong conclusions. Hence, care must be taken in choosing a sample. Also, a sample must be large enough to give a good representation of the population, but small enough to be manageable. Types of Sampling Procedures The method of drawing a sample has a big impact on the validity and reliability of the results of the study. It can also influence on the kind of inferences that can be made for the population. Samples can either be drawn randomly (and it is called probability or random sampling) or by non-random procedures. A. Probability Sampling or Random Sampling In probability sampling, each element has a known probability of selection, and a chance method such as “draw lots” or using numbers from a random number table is used in selecting the specific units to be included in the sample. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 4 (1) Simple Random Sampling (SRS) This is the simplest form of random sampling where every subset of size n of the population has an equal chance of being selected. A simple random sample can be done using the “fishbowl” method, “draw lots” method or using random numbers. In drawing a simple random sample, the researcher is in effect mixing up the units in the population before a sample of n units is selected.  Steps in Simple Random Sampling (SRS): (1) Assign a number to each element of the population using the numbers from 1 to N. (2) Select n numbers from 1 to N using random process like fishbowl method or draw lots, or you can use random numbers which can be generated by a scientific calculator, or you can use table of random numbers (see Appendix A). Steps in obtaining a random number from a scientific calculator: (NOTE: This can be used for a population less than 1000 since most scientific calculators produce only random numbers with 3 decimal places.) i. Press INV or Shift then press Ran#. Then a number between 0 and 1 with three decimal places will appear. (NOTE: You have different results with your classmates. And every time you press these buttons, a different number will appear.) ii. Multiply this random number with the population size and round it off to a whole number. The result corresponds to the number of an element in the list. iii. Repeat (i) and (ii) until you get the desired sample size with distinct elements. Examples: (a) Each name in a telephone book could be numbered sequentially. If the sample size is to include 1,000 people, then 1,000 numbers could be randomly generated by computer or numbers could be picked out using a random process, for instance, draw lots. These numbers could then be matched to names in the telephone book, thereby providing a list of 1,000 people. (Note: We cannot use scientific calculator for generating the random numbers because the population is greater than 1,000.) (b) Choose a random sample of five (5) students from the following 30 students using the random number of your calculator. Suzette Audrey Jamilah Chad Mary Marielle Charmie Cris Dawn Saliha Norjehan Edon Johairy Rose Khert Christian Eliezer Geneveve San Endera Emil Jan Amsari Yhenz Yhan Melai April Carima Annabelle Elma A disadvantage of simple random sampling is that we can never be assured that all sectors or groups are represented in the sample. For instance, in Example (b) above, there is a possibility that all elements drawn will be girls or all will be boys. To avoid the above mentioned possibility, we need to contemplate and employ other sampling procedures that can lead to more representative sample in which the sample units are spread evenly over the entire population. This sampling procedure is called systematic random sampling. (2) Systematic Random Sampling This is also called interval sampling. It means that there is a gap or interval between each selection. Researchers obtain systematic samples by numbering each subject of the population and then selecting every kth element in the population where the first unit is chosen at random.  Steps in taking a Systematic Random Sample: (1) Assign a number to each element of the population using the numbers from 1 to N. (2) Determine the sampling interval k: k = N/n, where N= population size and n= sample size NOTE: If k is not a whole number, then it is rounded to the nearest whole number. For example, suppose N = 400 and n =15 then k = 400/15 is equal to 26.67. That is, 26.67 is rounded-off to nearest whole number 27. (3) Select a random start r where 1≤ r ≤ k. The sample will include the rth element, (r+k)th, (r +2k)th, (r +3k)th and so on until you reach the desired sample size. Examples: (a) If a systematic sample of 6 students were to be selected in a class with an enrolled population of 48, the sampling interval would be: k = N/n =48 / 6 = 8 Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 5 All students would be assigned sequential numbers. The starting point would be chosen by selecting a random number between 1 and 8. If this number is 5, then the sample will consist of ( rth=5th) element which is the 5th student, (r+k = 5 +8)thelement which is the 13th student, (r +2k = 5 +2(8))th element which is the 21st student, (r +3k = 5 +3(8))th which is the 29th student,(r +4k = 5 +4(8))th which is the 37th student until (r +5k = 5 +5(8))th which is the 45th student. (b) In a population of 1200 individuals, choose a systematic random sample of size 9. Solution: k = N/n = 1200/ 9 = 133.33 and since it is not a whole number, we need to round it off to nearest whole number, that is 133. Since k =133, we have r = 1, 2, 3,…, or 133. If we choose r =3, the sample points will be the 3rd person, the 136th person, the 139th person, the 142nd person, the 145th person, the 148th person, the 151st person, the 154th person until the 157th person. NOTE:The list from which the systematic sample is drawn should be examined that it must not have a periodic pattern because it could possibly result to a biased sample. For example, if our list is the monthly sales from 2005 to 2008 arranged in chronological order then there are 48 months to choose from. If we select a random sample of 12 then k = 48/12 = 4. If we start choosing from the 3 rd element, then the next elements will be 7th, 11th, 15th, 19th, 23rd, 27th, 31st, 35th, 39th, 43rd and 47th. This would correspond to the sales of March 2005, July 2005, November 2005, March 2006, July 2006, November 2006, March 2007and so on, and the last element is November 2008. This sample is biased since it only represents the months of March, July and November. In this case with periodic pattern, the appropriate sampling procedure to use is Simple Random Sampling or SRS, since the result does not yield periodic pattern. Like for example if you use fishbowl or draw lots or random numbers, maybe the first element is March 2005 then July 2006 and so on until you get the desired sample size. After stating the real danger in systematic random sampling when choosing a sampling interval that corresponds to periodicity, we will study another sampling procedure. This sampling procedure may be much more efficient than simple random sampling by carrying out through dividing the population into homogeneous subpopulations and then selecting a simple random sample from each subpopulation. This sampling procedure is called stratified random sampling. (3) Stratified Random Sampling In this sampling procedure, the population of N units is first divided into homogeneous subpopulations called strata (homogeneous with respect to the characteristics of interest) and then a sample is drawn from each stratum. This type of sampling assures that all groups or strata are represented in the sample. Some stratification variables commonly used by the Social Weather Station (SWS) survey are location, age and sex. Other stratification may be religion, academic ability or marital status.  Steps in taking a Stratified Random Sample: (1) Classify the population into at least two homogeneous strata. The basis for classification must be closely related to the variable of interest. Suppose we are interested to determine the students’ opinion on the tuition fee increase, it may be logical to subdivide the population of students by income of parents, college, or by year level, or by tribe or a combination of these. (2) Draw a sample from each stratum by simple or systematic random sampling. How many shall we take from each stratum? The most commonly used formula is proportional allocation. In proportional allocation, the number of units to be taken from each stratum is proportional to the size of the subpopulation; that is, between two strata of different sizes, a bigger sample will be taken from the bigger stratum. Proportional Allocation. If the size N of the population is divided into k homogeneous subpopulations or strata of sizes N1, N2, …, Nk, then the sample size to be taken from each stratum i is obtained using the formula 𝑵 ni = ( 𝒊 )x n for i = 1, 2, …, k 𝑵 NOTE: If ni is not a whole number, then it is rounded-off to the nearest whole number. Example: (a) The manager of a girls’ dormitory wants to learn how the students feel about the dorm’s services. The students were classified according to the following scheme: NUMBER OF CLASSIFICATION STUDENTS Freshmen 220 Sophomore 195 Junior 163 Senior 150 Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 6 If we use proportional allocation to select stratified random sample of size n = 40, how large a sample must be taken from each stratum? Solution: Since n = 40 and N= 220 + 195 + 163 + 150 = 728, then 𝑁 220 𝑁 163 n1 = ( 𝑁𝑖)x n =(728)x 40 = 12.088 ≈12 n3 = ( 𝑁𝑖)x n =(728)x40 = 8.956 ≈ 9 𝑁 195 n2 = ( 𝑁𝑖)x n =(728)x40 = 10.714 ≈ 11 n4 = 40 – 12-11 -9 =8 (b) In an election survey in Makati City, registered voters are classified according to the following scheme: Economic Status Number of People A (Upper Class or Rich People) 725 B (Middle Class) 3489 C (Lower Class or Poor People) 2146 If one uses proportional allocation to select a stratified random sample of size n=345, how large a sample must be taken from each stratum? Solution: Since n =345 and N= 725 + 3489 +2146 = 6360, then 𝑁 725 𝑁 3489 n1 = ( 𝑁𝑖)x n =(6360)x 345 = 39.328≈ 39n2 = ( 𝑁𝑖)x n =(6360)x345 = 189.262 ≈189n3= 345– 39 – 189 = 117 The advantage of stratified sampling is not only on the assurance that all strata are represented but it can also lead to better estimates of the population parameters compared to Simple Random Sampling. In many statistical studies, we reduce the cost involved in sampling and so over simple random sampling by randomly selecting groups of elements from a population and then sampling some or all of the elements within the selected group. This sampling procedure is usually used when the population is widely distributed geographically or may occur in natural clusters such as households or schools or business establishments. If the population is the set of workers of NGOs (non-government organizations) in the Philippines, it is much cheaper to sample NGOs and interview every worker in the selected NGOs than to interview a Simple Random Sample (SRS) of NGO workers because with SRS, you might need to travel to an NGO office just to interview one worker. Thus, it is usually cheaper to sample in clusters than by SRS or stratified. This sampling procedure is known to be cluster sampling. (4) Cluster Sampling Cluster sampling assumes that the population is naturally separated by groups or clusters. A number of clusters are selected randomly and then all or parts of the units within the selected clusters are included in the sample. No units from the non-selected clusters are included in the sample. It differs from stratified sampling, because in the latter, sample units are selected from every group. You may be able to save much resource in cluster sampling compared to SRS or Stratified Random Sampling, but cluster sampling leads to less precise estimates. This is because when we sample each unit in a cluster, we would expect to get similar information which may be different from other clusters not selected.  Steps in taking Cluster Sampling: (1) Divide the population area into clusters. (2) Select randomly a few of these clusters. (3) Choose all the elements from the clusters selected or select only a portion of it. Examples: (a) Suppose the population of a study is residents of a condominium in a large city. If there are 10 condominium buildings in this city, the researcher can select two buildings randomly from the 10 and interview all (or a subsample) of the residents from these buildings. (b) Suppose an organization wishes to find out which sports senior students are participating in the Philippines. It would be too costly and would take too long to survey every student, or even some students from every school. Instead, 100 schools are randomly selected from all over the Philippines. These schools are considered to be clusters. Then every senior student in these 100 schools is surveyed. In effect, students in the sample of 100 schools represent all Senior students in the Philippines. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 7 The advantages of Cluster sampling are: reduced costs; simplification of the fieldwork and more convenient administration. Instead of having a sample scattered over the entire coverage area, the sample is more localized in relatively few centers. However, it often gives less accurate results due to higher sampling error than for simple random sampling with the same sample size. In the above example, you might expect to get more accurate estimates from randomly selecting students across all schools than from randomly selecting 100 schools and taking every student in those chosen schools. Exercises2 I. Classify each sampling procedure as simple random sampling, systematic, stratified, or cluster. (1) In a large school district, all teachers from two buildings are interviewed to determine whether they believe the students have less homework to do now than in previous years. (2) Nursing supervisors are selected using random numbers in order to determine average annual salaries. (3) Every hundredth hamburger manufactured is checked to determine its fat content. (4) Mail carriers of a large city are divided into four groups according to gender (male or female) and according to whether they walk or ride on their routes. Then 10 are selectedfrom each group and interviewed to determine whether they have been bitten by a dog in the last year. (5) There are 2000 subjects in the population and a sample of 50 is needed. So, every 40th subject is selected at random. (6) A city’s telephone book lists 100,000 people. Suppose the telephone book is the frame for a study, Suzette wants to interview every 200th person. II. Answer as indicated. (1) A population of 70 cities is numbered from 1 to 70. Select a systematic random sample of 15 cities. Choose your own starting sample point. (2) Pulse Asia is conducting an Exit Poll on the recently concluded National Election. A certain barangay has been considered for the survey. Of the 150 households (numbered 1 to 150) in the said barangay only 10 are to be included in the survey. (a) Use systematic random sampling to select the 10 households using your selected random start. (b) Use the method of simple random sampling to choose the 10 households. (3) At a university, students are classified according to the following scheme: HOUSING TYPE NUMBER OF STUDENTS Campus dormitory 2100 Lodging house 720 680 Private residence Use proportional allocation to determine how many students should be taken from each classification if a stratified random sample of size 200 is to be chosen. B Non-probability Sampling It is one in which individuals or items are chosen in a manner that does not involve random selection process. This is usually used when the size of the population is either unknown or cannot be individually identified. Here, personal preferences are applied. Because chance is not used to select items, the techniques are called non- probability techniques and are not desirable for use in gathering data to be analyzed by the methods of statistical inference because the reliability of the measures cannot be determined objectively. (1) Convenience Sampling The elements in convenience sampling are selected for convenience of the researcher. Usually the researcher chooses those that are readily available, nearby, or willing to participate. The result of such study usually leads to less varied observations than the population because in many environments the extreme elements of the population are not readily available. Examples: (a) A convenience sample of homes for door-to-door interviews might include houses where people are at home, houses with no dogs, houses near the street, first-floor apartments, and houses with friendly people. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 8 (b) If the research firm is located in a mall, a convenience sample might be selected by interviewing only shoppers who pass the shop and look friendly. (2) Quota Sampling Quota Sampling appears to be similar to stratified random sampling in which certain population subclasses, such as age group, gender, or geographic region, are used as strata. However, instead of using random sampling from each stratum, the researcher uses a nonrandom sampling method to gather data from one stratum until the desired quota of samples is filled. It is often filled by using available, recent, or applicable elements. It is less expensive than most of the random sampling techniques because it essentially is a technique of convenience and also has a speed of data gathering in which the researcher does not have to call back or send out a second questionnaire if he does not receive a response rather he just moves on the next element. Examples: (a) Instead of randomly interviewing people to obtain a quota of Italian Americans, the researcher would go to the Italian area of the city and interview there until enough responses are obtained to fill the quota. (b) Suppose the researcher wants to stratify the population into owners of different types of cars but fails to find any lists of Toyota van owners. Through quota sampling, the researcher would proceed by interviewing all car owners and casting out non-Toyota van owners until the quota of Toyota van owners is filled. (3) Judgment or Purposive Sampling The elements selected for the sample are chosen by the judgment of the researcher. Researchers often believe they can obtain a representative sample by using sound judgment or purposely choose as to who can provide the best information to achieve the objectives of the study which will result in saving time and money. The researcher only goes to people who in his/her opinion are likely to have the required information and are willing to share it. This is important when you want to construct a historical reality, describe a phenomenon or develop something about which only a little is known. Example: A student conducted a study on the history of CNSM. To get proper information, he interviewed past deans, chairmen and pioneering faculty and staff of the college. (4) Snowball Sampling The survey subjects of snowball sampling are selected based on referral from other survey respondents or selecting a sample using networks. The researcher identifies a person who fits the profile of subjects for the study and that person who is being selected by researcher become a part of the sample. The researcher then asks this person for the names and locations of others who would also fit the profile of subjects for the study. This process is continued until the required number in terms of the information being sought, has been reached. Through these referrals, survey subjects can be identified cheaply and efficiently, which is particularly useful when survey subjects are difficult to locate. This sampling technique is useful if you know little about the group or organization you wish to study, as you only need to make contact with a few individuals, who can then direct you to the other members of the group. Example: A researcher wanted to study the factors why some students occasionally use prohibited drugs. He intended to get 50 students, but he only knew 5 students who used it. By getting the cooperation of these 5 students, he was referred to other drug users, who in turn also provide additional contacts. In this way, he was able to get sufficient number of students he needed. NOTE: Probability samples can be further analyzed using methods in statistical inference but this is not valid for non- probability samples. Exercises 3 Determine what type of non-random sampling procedure is used in the following: (1) A researcher wanted to study about the participation of Meranao female students in sports. To get a sample of 50 students, he went to every college and asked female students whether they are Meranao or not. If they are, then they are asked to participate in the study by answering the questionnaire. (2) A researcher wanted to study about the sufficiency of the facilities of the CNSM library for the needs of the CNSM students in their CNSM subjects. He planned to get a sample of 100 students. Due to time constraint and difficulty of getting a random sample, he decided to get his sample during the CNSM orientation. From there he asked 100 students to participate in his study. (3) A researcher wanted to understand the attitudes of the minority managers toward system for assessing management performance. To get the proper information, theyinterviewed the managers who are members of minority group that work in the medium-scale to large-scale firms. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 9 (4) A sociologist conducted a study about the opinions of employed adult women about government funding for day care. She went around an area knocking on doors during weekend when women are likely to be at home. She asked to speak to the woman of the house. Her first question was about whether the woman is employed or not. Interview was conducted if the respondent is employed. (5) Manufacturers and advertising agencies wanted to know about the habits of consumers and the effectiveness of ads. They needed to interview a sample of 1,000 consumers. To get the required sample, they went to some shopping malls and interviewed consumers until they obtained the required number of consumers. (6) Suzanne wanted to know how to manage a good quality business. In order for her to have better information for her study, she interviewed different managers on different business stations. (7) Physical Education students wanted to know about the different views on the good effect on health in having Physical Education 4. In order for them to get information on 100 students, they stand in the grandstand and ask each of the students there if they had their Physical Education 4 in the previous semester. Then if there is, they ask again that student regarding his or her views on having Physical Education 4. (8) Emil needs to have information about the history of his place. He interviewed his ancestors and also the past officials in their place in order for him to get the information he wanted. (9) A company marketing wants to test a new personal computer. They need to interview 50 users. In order for them to get the desired number of users, they went to different internet cafés. They interviewed those people who look friendly. So far, we have only discussed how to select our sample for the study which we also call as sampling design. Sampling design is critical to the interpretation of observational studies, that is, studies in which the researcher merely “observes” the study units, making one or more measurement on each. For studies in which the researcher intervenes (“experiments”) in some way to affect the manner in which the study units (now called “experimental units”) respond, this type of study is called an experimental study. This is the topic in the next section. III. Design of Experiments The sampling procedures above are applicable for survey research. But for researches that involve experiments in a laboratory or in an agricultural field or applying different teaching methods, the question is “How will you assign the different treatments to the experimental units?” We call this as design of experiments or experimental design. By an experimental design, we mean a plan used to collect the data relevant to the problem under study in such a way as to provide a basis for valid and objective inference about the stated problem. The plan usually consists of the selection of treatments whose effects are to be studied, the specification of the experimental layouts, and the assignment of treatments to the experimental units and the collection of observations for analysis. All these steps are accomplished before any experiment is performed. Two of the most common designs are Complete Randomized Design (CRD) and Randomized Complete Block Design (RCBD). To interested students, they may read books that discuss experimental design like Black (2004), Walpole (1982), etc. or may search in the internet. IV. Methods of Collecting Data In the planning stage of a study, one of the critical things to be decided upon is the method to be used in collecting the data. Five methods of data collection are discussed below and each of them has their own strengths and weaknesses. The choice will depend upon the availability of time and resource, the appropriateness of the method, the type of sample units to be studied and others. A. Interview Method This is a person-to-person encounter between the one soliciting information (also known as the interviewer) and the one supplying the information (also known as the interviewee). It can be conducted in person or through telephone conversation.  Advantages: (1) Questions can be repeated, rephrased, or modified for better understanding. (2) Answers may be clarified, thus ensuring more precise information. (3) Information can be evaluated since the interviewer can observe the facial expression of the interviewee.  Disadvantages: (1) It is too costly because you might need to spend a lot for transportation, aside from other incidental expenses. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 10 (2) It can cover only a limited number of individuals in a given period of time. Hence, you need longer time to finish the data collection. (3) Interviewees may feel pressured for on-the-spot responses. (4) People may give different answers to different interviewers. (5) People may say what they think an interviewer wants to hear or what they think will impress the interviewer. (6) A particular interviewer may affect the accuracy of the response by misreading questions, recording responses inaccurately, or antagonizing the respondent. B. Questionnaire Method This could be mailed or hand-carried (delivered in person).  Advantages: (1) It is less expensive and has a greater scope than the interview method. (2) Respondents have enough time to formulate appropriate responses.  Disadvantages: (1) Low return rate. Only a few would care to mail back the questionnaire. (2) People do not always understand the questions or sometimes, certain words mean different things to different people. Hence, there is no way that they can make clarification before they answer the questionnaire. C. Observation Method This is appropriate in obtaining data pertaining to behavior of an individual or group of individuals at the time of occurrence of a given situation. Subjects may be observed individually or collectively. Examples: (a) If your study deals about how many people are involved in fighting and the reason why they are fighting in that particular place, then you must be there before their conflict will end, in order for you to witness what is happening and also you can get reliable reason/s behind it. Do not be too late because in this case, you can’t look back the said conflict. And it’s impossible to rewind what is happening unless you will secretly take a video of it. (b) Suppose you want to study the boxing games of Manny Pacquiao. In order for you to have unbiased study, you need to be there in the arena before the game starts. Limitation: Observation is made only at the time of occurrence of the appropriate event/s. E. Experimentation Method This can be applied in obtaining data from the experiment.  Advantage: Experiment can be made again.  Disadvantage: It takes long time and great effort to wait for the result especially when you failed in your first experiment because in that case, you must repeat your experiment in order for you to have a good outcome or result. F. Use of Existing Data The data are coming from: (a) documents (books and magazines, hospital records, public files, registrations, etc.) (b) from the internet.  Advantages: (1) Provide information about the incidence (the number of new cases), prevalence (the number of existing cases), and rate (the proportion of a population with the particular concern in a population (Rossi and Freeman, 1993). (2) Aid in definition and selection of target population. (3) Help improve the planning and design of new study.  Disadvantages: (1) If you are using agency records, your information will apply only to those individuals participating in that program. Agency records exclude data on individuals who are not participating. (2) If you are using published reports or data collected by outside sources, you will not have enough information about the individuals involved in the specific study you are evaluating. (3) It cannot give the precise information about the geographic area, unless it was collected specifically in your area. For example, if you access the records on teen pregnancy rates, those rates may not accurately reflect the pregnancy rates in your own community. (4) Published reports will not let you to determine the impact of your study on its actual participants. V. Types of Level of Measurement Another way of classifying data is according to their level of measurement. You may notice that one can measure the exact difference between two persons with height 5 feet and 4 feet. But you cannot measure the exact difference between two persons whose opinion on a certain issue is strongly agree and the other is agree. However, you know that one has a higher level of agreement compared to the other. And if you compare their gender, you can only tell whether they belong to the same category or not but you can never tell which is a stronger gender between them. This property of data leads to four (4) classifications or levels namely: nominal, ordinal, interval and ratio. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 11 A. Nominal Level This is the lowest level of measurement. The values of the data of this measurement fall into unordered categories or classes. Nominal type of data is used to distinguish different categories for qualitative variables and can be used as measures of identity. On the processing of data using computer packages, the encoder gives the same number to members of the same category and different numbers to members of different categories. In other words, the numbers here are essentially "dummy codes." Meaning, data can be coded but the codes neither have the ordering property nor a mathematical significance. Examples: (a) A sample of college instructors classified according to subject taught such as English, History, Psychology, or Mathematics. (b) Classifying respondents as male or female. (c) Classifying residents according to zip codes, there is no meaningful order or ranking. (d) Political party such as Liberal, United Nationalist Alliance, Independent. (e) Religion such as Lutheran, Jewish, Catholic, Methodist, etc. (f) Marital status such as married, divorced, widowed, separated. (g) Blood type: 1-type A 2-type B 3-type AB 4-type O The numbers 1, 2, 3, 4 in Example (g) above have no inherent mathematical properties, that is, assigning 4 to type O and 1 to type A does not mean that type O is better than type A. Moreover, the assignment of codes is not unique. For instance, 0 may be assigned to type A, 1 to type B, and so on. The codes have no mathematical significance, thus we cannot add, subtract, etc, the data. If we do so like 4-1=3, you might end in wrong interpretation like if we subtract a person who is blood type O from a person who is blood type A we get a person who is blood type AB (that’s very funny). The numbers are used only to facilitate data analysis using the computer. B. Ordinal Level More often, ordinal data are categorical data but categories can be ranked; however, precise differences between the categories do not exist. But sometimes ordinal data uses numbers. In this case, the numbers indicate position in an ordered series of the categories. But it does not indicate how much of a difference exists between the successive positions on the scale. It means that it involves data that may be arranged in some order but difference between data values either cannot be determined or is meaningless. When ordinal data are encoded in the computer for analysis, they are converted to numbers such that the numbers indicate positions in an ordered series. Examples: (a) Rank of students in a graduating class (1-valedictorian, 2-salutatorian, and so on). A rank of 5 is better than a rank of 10. The difference of 5 between the 5th and 10th ranks is meaningless, i.e., the difference of 5 between ranks 5 and 10 is not necessarily the same as the difference between ranks 20 and 25. (b) Military rank, position in the office, opinion on an issue (strongly disagree, disagree, neutral, agree, strongly agree), final grade in Math 1 (1.00, 1.25, 1.50, etc.) (c) Speakers might be ranked as superior, average, or poor. (d) Floats in homecoming parade might be ranked as first place, second place, etc. (e) Letter grades (such as A, B, C, D, E, F) or numerical grades such as 1.0, 1.25, 1.50, etc. It is ordinal because 1.0 corresponds to the score of 90-100, 1.25 corresponds to the score of 80-90 and so on then we can say that getting a grade of 1.0 is better than getting a grade of 1.25. C. Interval Level Interval levels are numerical data hence they can be ranked and precise differences between units of measure do exist. However, there is no absolute zero. It lacks an inherent zero starting point or lacks absolute zero (absolute zero means the total absence of the characteristic being measured). The starting point is arbitrary. Examples: (a) temperature in degrees Fahrenheit or degrees Celsius The freezing point of water in Celsius is 0o while in Fahrenheit it is 32. Moreover, 30o Celsius is hotter than 15 but it is wrong to conclude that 30o is twice as hot as 15o, since 0° is not an absolute zero point. o Moreover, 0o does not mean the total absence of heat. In fact there are countries during winter time that would even have negative temperature like - 10°C. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 12 (b) IQ is an example of interval scale. There is a meaningful difference of one point between an IQ of 109 and an IQ of 110. But we cannot say that a person who has an IQ of 100 is twice as intelligent as the one with an IQ of 50. IQ of zero (0) does not mean that the person who undergoes IQ test has no intelligence. D. Ratio Level This possesses all the characteristics of interval measurement and there exists a true zero, meaning it has an inherent zero starting point. Like interval scale, differences are meaningful. Ratio of two measures is also meaningful. For example, a person who is 4 feet tall is twice as tall compared to a 2 feet tall since the true starting point is zero. This is the highest level of measurement. Examples: (a) monthly salary -Php 0 means no salary. (b) Example of ratio scales are those used to measure height, weight, area, and number of phone calls received, number of children, etc. Exercises 4 1. Classify each variable as categorical or numerical. (a) Colors of jackets in a men’s clothing store. (b) Number of seats in classrooms. (c) Classification of children in a day care center (infant, toddler, preschool). (d) Length of fish caught in a certain stream. (e) Number of students who fail their first statistics examination. 2. Classify each variable as discrete or continuous. (a) Number of loaves of bread baked each day at a local bakery. (b) Water temperature of the saunas (steam bath) at a given health spa. (c) Income of single parents who attend at a community college. (d) Lifetimes of a certain type of batteries in a tape recorder. (e) Weights of newborn infants at a certain hospital 3. Classify each as nominal, ordinal, interval, or ratio level. (a) Horsepower of motorcycle engines. (b) Ratings of newscasts in Philippines (poor, fair, good, excellent). (c) Temperature of automatic popcorn poppers. (d) Time required by drivers to complete a course. (e) Salaries of cashiers of Day-Night grocery stores. (f) Marital status of respondents to a survey on savings account. (g) Ages of students enrolled in martial arts course. (h) Weights of beef cattle fed a special diet. (i) Rankings of weight lifters. (j) Number of exams given in a statistics course. (k) Ratings of word-processing programs as user-friendly. (l) Temperatures of a sample of automobile tires tested at 55 miles per hour for six minutes. (m) Weights of suitcases on a selected commercial airline flight. (n) Classification of students according to major field. (o) Data are classified according to color. (p) Years of service in a company. In order for the researcher to describe results after he collected the data needed for his study, draw conclusions, or make inferences about events, the researcher must present and organize the data in some meaningful way. The next section will show the different methods on how to present data or organize data in meaningful way. VI. Methods of Presenting Data This section shows how to organize data and to construct appropriate graphs to represent the data in a concise, easy-to-understand form. There are three methods of presenting data: textual presentation, tabular presentation and graphical presentation. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 13 A. Textual Presentation The first method in presenting data is through textual presentation.The data that are being collected are presented in sentence form. Example:Twenty of the respondents are male and thirty of the respondents are female. B. Tabular Presentation A tabular presentation is an arrangement of statistical data in rows and columns. Rows are horizontal arrangements whereas columns are vertical arrangements. Example: The table below shows the average weight of respondents grouped according to gender. GENDER AVERAGE WEIGHT (KILOS) Male 60 Female 52 A special type of table that is important in statistical analysis is called frequency distribution table. Definition: A frequency distribution is a summary of the data presented in the form of class intervals and frequencies. The data can be presented in a one-way or two-way frequency distribution table. (1) One-way frequency distribution The data are tabulated according to a single variable. Example: Frequency Distribution of Respondents According to Year Level Year Level Number of Students First 35 Second 50 Third 48 Fourth 24 2. Two-way frequency distribution The data are tabulated according to two variables. It is also called a cross-tabulation or contingency table. Example: GENDER TOTAL MALE FEMALE Meranao 29 55 84 TRIBE Non-Meranao 12 4 16 Total 41 59 100 For numerical data with a wide range of values, it is more practical to group the observations into classes like in the example below. AGE NUMBER OF (years) STUDENTS 15-16 40 17-18 56 19-20 42 21-22 30 23-24 15 TOTAL 183 This frequency table summarized the data into 5 classes or categories. The class interval 15-16 has a lower class limit of 15 and upper class limit of 16. The interval 15-16 actually includes age ranging from 14.5 to 16.5. Age between 14.5 and 16.5, when rounded to whole numbers, becomes 15 or 16 respectively. When data are organized into a frequency distribution, they are called grouped data. If they have not been summarized in any way, they are called raw data or ungrouped data. You might ask how frequency distribution is constructed. The following is your guide. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 14 Construction of Frequency Distribution The following steps are involved in the construction of a frequency distribution. (1) Find the range (R) of the raw data: The range is the difference between the largest and the smallest values. That is, R = (highest value) – (lowest value) (2) Decide on the number of class interval (or simply classes), k : There are no hard rules for the number of classes. Walpole (1982) recommended that there should not be less than 5 and not more than 20 classes. Having too few classes would lead to wider class intervals thereby losing much information. On the other hand, if there are too many classes, then it fails to aggregate the data enough to the useful. Others like H.A. Sturges (1976) has given a formula for determining the number of classes.(Note: Round off k to the nearest whole number.) k = 1. 3.322 log10 N where N= number of observations Example: If the total number of observations is 50, the number of classes would be k = 1+ 3.322 log10 N k = 1+ 3.322 log10 50 k = 1+ 3.322 (1.69897) k = 1+ 5.644 k = 6.644 or 7 classes approximately (3) Determine the class size or class width, c: This is obtained by dividing the range of the raw data by the number of classes. But the result is rounded up to the nearest higher value whose precision is the same as those of the raw data. 𝑟𝑎𝑛𝑔𝑒, 𝒓𝒂𝒏𝒈𝒆𝑅 ,𝑹 c >c > 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠, 𝒏𝒖𝒎𝒃𝒆𝒓 𝑘 ,𝒌 𝒐𝒇 𝒄𝒍𝒂𝒔𝒔𝒆𝒔 Examples: (a) Suppose a set of data has 100 observations with the lowest observation of 20 and highest observation of 85. Estimate the number of classes k and the class width c. Solution: k = 1 + 3.322 log10 N k = 1 + 3.322 log10 (100) k= 7.644 or 8 classes approximately range R = 85 – 20 = 65 c > 65 / 8 = 8.125 ≈ 9 (rounded up with the same precision as the given data) Thus, there will be 8 classes and class width is 9. (b) Suppose the lowest blood potassium level (in milliequivalents per liter) obtained in a study of 40 men is 3.2 and the highest blood potassium level is 5.8. Compute the number of classes k and the class width c. Solution: k = 1 + 3.322 log10 N k= 1 + 3.322 log10 (40) k= 6.322 or 6 classes approximately range R = 5.8 – 3.2 = 2.6 c > 0.43≈ 0.5 (rounded up with the same precision as the given data) Thus, 6 classes or categories of blood potassium levels can be made with a class width of 0.5. (c) Suppose thirty automobiles were tested for fuel efficiency, in miles per gallon (mpg), and the lowest mpg is 7.55 and the highest mpg is 32.67. Compute the number of classes k and the class width c. Solution: k = 1 + 3.322 log10 N k = 1 + 3.322 log10 (30) k= 5.91 or 6 classes approximately range R = 32.67 – 7.55 = 25.12 c > 4.186 ≈4.19 Thus, there are 6 number of classes that can be made with a class width of 4.19. (4) Determine the class limits of the k classes: The starting class limit must be equal to or lower than the lowest value in the raw data. When the lowest class limit has been decided, add the class size to the lowest class limit to get the lower limit of the next class. The remaining lower class limits are determined by adding the class size repeatedly until you reach k classes. The appropriate upper class limits are determined next. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 15 The upper limit (UL) of the first class can be obtained by subtracting one unit of measure from the lower limit of the next class. The upper limits of the rest of the classes can then be obtained in a similar fashion or by adding c to the upper limit of the preceding class. Finally, we check if the highest observation is contained in the last class. If not, then we simply add another class interval. (5) Tally the observations in the frequency column. After determining the class limits of k classes, tally or count the number of observations in each class. Example: A random sample of 40 Math 31 students was selected and their weights (in kilograms) were recorded as shown below: Weights (in kg) of Math 31 Students 63 59 43 60 41 53 56 81 50 66 62 52 49 48 52 40 64 64 47 53 47 54 62 56 58 53 50 47 79 70 45 47 46 58 56 55 56 45 73 49 Step 1.Compute the range: R = 81 – 40 = 41 Step 2.Compute the number of classes: k = 1 + 3.322 log1040 = 6.322. Round off 6.322 to the nearest whole number and that would be 6. Step 3.Compute the class width: c> 41 ÷ 6 = 6.833. Then round-up 6.833 according to the rule of c. It becomes 7 since the data are whole numbers. Thus, 6 number of classes can be made with a class width of 7 in the given set of data. They are 40-46, 47-53, 54- 60, 61-67, 68-74 and 75-81. Then if we check that the highest observation, 81, is contained in the last class. Step 4. Tally the number of observations in each class and write them in the frequency column. Class Limits (Weights in kilogram) Frequency (No. of Observations) 40-46 ||||-|= 6 47-53 |||| - |||| - |||| = 14 54-60 |||| - |||| = 10 61-67 |||| - | = 6 68-74 || = 2 75-81 || = 2 The class interval 40-46 actually contains all weights ranging from 39.5 to 46.5. Also the interval 47-53 contains the weights from 46.5 to 53.5. These true class limits are called the class boundaries. The class boundaries are 39.5-46.5, 46.5-53.5, 53.5-60.5, 60.5-67.5, 67.5-74.5 and 74.5-81.5. It is important to note that the upper class boundary of a class coincides with the lower class boundary of the next class. We can compute class boundaries using the following formula: Lower Class Boundary (LCB) = LL – ½ * (one unit of measure) Upper Class Boundary (UCB) = UL + ½ * (one unit of measure) Example: Class Interval Class Boundaries 50-55 49.5-55.5 56-61 55.5-61.5 (1/2) * (one unit of measure) = 1/2(1) = 0.5 Class Interval Class Boundaries 19.6-20.0 19.55-20.05 20.1-20.5 20.05-20.55 (1/2) * (one unit of measure) = 1/2 (0.1) = 0.05 Class Interval Class Boundaries 1.56-1.65 1.555-1.655 1.66-1.75 1.655-1.755 (1/2) * (one unit of measure) =1/2(0.01) = 0.005 Class Mark or Midpoint The class marks or midpoint is the mean of lower and upper class limits or class boundaries. So it divides the class into two equal parts. It is obtained by dividing the sum of lower and upper class limit or class boundaries of a class by 2. That is, Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 16 𝑳𝑳𝒊 +𝑼𝑳𝒊 𝑳𝑪𝑩𝒊 +𝑼𝑪𝑩𝒊 ̅i = 𝒙 or ̅i = 𝒙 𝟐 𝟐 Example: 𝟔𝟎+𝟔𝟗 𝟏𝟐𝟗 The class mark or midpoint of the class interval 60 – 69 is = = = 64.5. 𝟐 𝟐 or 𝟓𝟗.𝟓+𝟔𝟗.𝟓 𝟏𝟐𝟗 If we use the class boundaries, 𝟐 = 𝟐 = 64.5. Relative Frequency (R𝒇𝒊 ) 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 This is the frequency of a class expressed in proportion to the total number of observations: 𝑹𝒇𝒊 = 𝒏 Cumulative Frequency (Fi) It is the accumulated frequency of a class. It is the total number of observations whose values do not exceed the upper limit or boundary of the class. Example: Frequency Distribution Table of Weights (in kg) of Math 31 Students Class Class Frequency Class Mark, Relative Cumulative Boundaries Frequency Frequency, Fi 40 – 46 39.5 – 46.5 6 43 0.15 6 47 – 53 46.5 – 53.5 14 50 0.28 20 54 – 60 53.5 – 60.5 10 57 0.25 30 61 – 67 60.5 – 67.5 6 64 0.15 36 68 – 74 67.5 – 74.5 2 71 0.05 38 75 – 81 74.5 – 81.5 2 78 0.05 40 Example: Construct a frequency distribution of the scores of 50 students in a Prelim Exam in Math 1. Their scores are given below: 23, 50, 38, 42, 63, 75, 12, 33, 26, 39, 35, 47, 43, 52, 56, 59, 64, 77, 15, 21, 51, 54, 72, 68, 36, 65, 52, 60, 27, 34, 47, 48, 55, 58, 59, 62, 51, 48, 50, 41, 57, 65, 54, 43, 56, 44, 30, 46, 67, 53 Solution: Step 1:Compute the range: R = (highest value) – (lowest value) = 77 – 12 = 65. Step 2:Estimate the number of classes: k = 1+ 3.322 log10 N k = 1+ 3.322 log10 50 k = 1+ 3.322 (1.69897) k = 1+ 5.644 k = 6.644 or 7 classes approximately Step 3:Compute the class width: 𝒓𝒂𝒏𝒈𝒆 ,𝑹 C> 𝒏𝒖𝒎𝒃𝒆𝒓𝒐𝒇𝒄𝒍𝒂𝒔𝒔𝒆𝒔 ,𝒌 65 C >7 C > 9.286 ≈ 10 Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 17 Frequency Distribution Table of the Scores of 50 Students in Prelim Exam in Math 1 CLASS TALLY FRE CLASS CLASS MARK REL. FREQ. CUMULATIVE Q BOUNDARIES FREQUENCY 𝟏𝟎+𝟏𝟗 10-19 II 2 9.5 – 19.5 𝟐 = 14.5 2/50=0.04 2 𝟐𝟎+𝟐𝟗 20-29 IIII 4 19.5 – 29.5 = 24.5 4/50 =0.08 2+4=6 𝟐 𝟑𝟎+𝟑𝟗 30-39 IIII-II 7 29.5 – 39.5 𝟐 = 34.5 7/50 =0.14 6 + 7 = 13 𝟒𝟎+𝟒𝟗 40-49 IIII – IIII 10 39.5 – 49.5 𝟐 = 44.5 10/50=0.20 13 + 10 = 23 IIII – IIII 𝟓𝟎+𝟓𝟗 50-59 16 49.5 – 59.5 = 54.5 16/50=0.32 23 + 16 = 39 – IIII –I 𝟐 𝟔𝟎+𝟔𝟗 60-69 IIII-III 8 59.5 – 69.5 𝟐 = 64.5 8/50=0.16 39 + 8 = 47 𝟕𝟎+𝟕𝟗 70-79 III 3 69.5 – 79.5 𝟐 = 74.5 3/50=0.06 47 + 3 = 50 50 C.GRAPHICAL PRESENTATION After the data have been organized into a frequency distribution, they can be presented in graphical forms. The purpose of graph in statistics is to convey the data in pictorial form. It is easier to detect trends, low and high points in graphs, than in frequency tables. Graphs are also useful in getting the reader’s attention in a publication or in a presentation. They can be used to discuss an issue, reinforce a critical point, or summarize a data set. (a) Bar Chart This is a graph where the different classes are represented by rectangles or bars. The width of the rectangle is the length of the interval, represented by the class limits in the horizontal axis, or categories for nominal data. The length of the rectangle, corresponding to the class frequency, is drawn in the vertical axis. For the data on weights, the bar chart is shown below. Bar Chart 16 14 12 Frequency 10 8 6 4 2 0 40 - 46 47 - 53 54 - 60 61 - 67 68 - 74 75 - 81 Weights (in kg) of Math 31 Students (b) Histogram This closely resembles the bar chart with the basic difference that a bar chart uses the class limits for the horizontal axis while the histogram employs the class boundaries. Using the class boundaries eliminates the spaces between rectangles, thus giving it a solid appearance. Histogram 16 14 12 Frequency 10 8 6 4 2 0 39.5 - 46.5 46.5 - 53.5 53.5 - 60.5 60.5 - 67.5 67.5 - 74.5 74.5 - 81.5 Weights (in kg) of Math 31 Students (c) Frequency Polygon It is constructed by plotting the class marks against the frequency. Straight lines then connect the set of points formed by the class marks and their corresponding frequencies together with additional class marks at the beginning of the distribution. Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 18 Frequency Polygon 16 14 12 Frequency 10 8 6 4 2 0 36 43 50 57 64 71 78 85 Weights (in kg) of Math 31 Students (d) Frequency Ogive It represents a cumulative frequency distribution. It is constructed by plotting class boundaries on the horizontal scale and the cumulative frequency less than the upper class boundaries in the vertical scale. Frequency Ogive 45 40 35 30 Frequency 25 20 15 10 5 0 39.5 46.5 53.5 60.5 67.5 74.5 81.5 Weights (in kg) of Math 31 Students (e)Pie Chart This is a circle divided into pie-shaped sections, which look like slices of a pizza. The angle of a sector is a proportional in size to the frequencies or relative frequencies. Angle of a sector = Rfi x 360o Solution on getting angle of a sector:  RF=5 % or 0.05 Angle of a sector = Rfx 360o = 0.05 x 360o= 18o  RF=15 % or 0.15 Angle of a sector = Rfx 360o = 0.15 x 360o= 54o  RF=25 % or 0.25 Angle of a sector = Rfx 360o = 0.25 x 360o= 90o  RF=35 % or 0.35 Angle of a sector = Rfx 360o = 0.35 x 360o= 126o Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 19 Exercises 5 1. Find the class boundaries, midpoints, and width for each class interval given. a. 11-15 b. 17-39 c. 293-353 d. 11.8-14.7 e. 3.13-3.93 16-20 40-57 354-414 14.8-17.7 3.94-4.74 2. The ages of the signers of the Declaration of Independence of USA are shown below. Construct a frequency distribution for the data using seven classes. (Source: John W. Wright, ed., The Universal Almanac, Andrews and McMeel, 1994, p.53) 41 39 42 31 53 35 30 34 27 50 50 34 44 60 50 55 50 37 69 42 63 49 39 52 44 48 46 42 45 38 33 70 33 36 60 32 35 45 42 45 62 35 40 54 50 52 27 63 43 34 46 46 39 47 40 3. In a study of 40 women, the following data of blood potassium levels, in milliequivalents per liter, were obtained. Construct a frequency distribution. 3.2 5.8 6.0 4.5 4.2 4.3 2.7 5.1 4.9 5.3 4.7 5.0 3.9 4.9 4.0 5.2 3.8 4.6 4.3 4.4 3.8 5.6 4.2 4.7 3.6 5.1 4.2 5.8 3.7 3.7 4.3 4.5 3.4 3.6 4.4 3.9 4.2 4.1 4.3 4.9 4. For 108 randomly selected college students, the following IQ frequency distribution was obtained. Construct histogram, frequency polygon, and ogive for the data. Class limits Frequency 90- 98 6 99- 107 22 108- 116 43 117- 125 28 126 - 134 9 5. The weights (to the nearest tenth of a kilogram) of 35 students were measured and recorded as follows: 59.2 60.4 58.4 61.4 59.0 61.9 61.9 59.8 61.2 60.2 60.0 61.4 61.2 61.1 61.6 61.5 58.9 62.2 58.4 60.2 65.7 61.7 62.1 60.7 56.3 59.3 60.9 62.4 60.8 62.7 a. Construct a frequency distribution table. b. Construct a frequency ogive. 6. Thirty AA size batteries were tested to determine how long they lasted. The results, to the nearest hundredth, were recorded as follows (unit of measurement is 0.01): 4.23 3.71 4.31 4.0 3.96 3.69 3.77 4.01 3.81 3.72 3.87 3.89 3.63 3.99 4.10 4.11 4.09 3.91 4.15 4.19 3.93 3.92 4.05 4.28 3.86 3.94 4.08 3.82 4.22 3.90 a. Construct a frequency distribution table. b. Construct its frequency bar chart and frequency histogram. 7. If the class marks of a frequency distribution of weights of miniature poodles are 5.0, 6.5, 8.0, 9.5 and 11.0 kilograms, find a. the class width b. the class boundaries c. the class limits 8. Thirty automobiles were tested for fuel efficiency, in miles per gallon (mpg). The following frequency distribution was obtained. Construct a histogram, frequency polygon, and bar chart for the data. Class boundaries Frequency 7.5- 12.5 3 12.5- 17.5 5 17.5 - 22.5 15 22.5 - 27.5 5 27.5 – 32.5 2 Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 20 9. In an insurance company study of the causes of 1000 deaths, the following data were obtained. Construct a pie graph to represent the data. Cause of death Number of deaths Heart disease 432 Cancer 227 Stroke 93 Accidents 24 Other 224 10. Given the frequency ogive; 49 50 50 47 44 45 40 38 35 cumulative frequency 30 30 25 20 20 14 15 10 10 6 4 5 1 00 69.5 74.5 79.5 84.5 89.5 94.5 99.5 104.5109.5114.5119.5124.5129.5 class boundaries a. Reconstruct a frequency distribution table b. What is the class width? c. What is the total frequency? d. Based on the ogive,  how many observations were below 99.5?  how many were above 99.5?  what is the total number of observations? 11.Fill-in the missing values in the table. Class Interval Class Class Mark Frequency Relative Cum. Freq. Boundaries Frequency 3.5-___ ____ - ____ ____ 5 ____ 5 4.5-___ ____ - ____ ____ 9 ____ ____ ____ - ____ ____ - ____ ____ 15 ____ ____ ____ - ____ ____ - ____ ____ 6 ____ ____ ____ - ____ ____ - ____ ____ 3 ____ 38 Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 21 VII. Summation Notation Many of the computations in statistics involve a summation of the observed data. In this section, we discuss the notations used and its basic properties. The summation notation, ∑𝑛𝑖=1 𝑥𝑖 , read as “the sum of xi’s where i ranges from 1 to n,” is defined as follows ∑𝑛𝑖=1 𝑥𝑖 = x1 + x2 + x3 +... +xn, where i is called the index of summation, 1 is the lower limit and n is the upper limit of the summation. Examples: a. ∑5𝑖=1 𝑥𝑖 = x1+x2+x3+x4+x5 b. ∑3𝑖=1(𝑥𝑖 + 𝑦𝑖 ) = (𝑥1 + y1) + (x2 + y2) + (x3 + y3) c. ∑2𝑖=1 2𝑥𝑖 = 2x1 + 2x2 d. ∑4𝑖=1 3 = 3+ 3+ 3+ 3 = 3(4) = 12 Rules of Summation: a. ∑𝑛𝑖=1(𝑥𝑖 + 𝑦𝑖 ) = ∑𝑛𝑖=1 𝑥𝑖 + ∑𝑛𝑖=1 𝑦𝑖 b. ∑𝑛𝑖=1 𝑎𝑥𝑖 = a∑𝑛𝑖=1 𝑥𝑖 , where a is any constant. c. ∑𝑛𝑖=1 𝑛 = na, where a is any constant. Example: Given x1= 3, x2= 4, x3= 8, x4= -2, y1 = -6, y2= -1, y3= 5 and y4= 0, find the value of the following: a. ∑4𝑖=1 𝑥𝑖 2 = x12 + x22+ x32 + x42 = 32 + 42 + 82 + (-2) 2= 9+16+64+4= 93 b. ( ∑𝑛𝑖=1 𝑥𝑖 )2 = ( x1 + x2 + x3 + x4) 2 = (3 + 4+ 8 -2) 2 = (13)2 = 169 c. ∑2𝑖=1 𝑥𝑖 yi = x1y1 + x2y2 = (3) (-6) + (4) (-1) = -18 -4 = -22 d. ∑4𝑖=3(3𝑥𝑖 + yi ) =( 3x3 + y3 ) + (3x4 + y4) =( (3)(8) +5) + ( (3)(-2)+ 0) = 29-6 = 23 𝑜𝑟 4 4 4 ∑𝑖=3(3𝑥𝑖 + yi ) = 3∑𝑖=3 (𝑥𝑖 ) + (∑𝑖=3 yi ) = 3(x3 + x4) + (y3 + y4) ∑4𝑖=3(3𝑥𝑖 + yi ) =3(8 + (-2)) + (5 + 0) = 3(6) + 5 = 18 +5 = 23 e. ∑3𝑖=1(3𝑥𝑖 + 2) = (3x1 + 2) + (3x2 +2) + (3x3 +2) = (3(3) +2) +(3(4) +2) + (3(8) +2) ∑3𝑖=1(3𝑥𝑖 + 2) = 11 + 14 + 26 = 51 or 3 3 ∑𝑖=1(3𝑥𝑖 + 2) = 3∑𝑖=1(𝑥𝑖 ) + 2(3) = 3(x1 + x2 + x3 ) + 2(3) = 3(3+ 4 +8) + 6 ∑3𝑖=1(3𝑥𝑖 + 2) = 3(15) +6 = 45 + 6 = 51 Exercises 6 Given x1 = 3, x2 = 4, x3 = 8, x4 = -2, y1 = -6, y2 = 1, y3 = 5 and y4 = 0, find the value of the following: 2 4  4  3 1.  x2 2.  x  3.  xi y i i  i i 1  i 1  i 1 3 4  4  4  4.  (3xi  y i ) 5.  (2xi  y i )2 6.   x2i    y i  i 1 i 1  i 1   i 1  3  4  4   4  7.  (5xi  10) 8.   xi    y i  9.   y i  + 5 i 1 i 2  i 2   i 1  Math 31/Stat 32, Second Semester, A.Y. 2014 – 2015 Mathematics Dep’t, MSU, Marawi City 22 VIII. Statistical Description of Data The previous discussion showed how one can gain useful information from raw data by organizing it into frequency distribution, then presenting the data by using various graphs. This chapter shows other statistical methods that can be used to summarize the data. In this section, we will examine different statistical measures that are computed when given a set of data. Some of these measures are applicable for both numerical and non-numerical data (categorical data) but many of these are applicable only to numerical data. Recall that statistical measures can be computed from the sample or from a population. When it is from the whole population, it is called a parameter while if it is from a sample, it is referred to as statistic. A statistic is a characteristic or measure obtained by using the data values from a sample. A parameter is a characteristic or measure obtained by using all the data values for a specific population. Computing Statistical Measures of Data Measures of Central Location (Measures of Average) The measures of central location describe the center or middle part of a group of data. Here, we will consider the mean, median, mode and weighted mean. A. The Arithmetic Mean The mean is the sum of the values divided by the total number of values. This is commonly called the average in layman’s term. In statistics, all measures of center are also called average. For a sample, the mean is denoted by 𝑥̅ and this statistic is computed as: ∑𝒏 𝒊=𝟏 𝒙𝒊 𝐱 𝟏 + 𝐱𝟐 + 𝐱𝟑 +...+ 𝐱𝐧 𝒙= = , where n represents the total number of values in the sample. 𝒏 𝒏 While for a population, the mean is denoted by the Greek letter 𝜇 (read as “mu”) and the parameter is given as: ∑𝑵 𝒊=𝟏 𝒙𝒊 𝐱𝟏 + 𝐱𝟐 + 𝐱𝟑 +...+ 𝐱𝐧 𝝁 = = where N represents the total number of values in the population. 𝑵 𝑵 Examples: (a) The ages in weeks of six kittens at an animal shelter are 3, 8, 5, 12, 14 and 12. Find the mean. Solution: 𝒊 ∑𝑵 𝒙 𝟑 + 𝟖 + 𝟓+ 𝟏𝟐+ 𝟏𝟒+𝟏𝟐 𝟓𝟒 𝝁 = 𝒊=𝟏 𝑵 = 𝟔 = 𝟔 = 9 weeks. Thus, the mean age of the kittens is 9 weeks. (b) The fat contents in grams for one serving of 11 brands of packaged foods, as determined by the U.S Department of Agriculture, are given a follows. Find the mean. 6.5, 6.5, 9.5, 8.0, 14.0, 8.5, 3.0, 7.5, 16.5, 7.0, 8.0 Solution: ∑𝑵 𝒊=𝟏 𝒙𝒊 𝟔.𝟓 + 𝟔.𝟓+ 𝟗.𝟓+ 𝟖.𝟎+ 𝟏𝟒.𝟎+𝟖.𝟓+𝟑.𝟎+𝟕.𝟓+𝟏𝟔.𝟓+𝟕.𝟎+𝟖.𝟎 𝝁= 𝑵 = 𝟏𝟏 𝟗𝟓 𝝁 = = 8.64 grams 𝟏𝟏 Thus, the mean of fat contents in grams for one serving of 11 brands of packaged foods is 8.64 grams. Properties of the Mean: (a) It is unique, meaning it has only one value. (b) It can be computed for numerical data only, that is interval or ratio level data. (c) It is easily affected by extreme values in the data. Thus, one should be cautious in using the mean when there are extreme observations or outliers. If the outlier is extremely low, it pulls down the value of the mean. If the outlier is a very big value, it magnifies the mean. If the mean is greatly affected, then our summary description of the data is distorted. Example: Suppose we change the values of the data set in Example(b) above. That is 6.5, 6.5, 9.5, 8.0, 14.0, 8.5, 3.0, 7.5, 0.1, 7.0, 0.2 then the mean is ∑𝑵 𝒊=𝟏 𝒙𝒊𝟔.𝟓 + 𝟔.𝟓+ 𝟗.𝟓+ 𝟖.

STATISTICSHandout.pdf

Document Details

Tags

Related

Full Transcript

Upgrade to continue