SOR2250 Sampling I Chapter 1 PDF
Document Details
Uploaded by ImmaculateClavichord
University of Malta
Dr Fiona Sammut
Tags
Summary
This chapter introduces the problem of sampling in research, focusing on defining the problem and exploring different contexts within a sampling design. It highlights the importance of clear problem definitions and the role of statistical consideration in sampling methodologies.
Full Transcript
SOR2250 – Sampling I Dr Fiona Sammut CHAPTER 1 THE SAMPLING PROBLEM 1.1 DESCRIPTION OF THE PROBLEM Consider the five problems listed below: Computing the average weekly expenditure of a Malt...
SOR2250 – Sampling I Dr Fiona Sammut CHAPTER 1 THE SAMPLING PROBLEM 1.1 DESCRIPTION OF THE PROBLEM Consider the five problems listed below: Computing the average weekly expenditure of a Maltese family during October. Calculating the average velocity of oxygen molecules in a 10cc spherical glass container kept at a temperature of 0 0c and a pressure of 2 HPa. Working out the proportion of Maltese adults who make use of the computer. Estimating the total yield in kg from potatoes planted in Mgarr in autumn. Getting accurate estimates of the number of Maltese children suffering from asthma. Solving the problems above is not impossible but is more problematic than it may seem at first. On closer examination, it may in fact be noticed that there is lack of detail in the description of the situations of interest. 1 SOR2250 – Sampling I Dr Fiona Sammut For example in the first situation, on which type of family will the study focus? Will it be a family with children, without, with a particular socioeconomic status, living in a specific area of Malta? Apart from that, in which October will the study be carried out? October of this year, of the next? With regards to the other 4 situations: Which glass container shall be used? Any specific age groups or socioeconomic status or level of education which shall be considered when choosing adults who will be interviewed? A particular type of potato grown using particular fertilizers and type of soil? Which children? So prior to get to the stage on how to solve a problem, we should actually focus on properly defining the problem. Now, definition of a problem may also involve time. A question which should be posed is: should we limit ourselves to a particular point in time in our study or would it be more fruitful to consider a longitudinal study (taking observations over a period of time)? Not counting the above mentioned, to solve a problem, we will also have to consider how we are going to make the individual measurements of the object/subject of interest. Clearly, we cannot always perform an exhaustive treatment to subjects/objects so that each time a survey is to be carried out say, in the Maltese Islands, each and every Maltese person is interviewed. Expense and time would be too prohibitive in such a case. Aside from that, we cannot always expect to be able to access each and every individual. 2 SOR2250 – Sampling I Dr Fiona Sammut In some cases, like case 2, exhaustive treatment is not even possible. We will be dealing with theoretically infinite worlds. In general, we are concerned with the problem below: Given a large finite population or an infinite theoretical population, how can we make measurements, inferences and decisions based on operations which will take finite time and physical resources to achieve? Sampling is the answer to the problem. By reducing measurements to a finite number of controlled carefully chosen cases, we may provide fairly accurate approximate answers. We will have to: make sure that a sample is selected such that it is as representative as possible of the population design properly our measuring methods and hence used them efficiently and effectively perform the correct analyses present the results in suitable and precise form to appeal to the appropriate audience We can have no foolproof guarantees of the quality of the answers we obtain, but by following the above points, we will try to get the best solutions under the given circumstances. 3 SOR2250 – Sampling I Dr Fiona Sammut With the abundance of poorly done surveys, it is not surprising that some persons might still be skeptical of surveys. However, one thing which might not be so obvious is that estimates based on samples are often more accurate than those based on a census. The latter happens because a census often requires a large administration organization and involves many people in the data collection. With the administrative complexity and the pressure to produce timely estimates, errors can easily occur. When using a sample, especially if personnel involved are well-trained, more attention can be given to data quality; for example, following up on non-respondents. 1.2 FROM DISTRUST TO WIDE USE Because of military, political, economical and insurance-related problems, interest in population censuses and projections, as well as risk estimation with reference to human life and property, goes back to Biblical times. To mention a few censuses which were carried even before the year 1700: The emperor Yao had taken a census of the population in China in the year 2238 B.C. In the year 762 Charlemagne asked for detailed descriptions of church- owned properties. Around 1086, William and Conqueror ordered the writing of the Domesday Book, a record of the ownership, extent, and value of the lands of England - England’s first statistical abstract. 4 SOR2250 – Sampling I Dr Fiona Sammut Because of Henry VII’s fear of the plague, England began to register its dead in 1532. Around same time, French law required the clergy to register baptisms, deaths and marriages. During an outbreak of the plague in the late 1500s, the English government started publishing weekly death statistics. This practice continued and by 1632 these Bills of Mortality listed births and deaths by gender. The distrust in sampling by politicians, administrators and statisticians was however deep and common up to the beginning of the 20th century. It is only since the 1940s 1950s that sampling moved from acceptability to respectability. A strong theoretical basis, the development in sampling methodology, the availability of software, taking into account the reduction in costs and collection time that sampling brings about without serious loss of accuracy and also keeping in mind the fact that sampling is also a respectable instrument of measurement in circumstances which can be difficult, unmanageable or theoretically infinite, has turned sampling into a big, flourishing and reliable business. Surveys may be either descriptive or analytic. The descriptive survey gives a number of cross-sectional pictures of various aspects through the display of basic descriptive statistics usually involving percentages, measures of location and measures of variation. 5 SOR2250 – Sampling I Dr Fiona Sammut The analytical survey tends to be more investigative in nature and encompasses sets of exercises involving the formulation of statistical hypotheses whose validity is checked statistically. Nowadays, consumers and contributors of sampling involve amongst others, people working in market research, within industry, government agencies and departments, academics and media researchers. In market research, for example, sampling is carried out regularly by many businesses who would like to know how their products are viewed by their customers as well as to estimate the viability of certain products. In industry, sampling techniques are used to check for quality control and efficiency monitoring of their internal operations. Government agencies like our National Statistics Office (NSO) conduct surveys on health, lifestyle, transport, environment, education, agriculture, employment, etc. Local governments are increasingly commissioning sample surveys to help them gather information about current problems and future plans. Sampling is also used extensively in television and radio audience’s audits which help the media and the advertising firms assess potential customers for their products. 6 SOR2250 – Sampling I Dr Fiona Sammut A further application of sampling lies in political opinion polls. Statisticians can predict the outcome of an election hours before the long process of exhaustive counting of votes has ended. Sampling has been directly responsible for generating enough interest to help sustain a steady growth in statistical theory. The theoretical and practical contributions from the academic world, involving areas like sociology, media studies, economics, psychology, political science and the earth sciences have been instrumental in improving existing techniques and make them more popular. National statistics offices have also given their fair share with considerable and worldwide efforts. 1.3 ROLE OF STATISTICAL THEORY IN SAMPLING The typical sampling problem is that of selecting n elements from a population of N elements. This selection can be considered in 3 different contexts: without replacement order important, without replacement order not important and with replacement. The number of samples which may be obtained for each of the 3 mentioned contexts are respectively: N Pn , N Cn and N n. If we were to consider a sample of size 200 being selected from a population of size 300,000, then there would be: 2.49 × 101095 samples when sampling is carried out without replacement and order is considered to be important 7 SOR2250 – Sampling I Dr Fiona Sammut 3.15 × 10720 samples when sampling is carried out without replacement and order is considered not to be important 2.66 × 101095 samples when sampling is carried out with replacement You may notice how large each of the amounts is. It is important to understand that once we take a sample from our population of interest, it will be just one sample out of the many possible samples which we may have selected under the particular context. Obviously, we would like our sample to be a good reflection of the population being considered. How can statistical theory help us here? When statistical theory is applied to the sampling problem: It gives us an appreciation of the size of our major domain of interest being that of all possible samples. This will put emphasis on imposing clarity in the formulation of the problem. By giving probabilities, it allows us to decide on the appropriate sampling design to use (for example a method may be used where each person being selected in a sample has same probability of selection or using another method, each person being selected in a sample has a probability of selection depending on some characteristic; more detail on sampling designs later on in the notes). Will enable us to quantify accuracy in measurement. Will help us in bringing about efficiency in estimation. 8 SOR2250 – Sampling I Dr Fiona Sammut 1.4 SELECTION PROBABILITIES The procedure by which a sample of units is selected from a finite population is called the sampling design. With most well-known sampling designs, the design is determined by assigning a probability P ( Si ) to each sample Si , where P ( Si ) is the probability that a particular sample Si has of being selected. So, if S is the set of all possible samples, according to the sampling design chosen P : S → [ 0,1]. The simplest design to use is Simple Random Sampling, where for a population of size N and a sample of size n, the probability of selection of 1 1 1 each sample is the same and is equal to , or depending on Nn N Pn N Cn whether sampling is carried out with or without replacement and whether order is important or not. Each possible sample to be selected involves the selection of n subjects/objects from the population. By assigning a probability of selection to each sample, the procedure also gives every element in the population a nonzero probability of selection. This latter probability of selection is called the inclusion probability and for the ith element, shall be denoted by π i. Example: Suppose that a population is made up of elements A, B and C and that a sample of size 2 is to be selected using simple random sampling. Possible samples that may be selected if replacement is allowed are: 9 SOR2250 – Sampling I Dr Fiona Sammut AA, AB, AC, BA, BB, BC, CA, CB, CC. Possible samples that may be selected if replacement is not allowed and order is considered important are: AB, AC, BA, BC, CA, CB. Possible samples that may be selected if replacement is not allowed and order is considered not to be important are: AB, AC, BC. So the probabilities of choosing one sample of size 2 from the population 1 1 1 made up of elements A, B, C are respectively , and. 9 6 3 Suppose now that we wish to find the inclusion probability of A, that is the probability of choosing element A in our sample of size 2. Under sampling with replacement, the element A shows up if either sample AA, AB, AC, AA, BA or CA is selected. So the probability of selecting 6 2 element A is =. 9 3 Under sampling without replacement order important, the element A shows up if either sample AB, AC, BA, or CA is selected. So the probability of 4 2 selecting element A is =. 6 3 10 SOR2250 – Sampling I Dr Fiona Sammut Under sampling without replacement order not important, the element A shows up if either sample AB or AC is selected. So the probability of 2 selecting element A is. 3 It may be noted that irrespective of whether sampling is carried out with or without replacement and whether order is considered to be important or not important, the resulting inclusion probabilities under simple random sampling are all the same. The probability that an individual coming from a population of size N will be included in a sample of size n, given that the sampling design being used n n is simple random sampling is in fact always equal to. So π i =. N N On considering inclusion probabilities it is important to understand that the element of interest can be obtained at any position in the sample. For simple random sampling with replacement, the probability that some element 1 results in position 1 is equal to , the probability that some element results N 1 in position 2 is also equal to ,…, the probability that some element results N 1 in position n (the size of the sample) is also equal to. So the probability N 1 1 n of some element to be part of the sample is +... + =. N N N 11 SOR2250 – Sampling I Dr Fiona Sammut Under simple random sampling without replacement, the probability of selecting one particular element in the sample can be calculated by finding the probability of all the remaining elements of the sample, that is: ( N − 1)! Order important: N −1 Pn−1 =…= ( n − 1)! = n. N Pn N! N n! ( N − 1)! Order not important: N −1 Cn−1 =…= ( N − n )!( n − 1)! = n. N Cn N! N ( N − n ) !n ! Example: In a similar manner, the probability that two individuals coming from a population of size N will be included in a sample of size n, given that the sampling design being used is simple random sampling may be found to N −2 Cn − 2 N −2 Pn−2 n ( n − 1) be = =. N Cn N Pn N ( N − 1) More detail on simple random sampling will be given in Chapter 2. 1.5 ESTIMATORS AND THEIR PROPERTIES If we were to obtain observations on all the N objects/subjects of the population, the population characteristic of interest would be known exactly. 12 SOR2250 – Sampling I Dr Fiona Sammut Having to resort to sampled data, the usual inference problem in sampling is that of trying to obtain estimates of summary characteristics of the population like means, variances, medians, ratios and totals, based on sampled observations. After all, the purpose of taking a sample is to obtain information about the population. The formulas used to compute estimates are called statistics. Random variables used in such formulas are called the estimators. Examples of estimators that used under simple random sampling are: 1 n the statistic X = X i , where the formula in X works out the sample n i =1 mean, the latter being an estimator of the population mean. 1 n ( X i − X ) , where the formula in S 2 works out 2 the statistic S = 2 n − 1 i =1 the sample variance, the latter being an estimator of the population variance. Gathering observations on only part of the population, gives rise to uncertainty associated with the sample being selected. So, one would like to be able to assess the accuracy or confidence associated with the estimates obtained, usually by means of a confidence interval. If for every possible sample, the estimates are quite close to one another and to the true value of the population characteristic, there is little uncertainty associated with the sampling design and estimation method being used. 13 SOR2250 – Sampling I Dr Fiona Sammut Such a strategy is desirable. If however, the estimates obtained from the different samples, vary greatly from one another and from the true value of the population characteristic, uncertainty is associated with the implemented strategy. It is only through a careful selection of a sampling design and estimation method, that one can obtain desirable estimates. The major (and desirable) properties of an estimator are: unbiasedness (expected value over all possible samples that might be selected according to some design, equals the actual population value) efficiency (in the sense of minimal variance) consistency (as sample size increases, the probability that the estimators will give values far away from the true values goes to zero) sufficiency (utilization of all relevant information) robustness (departures from model assumptions do not violate adversely the performance of the estimator) Deriving results of use within this area is mathematically not easy. Not only is the setting abstract, but the expressions and formulas involved are usually algebraically complex. Simplification is obtained by assuming that: population involved is infinite (to allow for the assumption of independence between the observations obtained from the sampled persons) 14 SOR2250 – Sampling I Dr Fiona Sammut underlying population distribution is normal 1.6 GENERAL SAMPLING NOTATION Some general sampling notation follows: Observation unit: A subject/object on which a measurement is taken, also referred to as element. In studying human populations, the observation units are individuals. Target population: The complete collection of objects/subjects under investigation; on whom/which we would like to obtain information and parameter estimates. It may be a group of individuals, a harvest of potatoes, a batch of light bulbs, households in a particular area, factories producing high tech products, etc. Sample: Subset of a population of interest. Sampled population: The collection of objects/subjects from which the sample is taken. Ideally, the sampled population is identical to the target population. Sampling unit: The unit we can actually sample. We may for example wish to study individuals but the list of all individuals in the target population is not available. Instead, the list of households might be available. 15 SOR2250 – Sampling I Dr Fiona Sammut The households will serve as our sampling units and the observation units are the individuals living inside the households. Sampling frame: The list of sampling units to which the sampling selection scheme is applied, to gain observational access to the finite population of interest. The sampling frame might be a list of: all residential telephone numbers in the city for a telephone survey street addresses for personal interviews all farms or a map of area containing farms for an agricultural survey university students’ email addresses for an online based questionnaire delivered to university students electoral register for a survey to be carried out amongst adults The sampled population is usually smaller than the target population leading to undercoverage. In particular, focusing on surveys carried out amongst individuals, lack of resources or accessibility may lead some persons in the target population to be left out from the sampling frame. Apart from that, a number of persons will not respond to the survey (refuse to respond by choice, not capable of responding, not reachable). On the other hand, should persons not in the target population be included in the sampled population, we will have overcoverage. If a person in the target population is listed in the frame more than once, we will have a duplicate listing. 16 SOR2250 – Sampling I Dr Fiona Sammut Undercoverage, overcoverage and duplicate listings are undesirable and cause frame imperfection. 1.7 SELECTION BIAS Selection bias occurs when some part of the target population is not in the sampled population. Undercoverage is one source of selection bias. Other sources of selection bias are: Deliberately or purposefully selecting a sample that will confirm a prior opinion. For example, we would like to estimate the average amount of money a 4-person Maltese family spends on groceries per week and we sample shoppers who look like they have spent an average amount. Misspecifying the target population. Substituting a designated unit in the sample (might not be readily available or accessible) with a more convenient unit of the population. Failing to obtain responses from all units in the chosen sample. Nonresponse may even distort results of surveys that are carefully planned to minimize other sources of selection bias. Allowing the sample to consist of volunteers. Such is the case with television polls. Some persons may phone more than once and there may be some organization which will monopolize the calls. 17 SOR2250 – Sampling I Dr Fiona Sammut 1.8 MEASUREMENT BIAS Measurement bias occurs when the information collected for use as a study variable is inaccurate. In particular, focusing on surveys carried out amongst individuals, these inaccuracies may arise because: a person may not always tell the truth a person may not always understand the questions a person may forget a person may say what he/she thinks that the interviewer wants to hear a person may say what he/she thinks will impress the interviewer a particular interviewer may affect the response given by misreading questions, recording responses inaccurately or leading the respondent to answer in a specific way certain words mean different things to different people question wording and order have a large effect on the responses obtained In some cases, accuracy may be increased by using a questionnaire which has been carefully designed. 18 SOR2250 – Sampling I Dr Fiona Sammut 1.9 QUESTIONNAIRE DESIGN Some general guidelines which should be followed when designing a questionnaire are: Decide what you want to find out Keep it simple and clear Use specific questions instead of general ones Relate your questions to the concept of interest Avoid questions that prompt or motivate the respondent to say what you would like to hear Decide whether to use open or closed questions (in a closed question the respondent has to choose from a set of categories, in an open question the respondent is not prompted with any categories for response) Pay attention to question-order effects Ask only one concept in each question Always test your questions before taking a survey. 19 SOR2250 – Sampling I Dr Fiona Sammut 1.10 THE SAMPLE SURVEY The various stages of work involved in conducting a survey are: 1. The objectives of study are clearly defined and communicated to the persons involved. 2. The target population or subpopulations are clearly defined and a precise distinction made as to whether the population of interest is finite, infinite, fixed in time, dynamic, potential or theoretical. 3. Assess resources (which external resources will you need, which in- house resources can you make use of). 4. A decision is taken as to whether the survey should be limited to purely descriptive tasks or whether deeper analysis will be required after the data has been collected. 5. Specification of variables, characteristics and parameters as well as their method of measurement and calibration should be agreed upon. 6. The manner in which the data is to be collected is decided (home visits, telephones, email, etc). 7. The level of confidence and confidence interval should be established so that the sample size necessary to obtain the desired precision is determined. 8. The sampling frame is constructed or else acquired. 20 SOR2250 – Sampling I Dr Fiona Sammut 9. The questionnaire is prepared. 10. The field work is organized; interviewers are selected and interviewer assignments are delineated. 11. A pretest is conducted and results evaluated. Should it be necessary, adjustments have to be made at this stage. 12. Organization of the field work is put into action whilst the sample is being selected, with active supervision and swift intervention if required. 13. The required number of valid responses are observed and handed to the data entry persons. Coding and data entry is carried out. 14. Data is made ready for analysis: observed data is cleaned. Renew contact with respondents to get clarification if necessary. Imputation of the data (substitution of good artificial values for missing values) is performed. 15. Summary statistics together with point estimates of the major variables of interest, supplemented with degree of precision of the estimates, are calculated using statistical software packages and communicated to the relevant persons. 16. If required, in-depth analyses like comparison between subgroups of a population, correlation and regression are carried out. 21 SOR2250 – Sampling I Dr Fiona Sammut 17. Preferably two or three reports intended for difference audiences with increasing level of statistical knowledge are completed. A general declaration of the conditions under which the survey was carried out should be included. 18. Results are presented in the appropriate format and at the right time. 19. Archiving of the accumulated data is effected once negotiation with the clients asking for the survey has been successfully terminated. 1.11 OTHER POINTS REGARDING SURVEYS Surveys can vary tremendously in size, complexity, purpose and means of access. This suggests caution when one tries to generalize. The purpose of a large number of surveys is usually to obtain summary statistics but this is not always the case. Surveys may also be used to help in decision making, to test hypotheses and in understanding better certain situations. 22