Sampling and Bias 2020-2020 PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides an introduction to sampling techniques and different types of bias that can occur in data collection. It discusses how to identify sampling frames and samples and methods to acquire good samples for improved data analysis. Examples of different sampling methods and bias are included.
Full Transcript
SAMPLING AND BIAS Learning Outcomes: Define a sample. Given a description of a data collection, identify the sampling frame, the sample, the parameter of interest, and the sampling design. Explain the difference between bias and random/sampling error....
SAMPLING AND BIAS Learning Outcomes: Define a sample. Given a description of a data collection, identify the sampling frame, the sample, the parameter of interest, and the sampling design. Explain the difference between bias and random/sampling error. Given a description of data collection, identify the source of bias and explain the its likely direction, i.e., its effect on the estimation of the parameter of interest. Distinguish between sampling designs, and explain the advantages and disadvantages of each. A sampling frame, also called the target population, is the set of elements from which the sample is actually taken. Ideally, the sampling frame should be identical with the population of interest. In practice, however, this is not always possible. Examples: 1. You need to select a sample of 100 Vancouver residents. To do this you pick one hundred names from the telephone directory. Here, the population of interest is all Vancouver residents, and you are using the phone directory as a sampling frame. Note that the two are not identical. Some residents may not be in the directory: people with unlisted numbers, the homeless, the incarcerated, etc. And conversely, some names listed in the directory are no part of the population of interest: e.g., people who have moved away. 2. Now instead of using the phone directory, suppose you plan to obtain your sample of 100 Vancouver residents by standing at the corner of Main & 49th Ave., approach everyone that passes by, and select the first 100 persons willing to participate in your study. Here, the sampling frame consists of all persons who walk by in the vicinity of where you will be standing. Again, the sampling frame is not identical to the population of interest. Many Vancouver residents do not frequent the corner of Main and 49th. Conversely, some who frequent the area may not be residents (they could be tourists). 3. The college administration wishes to survey 200 currently registered students. The sample is obtained by selecting 200 names from the registrar’s list of students. Here, the population of interest is all registered students, the sampling frame is the registrar’s list of students. In this case the two are identical. When we cannot find a sampling frame that is identical to the population of interest, we should try to find one that is as closely identical, or ‘representative’ as possible. Sampling is the process by which we select from the sampling frame, a set of elements on which we will actually gather data (i.e., the sample). Good Samples And Bad Samples: A good sample is one that is representative of the entire population, i.e., one that is similar to the population of interest in terms of the characteristics of interest. E.g., if we want to select a sample of Langara students as a part of a study on academic performance, then we should strive to select a sample of students that is similar to the student body as a whole in terms of academic performance. p. 1/5 A bad sample is one that is biased. A sample is biased if it systematically over/under represent a segment of the population with a distinct characteristic of interest. Examples: 1. In order to estimate per-capita spending on food a student decide to survey ten of her closest friends. Here, the population of interest is all adults in Vancouver (or BC, or any other region). The resulting sample will be bias because it is most likely made up exclusively of. It will systematically underrepresent working people, seniors, stay-at-home parents, all of whom will probably spend more on food than students. It is important to note that the groups that are underrepresented are different from those that are overrepresented in term of the characteristic of interest (per-capita spending on food.) 2. A city engineer wishes to conduct a waste audit. He plans to do this by standing before a map of the city, closes his eyes, and then haphazardly touches a spot on the map with his finger. Wherever his finger lands, that’s where he will conduct the waste audit. In situations like this, it is well documented that people are more likely to point to spot near the centre of the map. The sample obtained will systematically underrepresent areas near the perimeter of the city, which may have a different quantity and kind of wastes compared to inner city areas. 3. Suppose you want to find out the average GPA of the students in your class. You handout questionnaires to everyone. But out of 30 students, only 20 return the questionnaires. The 20 students who make up your sample are likely to have higher GPA’s than the 10 who choose not to respond. The sensitive nature of the question results in a systematic underrepresentation of students with lower GPA. This type of bias is called a non-response bias. It occurs in surveys, where those who respond tend to be different from those who do not – in terms of the characteristics of interest (in this example, GPA). Not all non-response results in a bias. The above are examples of selection bias, bias that occurs because the sample is not representative of the population in terms of the characteristics of interest. Selection bias may occur due to an improper sampling frame (example 1 above), a non-random sampling design (example 2 above), or non-response (example 3 above). Response bias: A response bias occurs in a survey when inaccurate reponses tend to favour a certain outcome. Example: suppose that in the GPA example above you make it mandatory to respond. Everyone has to fill and return the questionnaire. The result may still be biased if those who have low GPA tend to report a higher one. Note, inaccurate response does not result in a bias if it is equally likely to be too high or too low. Bias occurs when the inaccuracy systematically favours one outcome. To summarize: below are the different sources of bias: Selection bias: o Due to biased sampling frame o Due to non-random sampling Non-response bias Response bias Note that response and non-response bias occur only when conducting a survey (i.e., the data are solicited from people.) When dealing with non-human subjects, we do not usually have to worry about p. 2/5 non-response or response bias. SAMPLING DESIGN (PLAN) Random Samples: A sample is said to be random if it is chosen in such a way so that every element in the sampling frame has equal chance to be selected. The four basic random samples: Simple random sample (SRS) Stratified random sample Cluster (multistage) random sample Systematic random sample Simple Random Sample (SRS): A sample is said to be an SRS if it is chosen in such a way that every subset of the same size of the sampling frame has an equal chance to be the sample. Example: in Lotto 6/49 the winning six numbers are randomly selected from a set of numbers from 1 to 49. Here we can think of the numbers 1 to 49 as the population of interest and the sampling frame. The winning combination is selected in such a way that it can be any subset of size 6 (e.g., {1, 2, 3, 4, 5, 6}, {4, 7, 18, 40, 41, 49}) with equal likelihood. Note that every SRS is a random sample, but not every random sample is an SRS. Stratified Random Sample: If the target population (sampling frame) can be partitioned (or it already is), we can select a stratified random sample by first taking an SRS in each partition (called stratum) and then combine all selected elements into one sample. Example: An accounting firm has two types of clients, individuals and corporate. The firm can take a stratified random sample of its clients by selecting an SRS from among its individual clients and an SRS from among its corporate clients, and then combining the two SRS. Here, type of client (individual or corporate) is used as a stratifying variable. The sampling frame must first be partitioned into the two types of clients. Cluster (Multistage) Random Sample: Like a stratified random sample, a cluster random sample can be selected if the sampling frame can be partitioned into two or more segments (called clusters). The sample is then selected by first randomly selecting one or more clusters, and then taking an SRS or census in the selected clusters. For this reason, the cluster random sample is also called the multistage: first we select the cluster, and the elements within the selected clusters. Example: You want to select a sample of 100 elementary school students in Vancouver. A cluster sample may be selected as follows: Randomly select one elementary school in Vancouver, and then select an SRS 100 from that school. (If the school has fewer than 100 students, you can randomly select two schools, and then take an SRS of 50 from each school.) The a school constitute a cluster. Note: contrast this to stratified random sample. In both designs, the sampling frame is partitioned into two or more segments. But with a stratified random sample we take an SRS in each segment (called p. 3/5 stratum); with a cluster sample we take an SRS only in the segments (called clusters) that have been first randomly selected. Systematic Random Sample: If the elements in the sampling frame can be arranged in sequence (from first to last) a systematic random sample, known as a 1-in-k sample, can be selected by taking an SRS of size 1 from the first k elements, and then selecting every kth element that follows. Example: An outdoor club has 500 members. A member’s list (in alphabetical order) is available. You wish to take a random sample of 50 members (i.e., 1 in 10). Using the member’s list as a sampling frame you can select a systematic random as follows: randomly select 1 member from the first 10 on the list. Say you get member no. 7. The sample will then consists of every 10th member starting from member no. 7, i.e., members no. 7, 17, 27, …, and 497. The systematic random sample is very convenient in many situations. It is particularly useful when sampling a moving or dynamic population. For example, people streaming out of a movie theater. The systematic random sample should not be used when the sampling frame cyclical in terms of the characteristic of interest. Example: We wish to estimate monthly sales. We have record of 120 monthly sales, from January 1996 to December 2006. If we take a systematic random sample of size 10 we will end up with sales data from 10 of the same months (all Januarys, or all Februarys, etc.) When to use which sampling design: Stratified random sample is the most costly. It is most appropriate when elements in different strata have different characteristics of interest. (Why?) Cluster random sample is the least expensive. It is most appropriate when elements in different clusters have similar characteristics of interest. If this is not the case, a cluster sample can lead to larger error. When elements in the sampling frame can be arranged in a sequence, a systematic random sample is often desirable, provided that the sequence is not cyclical with respect to the characteristics of interest. When the sampling frame cannot be partitioned or is not in sequence, an SRS is often the best choice. Combination of Designs: In practice a combination of more than one sampling design is often used. Ex: an accounting firm has two lists of clients: individual and corporate. To sample 40 clients, it takes a systematic random sample of 20 from the list of corporate clients, and another systematic random sample from the list of individual clients. The 40 clients selected in this way constitute a systematic within stratified random sample. Non-random sampling: In practice, a random sample is not always possible. When this is the case, researcher must take non- random samples, such as judgment or convenient sample. A judgment sample is one that subjectively judged to be representative. A convenient sample is one that consists of elements that are most convenient to obtain data from. Although non-random sampling is not scientific, and its results cannot be generalize to the population of interest, it is often useful as in the exploratory stage of a scientific research. p. 4/5 Sampling Error: Even when we have no bias, the sample statistic obtained will mostly likely be different from the population parameter we are trying to estimate. This discrepancy is called a sampling error. It is different from bias in that it is random. It is equally likely to result in an overestimation or underestimation. Sampling error decreases as sample size increases. (Bias, on the other hand, cannot be decreased by increasing sample size.) p. 5/5