Data Collection and Confidence Intervals PDF

Summary

This document provides an overview of data collection methods and types, including qualitative and quantitative data. It also covers criteria for data quality, such as validity and reliability, and various data analysis techniques.

Full Transcript

Data Collection DATA COLLECTION DATA PROCESSING CREATION VALIDATION AND ANALYZING Data collection  Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to a...

Data Collection DATA COLLECTION DATA PROCESSING CREATION VALIDATION AND ANALYZING Data collection  Data collection is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes.  Data collection methods in research methodology refer to the techniques and procedures used to gather information or data for research purposes. These methods are crucial for obtaining reliable and valid data to address research questions or hypotheses. Types of data Primary versus secondary data Primary data does not actually exist until and unless it Primary data is generated through the research process as part of the consultancy or dissertation or project. As we shall see, primary data is closely related to, and has implications for, the methods and techniques of data collection. For example, primary data will often be collected secondary data through techniques such as experimentation, interviewing, observation and surveys. Secondary data, on the other hand, is information which already exists in some form or other but which was not primarily collected, at least initially, for the purpose of the consultancy exercise at hand. Types of data Qualitative Data: Qualitative data Description and Interpretation: Qualitative data provide descriptive information about qualities, characteristics, meanings, and interpretations rather than numerical Quantitative data measurements. Measurement Scale: Qualitative data are not measured on a numerical scale but are categorized or described based on themes, patterns, or concepts. Interviews Subjectivity: Qualitative data are subjective, reflecting the perspectives, experiences, and interpretations of the Focus Groups participants or researchers. Richness and Depth: Qualitative data provide rich and Observation detailed insights into complex phenomena, allowing for a Field Notes deeper understanding of context, culture, and human experiences. Document Analysis Contextual Understanding: Qualitative data are often collected and analyzed within their natural context, Narrative Inquiry emphasizing the importance of understanding the social, Diaries or Journals cultural, and environmental factors influencing the phenomenon under study. Types of data Qualitative data Quantitative Data: Measurement Scale: Quantitative data are measured Quantitative data on a numerical scale. This scale can be continuous (e.g., height, weight) or discrete (e.g., number of siblings). Precision: Quantitative data provide precise measurements, allowing for statistical analysis and Surveys numerical comparisons. Statistical Analysis: Quantitative data lend themselves well to statistical techniques such as means, standard Experiments deviations, correlations, and regression analysis. Objective Measurement: Quantitative data are Observations typically objective and can be measured independently of the observer's biases. Units of Measurement: Quantitative data are often Tests and Assessments expressed in units, such as meters, kilograms, dollars, or percentages. Biometric Measurements Criteria for effective data: data quality Validity: Validity relates to the extent to which the data collection method or research method describes or measures what it is supposed to describe or measure. Reliability: Reliability relates to the extent to which a particular data collection approach will yield the Choosing between data collection methods same results on different occasions. Objectives/purpose of research Generalizability: Generalizability is essentially another Researchers’ skills and expertise dimension of validity quality in data and relates to the extent to which results from data can be Cost/budgets generalized to other situations. Time Availability Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science (Art) of collecting, organizing, analyzing, and interpreting data in order to make decisions. A population is the collection of all outcomes, responses, measurement, or counts that are of interest. A sample is a subset of a population. Example: In a recent survey, 250 college students at Union College were asked if they smoked cigarettes regularly. 35 of the students said yes. Identify the population and the sample. Responses of all students at Responses of students in Union College (population) survey (sample) Parameters & Statistics A parameter is a numerical description of a population characteristic. A statistic is a numerical description of a sample characteristic. Parameter Population Statistic Sample Parameters & Statistics  Example: Decide whether the numerical value describes a population parameter or a sample statistic. a.) A recent survey of a sample of 450 college students reported that the average weekly income for students is $325. Because the average of $325 is based on a sample, this is a sample statistic. b.) The average weekly income for all students is $405. Because the average of $405 is based on a population, this is a population parameter. DATA CLASSIFICATION Types of Data Data sets can consist of two types of data: qualitative data and quantitative data. Data Qualitative Quantitative Data Data Consists of Consists of attributes, labels, numerical or nonnumerical measurements or entries. counts. Qualitative and Quantitative Data  Example: The grade point averages of five students are listed in the table. Which data are qualitative data and which are quantitative data? Student GPA Sally 3.22 Bob 3.98 Cindy 2.75 Mark 2.24 Kathy 3.84 Qualitative data Quantitative data Levels of Measurement The level of measurement determines which statistical calculations are meaningful. The four levels of measurement are: nominal, ordinal, interval, and ratio. Nominal Levels Lowest Ordinal to of Measurement Interval highest Ratio Nominal Level of Measurement Data at the nominal level of measurement are qualitative only. Nominal Levels Calculated using names, labels, of or qualities. No mathematical Measurement computations can be made at this level. Colors in Names of Textbooks you the US students in your are using this flag class semester Ordinal Level of Measurement Data at the ordinal level of measurement are qualitative or quantitative. Levels of Ordinal Measurement Arranged in order, but differences between data entries are not meaningful. Class standings: Numbers on the Top 50 songs freshman, back of each played on the sophomore, player’s shirt radio junior, senior Interval Level of Measurement Data at the interval level of measurement are quantitative. A zero entry simply represents a position on a scale; the entry is not an inherent zero. Levels of Measurement Interval Arranged in order, the differences between data entries can be calculated. Temperatures Years on a Atlanta Braves timeline World Series victories Ratio Level of Measurement Data at the ratio level of measurement are similar to the interval level, but a zero entry is meaningful. A ratio of two data values can be Levels formed so one data value can be of Measurement expressed as a ratio. Ratio Ages Grade point Weights averages Summary of Levels of Measurement Determine if Put data Arrange Level of Subtract one data value in data in measurement data values is a multiple of categories order another Nominal Yes No No No Ordinal Yes Yes No No Interval Yes Yes Yes No Ratio Yes Yes Yes Yes EXPERIMENTAL DESIGN Designing a Statistical Study  GUIDELINES 1. Identify the variable(s) of interest (the focus) and the population of the study. 2. Develop a detailed plan for collecting data. If you use a sample, make sure the sample is representative of the population. 3. Collect the data. 4. Describe the data. 5. Interpret the data and make decisions about the population using inferential statistics. 6. Identify any possible errors. Methods of Data Collection In an observational study, a researcher observes and measures characteristics of interest of part of a population. In an experiment, a treatment is applied to part of a population, and responses are observed. A simulation is the use of a mathematical or physical model to reproduce the conditions of a situation or process. A survey is an investigation of one or more characteristics of a population. A census is a measurement of an entire population. A sampling is a measurement of part of a population. Random sampling  In random sampling, each member of the population has an equal chance of being selected for the sample. This method helps avoid bias and ensures representativeness. Random sampling  When to select: Random sampling is ideal when you want to ensure every member of the population has an equal chance of being selected for the sample. It helps to avoid bias and ensures representativeness.  Advantages: Provides a representative sample, minimizes selection bias.  Disadvantages: May be impractical for large populations, requires a complete list of the population, and can be time-consuming and costly. Stratified Samples A stratified sample has members from each segment of a population. This ensures that each segment from the population is represented. Freshmen Sophomores Juniors Seniors Stratified Samples  When to select: Stratified sampling is suitable when the population can be divided into meaningful subgroups (strata) based on certain characteristics, and you want to ensure representation from each subgroup.  Advantages: Ensures representation from all subgroups, reduces sampling variability, allows for comparisons between subgroups.  Disadvantages: Requires knowledge of population characteristics for proper stratification, may be more complex and time-consuming than simple random sampling. Cluster Samples A cluster sample has all members from randomly selected segments of a population. This is used when the population falls into naturally occurring subgroups. All members in each selected group are used. The city of Clarksville divided into city blocks. Cluster Samples When to select: Cluster sampling is appropriate when it is impractical or too costly to sample individuals directly, and the population naturally clusters into groups or clusters. Advantages: More cost-effective than simple random sampling for large populations, can be easier to administer, useful for geographically dispersed populations. Disadvantages: May not be as precise as other methods, requires careful consideration of cluster size and selection. Systematic Samples A systematic sample is a sample in which each member of the population is assigned a number. A starting number is randomly selected and sample members are selected at regular intervals. Every fourth member is chosen. Systematic Samples When to select: Systematic sampling is useful when the population is organized in a particular order, such as alphabetically or by time, and you want a random sample without having to list the entire population. Advantages: Simple and easy to implement, suitable for large populations, systematic selection may still provide a representative sample. Disadvantages: Vulnerable to periodicity if there is a pattern in the population order, may introduce bias if there is a systematic pattern in the list. Convenience Samples A convenience sample consists only of available members of the population. When to select: Convenience sampling is chosen when quick and easy access to participants is necessary, or when the population of interest is difficult to reach. Advantages: Easy and inexpensive to implement, convenient for small-scale studies or preliminary research. Disadvantages: Likely to introduce selection bias, may not provide a representative sample. Continued. Convenience Samples Example: You are doing a study to determine the number of years of education each teacher at your college has. Identify the sampling technique used if you select the samples listed. 1.) You randomly select two different departments and survey each teacher in those departments. 2.) You select only the teachers you currently have this semester. 3.) You divide the teachers up according to their department and then choose and survey some teachers in each department. Continued. Identifying the Sampling Technique Example continued: You are doing a study to determine the number of years of education each teacher at your college has. Identify the sampling technique used if you select the samples listed. 1.) This is a cluster sample because each department is a naturally occurring subdivision. 2.) This is a convenience sample because you are using the teachers that are readily available to you. 3.) This is a stratified sample because the teachers are divided by department and some from each department are randomly selected. Confidence Intervals CONFIDENCE INTERVALS FOR THE MEAN(LARGE SAMPLES) Point Estimate for Population μ A point estimate is a single value estimate for a population parameter. The most unbiased point estimate of the population mean, , is the sample mean, x. Example: A random sample of 32 textbook prices (rounded to the nearest dollar) is taken from a local college bookstore. Find a point estimate for the population mean, . 34 34 38 45 45 45 45 54 56 65 65 66 67 67 68 74 79 86 87 87 87 88 90 90 x  74.22 94 95 96 98 98 101 110 121 The point estimate for the population mean of textbooks in the bookstore is $74.22. Interval Estimate An interval estimate is an interval, or range of values, used to estimate a population parameter. Point estimate for textbooks 74.22 interval estimate How confident do we want to be that the interval estimate contains the population mean, μ? Level of Confidence The level of confidence c is the probability that the interval estimate contains the population parameter. c is the area beneath the c normal curve between the critical values. 1 1 (1 – c) (1 – c) 2 2 z −zc z=0 zc Use the Standard Critical values Normal Table to find the corresponding z-scores. The remaining area in the tails is 1 – c. Common Levels of Confidence If the level of confidence is 90%, this means that we are 90% confident that the interval contains the population mean, μ. 0.90 0.05 0.05 z −zc = −−z1.645 c z = 0 zc = z1.645 c The corresponding z-scores are ± 1.645. Common Levels of Confidence If the level of confidence is 95%, this means that we are 95% confident that the interval contains the population mean, μ. 0.95 0.025 0.025 z −zc = − −z1.96 c z=0 zc =z1.96 c The corresponding z-scores are ± 1.96. Common Levels of Confidence If the level of confidence is 99%, this means that we are 99% confident that the interval contains the population mean, μ. 0.99 0.005 0.005 z −zc = −−z2.575 c z = 0 zc = z2.575 c The corresponding z-scores are ± 2.575. Margin of Error The difference between the point estimate and the actual population parameter value is called the sampling error. When μ is estimated, the sampling error is the difference μ – x. Since μ is usually unknown, the maximum value for the error can be calculated using the level of confidence. Given a level of confidence, the margin of error (sometimes called the maximum error of estimate or error tolerance) E is the greatest possible distance between the point estimate and the value of the parameter it is estimating. E = z cσ x = z c σ n When n  30, the sample standard deviation, s, can be used for . Margin of Error Example: A random sample of 32 textbook prices is taken from a local college bookstore. The mean of the sample is x = 74.22, and the sample standard deviation is s = 23.44. Use a 95% confidence level and find the margin of error for the mean price of all textbooks in the bookstore. Since n  30, s can be E = z c σ = 1.96  23.44 substituted for σ. n 32  8.12 We are 95% confident that the margin of error for the population mean (all the textbooks in the bookstore) is about $8.12. Confidence Intervals for μ A c-confidence interval for the population mean μ is x−E   x+E The probability that the confidence interval contains μ is c. Example: A random sample of 32 textbook prices is taken from a local college bookstore. The mean of the sample is x = 74.22, the sample standard deviation is s = 23.44, and the margin of error is E = 8.12. Construct a 95% confidence interval for the mean price of all textbooks in the bookstore. Continued. Confidence Intervals for μ Example continued: Construct a 95% confidence interval for the mean price of all textbooks in the bookstore. x = 74.22 s = 23.44 E = 8.12 Left endpoint = ? Right endpoint = ? x = 74.22 x − E = 74.22 − 8.12 x + E = 74.22 + 8.12 = 66.1 = 82.34 With 95% confidence we can say that the cost for all textbooks in the bookstore is between $66.10 and $82.34. Finding Confidence Intervals for μ Finding a Confidence Interval for a Population Mean (n  30 or σ known with a normally distributed population) In Words In Symbols 1. Find the sample statistics n and x. x = x n 2. Specify , if known. Otherwise, if n  30, ( x − x )2 find the sample standard deviation s and s= n −1 use it as an estimate for . 3. Find the critical value zc that corresponds Use the Standard Normal Table. to the given level of confidence. E = zc σ 4. Find the margin of error E. n Left endpoint: x − E 5. Find the left and right endpoints Right endpoint: x + E and form the confidence interval. Interval: x − E    x + E Confidence Intervals for μ ( Known) Example: A random sample of 25 students had a grade point average with a mean of 2.86. Past studies have shown that the standard deviation is 0.15 and the population is normally distributed. Construct a 90% confidence interval for the population mean grade point average. n = 25 x = 2.86  = 0.15 σ = 1.645  0.15 zc = 1.645 E = z c  0.05 n 25 x + E = 2.86 ± 0.05 2.81 < μ < 2.91 With 90% confidence we can say that the mean grade point average for all students in the population is between 2.81 and 2.91. Sample Size Given a c-confidence level and a maximum error of estimate, E, the minimum sample size n, needed to estimate , the population mean, is  zc  2 n= .  E  If  is unknown, you can estimate it using s provided you have a preliminary sample with at least 30 members. Example: You want to estimate the mean price of all the textbooks in the college bookstore. How many books must be included in your sample if you want to be 99% confident that the sample mean is within $5 of the population mean? Continued. Sample Size Example continued: You want to estimate the mean price of all the textbooks in the college bookstore. How many books must be included in your sample if you want to be 99% confident that the sample mean is within $5 of the population mean? x = 74.22   s = 23.44 zc = 2.575  zc   2.575  23.44  2 2 n=  =   E   5   145.7 (Always round up.) You should include at least 146 books in your sample. CONFIDENCE INTERVALS FOR THE MEAN (SMALL SAMPLES) The t-Distribution When a sample size is less than 30, and the random variable x is approximately normally distributed, it follow a t-distribution. t =x −μ s n Properties of the t-distribution 1. The t-distribution is bell shaped and symmetric about the mean. 2. The t-distribution is a family of curves, each determined by a parameter called the degrees of freedom. The degrees of freedom are the number of free choices left after a sample statistic such as x is calculated. When you use a t-distribution to estimate a population mean, the degrees of freedom are equal to one less than the sample size. d.f. = n – 1 Degrees of freedom Continued. The t-Distribution 3. The total area under a t-curve is 1 or 100%. 4. The mean, median, and mode of the t-distribution are equal to zero. 5. As the degrees of freedom increase, the t-distribution approaches the normal distribution. After 30 d.f., the t-distribution is very close to the standard normal z-distribution. The tails in the t-distribution are “thicker” than those in the standard normal distribution. d.f. = 2 d.f. = 5 t 0 Standard normal curve Critical Values of t Example: Find the critical value tc for a 95% confidence when the sample size is 5. Appendix B: Table 5: t-Distribution Level of confidence, c 0.50 0.80 0.90 0.95 0.98 One tail,  0.25 0.10 0.05 0.025 0.01 d.f. Two tails,  0.50 0.20 0.10 0.05 0.02 1 1.000 3.078 6.314 12.706 31.821 2.816 1.886 2.920 4.303 6.965 3.765 1.638 2.353 3.182 4.541 4.741 1.533 2.132 2.776 3.747 5.727 1.476 2.015 2.571 3.365 d.f. = n – 1 = 5 – 1 = 4 c = 0.95 tc = 2.776 Continued. Critical Values of t Example continued: Find the critical value tc for a 95% confidence when the sample size is 5. 95% of the area under the t-distribution curve with 4 degrees of freedom lies between t = ±2.776. c = 0.95 t −tc = − 2.776 tc = 2.776 Confidence Intervals and t-Distributions Constructing a Confidence Interval for the Mean: t- Distribution In Words In Symbols 1. Identify the sample statistics n, x , x = x ( x − x )2 n s= and s. n −1 2. Identify the degrees of freedom, d.f. = n – 1 the level of confidence c, and the critical value tc. 3. Find the margin of error E. E = tc s n 4. Find the left and right endpoints and form the confidence interval. Left endpoint: x − E Right endpoint: x + E Interval: x − E    x + E Constructing a Confidence Interval Example: In a random sample of 20 customers at a local fast food restaurant, the mean waiting time to order is 95 seconds, and the standard deviation is 21 seconds. Assume the wait times are normally distributed and construct a 90% confidence interval for the mean wait time of all customers. n = 20 x = 95 s = 21 21 = 8.1 d.f. = 19 tc = 1.729 E = tc s = 1.729  n 20 x  E = 95 ± 8.1 86.9 < μ < 103.1 We are 90% confident that the mean wait time for all customers is between 86.9 and 103.1 seconds. Normal or t-Distribution? Example: Determine whether to use the normal distribution, the t-distribution, or neither. a.) n = 50, the distribution is skewed, s = 2.5 The normal distribution would be used because the sample size is 50. b.) n = 25, the distribution is skewed, s = 52.9 Neither distribution would be used because n < 30 and the distribution is skewed. c.) n = 25, the distribution is normal,  = 4.12 The normal distribution would be used because although n < 30, the population standard deviation is known. CONFIDENCE INTERVALS FOR POPULATION PROPORTIONS Point Estimate for Population p The probability of success in a single trial of a binomial experiment is p. This probability is a population proportion. The point estimate for p, the population proportion of successes, is given by the proportion of successes in a sample and is denoted by pˆ = x n where x is the number of successes in the sample and n is the number in the sample. The point estimate for the proportion of failures is qˆ = 1 – p̂. The symbols p̂ and qˆ are read as “p hat” and “q hat.” Point Estimate for Population p Example: In a survey of 1250 US adults, 450 of them said that their favorite sport to watch is baseball. Find a point estimate for the population proportion of US adults who say their favorite sport to watch is baseball. n = 1250 x = 450 pˆ = x = 450 = 0.36 n 1250 The point estimate for the proportion of US adults who say baseball is their favorite sport to watch is 0.36, or 36%. Confidence Intervals for p A c-confidence interval for the population proportion p is pˆ − E  p  pˆ + E where E = zc pq ˆ ˆ. n The probability that the confidence interval contains p is c. Example: Construct a 90% confidence interval for the proportion of US adults who say baseball is their favorite sport to watch. n = 1250 x = 450 p̂ = 0.36 Continued. Confidence Intervals for p Example continued: n = 1250 x = 450 p̂ = 0.36 E = z c pq ˆˆ n (0.36)(0.64)  0.022 qˆ = 0.64 = 1.645 1250 Left endpoint = ? Right endpoint = ? p̂ = 0.36 p̂ − E = 0.36 − 0.022 p̂ + E = 0.36 + 0.022 = 0.338 = 0.382 With 90% confidence we can say that the proportion of all US adults who say baseball is their favorite sport to watch is between 33.8% and 38.2%. Finding Confidence Intervals for p Constructing a Confidence Interval for a Population Proportion In Words In Symbols 1. Identify the sample statistics n and x. 2. Find the point estimate p̂. pˆ = x n 3. Verify that the sampling distribution npˆ  5, nqˆ  5 can be approximated by the normal distribution. 4. Find the critical value zc that Use the Standard corresponds to the given level of Normal Table. confidence. 5. Find the margin of error E. E = z c pq ˆˆ n 6. Find the left and right endpoints and Left endpoint: p̂ − E form the confidence interval. Right endpoint: p̂ + E Interval: pˆ − E  p  pˆ + E Sample Size Given a c-confidence level and a margin of error, E, the minimum sample size n, needed to estimate p is 2  zc  n = pq ˆˆ . E This formula assumes you have an estimate for p̂ and qˆ. If not, use pˆ = 0.5 and qˆ = 0.5. Example: You wish to find out, with 95% confidence and within 2% of the true population, the proportion of US adults who say that baseball is their favorite sport to watch. Continued. Sample Size Example continued: You wish to find out, with 95% confidence and within 2% of the true population, the proportion of US adults who say that baseball is their favorite sport to watch. n = 1250 x = 450 p̂ = 0.36 2 2 z  ˆ ˆ  c  = (0.36)(0.64)  1.96  n = pq  E  0.02   2212.8 (Always round up.) You should sample at least 2213 adults to be 95% confident.

Use Quizgecko on...
Browser
Browser