Statistics and Probability Notes PDF

Summary

This document provides notes on statistics and probability, covering topics such as variables, data types, and sampling techniques. It describes different types of variables, qualitative and quantitative data and methods for sampling.

Full Transcript

Statistics and Probability Notes The Nature of Probability and Statistics A variable is a characteristic or attribute that can assume different values. Data are the values (measurements or observations) that the variables can assume. Variable...

Statistics and Probability Notes The Nature of Probability and Statistics A variable is a characteristic or attribute that can assume different values. Data are the values (measurements or observations) that the variables can assume. Variables whose values are determined by chance are called random variables. A collection of data values forms a data set. Each value in the data set is called a data value or a datum. A population consists of all subjects (humans or otherwise) that are being studied. A sample is a group of subjects selected from a population. Two areas of Statistics Descriptive statistics consists of the collection, organization, summarization, and presentation of data. Inferential statistics consists of generalizing from samples to populations, performing estimations and hypothesis tests, determining relationships among variables, and making predictions. The area of inferential statistics called hypothesis testing is a decision-making process for evaluating claims about a population, based on information obtained from samples. Variables and Types of Data Variables can be classified as qualitative or quantitative. Qualitative variables are variables that have distinct categories according to some characteristic or attribute Quantitative variables are variables that can be counted or measured. o Discrete variables assume values that can be counted. o Continuous variables can assume an infinite number of values between any two specific values. They are obtained by measuring. They often include fractions and decimals. Classification of Variables according to Level of Measurement The nominal level of measurement classifies data into mutually exclusive (nonoverlapping) categories in which no order or ranking can be imposed on the data. The ordinal level of measurement classifies data into categories that can be ranked; however, precise differences between the ranks do not exist. The interval level of measurement ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero. The interval level of measurement ranks data, and precise differences between units of measure do exist; however, there is no meaningful zero. Sampling Techniques Probability Sampling - procedure wherein every element of the population is given a nonzero chance of being selected in the sample. Nonprobability Sampling – procedure wherein not all the elements in the population are given a chance of being included in the sample. Types of Probability Sampling A random sample is a sample in which all members of the population have an equal chance of being selected. This can be done by random selection methods, such as drawing names from a hat or using random number generators. Systematic sampling – subjects are selected by using every kth individual from a population. The first individual selected is a random number between 1 and k. For example, every 10th person on a list is selected, starting from a randomly chosen point. A stratified sample is a sample obtained by dividing the population into subgroups or strata according to some characteristic relevant to the study. (There can be several subgroups.) Then subjects are selected from each subgroup. A random sample is then taken from each stratum in proportion to the size of the stratum within the population. A cluster sample is obtained by dividing the population into sections or clusters and then selecting one or more clusters and using all members in the cluster(s) as the members of the sample. Types of Nonprobability Sampling Purposive sampling is a method where the participants are selected by the researcher subjectively. The researcher uses their judgment to select participants who they believe are most suitable for the study. This method is often used when studying a specific group or characteristic. Convenience Sampling is method of data collection by using data from population members that are readily available. Samples are chosen based on what is easiest or most convenient for the researcher. For example, selecting individuals who are readily available or easy to access. Quota Sampling. This is the nonprobability equivalent of stratified sampling. Then convenience or judgment sampling is used to select the required number of subjects from each stratum. The researcher ensures that certain characteristics of the population are represented in the sample by setting quotas. For example, a certain number of males and females are chosen to ensure gender representation, though selection within those groups is non-random. Snowball sampling. A special nonprobability method used when the desired sample characteristic is rare. Snowball sampling relies on referrals from initial subjects to generate additional subjects. Used when it’s difficult to find participants. Existing participants recruit future participants from among their acquaintances, creating a "snowball" effect. This method is often used in studies involving hard-to-reach populations, like drug users or individuals in marginalized groups. Methods of Data Collection A Survey is a data collection method that involves asking a set of standardized questions to a specific group of people. Surveys can be administered in various forms, such as questionnaires or interviews, and are designed to collect information on opinions, behaviors, experiences, or characteristics. o Examples: Online surveys, telephone interviews, face-to-face interviews, or mailed questionnaires. Observation involves collecting data by directly watching and recording the behavior or characteristics of individuals or phenomena in their natural or controlled environment. It can be structured (with specific guidelines) or unstructured (open-ended, free-form). o Examples: Observing children in a classroom, monitoring traffic patterns, or studying animal behavior in the wild. Use of Existing Records (Secondary Data) involves using already available data collected by other organizations, researchers, or institutions for a different purpose. Researchers analyze this secondary data to draw new conclusions or answer new research questions. o Examples: Analyzing census data, reviewing academic research papers, using government reports, or referencing company financial records. Simulation involves using computer models or mathematical algorithms to replicate real-world processes or systems in order to study their behavior under different scenarios or conditions. It's a method of data collection that generates data by mimicking reality. o Examples: Simulating climate changes, market conditions, or the spread of a disease in epidemiology. An Experiment is a controlled study where the researcher manipulates one or more variables (independent variables) to observe the effect on another variable (dependent variable), while controlling for other influencing factors. It is used to establish cause-and-effect relationships. o Examples: Clinical trials in medicine, testing the effectiveness of a new teaching method, or a laboratory experiment to observe chemical reactions. Types of Study In an observational study, the researcher merely observes what is happening or what has happened in the past and tries to draw conclusions based on these observations. In an experimental study, the researcher manipulates one of the variables and tries to determine how the manipulation influences other variables. o The independent variable in an experimental study is the one that is being manipulated by the researcher. The independent variable is also called the explanatory variable. The resultant variable is called the dependent variable or the outcome variable. Elements of a Well-designed Experiment 1. Control Group. A group in the experiment that does not receive the treatment or experimental condition. It serves as a baseline to compare against the experimental group. The control group helps to isolate the effects of the independent variable. 2. Randomization. The process of randomly assigning participants to either the control group or the experimental group. Randomization helps to reduce bias and ensures that any differences between groups are due to the treatment and not other factors. 3. Replication. The process of randomly assigning participants to either the control group or the experimental group. Randomization helps to reduce bias and ensures that any differences between groups are due to the treatment and not other factors. A sufficient number of experimental units should be used to ensure that randomization creates groups that resemble each closely and to increase the chances of detecting any differences among the treatments. Factors that could Affect an Experiment 1. Confounding Variables. Confounding variables are factors other than the independent variable that could affect the dependent variable. A well-designed experiment controls for these variables, either by holding them constant or accounting for their influence through statistical methods. A confounding variable occurs when an experimenter cannot tell the difference between the effects of different factors on a variable. 2. Placebo Effect. The placebo effect occurs when a subject reacts favorably to a placebo when in fact the subject has been given no medicated treatment at all. To help control or minimize the placebo effect, a technique called blinding can be used. Blinding is a technique used to prevent bias. In a single-blind experiment, participants are unaware of whether they are in the experimental or control group. In a double-blind experiment, both the participants and the researchers conducting the study are unaware of which group participants are in. Blinding helps eliminate placebo effects and researcher bias. 3. Hawthorne Effect. The Hawthorne effect refers to the phenomenon where individuals modify their behavior in response to being observed or knowing they are part of an experiment. Named after a series of studies conducted in the 1920s and 1930s at the Hawthorne Works plant in Chicago, the effect suggests that participants may improve their performance or change their actions simply because they are aware that they are being studied, not necessarily because of any experimental manipulation. Example: In a workplace study, if employees know their performance is being monitored, they may work harder or follow procedures more closely than they would under normal conditions, skewing the results of the study. To minimize the Hawthorne effect, blinding may also be used, where participants are unaware of being observed or unaware of the exact focus of the study. Types of Graphs/Charts The histogram is a graph that displays the data by using contiguous vertical bars (unless the frequency of a class is 0) of various heights to represent the frequencies of the classes. The frequency polygon is a graph that displays the data by using lines that connect points plotted for the frequencies at the midpoints of the classes. The frequencies are represented by the heights of the points. The ogive is a graph that represents the cumulative frequencies for the classes in a frequency distribution. A bar graph represents the data by using vertical or horizontal bars whose heights or lengths represent the frequencies of the data. A Pareto chart is used to represent a frequency distribution for a categorical variable, and the frequencies are displayed by the heights of vertical bars, which are arranged in order from highest to lowest. A time series graph represents data that occur over a specific period of time. A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution. A dotplot is a statistical graph in which each data value is plotted as a point (dot) above the horizontal axis. A stem and leaf plot is a data plot that uses part of the data value as the stem and part of the data value as the leaf to form groups or classes. Summarizing Data A statistic is a characteristic or measure obtained by using the data values from a sample. A parameter is a characteristic or measure obtained by using all the data values from a specific population. Measures of Central Tendency The median is the midpoint of the data array. The symbol for the median is MD. The value that occurs most often in a data set is called the mode. The midrange is defined as the sum of the lowest and highest values in the data set, divided by 2. Properties and Uses of Central Tendency The Mean 1. The mean is found by using all the values of the data. 2. The mean varies less than the median or mode when samples are taken from the same population and all three measures are computed for these samples. 3. The mean is used in computing other statistics, such as the variance. 4. The mean for the data set is unique and not necessarily one of the data value 5. The mean cannot be computed for the data in a frequency distribution that has an open-ended class. 6. The mean is affected by extremely high or low values, called outliers, and may not be the appropriate average to use in these situations. The Median 1. The median is used to find the center or middle value of a data set. 2. The median is used when it is necessary to find out whether the data values fall into the upper half or lower half of the distribution. 3. The median is used for an open-ended distribution. 4. The median is affected less than the mean by extremely high or extremely low values. The Mode 1. The mode is used when the most typical case is desired. 2. The mode is the easiest average to compute. 3. The mode can be used when the data are nominal or categorical, such as religious preference, gender, or political affiliation. 4. The mode is not always unique. A data set can have more than one mode, or the mode may not exist for a data set. The Midrange The midrange is easy to compute. The midrange gives the midpoint. The midrange is affected by extremely high or low values in a data set. Other Types of Means Harmonic Mean. The harmonic mean (HM) is defined as the number of values divided by the sum of the reciprocals of each value. The formula is The harmonic mean is useful for finding the average speed. Geometric Mean. The geometric mean is the average value or mean that, by applying the root of the product of the values, displays the central tendency of a set of numbers or data. The geometric mean (GM) is defined as the nth root of the product of n values. The formula is The geometric mean is useful in finding the average of percentages, ratios, indexes, or growth rates. Geometric mean formula for grouped data or The following are the properties of Geometric mean: 1. The geometric mean for a given data is always less than the arithmetic means for a given data set. 2. The ratio of the associated observation of the geometric mean in two series is equivalent to the ratio of their geometric means. 3. The product of the associated observation of the geometric mean in two series is equivalent to the product of their geometric means. 4. If the geometric mean replaces each observation in the given data set, then the product of observations does nor change. Algebraic Properties of Geometric Mean The geometric mean is less than the arithmetic mean for any set of positive numbers but when all of a series' values are equal, however, G.M. equals the arithmetic mean. If any value in a series is 0, the geometric mean is infinity, which is unsuitable. If the number of negative values is odd, it cannot be calculated. This is due to the fact that the product of the values will turn negative, and we will be unable to determine the root of a negative product. The product of the values equals the geometric mean raised to the nth power. The geometric mean of any set of numbers with the same N and product is the same. The product of the geometric mean's each side ratio will be equal to both sides. Even when each number in a series is replaced by its geometric mean, the series' products remain the same. The sum of the deviations of the original values' logarithms above and below the G.M.'s logarithm is equal. Quadratic Mean. A useful mean in the physical sciences (such as voltage) is the quadratic mean (QM), which is found by taking the square root of the average of the squares of each value. The formula is Median for Grouped Data An approximate median can be found for data that have been grouped into a frequency distribution. First it is necessary to find the median class. This is the class that contains the median value. That is the data value. Then it is assumed that the data values are evenly distributed throughout the median class. The formula is Measures of Variation Population Variance and Standard Deviation Sample Variance and Standard Deviation Variance and Standard Deviation for Grouped Data Uses of the Variance and Standard Deviation 1. As previously stated, variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. This information is useful in comparing two (or more) data sets to determine which is more (most) variable. 2. The measures of variance and standard deviation are used to determine the consistency of a variable. For example, in the manufacture of fittings, such as nuts and bolts, the variation in the diameters must be small, or else the parts will not fit together. 3. The variance and standard deviation are used to determine the number of data values that fall within a specified interval in a distribution. For example, Chebyshev’s theorem shows that, for any distribution, at least 75% of the data values will fall within 2 standard deviations of the mean. 4. Finally, the variance and standard deviation are used quite often in inferential statistics. Coefficient of Variation Chebyshev’s Theorem Chebyshev’s theorem, developed by the Russian mathematician Chebyshev (1821–1894), specifies the proportions of the spread in terms of the standard deviation. The Empirical (Normal) Rule Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true. Approximately 68% of the data values will fall within 1 standard deviation of the mean. Approximately 95% of the data values will fall within 2 standard deviations of the mean. Approximately 99.7% of the data values will fall within 3 standard deviations of the mean. Measures of Position Standard Scores Quartiles Quartiles divide the distribution into four equal groups, denoted by Q1, Q2, Q3. Quartile formula for ungrouped data 𝑘 (𝑛+1) 𝑡ℎ 𝑄𝑘 = ( ) value in the data set when arranged from lowest to highest. 4 Quartile formula for grouped data 𝑘(𝑛) − 𝑐𝑓 𝑄𝑘 = 𝐿𝐵𝑄𝑘 + ( 4 )𝑖 𝑓𝑄𝑘 where: 𝐿𝐵𝑄𝑘 – the lower boundary of the kth quartile class 𝑓𝑄𝑘 – frequency of the kth quartile class 𝑐𝑓 – cumulative frequency before the kth quartile class 𝑖 – class width 𝑛 – total frequency/sample The interquartile range (IQR) is the difference between the third and first quartiles. This measure of variability which uses quartiles is the range of the middle 50% of the data values. 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 Another measure of the average is called the midquartile/quartile deviation; it is the numerical value halfway between Q1 and Q3, and the formula is 𝑄3 − 𝑄1 𝑄𝐷 = 2 Deciles Deciles divide the distribution into 10 groups, as shown. They are denoted by D1, D2, etc. Decile formula for ungrouped data 𝑘 (𝑛+1) 𝑡ℎ 𝐷𝑘 = ( ) value in the data set when arranged from lowest to highest. 10 Decile formula for grouped data 𝑘(𝑛) − 𝑐𝑓 𝐷𝑘 = 𝐿𝐵𝐷𝑘 + ( 10 )𝑖 𝑓𝐷𝑘 where: 𝐿𝐵𝐷𝑘 – the lower boundary of the kth decile class 𝑓𝐷𝑘 – frequency of the kth decile class 𝑐𝑓 – cumulative frequency before the kth decile class 𝑖 – class width 𝑛 – total frequency/sample Percentiles Percentiles divide the data set into 100 equal groups. Percentiles are symbolized by P1, P2, P3,... , P99 and divide the distribution into 100 groups. Percentile formula for ungrouped data 𝑘 (𝑛+1) 𝑡ℎ 𝑃𝑘 = ( ) value in the data set when arranged from lowest to highest. 100 Percentile formula for grouped data 𝑘(𝑛) − 𝑐𝑓 𝑃𝑘 = 𝐿𝐵𝑃𝑘 + ( 100 )𝑖 𝑓𝑃𝑘 where: 𝐿𝐵𝑃𝑘 – the lower boundary of the kth percentile class 𝑓𝑃𝑘 – frequency of the kth percentile class 𝑐𝑓 – cumulative frequency before the kth percentile class 𝑖 – class width 𝑛 – total frequency/sample The Five-Number Summary and Boxplots A boxplot can be used to graphically represent the data set. These plots involve five specific values: 1. The lowest value of the data set (i.e., minimum) 2. Q1 3. The median 4. Q3 5. The highest value of the data set (i.e., maximum) These values are called a five-number summary of the data set. A boxplot is a graph of a data set obtained by drawing a horizontal line from the minimum data value to Q1, drawing a horizontal line from Q3 to the maximum data value, and drawing a box whose vertical sides pass through Q1 and Q3 with a vertical line inside the box passing through the median or Q2 Information Obtained from a Boxplot 1. a. If the median is near the center of the box, the distribution is approximately symmetric. b. If the median falls to the left of the center of the box, the distribution is positively skewed. c. If the median falls to the right of the center, the distribution is negatively skewed. 2. a. If the lines are about the same length, the distribution is approximately symmetric. b. If the right line is larger than the left line, the distribution is positively skewed. c. If the left line is larger than the right line, the distribution is negatively skewed. Skewness and Kurtosis Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Skewness can be checked by using the Pearson coefficient (PC) of skewness also called Pearson’s index of skewness. The formula is Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case. Probability A probability experiment is a chance process that leads to well-defined results called outcomes. An outcome is the result of a single trial of a probability experiment. A sample space is the set of all possible outcomes of a probability experiment. A tree diagram is a device consisting of line segments emanating from a starting point and also from the outcome point. It is used to determine all possible outcomes of a probability experiment. An event consists of a set of outcomes of a probability experiment. An event with one outcome is called a simple event. The event of getting an odd number when a die is rolled is called a compound event, since it consists of three outcomes or three simple events. In general, a compound event consists of two or more outcomes or simple events. There are three basic interpretations of probability: 1. Classical probability 2. Empirical or relative frequency probability 3. Subjective probability Classical Probability Classical probability uses sample spaces to determine the numerical probability that an event will happen. You do not actually have to perform the experiment to determine that probability. Classical probability is so named because it was the first type of probability studied formally by mathematicians in the 17th and 18th centuries. Equally likely events are events that have the same probability of occurring. Complementary Events The complement of an event E is the set of outcomes in the sample space that are not included in the outcomes of event E. The complement of E is denoted by (read “E bar”). Empirical Probability The difference between classical and empirical probability is that classical probability assumes that certain outcomes are equally likely (such as the outcomes when a die is rolled), while empirical probability relies on actual experience to determine the likelihood of outcomes. Law of Large Numbers When a coin is tossed one time, it is common knowledge that the probability of getting a head is 1/2. But what happens when the coin is tossed 50 times? Will it come up heads 25 times? Not all the time. You should expect about 25 heads if the coin is fair. But due to chance variation, 25 heads will not occur most of the time. If the empirical probability of getting a head is computed by using a small number of trials, it is usually not exactly 1/2. However, as the number of trials increases, the empirical probability of getting a head will approach the theoretical probability of 1/2, if in fact the coin is fair (i.e., balanced). This phenomenon is an example of the law of large numbers. Subjective Probability The third type of probability is called subjective probability. Subjective probability uses a probability value based on an educated guess or estimate, employing opinions and inexact information. In subjective probability, a person or group makes an educated guess at the chance that an event will occur. This guess is based on the person’s experience and evaluation of a solution. For example, a sportswriter may say that there is a 70% probability that the Pirates will win the pennant next year. The Addition Rules for Probability Two events are mutually exclusive events or disjoint events if they cannot occur at the same time (i.e., they have no outcomes in common). The Multiplication Rules and Conditional Probability The Multiplication Rules The multiplication rules can be used to find the probability of two or more events that occur in sequence. For example, if you toss a coin and then roll a die, you can find the probability of getting a head on the coin and a 4 on the die. These two events are said to be independent since the outcome of the first event (tossing a coin) does not affect the probability outcome of the second event (rolling a die). Two events A and B are independent events if the fact that A occurs does not affect the probability of B occurring. Dependent Events When the outcome or occurrence of the first event affects the outcome or occurrence of the second event in such a way that the probability is changed, the events are said to be dependent events. Here are some examples of dependent events: Drawing a card from a deck, not replacing it, and then drawing a second card Selecting a ball from an urn, not replacing it, and then selecting a second ball Conditional Probability The event of getting a king on the second draw given that an ace was drawn the first time is called a conditional probability. The conditional probability of an event B in relationship to an event A is the probability that event B occurs after event A has already occurred. The notation for conditional probability is P(B|A). This notation does not mean that B is divided by A; rather, it means the probability that event B occurs given that event A has already occurred. Probabilities for “At Least” The multiplication rules can be used with the complementary event rule to simplify solving probability problems involving “at least.” The Fundamental Counting Rule Factorial Notation Permutations A permutation is an arrangement of n objects in a specific order. Circular Permutation If clockwise and anti-clockwise orders are different, then a total number of circular permutations is given by Pn=(n−1)! If clock-wise and anti-clock-wise orders are taken as not different, the total number of circular permutations is given by Pn=(n−1)! 2! Combinations A selection of distinct objects without regard to order is called a combination. Probability and Counting Rules The counting rules can be combined with the probability rules in this chapter to solve many types of probability problems. By using the fundamental counting rule, the permutation rules, and the combination rule, you can compute the probability of outcomes of many experiments, such as getting a full house when 5 cards are dealt or selecting a committee of 3 women and 2 men from a club consisting of 10 women and 10 men. Discrete Probability Distributions Probability Distributions A random variable is a variable whose values are determined by chance. Discrete variables have a finite number of possible values or an infinite number of values that can be counted. The word counted means that they can be enumerated using the numbers 1, 2, 3, etc. For example, the number of joggers in Riverview Park each day and the number of phone calls received after a TV commercial airs are examples of discrete variables, since they can be counted. A discrete probability distribution consists of the values a random variable can assume and the corresponding probabilities of the values. The probabilities are determined theoretically or by observation. Mean, Variance, Standard Deviation, and Expectation Expectation Another concept related to the mean for a probability distribution is that of expected value or expectation. Expected value is used in various types of games of chance, in insurance, and in other areas, such as decision theory. The Binomial Distribution Many types of probability problems have only two outcomes or can be reduced to two outcomes. For example, when a coin is tossed, it can land heads or tails. When a baby is born, it will be either male or female. In a basketball game, a team either wins or loses. A true/false item can be answered in only two ways, true or false. Other situations can be reduced to two outcomes. The outcomes of a binomial experiment and the corresponding probabilities of these outcomes are called a binomial distribution. The Multinomial Distribution Recall that for an experiment to be binomial, two outcomes are required for each trial. But if each trial in an experiment has more than two outcomes, a distribution called the multinomial distribution must be used. A multinomial experiment is a probability experiment that satisfies the following four requirements: 1. There must be a fixed number of trials. 2. Each trial has a specific—but not necessarily the same—number of outcomes. 3. The trials are independent. 4. The probability of a particular outcome remains the same. The Poisson Distribution A discrete probability distribution that is useful when n is large and p is small and when the independent variables occur over a period of time is called the Poisson distribution. A Poisson experiment is a probability experiment that satisfies the following requirements: 1. The random variable X is the number of occurrences of an event over some interval (i.e., length, area, volume, period of time, etc.). 2. The occurrences occur randomly. 3. The occurrences are independent of one another. 4. The average number of occurrences over an interval is known. The Hypergeometric Distribution When sampling is done without replacement, the binomial distribution does not give exact probabilities, since the trials are not independent. The smaller the size of the population, the less accurate the binomial probabilities will be. A hypergeometric experiment is a probability experiment that satisfies the following requirements: 1. There are a fixed number of trials. 2. There are two outcomes, and they can be classified as success or failure. 3. The sample is selected without replacement The Geometric Distribution Another useful distribution is called the geometric distribution. This distribution can be used when we have an experiment that has two outcomes and is repeated until a successful outcome is obtained. For example, we could flip a coin until a head is obtained, or we could roll a die until we get a 6. In these cases, our successes would come on the nth trial. The geometric probability distribution tells us when the success is likely to occur. A geometric experiment is a probability experiment if it satisfies the following requirements: 1. Each trial has two outcomes that can be either success or failure. 2. The outcomes are independent of each other. 3. The probability of a success is the same for each trial. 4. The experiment continues until a success is obtained. Bayes Theorem Given two dependent events A and B, the previous formulas for conditional probability allow you to find P(A and B), or P(B|A). Related to these formulas is a rule developed by the English Presbyterian minister Thomas Bayes (1702–1761). The rule is known as Bayes’ theorem. Example: On a game show, a contestant can select one of four boxes. Box 1 contains one $100 bill and nine $1 bills. Box 2 contains two $100 bills and eight $1 bills. Box 3 contains three $100 bills and seven $1 bills. Box 4 contains five $100 bills and five $1 bills. The contestant selects a box at random and selects a bill from the box at random. If a $100 bill is selected, find the probability that it came from box 4. Solution: STEP 1 Select the proper notation. Let B1, B2, B3, and B4 represent the boxes and 100 and 1 represent the values of the bills in the boxes. STEP 2 Draw a tree diagram and find the corresponding probabilities. The probability of selecting each box is 1/4, or 0.25. The probabilities of selecting the $100 bill from each box, respectively, are 1/10 = 0.1, 2/10 = 0.2, 3/10 = 0.3, and 5/10 = 0.5. STEP 3 Using Bayes’ theorem, write the corresponding formula. Since the example asks for the probability that box 4 was selected, given that $100 was obtained, the corresponding formula is as follows: The Normal Distribution If a random variable has a probability distribution whose graph is continuous, bell-shaped, and symmetric, it is called a normal distribution. The graph is called a normal distribution curve. Summary of the Properties of the Theoretical Normal Distribution 1. A normal distribution curve is bell-shaped. 2. The mean, median, and mode are equal and are located at the center of the distribution. 3. A normal distribution curve is unimodal (i.e., it has only one mode). 4. The curve is symmetric about the mean, which is equivalent to saying that its shape is the same on both sides of a vertical line passing through the center. 5. The curve is continuous; that is, there are no gaps or holes. For each value of X, there is a corresponding value of Y. 6. The curve never touches the x axis. Theoretically, no matter how far in either direction the curve extends, it never meets the x axis—but it gets increasingly close. 7. The total area under a normal distribution curve is equal to 1.00, or 100%. This fact may seem unusual, since the curve never touches the x axis, but one can prove it mathematically by using calculus. (The proof is beyond the scope of this text.) 8. The area under the part of a normal curve that lies within 1 standard deviation of the mean is approximately 0.68, or 68%; within 2 standard deviations, about 0.95, or 95%; and within 3 standard deviations, about 0.997, or 99.7%. The Standard Normal Distribution The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. All normally distributed variables can be transformed into the standard normally distributed variable by using the formula for the standard score: The Central Limit Theorem Distribution of Sample Means Suppose a researcher selects a sample of 30 adult males and finds the mean of the measure of the triglyceride levels for the sample subjects to be 187 milligrams/deciliter. Then suppose a second sample is selected, and the mean of that sample is found to be 192 milligrams/deciliter. Continue the process for 100 samples. What happens then is that the mean becomes a random variable, and the sample means 187, 192, 184,... , 196 constitute a sampling distribution of sample means. A sampling distribution of sample means is a distribution using the means computed from all possible random samples of a specific size taken from a population. Sampling error is the difference between the sample measure and the corresponding population measure due to the fact that the sample is not a perfect representation of the population. Properties of the Distribution of Sample Means 1. The mean of the sample means will be the same as the population mean. 2. The standard deviation of the sample means will be smaller than the standard deviation of the population, and it will be equal to the population standard deviation divided by the square root of the sample size. If the sample size is sufficiently large, the central limit theorem can be used to answer questions about sample means in the same manner that a normal distribution can be used to answer questions about individual values. The only difference is that a new formula must be used for the z values. It is 𝑋̅ − 𝜇 𝑧= 𝜎/√𝑛 It’s important to remember two things when you use the central limit theorem: 1. When the original variable is normally distributed, the distribution of the sample means will be normally distributed, for any sample size n. 2. When the distribution of the original variable is not normal, a sample size of 30 or more is needed to use a normal distribution to approximate the distribution of the sample means. The larger the sample, the better the approximation will be. Confidence Intervals and Sample Size Confidence Intervals for the Mean When 𝝈 Is Known Suppose a college president wishes to estimate the average age of students attending classes this semester. The president could select a random sample of 100 students and find the average age of these students, say, 22.3 years. From the sample mean, the president could infer that the average age of all the students is 22.3 years. This type of estimate is called a point estimate. A point estimate is a specific numerical value estimate of a parameter. The best point estimate of the population mean 𝜇 is the sample mean 𝑋̅. Confidence Intervals An interval estimate of a parameter is an interval or a range of values used to estimate the parameter. This estimate may or may not contain the value of the parameter being estimated. The confidence level of an interval estimate of a parameter is the probability that the interval estimate will contain the parameter, assuming that a large number of samples are selected and that the estimation process on the same parameter is repeated. A confidence interval is a specific interval estimate of a parameter determined by using data obtained from a sample and by using the specific confidence level of the estimate. The margin of error, also called the maximum error of the estimate, is the maximum likely difference between the point estimate of a parameter and the actual value of the parameter. Sample Size Confidence Intervals for the Mean When 𝝈 is Unknown When s is known and the sample size is 30 or more, or the population is normally distributed if the sample size is less than 30, the confidence interval for the mean can be found by using the z distribution. However, most of the time, the value of 𝜎 is not known, so it must be estimated by using s, namely, the standard deviation of the sample. When s is used, especially when the sample size is small, critical values greater than the values for 𝑧𝛼/2 are used in confidence intervals in order to keep the interval at a given level, such as the 95%. These values are taken from the Student t distribution, most often called the t distribution. To use this method, the samples must be simple random samples, and the population from which the samples were taken must be normally or approximately normally distributed, or the sample size must be 30 or more. Confidence Intervals and Sample Size for Proportions To construct a confidence interval about a proportion, you must use the margin of error, which is 𝑝̂ 𝑞̂ 𝐸 = 𝑧𝛼/2 √ 𝑛 Confidence intervals about proportions must meet the criteria that 𝑛𝑝̂ ≥ 5 and 𝑛𝑞̂ ≥ 5. Sample Size for Proportions Hypothesis Testing A statistical hypothesis is a conjecture about a population parameter. This conjecture may or may not be true. In the hypothesis-testing situation, there are four possible outcomes. In reality, the null hypothesis may or may not be true, and a decision is made to reject or not reject it on the basis of the data obtained from a sample. The four possible outcomes. Notice that there are two possibilities for a correct decision and two possibilities for an incorrect decision. The four possibilities are as follows: 1. We reject the null hypothesis when it is true. This would be an incorrect decision and would result in a type I error. 2. We reject the null hypothesis when it is false. This would be a correct decision. 3. We do not reject the null hypothesis when it is true. This would be a correct decision. 4. We do not reject the null hypothesis when it is false. This would be an incorrect decision and would result in a type II error. Differences Between Alpha, Beta and Power of a Test: Alpha (α): Alpha is set by the researcher before conducting the study and represents the maximum acceptable probability of rejecting a true null hypothesis (Type I error rate). Commonly chosen levels for alpha include 0.05 and 0.01, but it can vary based on the study and field of research. Significance level representing the probability of Type I error, typically set by the researcher before conducting a study. It's the maximum acceptable probability of rejecting a true null hypothesis. Beta (β): Beta is the probability of failing to reject a false null hypothesis (Type II error rate). Beta is related to power (1 −β) in that 1−β is the power of the study, which represents the ability to detect a true effect. Power of a test: Power is the probability of correctly rejecting a false null hypothesis, representing the ability to detect a true effect. Power is not set directly but is determined by the study design, including sample size, effect size, and other factors. It's the complement of the Type II error rate (β), so Power = 1−β. While there's a mathematical relationship between alpha and power, conceptually they serve different purposes: Alpha (α) controls the rate of Type I errors, determining the threshold for statistical significance. Power measures the ability to detect true effects, controlling the rate of Type II errors. In practical terms, researchers set alpha based on their study requirements, and then they aim to design the study to achieve a satisfactory level of power, which is influenced by factors such as sample size, effect size, and variability. Balancing these factors is crucial for a well-designed and meaningful statistical study. Z test for one sample P-Value Method for Hypothesis Testing t Test for a one sample Mean z Test for one sample Proportion Testing the Difference Between Two Means: Using the z Test Suppose a researcher wishes to determine whether there is a difference in the average age of nursing students who enroll in a nursing program at a community college and those who enroll in a nursing program at a university. In this case, the researcher is not interested in the average age of all beginning nursing students; instead, he is interested in comparing the means of the two groups. His research question is, Does the mean age of nursing students who enroll at a community college differ from the mean age of nursing students who enroll at a university? Here, the hypotheses are 𝐻0 : 𝜇1 = 𝜇2 𝐻1 : 𝜇1 ≠ 𝜇2 where 𝜇1 = mean age of all beginning nursing students at a community college 𝜇2 = mean age of all beginning nursing students at a university Another way of stating the hypotheses for this situation 𝐻0 : 𝜇1 − 𝜇2 = 0 𝐻1 : 𝜇1 − 𝜇2 ≠ 0 Testing the Difference Between Two Means of Independent Samples: Using the t Test Testing the Difference Between Two Means: Dependent Samples Testing the Difference Between Two Variances Correlation and Regression In simple correlation and regression studies, the researcher collects data on two numerical or quantitative variables to see whether a relationship exists between the variables. For example, if a researcher wishes to see whether there is a relationship between number of hours of study and test scores on an exam, she must select a random sample of students, determine the number of hours each studied, and obtain their grades on the exam. Correlation Coefficient Statisticians use a measure called the correlation coefficient to determine the strength of the linear relationship between two variables. There are several types of correlation coefficients. When two variables are highly correlated, item 3 in the box states that there exists a possibility that the correlation is due to a third variable. If this is the case and the third variable is unknown to the researcher or not accounted for in the study, it is called a lurking variable. An attempt should be made by the researcher to identify such variables and to use methods to control their influence. Regression If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line, which is the data’s line of best fit. The difference between the actual value y and the predicted value (that is, the vertical distance) is called a residual or a predicted error. Residuals are used to determine the line that best describes the relationship between the two variables. The method used for making the residuals as small as possible is called the method of least squares. As a result of this method, the regression line is also called the least squares regression line. Coefficient of Determination Standard Error of the Estimate Multiple Regression In multiple regression, there are several independent variables and one dependent variable. Chi Square Test for Goodness of Fit Recall the characteristics of the chi-square distribution: 1. The chi-square distribution is a family of curves based on the degrees of freedom. 2. The chi-square distributions are positively skewed. 3. All chi-square values are greater than or equal to zero. 4. The total area under each chi-square distribution is equal to 1. When you are testing to see whether a frequency distribution fits a specific pattern, you can use the chi-square goodness-of-fit test. For example, suppose as a market analyst you wished to see whether consumers have any preference among five flavors of a new fruit soda. A sample of 100 people provided these data: Since the frequencies for each flavor were obtained from a sample, these actual frequencies are called the observed frequencies. The frequencies obtained by calculation (as if there were no preference) are called the expected frequencies. To calculate the expected frequencies, there are two rules to follow. 1. If all the expected frequencies are equal, the expected frequency E can be calculated by using 𝐸 = 𝑛/𝑘, where n is the total number of observations and k is the number of categories. 2. If all the expected frequencies are not equal, then the expected frequency E can be calculated by 𝐸 = 𝑛 ∙ 𝑝, where n is the total number of observations and p is the probability for that category. Looking at the new fruit flavors example, if there were no preference, you would expect each flavor to be selected with equal frequency. In this case, the equal frequency is 100/5 = 20. That is, approximately 20 people would select each flavor. A completed table for the test is shown. The observed frequencies will almost always differ from the expected frequencies due to sampling error; that is, the values differ from sample to sample. But the question is: Are these differences significant (a preference exists), or are they due to chance? The chi-square goodness-of-fit test will enable the researcher to determine the answer. Before computing the test value, you must state the hypotheses. The null hypothesis should be a statement indicating that there is no difference or no change. For this example, the hypotheses are as follows: H0: Consumers show no preference for flavors of the fruit soda. H1: Consumers show a preference. Next, we need a measure of discrepancy between the observed values O and the expected values E, so we use the test statistic for the chi-square goodness-of-fit test. Chi-Square Test for Independence The chi-square independence test is used to test whether two variables are independent of each other. The null hypotheses for the chi-square independence test are generally, with some variations, stated as follows: H0: The variables are independent of each other. H1: The variables are dependent upon each other. The data for the two variables are placed in a contingency table. One variable is called the row variable, and the other variable is called the column variable. The table is called an R X C table, where R is the number of rows and C is the number of columns. (Remember, rows go across or horizontally, and columns go up and down or vertically.) For example, a 2 X 3 contingency table would look like this. Each value in the table is called a cell value. For example, the cell value C2,3 means that it is in the second row (2) and third column (3). The observed values are obtained from the sample data. (That is, they are given in the problem.) The expected values are computed from the observed values, and they are based on the assumption that the two variables are independent. Analysis of Variance One-Way Analysis of Variance When an F test is used to test a hypothesis concerning the means of three or more populations, the technique is called analysis of variance (commonly abbreviated as ANOVA). Recall that the characteristics of the F distribution are as follows: 1. The values of F cannot be negative, because variances are always positive or zero. 2. The distribution is positively skewed. 3. The mean value of F is approximately equal to 1. 4. The F distribution is a family of curves based on the degrees of freedom of the variance of the numerator and the degrees of freedom of the variance of the denominator. One -Way Analysis of variance (ANOVA) Used to evaluate mean differences between two or more treatments Uses sample data as basis for drawing general conclusions about populations Clear advantage over a t test: it can be used to compare more than two treatments at the same time Total Variability Between Within Treatments Treatments variance Variance ▪ Between- groups (treatments) variance o Variability results from general differences between the treatment conditions o Variance between treatments measures differences among sample means ▪ Within-groups (treatments) variance o Variability within each sample o Individual scores are not the same within each sample Statistical Hypotheses for ANOVA ▪ Null hypothesis: the level or value on the factor does not affect the dependent variable o 𝐻0 : The means of the group do not differ from each other. o 𝐻0 : 𝜇1 = 𝜇2 = 𝜇3 ▪ Alternative hypothesis: the level or value on the factor affect the dependent variable o 𝐻1 : There is at least one mean difference among the groups. o 𝐻1 : ~(𝜇1 = 𝜇2 = 𝜇3 ) ▪ F-ratio is based on variance instead of sample mean differences 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 o 𝐹= 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑔𝑟𝑜𝑢𝑝𝑠 (∑ 𝑋)2 [∑(∑ 𝑋 )]2 (∑ 𝑋 )2 𝑆𝑆𝐵 = ∑ − 𝑆𝑆𝑊 = ∑( ∑ 𝑋 2 ) − ∑ 𝑛 𝑁 𝑛 Example: A random sample of the students in each row was taken. The score for those students on the second exam was recorded Front: 7, 8, 7, 10, 5, 7, 9 Middle: 8, 7, 6, 6, 7, 8, 10 Back: 5, 10, 5, 7, 8, 7, 6 Non-Parametric Tests Non-parametric tests, also known as distribution-free tests, are statistical procedures used to analyze data when the assumptions of parametric tests are not met or when dealing with data that do not follow a specific distribution. These tests make minimal or no assumptions about the underlying population distribution. Here's a summary of non- parametric tests: Assumptions: o Non-parametric tests require fewer assumptions compared to parametric tests. They don't assume a particular shape for the population distribution, normality, or homogeneity of variances. Type of Data: o Non-parametric tests are suitable for analyzing ordinal, nominal, or continuous data that may not follow a specific distribution. o They are often used with data that has outliers or is not normally distributed. Examples of Non-Parametric Tests: o Wilcoxon Signed-Rank Test: Compares two related samples (paired samples) to assess whether their distributions differ significantly. o Mann-Whitney U Test: Compares two independent samples to determine if there is a significant difference between their distributions. o Kruskal-Wallis Test: An extension of the Mann-Whitney U Test for comparing more than two independent groups. o Spearman's Rank Correlation: Determines the strength and direction of a monotonic relationship between two variables. o Chi-Square Test: Evaluates the association between categorical variables in a contingency table. o Friedman Test: Compares more than two related samples (matched groups) to see if they have significantly different distributions. Advantages: o Non-parametric tests are robust against outliers and deviations from normality, making them suitable for a wide range of data types. o They provide valid statistical inference when the assumptions of parametric tests are violated. o No distributional assumptions make them versatile and applicable in various situations. Disadvantages: o Non-parametric tests may have less statistical power (lower efficiency) compared to their parametric counterparts when data meet the assumptions of parametric tests. o They often require larger sample sizes to achieve comparable power to parametric tests under certain conditions. Use Cases: o Non-parametric tests are commonly used in behavioral sciences, social sciences, medicine, and other fields where assumptions of parametric tests may not hold or when dealing with non-normal data. Non-parametric tests are valuable tools in statistical analysis, providing reliable options for making inferences when parametric assumptions cannot be met. Researchers should choose the appropriate non-parametric test based on the data type, research design, and specific research question.

Use Quizgecko on...
Browser
Browser