Statistics and Probability PDF

Summary

The document appears to be a set of lecture notes or study materials on statistics and probability, covering various topics like probability distributions, sampling techniques, and hypothesis testing.

Full Transcript

Statistics and Probability Made by MATRIX Table of Contents Click on a section to scroll there. Table of Contents 0 Posted Coverage 2 Probability...

Statistics and Probability Made by MATRIX Table of Contents Click on a section to scroll there. Table of Contents 0 Posted Coverage 2 Probability 3 Probability of an Event 3 Random Variable 3 Discrete Random Variable 3 Continuous Random Variable 3 Probability Distribution 3 Properties 3 Probability Histogram 5 Measures of Central Tendency 6 Sample Mean 6 Population Mean 6 Mean of a Probability Distribution 6 Measures of Dispersion 6 Sample Variance and Standard Deviation 7 Variance and Standard Deviation of a Probability Distribution 7 Binomial Distributions 8 Bernoulli Trial 8 Binomial Experiment 8 Binomial Distribution 8 Characteristics 8 In Excel 9 Hypergeometric Distributions 10 Variables 10 Poisson Distributions 11 Poisson Experiment 11 Characteristics 11 Poisson Distribution 11 Normal Distributions 12 Normal Random Variable 12 Normal Distribution 12 Properties 12 The Standard Normal Distribution 13 Percentile 13 Sampling Techniques 14 Basic Concepts 14 Sample Size (Slovin’s Formula) 14 Probability Sampling 15 Simple Random Sampling 15 Systematic Random Sampling 15 Stratified Random Sampling 16 Cluster Sampling 17 1 Multistage Sampling 17 Nonprobability Sampling 18 Convenience Sampling 18 Purposive Sampling 18 Snowball Sampling 18 Quota Sampling 18 Volunteer Sampling 18 Estimating the Population Mean 19 Point Estimate for a Parameter 19 Interval Estimate for a Parameter 19 Level of Confidence 20 Properties of Good Estimator 20 Estimating Mean with Known SD 21 Confidence Coefficient 21 Confidence Interval: Five Steps 21 Estimating Mean with Unknown SD 22 Degree of Freedom 22 Determining Sample Size 23 Population Proportion and Correlation 24 Confidence Interval Procedure For Estimation for the Population Proportion 24 Examples 24 Language of Hypothesis Testing 25 Statistical Hypothesis Test 25 Null Hypothesis 25 Alternative Hypothesis 25 Errors in Decisions 25 Test Statistic 27 Probability Value or p-value 27 Conclusion 27 Hypothesis Testing for Mean 28 Known SD 28 Two-Tailed Situation 28 One-Tailed Situation 29 Unknown SD 29 One-tailed Situation 30 Two-tailed Situation 30 Summary 31 Linear Correlation 32 Bivariate Data 32 The Scatter Plot 32 Linear Correlation Coefficient r 33 Examples 33 2 Posted Coverage Sir Luke posted coverage on Ranger: 1. Definition of Random Variables, Mean for Probability Distribution 2. Properties of Probability Distribution 3. Probability Mass Function (Table) 4. Summation of P(x) 5. Measures of Central Tendency 6. Sample Mean 7. Population Mean 8. The Measures of Dispersion (Variance, Standard Deviation) 9. Properties of a Normal Curve 10. Empirical Rule 11. If the given is in between (subtract the higher z-value to lower z-value) If the given is right (subtract 1 to calculated z-value) If the given is left find only the area to z-table If X is not given use the formula X= Z⋅σ+μ 12. Review the types of Non-Probability Sampling and Probability Sampling and its definitions 13. For Systematic Sampling (it uses random numbers) Hypothesis being tested (Null) Population Parameter being tested (Mean of Known SD, Mean of Unknown SD, Proportion) 14. Keywords and Symbols for Null and Alternative Hypothesis 15. The Decision Rule 16. Formulating a Hypothesis Critical and Acceptance Region One Tailed Test 17. Zero, Negative and Positive Correlation Definition Linear Correlation Coefficient r Table of Interpretation Others 18. Familiarize yourself with the formulas being used: Population or Sample Variance, Population or Sample Standard Deviation, Population Mean, Sample Mean Probability Mass Function of Binomial, Hypergeometric, and Poisson. Mean, Variance, and Standard Deviation of Binomial Distribution Z-Value Formula Test Statistic (Z-Formula) Critical Values and Level of Significance Table Familiarize and Remember the Scatter Plots and Correlation Examples Pearson’s r 3 Probability In statistics, probability is primarily concerned with predicting chances, especially the occurrence of an event. In an experiment, there would be an outcome. The set of all possible outcomes of an experiment is called the sample space. Coins Probability of an Event For an event E and a sample space S, the probability of an event P(E) is determined by: Probabilities of an event can be represented in fractions, decimals, or percentages. The values of these range either from 0 to 1 or 0% to 100%. Random Variable This is a rule that assigns a numerical value or characteristic to an outcome of an experiment. For example: The number of heads obtained by tossing two coins The average grade of the student in his report card The amount of money in a one-year bank statement The measurement of water in making a cake The number of gadgets per household The average number of votes in the election Discrete Random Variable Continuous Random Variable These are random variables whose values These are random variables whose values are obtained by counting, and therefore are obtained by measuring. cannot have decimal values. The average grade of the student in The number of heads obtained by his report card tossing two coins The measurement of water for The amount of money in a one-year making a cake bank statement Probability Distribution This is also known as the probability mass function. This refers to the table that gives a list of probability values along with their associated value in the range space of a discrete random variable. Properties Each probability value p ranges from 0 to 1 inclusive, that is, [0, 1]. The sum of all the individual probability values is equal to 1. That is: 4 5 Probability Histogram A probability distribution can be expressed through a histogram, which is similar to a bar graph. 6 Measures of Central Tendency These are numerical values that locate, in some sense, the center of a set of data. The term average is often associated with all of the measures of central tendency: Mean (x) Median (x) Mode (x) Sample Mean The mean of an ungrouped set of data is obtained by getting the sum of all values of the variable and dividing it by the number of values. Population Mean Mean of a Probability Distribution The mean of a probability distribution of a discrete random variable x is found by getting the sum of the products of each possible value in the range space by its probability. Measures of Dispersion These are the numerical values that describe the amount of spread, or variability, that is found among the data. Closely grouped data have relatively small values. More widely spread-out data have larger values. 7 Sample Variance and Standard Deviation Variance and Standard Deviation of a Probability Distribution For a random variable X and the mean u of its probability distribution, the variance o^2 of the probability distribution is: 8 Binomial Distributions Dwodjkwopdjqwodjoqpjfo Bernoulli Trial A Bernoulli trial is an experiment that results in one of only two possible outcomes: success or failure. For example, the amount of heads after flipping a coin is either 0 or 1, which is success and failure. Thus, it is an example of a Bernoulli trial. Binomial Experiment A binomial experiment consists of a sequence of n independent and identical Bernoulli trials, then observing the number of successes. The random variable X which signifies the number of successes in a binomial experiment is called a binomial random variable. Binomial Distribution It is a probability distribution that summarizes the likelihood of a given number of successes in a fixed number of independent experiments (or trials), each of which has exactly two possible outcomes: success or failure. Characteristics 1. It consists of n repeated trials. 2. The trials are independent of one another. 3. Each trial is a Bernoulli trial. 4. The probability of success p is constant, and the probability of failure q = 1 - p is also constant. 9 In Excel To get the probability of the values of the random variable, we can use the command: 10 Hypergeometric Distributions This is a probability distribution used in situations where success or failure is possible on each trial but where there is no independence from trial to trial. This applies in situations when there are N items, of which k are classified as number of successes and N-k are classified as failures. A sample of size n < k is selected from the N items without replacement and X is defined by the number of successes in the n items selected. Variables N = the total number of objects/population size N = the number of objects being selected in the N objects/sample size k = the total number of successes out of N objects/population X = the random variable (the number of successes for each trial) 11 Poisson Distributions Poison Poisson Experiment This is a process that examines the number of times an event will occur over a specified time interval or region of space. The number of occurrences of such an event within a specified time interval or region of space is independent of its occurrence in another time interval or region of space. Characteristics 1. The number of occurrences of such an event within a specified time interval or region of space is independent of its occurrence in another time interval or region of space. 2. The number of occurrences is proportional to the length of the time interval or size of the region. 3. The probability of two or more occurrences within a very short time or small region is negligible. Poisson Distribution 12 Normal Distributions The previous distributions discussed were for discrete random variables. But what about for continuous random variables? Continuous random variables are random variables obtained by measuring: Average grade of a student in his report card Weight or height of a person Normal Random Variable A random variable is considered normal if the majority of its values are close to the expected value (or the mean) with only very few values that are extremely small and extremely large. Normal Distribution Properties 1. The graph is continuous with the domain (− ∞, ∞). 2. The graph is asymptotic to the x-axis. 3. The maximum point of the graph lies at the mean (expected value). 4. The graph is symmetrical about the mean (expected value). 5. All measures of central tendency (mean, median, mode) lie at the center of the graph (they are all equal). 6. The area under the curve bounded by the x-axis is equal to 100% or 1. 7. The mean (expected value) divides the graph into halves. Thus, from the mean to the right (or left), the area is 50% or 0.5. 8. The standard deviation affects the width and height of the normal distribution. 1. A smaller standard deviation leads to a narrower graph with greater height. 2. A larger standard deviation leads to a wider graph with a lower height. 9. The standard deviation precisely describes the spread of the normal distribution. 10. Specific standard deviation constants: 1. Area of one standard deviation from the mean to the left and right: 68.3% or 0.683 2. Two standard deviations: 95.4% or 0.954 3. Three standard deviations: 99.7% or 0.997 13 The Standard Normal Distribution By setting the mean to 0 and standard deviation to 1, we can standardize the normal distribution: The area under the curve to the left of a z-score can be found using a z-table instead of manually calculating it yourself. Why do we find areas under the normal curve? Areas under the normal curve can represent either of the following in a normally distributed random variable: Probabilities of values in a given interval Proportion of the whole population Percentage of the values in a given percentile Percentile 14 Sampling Techniques Sampling techniques are methods of gathering samples from a population. Basic Concepts 1. Census 1. It is the systematic recording of information of each element of the population. 2. Examples: 1. Parent/Guardian Census — gathering information from all parents about contact details, involvement in school activities, feedback on school programs, etc… 2. Population 1. This refers to the set of individuals, objects, or events whose properties are to be analyzed. 2. This refers to the totality of observations or elements from a universe. 3. Examples: 1. All students enrolled in LSGH 2. Every resident of Mandaluyong City (425,758) 3. Sample 1. This refers to the subset (from the population) of individuals, objects, or events whose properties are to be analyzed. 2. This refers to one or more elements taken from the population for a specific purpose. 3. Examples: 1. A group of 100 ADCI students randomly selected from LSGH 2. A survey of 500 residents from Mandaluyong City 4. Population Size 1. This refers to the number of elements in the population, represented by 𝑁. Sample Size (Slovin’s Formula) This refers to the number of elements in the given sample, represented by 𝑛. One way to determine a reliable sample size given a margin of error 𝑒 is the Slovin’s Formula, which uses the population size in relation to the margin of error: 15 Probability Sampling It is based on the fact that every member (population element) has a known (non-zero) chance of being selected. It involves random selection, giving every sample an equal chance of being selected, allowing you to make strong statistical inferences about the whole group. Probability sampling should be used to avoid bias. Making statistical inferences is the process of using data analysis to infer properties of an underlying distribution of probability. Simple Random Sampling In this technique, each member in the population has an equal chance to be selected as a participant. This is also known as the lottery method. 1. Write on several pieces of paper the names or the numbers assigned to each member in the population. 2. Put these in a single container. 3. Simply randomize the entries by shaking the container. 4. Let one person pick one name or number. 5. Then repeat steps 3 and 4 until the desired sample size is obtained. Example 1: A teacher wants to select 10 students from a class of 50 for a group project. She writes the names of all 50 students on separate pieces of paper, places them in a box, and randomly draws 10 names. Example 2: A company has 1,000 employees and wants to survey 100 employees about job satisfaction. They use a random number generator to pick 100 employees from the list of all 1,000 employees. Systematic Random Sampling This refers to the sampling technique which considers 𝑘𝑡ℎ element of the population in the sample with the selected random starting point from the first 𝑘 members. 1. Randomly arrange the elements in the population. 2. Assign a number to each element in the population. 3. Find the sampling interval k. 4. Randomly select a number from the whole numbers between 0 and k + 1. This is called as the random start. 5. Using the random start as the first element, select every kth element from the list.S 16 Stratified Random Sampling This is done when the population is divided into strata (homogenous partitions). The samples are selected on the total number of members in each stratum. 1. Make sure the population is divided into partitions. No member belongs to more than one strata. 2. Determine the total members for each stratum. 3. Determine the population size. 4. Determine the desired sample size. 5. To determine the sample size for each stratum, divide the total members by the population size then multiply the quotient by the total sample size. Do it for all strata. 6. Make sure the total partitions for the sample matches the desired sample size. If not, estimate the members accordingly. 17 Cluster Sampling This is similar to stratified random sampling, but the population is partitioned into clusters (heterogenous partitions). The sample is taken through a random selection of clusters and the all members of the chosen clusters will be part of the sample. 1. Make sure the population is divided into clusters. 2. Determine the desired number of clusters needed in the sample. 3. Select by simple random sampling the clusters for the sample. Example 1: The Department of Tourism wants to evaluate visitor satisfaction in major tourist destinations across the Philippines. They create clusters by dividing the country into tourist areas (e.g., Palawan, Boracay, Cebu, etc.) and randomly select a few areas. All tourists visiting these areas are surveyed about their experience. Example 2: A study is conducted to understand the level of student participation in extracurricular activities (e.g., sports, clubs) in Mandaluyong schools. Schools are grouped into clusters based on public and private school categories. A random selection of clusters is made, and all students from the chosen schools are surveyed about their participation in extracurricular activities. Multistage Sampling It is a complex form of cluster sampling in which large populations are divided into smaller, more manageable groups (stages), and sampling is conducted in multiple phases or stages. This method is often used when dealing with large, geographically dispersed populations that are difficult or expensive to sample from directly. 1. Divide the population into clusters: The population is divided into clusters or groups, which can be geographical regions, schools, districts, etc. 2. Randomly select clusters: A random selection of clusters is made in the first stage. 3. Further divide the selected clusters: In subsequent stages, smaller subgroups within the chosen clusters are randomly selected. This step may be repeated multiple times, moving from larger units to smaller units. 4. Sample within the final subgroup: In the final stage, individual elements (e.g., people, households) are randomly selected from the final clusters for the sample. Example: National Health Survey First Stage (Cluster selection): The Department of Health wants to conduct a survey on public health. They first divide the country into its 17 regions (clusters) and randomly select a few regions. Second Stage (Sub-cluster selection): Within the selected regions, provinces are randomly chosen. Third Stage (Unit selection): Within each chosen province, barangays (small communities) are selected randomly. Final Stage: Households or individuals within those barangays are randomly selected for the health survey. 18 Nonprobability Sampling It is a type of sampling where the chance of any member (population element) being selected for a sample cannot be calculated (unknown) or can be zero. It involves non-random selection based on convenience or other criteria, allowing you to easily collect data. Convenience Sampling Also known as haphazard sampling, the samples taken are readily available to participate in the study. Example: A teacher wants to survey high school students about their study habits. Instead of randomly selecting students, she surveys her own class because they are easily accessible. Purposive Sampling This sampling technique is also called judgment sampling or selective sampling. The samples taken are chosen based on the judgment of the researcher. The goal of this sampling is to carefully choose the members of the population which are best fitted to answer the research questions. Example: A researcher studying the effects of a new medication for diabetes specifically selects diabetic patients from a hospital to participate in the study, excluding those with other conditions. Snowball Sampling This sampling technique is also called chain-referral sampling. The researcher chooses a possible respondent for the study at hand, then each respondent is asked to give recommendations or referrals to other possible respondents. Like how a snowball increases in size as it goes down a snowy hill, the sample size rapidly increases and grows. Example: A researcher interested in studying LGBTQ+ experiences might begin with a few participants they know or have been introduced to. Those participants then help the researcher connect with more members of the community, expanding the sample through personal networks. Quota Sampling In this technique, the researcher starts by identifying quotas, which are predefined control categories such as age, gender, education, or religion. Then, the population is then divided into several categories according to the control category. Lastly, the researcher collects the sample, which has the same proportion as the given population. Example: A university researcher is studying student performance and wants to ensure representation from different academic majors. The researcher sets a quota to survey 200 students: 50 from Mathematics, 50 from Humanities, 50 from Business, and 50 from Arts. They recruit participants until all major categories meet the specified quota. Volunteer Sampling In this technique, the sample units are those who volunteered (in their own initiative) to participate in the study. This is usually utilized in qualitative studies. 19 Example: A university professor posts an advertisement for volunteers to participate in a study on sleep patterns. Interested students sign up on their own initiative. Estimating the Population Mean A parameter is any numerical measure that describes the whole population. Since parameters are numerical descriptions for the population, it is needed to gather data for the whole population to calculate a parameter. This makes it infeasible to do, especially if the population size is quite large (such as a whole community). These parameters are usually estimated using sample statistics assuming that the sample is randomly selected. Point Estimate Interval Estimate Point Estimate for a Parameter This refers to a single value that best determines the proposed parameter value of the population. The single number designed to estimate a quantitative parameter for the population, usually the value of the corresponding sample statistic. Interval Estimate for a Parameter This gives the range of values within which the parameter value possibly falls. An interval bounded by two values and used to estimate the value of a population parameter. The values that bound this interval are statistics calculated from the sample that is being used as the basis of estimation. 20 Level of Confidence Denoted by 1 − α, where a represents the level of significance. The proportion of all interval estimates that include the parameter being estimated. Common choices for confidence levels are 90%, 95%, and 99%. Properties of Good Estimator Estimators are sample measures that are used to estimate population parameters. 1. Unbiasedness ○ Any parameter estimate can be considered a random variable since its value may change depending on the selection of the members of the sample. ○ An estimate is said to be unbiased if the expectation of all the estimates taken from samples is shown to be equal to the parameter being estimated. 2. Consistency ○ Consistency of an estimator is achieved when the estimate produced a relatively smaller standard error (standard deviation). ○ This may be done by increasing the sample used to estimate the population parameter. ○ By the Central Limit Theorem (CLT), “as the sample size gets larger, the sampling distribution will closely resemble the normal distribution.” 3. Efficiency ○ From all unbiased Estimators of the population parameter, the efficient estimator is the one who gives the smallest variance. 21 Estimating Mean with Known SD The assumption for estimating the mean with a known standard deviation is that the sampling distribution of sample means has a normal distribution. 𝑥 is the point estimate for the population, and 𝑧 α is the confidence coefficient. (2) σ σ ( ) is the standard error of the mean. 𝑧 α *( ) is the maximum error of estimate. 𝑛 (2) 𝑛 The left side is the lower confidence limit, while the right side is the upper confidence limit. Confidence Coefficient It is the number of multiples of the standard error needed to formulate an interval estimate of the correct width to have a level of confidence of 1-α. Confidence Interval: Five Steps 1. The Set-Up a. Describe the population parameter of interest. 2. The Confidence Interval Criteria a. Check the assumptions. b. Identify the probability distribution to be used. c. State the level of confidence, 1−𝛼. 3. The Sample Evidence a. Collect the sample information. 4. The Confidence Interval a. Determine the confidence coefficient. b. Find the maximum error of estimate. c. Find the lower and upper confidence limits 5. The Results a. State the confidence interval. 22 Estimating Mean with Unknown SD The assumption for estimating the mean with an unknown standard deviation is that the sampled population is normally distributed. 𝑡 is distributed with a mean of zero. 𝑡 is distributed symmetrically about its mean. 𝑡 is distributed so as to form a family of distributions, a separate distribution for each different number of degrees of freedom. The 𝑡-distribution approaches the standard normal distribution as the degree of freedom increases. 𝑡 is distributed with a variance greater than 1, but as the degree of freedom increases, the variance approaches to 1. 𝑡 is distributed so as to be less peaked at the mean and thicker at the tails than is the normal distribution. Degree of Freedom A parameter that identifies each different distribution of the Student’s 𝑡-distribution It is computed by the sample size 𝑛 minus 1. That is, 𝑑𝑓 = 𝑛 − 1. 23 Determining Sample Size To determine the sample size needed given the population standard deviation 𝜎 and the maximum error of estimate E: 24 Population Proportion and Correlation Suppose you want to determine the percentage of brown-eyed Filipinos as of 2015. Instead of getting data for all Filipinos, you considered only 100 randomly selected Filipinos and determined the number of brown-eyed people. Suppose 28 out of the 100 people are brown-eyed. You can “estimate” that 28% of all Filipinos are brown-eyed. Confidence Interval Procedure For Estimation for the Population Proportion Examples 25 Language of Hypothesis Testing Those who know 💀💀💀 Statistical Hypothesis Test This refers to the process by which a decision is made between two opposing hypotheses: 1. The two opposing hypotheses are formulated so that each hypothesis is the negation of the other. (That way one of them is always true, and the other one is always false.). 2. Then one hypothesis is tested in hopes that it can be shown to be a very improbable occurrence, thereby implying the other hypothesis is likely the truth. Null Hypothesis This refers to the hypothesis being tested. Generally, this is a statement that a population has a specific value. The null hypothesis so named because it is the starting point for the investigation. The phrase “there is no difference” is often used in its interpretation. Alternative Hypothesis This refers to a statement about the same population parameter that is used in the null hypothesis. Generally, this is a statement that specifies the population parameter has a value different from the value given in the null hypothesis. The rejection of the null hypothesis will imply the likely truth of this alternative hypothesis. Example: A researcher is testing a new design for airbags used in automobiles, and he is concerned that they might NOT open properly. Ho: The airbags might open properly. Ha: The airbags might not open properly. Errors in Decisions 1. Type I error ○ This refers to the error committed when a true null hypothesis is rejected. That is, the null hypothesis is true but was decided against. 2. Type II error ○ This refers to the error committed when a false null hypothesis is accepted. The null hypothesis is false but was decided in favor for. 26 The level of significance α refers to the probability of committing the type I error. Usually 0.01 for experiments/medical research, 0.05 for surveys 27 Test Statistic A random variable whose value is calculated from the sample data and is used in making the decision “accept Ho” or “reject Ho”. Probability Value or p-value The probability that the test statistic could be the value it is or a more extreme value (in the direction of the alternative hypothesis) when the null hypothesis is true. Conclusion 28 Hypothesis Testing for Mean Also known as z-test for mean, this hypothesis testing procedure is used if the standard deviation of the population is given or the sample size is large enough (more than 30 samples). P-value approach Classical approach Known SD The assumption is that the sampling distribution of sample means has a normal distribution. Two-Tailed Situation 29 One-Tailed Situation Unknown SD Also known as t-test for mean. This hypothesis testing procedure is used if the standard deviation of the population is not given or the sample size is NOT large enough (100 samples and below). The assumption is that the sampled population is normally distributed. 30 One-tailed Situation Two-tailed Situation 31 Summary 32 Linear Correlation Bivariate Data These are the values of two different variables that are obtained from the same population element. Examples: ○ Gender and Strand in SHS (both variables are qualitative/attribute) ○ School and Their Academic Performance (one qualitative/attribute and one quantitative/numerical) ○ Weekly Allowance of the Students and their average grade (both variables are quantitative/numerical) The Scatter Plot The scatter plot is one way to graph bivariate data. 33 Linear Correlation Coefficient r This represents the numerical measure of the strength of the relationship between two variables. The coefficient always has a value between -1 and 1. Perfect negative correlation: 𝑟 =− 1 No correlation: 𝑟 = 0 Perfect positive correlation: 𝑟 = 1 The primary purpose of linear correlation is to measure the strength of a linear relationship between two random variables. In linear correlation, there is the input variable 𝑥 and the output variable 𝑦. No correlation means that if the input variable 𝑥 increases, there is no definite shift in the values of output variable 𝑦. Positive correlation means that if the input variable 𝑥 increases, the values of the output variable 𝑦 tends to increase. Negative correlation means that if the input variable 𝑥 increases, the values of the output variable 𝑦 tends to decrease. Examples Two variables:

Use Quizgecko on...
Browser
Browser