Week 2 - Statistical Concepts - BT22203 Econometrics PDF
Document Details
Uploaded by CapableOmaha9956
Universiti Malaya
Shariff Umar bin Shariff Abd. Kadir
Tags
Summary
These lecture notes cover Week 2 of BT22203 Econometrics, focusing on statistical concepts like central tendency (mean, median, mode), variability (range, variance, standard deviation), and probability distributions (including normal). The notes also explore the importance of descriptive statistics in data analysis and how to use sampling distributions. The content includes various examples and tables.
Full Transcript
BT22203 ECONOMETRICS Week 2: Review of Statistical Concepts Shariff Umar bin Shariff Abd. Kadir, PhD [email protected] 1. In this chapter, you will learn several statistical Learning concepts, e.g Descriptive Statistics: M...
BT22203 ECONOMETRICS Week 2: Review of Statistical Concepts Shariff Umar bin Shariff Abd. Kadir, PhD [email protected] 1. In this chapter, you will learn several statistical Learning concepts, e.g Descriptive Statistics: Mean, Objective Variance, Skewness Probability and sampling s distributions Hypothesis testing and confidence intervals What is Descriptive Statistics? Descriptive Statistics Types Descriptive statistics can be classified into two types: 1. Measures of Central Tendency 2. Measures of Variability (or Dispersion) Measures of Central Tendency These provide a summary statistic that represents the center point or typical value of a dataset. The most common measures of central tendency are the mean (average), median (middle value), and mode (most frequent value). Mean: The average value of the dataset, obtained by adding all the data points and dividing by the number of data points (μ = Σx/n) Median: The middle value of the dataset, obtained by ordering all data points and picking out the one in the middle (or the average of the two middle numbers if the dataset has an even number of observations). Mode: The most frequently occurring value in the dataset. Measures of Variability (or Dispersion) These measures describe the spread or variability of the data points in the dataset. There are three main types: 1. Range: The difference between the largest and smallest values in the dataset (Range=max-min) 2. Variance: The average of the squared differences from the mean (σ² = Σ(x – μ)² / N) 3. Standard Deviation: The square root of the variance, giving a measure of dispersion that is in the same units as the original dataset (σ = √σ²) Importance of Descriptive Statistics Data Summarization: Descriptive Identification of statistics provide Patterns and simple summaries Trends: Descriptive about the measures statistics can help and samples you identify patterns Data Comparison: By have collected. and trends in the summarizing data With a large data, providing into measures such dataset, it’s often valuable insights. as the mean and difficult to Measures like the standard deviation, identify patterns mean and median can it’s easier to or tendencies just tell you about the compare different by looking at the central tendency of datasets or raw data. your data, while different groups Descriptive measures like the within a dataset. statistics provide range and standard numerical and deviation tell you graphical summaries about the that can highlight dispersion. important aspects of the data. Probability distribution s Probability distributions No doubt you are familiar with terms such as probability, chance, and likelihood. They are often used interchangeably. A probability is frequently expressed as a decimal, such as 0.70, 0.27, or 0.50. However, it may be given as a fraction, such as 7/10, 27/100, or 1/2. It can assume any number from 0 to 1. Expressed as a percentage, the range is between 0% and 100%. The closer a probability is to 0, the more improbable it is the event will happen. The closer the probability is to 1, the more likely it will happen. The important characteristics of a probability distribution are: The probability of a particular outcome is between 0 and 1 inclusive. the sum of the probabilities of the outcomes is equal to 1. Normal Probability Distributions The normal probability distribution has the following major characteristics: 1. It is bell-shaped and has a single peak at the centre of the distribution. The arithmetic mean, median, and mode are equal and located in the centre of the distribution. The total area under the curve is 1.00. Half the area under the curve is to the right of this centre point, and the other half is to the left of it. 2. It is symmetric about the mean. If we cut the normal curve vertically at the central value, the two halves will be mirror images. The area of each half is 0.5. 3. It falls off smoothly in either direction from the central value. Also, the distribution is asymptotic, meaning that the curve gets closer and closer to the X-axis but never actually touches it. To put it another way, the tails of the curve extend indefinitely in both directions. -Normal probability distributions with equal means but different standard deviations. -As the standard deviation gets smaller, the distribution becomes more narrow and “peaked.” The Standard Normal Probability Distribution The number of normal distributions is unlimited, each having a different mean (μ), standard deviation (σ), or both. One member of the family can be used to determine the probabilities for all normal probability distributions. It is called the standard normal probability distribution, and it is unique because it has a mean of 0 and a standard deviation of 1. Any normal probability distribution can be converted into a standard normal distribution by subtracting the mean from each observation and dividing this difference by the standard deviation. The results are called z values or z scores. So, a z value is the distance from the mean, measured in units of the standard deviation. There is only one standard normal distribution. It has a mean of 0 and a standard deviation of 1. In terms of a formula: 𝑥−μ STANDARD NORMAL VALUE: z = σ x is the value of any particular observation or measurement. μ is the mean of the distribution. σ is the standard deviation of the distribution. The Empirical Rule If a random variable is normally distributed, then: 1. About 68% of the observations will lie within plus and minus one standard deviation of the mean. This can be written as μ ± 1σ. 2. About 95% of the observations will lie within plus and minus two standard deviations of the mean, written as μ ± 2σ. 3. Practically all, or 99.7% of the observations, will lie within plus or minus three standard deviations of the mean, written as μ ± 3σ. Example As part of its quality control program, the Autolite Battery Company conducts tests on battery life. For a particular D cell alkaline battery, the mean life is 19 hours. The useful life of the battery follows a normal distribution with a standard deviation of 1.2 hours. Answer the following questions: 1. About 68% of the batteries failed between what two values? 2. About 95% of the batteries failed between what two values? 3. Nearly all of the batteries failed between what two values? Solution We can use the Empirical Rule to answer these questions. 1. About 68% of the batteries will fail between 17.8 and 20.2 hours, found by: 19.0 ± 1(1.2) hours. 2. 2. About 95% of the batteries will fail between 16.6 and 21.4 hours, found by: 19.0 ± 2(1.2) hours. 3. 3. Nearly all failed between 15.4 and 22.6 hours, found by: 19.0 ± 3(1.2) hours Sampling Methods What is sampling? A sample is a portion or part of the population of interest. Sampling is a process of selecting items from a population so that we can use this information to make judgments or inferences about the population. Sampling Distribution of the Sample Mean When we use the sample mean to estimate the population mean, how can we determine how accurate the estimate is? Example: A researcher make an accurate prediction about the impact of inflation on consumer demand based on a sample of 2000 urban people out of a total population of millions? To answer these questions, we first develop a sampling distribution of the sample mean (a probability distribution of all possible sample means of a given sample size). Employee Hourly Earnings Example ($) Joe 19 Hourly Earnings of the Sam 19 Production Employees of Schauer Industries Haley 18 Matt 18 Monida 18 Colton 17 Tom 13 What is the population mean? What is the sampling distribution of the sample mean for samples of size 2? What is the mean of the sampling distribution? What observations can be made about the population and the sampling distribution? 1. The population mean is $17.43, found by: μ = $19 + $19 + $18 + $18 + $18 + $17 + $13 / 7 = $17.43 We identify the population mean with the Greek letter µ 2. To arrive at the sampling distribution of the sample mean, we need to select all possible samples of 2 without replacement from the population, then compute the mean of each sample. There are 21 possible samples, found by using The mean of the sampling distribution of the sample mean is obtained by summing all sample means and dividing the sum by the number of samples. The mean of all the sample means is usually written μx. The µ reminds us that it is a population value because we have considered all possible samples. The subscript x indicates that it is the sampling Confidence intervals A point estimate is a single value (point) derived from a sample and used to estimate a population value. For example, suppose that we select a sample of 50 junior executives and asked how many hours they worked last week. Then we compute the mean of this sample of 50 and use the value of the sample mean as a point estimate of the unknown population mean. But a point estimate is a single value. A more informative approach is to present a range of values in which we expect the population parameter to occur. Such a range of values is called a confidence interval. Confidence interval statements provide examples of levels of confidence and are called 95% confidence interval and 99% confidence interval. The 95% and 99% are the levels of confidence and refer to the percentage of similarly constructed intervals that would include the parameter being estimated—in this case, µ, the population mean. The 95%, for example, refers to the middle 95% of the observations. Therefore, the remaining 5% is equally divided between the two tails. Hypothesis testing What is Hypothesis? A hypothesis is a statement about a population. Data are then used to check the reasonableness of the statement. Example: Hypothesis suggested by the model of demand and supply: an increase in the price of gasoline will reduce the quantity of gasoline consumers demand. In statistical analysis, we make a claim, that is, state a hypothesis, collect data, then use the data to test the claim. Thus, the question is how to test the hypothesis? Five-Step Procedure for Testing a Hypothesis It is important to remember that no matter how the problem is stated, the null hypothesis will always contain the equal sign. The equal sign (=) will never appear Step 1: in the alternative hypothesis. Why? State the Because the null hypothesis is the statement being tested, and we need a Null specific value to include in our calculations. We turn to the alternative Hypothesis hypothesis only if the data suggest that the null hypothesis is not true. (H0) and Example: refer back to model demand and the supply, which stated that an increase in the price of gasoline will reduce the Alternativ quantity of gasoline consumers demand. H0=increase in the price of gasoline e will reduce the quantity of gasoline Hypothesis consumers demand. H1=increase in the price of gasoline (H1) will not reduce the quantity of gasoline consumers demand. Take note: If the null hypothesis is not rejected on the basis of the sample data, we cannot say that the null hypothesis is true. To put it another way, the null hypothesis not being rejected does not prove that H0 is true; it means that we have failed to disprove H0. To prove without any doubt that the null hypothesis is true, the population parameter would have to be known. To actually determine it, we would have to test, survey, or count every item in the population. This is usually not feasible. The alternative is to take a sample from the designated α, the Greek letter alpha. It is also sometimes called the level of risk. This may be a more appropriate term because it is the risk you take of rejecting the null hypothesis when it is really Step 2: true. There is no one level of Select a significance that is applied to all tests. The 0.05 level (often stated Level of as the 5% level), the 0.01 level, and the 0.10 level are the most Significa common levels of significance. nce Traditionally, the 0.05 level is selected for consumer research projects, 0.01 for quality control, and 0.10 for political polling. You, the researcher, must decide on the level of significance before formulating a decision rule and Step 3: There are many test statistics (z test statistic, t test Select statistics, F test statistics the Test and χ2, called chi-square.) Statistic Step 4: Formulate the Decision Rule The region or area of A decision rule is a rejection defines the statement of the location of all those specific conditions values that are so for which the null large or so small hypothesis is that the probability rejected and of the of their occurrence conditions under for a true null which it is not hypothesis is rather rejected. remote. Example: Sampling Distribution of the Statistic z, a Right-Tailed Test, 0.05 Level of Significance. Note in the chart that: 1. The area where the null hypothesis is not rejected is equal to or to the left of 1.645. We will explain how to get the 1.645 value shortly. 2. The area of rejection is to the right of 1.645. 3. A one-tailed test is being applied. (This will also be explained later.) 4. The 0.05 level of significance was chosen. 5. The sampling distribution of the statistic z follows the normal probability distribution. 6. The value 1.645 separates the regions where the null hypothesis is rejected and where it is not rejected. 7. The value 1.645 is the critical value. Step 5: The fifth and final step in hypothesis testing is to compute Make a the test statistic, compare it with the critical value, make a Decision decision to reject or not reject the null hypothesis, and interpret the results. and The decision rule is that: z-value < z-statistic: Do not Interpre reject Ho / we fail to reject Ho / there is not enough t the evidence to reject Ho. z-value > z-statistic: Reject Result Ho. 1. Establish the null hypothesis (H0 and the alternative hypothesis (H1). Summary of the 2. Select the level of significance, that is, α. Steps in 3. Select an appropriate test statistic. Hypothesis 4. Formulate a decision rule based on steps 1, 2, and 3 above. Testing 5. Make a decision regarding the null hypothesis based on the sample information. Interpret the results of the test. One-Tailed and Two-Tailed Tests of Significance Ho: μ ≤ 450 Ho: μ ≥ 450 Ho: μ = 450 H1: μ>450 H1: μ < 450 H1: μ ≠ 450 Note that the inequality sign in the alternative hypothesis (>) points to the region of rejection in the upper tail. Thus (H1): < = left tail > = right tail ≠ = two tails Thank you