Stats Exam Notes PDF
Document Details
Uploaded by Deleted User
Tags
Related
- CSET Multiple Subjects Test - Subtest II: Mathematics Study Guide PDF
- JG University Probability and Statistics Sample Mid-Term Exam Paper 2023-2024 PDF
- JG University Probability and Statistics Mid-Term Exam 2023-24 PDF
- Chapter 8 - Statistics and Probability PDF
- Probability & Statistics 2024-2025 PDF
- Mathematics of Data Management Notes (Lower Canada College)
Summary
These notes cover fundamental statistical concepts, including data types, measures of spread (mean, median, mode, range, variance, and standard deviation), skewness, probability distributions, and the Normal distribution. It also discusses random variables and probability distributions.
Full Transcript
STATS EXAM NOTES TYPES OF DATA Numerical or Quantitative Data: This is data that takes a numeric value. Categorical or Qualitative Data: This is data that does not take on numeric values, but can be classified into distinct categories. Independence: Two events are independent if the probability...
STATS EXAM NOTES TYPES OF DATA Numerical or Quantitative Data: This is data that takes a numeric value. Categorical or Qualitative Data: This is data that does not take on numeric values, but can be classified into distinct categories. Independence: Two events are independent if the probability of one event occurring is unaffected by the outcome of the other event. A black text on a white background Description automatically generated Nominal/Actual/Current Prices: the \$ value of a series as actually measured at each point in time. Real/Constant Prices: the value a series would have taken if prices remained fixed at some point in history-the \'base\' period (this is March 1990 in the Figure 2.2). Price Index (e.g. Consumer Price Index (CPI)): A weighted average of prices of goods and services indexed to 100 in the \'base\' period (this is 1990 in Figure 2.2). **MEASURES OF SPREAD** Mean: (often referred to as the \'average\'). This is calculated using the formula below: ![A black and white math equation Description automatically generated](media/image2.png) Median: Another measure of the central tendency of the data is the median. The median is the number such that 50% of values are equal to or higher than the median, and 50% are equal to or lower. In other words, the median is the \'middle\' value or \'typical\' value. Mode: This is the most commonly occurring value. Note, if values do not repeat very often, then the Mode is often not very interesting. A couple of repeated values may be the mode, but they may be nowhere near the centre of the data. **SKEWNESS\ **A diagram of a normal distribution Description automatically generated **Range:** This is calculated as the difference between the maximum and the minimum value. **The sample variance** ![A mathematical equation with numbers and symbols Description automatically generated](media/image4.png) Interpreting the Standard Deviation is important: In simple terms, we may say that the standard deviation measures "the average amount that the values vary above and below the mean" More precisely, it is "the square root of the average of the squared deviations from the mean". This is a bit complicated for everyday use, so we often explain the standard deviation using the simple interpretation. **The Interquartile Range is the measure of spread associated with these Quartiles. Take Q3 - Q1, and you have it! It measures the spread of the "middle 50% of the data".** The probability of an event (a particular outcome in some process or phenomenon-such as drawing a household that earns between \$30,001 and \$40,000 p.a.) is usually understood as the proportion of times that event would occur, if the process were repeated many times. **A random variable is a characteristic or quantity of interest that can take a range of values. Its probability distribution is a set of all possible values of that variable together with the probability of each of those values occurring** **To be a proper probability distribution, the outcomes/events listed in a probability distribution should have two characteristics:** 1\. Mutually exclusive: that is, no two outcomes in the list can be true at the same time. e.g. If we get a head, we can\'t get a tail! This is true in the example above, as the income ranges/categories do not overlap. 2\. Exhaustive: that is, the list includes all possible outcomes, e.g. we must get either a head or a tail. There are no other possibilities. This is true in the examples above, because the ranges/categories cover all possible values. This is why the sum of probabilities in a probability distribution always equals one. The Normal Distribution - The area under the curve is equal to 1, the mean is 0 and the variance is 1. What this curve illustrates is the relative likelihood of obtaining different variables for the random variable X, which is shown on the x-axis A math equation with black letters Description automatically generated with medium confidenceX -- mean / standard deviation A scatter plot is a graph that allows us to visually see how two numerical variables relate to each other. The covariance is the average of the products of paired deviations from the mean for the two variables, X and Y. ![A black and white math equation Description automatically generated](media/image6.png) - Summing mostly positive values together will give you a covariance that is positive. - The covariance indicates both the strength and direction of the linear association between two variables - A positive covariance indicates that when X is big, then is it more likely that Y will be big. Similarly, when X is small there is a higher probability of Y being small - i.e. X and Y tend to move together. This is captured with positive covariance. The correlation is a standardised measure: its values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). It is not influenced by the scale of the variables (i.e. it will not change if we start measuring income in terms of thousands of dollars). ![A black and white math equation Description automatically generated](media/image8.png) We refer to this line capturing the relationship between X & Y as a Regression Line. A close up of a symbol Description automatically generated The simple linear regression model is a mathematical function which describes how one variable (Y) changes in response to another (X).![A group of black text Description automatically generated with medium confidence](media/image10.png) The key components of the model are: Y is called the dependent variable (the thing you want to influence). X is called the independent or explanatory variable (the thing you can control). 𝛽0 and 𝛽1 are the population parameters which we are most interested in and want to estimate. ei is the error-how far the observed value of Yi is from the regression line. Confidence intervals: We are 95% confident that the actual average income is between \$41,000 and \$49,000." The population and the sample A diagram of a diagram of a group of people Description automatically generated A Representative Sample is Key: We need to ensure that our sample is representative of the population. When a sample produces results which are not representative of the population, we say that the sample is biased. In particular, there are two important sources of bias in sample design. Selection Bias: Relying on voluntary participation as in the COVID-testing example above, is an example of selection Bias. Voluntary participation is a common sampling method, and is almost always a cause for big selection bias: eg student evaluations of teaching at University, product feedback from your online purchases, etc etc. Those who respond are usually not a typical cross-section of the wider population. How do we ensure that our sample is representative? The key is randomness: selecting the sample in some kind of random way. This reduces the possibility that the sample may not represent the population well. This means relying less on voluntary participation, and targeting a random sample of people from the population. Expected value for mean ![](media/image12.png) How do we construct this confidence interval? We have already seen that we can work out probabilities about X if we know μ and σ 2. We can also write down probability statements like the following: A math equations and formulas Description automatically generated![A diagram of a function Description automatically generated](media/image14.png) Step 1: Formulate the null (H0) and alternative (H1) hypotheses Step 2: Decide a significance level - we will talk more about this later. Essentially, this is the level of tolerance for how much evidence we need before we will reject the current / conservative view. Step 3: Use some software to calculate a p-value. More about this later too, but essentially this is a probability that we use to decide whether to reject the null view, in favour of the alternative. Step 4: Make a decision, by comparing the p-value with the significance level. If the p-value is less than the significance level, we reject H0 and conclude in favour of the alternative hypothesis. p-value: the probability of getting a value of the sample statistic as far or further from H0 than the one we observe in our sample, if H0 were true. A white rectangular object with black text Description automatically generated Type I Error: rejecting H0 when H0 is in fact true Type II Error: not rejecting H0 when H0 is actually false - Let us think about what happens as we decrease the significance level. This seems like a good thing to do, as it will decrease the chances of rejecting H0 when it is true (a Type I error). However, there is a down side: as we reduce the probability of a Type I error, in turn , we will increase the chances of a Type II error. Why? - A smaller 𝛼 means we need a smaller p-value before we have enough evidence to reject H0. i.e. We need more convincing evidence against H0 before we will reject it. So it makes sense that if H0 is actually false, we are less likely to correctly reject H0 in favour of H1. i.e. We are more likely to make a Type II error. Multiple Linear Regression ![A diagram of a diagram Description automatically generated](media/image16.png)A diagram of a diagram Description automatically generated Yi = β0 + β1Xi1 + β2Xi2 + · · · βkXik + ei βˆ 0 is known as the intercept---the estimated value of Y when X1 = 0 and X2 = 0. βˆ 1 and βˆ 2 are the slopes of Y with respect to X1 and X2---they estimate the change in Y for a 1 unit change in each of the variables. R2 equals the square of the sample correlation coefficient between the actual Y values and those predicted by the model, Y ![A black text with a white background Description automatically generated](media/image18.png) - This will be a value between 0 and 1. It measures the proportion of the total variation in Y that the model has been able to explain. So, a value of R2 close to 1 indicates that the model has been able to explain a large proportion of the variation in Y , and hence is a very good model. A value of R2 close to zero indicates a poor model---not much of Y has been explained. H 9 The standard error provides another measure of how good our model is. It is simply the standard deviation of the error term in the model. A black and white math equation Description automatically generated with medium confidence - Positive errors (where the model predicts a Y value smaller than what actually happened) will cancel out with the negative errors, so the average error will be zero. This means the error standard deviation is actually just the square root of the average of the squared errors. Or more loosely, the standard error gives us an estimate of the magnitude of the average error that will be produced from the model ![A graph of a scatter plot Description automatically generated](media/image20.png) Data over time is called a time series. It is a series of observations on some variable of interest over a sequence of time periods - Long Term Trend: A trend is a persistent, long term upward or downward pattern of movement. The duration of a trend is usually several years. The source of such a trend might be gradual and ongoing changes in technology, population, wealth, etc. - Long Term Trend: A trend is a persistent, long term upward or downward pattern of movement. The duration of a trend is usually several years. The source of such a trend might be gradual and ongoing changes in technology, population, wealth, etc. - Short Term Seasonal: A seasonal pattern is a regular pattern of fluctuations that occur within each year, and tend to repeat year after year. More generally, data can follow a regular repeated pattern over some defined period like a year or a month or a week or a day. - Random / Irregular: This component represents whatever is \'left over\' after identifying the other three systematic components. It represents the random, unpredictable fluctuations in data. There is no pattern to the irregular component. NET PRESENT VALUE: the discounted value of money, or a cashflow, back to some base period. A mathematical equation with numbers and symbols Description automatically generated - R is the interest rate that we used to discount future cashflows back to present cashflows - If NPV Is positive it indicates that the project can generate value