DS100 Final Exam 3 PDF
Document Details
Uploaded by HumaneConnemara9786
Tags
Related
- Multivariate Statistics Made Simple (2019) PDF
- Statistical Methods of Data Science PDF
- An Introduction to Statistical Learning, With Applications in Python (ISLP) PDF
- Research Methods (ASC415) – Actuarial Science PDF
- PSCI 2702 Textbook: Statistics for Social Research - PDF
- Bivariate Analyses Lecture 4 PDF
Summary
This document contains lecture notes on data science topics. It covers various concepts in data analysis, including if-statements, prediction, sampling, and distributions. The document also outlines hypothesis testing and the use of statistical models.
Full Transcript
1. If-statements and For Loops If-statements: For example: x = 10 if x > 5: print("x is greater than 5") For loops: For example: for i in range(5): print(i) In simulations, loops are often used to repeat experiments or generate many samples to model a random process. 2. Prediction Example: Pr...
1. If-statements and For Loops If-statements: For example: x = 10 if x > 5: print("x is greater than 5") For loops: For example: for i in range(5): print(i) In simulations, loops are often used to repeat experiments or generate many samples to model a random process. 2. Prediction Example: Predicting children’s heights based on the heights of their parents. We made predictions using statistical models, such as linear regression, where we used the relationship between variables (like parental height) to predict another variable (child's height). A prediction is made based on a model or past data, and we often use data patterns to estimate future outcomes. 3. Samples Deterministic sample: This is when every outcome is predictable and follows a fixed pattern (e.g., a sequence where the next value is always 1 greater than the previous). Random sample: Each outcome has a certain probability of occurring, and there’s uncertainty (e.g., drawing a card from a shuOled deck). Convenience sample: A sample that is easy to collect, often non-random (e.g., surveying people you encounter at a mall). tbl.sample(n) → Table of n rows picked randomly with tbl.sample(n, with_replacement = False) → rows not replaced tbl.sample(with_replacement = False) → Can be used for shuffling! 4. Distributions Probability distribution: shows how likely diOerent things are to happen in a random situation. It tells you the chances of each possible outcome. For example while rolling a fair die. The probability distribution tells you that each side of the die (1, 2, 3, 4, 5, or 6) has the same chance of showing up: 1 out of 6. Empirical distribution: shows how often diOerent things actually happened in a real situation. It's based on real data or observations, not just theory. For example If you roll a die 60 times and record how many times each number shows up, the empirical distribution will show the actual results from those 60 rolls, which might not be perfectly even (like the probability distribution) but will give you a sense of how often each number appeared. Creating an empirical distribution: To create an empirical distribution, you collect data from real-world observations or experiments and then count how often each outcome occurs. 5. Laws The Law of Large Numbers states that as the size of a sample increases, the sample mean approaches the population mean. This means that with more data, empirical distributions better approximate probability distributions. Example: If you flip a coin many times, the proportion of heads and tails will approach 50% as the number of flips increases. Law of Averages: Imply that if something hasn't happened for a while, it's more likely to happen soon. Gambler's Fallacy: A misconception about chance, where past independent events are believed to have an eOect on future events, which is not true for random, independent trials in probability. 6. Inference Statistical inference: Making conclusions about a population based on data from random samples Population: The entire set of individuals or items you're interested in. Parameter: A number associated with the population (e.g., average high of US population). Sample: A subset of the population, used to estimate the parameter. Statistic: A number calculated from the sample (e.g., average high of 10 000 population). Statistic can be used as an estimate of a parameter. 7. Distribution of a Statistic Probability distribution of a statistic: Is a theoretical distribution that shows how a statistic (like the average of a sample) is expected to behave based on a random process. Empirical distribution of a statistic: Is based on real data you get from actually doing the sampling. It shows what actually happened when you took a lot of samples and calculated the statistic (like the sample mean). 8. Models A model is a set of assumptions about the data. In data science, many models involve assumptions about processes that involve randomness. Steps in assessing a model: 1. Choose a statistic to measure discrepancy between model and data 2. Simulate the statistic under the model’s assumptions 3. Compare the data to the model’s predictions: Draw a histogram of simulated values of the statistic Compute the observed statistic from the real sample 4. If the observed statistic is far from the histogram, that is evidence against the model 9. Hypotheses Null hypothesis (H₀): The hypothesis that there is no eOect or diOerence. Alternative hypothesis (H₁): The hypothesis that there is an eOect or diOerence. For example; Testing whether a new drug has an eOect on blood pressure: o H₀: The drug has no eOect. o H₁: The drug lowers blood pressure. We test these using data to decide whether to reject H₀ in favor of H₁. 10. Test Statistics Test statistic: A numerical value used in hypothesis testing to decide whether to reject the null hypothesis (e.g., t-statistic, z-statistic). Helps quantify the diOerence between the observed data and what we would expect under the null hypothesis. If the p-value is less than or equal to 0.05 - reject the null hypothesis f the p-value is greater than 0.05 – accept the null hypothesis Total variation Distance = sum(abs(differences)) / 2 Directional test: A test where the hypothesis predicts a specific direction (e.g., “the drug lowers blood pressure”). Non-directional test: A test where the hypothesis does not predict direction (e.g., “the drug aOects blood pressure”). 11. Permutation Tests In a permutation test, we shuOle or randomly rearrange data points to simulate the null hypothesis scenario and calculate the statistic for each shuOle. ShuYling means reordering data randomly and recalculating the statistic for each new permutation. This helps determine how likely the observed statistic is under the null hypothesis. 12. Sampling tbl.sample: draws a random sample of rows from the table tbl. The output is a table consisting of the sampled rows. np.random.choice: draws a random sample from a population whose elements are in an array. The output is an array consisting of the sampled elements. sample_proportions: draws from a categorical distribution whose proportions are in an array. The output is an array consisting of the sampled proportions in all the categories. 13. P-values P-value: The probability of obtaining results as extreme as or more extreme than the observed results, assuming the null hypothesis is true. P-value cutoY: A threshold (typically 0.05) that determines whether we reject or fail to reject the null hypothesis. A small p-value (less than 0.05) indicates strong evidence against the null hypothesis. 14. Percentiles The 20th percentile of an array (1,8,10,16,20) 0.2 = x/5 X = 1st item = number 1 15. Randomized Controlled Experiments and Causality A randomized controlled experiment randomly assigns participants to treatment (actual treatment)or control groups(placebo). Helps to eliminate any bias. Causality: We can safely infer causality if the experiment is well-controlled, and random assignment eliminates bias. Bootstrapping: A technique for simulating repeated random sampling from a population using a single sample Example: Estimate the mean of a dataset by creating many resamples and averaging them. Confidence Intervals: A range where we expect the true population value to lie. To narrow a confidence intervals, we can either: lower the confidence level, or increase the original sample size. Example: A 95% confidence interval for the mean could be [2, 4]. Center and Spread: "Center" refers to the middle value (e.g., mean), and "spread" refers to how much data varies (e.g., standard deviation). Mean – Balance point of the histogram, need not be a value in the set, doesn’t have to be an integer even if the data are integers, Between min/max but not necessarily halfway, same units as the data, split evenly Median – halfway point, 50% on each side of the histogram Mean = Median ony when distribution is symmetric Example: The mean of [2, 4, 6] is 4, and the standard deviation is 2. Standard Deviation: measures how far the data are from their average (the square root of variance) (Xi )2 - x = 2 1 + 2+ 3 ( 2)2 ( 2) (3 2)2 = 6 = = + 0 - + - 200 816 = 1 + +1 = 6. = 3 3 h Example: In [1, 2, 3], the standard deviation is 0.82. Variability of the sample average: (Xi=a s= The SD of the distribution of the sample average is = population SD / √sample size Standard Units: Values converted to have a mean of 0 and a standard deviation of 1 (z- scores). Example: A height of 170 cm becomes z=170−160/ 10 = 1 Chebyshev's Inequality: A rule that says for any distribution, at least 1 - 1/k² of data is within k standard deviations of the mean. At least 75% of the data is within 2 standard deviations. Central Limit Theorem (CLT): says that random sample averages are connected to the normal distribution. If the sample is large, and drawn at random with replacement, then, no matter the shape of the population distribution, the probability distribution of the sample average (or sum) is roughly the normal distribution. The distribution of the sample average is roughly bell-shaped. The CLT only applies to averages, we still have to bootstrap for other statistics like median. The CLT only applies to mean, sum, proportion!!! Example: Take many samples of size 30 from a non-normal population. The means of these samples will be normal. Normal Distribution: A bell-shaped distribution where most values are near the mean. Standard Normal Distribution: A normal distribution with a mean of 0 and a standard deviation of 1. (Z-scores form a standard normal distribution) Z = (value- mean) / SD Normal Proportions: The percentage of data within a certain range in a normal distribution (68% of data is within 1 standard deviation in a normal distribution) Distribution of the Sample Average: The distribution of the averages of all the possible samples, which tends to be normal as the sample size increases. We can approximate it by empirical distribution. Example: The average height of 100 random people will be approximately normal. Prediction: Estimating future values based on data, guessing the future. (Predicting someone's weight from their height using regression) Association: A relationship where one variable aeects another. Example: Taller people tend to weigh more. Correlation: A number that measures the strength and direction of the relationship between two variables. Example: The number of hours spent studying and exam scores may have a positive correlation The correlation CoePicient r: R measures how tightly the points cluster around the line after averaging the products, single r: R= 1 - > scatter is perfect straight line slopping up R= -1 - > scatter is perfect straight line slopping down R = 0 - > no linear association R Further from 0, mean STRONGER correlation Nearest Neighbor Regression: Predicting a value based on the nearest similar data points. Example: Predict a person’s income by averaging the incomes of similar individuals. Linear Regression: Correlation (r) is used to measure linear relationships that we can later use for prediction Predicted y = slope x + intercept Example: Predicting salary from years of experience. Residuals (prediction error): The dieerences between observed and predicted values. Least Squares: A method to find the best-fitting line by minimizing the sum of squared residuals. (In linear regression, the line that minimizes squared dieerences between predicted and actual values) Minimization: The process of finding the minimum value of a function, often to improve model fit. Multiple Regression: A method to model relationships between one dependent variable and multiple independent variables. Prediction y from one variable x Example: Predicting house prices using size, location, and number of rooms.