Module 2 Data Science PDF
Document Details
Uploaded by Deleted User
Tags
Summary
These notes cover Module 2 of a data science course. They delve into statistical hypothesis testing, including concepts like p-values, confidence intervals, and different types of errors. The document also introduces Gradient Descent, a fundamental optimization algorithm used in machine learning.
Full Transcript
Module 2: Statistical Hypothesis Testing. Example: Flipping a Coin, p-Values, Confidence Intervals, p- Hacking. Example: Running an A/B Test, Bayesian Inference, Gradient Descent, The Idea Behind Gradient Descent Estimating the Gradient, Using the Gradient, Choosing the Right Step Size, Using Gradie...
Module 2: Statistical Hypothesis Testing. Example: Flipping a Coin, p-Values, Confidence Intervals, p- Hacking. Example: Running an A/B Test, Bayesian Inference, Gradient Descent, The Idea Behind Gradient Descent Estimating the Gradient, Using the Gradient, Choosing the Right Step Size, Using Gradient Descent to Fit Models, Minibatch and Stochastic Gradient Descent, Getting Data, stdin and stdout, Reading Files, Scraping the Web, Using APIs, Example: Using the Twitter APis, Working with Data, Exploring Your Data, Using Named Tuples, Dataclasses, Cleaning and Munging, Manipulating Data, Rescaling, An Aside: tqdm, Dimensionality Reduction. Chapters 7, 8, 9 and 10 Statistical Hypothesis Testing Statistical hypothesis testing is a method used in statistics to make inferences or draw conclusions about a population based on sample data. Hypothesis is a premise or claim that we want to test. Often, as data scientists, we’ll want to test whether a certain hypothesis is likely to be true. For our purposes, hypotheses are assertions like “this coin is fair” or “data scientists prefer Python to R” Under various assumptions, those statistics can be thought of as observations of random variables from known distributions, which allows us to make statements about how likely those assumptions are to hold. Two types of hypothesis Null hypothesis (H0) and Alternative hypothesis(H1 or Ha) Null hypothesis: It represents some default position. It is the hypothesis that the test seeks to disprove. Alternative hypothesis: A statement that there is an effect, a difference, or a relationship. It is what the researcher wants to prove. We use statistics to decide whether we can reject H0 as false or not. Example: Flipping a Coin Imagine we have a coin and we want to test whether it’s fair. We’ll make the assumption that the coin has some probability p of landing heads, and so our null hypothesis is that the coin is fair — that is, that p=0.5. We’ll test this against the alternative hypothesis p≠0.5 Example 2 Suppose a researcher wants to test whether a new drug is more effective than the current standard treatment. The hypotheses might be: H₀: The new drug is no more effective than the standard treatment (mean difference = 0). H₁: The new drug is more effective than the standard treatment (mean difference > 0). The researcher collects data, performs a t-test, and finds a p-value of 0.03. If the significance level is set at 0.05, the p-value is less than α, so the null hypothesis is rejected, indicating that the new drug is statistically significantly more effective. Statistical hypothesis testing is a fundamental tool in research, allowing scientists to make data-driven decisions and determine the validity of their hypotheses. In hypothesis testing, Type 1 and Type 2 errors are two potential errors that can occur when making decisions based on sample data. Here's a breakdown: Type 1 Error (False Positive): Occurs when you reject the null hypothesis when it is actually true. It’s essentially a false alarm, concluding there is an effect or difference when there isn’t. The probability of committing a Type 1 error is denoted by α (alpha), also known as the significance level (commonly 0.05 or 5%). Example: A test concludes that a new drug is effective when, in reality, it isn’t. Type 2 Error (False Negative): Occurs when you fail to reject the null hypothesis when it is actually false. This is a missed detection, meaning you fail to detect an effect or difference when one actually exists. The probability of committing a Type 2 error is denoted by β (beta). Example: A test fails to detect that a new drug is effective, when in fact it is. Power of the test The probability of correctly rejecting a false null hypothesis is called the power of the test, and it is equal to 1 - β. A higher power means a lower probability of a Type 2 error. There’s often a trade-off between Type 1 and Type 2 errors. Reducing the chance of one typically increases the chance of the other, unless you increase the sample size. p-Values P- Value is a measure used in statistical hypothesis testing to determine the significance of the results. It represents the probability of obtaining test results at least as extreme as the one we actually observed, assuming that null hypothesis is true. For our two-sided test of whether the coin is fair, we compute: The p-value is the probability of obtaining the observed results, or more extreme results, under the assumption that the null hypothesis is true. It quantifies the evidence against the null hypothesis. Small p-Value (≤ α): If the p-value is less than or equal to the chosen significance level (α, usually 0.05), it suggests that the observed data is unlikely under the null hypothesis. Therefore, you reject the null hypothesis. Example: A p-value of 0.03 suggests that there is only a 3% chance of observing the data (or something more extreme) if the null hypothesis were true, leading to a rejection of H₀. Large p-Value (> α): If the p-value is greater than the significance level, there isn't enough evidence to reject the null hypothesis. Thus, you fail to reject the null hypothesis. Example: A p-value of 0.2 indicates that there is a 20% chance of observing the data (or something more extreme) if the null hypothesis were true, so you do not reject H₀. Significance Level (α): The significance level (α) is a threshold chosen by the researcher before the analysis begins. It represents the probability of rejecting the null hypothesis when it is actually true (Type I error). Common α values are 0.05, 0.01, or 0.10. Example of p-Value Interpretation: Suppose you are testing whether a new drug is more effective than the current standard. You conduct a study, and your statistical test results in a p-value of 0.02. If your significance level (α) is 0.05, the p-value (0.02) is less than α, meaning there is strong evidence against the null hypothesis. Therefore, you reject the null hypothesis and conclude that the new drug is statistically significantly more effective than the standard treatment. Confidence Intervals We’ve been testing hypotheses about the value of the heads probability p, which is a parameter of the unknown “heads” distribution. When this is the case, a third approach is to construct a confidence interval around the observed value of the parameter. For example, we can estimate the probability of the unfair coin by looking at the average value of the Bernoulli variables corresponding to each flip—1 if heads, 0 if tails. If we observe 525 heads out of 1,000 flips, then we estimate p equals 0.525. How confident can we be about this estimate? Well, if we knew the exact value of p, the central limit theorem (recall “The Central Limit Theorem”) tells us that the average of those Bernoulli variables should be approximately normal, with mean p and standard deviation: math.sqrt(p * (1 - p) / 1000) Here we don’t know p, so instead we use our estimate: p_hat = 525 / 1000 mu = p_hat sigma = math.sqrt(p_hat * (1 - p_hat) / 1000) # 0.0158 This is not entirely justified, but people seem to do it anyway. Using the normal approximation, we conclude that we are “95% confident” that the following interval contains the true parameter p: normal_two_sided_bounds(0.95, mu, sigma) # [0.4940, 0.5560] In particular, we do not conclude that the coin is unfair, since 0.5 falls within our confidence interval. If instead we’d seen 540 heads, then we’d have: p_hat = 540 / 1000 mu = p_hat sigma = math.sqrt(p_hat * (1 - p_hat) / 1000) # 0.0158 normal_two_sided_bounds(0.95, mu, sigma) # [0.5091, 0.5709] Here, “fair coin” doesn’t lie in the confidence interval. (The “fair coin” hypothesis doesn’t pass a test that you’d expect it to pass 95% of the time if it were true.) P Hacking P-hacking, also known as data dredging or data fishing, refers to the manipulation of statistical analyses to produce a desired result, typically one that is statistically significant (usually p < 0.05). It involves selectively reporting, adjusting, or analyzing data in ways that increase the likelihood of obtaining a significant p-value, even if the underlying hypothesis is not actually supported by the data. P-hacking undermines the integrity of scientific research by artificially inflating the significance of results. It’s important for researchers, reviewers, and journals to adopt practices that minimize the potential for p-hacking to ensure that published findings are reliable and reproducible. If you want to do good science, you should determine your hypotheses before looking at the data, you should clean your data without the hypotheses in mind, and you should keep in mind that p-values are not substitutes for common sense. Example: Running an A/B Test An A/B testing is also known as split testing, a controlled experiment used to compare two versions of a variable, typically to determine which one performs better. This type of test is commonly used in marketing, product development, and website optimization to make data-driven decisions. The goal is to use statistical analysis to identify changes that improve a given outcome. In an A/B test two versions (A and B) are shown to users at random and statistical analysis is used to determine which version performs better. Version A is often the current experience (the control) while version B includes a modification that you want to test(the treatment). One of your advertisers has developed a new energy drink targeted at data scientists, and the VP of Advertisements wants your help choosing between advertisement A (“tastes great!”) and advertisement B (“less bias!”). Being a scientist, you decide to run an experiment randomly showing site visitors one of the two advertisements and tracking how many people click on each one. If 990 out of 1,000 A-viewers click their ad, while only 10 out of 1,000 B-viewers click their ad, you can be pretty confident that A is the better ad. But what if the differences are not so stark? Here’s where you’d use statistical inference. Let’s say that NA people see ad A, and that nA of them click it. We can think of each ad view as a Bernoulli trial where pA is the probability that someone clicks ad A. Then (if NA is large, which it is here) we know that nA/NA is approximately a normal random variable with mean pA and standard deviation Similarly, nB/NB is approximately a normal random variable with mean pB and standard deviation def estimated_parameters(N, n): p=n/N sigma = math.sqrt(p * (1 - p) / N) return p, sigma If we assume those two normals are independent then their difference should also be normal with Mean PB − PA and standard deviation This means we can test the null hypothesis that pA and pB are the same (that is, that pA − pB is 0) by using the statistic: def a_b_test_statistic(N_A, n_A, N_B, n_B): p_A, sigma_A = estimated_parameters(N_A, n_A) p_B, sigma_B = estimated_parameters(N_B, n_B) return (p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2) which should approximately be a standard normal. For example, if “tastes great” gets 200 clicks out of 1,000 views and “less bias” gets 180 clicks out of 1,000 views, the statistic equals: z = a_b_test_statistic(1000, 200, 1000, 180) # -1.14 The probability of seeing such a large difference if the means were actually equal would be: two_sided_p_value(z) # 0.254 which is large enough that we can’t conclude there’s much of a difference. On the other hand, if “less bias” only got 150 clicks, we’d have: z = a_b_test_statistic(1000, 200, 1000, 150) # -2.94 two_sided_p_value(z) # 0.003 which means there’s only a 0.003 probability we’d see such a large difference if the ads were equally effective. Bayesian Inference An alternative approach to inference involves treating the unknown parameters themselves as random variables. starts with a prior distribution for the parameters and then uses the observed data and Bayes’s theorem to get an updated posterior distribution for the parameters. Rather than making probability judgments about the tests, you make probability judgments about the parameters. Bayesian inference is a method of statistical inference that uses Bayes' theorem to update the probability of a hypothesis as more evidence or data becomes available. Bayesian inference incorporates prior beliefs and provides a more flexible framework for decision-making under uncertainty. Bayes' Theorem: P(H∣D)=P(D∣H)⋅P(H)/P(D) P(H∣D): Posterior probability, the probability of the hypothesis H given the observed data D. P(D∣H): Likelihood, the probability of observing the data D given that the hypothesis H is true. P(H): Prior probability, the initial belief about the probability of the hypothesis before seeing the data. P(D): Marginal likelihood or evidence, the total probability of observing the data under all possible hypotheses. For example, when the unknown parameter is a probability (as in our coin-flipping example), we often use a prior from the Beta distribution, which puts all its probability between 0 and 1: def B(alpha, beta): """a normalizing constant so that the total probability is 1""" return math.gamma(alpha) * math.gamma(beta) / math.gamma(alpha + beta) def beta_pdf(x, alpha, beta): if x < 0 or x > 1: # no weight outside of [0, 1] return 0 return x ** (alpha - 1) * (1 - x) ** (beta - 1) / B(alpha, beta) Beta Distribution: The Beta distribution is a continuous probability distribution defined on the interval [0, 1]. It's often used as a prior distribution for modeling probabilities in Bayesian inference. Shape Parameters α\alphaα and β\betaβ: These parameters control the shape of the distribution. For instance: If α=β=1, the Beta distribution is uniform. If α>1 and β>1, the distribution is bell-shaped. If α is small and β is large, the distribution is skewed towards 0, and vice versa. Gradient Descent Frequently when doing data science, we’ll be trying to the find the best model for a certain situation. And usually “best” will mean something like “minimizes the error of the model” or “maximizes the likelihood of the data.” In other words, it will represent the solution to some sort of optimization problem. Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving towards the minimum value of that function. It's commonly used in machine learning to optimize cost functions and adjust model parameters, such as in linear regression, logistic regression, and neural networks. The key idea is to adjust parameters in the opposite direction of the gradient of the function with respect to those parameters. The gradient is the vector of partial derivatives, and it points in the direction of the steepest ascent. Thus, moving in the opposite direction of the gradient leads to the steepest descent, i.e., toward the minimum. Gradient Descent: Minimization optimization that follows the negative of the gradient to the minimum of the target function. Gradient Ascent: Maximization optimization that follows the gradient to the maximum of the target function. Types of Gradient Descent 1. Batch gradient descent It computes the gradient of the cost function with respect to the parameter for entire training data. 2. Stochastic gradient descent It computes the gradient for each data using a single training point chosen at random. 3. Mini-batch gradient descent It divides the entire dataset into mini-batches, and then the gradient is calculated for each mini-batch. Steps in Gradient Descent: 1.Initialize parameters (e.g., weights): Start with some initial values for the parameters. 2.Compute the gradient: Calculate the gradient of the cost function with respect to each parameter. 3.Update the parameters: Adjust the parameters by moving in the direction opposite to the gradient, scaled by a learning rate. 4.Repeat: Continue iterating until the parameters converge to values where the gradient is near zero (or stops changing significantly). The Idea Behind Gradient Descent Suppose we have some function f that takes as input a vector of real numbers and outputs a single real number. One simple such function is: from scratch.linear_algebra import Vector, dot def sum_of_squares(v: Vector) -> float: """Computes the sum of squared elements in v""" return dot(v, v) We’ll frequently need to maximize (or minimize) such functions. That is, we need to find the input v that produces the largest (or smallest) possible value. The gradient gives the input direction in which the function most quickly increases. Accordingly, one approach to maximizing a function is to pick a random starting point, compute the gradient, take a small step in the direction of the gradient (i.e., the direction that causes the function to increase the most), and repeat with the new starting point. Similarly, you can try to minimize a function by taking small steps in the opposite direction, as shown in Figure Estimating the Gradient If f is a function of one variable, its derivative at a point x measures how f(x) changes when we make a very small change to x. It is defined as the limit of the difference quotients: from typing import Callable # Function to compute the difference quotient def difference_quotient(f: Callable[[float], float], x: float, h: float) -> float: return (f(x + h) - f(x)) / h For example, the square function: def square(x: float) -> float: return x * x has the derivative: def derivative(x: float) -> float: return 2 * x The derivative is the slope of the tangent line at x, f x , while the difference quotient is the slope of the not-quite-tangent line that runs through x + h, f x + h. As h gets smaller and smaller, the not-quite-tangent line gets closer and closer to the tangent line (Figure 8-2). Choosing the Right Step Size Although the rationale for moving against the gradient is clear, how far to move is not. Indeed, choosing the right step size is more of an art than a science. Popular options include: 1. Using a fixed step size 2. Gradually shrinking the step size over time 3. At each step, choosing the step size that minimizes the value of the objective function The last approach sounds great but is, in practice, a costly computation. To keep things simple, we’ll mostly just use a fixed step size. The step size that “works” depends on the problem—too small, and your gradient descent will take forever; too big, and you’ll take giant steps that might make the function you care about get larger or even be undefined. So we’ll need to experiment. Using Gradient Descent to Fit Models If we think of our data as being fixed, then our loss function tells us how good or bad any particular model parameters are. This means we can use gradient descent to find the model parameters that make the loss as small as possible. example: # x ranges from -50 to 49, y is always 20 * x + 5 inputs = [(x, 20 * x + 5) for x in range(-50, 50)] We’ll use gradient descent to find the slope and intercept that minimize the average squared error. We’ll start off with a function that determines the gradient based on the error from a single data point: def linear_gradient(x: float, y: float, theta: Vector) -> Vector: slope, intercept = theta predicted = slope * x + intercept # The prediction of the model. error = (predicted - y) # error is (predicted - actual). squared_error = error ** 2 # We'll minimize squared error grad = [2 * error * x, 2 * error] # using its gradient. return grad 1. Start with a random value for theta. 2. Compute the mean of the gradients. 3. Adjust theta in that direction. 4. Repeat. Getting Data In order to be a data scientist you need data. In fact, as a data scientist you will spend an embarrassingly large fraction of your time acquiring, cleaning, and transforming data. In a pinch, you can always type the data in yourself, but usually this is not a good use of your time. In this chapter, we’ll look at different ways of getting data into Python and into the right formats. stdin and stdout If you run your Python scripts at the command line, you can pipe data through them using sys.stdin and sys.stdout. For example, here is a script that reads in lines of text and spits back out the ones that match a regular expression: # egrep.py import sys, re # sys.argv is the list of command-line arguments # sys.argv is the name of the program itself # sys.argv will be the regex specified at the command line regex = sys.argv # for every line passed into the script for line in sys.stdin: # if it matches the regex, write it to stdout if re.search(regex, line): Counts the lines it receives and then writes out the count: # line_count.py import sys count = 0 for line in sys.stdin: count += 1 # print goes to sys.stdout print(count) Cleaning Data with Pandas: Common Functions Filling missing values: pandas provides methods for automatically dealing with missing values in a dataset, be it by replacing missing values with a “default” value using the df.fillna() method, or by removing any rows or columns containing missing values through the df.dropna() method. Removing duplicated instances: automatically removing duplicate entries (rows) in a dataset with the df.drop_duplicates() method, which allows the removal of extra instances when either a specific attribute value or the entire instance values are duplicated to another entry. Manipulating strings: some pandas functions are useful to make the format of string attributes uniform. For instance, if there is a mix of lowercase, sentencecase, and uppercase values for an 'column' attribute and we want them all to be lowercase, the df['column'].str.lower() method does the job. For removing accidentally introduced leading and trailing whitespaces, try the df['column'].str.strip() method. Manipulating date and time: the pd.to_datetime(df['column']) converts string columns containing date-time information, e.g. in the dd/mm/yyyy format, into Python datetime objects, thereby easing their further manipulation. Column renaming: automating the process of renaming columns can be particularly useful when there are multiple datasets seggregated by city, region, project, etc., and we want to add prefixes or suffixes to all or some of their columns for easing their identification. The df.rename(columns={old_name: new_name}) method makes this possible. Manipulating Data One of the most important skills of a data scientist is manipulating data. It involves modifying, processing, or transforming data to make it usable for analysis and machine learning tasks. Common Data Manipulation Techniques: 1. Filtering: Selecting specific rows or columns based on conditions (e.g., removing null values, filtering by range). 2. Sorting: Organizing the data by one or more columns. 3. Aggregation: Summarizing data through functions like mean, sum, or count. 4. Joining/Merging: Combining multiple datasets on common fields (e.g., SQL joins). 5. Pivoting and Unpivoting: Transforming data structure, for example, turning rows into columns or vice versa. 6. Encoding Categorical Data: Converting categorical data to numerical format (e.g., one-hot encoding). 7. Handling Missing Values: Imputing missing data or dropping missing rows/columns. import pandas as pd # Sample DataFrame data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, None, 35, 40], 'Salary': [50000, 60000, None, 70000]} df = pd.DataFrame(data) print(df) print("after filling missing value") df.loc[df['Age'].isnull(), 'Age'] = df['Age'].mean() # Fill missing age with mean df.loc[df['Salary'].isnull(), 'Salary'] = df['Salary'].median() # Fill missing salary with median print(df) print("salary >6000")# Filter rows where Salary > 60000 filtered_df = df[df['Salary'] > 60000] print(filtered_df) print("sort age") sorted_df = filtered_df.sort_values(by='Age') # Sort by Age print(sorted_df) Rescaling Many techniques are sensitive to the scale of your data. For example, imagine that you have a dataset consisting of the heights and weights of hundreds of data scientists, and that you are trying to identify clusters of body sizes. Person height(inch) height(cms) weight(pounds) A 63 160 150 B 67 170.2 160 C 70 177.8 171 If we measure height in inches, then B’s nearest neighbor is A: from scratch.linear_algebra import distance a_to_b = distance([63, 150], [67, 160]) # 10.77 a_to_c = distance([63, 150], [70, 171]) # 22.14 b_to_c = distance([67, 160], [70, 171]) # 11.40 However, if we measure height in centimeters, then B’s nearest neighbor is instead C: a_to_b = distance([160, 150], [170.2, 160]) # 14.28 a_to_c = distance([160, 150], [177.8, 171]) # 27.53 b_to_c = distance([170.2, 160], [177.8, 171]) # 13.37 Obviously it’s a problem if changing units can change results like this. For this reason, when dimensions aren’t comparable with one another, we will sometimes rescale our data so that each dimension has mean 0 and standard deviation 1. This effectively gets rid of the units, converting each dimension to “standard deviations from the mean.” Here’s an example Python program that rescales the data you provided using Min-Max Scaling for both the height and weight attributes. MinMaxScaler from the sklearn.preprocessing module scales the values according to the formula: 𝑋scaled=𝑋−𝑋min/𝑋max−𝑋min This method scales the values between 0 and 1, which is useful when normalizing numerical features for machine learning. An Aside: tqdm Frequently we’ll end up doing computations that take a long time. When you’re doing such work, you’d like to know that you’re making progress and how long you should expect to wait. One way of doing this is with the tqdm library, which generates custom progress bars. We’ll use it some throughout the rest of the book, so let’s take this chance to learn how it works. There are only a few features you need to know about. The first is that an iterable wrapped in tqdm.tqdm will produce a progress bar: import tqdm for i in tqdm.tqdm(range(100)): # do something slow _ = [random.random() for _ in range(1000000)] which produces an output that looks like this: 56%|████████████████████ | 56/100 [00:08