Statistics Concepts
40 Questions
2 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following best exemplifies a descriptive analysis?

  • Predicting the likelihood of a customer defaulting on a loan based on their credit score.
  • Estimating future stock prices based on historical trends.
  • Determining whether a new advertising campaign caused an increase in sales.
  • Summarizing customer demographics using mean, median, and standard deviation. (correct)

A dataset has a high standard deviation. What does this indicate about the data?

  • The data points are clustered closely around the mean.
  • The data points are widely dispersed from the mean. (correct)
  • The mean is not a reliable measure of central tendency.
  • The dataset contains a large number of outliers.

In the context of data analysis, what does a high coefficient of variation suggest?

  • The mean is close to zero.
  • Low variability relative to the mean.
  • High variability relative to the mean. (correct)
  • The data follows a normal distribution.

What information does the 90/10 percentile ratio provide about a distribution?

<p>The width or spread of the distribution. (D)</p> Signup and view all the answers

In a right-skewed distribution, how do the mean and median typically relate to each other?

<p>The mean is greater than the median. (A)</p> Signup and view all the answers

What does the area of a bar in a histogram represent?

<p>The frequency of observations within that bin. (B)</p> Signup and view all the answers

Which of the following is an example of inferring causation from correlation?

<p>Observing a strong positive correlation between education level and income and concluding that higher education leads to higher income. (B)</p> Signup and view all the answers

What is the main purpose of a kernel density function?

<p>To estimate the probability density of a continuous random variable. (D)</p> Signup and view all the answers

Why are control groups used when trying to estimate the effect of a treatment?

<p>To approximate what would have happened to the treated group had they not received the treatment, estimating the counterfactual. (D)</p> Signup and view all the answers

How does randomization help to eliminate selection bias in treatment and control groups?

<p>By ensuring the treatment and control groups are statistically equivalent on average before treatment, with similar distributions of characteristics. (B)</p> Signup and view all the answers

A researcher finds a p-value of 0.02 when testing the effect of a new drug. What does this p-value indicate?

<p>There is strong evidence against the null hypothesis; the drug has a statistically significant effect. (B)</p> Signup and view all the answers

What does the Central Limit Theorem state about the sampling distribution when dealing with a large number of independent variables?

<p>It approximates a Normal distribution. (C)</p> Signup and view all the answers

When a study reports 'standard errors,' what variability is being summarized?

<p>Variability in the treatment effect due to random sampling. (A)</p> Signup and view all the answers

In the context of treatment effect analysis, what does a 'point estimate' represent?

<p>The single best guess for the approximation of a parameter, derived from the sample data. (D)</p> Signup and view all the answers

What does selection bias primarily result from in the context of estimating treatment effects?

<p>Using an invalid control group, leading to an incorrect estimate of the counterfactual. (C)</p> Signup and view all the answers

If researchers run a simulation of placebo treatments and observe large, randomly occurring differences between groups, what does this indicate?

<p>Random splits can lead to large differences due to chance alone. (C)</p> Signup and view all the answers

How does increasing the bandwidth (h) affect kernel density estimation?

<p>It disregards more data, leading to a smoother estimate but potentially missing finer details. (A)</p> Signup and view all the answers

In the context of kernel density estimation, what is the primary role of the kernel function?

<p>To weight each observation within a bandwidth. (D)</p> Signup and view all the answers

Given a standard normal distribution, if $F_X(-1) = 0.159$ and $F_X(0) = 0.5$, what does $F_X(-1) = 0.159$ represent?

<p>The area under the standard normal curve to the left of -1 is 0.159. (C)</p> Signup and view all the answers

Why might the Average Treatment Effect (ATE) differ from the Average Treatment Effect for the Treated (ATT)?

<p>The treated and not treated groups may have systematic differences. (B)</p> Signup and view all the answers

A government is considering a new policy that would affect the entire population. Which treatment effect would be most relevant in this scenario?

<p>Average Treatment Effect (ATE) (D)</p> Signup and view all the answers

A job training program is offered, but only some individuals enroll. To assess the program's impact on those who participated, which treatment effect is most appropriate?

<p>Average Treatment Effect for the Treated (ATT) (C)</p> Signup and view all the answers

You suspect that the effect of a mentorship program on job placement is different for participants compared to non-participants. Which treatment effect(s) should you examine?

<p>Both the Average Treatment Effect (ATE) and the Average Treatment Effect for the Treated (ATT). (D)</p> Signup and view all the answers

Which of the following approaches to constructing counterfactuals is considered the least reliable and should generally be avoided?

<p>Unsubstantiated Guess (D)</p> Signup and view all the answers

In the context of hypothesis testing, what does the t-statistic primarily indicate?

<p>How many standard errors the estimated coefficient is away from zero, assuming the null hypothesis is true. (C)</p> Signup and view all the answers

What is the interpretation of a 95% confidence interval?

<p>If we were to repeat the experiment many times, 95% of the calculated confidence intervals would contain the true population parameter. (D)</p> Signup and view all the answers

What is the relationship between the standard error, sample size, and confidence interval width?

<p>As the standard error decreases, the confidence interval width decreases. (C)</p> Signup and view all the answers

In hypothesis testing, what is a Type II error?

<p>Failing to reject the null hypothesis when it is actually false. (B)</p> Signup and view all the answers

How does increasing the sample size affect the likelihood of Type I and Type II errors?

<p>It does not change the likelihood of a Type I error, but it decreases the likelihood of a Type II error. (C)</p> Signup and view all the answers

Assume a study finds a statistically significant result with a small sample size. What is a potential concern regarding this finding?

<p>The result may be a false positive with an exaggerated effect size. (B)</p> Signup and view all the answers

What does 'statistical power' refer to in the context of hypothesis testing?

<p>The probability of correctly rejecting a false null hypothesis. (C)</p> Signup and view all the answers

A researcher sets a very stringent significance level (e.g., $\alpha = 0.01$) for a hypothesis test. What is a likely consequence of this choice?

<p>Decreased risk of a Type I error but increased risk of a Type II error. (A)</p> Signup and view all the answers

Which of the following practices helps to mitigate the multiple comparison problem in research?

<p>Pre-registering the study design, hypotheses, and analysis plan. (B)</p> Signup and view all the answers

How does publication bias affect the overall body of research?

<p>It leads to an overrepresentation of statistically significant results, potentially exaggerating true effects. (D)</p> Signup and view all the answers

A researcher conducts a study with a small sample size and fails to find a statistically significant effect, despite a true effect existing. What type of error has likely occurred?

<p>Type II error (false negative). (B)</p> Signup and view all the answers

In the context of research methodology, what is the primary benefit of using larger sample sizes?

<p>To reduce the variability of estimates and increase the precision of findings. (A)</p> Signup and view all the answers

What is the purpose of pre-registration in randomized controlled trials (RCTs)?

<p>To publicly document the study design, hypotheses, and analysis plan before data collection to prevent p-hacking. (A)</p> Signup and view all the answers

Which factor does NOT directly influence the power of a statistical test?

<p>The researcher's personal bias. (D)</p> Signup and view all the answers

In the 'Moving to Opportunity' study, what was the primary intervention used to assess the impact of neighborhood conditions on low-income families?

<p>Offering housing vouchers for families to move to lower-poverty neighborhoods. (D)</p> Signup and view all the answers

Which of the following best describes the 'file-drawer effect' in research?

<p>The phenomenon where researchers do not finish papers with statistically insignificant results because they are unlikely to be published. (C)</p> Signup and view all the answers

Flashcards

Descriptive Analysis

Summarizing data to establish facts.

Casual Analysis

Understanding cause-and-effect relationships (how X affects Y).

Predictive Analysis

Estimating how one variable (X) predicts another (Y).

Mean

Average value in a dataset.

Signup and view all the flashcards

Percentile

Value below which a given percentage of observations fall.

Signup and view all the flashcards

Standard Deviation

Average distance of values from the mean.

Signup and view all the flashcards

Variance

Measure of data spread around the mean.

Signup and view all the flashcards

90/10 Ratio

Quantile at which 90% of observations fall, divided by the quantile at which 10% fall. (Q(.90)/Q(.10))

Signup and view all the flashcards

Bandwidth (h)

Amount of data considered around a point 'x' in kernel density estimation.

Signup and view all the flashcards

Kernel Function

A function defining how much weight each observation has within the bandwidth.

Signup and view all the flashcards

Cumulative Density Function (CDF)

Fraction of observations with values less than or equal to a specified value.

Signup and view all the flashcards

Standard Normal Distribution

A normal distribution with a mean of 0 and a standard deviation of 1.

Signup and view all the flashcards

Z-score formula

Transforms a data point into a standard normal distribution.

Signup and view all the flashcards

Average Treatment Effect (ATE)

The expected impact of a treatment on the entire population.

Signup and view all the flashcards

Average Treatment Effect for the Treated (ATT)

The impact of a treatment specifically on those who received the treatment.

Signup and view all the flashcards

Control Group

A group used for comparison that doesn't receive the treatment.

Signup and view all the flashcards

Selection Bias

Arises when the control group leads to an incorrect estimate of the counterfactual, leading to an incorrect average treatment effect on the treated (ATT).

Signup and view all the flashcards

Point Estimate

The single best guess for a parameter, derived from sample data.

Signup and view all the flashcards

Simulating a test distribution

A distribution of placebo treatments to see randomly occurring differences.

Signup and view all the flashcards

P-value

The probability of obtaining a result at least as extreme as the one observed under the null hypothesis (treatment is zero).

Signup and view all the flashcards

Central Limit Theorem

The sampling distribution of the sample mean of a large number of independent variables is approximately Normal.

Signup and view all the flashcards

Standard Errors

Summarizes the variability in treatment effect due to random sampling.

Signup and view all the flashcards

Randomization (in experiments)

Ensures treatment and control groups are statistically equivalent, with similar distribution of characteristics and confounders equally distributed.

Signup and view all the flashcards

False Negative

A statistical error where an effect exists but isn't detected (p > 0.05).

Signup and view all the flashcards

Experiment Precision Factors

Variability in the outcome variable, and the experiment's size.

Signup and view all the flashcards

T-statistic

Number of standard errors an estimated coefficient is from zero, indicating if the effect is large relative to data variability.

Signup and view all the flashcards

Multiple Comparison Problem

The increased chance of false positives when many comparisons are made without adjusting significance levels.

Signup and view all the flashcards

P-Hacking

Researchers only report a speculation with p < 0.05.

Signup and view all the flashcards

Publication Bias

Academic journals are more likely to publish studies with significant results than insignificant ones.

Signup and view all the flashcards

Critical value

A point in the test distribution corresponding to a specific p-value, determining statistical significance.

Signup and view all the flashcards

File-Drawer Effect

Researchers don't finish papers with statistically insignificant results as it wouldn't be published anyway.

Signup and view all the flashcards

Confidence interval

Range of values where we expect the true population parameter to lie, given a certain confidence level.

Signup and view all the flashcards

Type 1 error

Claiming an effect exists when it doesn't (false positive).

Signup and view all the flashcards

Pre-registration (RCTs)

Publicly document study design, hypothesis, and analysis plan before collecting or analysing data.

Signup and view all the flashcards

Power (Statistical)

How likely we are to conclude the treatment has an impact, when it truly does.

Signup and view all the flashcards

Type 2 error

Failing to detect an effect that actually exists (false negative).

Signup and view all the flashcards

Statistical power

The probability of correctly identifying an effect when it exists.

Signup and view all the flashcards

Replication Files

Posting code and data so other researchers can analyse the robustness of the results.

Signup and view all the flashcards

Study Notes

  • Descriptive analysis summarizes data and establishes facts.
  • Casual analysis understands how one variable (X) affects another (Y).
  • Predictive analysis estimates how one variable (X) predicts another (Y).
  • Confusing correlation with causation leads to many false claims.
  • Descriptive statistics aims to reduce the number of numbers while retaining as much information as possible.
  • Mean is the average value.
  • Percentile is a value Q(p) such that a fraction (p) of observations are at most Q(p).
  • Standard deviation indicates the average distance of observations from the mean.
  • Variance measures how spread out the observations are.
  • Coefficient of variance is a relative measure of variance used to compare across different datasets.

Lecture 2

  • Quantiles are values Q(p) such that a fraction (p) of observations are at most Q(p).
  • Percentile ratios indicate the width of a distribution.
  • 90/10 ratio is calculated as Q(0.90) / Q(0.10).
  • Probability random variable X takes a specific value x in a discrete density function. The probability of all x values is at most 1 and at least 0.
  • Histograms display the distribution of observations.
  • The width of histogram bars represents the range of observations.
  • The height of histogram bars represents the frequency of observations in each bin.
  • A normal distribution is symmetrically distributed.
  • Right/positive skew has more values on the right and a long tail on the left.
  • Left/negative skew has more values on the left and a long tail on the right.
  • Skewedness implies that the mean and median are not close, while symmetry suggests they are close.
  • Wide histogram bars indicate more spread-out data.
  • Narrow histogram bars indicate data concentrated around the center value.
  • The peak of a histogram indicates where most data points are concentrated.
  • Density Function has a probability random variable X, that takes a specific value x within set A.
  • Kernel density function is an estimate of the likelihood of values across a range, representing the local weighted average for each x variable.
  • Bandwidth (h) in kernel density estimation determines how much data around x is used with larger bandwidths disregard more data, smaller bandwidths create more noise.
  • Peak in kernel density estimation indicates the concentration of data.
  • P(X > 40,000) = 1 - CDF(40,000) determines how many values are under 40,000.

Cumulative Density Function (CDF)

  • Indicates the fraction of observations with values less than x.
  • Z = (X - mean) / SD(X) for a standard normal distribution.
  • Then create the cumulative density function by plotting points.

Sampling

  • Population is the entire group to draw conclusions about.
  • Sample is a specific group selected from the population and collected from, used to make inference of the population.
  • Sampling bias happens when the sample does not represent the population.
  • Random sampling removes bias by ensuring each object in the population has an equal chance of being selected.
  • Sampling error is an exception of observations sampled by chance.
  • As the sample size increases, the sample average tends to approach the population mean.
  • Sample averages are distributed relatively symmetrically around the population mean.
  • Law of large numbers states that the larger the sample size, the closer the sample average is to the population mean.
  • Central limit theorem states that sample averages are distributed relatively symmetrically around the population mean.
  • Joint distribution is used for small data sets of two variables, in a cross tabulation
  • Rows represent values Y can take and Columns represent values X can take
  • Crosstabulation can also be used to present the share.

Joint density

  • Means the probability that random variable X takes value x and random variable Y takes value y.

Lecture 3

  • Conditional expectation: expectation of random variable Y when another random variable X takes a value x.
  • Marginal distribution of Y means the probability of Y not taking X into account.
  • Marginal distribution of X means the probability of X not taking Y into account.
  • Conditional distribution means the probability that Y takes y conditional on the fact that X takes x.
  • Conditional expectation represents the population average of Y when X is fixed.
  • Income distribution: the horizontal distance between different year CDFs represents the dollar change for each percentile per year, however, those may not be the same people as we are comparing percentiles with CDFs, not people.
  • Scatter plots are used to see associations between variables.
  • Covariance measures the direction of a relationship between two variables (positive, negative, or zero).
  • Correlation measures the linear dependence between two variables and is bounded between -1 and 1.

Regression Model

  • Y = βο + β₁X + €
  • Y is the outcome or dependent variable.
  • X is the regressor or independent variable.
  • β₁ = Cov(X, Y) / Var(X)
  • E is the error term, representing relevant unobserved factors.
  • Setting the betas follows ordinary least squares, where values of βo and β1 that minimize the difference between the observed data and regression model prediction.
  • The formula for correlation is closely related to Pearson correlation coefficient.
  • Intergenerational mobility is the idea that everyone has equal opportunities.
  • Children's position in life is measured by income distribution and parents position, and that is an incomplete but powerful tool to measure inequality.
  • With the conditional expectation function, if Y is linear in X, the linear regression will recover the true relationship exactly and even if it not linear, the regression provides the best linear approximation.

Steps to setup

  1. Create a scatter plot of data to Add some noise to make it more visible
  2. Measure linear dependence to Find correlation and Estimate linear regression to Get the parameter estimates
  3. Check if there are any helpful summary statistics
  4. Compare sample average by x
  5. Adjust regression to multivariate regression to see if it better fits the data to Fit sample better with higher order polynomials but be careful with overfitting.

Lecture 4

  • R^2: measures variation in the dependent variable explained by the independent variable or fitted model with If R^2 = 1, the regression model perfectly predicts the data.
  • Causal questions aim to compare counterfactual states of the world, specifically the impact of X on Y (how would Y change if we changed X, where Y is the outcome and X is the treatment?).
  • Ceteris paribus means everything is the same except the treatment.
  • Counterfactual asks what would have happened to the treated group in the absence of treatment.
  1. Impact of...treatment
  2. In comparison to...counterfactual
  3. Impact On... outcome
  4. Impact for... population
  • Example: the impact of Jakarta's high-occupancy vehicle restriction (treatment) in comparison to unrestricted road travel(counterfactual) on drivers( population) travel time(outcome)
  • Binary treatments: they either receive treatment or not by denoting the treatment status
  • Is the treatment effect for an individual, where y is the outcome.
  • Average treatment effect = ATE, that is expected the impact the treatment has for the population.
  • External validity: how ATE or ATT applies to different populations and treatment may be heterogeneous, ATT may not be a good predictor of what would happen if expanded to larger population
  • Structural models, they are quantitative models to construct alternative states of the world, or by:
  • Control groups, which means compare treatment group to another otherwise similar control group where everything effects groups similarly except for the treatment.
  • Control groups are used to approximate what would have happened in the absence of treatment.
  • There could be Invalid control groups leads to selection bias.
  • Randomization eliminates selection bias by ensuing T and C are on average statistically equivalent.
  • Similar distributors of characteristics, where confounders are equally distrusted across T and C.

Lecture 5

  • Columns 1 and 2 show the averages and Column 3 shows the difference in averages (The point estimate).
  • Row 2 shows the standard errors.
  • Point Estimate: the single best guess for the value of a parameter.
  • Need to determine how likely we are to got that point estimate.
  • If less than 0.05 or 5% then statistically significant.
  • Simulating a test distribution, is Create a distributing of placebo treatments to see randomly occurring differences.
  • Random splits can be large due to chance.
  • P-value: probability of obtaining a result at least as extreme as the one observed under the null hypothesis (treatment is zero).
  • P < 0.05: strong evidence against null hypothesis -> the effect is statistically significant.
  • Experiments yield more precise evidence when outcome variable has less variation and when they are larger.
  • T-statistic: how many standard errors the estimated coefficient is away from zero (the null hypothesis; treatment is 0).
  • P > 0.05: strong evidence for the null hypothesis -> can't reject null hypothesis, the effect is not statistically significant.
  • They represent the range consistent with the observed data and Values inside vary in their compatibility, for example the point estimates is most compatible.

Lecture 6

  • Type 1 Error (false positive): claiming an effect when it doesn't exist.
  • Type 2 Error (false negative): claiming there isnt an effect when it doesn't exist.
  • Power: The probability of finding an effect when it exists.
  • There is a trade-off between type 1 and type 2 errors.
  • The likelihood of false positives does not vary with sample size; the probabililty of a false positive is set by the signicanece level.
  • Small samples means there is more variability meaning that false positives show exaggerated effect, but in large samples the random fluctations balance out.
  • Some problems are as follows:
  • Can lead to bad policy choices, and garner media attention towards big effects over truth due to underpowered studies.
  • Publication bias: academic journals are more likely to publish studies with significant results than insignificant
  • File-drawer effect: researchers don't finish papers with statistically insignificant results as it wouldn't be published anyway.
  • P-hacking: researchers only report only a speculation with p < 0.05.
  • If too many tests are run there is an increased likelihood of false postives.

Approached to Human Mistakes

  1. Pre-registration of randomized control trials (RCTs), meaning to Publicly document study design, hypothesis and analysis plan before collecting or analysing data.
  2. Replication files to Must post code and data so other researchers can analyse the robustness of the results.
  3. Running larger experiments, higher lower variability and more precise estimates.
  • Power depends on:
  1. True effect size
  2. Sample size
  3. Variability of the outcome variable
  4. Statistical significance level.
  • The moving to opportunity theory is:
  • Housing vouchers to randomly selected low-income families in high-poverty areas, such as;
    • Experimental: vouchers with restriction (move to low-poverty neighbourhoods)
    • Section 8: unrestricted vouchers
    • Control: continue public housing with no vouchers.
  • Found it had positive effects.
  • Not everyone who was offered a voucher used it -> non-compliance, making hard to estimated true treatment effect.
  • If there is self-selection bias such as those who take the treatment are different to the full treatment group.
  • If a comparison was made between the treatment group and the entire control, the resulting estimate would be biased..

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Questions cover descriptive statistics, data interpretation, distributions, and causal inference. Topics include standard deviation, coefficient of variation, percentile ratios, and the Central Limit Theorem. Focus on understanding statistical measures and their implications.

More Like This

Age Bracket Analysis
3 questions

Age Bracket Analysis

ExceptionalEpiphany avatar
ExceptionalEpiphany
Statistical Ranges in Data Analysis
5 questions
Statistical Analysis Overview
8 questions

Statistical Analysis Overview

PreferableNonagon3509 avatar
PreferableNonagon3509
Use Quizgecko on...
Browser
Browser