2
25 Questions
3 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

In a histogram, what does the height of each bar typically represent?

  • The cumulative count of observations up to that value.
  • The total number of observations in the dataset.
  • The frequency or proportion of observations falling within that bin. (correct)
  • The precise value of each observation.

What is the primary purpose of dividing observations into bins when constructing a histogram?

  • To represent a true density function of a continuous variable.
  • To simplify the data by grouping observations into intervals. (correct)
  • To ensure that each observation has a unique representation.
  • To allow for easy calculation of the mean and median.

If you increase the number of bins in a histogram, what effect does this have on the display, assuming the data remains the same?

  • The overall shape of the distribution remains unchanged.
  • The bars become narrower, potentially revealing more detail but also more noise. (correct)
  • The histogram becomes a scatterplot.
  • The bars become wider, smoothing out the distribution.

Why is it important that each observation in a dataset is allocated to a single bin when creating a histogram?

<p>To prevent overlaps in the data representation and maintain accurate counts. (D)</p> Signup and view all the answers

A researcher wants to visualize the distribution of income in a city. Which of the following would be the most appropriate way to present this data to show income density?

<p>A histogram with income ranges on the x-axis and density on the y-axis. (D)</p> Signup and view all the answers

How does changing the number of bins in a histogram affect the representation of data?

<p>It can reveal different patterns or trends in the same data by grouping values differently. (C)</p> Signup and view all the answers

For a continuous random variable X with a density function $f_X(x)$, what does the integral $\int_{a}^{b} f_X(x) dx$ represent?

<p>The probability that X falls within the interval [a, b]. (C)</p> Signup and view all the answers

Why is the probability of a continuous random variable $X$ taking a specific value $x$ (i.e., $P(X = x)$) considered to be zero?

<p>Because the integral of the density function at a single point is always zero. (B)</p> Signup and view all the answers

Given a small variation $h > 0$, how is the density function $f_X(x)$ for a continuous stochastic variable $X$ related to probability?

<p>$f_X(x) \approx P(X = x \pm h/2) / h$ (B)</p> Signup and view all the answers

The approximation $f_X(x) ≈ P(X = x ± h/2) / h$ is the basis for the definition of which statistical method?

<p>Kernel Density Estimation (C)</p> Signup and view all the answers

What is the fundamental question that the Cumulative Distribution Function (CDF) $F_X(t)$ for a continuous variable answers?

<p>What fraction of the observations have values of <em>x</em> below <em>t</em>? (D)</p> Signup and view all the answers

Given a standardized normal distribution, what does $F_X(0) = 0.5$ imply?

<p>50% of the observations have values below 0. (B)</p> Signup and view all the answers

If $F_X(-1) = 0.159$ for a standardized normal distribution, what percentage of observations have values of x greater than -1?

<p>84.1% (C)</p> Signup and view all the answers

The cumulative density function (CDF) is defined as $F_X(t) = \int_{-\infty}^{t} f_X(s) ds$. What does $f_X(s)$ represent in this equation?

<p>The probability density at the value s. (A)</p> Signup and view all the answers

Which of the following is NOT a property of a Cumulative Distribution Function (CDF)?

<p>It gives the probability that a variable is equal to a specific value. (B)</p> Signup and view all the answers

Consider two values, a and b, where a < b. How can you use the CDF, $F_X(x)$, to find the probability that a random variable X falls between a and b?

<p>$F_X(b) - F_X(a)$ (D)</p> Signup and view all the answers

Suppose you have a random variable Y and its CDF, $F_Y(t)$. If $F_Y(5) = 0.8$, which of the following statements is the most accurate interpretation?

<p>80% of the values of <em>Y</em> are less than or equal to 5. (B)</p> Signup and view all the answers

Based on the kernel density plot (kdensity earn, bw(1000)), what can be inferred about the distribution of incomes in the FLEED teaching data?

<p>The distribution is bimodal, with peaks roughly around 20,000 and 60,000. (B)</p> Signup and view all the answers

If $F_X(t)$ represents the cumulative distribution function (CDF) for a continuous variable X, how would you interpret $F_X(50000)$?

<p>The fraction of observations that are less than or equal to 50,000. (D)</p> Signup and view all the answers

In the context of kernel density estimation, what is the primary role of the bandwidth parameter?

<p>To control the spread or smoothness of the estimated density. (A)</p> Signup and view all the answers

Given the kernel density plot, what would be the most accurate method to estimate the proportion of the sample with incomes between $20,000 and $60,000?

<p>Calculate the area under the curve between $20,000 and $60,000. (D)</p> Signup and view all the answers

Suppose you want to compare income distributions across two different years using kernel density estimates. What is the most important consideration when interpreting any observed differences?

<p>Using the same bandwidth for both years to allow for a fair comparison of the shapes of the distributions. (B)</p> Signup and view all the answers

How does the Epanechnikov kernel differ from other kernel functions (e.g., Gaussian) in kernel density estimation?

<p>It has a bounded support, meaning it assigns zero weight to data points outside a certain range. (C)</p> Signup and view all the answers

Which of the following is a potential drawback of using a very small bandwidth in kernel density estimation?

<p>It can lead to a density estimate that is too sensitive to individual data points, resulting in a noisy estimate. (D)</p> Signup and view all the answers

What is the relationship between the kernel density estimator and the histogram?

<p>The kernel density estimator is generally smoother than a histogram because each data point contributes a small amount to the density estimate around it, whereas a histogram counts observations within discrete bins. (A)</p> Signup and view all the answers

Flashcards

Histogram

A visualization that shows the distribution of a discrete variable.

Histogram Bar Height

The height represents the proportion of observations falling into that category.

Histogram Bin

A range of values grouped together in a histogram.

Exclusive Bins

Each piece of data goes into only one group.

Signup and view all the flashcards

Exhaustive Bins

All pieces of data must be assigned to one of the bins.

Signup and view all the flashcards

Bin Adjustment

Changing the number of intervals in a histogram to reveal different patterns in the data.

Signup and view all the flashcards

P(X ∈ A)

For a continuous variable X, the probability of X falling within a set A.

Signup and view all the flashcards

Density Function

A function describing the relative likelihood of a continuous random variable taking a specific value.

Signup and view all the flashcards

P(X = x) = 0

For continuous variables, the likelihood of observing a specific value is virtually zero.

Signup and view all the flashcards

Income Density Plot

A graphical representation showing the distribution of income within a sample.

Signup and view all the flashcards

Kernel Density Estimator

A non-parametric way to estimate the probability density function of a random variable.

Signup and view all the flashcards

Epanechnikov Kernel

A kernel function that assigns more weight to points closer to the center.

Signup and view all the flashcards

Bandwidth in KDE

A parameter controlling the smoothness of the density estimate.

Signup and view all the flashcards

Cumulative Density Function (CDF)

A function that gives the probability that a variable X is less than or equal to a certain value.

Signup and view all the flashcards

CDF Definition

The integral of the probability density function from negative infinity to t.

Signup and view all the flashcards

CDF Purpose

Determines the proportion of observations with values below a specified threshold.

Signup and view all the flashcards

CDF Visual

The area under the density curve up to a certain point.

Signup and view all the flashcards

What is a Cumulative Density Function (CDF)?

A function that gives the probability that a random variable X is less than or equal to a certain value t.

Signup and view all the flashcards

CDF formula for continuous variables

For a continuous variable, the CDF, denoted as FX(t), is the integral of the probability density function fX(s) from negative infinity to t.

Signup and view all the flashcards

What does the CDF tell us?

The CDF answers this question. It indicates the proportion (or percentage) of observations in a dataset that fall below a specific value 't'.

Signup and view all the flashcards

CDF Example: FX(-1) for standard normal distribution

For a standardized normal distribution, approximately 15.9% of the observations have values below -1.

Signup and view all the flashcards

CDF Example: FX(0) for standard normal distribution

For a standardized normal distribution, 50% of the observations have values below 0. This makes sense because the standard normal distribution is centered at 0.

Signup and view all the flashcards

What does the y-axis of a CDF represent?

The y-axis of a CDF represents the cumulative probability, indicating the fraction (or percentage) of data points falling below a given x-value.

Signup and view all the flashcards

What does the x-axis of a CDF represent?

It represents the independent or input variable for which we are calculating the cumulative probability.

Signup and view all the flashcards

Study Notes

Logistics

  • Bring name placards to class
  • Pre-class assignment 1 was due 15 minutes before the class
  • Students are allowed up to 2 skips without penalty
  • Assignments are graded pass/fail based on effort
  • There is an in-class worksheet to complete during the class
  • The worksheet must be picked up from the front of the class
  • Students can take a photo or scan of the completed worksheet
  • The worksheet must be submitted on MyCourses before the next class
  • The worksheet is graded pass/fail based on accuracy

Today's Learning Objectives

  • Descriptive Statistics
    • Including mean, variance, and standard deviation
    • Including median and quantiles
    • Including density functions
    • Including joint distributions
  • Sample and Population
    • Including representativeness
    • Including sampling error

Descriptive Statistics

  • Summarisation of information to make data understandable
  • The objective is to reduce the amount of numbers while losing as little information as possible
  • Stata's summarize command summarises and details data, and allows shortened formats (eg. sum earn, d)
  • Descriptive statistics provides key data:
    • Sample mean which equals x = n 1/n ΣX(i=1)
    • Single number measures of variation
    • Selected quantiles

Measures of Variation

  • Variance is: Var(x) = 1/n Σ(x-x̄)²
  • Standard deviation is: SD(x) = √Var(x)
  • Coefficient of variation involves normalizing the standard deviation with the mean to compare across variables:
    • CV(x) = SD(x) / x̄

Quantiles

  • Quantile Q(p) represents the value below which a fraction p of observations fall
  • Some quantiles have specific names, like the median
  • Q(.5) is the median, where 50% of observations lie below this value
  • Quartiles include Q(.25), Q(.5), and Q(.75)
  • Deciles include Q(.1), Q(.2), ..., Q(.9)
  • Percentiles include Q(.01), Q(.02), ..., Q(.99)
  • The width of a distribution is characterized using percentile ratios
    • 90/10 ratio: Q(.9)/Q(.1) = 15
    • 90/50 ratio: Q(.9)/Q(.5) = 2.1
    • 50/10 ratio: Q(.5)/Q(.1) = 7

Density Functions (Discrete Variables)

  • If the distribution of random variable X is discrete, its density function = fX(x) = P(X=x)
  • The functions measure the probability that the random variable X takes a specific value x
  • Conditions:
    • fX(x) ≥ 0
    • ∑ fX(x) = 1
  • The probability that X takes a value within set A is P(X∈A) = ∑ fX(x) X∈A
  • A histogram represents the empirical counterpart of a density function for a discrete variable

Histograms

  • The height of each bar in a histogram represents the fraction of observations with a particular value
  • The observations are divided into bins to draw a histogram of them
    • Each observation is allocated to a single bin, ensuring all observations are assigned
    • The width of the bin represents the range of values that observations within that bin can take
    • Changing the number of bins allows different perspectives on same data

Density Functions (Continuous Variables)

  • The probability that X takes a value within a set A, if X is continuous, is P(X∈A) = ∫ fX(x)dx
  • Continuous variables can take infinite values, therefore, X taking a specific value approaches zero P(X=x) = ∫fX(x)dx = 0
  • For a probabilistic variable with small variation (h>0):
    • fX(x) ≈ P(X=x±h/2)
    • (X=x±h/2) means (x - h/2 ≤ X ≤ x+h/2)
    • This is the basis for the defintion of a kernel

Kernel Density Estimation

  • A kernel density estimator represents a local average for a value - fₙ(x) = 1/n Σ K(x-xᵢ)
  • bandwidth (h): Indicates how much data around x is used
  • Kernel function (Kₙ): weights observations within the bandwidth
  • Stata chooses an “optimal” bandwidth by default
    • Larger bandwidth disregards more data

Cumulative Density Function (CDF)

  • The CDF for a continuous variable provides the fraction of observations with values of x below t with: Fx(t) = ∫ fX(s)ds
  • For instance, in a standardized normal distribution:
    • Fx(-1) = 0.159
    • Fx(0) = 0.5
  • Plotting all of these points will show the entire function for various values of x

Population and Sample

  • Population: the entire group of N units about which conclusions are drawn
  • Sample: a specific group of n units selected from the population to collect data
    • Aim is inference of the population
    • This refers to deducing or concluding information from evidence/reasoning instead of explicit statements
  • sampling bias: sample must represent population
  • sampling error: exceptional observations must be sampled by chance

Sampling Bias: 1936 US Presidential Election Polls

  • Literary Digest sent 10 million "straw” ballots, asking who people intended to vote for in the upcoming election
    • 2.4 million ballots were returned
    • 57% indicated they would vote for Landon
    • 43% indicated they would vote for Roosevelt
  • Fact: Roosevelt won the election with 62% of the vote
  • George Gallup also conducted a poll
    • Sample size was just 50,000
    • Prediction: 56% for Roosevelt

Sampling Error

  • Random sampling removes bias
    • each object in population has same probability of being selected into the sample
    • difference between a sample statistic and population parameter arising by chance

Sampling Error Example

  • The population mean income among 15-64 year olds in Finland in 2010 (N ≈ 3.5M) was €26,144 µx = 1/n Σxᵢ
  • Sample Average: x̄ = 1/n Σxᵢ
  • The larger the smaple size, the closer the sample average tends to be to the population mean
  • Sample averages are distributed relatively symmetrically around population mean
  • These Properties are also known as:
    • The law of Large Numbers
    • Central Limit Theorem

Cross Tabulation (Joint Distribution)

  • Table displays small data sets of two variables
  • Each observation has:
    • Rows = values of Y
    • Columns = values of X
    • Cells = observations of value (y, x)
  • Cross tabulation cells report the share of observations with value (y, x)
  • fXY(x, y) = P(X = x, Y = y) shows that the empirical counterpart of the joint density function
  • Probability: the random variable X takes the value x and that random variable Y takes the value y

Summary Of Key Concepts

  • Concepts
    • density function, including CDF
    • joint distributions
  • Things to worry about when using samples
    • representativeness
    • sampling error

The next lecture features:

  • Conditional descriptive statistics
    • and applications to make sense of recent research on inequality

Upcoming Assignments

  • In-class worksheet 1 due on MyCourses before next lecture
    • You may upload a photo/scan (preferred) OR
    • can turn in paper copy in-person beginning of next lecture
  • Pre-class assignment 2 = due 15 minutes before next lecture
  • Homework 1 = due Jan 15
    • attend exercise session 1 for practical tools
    • you have the conceptual tools to get started

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Explore the basics of histograms, including the meaning of bar heights and how they are constructed. Understand the effects of changing the number of bins and the importance of proper data allocation. Learn about representing income distribution and probability in continuous random variables.

More Like This

Use Quizgecko on...
Browser
Browser