Descriptive Statistics and Deceptive Descriptions
42 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What primary benefit do Jupyter Notebooks offer to data scientists?

  • Enabling the creation of a data-driven narrative through code, documentation, and output. (correct)
  • Facilitating complex statistical computations.
  • Enhancing the security of data science projects.
  • Automating data ingestion from various online sources.

Which of the following descriptive statistics is most susceptible to being skewed by outliers?

  • Percentile
  • Mean (correct)
  • Median
  • Standard Deviation

In assessing the economic health of the middle class, which metric would provide a more accurate representation than per capita income alone?

  • Nominal GDP growth rate.
  • The unemployment rate among college graduates.
  • Change in median wages, adjusted for inflation. (correct)
  • The stock market's performance over the same period.

In an analysis of printer quality based on warranty claims, what does a significant difference between the mean and median number of quality problems suggest?

<p>There are outliers in the data. (B)</p> Signup and view all the answers

When are percentages most likely to be misleading?

<p>When presented without the underlying base number. (B)</p> Signup and view all the answers

A patient's HCb2 blood chemical count is 134. Upon researching online, the patient finds the average for their age group is 122. What additional statistical measure would be most helpful in determining the patient's relative position within the population?

<p>The median and standard deviation of HCb2 counts. (B)</p> Signup and view all the answers

In the context of data analysis, what does 'provenance' refer to?

<p>The original source and history of the data. (D)</p> Signup and view all the answers

What was the primary flaw of the 1936 Literary Digest poll that led to its incorrect prediction of the presidential election outcome?

<p>The poll used a sample of voters that was not representative of the entire electorate. (B)</p> Signup and view all the answers

Which dataset is collected as part of routine organizational bookkeeping activities?

<p>An administrative dataset (A)</p> Signup and view all the answers

How does stratified sampling differ from simple random sampling?

<p>Stratified sampling divides the population into subgroups (strata) and samples randomly from each, while simple random sampling selects randomly from the entire population. (C)</p> Signup and view all the answers

Which of the following is a risk associated with cluster sampling?

<p>Bias if members within clusters are similar. (D)</p> Signup and view all the answers

In the context of statistical analysis, why is it important to 'pay attention to the unit of analysis'?

<p>To properly interpret the results and avoid making misleading claims. (B)</p> Signup and view all the answers

If a study finds that 80% of Harvard students are first-born children, what additional information is most crucial to determine whether being a first-born child gives an advantage in getting into Harvard?

<p>The percentage of the general population that are first-born children. (A)</p> Signup and view all the answers

Sandel measures $P(F \mid H)$ in his analysis of Harvard students and birth order. What quantity would be more helpful in determining if being a firstborn conveys an advantage over not being firstborn?

<p>$P(H \mid F) / P(H \mid F^c)$ (C)</p> Signup and view all the answers

When analyzing economic health, which approach helps to avoid being misled by descriptive statistics?

<p>Considering median wages and examining changes across different percentile ranges. (A)</p> Signup and view all the answers

What is the most accurate way to describe the relationship between the mean, median, and outliers?

<p>Outliers affect the mean more than the median. (C)</p> Signup and view all the answers

The average price of homes sold in a city increased by 50% over the past year. Which detail would be most crucial in interpreting whether this reflects a genuine increase in housing values?

<p>The median price increase and whether the prices are adjusted for inflation. (B)</p> Signup and view all the answers

What is the standard deviation?

<p>A measure of the spread or dispersion of data points around the mean. (B)</p> Signup and view all the answers

Which scenario best demonstrates the deceptive use of statistics through unclear definitions?

<p>Claiming a decline in the number of car accidents without specifying the change in miles driven. (C)</p> Signup and view all the answers

Imagine you want to get the most accurate data possible regarding the population. What action must you take?

<p>Conduct a census. (B)</p> Signup and view all the answers

What describes a sample drawn from a population at random without replacement?

<p>A simple random sample (B)</p> Signup and view all the answers

Why is data provenance important?

<p>It provides information on the origin, history, and reliability of the data. (C)</p> Signup and view all the answers

When is cluster sampling most appropriate?

<p>When the population naturally exists in groupings, and surveying entire groups is feasible. (B)</p> Signup and view all the answers

What is the main purpose of using stratified sampling?

<p>To ensure proportional representation of different subgroups within the sample. (A)</p> Signup and view all the answers

Politician A claims, "Our schools are getting worse! Sixty percent of our schools had lower test scores this year than last year". Politician B claims, “Our schools are getting better! Eighty percent of our students had higher test scores this year than last year". What is the key difference between the message.

<p>Politician A looks at schools. Politician B looks at individual students. (C)</p> Signup and view all the answers

What is one problem with relying on a single metric such as someone's GPA?

<p>Data can vary depending on the courses taken and grading systems. (A)</p> Signup and view all the answers

Which action would be best suited to assess the economic health of the middle class in a country?

<p>Compare the change in median wages adjusted for inflation. (D)</p> Signup and view all the answers

You are seeing a presentation about warranty claims for your company's printers and the average number of quality problems per printer sold by your firm is 9.1, while the competitor is 2.8. However, after further analysis you find that the median number of quality problems for your firm is 1. What is the most likely reason to explain the large differences in median and averages?

<p>Outliers in the data make the mean number of problems higher than the median. (C)</p> Signup and view all the answers

If you invest in a company and receive a letter indicating that profits were 46% higher than the year before, what is the most important thing to be aware of?

<p>The base number of the previous year's annual profits. (C)</p> Signup and view all the answers

How would you describe a normal distribution?

<p>Data clustered around a midpoint, with evenly dispersed tails (A)</p> Signup and view all the answers

What is a major problem with claiming AT&T covers 97% of the population?

<p>They may not have covered the geographic area. (C)</p> Signup and view all the answers

What action should you take to define the health of the manufacturing industry of a county?

<p>See if manufacturing output and manufacturing employment are up. (D)</p> Signup and view all the answers

What does A 2012 article use to assess Sandel's argument?

<p>$P(F | H)$ (B)</p> Signup and view all the answers

Using Bayes Rule, P(H | F) implies what?

<p>$P(H)P(F|H) / P(F)$ (B)</p> Signup and view all the answers

In what situation does Data Quality suffer?

<p>When there may be issues with data provenance. (B)</p> Signup and view all the answers

To measure how dispersed data from the mean is, you would use what quantity?

<p>The standard deviation (C)</p> Signup and view all the answers

When looking at examples of normally distributed data, it is extremely helpful if you know what?

<p>If it lies within two or three standard deviations. (C)</p> Signup and view all the answers

According to the slides, what makes statistics deceptive?

<p>Describing data with a variety of ways (C)</p> Signup and view all the answers

Why are there two schools of thought over first borns?

<p>Because they have the magic touch. (D)</p> Signup and view all the answers

What are you likely to encounter in a self-selected sample?

<p>Bias. (A)</p> Signup and view all the answers

Which statement describes a judgment sample?

<p>Subset is whomever researcher deliberately selects (B)</p> Signup and view all the answers

What does a 1988 study say about the flaws of The Literary Digest, 1936?

<p>The poll focused too much on the rich as opposed to the poor. (B)</p> Signup and view all the answers

Flashcards

Jupyter Notebooks

A tool for data scientists to create narratives with code, documentation, and output.

Descriptive Statistics

Summarizes data in an accessible way, simplifying complex information.

Problem 1 with Economic Health Example

Figures are not adjusted for inflation.

Median

Median divides a data distribution into two equal halves.

Signup and view all the flashcards

Further Dividing Distribution Data

Divide into quartiles, deciles or percentiles.

Signup and view all the flashcards

Appropriate Metrics for Economic Health

Inflation adjusted median wages or wages at identified percentiles.

Signup and view all the flashcards

Standard Deviation

Dispersion of data from its mean.

Signup and view all the flashcards

Key to avoiding deceptive descriptions

Understanding the unit of analysis.

Signup and view all the flashcards

Provenance

The origin of the data.

Signup and view all the flashcards

Simple Random Sample (SRS)

A sample where all individuals have an equal chance of being selected.

Signup and view all the flashcards

Cluster Sampling

Divide population into clusters then randomly select clusters.

Signup and view all the flashcards

Stratified Sampling

Divide population into strata then sample from each.

Signup and view all the flashcards

P(H | F)

Chance of Harvard given you're a first-born

Signup and view all the flashcards

P(H | Fc)

Chance you are a Harvard student if not first born

Signup and view all the flashcards

Study Notes

Announcements

  • Lab 0 was released on LMS, and complete setup before the next lecture on January 27.
  • A Python tutorial, led by Shizza, is scheduled for Zoom.

Notebooks

  • Notebooks enable the telling a narrative.
  • Notebooks include code documentation, the thought process, a markdown language, and output.

Descriptive Statistics

  • Descriptive stats are useful to presents meaningful information in an easily accessible way, as succinct metrics.
  • Descriptive statistics can summarize careers by batting average and income distribution by the Gini Index.
  • Over-reliance on descriptive statistics can lead to misleading conclusions.

Deceptive Descriptions

  • Occurs when the use of statistics describes complex phenomenon not exactly.
  • An example could be assesing the economic health of the middle class of a country.
  • The per capita income could be analyzed over 30 years.
  • For example, the average income in the US in 1980 climbed from $7,787 to $26,487 in 2010.
  • Technically this statement is correct, it is wrong to question if the figures are adjusted for inflation.
  • The average income in America is not equal to the income of the average america.
  • It's important to compare the change in median of wages adjusted for inflation.

Printer Quality Example

  • A boss shares files of data, one with warranty information of 57,334 printers sold last year.
  • The other file has information on the 994,773 laser printer sold by a competetor.
  • Each printer documents the number of quality problems during the warranty period.
  • The average number of quality problems per printer sold by the competitor is 2.8, but 9.1 for your firm.
  • However, the median number of quality problems by sold printer is 2 for the competitor, but 1 for your firm.
  • Outliers inflate the mean but not the median.
  • A fraction of printers have a huge number of quality complaints.

Median, Quartiles, Deciles, and Percentiles

  • The median divides the distribution in hlaf.
  • The median is the point where half the observations are above, and the other half are below.
  • Distributions can be divided into quartiles, deciles, or percentiles.
  • The first quartile is the bottom 25% of observations.
  • The first decile is the bottom 10% of observations.
  • The first percentile is the bottom 1% of observations.
  • They describe where an observation lies compared with everyone else.
  • Difference between "Absolute" scores and "relative" scores.
  • Example: Reading comprehension score in the 3rd percentile.

Percentages

  • When given percentages, be aware of information without base numbers.
  • Receiving a letter claiming a company's profits were 46% higher than the year before.
  • Consider if the firm earned 27 cents last year, which turned into 39 cents this year (only a 46% increase.)

The Standard Deviation

  • Indicates how dispersed the data are from their mean/how spread apart are the observations.
  • Example: Data was collected on the weights of 250 people on an airplane headed for Boston and the weights of 250 qualifiers for the Boston Marathon.
  • However, the mean weight for both groups is 155 pounds, and the former will have a greater standard deviation.

Deceptive Descriptions

  • The field of statistics is rooted in mathematics, and mathematics are exact.
  • However, the use of statistics to describe complex phenomenom are not exact.
  • Be sure to pay attention to the unit of analysis.
  • There can be a lack of clarity over what one is defining, describing, or explaining.
  • How does one define health as "manufacturing output" or "employment"?

Data Designs and Sampling Strategies

  • Provenance: "the place of origin of something."
  • Literary Digest accurately predicted the correct candidate to win the presidental race from 1916 - 1932.
  • In 1936, about 40 million voters were expected.
  • Literary Digest sent out 10 million mock ballots which it received 2.4 million back.
  • The prediction was that Landon had 57% and Roosevelt 43% of the votes.
  • The actual result was that Landon had 38% and Roosevelt 62% of the votes.
  • A 1988 study showed that the 10 million voters who received mock ballots were not representative.
  • They were found to be drawn from wealthier voters.
  • The 2.4 million voters that responded were also found to be not representative.
  • The most passionate were more likely to respond.
  • There's a fundamental problem: Poor data provenance, as big data is equated to good data.
  • Census: Information on all subjects.
  • Subset or sample: Information on some subjects.
  • Self-selected sample example: Subset is whoever chooses to answer.
  • Convenience sample: Subset is whomever is convinient for the researcher.
  • Judgement sample: Subset is whoever the researcher deliberately selects.
  • Probability sample: Subset involves some probabilistic selection.
  • Adminstrative dataset: A dataset collected as parts of administrative work ( e.g. social security names, restaurant safety ratings, etc.)

Simple Random Sample (SRS)

  • A "sample drawn from population at random without replacment."
  • If we have 6 people and want to sample them (A,B,C,D,E,F) the 15 samples of size 2 are AB, AD, AE, AF, BC, BD, BE, BF, CD, CE, CF, DE, DF, and EF.
  • Then P(AB)=P(AC)=...P(EF)=1/15
  • And P(A)=5/15=1/3

Cluster Sampling

  • Divide the population into clusters and then SRS chooses clusters instead of individuals (example: everyone in a town)
  • If we have 6 people and want to sample them (A,B,C,D,E,F) the first we've chosen are: A,B/ C,D/ E,F arbitrarily. Then SRS picks one cluster (A,B).
  • Risk of bias if members of clusters are similar.
  • Could be cheaper to use when polling everyone at an address.

Stratified Sampling

  • It divides the population into clusters, and then creates the first SRS per strata.
  • Gender Identification is an example (Female, Male, Other) First the strata is chosen: {A,B.C.D} and {E,F } arbitrarily.
  • Then uses SRS to select one person from each strata.

First Borns and Harvard

  • Harvard Philosopher Michael Sandel argues that birth order has a significant impact on work ethic.
  • He found in class what fraction of students were first born, and found out to be consistently 75-80% born firstP(F|h)=75-80%
  • US census data for mothers with children in college is:
  • 1 Child: 21%, 2 Children: 43%, 3 Children: 23%, 4 Children: 13%
  • And percent first born is: 100/(21+43 * 2 +23 * 3 + 13*4)=44 %
  • Measuring PP(F|h)=80 % and comparint to 44% is insufficient.
  • Harvard must consider how many children mother's Harvard tend to have.
  • The fact that Harvard mothers tend to have very few children.
  • The 2012 study walk throught the math regarding Sandels argument in great detail. Sandels Measures P(F|h)
  • But what he cares about is a different quanity.
  • Sandel wants to show that a first born conveys an advantage over NOT.
  • He wants to show r is P(H|F) (the probability of getting into Harvard if first born) /P(H|FC) (the probability of getting into Harvard if not first born).
  • P(H|F): Probability of getting into Harvard firstborn.
  • P(H|FC): Probability of NOT getting into Harvard firstborn. The higher r is the better for the firstborn.

Applying BAyes RUle for Harvard first Borns P(F|H)

  • Let's say r=3, that being first born makes you 3 times as likely to get into Harvard.
  • If r > 1: it'd imply that being first born gives you "magic touch ( smart, assistance in demons...
  • Find the CHance for first born to be at harvard.
  • We want the change you are the harvard student (H) given your first born.
  • Compare it to chane you are not first Born.
  • The population of students are N first born where it is the fertitlity rate.
  • IF the US Value for 1994 we get r=3.9
  • But in the most elite schools in the US r=3.2

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Notebooks in data science allow for narrative presentation with code, documentation, and markdown. Descriptive statistics present data meaningfully but can be misleading. Deceptive descriptions occur when statistics don't fully represent complex phenomena, such as using per capita income to assess the economic health of the middle class.

More Like This

Quantitative Data Analysis Techniques
12 questions
Introduction to Statistics
30 questions

Introduction to Statistics

BlamelessConnemara2892 avatar
BlamelessConnemara2892
Use Quizgecko on...
Browser
Browser