Podcast
Questions and Answers
What primary benefit do Jupyter Notebooks offer to data scientists?
What primary benefit do Jupyter Notebooks offer to data scientists?
- Enabling the creation of a data-driven narrative through code, documentation, and output. (correct)
- Facilitating complex statistical computations.
- Enhancing the security of data science projects.
- Automating data ingestion from various online sources.
Which of the following descriptive statistics is most susceptible to being skewed by outliers?
Which of the following descriptive statistics is most susceptible to being skewed by outliers?
- Percentile
- Mean (correct)
- Median
- Standard Deviation
In assessing the economic health of the middle class, which metric would provide a more accurate representation than per capita income alone?
In assessing the economic health of the middle class, which metric would provide a more accurate representation than per capita income alone?
- Nominal GDP growth rate.
- The unemployment rate among college graduates.
- Change in median wages, adjusted for inflation. (correct)
- The stock market's performance over the same period.
In an analysis of printer quality based on warranty claims, what does a significant difference between the mean and median number of quality problems suggest?
In an analysis of printer quality based on warranty claims, what does a significant difference between the mean and median number of quality problems suggest?
When are percentages most likely to be misleading?
When are percentages most likely to be misleading?
A patient's HCb2 blood chemical count is 134. Upon researching online, the patient finds the average for their age group is 122. What additional statistical measure would be most helpful in determining the patient's relative position within the population?
A patient's HCb2 blood chemical count is 134. Upon researching online, the patient finds the average for their age group is 122. What additional statistical measure would be most helpful in determining the patient's relative position within the population?
In the context of data analysis, what does 'provenance' refer to?
In the context of data analysis, what does 'provenance' refer to?
What was the primary flaw of the 1936 Literary Digest poll that led to its incorrect prediction of the presidential election outcome?
What was the primary flaw of the 1936 Literary Digest poll that led to its incorrect prediction of the presidential election outcome?
Which dataset is collected as part of routine organizational bookkeeping activities?
Which dataset is collected as part of routine organizational bookkeeping activities?
How does stratified sampling differ from simple random sampling?
How does stratified sampling differ from simple random sampling?
Which of the following is a risk associated with cluster sampling?
Which of the following is a risk associated with cluster sampling?
In the context of statistical analysis, why is it important to 'pay attention to the unit of analysis'?
In the context of statistical analysis, why is it important to 'pay attention to the unit of analysis'?
If a study finds that 80% of Harvard students are first-born children, what additional information is most crucial to determine whether being a first-born child gives an advantage in getting into Harvard?
If a study finds that 80% of Harvard students are first-born children, what additional information is most crucial to determine whether being a first-born child gives an advantage in getting into Harvard?
Sandel measures $P(F \mid H)$ in his analysis of Harvard students and birth order. What quantity would be more helpful in determining if being a firstborn conveys an advantage over not being firstborn?
Sandel measures $P(F \mid H)$ in his analysis of Harvard students and birth order. What quantity would be more helpful in determining if being a firstborn conveys an advantage over not being firstborn?
When analyzing economic health, which approach helps to avoid being misled by descriptive statistics?
When analyzing economic health, which approach helps to avoid being misled by descriptive statistics?
What is the most accurate way to describe the relationship between the mean, median, and outliers?
What is the most accurate way to describe the relationship between the mean, median, and outliers?
The average price of homes sold in a city increased by 50% over the past year. Which detail would be most crucial in interpreting whether this reflects a genuine increase in housing values?
The average price of homes sold in a city increased by 50% over the past year. Which detail would be most crucial in interpreting whether this reflects a genuine increase in housing values?
What is the standard deviation?
What is the standard deviation?
Which scenario best demonstrates the deceptive use of statistics through unclear definitions?
Which scenario best demonstrates the deceptive use of statistics through unclear definitions?
Imagine you want to get the most accurate data possible regarding the population. What action must you take?
Imagine you want to get the most accurate data possible regarding the population. What action must you take?
What describes a sample drawn from a population at random without replacement?
What describes a sample drawn from a population at random without replacement?
Why is data provenance important?
Why is data provenance important?
When is cluster sampling most appropriate?
When is cluster sampling most appropriate?
What is the main purpose of using stratified sampling?
What is the main purpose of using stratified sampling?
Politician A claims, "Our schools are getting worse! Sixty percent of our schools had lower test scores this year than last year". Politician B claims, “Our schools are getting better! Eighty percent of our students had higher test scores this year than last year". What is the key difference between the message.
Politician A claims, "Our schools are getting worse! Sixty percent of our schools had lower test scores this year than last year". Politician B claims, “Our schools are getting better! Eighty percent of our students had higher test scores this year than last year". What is the key difference between the message.
What is one problem with relying on a single metric such as someone's GPA?
What is one problem with relying on a single metric such as someone's GPA?
Which action would be best suited to assess the economic health of the middle class in a country?
Which action would be best suited to assess the economic health of the middle class in a country?
You are seeing a presentation about warranty claims for your company's printers and the average number of quality problems per printer sold by your firm is 9.1, while the competitor is 2.8. However, after further analysis you find that the median number of quality problems for your firm is 1. What is the most likely reason to explain the large differences in median and averages?
You are seeing a presentation about warranty claims for your company's printers and the average number of quality problems per printer sold by your firm is 9.1, while the competitor is 2.8. However, after further analysis you find that the median number of quality problems for your firm is 1. What is the most likely reason to explain the large differences in median and averages?
If you invest in a company and receive a letter indicating that profits were 46% higher than the year before, what is the most important thing to be aware of?
If you invest in a company and receive a letter indicating that profits were 46% higher than the year before, what is the most important thing to be aware of?
How would you describe a normal distribution?
How would you describe a normal distribution?
What is a major problem with claiming AT&T covers 97% of the population?
What is a major problem with claiming AT&T covers 97% of the population?
What action should you take to define the health of the manufacturing industry of a county?
What action should you take to define the health of the manufacturing industry of a county?
What does A 2012 article use to assess Sandel's argument?
What does A 2012 article use to assess Sandel's argument?
Using Bayes Rule, P(H | F) implies what?
Using Bayes Rule, P(H | F) implies what?
In what situation does Data Quality suffer?
In what situation does Data Quality suffer?
To measure how dispersed data from the mean is, you would use what quantity?
To measure how dispersed data from the mean is, you would use what quantity?
When looking at examples of normally distributed data, it is extremely helpful if you know what?
When looking at examples of normally distributed data, it is extremely helpful if you know what?
According to the slides, what makes statistics deceptive?
According to the slides, what makes statistics deceptive?
Why are there two schools of thought over first borns?
Why are there two schools of thought over first borns?
What are you likely to encounter in a self-selected sample?
What are you likely to encounter in a self-selected sample?
Which statement describes a judgment sample?
Which statement describes a judgment sample?
What does a 1988 study say about the flaws of The Literary Digest, 1936?
What does a 1988 study say about the flaws of The Literary Digest, 1936?
Flashcards
Jupyter Notebooks
Jupyter Notebooks
A tool for data scientists to create narratives with code, documentation, and output.
Descriptive Statistics
Descriptive Statistics
Summarizes data in an accessible way, simplifying complex information.
Problem 1 with Economic Health Example
Problem 1 with Economic Health Example
Figures are not adjusted for inflation.
Median
Median
Signup and view all the flashcards
Further Dividing Distribution Data
Further Dividing Distribution Data
Signup and view all the flashcards
Appropriate Metrics for Economic Health
Appropriate Metrics for Economic Health
Signup and view all the flashcards
Standard Deviation
Standard Deviation
Signup and view all the flashcards
Key to avoiding deceptive descriptions
Key to avoiding deceptive descriptions
Signup and view all the flashcards
Provenance
Provenance
Signup and view all the flashcards
Simple Random Sample (SRS)
Simple Random Sample (SRS)
Signup and view all the flashcards
Cluster Sampling
Cluster Sampling
Signup and view all the flashcards
Stratified Sampling
Stratified Sampling
Signup and view all the flashcards
P(H | F)
P(H | F)
Signup and view all the flashcards
P(H | Fc)
P(H | Fc)
Signup and view all the flashcards
Study Notes
Announcements
- Lab 0 was released on LMS, and complete setup before the next lecture on January 27.
- A Python tutorial, led by Shizza, is scheduled for Zoom.
Notebooks
- Notebooks enable the telling a narrative.
- Notebooks include code documentation, the thought process, a markdown language, and output.
Descriptive Statistics
- Descriptive stats are useful to presents meaningful information in an easily accessible way, as succinct metrics.
- Descriptive statistics can summarize careers by batting average and income distribution by the Gini Index.
- Over-reliance on descriptive statistics can lead to misleading conclusions.
Deceptive Descriptions
- Occurs when the use of statistics describes complex phenomenon not exactly.
- An example could be assesing the economic health of the middle class of a country.
- The per capita income could be analyzed over 30 years.
- For example, the average income in the US in 1980 climbed from $7,787 to $26,487 in 2010.
- Technically this statement is correct, it is wrong to question if the figures are adjusted for inflation.
- The average income in America is not equal to the income of the average america.
- It's important to compare the change in median of wages adjusted for inflation.
Printer Quality Example
- A boss shares files of data, one with warranty information of 57,334 printers sold last year.
- The other file has information on the 994,773 laser printer sold by a competetor.
- Each printer documents the number of quality problems during the warranty period.
- The average number of quality problems per printer sold by the competitor is 2.8, but 9.1 for your firm.
- However, the median number of quality problems by sold printer is 2 for the competitor, but 1 for your firm.
- Outliers inflate the mean but not the median.
- A fraction of printers have a huge number of quality complaints.
Median, Quartiles, Deciles, and Percentiles
- The median divides the distribution in hlaf.
- The median is the point where half the observations are above, and the other half are below.
- Distributions can be divided into quartiles, deciles, or percentiles.
- The first quartile is the bottom 25% of observations.
- The first decile is the bottom 10% of observations.
- The first percentile is the bottom 1% of observations.
- They describe where an observation lies compared with everyone else.
- Difference between "Absolute" scores and "relative" scores.
- Example: Reading comprehension score in the 3rd percentile.
Percentages
- When given percentages, be aware of information without base numbers.
- Receiving a letter claiming a company's profits were 46% higher than the year before.
- Consider if the firm earned 27 cents last year, which turned into 39 cents this year (only a 46% increase.)
The Standard Deviation
- Indicates how dispersed the data are from their mean/how spread apart are the observations.
- Example: Data was collected on the weights of 250 people on an airplane headed for Boston and the weights of 250 qualifiers for the Boston Marathon.
- However, the mean weight for both groups is 155 pounds, and the former will have a greater standard deviation.
Deceptive Descriptions
- The field of statistics is rooted in mathematics, and mathematics are exact.
- However, the use of statistics to describe complex phenomenom are not exact.
- Be sure to pay attention to the unit of analysis.
- There can be a lack of clarity over what one is defining, describing, or explaining.
- How does one define health as "manufacturing output" or "employment"?
Data Designs and Sampling Strategies
- Provenance: "the place of origin of something."
- Literary Digest accurately predicted the correct candidate to win the presidental race from 1916 - 1932.
- In 1936, about 40 million voters were expected.
- Literary Digest sent out 10 million mock ballots which it received 2.4 million back.
- The prediction was that Landon had 57% and Roosevelt 43% of the votes.
- The actual result was that Landon had 38% and Roosevelt 62% of the votes.
- A 1988 study showed that the 10 million voters who received mock ballots were not representative.
- They were found to be drawn from wealthier voters.
- The 2.4 million voters that responded were also found to be not representative.
- The most passionate were more likely to respond.
- There's a fundamental problem: Poor data provenance, as big data is equated to good data.
- Census: Information on all subjects.
- Subset or sample: Information on some subjects.
- Self-selected sample example: Subset is whoever chooses to answer.
- Convenience sample: Subset is whomever is convinient for the researcher.
- Judgement sample: Subset is whoever the researcher deliberately selects.
- Probability sample: Subset involves some probabilistic selection.
- Adminstrative dataset: A dataset collected as parts of administrative work ( e.g. social security names, restaurant safety ratings, etc.)
Simple Random Sample (SRS)
- A "sample drawn from population at random without replacment."
- If we have 6 people and want to sample them (A,B,C,D,E,F) the 15 samples of size 2 are AB, AD, AE, AF, BC, BD, BE, BF, CD, CE, CF, DE, DF, and EF.
- Then P(AB)=P(AC)=...P(EF)=1/15
- And P(A)=5/15=1/3
Cluster Sampling
- Divide the population into clusters and then SRS chooses clusters instead of individuals (example: everyone in a town)
- If we have 6 people and want to sample them (A,B,C,D,E,F) the first we've chosen are: A,B/ C,D/ E,F arbitrarily. Then SRS picks one cluster (A,B).
- Risk of bias if members of clusters are similar.
- Could be cheaper to use when polling everyone at an address.
Stratified Sampling
- It divides the population into clusters, and then creates the first SRS per strata.
- Gender Identification is an example (Female, Male, Other) First the strata is chosen: {A,B.C.D} and {E,F } arbitrarily.
- Then uses SRS to select one person from each strata.
First Borns and Harvard
- Harvard Philosopher Michael Sandel argues that birth order has a significant impact on work ethic.
- He found in class what fraction of students were first born, and found out to be consistently 75-80% born firstP(F|h)=75-80%
- US census data for mothers with children in college is:
- 1 Child: 21%, 2 Children: 43%, 3 Children: 23%, 4 Children: 13%
- And percent first born is: 100/(21+43 * 2 +23 * 3 + 13*4)=44 %
- Measuring PP(F|h)=80 % and comparint to 44% is insufficient.
- Harvard must consider how many children mother's Harvard tend to have.
- The fact that Harvard mothers tend to have very few children.
- The 2012 study walk throught the math regarding Sandels argument in great detail. Sandels Measures P(F|h)
- But what he cares about is a different quanity.
- Sandel wants to show that a first born conveys an advantage over NOT.
- He wants to show r is P(H|F) (the probability of getting into Harvard if first born) /P(H|FC) (the probability of getting into Harvard if not first born).
- P(H|F): Probability of getting into Harvard firstborn.
- P(H|FC): Probability of NOT getting into Harvard firstborn. The higher r is the better for the firstborn.
Applying BAyes RUle for Harvard first Borns P(F|H)
- Let's say r=3, that being first born makes you 3 times as likely to get into Harvard.
- If r > 1: it'd imply that being first born gives you "magic touch ( smart, assistance in demons...
- Find the CHance for first born to be at harvard.
- We want the change you are the harvard student (H) given your first born.
- Compare it to chane you are not first Born.
- The population of students are N first born where it is the fertitlity rate.
- IF the US Value for 1994 we get r=3.9
- But in the most elite schools in the US r=3.2
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Notebooks in data science allow for narrative presentation with code, documentation, and markdown. Descriptive statistics present data meaningfully but can be misleading. Deceptive descriptions occur when statistics don't fully represent complex phenomena, such as using per capita income to assess the economic health of the middle class.